# NLP Information Extraction
---

## SparkContext and SparkSession

In [1]:
from pyspark import SparkContext
sc = SparkContext(master = 'local')

from pyspark.sql import SparkSession
spark = SparkSession.builder \
          .appName("Python Spark SQL basic example") \
          .config("spark.some.config.option", "some-value") \
          .getOrCreate()

## Simple NLP pipeline architecture

![](images/simple-nlp-pipeline.png)

**Reference:** Bird, Steven, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the natural language toolkit. " O'Reilly Media, Inc.", 2009.

## Example data

The raw text is from the gutenberg corpus from the nltk package. The fileid is *milton-paradise.txt*.

### Get the data

#### Raw text

In [2]:
import nltk
from nltk.corpus import gutenberg

milton_paradise = gutenberg.raw('milton-paradise.txt')

## Create a spark data frame to store raw text

* Use the `nltk.sent_tokenize()` function to split text into sentences.

In [5]:
import pandas as pd
pdf = pd.DataFrame({
        'sentences': nltk.sent_tokenize(milton_paradise)
    })
df = spark.createDataFrame(pdf)
df.show(n=5)

+--------------------+
|           sentences|
+--------------------+
|[Paradise Lost by...|
|And chiefly thou,...|
|Say first--for He...|
|Who first seduced...|
|Th' infernal Serp...|
+--------------------+
only showing top 5 rows



## Tokenization and POS tagging

In [8]:
from pyspark.sql.functions import udf
from pyspark.sql.types import *

## define udf function
def sent_to_tag_words(sent):
    wordlist = nltk.word_tokenize(sent)
    tagged_words = nltk.pos_tag(wordlist)
    return(tagged_words)
## define schema for returned result from the udf function
## the returned result is a list of tuples.
schema = ArrayType(StructType([
            StructField('f1', StringType()),
            StructField('f2', StringType())
        ]))
        
## the udf function
sent_to_tag_words_udf = udf(sent_to_tag_words, schema)

#### Transform data

In [9]:
df_tagged_words = df.select(sent_to_tag_words_udf(df.sentences).alias('tagged_words'))
df_tagged_words.show(5)

+--------------------+
|        tagged_words|
+--------------------+
|[[[,JJ], [Paradis...|
|[[And,CC], [chief...|
|[[Say,NNP], [firs...|
|[[Who,WP], [first...|
|[[Th,NNP], [',POS...|
+--------------------+
only showing top 5 rows



## Chunking

Chunking is the process of segmenting and labeling multitokens. The following example shows how to do a noun phrase chunking on the tagged words data frame from the previous step.

First we define a *udf* function which chunks noun phrases from a list of pos-tagged words.

In [10]:
import nltk
from pyspark.sql.functions import udf
from pyspark.sql.types import *

# define a udf function to chunk noun phrases from pos-tagged words
grammar = "NP: {<DT>?<JJ>*<NN>}"
chunk_parser = nltk.RegexpParser(grammar)
chunk_parser_udf = udf(lambda x: str(chunk_parser.parse(x)), StringType())

#### Transform data

In [15]:
df_NP_chunks = df_tagged_words.select(chunk_parser_udf(df_tagged_words.tagged_words).alias('NP_chunk'))

In [16]:
df_NP_chunks.show(2, truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------