This is a template for NLP pipeline. This template contains 6 parts:

* Cleaning
* Normalization
* Tokenization
* Stop Words
* Stemming and Lemmatizing
* Bag of Words and TF-IDF


Note:

All codes are restructured from resource codes provided by Udacity "Data Scientist" class. All rights are reserved for Udacity.
This is just sample codes and cannot run and produce any meaningful results.

# Cleaning
### Step 1: Get text from the movie web page
You can use the `requests` library to do this.

Outputting all the javascript, CSS, and text may overload the space available to load this notebook, so we omit a print statement here.

In [None]:
# import statements
import requests
from bs4 import BeautifulSoup

In [None]:
# fetch web page
r = requests.get("https://www.rottentomatoes.com/m/et_the_extraterrestrial")
r.text

### Step 2: Use BeautifulSoup to remove HTML tags
Use `"lxml"` rather than `"html5lib"`.

Again, outputting all the results may overload the space available to load this notebook, so we omit a print statement here.

In [None]:
soup = BeautifulSoup(r.text, "lxml")
soup.get_text()

### Step 3: Find cast crew summaries
Use the BeautifulSoup's `find_all` method to select based on tag type and class name. Just ike in the video, you can right click on the item, and click "Inspect" to view its html on a web page.

In [None]:
# Find all cast crew summaries
crew = soup.find_all('div', {'data-qa':'cast-crew-item'})
print('Number of people in the cast crew:', len(crew))

### Step 4: Inspect the first crew member to find tags for the member's name and role
Tip: `.prettify()` is a super helpful method BeautifulSoup provides to output html in a nicely indented form! Make sure to use `print()` to ensure whitespace is displayed properly.

In [None]:
# print the first summary in crew
print(crew[0].prettify())

Look for tags that contain the actor/actress's name and the role that you want to extract. Then, use the `find_all` method on the crew object to pull out the html with those tags. Afterwards, don't forget to do some extra cleaning to isolate the names (get rid of unnecessary html), as you saw in the last video.

In [None]:
# Extract name
crew[0].find_all('p')[0].get_text().strip()

In [None]:
# Extract role
crew[0].find_all('p')[1].get_text().strip()

### Step 5: Collect names and roles of ALL memeber listings
Reuse your code from the previous step, but now in a loop to extract the name and role from every crew summary in `crew`!

In [None]:
name_role = []
for summary in crew:
    # append name and role of each summary to name_role list
    name = summary.find_all('p')[0].get_text().strip()
    role = summary.find_all('p')[1].get_text().strip()
    name_role.append((name, role))

In [None]:
# display results
print(len(name_role), "actors found in cast crew. Sample:")
name_role[:5]

# Normalization

In [None]:
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"
print(text)

### Case Normalization

In [None]:
# Convert to lowercase
text = text.lower() 
print(text)

### Punctuation Removal
Use the `re` library to remove punctuation with a regular expression (regex). Feel free to refer back to the video or Google to get your regular expression. You can learn more about regex [here](https://docs.python.org/3/howto/regex.html).

In [None]:
import re

# Remove punctuation characters
text = re.sub(r"[^a-zA-Z0-9]", " ", text) 
print(text)

# Tokenization

### Note on NLTK data download
Run the cell below to download the necessary nltk data packages. Note, because we are working in classroom workspaces, we will be downloading specific packages in each notebook throughout the lesson. However, you can download all packages by entering `nltk.download()` on your computer. Keep in mind this does take up a bit more space. You can learn more about nltk data installation [here](https://www.nltk.org/data.html).


In [None]:
import nltk
nltk.download('punkt')

In [None]:
# import statements
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

In [None]:
text = "Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers."
print(text)

In [None]:
# Split text into words using NLTK
words = word_tokenize(text)
print(words)

In [None]:
# Split text into sentences
sentences = sent_tokenize(text)
print(sentences)

# Stop Words
Combine the steps you learned so far to normalize, tokenize, and remove stop words from the text below.

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

In [None]:
# import statements
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [None]:
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"
print(text)

In [None]:
# Normalize text
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())

# Tokenize text
words = word_tokenize(text)
print(words)

In [None]:
# Remove stop words
words = [w for w in words if w not in stopwords.words("english")]
print(words)

In [None]:
#Take a look at the stop words included in nltk's corpus!
print(stopwords.words("english"))

## Parts of Speech (POS) Tagging*

In [None]:
# import statements
from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize

In [None]:
text = "I always lie down to tell a lie."

In [None]:
text = "I always lie down to tell a lie."

## Named Entity Recognition (NER)*

In [None]:
# tokenize, pos tag, then recognize named entities in text
tree = ne_chunk(pos_tag(word_tokenize(text)))
print(tree)

## Sentence Parsing*

In [None]:

# Define a custom grammar
my_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas'
V -> 'shot'
P -> 'in'
""")
parser = nltk.ChartParser(my_grammar)

In [None]:
# Parse a sentence
sentence = word_tokenize("I shot an elephant in my pajamas")
for tree in parser.parse(sentence):
    print(tree)

# Stemming and Lemmatizing

In [None]:
import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet') # download for lemmatization

In [None]:
from nltk.corpus import stopwords

In [None]:
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"

# Normalize text
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())

# Tokenize text
words = text.split()
print(words)

In [None]:
# Remove stop words
words = [w for w in words if w not in stopwords.words("english")]
print(words)

### Stemming
**Stemming** is the process of reducing a word to its stem or root form. For example, branching, branched, and branches all stem from the word branch.

This is a very quick and rough process so sometime the result isn't a complete word. For example, caching, cached, caches would result in a stem "cach", but that isn't a word. But as long as all related words to cache results in the same stem still captures the common idea in the resultant stem.

In [None]:
from nltk.stem.porter import PorterStemmer

# Reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in words]
print(stemmed)

### Lemmatization
**Lemmatization** is the process to map the words back to its root using a dictionary. For example, is, was, and were would all be lemmatized to "be".

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer

# Reduce words to their root form
lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]
print(lemmed)

In [None]:
# Lemmatize verbs by specifying pos
lemmed = [WordNetLemmatizer().lemmatize(w, pos='v') for w in lemmed]
print(lemmed)

# Bag of Words and TF-IDF
Below, we'll look at three useful methods of vectorizing text.
- `CountVectorizer` - Bag of Words
- `TfidfTransformer` - TF-IDF values
- `TfidfVectorizer` - Bag of Words AND TF-IDF values

Let's first use an example from earlier and apply the text processing steps we saw in this lesson.

In [52]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [12]:
corpus = ["The first time you see The Second Renaissance it may look boring.",
        "Look at it at least twice and definitely watch part 2.",
        "It will change your view of the matrix.",
        "Are the human people the ones who started the war?",
        "Is AI a bad thing ?"]

In [15]:
stop_words = stopwords.words("english")
lemmatizer = WordNetLemmatizer()

Use the skills you learned so far to create a function `tokenize` that takes in a string of text and applies the following:
- case normalization (convert to all lowercase)
- punctuation removal
- tokenization, lemmatization, and stop word removal using `nltk`

Feel free to refer back to previous sections to complete these steps!

In [29]:
def tokenize(text):
    # normalize case and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # tokenize text
    tokens = word_tokenize(text)
    
    # lemmatize andremove stop words
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    return tokens

# `CountVectorizer` (Bag of Words)

In [30]:
from sklearn.feature_extraction.text import CountVectorizer

# initialize count vectorizer object
vect = CountVectorizer(tokenizer=tokenize)

In [31]:
# get counts of each token (word) in text data
X = vect.fit_transform(corpus)

In [33]:
# convert sparse matrix to numpy array to view
X.toarray()

array([[0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
        0, 0, 0],
       [1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
        0, 1, 0],
       [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0]], dtype=int64)

In [53]:
# view token vocabulary and counts
vect.vocabulary_

{'2': 0,
 'ai': 1,
 'bad': 2,
 'boring': 3,
 'change': 4,
 'definitely': 5,
 'first': 6,
 'human': 7,
 'least': 8,
 'look': 9,
 'matrix': 10,
 'may': 11,
 'one': 12,
 'part': 13,
 'people': 14,
 'renaissance': 15,
 'second': 16,
 'see': 17,
 'started': 18,
 'thing': 19,
 'time': 20,
 'twice': 21,
 'view': 22,
 'war': 23,
 'watch': 24}

# `TfidfTransformer`

In [42]:
from sklearn.feature_extraction.text import TfidfTransformer

# initialize tf-idf transformer object
transformer = TfidfTransformer(smooth_idf=False)

In [43]:
# use counts from count vectorizer results to compute tf-idf values
tfidf = transformer.fit_transform(X)

In [45]:
# convert sparse matrix to numpy array to view
tfidf.toarray()

array([[ 0.        ,  0.        ,  0.        ,  0.36419547,  0.        ,
         0.        ,  0.36419547,  0.        ,  0.        ,  0.26745392,
         0.        ,  0.36419547,  0.        ,  0.        ,  0.        ,
         0.36419547,  0.36419547,  0.36419547,  0.        ,  0.        ,
         0.36419547,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.39105193,  0.        ,  0.        ,  0.        ,  0.        ,
         0.39105193,  0.        ,  0.        ,  0.39105193,  0.28717648,
         0.        ,  0.        ,  0.        ,  0.39105193,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.39105193,  0.        ,  0.        ,  0.39105193],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.57735027,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.57735027,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0

# `TfidfVectorizer`
`TfidfVectorizer` = `CountVectorizer` + `TfidfTransformer`

In [50]:
from sklearn.feature_extraction.text import TfidfVectorizer

# initialize tf-idf vectorizer object
vectorizer = TfidfVectorizer()

In [None]:
# compute bag of word counts and tf-idf values
X = vectorizer.fit_transform(corpus)

In [49]:
# convert sparse matrix to numpy array to view
X.toarray()

array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.30298183,  0.        ,  0.        ,  0.30298183,  0.        ,
         0.        ,  0.20291046,  0.        ,  0.24444384,  0.        ,
         0.30298183,  0.        ,  0.        ,  0.        ,  0.        ,
         0.30298183,  0.30298183,  0.30298183,  0.        ,  0.40582093,
         0.        ,  0.30298183,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.30298183,  0.        ],
       [ 0.        ,  0.30015782,  0.        ,  0.60031564,  0.        ,
         0.        ,  0.        ,  0.30015782,  0.        ,  0.        ,
         0.        ,  0.20101919,  0.30015782,  0.24216544,  0.        ,
         0.        ,  0.        ,  0.        ,  0.30015782,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.30015782,  0.        ,  0.        ,
         0.30015782,  0.        ,  0.        ,  0.