# Natural Language Processing Pipelines

This notebook is part of my learning journey of Udacity's Natural Language processing program, which I really helped me learn and excel basics and advanced NLP topics 

![alt text](download-1.jpg)
The 3 stages of an NLP pipeline are: Text Processing > Feature Extraction > Modeling.

- **Text Processing:** Take raw input text, clean it, normalize it, and convert it into a form that is suitable for feature extraction.
- **Feature Extraction:** Extract and produce feature representations that are appropriate for the type of NLP task you are trying to accomplish and the type of model you are planning to use.
- **Modeling:** Design a statistical or machine learning model, fit its parameters to training data, use an optimization procedure, and then use it to make predictions about unseen data.

This process isn't always linear and may require additional steps.

# Text Processing
The first part of this notebook will explore the steps involved in **text processing**, the first stage of the NLP pipeline.

**Why Do We Need to Process Text?**

- **Extracting plain text:** Textual data can come from a wide variety of sources: the web, PDFs, word documents, speech recognition systems, book scans, etc. Your goal is to extract plain text that is free of any source specific markup or constructs that are not relevant to your task.
- **Reducing complexity:** Some features of our language like capitalization, punctuation, and common words such as a, of, and the, often help provide structure, but don't add much meaning. Sometimes it's best to remove them if that helps reduce the complexity of the procedures you want to apply later.

You'll prepare text data from different sources with the following text processing steps:

1. Cleaning to remove irrelevant items, such as HTML tags
2. Normalizing by converting to all lowercase and removing punctuation
3. Splitting text into words or tokens
4. Removing words that are too common, also known as stop words
5. Identifying different parts of speech and named entities
6. Converting words into their dictionary forms, using stemming and lemmatization

After performing these steps, your text will capture the essence of what was being conveyed in a form that is easier to work with.

> ## Cleaning: Udacity's Course Catalog

Udacity's [course catalog page](https://www.udacity.com/courses/all)

In this activity, you're going to extract the following information from each course listing on the page:
1. The course name - e.g. "Data Analyst"
2. The school the course belongs to - e.g. "School of Data Science"

### Step 1: Get text from Udacity's course catalog web page
You can use the `requests` library to do this.

Outputting all the javascript, CSS, and text may overload the space available to load this notebook, so we omit a print statement here.

In [1]:
# import statements
import requests

In [2]:
# fetch web page
r = requests.get('https://www.udacity.com/courses/all')

### Step 2: Use BeautifulSoup to remove HTML tags

In [3]:
# Installing bs4
%pip install lxml
!pip install bs4

In [None]:
# import statements
from bs4 import BeautifulSoup

In [None]:
soup = BeautifulSoup(r.text,'lxml')

In [None]:
print(soup.get_text())

Learn the Latest Tech Skills; Advance Your Career | UdacityLearnSchoolsArtificial IntelligenceAutonomous SystemsBusinessCareer ResourcesCloud ComputingCybersecurityData ScienceExecutive LeadershipProgrammingProduct ManagementPopularData Engineering with AWSIntroduction to ProgrammingC++Business AnalyticsData AnalystFeaturedDeep Reinforcement LearningComputer VisionNatural Language ProcessingData Structure and AlgorithmsSensor Fusion EngineerCatalogBusinessOverviewResourcesCompare PlansGovernmentCancelCancelLog InJoin for Freeall ProgramsSort by Most PopularMost PopularHighest RatedRecently Updatedall ProgramsSort Sort by Most PopularMost PopularHighest RatedRecently UpdatedSchool data scienceautonomous systemsartificial intelligencebusinessprogramming and developmentexecutive leadershipproduct managementcybersecuritycloud computingCareer resourcesSkill Type to SearchLevel discoveryfluencybeginnerintermediateadvancedDuration hoursdaysweeksmonthsType coursenanodegree programFilter & Sort

### Step 3: Find all course summaries
Use the BeautifulSoup's `find_all` method to select based on tag type and class name. Just like in the video, you can right click on the item, and click "Inspect" to view its html on a web page.

In [None]:
# Find all course summaries
summaries = soup.find_all("li",{"class":"catalog-card-list_catalogCardListItem__AZBy6"})
print('Number of Courses:', len(summaries))


Number of Courses: 0


### Step 4: Inspect the first summary to find selectors for the course name and school

In [None]:
#I got an inex out of range error so I had to change modify the code and added the if else statement, seems like the Udacity course catalog was updated.

if summaries:
    print(summaries[0].prettify())
else:
    print("No summaries found.")


No summaries found.


Look for selectors contain the the courses title and school name text you want to extract. Then, use the select_one method on the summary object to pull out the html with those selectors. Afterwards, don't forget to do some extra cleaning to isolate the names (get rid of unnecessary html).

In [None]:
#I got an inex out of range error so I had to change modify the code and added the if else statement, seems like the Udacity course catalog was updated.

if summaries:
    summaries[0].select_one("h2").get_text().strip()
else:
    print("No summaries found.")


No summaries found.


In [None]:
#I got an inex out of range error so I had to change modify the code and added the if else statement, seems like the Udacity course catalog was updated.

if summaries:
    summaries[0].select_one("h3").get_text().strip()
else:
    print("No summaries found.")


No summaries found.


### Step 5: Collect names and schools of ALL course listings
Reuse your code from the previous step, but now in a loop to extract the name and school from every course summary in `summaries`!

In [None]:
courses = []
for summary in summaries:
    # append name and school of each summary to courses list
    title = summary.select_one("h2").get_text().strip()
    school = summary.select_one("h3").get_text().strip()
    courses.append((title, school))

In [None]:
# display results
print(len(courses), "course summaries found. Sample:")
courses[:5]

0 course summaries found. Sample:


[]

# Normalization

In [None]:
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"
print(text)

The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?


### Case Normalization

In [None]:
# Convert to lowercase
text = text.lower()
print(text)

the first time you see the second renaissance it may look boring. look at it at least twice and definitely watch part 2. it will change your view of the matrix. are the human people the ones who started the war ? is ai a bad thing ?


### Punctuation Removal
Use the `re` library to remove punctuation with a regular expression (regex). Feel free to referGoogle to get your regular expression. You can learn more about regex [here](https://docs.python.org/3/howto/regex.html).

In [None]:
# Remove punctuation characters
import re
text = re.sub(r"[^a-zA-Z0-9]"," ",text)
print(text)

the first time you see the second renaissance it may look boring  look at it at least twice and definitely watch part 2  it will change your view of the matrix  are the human people the ones who started the war   is ai a bad thing  


# Tokenization

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hisha\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [None]:
# import statements
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

In [None]:
text = "Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers."
print(text)

Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers.


In [None]:
# Split text into words using NLTK
words = word_tokenize(text)
print(words)

['Dr.', 'Smith', 'graduated', 'from', 'the', 'University', 'of', 'Washington', '.', 'He', 'later', 'started', 'an', 'analytics', 'firm', 'called', 'Lux', ',', 'which', 'catered', 'to', 'enterprise', 'customers', '.']


In [None]:
# Split text into sentences using NLTK
words = sent_tokenize(text)
print(words)

['Dr. Smith graduated from the University of Washington.', 'He later started an analytics firm called Lux, which catered to enterprise customers.']


# Stop Words

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hisha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
# import statements
from nltk.corpus import stopwords
import re
from nltk.tokenize import word_tokenize


In [None]:
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"
print(text)

The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?


In [None]:
# Normalize text
text = text.lower()
text = re.sub(r"[^a-zA-Z0-9]"," ",text)


In [None]:
# Tokenize text
words = word_tokenize(text)
print(words)

['the', 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring', 'look', 'at', 'it', 'at', 'least', 'twice', 'and', 'definitely', 'watch', 'part', '2', 'it', 'will', 'change', 'your', 'view', 'of', 'the', 'matrix', 'are', 'the', 'human', 'people', 'the', 'ones', 'who', 'started', 'the', 'war', 'is', 'ai', 'a', 'bad', 'thing']


In [None]:
# remove stop words
words = [w for w in words if w not in stopwords.words("english")]
print(words)

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'ones', 'started', 'war', 'ai', 'bad', 'thing']


In [None]:
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

## Parts of Speech (POS) Tagging


In [None]:
nltk.download('words')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\hisha\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hisha\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\hisha\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\hisha\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!


True

In [None]:
#  import statements
from nltk import pos_tag
from nltk import ne_chunk

In [None]:
text = "I always lie down to tell a lie."

In [None]:
# tokenize text
sentence = word_tokenize(text)

# tag each word with part of speech
pos_tag(sentence)

[('I', 'PRP'),
 ('always', 'RB'),
 ('lie', 'VBP'),
 ('down', 'RP'),
 ('to', 'TO'),
 ('tell', 'VB'),
 ('a', 'DT'),
 ('lie', 'NN'),
 ('.', '.')]

## Named Entity Recognition (NER)

In [None]:
text = "Antonio joined Udacity Inc. in California."

In [None]:
# tokenize, pos tag, then recognize named entities in text
tree = ne_chunk(pos_tag(word_tokenize(text)))
print(tree)

(S
  (PERSON Antonio/NNP)
  joined/VBD
  (ORGANIZATION Udacity/NNP Inc./NNP)
  in/IN
  (GPE California/NNP)
  ./.)


## Sentence Parsing

In [None]:
# Define a custom grammar
my_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas'
V -> 'shot'
P -> 'in'
""")
parser = nltk.ChartParser(my_grammar)

In [None]:
# Parse a sentence
sentence = word_tokenize("I shot an elephant in my pajamas")
for tree in parser.parse(sentence):
    print(tree)

(S
  (NP I)
  (VP
    (VP (V shot) (NP (Det an) (N elephant)))
    (PP (P in) (NP (Det my) (N pajamas)))))
(S
  (NP I)
  (VP
    (V shot)
    (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))


## Stemming and Lemmatization


In [None]:
nltk.download('wordnet') # download for lemmatization

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hisha\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"

# Normalize text
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())

# Tokenize text
words = text.split()
print(words)

['the', 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring', 'look', 'at', 'it', 'at', 'least', 'twice', 'and', 'definitely', 'watch', 'part', '2', 'it', 'will', 'change', 'your', 'view', 'of', 'the', 'matrix', 'are', 'the', 'human', 'people', 'the', 'ones', 'who', 'started', 'the', 'war', 'is', 'ai', 'a', 'bad', 'thing']


In [None]:
# Remove stop words
words = [w for w in words if w not in stopwords.words("english")]
print(words)

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'ones', 'started', 'war', 'ai', 'bad', 'thing']


### Stemming

In [None]:
from nltk.stem.porter import PorterStemmer

# Reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in words]
print(stemmed)

['first', 'time', 'see', 'second', 'renaiss', 'may', 'look', 'bore', 'look', 'least', 'twice', 'definit', 'watch', 'part', '2', 'chang', 'view', 'matrix', 'human', 'peopl', 'one', 'start', 'war', 'ai', 'bad', 'thing']


### Lemmatization

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer

# Reduce words to their root form
lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]
print(lemmed)

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'one', 'started', 'war', 'ai', 'bad', 'thing']


In [None]:
# Lemmatize verbs by specifying pos
lemmed = [WordNetLemmatizer().lemmatize(w, pos='v') for w in lemmed]
print(lemmed)

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'bore', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'one', 'start', 'war', 'ai', 'bad', 'thing']


![alt text](<download (1)-1.jpg>)

# Feature Extraction

## Bag of Words and TF-IDF
Below, we'll look at three useful methods of vectorizing text.
- `CountVectorizer` - Bag of Words
- `TfidfTransformer` - TF-IDF values
- `TfidfVectorizer` - Bag of Words AND TF-IDF values
 

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hisha\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hisha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hisha\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
corpus = ["The first time you see The Second Renaissance it may look boring.",
        "Look at it at least twice and definitely watch part 2.",
        "It will change your view of the matrix.",
        "Are the human people the ones who started the war?",
        "Is AI a bad thing ?"]

In [None]:
stop_words = stopwords.words("english")
lemmatizer = WordNetLemmatizer()

In [None]:
def tokenize(text):
    # normalize case and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # tokenize text
    tokens = word_tokenize(text)
    
    # lemmatize andremove stop words
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    return tokens

## `CountVectorizer` (Bag of Words)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# initialize count vectorizer object
vect = CountVectorizer(tokenizer=tokenize)

ModuleNotFoundError: No module named 'sklearn'

In [None]:
# get counts of each token (word) in text data
X = vect.fit_transform(corpus)

In [None]:
# convert sparse matrix to numpy array to view
X.toarray()

In [None]:
# view token vocabulary and counts
vect.vocabulary_

## `TfidfTransformer`

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

# initialize tf-idf transformer object
transformer = TfidfTransformer(smooth_idf=False)

In [None]:
# use counts from count vectorizer results to compute tf-idf values
tfidf = transformer.fit_transform(X)

In [None]:
# convert sparse matrix to numpy array to view
tfidf.toarray()

## `TfidfVectorizer`
`TfidfVectorizer` = `CountVectorizer` + `TfidfTransformer`

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# initialize tf-idf vectorizer object
vectorizer = TfidfVectorizer()

In [None]:
# compute bag of word counts and tf-idf values
X = vectorizer.fit_transform(corpus)

In [None]:
# convert sparse matrix to numpy array to view
X.toarray()

# Modeling
The final stage of the NLP pipeline is modeling, which includes designing a statistical or machine learning model, fitting its parameters to training data, using an optimization procedure, and then using it to make predictions about unseen data.

The nice thing about working with numerical features is that it allows you to choose from all machine learning models or even a combination of them.

Once you have a working model, you can deploy it as a web app, mobile app, or integrate it with other products and services. The possibilities are endless!