# <font color="maroon"> 1.0 NLP Toolkits and Preprocessing Techniques </font>

### NLP Toolkits
▪ Python libraries for natural language processing

### Text Preprocessing Techniques
▪ Converting text to a meaningful format for analysis
▪ Preprocessing and cleaning text

## Code: How to Install NLTK

### Command Line
pip install nlt

### Jupyter Notebook
import nltk
nltk.download()

#downloads all data & models
#this will take a while



## Sample Text Data

**Hi Mr. Smith! I am going to buy some vegetables (tomatoes and cucumbers) from the
store. Should I pick up some black-eyed peas as well?**

Text data is messy.

To analyze this data, we need to preprocess the text.


![](https://i.imgur.com/pt5p6Hb.png)

# Code: Tokenization (Words)

In [None]:
import nltk

In [None]:
from nltk.tokenize import word_tokenize

my_text = "Hi Mr. Smith! I’m going to buy some vegetables \
(2 tomatoes and 4 cucumbers from the store. Should I pick up some black-eyed peas as well?"

print(word_tokenize(my_text)) # print function requires Python 3

# Code: Tokenization (Sentences)

In [None]:
from nltk.tokenize import sent_tokenize

my_text = "Hi Mr. Smith! I’m going to buy some vegetables \
(2 tomatoes and 4 cucumbers)from the store. Should I pick up some black-eyed peas as well?"

print(sent_tokenize(my_text))

# Code: Tokenization (Regular Expressions)

![](https://i.imgur.com/3L6x92C.png)

# Code: Remove Punctuation

In [None]:
import re # Regular expression library
import string
# Replace punctuations with a white space
#clean_text = re.sub('[%s]' % re.escape(string.punctuation), ' ', my_text)
#clean_text

s = re.sub('[^\w\s]','',my_text)
s

# Code: Make All Text Lowercase

In [None]:
clean_text = s.lower()
clean_text

# Code: Remove Numbers

In [None]:
# Removes all words containing digits
clean_text = re.sub('\d', '', clean_text)
clean_text

# <font color='blue'>Preprocessing: Stop Words</font>

![](https://i.imgur.com/T5RJXrX.png)

# Code: Stop Words

from nltk.corpus import stopwords
set(stopwords.words('english'))

# Code: Remove Stop Words

<a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html">CountVectorizer</a>

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

my_text = ["Hi Mr. Smith! I’m going to buy some vegetables \
(2 tomatoes and 3 cucumbers from the store. Should I pick up some black-eyed peas as well?"]
           
# Incorporate stop words when creating the count vectorizer
cv = CountVectorizer(stop_words='english') 
X = cv.fit_transform(my_text)
pd.DataFrame(X.toarray(), columns=cv.get_feature_names())

![](https://i.imgur.com/9qllh8j.png)

# Code: Stemming

In [None]:
from nltk.stem.lancaster import LancasterStemmer
stemmer = LancasterStemmer()

# Try some stems
print('drive: {}'.format(stemmer.stem('drive')))
print('drives: {}'.format(stemmer.stem('drives')))
print('driver: {}'.format(stemmer.stem('driver')))
print('drivers: {}'.format(stemmer.stem('drivers')))
print('driven: {}'.format(stemmer.stem('driven')))

![](https://i.imgur.com/8edVsCR.png)

# Code: Lemmatization

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
lemmatizer=WordNetLemmatizer()

input_str="been had done languages cities mice"
input_str=word_tokenize(input_str)
for word in input_str:
    print(lemmatizer.lemmatize(word))

# Code: Parts of Speech Tagging

In [None]:
from nltk.tag import pos_tag

my_text = "James Smith lives in the United States."

tokens = pos_tag(word_tokenize(my_text))
print(tokens)

## Named Entity Recognition

In [None]:
from nltk.chunk import ne_chunk
my_text = "James Smith lives in the United States."
tokens = pos_tag(word_tokenize(my_text)) # this labels each word as a part of speech
entities = ne_chunk(tokens) # this extracts entities from the list of words
entities.draw()

# <font color="blue"> Prepocessing: Compound Term Extraction </font>

![](https://i.imgur.com/q1WuWai.png)

# Code: Compound Term Extraction

In [None]:
from nltk.tokenize import MWETokenizer # multi-word expression

my_text = "You all are the greatest students of all time."

mwe_tokenizer = MWETokenizer([('You','all'), ('of', 'all', 'time')])
mwe_tokens = mwe_tokenizer.tokenize(word_tokenize(my_text))

mwe_tokens

![](https://i.imgur.com/HpgLFOT.png)

# Basic Pandas Functionality

In [None]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(6,4))
data = pd.read_csv('cookie_reviews.csv')
#data

#Selecting top and bottom rows:
#Returns the first n rows.
first = data  
s=first.head()
#Returns the last n rows.
first = data  
e=first.tail()

print(s)
print(e)

#Selecting columns:
#data['column_name'] 
#or data.column_name
col_names=data.columns
print("\n column names = ",col_names)
print ("\n \n")
#Selecting by indexer:
data.iloc[0] #- first row of data frame
#data.iloc[-1] #- last row of data frame
#data.iloc[:,0] #- first column of data frame
#data.iloc[:,-1] #- last column of data frame
#Data.iloc[0,1] #– first row, second column of the dataframe
#data.iloc[0:4, 3:5] # first 4 rows and 3rd, 4th, 5th columns of data frame

![](https://i.imgur.com/w9gWcfX.png)

In [None]:
# Basic example
square_me=lambda x: x*x

my_numbers=[9, 3, 4, 100, 2, 1]
my_numbers_squared = list(map(square_me, my_numbers))#map=applies a function to all the items in an input_list
print(my_numbers_squared)

# <font color=red>Preprocessing Exercise </font>



# Introduction

We will be using review data from Kaggle to practice preprocessing text data. The dataset contains user reviews for many products, but today we'll be focusing on the product in the dataset that had the most reviews - an oatmeal cookie.

The following code will help you load in the data. If this is your first time using nltk, you'll to need to pip install it first.


In [None]:
import nltk
# nltk.download() <-- Run this if it's your first time using nltk to download all of the datasets and models

import pandas as pd

In [None]:
df = pd.read_csv('cookie_reviews.csv')
df.head()

**Question 1:**

Determine how many reviews there are in total.
   

**Question 2:**
    
Determine the percentage of 1, 2, 3, 4 and 5 star reviews.

**Question 3:**

(a) Remove stop words

(b) Change to lower case

(b) Perform stemming

# <font color="maroon"> 2.0 Text Similarity Measures </font>

- To measure distance between 2 string

## 2.1 Applications
- Information retrieval
- Text classification
- Document clustering
- Topic Modeling
- Matric decomposition

To measure the word similarity, we use **<font color="blue"> Levenshtein distance </font>**.
- Minimum number of operations to get from one word to another.

![](https://i.imgur.com/FkdJmPi.png)

# TextBlob

### Another toolkit other than NLTK

- Wraps around NLTK and makes it easier to use

### TextBlob capabilities

- Tokenization
- Parts of speech tagging
- Sentiment analysis
- Spell check


# TextBlob Demo: Tokenization

In [None]:
#pip install textblob

from textblob import TextBlob
my_text = TextBlob("We're moving from NLTK to TextBlob. How fun!")
my_text.words

# TextBlob Demo: Spell Check

In [None]:
blob = TextBlob("I'm graat at speling.")
print(blob.correct()) # print function requires Python 3

<font color="blue"> 
## How does the correct function work?  <br> 
    
- Calculates the Levenshtein distance between the word ‘graat’ and all words in its word list </br>
- Of the words with the smallest Levenshtein distance, it outputs the most popular word </br></font>

# TextBlob Demo: Tagging

In [None]:
blob = TextBlob("John hits the ball.")
for words, tag in blob.tags:
 print (words, tag)

# TextBlob Demo: Language Detection and Translation

In [None]:
word=TextBlob("Bonjour, comment allez-vous ")
word.detect_language()


In [None]:
word.translate(from_lang='fr', to ='en')

# Text Format for Analysis: Count Vectorizer

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
corpus =['This is the first document.', 'This is the second document.', 'And the third one. One is fun.'] #corpus=collection of teks
cv = CountVectorizer()
X = cv.fit_transform(corpus)
pd.DataFrame(X.toarray(),columns=cv.get_feature_names())

![](https://i.imgur.com/OQDeQlb.png)

# Document Similarity: Example

![](https://i.imgur.com/PyirXsy.png)

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['The weather is hot under the sun',
'I make my hot chocolate with milk',
'One hot encoding',
'I will have a chai latte with milk',
'There is a hot sale today']
# create the document-term matrix with count vectorizer
cv = CountVectorizer(stop_words="english")
X = cv.fit_transform(corpus).toarray()
dt = pd.DataFrame(X, columns=cv.get_feature_names())
dt

# Document Similarity: Example

In [None]:
# calculate the cosine similarity between all combinations of documents
from itertools import combinations
from sklearn.metrics.pairwise import cosine_similarity

# list all of the combinations of 5 take 2 as well as the pairs of phrases
pairs = list(combinations(range(len(corpus)),2)) #sentence (0, 1), (0, 2), (0, 3), (0, 4), (1, 2), (1, 3), .., (3,4))
print(pairs)
combos = [(corpus[a_index], corpus[b_index]) for (a_index, b_index) in pairs]
print (combos)

# calculate the cosine similarity for all pairs of phrases and sort by most similar
results = [cosine_similarity([X[a_index]], [X[b_index]]) for (a_index, b_index) in pairs]
sorted(zip(results, combos), reverse=True)


In [None]:
pairs = list(combinations(range(5),2))
pairs

![](https://i.imgur.com/jrfN6Jj.png)

![](https://i.imgur.com/BI8XP92.png)

![](https://i.imgur.com/3IbfQXT.png)

![](https://i.imgur.com/pnNqzql.png)

In [None]:
import pandas as pd
corpus = ['This is the first document.',
         'This is the second document.',
         'And the third one. One is fun.']


# original Count Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
pd.DataFrame(X, columns=cv.get_feature_names())



In [None]:
# new TF-IDF Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
cv_tfidf = TfidfVectorizer()
X_tfidf = cv_tfidf.fit_transform(corpus).toarray()
pd.DataFrame(X_tfidf, columns=cv_tfidf.get_feature_names())

![](https://i.imgur.com/xlJibKw.png)

## Document Similarity: Example with TF-IDF

In [None]:
corpus = ['The weather is hot under the sun',
'I make my hot chocolate with milk',
'One hot encoding',
'I will have a chai latte with milk',
'There is a hot sale today']

from sklearn.feature_extraction.text import TfidfVectorizer
# create the document-term matrix with TF-IDF vectorizer
cv_tfidf = TfidfVectorizer(stop_words="english")
X_tfidf = cv_tfidf.fit_transform(corpus).toarray()
dt_tfidf = pd.DataFrame(X_tfidf,columns=cv_tfidf.get_feature_names())
dt_tfidf

In [None]:
# calculate the cosine similarity for all pairs of phrases and sort by most similar
results_tfidf = [cosine_similarity([X_tfidf[a_index]], [X_tfidf[b_index]]) for (a_index, b_index) in pairs]
sorted(zip(results_tfidf, combos), reverse=True)


![](https://i.imgur.com/mj4J60v.png)

# <font color=red>Text Similarity Exercise</font>

## Introduction

We will be using a song lyric dataset from Kaggle to identify songs with similar lyrics. The data set contains artists, songs and lyrics for 55K+ songs, but today we will be focusing on songs by one group in particular - The Beatles.

The following code will help you load in the data and get set up for this exercise.


In [None]:
import nltk
import pandas as pd

In [None]:
data = pd.read_csv('songdata.csv')
data.head()

# Question 1

Apply the following preprocessing steps:

- Note the '\n' (new line) characters in the lyrics. Remove them using regular expressions.


## Question 2

(a) List all the rows with "Imagine" in the title


## Question 3

(a) Extract the first line of lyric out from the first song.


(b) Find out the sentiment of the extracted lyric. 

# NLP Showcase
** 1 Name Gender Classifier **

In [None]:
# code to build a classifier to classify names as male or female
# demonstrates the basics of feature extraction and model building

names = [(name, 'male') for name in nltk.corpus.names.words("male.txt")]
names += [(name, 'female') for name in nltk.corpus.names.words("female.txt")]

def extract_gender_features(name):
    name = name.lower()
    features = {}
    features["suffix"] = name[-1:]
    features["suffix2"] = name[-2:] if len(name) > 1 else name[0]
    features["suffix3"] = name[-3:] if len(name) > 2 else name[0]
    features["suffix4"] = name[-4:] if len(name) > 3 else name[0]
    #features["suffix5"] = name[-5:] if len(name) > 4 else name[0]
    #features["suffix6"] = name[-6:] if len(name) > 5 else name[0]
    features["prefix"] = name[:1]
    features["prefix2"] = name[:2] if len(name) > 1 else name[0]
    features["prefix3"] = name[:3] if len(name) > 2 else name[0]
    features["prefix4"] = name[:4] if len(name) > 3 else name[0]
    features["prefix5"] = name[:5] if len(name) > 4 else name[0]
    #features["wordLen"] = len(name)
    
    #for letter in "abcdefghijklmnopqrstuvwyxz":
    #    features[letter + "-count"] = name.count(letter)
   
    return features

data = [(extract_gender_features(name), gender) for (name,gender) in names]

import random
random.shuffle(data)

#print(data[:10])
#print()
#print(data[-10:])

dataCount = len(data)
trainCount = int(.8*dataCount)

trainData = data[:trainCount]
testData = data[trainCount:]
bayes = nltk.NaiveBayesClassifier.train(trainData)

def classify(name):
    label = bayes.classify(extract_gender_features(name))
    print("name=", name, "classifed as=", label)

print("trainData accuracy=", nltk.classify.accuracy(bayes, trainData))
print("testData accuracy=", nltk.classify.accuracy(bayes, testData))

bayes.show_most_informative_features(25)

In [None]:
# print gender classifier errors so we can design new features to identify the cases
errors = []

for (name,label) in names:
    if bayes.classify(extract_gender_features(name)) != label:
        errors.append({"name": name, "label": label})

errors


# ** 2 Sentiment Analysis **

In [None]:
# movie reviews / sentiment analysis - part #1
from nltk.corpus import movie_reviews as reviews
import random

docs = [(list(reviews.words(id)), cat)  for cat in reviews.categories() for id in reviews.fileids(cat)]
random.shuffle(docs)

print([ (len(d[0]), d[0][:2], d[1]) for d in docs[:10]])

fd = nltk.FreqDist(word.lower() for word in reviews.words())
topKeys = [ key for (key,value) in fd.most_common(2000)]
print(topKeys)

In [None]:
# movie reviews sentiment analysis - part #2
import nltk

def review_features(doc):
    docSet = set(doc)
    features = {}
    
    for word in topKeys:
        features[word] = (word in docSet)
        
    return features

#review_features(reviews.words("pos/cv957_8737.txt"))

data = [(review_features(doc), label) for (doc,label) in docs]

dataCount = len(data)
trainCount = int(.8*dataCount)

trainData = data[:trainCount]
testData = data[trainCount:]
bayes2 = nltk.NaiveBayesClassifier.train(trainData)

print("train accuracy=", nltk.classify.accuracy(bayes2, trainData))
print("test accuracy=", nltk.classify.accuracy(bayes2, testData))

bayes2.show_most_informative_features(20)