##<font color='navy' face='Helvetica' size=12pt>Introduction to Natural Language Processing</font>

<font color='crimson'>**Objective:** use speech and words along with computer run algorithms.

<span style="font-family:Calibri; color:darkblue; font-size:18pt;">Examples of projects/research with NLP:</span>

<font color='blue'>*Sentiment Analysis*</font> - How positive or negative is text about a topic? 

<font color='blue'>*Prediction*</font> - What genres should Netflix classify a movie as to maximize views? Based on product reviews, can we predict the star rating of a product?

<font color='blue'>*Translation*</font> - Recognize words in one language to provide similar words in another.

**Playground:** https://www.deepl.com/translator


<font color='blue'>*Summarization*</font> - Take a long document and produce a shorter one (a synthesis) without losing meaningful information.


<font color='forestgreen'>**Methods:**</font> <span style="font-family:Calibri; color:red; font-size:12pt;">The main idea is to quantify the occurrence of relevant words and, based on the context, to map them into vectors. That is to say that we want to create mathematically representable quantities from words and text; they will serve as features for data analysis. One approach is separate the text data into sentences and then sentences can be used to extract (key) words and expressions.</span>

###**Regular Expressions (regex)**

Goal: provide a language that allows us to search for different text strings.

For example, Regular Expressions (frequently called “regex”) allows us to label all tweets with a “1” if they contain the following list of words:

- college
- College of
- colleges
- The College

The idea is to detect that in all expressions above we have the same concept "college".



<font color='blue' face='Calibri' size=5pt>Examples of common REGEX patterns</font>

**[tT]**imber  - would match lower or uppercase T

**[A-Z]** - would match any capital character

**[a-z]** - would match any lowercase character

**[0-9]** - would match any single number (i.e., 9)

**[^A-Z]** - would match anything that isn’t an uppercase letter.

**\w** - would match any letter.

A comprehensive manual on regex can be found here:
https://www3.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html

In [None]:
# import regex in Python
import re

In [None]:
pattern = "[cC]hoca"
sentence1 = "Chocolate is very delicious"
sentence2 = "This new recipe deliciously implemented a new idea about the texture of the chocolate."
if re.search(pattern, sentence1):
  print("Match!")
else: print("Not a match!")

Not a match!


In [None]:
pattern

'[cC]hocolate'

###An example for replacing the spaces between words:

In [None]:
text = "This chocolate is delicious but it may have too many calories, such as 400."
re.sub('[^a-zA-Z0-9-,]','*',text)

'This*chocolate*is*delicious*but*it*may*have*too*many*calories,*such*as*400*'

In [None]:
text = "This chocolate is delicious but it may have too many calories, such as 400."

In [None]:
text.split()

['This',
 'chocolate',
 'is',
 'delicious',
 'but',
 'it',
 'may',
 'have',
 'too',
 'many',
 'calories,',
 'such',
 'as',
 '400.']

In [None]:
info = text.split(sep=' ')

In [None]:
# info is now an array of different words
info[2]

'is'

###An example for matching a patttern (a sequence of characters)

In [None]:
pattern = r"[cC]hoco"
sequence = "Chocolate is delicious"
if re.match(pattern, sequence):
  print("Match!")
else: print("Not a match!")

Match!


### Rooting words is very important ! (in short, an identifier of the meaning of the word)

In [None]:
pattern = r"good for you"
sentence = "Chocolate is delicious and good for you"
if re.search(pattern, sentence):
  print("Match!")
else: print("Not a match!")

Match!


###Example:

<figure>
<center>
<img src='https://drive.google.com/uc?id=1AMHbSgq3MHcv8Q8ljnHvl5IkxTKkzGkx' 
width='600px' />
<figcaption>Data from Twitter</figcaption></center>
</figure>


In [None]:
text = """Rep. Stephanie Murphy Verified account @RepStephMurphy Aug 30 More Celebrating 100yrs of coeducation at @williamandmary, 
        it was a true honor to return to my alma mater & join its first female president, Katherine Rowe, to welcome students at their convocation. 
        I spoke about the power of patriotism & the urgent need for active, engaged citizens."""

In [None]:
pattern = r"[cC]elebrating"
if re.search(pattern, text):
  print("Match!")
else: print("Not a match!")

Match!


In [None]:
pattern = r"\welebrat[a-z]+"
if re.search(pattern, text):
  print("Match!")
else: print("Not a match!")

Match!


<font face='Calibri' color='blue' size=5pt>The Bag of Words model (BoW)</font>

**Main Goal:** use concurrences within context and counts of keywords to make predictions.

**Observation:** there are many words that do not matter (such as prepositions or definite and indefinite articles). 

**Important:** each word can be translated into a binary value of occurrence.

<span style="font-family:Calibri; color:darkblue; font-size:5pt;">Analog Example:</span>

*Statement 1*: Jurassic World was the pinnacle of human achievement.

*Statement 2*: Human kind would be better without Jurassic World.


<figure>
<center>
<img src='https://drive.google.com/uc?id=1EUGNgop58BOOhFGHR3iKs5gXbrji6jEM' 
width='600px' />
<figcaption>What is the difference in the statements above?</figcaption></center>
</figure>



**Method**: we discard the *stopwords* such as articles, prepositions, verbs and retain the *corpus* (important words or *roots* of important words).



A simple model based on this data:

<figure>
<center>
<img src='https://drive.google.com/uc?id=1-uuXfXiYlmub8DauhxhYYCP2TKqfdvoB' 
width='600px' />
<figcaption>The differences can be highlighted by using a count/vectorizer method</figcaption></center>
</figure>

**Main idea:** analyze differences and co-occurrencies.

**Known Problems:**

 - If some sentences are much longer in length, the vocabulary would increase and as such, the length of the vectors would increase; this is a dimensionality problem.
 - The new sentences may contain more different words from the previous sentences.
 - The vectors would also contain many zeros, thereby resulting in a sparse matrix.
 - No information on the grammatical structure or the actual ordering of the words is being used.

**Possible Solution:** Term Frequency-Inverse Document Frequency (TF-IDF)

The term frequency-inverse document frequency is a measure that quantifies the importance of a word in the context of a document or a *corpus*.

The *term-frequency* of a word is the relative frequency of the term in the context of the document.

$$\text{TF}(t,d):=\frac{\text{# of times the term appears in the document}}{\text{# of terms in the document }}$$


The *inverse document frequency* is defined as:

$$\text{IDF}(t,d):=\log\left(\frac{\text{# of documents}}{\text{# of documents with term } t}\right)$$

Our quantification of relative importance is defined as the product between TF and IDF.

TF-IDF gives larger values for less frequent words and is high when both IDF and TF values are high, for instance the word is rare in all the documents combined but frequent in a single document.

A good Python example can be found here: 

https://towardsdatascience.com/natural-language-processing-feature-engineering-using-tf-idf-e8b9d00e7e76


In [None]:
import nltk
nltk.download('punkt')
import numpy as np

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
text = open("drive/MyDrive/Data Sets/SherlockHolmes.txt").read()

In [None]:
text[:1234]

'\ufeffThe Project Gutenberg eBook of The Adventures of Sherlock Holmes, by Arthur Conan Doyle\n\nThis eBook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this eBook or online at\nwww.gutenberg.org. If you are not located in the United States, you\nwill have to check the laws of the country where you are located before\nusing this eBook.\n\nTitle: The Adventures of Sherlock Holmes\n\nAuthor: Arthur Conan Doyle\n\nRelease Date: November 29, 2002 [eBook #1661]\n[Most recently updated: May 20, 2019]\n\nLanguage: English\n\nCharacter set encoding: UTF-8\n\nProduced by: an anonymous Project Gutenberg volunteer and Jose Menendez\n\n*** START OF THE PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK HOLMES ***\n\ncover\n\n\n\n\nThe Adventures of Sherlock Holmes\n\nby Arthur Conan Doyle\n\n

<font face="Calibri" color='navy' size=4pt>We can extract all the sentences (based on punctuation):</font>

In [None]:
dataset = nltk.sent_tokenize(text) 
for i in range(len(dataset)): 
    dataset[i] = dataset[i].lower() 
    dataset[i] = re.sub(r'\W', ' ', dataset[i]) 
    dataset[i] = re.sub(r'\s+', ' ', dataset[i]) 

In [None]:
dataset[1000]

'but the maiden herself was most instructive you appeared to read a good deal upon her which was quite invisible to me i remarked '

In [None]:
# this is the 2001th sentence
dataset[2000]

'they could only have come from the old man at my side and yet he sat now as absorbed as ever very thin very wrinkled bent with age an opium pipe dangling down from between his knees as though it had dropped in sheer lassitude from his fingers '

What do you notice? There are no capital letters, no punctuation (because the computer does not need them).

We can also determine how frequent are the different words.

In [None]:
# we can count the occurrencies of different words
# Creating the Bag of Words model 
word2count = {} # this is a list 
for data in dataset: 
    words = nltk.word_tokenize(data) 
    for word in words: 
        if word not in word2count.keys(): 
            word2count[word] = 1
        else: 
            word2count[word] += 1

In [None]:
word2count.get('follow')

15

In [None]:
word2count.get('watson')

81

This means that the word "ghost" appeared 1 time.

In [None]:
word2count.get('the') # however 'the' is a stopword so it should be counted!!

5815

<font face="Calibri" color='navy' size=4pt>We can determine what are the most frequent words, for example:</font>

In [None]:
# the top 100 most frequent words
import heapq 
freq_words = heapq.nlargest(100, word2count, key=word2count.get)
freq_words

Indeed this is a story about "sherlock" and "crime"...

Important: we want to discard all the unimportant words (as known as "stopwords").

In [None]:
# Stopword dictionary
from nltk.corpus import stopwords
nltk.download('stopwords')
# For stemming
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
txt = re.sub('[^a-zA-Z0-9 ]','',dataset[2300])
# Make everything lower case
txt = txt.lower()
# Make it a list of words
txt = txt.split()
# Get all the stop words out
txt = [word for word in txt if not word in set(stopwords.words('english'))]
# Stem the words
txt = [stemmer.stem(word) for word in txt]
# Put it all back together and look at the result
' '.join(txt)

'day stream penni vari silver pour upon bad day fail take 2'

..and we want to do this for every sentence in the book:

In [None]:
corpus = [] # the name 'corpus' refers to the senteces after we throwed all stopwords and we rooted the remaining ones
for i in range(len(dataset)):
    txt = re.sub('[^a-zA-Z0-9 ]','',dataset[i])
    txt = txt.lower()
    txt = txt.split()
    txt = [word for word in txt if not word in set(stopwords.words('english'))]
    txt = [stemmer.stem(word) for word in txt]
    txt = ' '.join(txt)
    corpus.append(txt)

In [None]:
corpus[:100]

['project gutenberg ebook adventur sherlock holm arthur conan doyl ebook use anyon anywher unit state part world cost almost restrict whatsoev',
 'may copi give away use term project gutenberg licens includ ebook onlin www gutenberg org',
 'locat unit state check law countri locat use ebook',
 'titl adventur sherlock holm author arthur conan doyl releas date novemb 29 2002 ebook 1661 recent updat may 20 2019 languag english charact set encod utf 8 produc anonym project gutenberg volunt jose menendez start project gutenberg ebook adventur sherlock holm cover adventur sherlock holm arthur conan doyl content',
 'scandal bohemia ii',
 'red head leagu iii',
 'case ident iv',
 'boscomb valley mysteri v five orang pip vi',
 'man twist lip vii',
 'adventur blue carbuncl viii',
 'adventur speckl band ix',
 'adventur engin thumb x',
 'adventur nobl bachelor xi',
 'adventur beryl coronet xii',
 'adventur copper beech',
 'scandal bohemia',
 'sherlock holm alway woman',
 'seldom heard mention name'

In [None]:
# we can count the occurrencies of different words in the corpus
# Creating the Bag of Words model 
word2count = {} 
for data in corpus: 
    words = nltk.word_tokenize(data) 
    for word in words: 
        if word not in word2count.keys(): 
            word2count[word] = 1
        else: 
            word2count[word] += 1

In [None]:
# .. and get the top 10 most frequent in the corpus:
freq_words = heapq.nlargest(100, word2count, key=word2count.get)
freq_words

## Application to Amazon customer reviews

In [None]:
import pandas as pd
import re

In [None]:
# this data is available via Kaggle
df = pd.read_csv('drive/MyDrive/Data Sets/amazon_reviews.csv', quoting=2 )
# Extract the ratings and text reviews
data = df[['reviews.text', 'reviews.rating']].dropna().reset_index(drop=True)

reviews = data['reviews.text']
y = data['reviews.rating']

In [None]:
data.loc[500,'reviews.text']

'I have a regular echo and now the tap. Both awesome products, use them to control lights, locks, and play music. Would buy again.'

In [None]:
y[2]

4.0

To learn more about the data:   

https://www.kaggle.com/bittlingmayer/amazonreviews

In [None]:
allreviews = []
for i in range(len(reviews)):
    txt = re.sub('[^a-zA-Z0-9 ]','',reviews[i])
    txt = txt.lower()
    txt = txt.split()
    txt = [word for word in txt if not word in set(stopwords.words('english'))]
    txt = [stemmer.stem(word) for word in txt]
    txt = ' '.join(txt)
    allreviews.append(txt)

In [None]:
allreviews[500] # this btw is a 5 star review

'regular echo tap awesom product use control light lock play music would buy'

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X_raw = cv.fit_transform(allreviews)
X = X_raw.toarray()

In [None]:
X.shape

(1177, 5084)

In [None]:
X.shape

(1177, 5084)

In [None]:
# for the number of stars we say 5 star is a hit and less than 5 is a miss
yb = y.where(y==5, other=0).where(y<5, other=1)

In [None]:
yb[309]

1.0

In [None]:
yb.shape

(1177,)

In [None]:
yb

0       1.0
1       1.0
2       0.0
3       1.0
4       1.0
       ... 
1172    0.0
1173    0.0
1174    0.0
1175    0.0
1176    0.0
Name: reviews.rating, Length: 1177, dtype: float64

### Logistic Regression Classifier 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split as tts
from sklearn.metrics import accuracy_score as acc

Xtrain,Xtest,ytrain,ytest = tts(X,yb,random_state=310,test_size=0.25)
cls = LogisticRegression(random_state=310, solver='lbfgs')
cls.fit(Xtrain,ytrain)
ypred = cls.predict(Xtest)
cm = confusion_matrix(ytest, ypred)
pd.DataFrame(cm, columns=['Not 5', '5'], index =['Not 5', '5'])

Unnamed: 0,Not 5,5
Not 5,59,57
5,30,149


In [None]:
acc(ytest,ypred)

0.7050847457627119

In [None]:
# the input features are based on the Bag of Words Model
# the input features matrix X is sparse

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

## K-Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
model = KNeighborsClassifier(n_neighbors=5,weights='distance')

In [None]:
model.fit(Xtrain,ytrain)
ypred = model.predict(Xtest)
cm = confusion_matrix(ytest, ypred)
pd.DataFrame(cm, columns=['Not 5', '5'], index =['Not 5', '5'])

Unnamed: 0,Not 5,5
Not 5,43,73
5,21,158


### Naive Bayes Classifier

In [None]:
from sklearn.naive_bayes import GaussianNB
cls = GaussianNB()
cls.fit(Xtrain,ytrain)
ypred = cls.predict(Xtest)
cm = confusion_matrix(ytest, ypred)
pd.DataFrame(cm, columns=['Not 5', '5'], index =['Not 5', '5'])

Unnamed: 0,Not 5,5
Not 5,82,34
5,71,108


In [None]:
acc(ytest,ypred)

0.6440677966101694

### Random Forest Classifier 

In [None]:
from sklearn.ensemble import RandomForestClassifier


cls = RandomForestClassifier(random_state=310, max_depth=100, n_estimators = 100)
cls.fit(Xtrain,ytrain)
ypred = cls.predict(Xtest)
cm = confusion_matrix(ytest, ypred)
pd.DataFrame(cm, columns=['Not 5', '5'], index =['Not 5', '5'])

Unnamed: 0,Not 5,5
Not 5,47,69
5,12,167


In [None]:
acc(ytest,ypred)

0.7254237288135593

## Application to wine ratings based on customer reviews

In [None]:
import matplotlib.pyplot as plt

from nltk import download
download('stopwords')

from sklearn.model_selection import train_test_split as tts
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

In [None]:
%%time
wine_data = pd.read_csv('winemagdata130kv2.csv',quoting=2)
wines = wine_data[["description","points"]]
wines_subset = wines.sample(1000,random_state=1693).reset_index(drop=True)
corpus = []

for i in range(0,len(wines_subset)):
    wine_descriptions = re.sub('[^a-zA-Z0-9 ]','',wines_subset["description"][i])
    wine_descriptions=wine_descriptions.lower()
    wine_descriptions = wine_descriptions.split()
    wine_descriptions = [word for word in wine_descriptions if not word in set(stopwords.words('english'))]
    stemmer = PorterStemmer()
    wine_descriptions = [stemmer.stem(word) for word in wine_descriptions]
    wine_descriptions = " ".join(wine_descriptions)
    corpus.append(wine_descriptions)

In [None]:
%%time
countVec = CountVectorizer()
X_raw = countVec.fit_transform(corpus)
X = X_raw.toarray()

In [None]:
#### Visualize the distribution of the wine ratings (points)
n, bins, patches = plt.hist(wines_subset["points"].values,10,density=1,facecolor='green',alpha=0.7)

In [None]:
y = wines_subset["points"]
y = y.where(y>90,other=0).where(y<=90,other=1).values

In [None]:
X_train, X_test, Y_train, Y_test = tts(X,y,test_size=0.25,random_state=1693)
#scale_X = StandardScaler()
#X_train = scale_X.fit_transform(X_train)
#X_test = scale_X.transform(X_test)
classifier = LogisticRegression(random_state=1693,solver='lbfgs')
classifier.fit(X_train,Y_train)
Y_pred = classifier.predict(X_test)

In [None]:
spc = ['Bad','Good']
cm = confusion_matrix(Y_test,Y_pred)
pd.DataFrame(cm, columns=spc, index=spc)