## Natural Language Processing Reference Sheet
### Derek Wales

### Part One: Text Formating

In [1]:
library = [('Author', 'Topic', 'Pages'), ('Twain', 'Rafting', 601), ('Feynman', 'Physics', 95), ('Hamilton', 'Mythology', 144)]

# :{{10} sets the minimum number of spaces
for book in library: 
    print(f'{book[0]:{10}} {book[1]:{8}} {book[2]:{7}}')

Author     Topic    Pages  
Twain      Rafting      601
Feynman    Physics       95
Hamilton   Mythology     144


In [2]:
# More spacing with the | break
for author,topic,pages in library:
    print(f"{author:{15}} | {topic:{15}} | {pages:>{7}}")

Author          | Topic           |   Pages
Twain           | Rafting         |     601
Feynman         | Physics         |      95
Hamilton        | Mythology       |     144


#### Creating a Test File

In [3]:
%%writefile test.txt 
Hello, this is a quick test file.
Second file line.

Writing test.txt


In [4]:
# Creating a myfile object
my_file = open('test.txt')

# Readlines returns a list of the lines in the file
my_file.seek(0)
my_file.readlines()

['Hello, this is a quick test file.\n', 'Second file line.\n']

In [5]:
# There are many examples of the file operations with python (open, close, seek, etc)
# Iterating through a file
with open('test.txt','r') as txt:
    for line in txt:
        print(line, end='')  # the end='' argument removes extra linebreaks

Hello, this is a quick test file.
Second file line.


#### Working w/PDFs

In [6]:
# note the capitalization
import PyPDF2

In [7]:
# Notice we read it as a binary with 'rb'
f = open('.//data//US_Declaration.pdf','rb')

# Creating the reader object
pdf_reader = PyPDF2.PdfFileReader(f)

# Grabbing the first page
page_one = pdf_reader.getPage(0)

# Extracting the text
page_one_text = page_one.extractText()

# Printing
print(page_one_text)

Declaration of IndependenceIN CONGRESS, July 4, 1776. The unanimous Declaration of the thirteen united States of America, When in the Course of human events, it becomes necessary for one people to dissolve the
political bands which have connected them with another, and to assume among the powers of the
earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle

them, a decent respect to the opinions of mankind requires that they should declare the causes

which impel them to the separation. 
We hold these truths to be self-evident, that all men are created equal, that they are endowed by

their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit
of Happiness.ŠThat to secure these rights, Governments are instituted among Men, deriving

their just powers from the consent of the governed,ŠThat whenever any Form of Government
becomes destructive of these ends, it is the Right of the People to alter or to abolish it,

### Part Two: NLP Basics

|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

In [8]:
# Importing Spacey and loading the English dictionary
import spacy
nlp = spacy.load('en_core_web_sm')

In [9]:
# Enter your code here:
with open('./data/owlcreek.txt') as f:
    doc = nlp(f.read())

In [10]:
# Counting Tokens (Words)
token_list =[]
for token in doc:
    token_list.append(token)

# Counting Sentences 
sent_list =[]
for sent in doc.sents:
    sent_list.append(sent)
    
# Printing the details
print(f'There are {len(token_list)} tokens and {len(sent_list)} sentences')

There are 4835 tokens and 229 sentences


In [11]:
# Printing the details of the second sentence
for token in sent_list[2]:
    print(token.text, token.pos_, token.dep_)

A DET det
man NOUN nsubj
stood VERB ROOT
upon SCONJ prep
a DET det
railroad NOUN compound
bridge NOUN pobj
in ADP prep
northern ADJ amod
Alabama PROPN pobj
, PUNCT punct
looking VERB advcl
down ADV prt

 SPACE 
into ADP prep
the DET det
swift ADJ amod
water NOUN pobj
twenty NUM nummod
feet NOUN npadvmod
below ADV advmod
. PUNCT punct
  SPACE 


### Part Three: Parts of Speech Tagging

In [12]:
# Create a simple Doc object
doc = nlp(u"The quick brown fox jumped over the lazy dog's back.")

# Print the fifth word and associated tags:
print(doc[4].text, doc[4].pos_, doc[4].tag_, spacy.explain(doc[4].tag_))

jumped VERB VBD verb, past tense


In [13]:
for token in doc:
    print(f'{token.text:{10}} {token.pos_:{8}} {token.tag_:{6}} {spacy.explain(token.tag_)}')

The        DET      DT     determiner
quick      ADJ      JJ     adjective
brown      ADJ      JJ     adjective
fox        PROPN    NNP    noun, proper singular
jumped     VERB     VBD    verb, past tense
over       ADP      IN     conjunction, subordinating or preposition
the        DET      DT     determiner
lazy       ADJ      JJ     adjective
dog        NOUN     NN     noun, singular or mass
's         PART     POS    possessive ending
back       NOUN     NN     noun, singular or mass
.          PUNCT    .      punctuation mark, sentence closer


### Coarse-grained Part-of-speech Tags
Every token is assigned a POS Tag from the following list:


<table><tr><th>POS</th><th>DESCRIPTION</th><th>EXAMPLES</th></tr>
    
<tr><td>ADJ</td><td>adjective</td><td>*big, old, green, incomprehensible, first*</td></tr>
<tr><td>ADP</td><td>adposition</td><td>*in, to, during*</td></tr>
<tr><td>ADV</td><td>adverb</td><td>*very, tomorrow, down, where, there*</td></tr>
<tr><td>AUX</td><td>auxiliary</td><td>*is, has (done), will (do), should (do)*</td></tr>
<tr><td>CONJ</td><td>conjunction</td><td>*and, or, but*</td></tr>
<tr><td>CCONJ</td><td>coordinating conjunction</td><td>*and, or, but*</td></tr>
<tr><td>DET</td><td>determiner</td><td>*a, an, the*</td></tr>
<tr><td>INTJ</td><td>interjection</td><td>*psst, ouch, bravo, hello*</td></tr>
<tr><td>NOUN</td><td>noun</td><td>*girl, cat, tree, air, beauty*</td></tr>
<tr><td>NUM</td><td>numeral</td><td>*1, 2017, one, seventy-seven, IV, MMXIV*</td></tr>
<tr><td>PART</td><td>particle</td><td>*'s, not,*</td></tr>
<tr><td>PRON</td><td>pronoun</td><td>*I, you, he, she, myself, themselves, somebody*</td></tr>
<tr><td>PROPN</td><td>proper noun</td><td>*Mary, John, London, NATO, HBO*</td></tr>
<tr><td>PUNCT</td><td>punctuation</td><td>*., (, ), ?*</td></tr>
<tr><td>SCONJ</td><td>subordinating conjunction</td><td>*if, while, that*</td></tr>
<tr><td>SYM</td><td>symbol</td><td>*$, %, §, ©, +, −, ×, ÷, =, :), 😝*</td></tr>
<tr><td>VERB</td><td>verb</td><td>*run, runs, running, eat, ate, eating*</td></tr>
<tr><td>X</td><td>other</td><td>*sfpksdpsxmsa*</td></tr>
<tr><td>SPACE</td><td>space</td></tr>

### Visualizing 

In [14]:
# Import the displaCy library
from spacy import displacy

# Render the dependency parse immediately inside Jupyter:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 80})

#### Basic entity function

In [15]:
# Write a function to display basic entity info:
def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))
    else:
        print('No named entities found.')

In [16]:
doc = nlp(u'May I go to Washington, DC next May to see the Washington Monument?')

show_ents(doc)

Washington, DC - GPE - Countries, cities, states
next May - DATE - Absolute or relative dates or periods
the Washington Monument - ORG - Companies, agencies, institutions, etc.


In [17]:
# More Display Options
doc = nlp(u"over the last quarter Apple sold nearly of nearly 20 thousand iPods for a profit of $6 million." u"by contrast Sony only sold 8 thousand Walkman music players")

# Highlights the entities
displacy.render(doc, style='ent', jupyter = True)

In [18]:
# Reading in the doc
with open('./data/peterrabbit.txt') as f:
    doc = nlp(f.read())

In [19]:
# Creating a sentence
sents = list(doc.sents)
sents[2]

They lived with their Mother in a sand-bank, underneath the root of a
very big fir-tree.


In [20]:
# Printing out parts of speech for that sentence
POS_counts = doc.count_by(spacy.attrs.POS)

for k,v in sorted(POS_counts.items()):
    print(f'{k}. {doc.vocab[k].text:{5}}: {v}')

84. ADJ  : 49
85. ADP  : 122
86. ADV  : 67
87. AUX  : 48
89. CCONJ: 61
90. DET  : 117
92. NOUN : 169
93. NUM  : 8
94. PART : 28
95. PRON : 82
96. PROPN: 75
97. PUNCT: 174
98. SCONJ: 20
100. VERB : 139
103. SPACE: 99


### Part Four: Text Classification

#### Sklearn Ham vs Spam

In [21]:
# Perform imports and load the dataset:
import numpy as np
import pandas as pd

df = pd.read_csv('./data/smsspamcollection.tsv', sep='\t')
df.head()

Unnamed: 0,label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2


In [22]:
# Validating Null, good practice
df.isnull().sum()

label      0
message    0
length     0
punct      0
dtype: int64

In [23]:
# Train test split
from sklearn.model_selection import train_test_split

X = df['message']  # this time we want to look at the text
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [24]:
# Creating an instance of the TfidfVectorizer which is a combination of Tf-idf and count vectoriztion
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

X_train_tfidf = vectorizer.fit_transform(X_train) # remember to use the original X_train set
X_train_tfidf.shape

(3733, 7082)

In [25]:
# Training the model
from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X_train_tfidf,y_train)

LinearSVC()

#### All the steps above can be done with a pipeline object

In [26]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)  

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

In [27]:
# Evaluating the model
from sklearn import metrics

# Form a prediction set
predictions = text_clf.predict(X_test)

# Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1593
        spam       0.97      0.95      0.96       246

    accuracy                           0.99      1839
   macro avg       0.98      0.97      0.98      1839
weighted avg       0.99      0.99      0.99      1839



#### Sentiment of movie reviews problem

In [28]:
# Reading in 
df = pd.read_csv('./data/moviereviews.tsv', sep='\t')

# Dropping Null Values
df.dropna(inplace=True)
len(df)

1965

#### Cool trick to remove blanks

In [29]:
blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
        
print(len(blanks), 'blanks: ', blanks)

27 blanks:  [57, 71, 147, 151, 283, 307, 313, 323, 343, 351, 427, 501, 633, 675, 815, 851, 977, 1079, 1299, 1455, 1493, 1525, 1531, 1763, 1851, 1905, 1993]


In [30]:
# Dropping blanks
df.drop(blanks, inplace=True)
len(df)

1938

In [31]:
# Splitting with Sklearn
from sklearn.model_selection import train_test_split

X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [32]:
# Using Pipeline/Naive Bayes
from sklearn.naive_bayes import MultinomialNB

# Naïve Bayes:
text_clf_nb = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', MultinomialNB()),
])

In [33]:
text_clf_nb.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])

In [34]:
# Form a prediction set
predictions = text_clf_nb.predict(X_test)

# Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         neg       0.69      0.93      0.79       308
         pos       0.91      0.61      0.73       332

    accuracy                           0.76       640
   macro avg       0.80      0.77      0.76       640
weighted avg       0.80      0.76      0.76       640



### Another Example of Machine Learning to Classify Reviews

In [35]:
import pandas as pd
import numpy as np

df = pd.read_csv("./data/moviereviews2.tsv", sep ='\t')

In [36]:
# Check for whitespace strings (it's OK if there aren't any!):
blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
        
len(blanks)

0

In [37]:
# Removing N/As
df = df[df['review'].notna()]

In [38]:
# Train Test Split
from sklearn.model_selection import train_test_split

# Splitting into train and test
X = df['review']  
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [39]:
# Modeling Tools
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

# Pipeline
from sklearn.pipeline import Pipeline

# Building the Pipeline
text_clf = Pipeline([('tfidf',TfidfVectorizer()), ('clf',LinearSVC())])

# Training
text_clf.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

In [40]:
# Assessing Accuracy
# Importing Metrics
from sklearn import metrics

# Form a prediction set
predictions = text_clf.predict(X_test)

In [41]:
# Report the confusion matrix
print(metrics.confusion_matrix(y_test,predictions))

[[900  91]
 [ 63 920]]


In [42]:
# Print the overall accuracy
metrics.accuracy_score(y_test,predictions)

0.9219858156028369

### Part Five: Semantics

#### Using Vader for Amazon Reviews

In [43]:
# Importing NLTK
import nltk
nltk.download('vader_lexicon')

from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\derek\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [44]:
import numpy as np
import pandas as pd

df = pd.read_csv('./data/amazonreviews.tsv', sep='\t')
df.head()

Unnamed: 0,label,review
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
2,pos,Amazing!: This soundtrack is my favorite music...
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember, Pull Your Jaw Off The Floor After He..."


In [45]:
# REMOVE NaN VALUES AND EMPTY STRINGS:
df.dropna(inplace=True)

blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list

df.drop(blanks, inplace=True)

In [46]:
# Adding the scores to a column in the Dataframe
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))

df.head()

Unnamed: 0,label,review,scores
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co..."
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co..."
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com..."
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com..."
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp..."


In [47]:
# Grabbing the compound score
df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound'])

# Determining if it's pos or negative
df['comp_score'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')

In [48]:
# Checking Accuracy
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

accuracy_score(df['label'],df['comp_score'])

0.7092

### Part Five: Topic Modeling

In [49]:
import pandas as pd

npr = pd.read_csv('./data/npr.csv')

In [50]:
# Preprocessing
from sklearn.feature_extraction.text import CountVectorizer

# Building the count vectorizor for the Document Term Matrix (DTM)
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')

# DTM Object
dtm = cv.fit_transform(npr['Article'])

In [51]:
# Performing LDA
from sklearn.decomposition import LatentDirichletAllocation

# LDA with 7 Topics
LDA = LatentDirichletAllocation(n_components=7,random_state=42)

# Fitting
LDA.fit(dtm)

LatentDirichletAllocation(n_components=7, random_state=42)

In [52]:
# Printing the top 15 words in each topic
for index,topic in enumerate(LDA.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([cv.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']


THE TOP 15 WORDS FOR TOPIC #1
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']


THE TOP 15 WORDS FOR TOPIC #2
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']


THE TOP 15 WORDS FOR TOPIC #3
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']


THE TOP 15 WORDS FOR TOPIC #4
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']


THE TOP 15 WORDS FOR TOPIC #5
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know', 'think',

In [53]:
# Preprocessing for Non-Negative Matrix Factorization
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

dtm = tfidf.fit_transform(npr['Article'])

In [54]:
# Trying the same thing with Non Negative Matrix Factorization
from sklearn.decomposition import NMF

# Nmf Model
nmf_model = NMF(n_components=7,random_state=42)

# This can take awhile, we're dealing with a large amount of documents!
nmf_model.fit(dtm)

NMF(n_components=7, random_state=42)

In [55]:
# Printing the new results for NNMF, yields slightly different results
for index,topic in enumerate(nmf_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']


THE TOP 15 WORDS FOR TOPIC #1
['gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']


THE TOP 15 WORDS FOR TOPIC #2
['senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']


THE TOP 15 WORDS FOR TOPIC #3
['officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']


THE TOP 15 WORDS FOR TOPIC #4
['primary', 'cruz', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']


THE TOP 15 WORDS FOR TOPIC #5
['love', 've', 'don', 'al