# NLP Introduction

## Important Terms

In a non-nlp **dataset**, each row constitutes an **observation**. 

In, NLP we call each observation a **document**. And instead of dataset, we say **corpus**.

# Preprocessing

Two steps:

1. Tokenization.
2. Vectorization.

In [2]:
review = "Food and service were fabulous! Everything from the  bread and balsamic and olive oil combo to the main pasta dishes and dessert. The sauces for each dish we tried were unique and flavorful. Didn't get to try the home made tiramisu, but we will definitely be back to do so."

In [3]:
print(review)

Food and service were fabulous! Everything from the  bread and balsamic and olive oil combo to the main pasta dishes and dessert. The sauces for each dish we tried were unique and flavorful. Didn't get to try the home made tiramisu, but we will definitely be back to do so.


In [6]:
print([s for s in review.split(' ')])

['Food', 'and', 'service', 'were', 'fabulous!', 'Everything', 'from', 'the', '', 'bread', 'and', 'balsamic', 'and', 'olive', 'oil', 'combo', 'to', 'the', 'main', 'pasta', 'dishes', 'and', 'dessert.', 'The', 'sauces', 'for', 'each', 'dish', 'we', 'tried', 'were', 'unique', 'and', 'flavorful.', "Didn't", 'get', 'to', 'try', 'the', 'home', 'made', 'tiramisu,', 'but', 'we', 'will', 'definitely', 'be', 'back', 'to', 'do', 'so.']


In [8]:
["Did", "n't"]

['Did', "n't"]

**Stop Words**

In [9]:
from spacy.lang.en.stop_words import STOP_WORDS

In [12]:
print([s.lower() for s in review.split(' ') 
 if s not in STOP_WORDS])

['food', 'service', 'fabulous!', 'everything', '', 'bread', 'balsamic', 'olive', 'oil', 'combo', 'main', 'pasta', 'dishes', 'dessert.', 'the', 'sauces', 'dish', 'tried', 'unique', 'flavorful.', "didn't", 'try', 'home', 'tiramisu,', 'definitely', 'so.']


**Lemmatization**

In [16]:
run_list = ['run', 'runs', 'ran', 'running', 'runner']
run_string = ' '.join(run_list)

In [17]:
run_string

'run runs ran running runner'

In [14]:
import spacy

In [15]:
nlp = spacy.load("en")

In [18]:
doc = nlp(run_string)

In [20]:
[token.lemma_ for token in doc]

['run', 'run', 'run', 'run', 'runner']

**Vectorization**

In [2]:
corpus = ['The movie is ingenious fun',
          'The movie is a dud',
          'This thing is virtually unwatchable',
          'This thing is just garbage']

In [21]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd

In [8]:
cv = CountVectorizer()
corpus_trans = cv.fit_transform(corpus).toarray()
df = pd.DataFrame(corpus_trans, columns=cv.get_feature_names())

In [9]:
df

Unnamed: 0,dud,fun,garbage,ingenious,is,just,movie,the,thing,this,unwatchable,virtually
0,0,1,0,1,1,0,1,1,0,0,0,0
1,1,0,0,0,1,0,1,1,0,0,0,0
2,0,0,0,0,1,0,0,0,1,1,1,1
3,0,0,1,0,1,1,0,0,1,1,0,0


In [15]:
print('NEGATIVE')
print('A bad movie that happened to good actors')
print()
print('POSITIVE')
print('A good movie that happened to bad actors')

NEGATIVE
A bad movie that happened to good actors

POSITIVE
A good movie that happened to bad actors


In [16]:
cv = CountVectorizer()
cv.fit_transform(['A bad movie that happened to good actors',
                  'A good movie that happened to bad actors']).toarray()

array([[1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1]])

**TFIDF**

In [17]:
corpus = ['Bad movie',
          'Just plain bad',
          'A bad movie that happened to good actors',
          "It is that rare combination of bad writing , bad direction and bad acting -- the trifecta of badness"]

corpus

['Bad movie',
 'Just plain bad',
 'A bad movie that happened to good actors',
 'It is that rare combination of bad writing , bad direction and bad acting -- the trifecta of badness']

In [18]:
cv = CountVectorizer()
corpus_trans = cv.fit_transform(corpus).toarray()
df = pd.DataFrame(corpus_trans, columns=cv.get_feature_names())

In [19]:
df

Unnamed: 0,acting,actors,and,bad,badness,combination,direction,good,happened,is,...,just,movie,of,plain,rare,that,the,to,trifecta,writing
0,0,0,0,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0
2,0,1,0,1,0,0,0,1,1,0,...,0,1,0,0,0,1,0,1,0,0
3,1,0,1,3,1,1,1,0,0,1,...,0,0,2,0,1,1,1,0,1,1


In [20]:
df['bad']

0    1
1    1
2    1
3    3
Name: bad, dtype: int64

In [26]:
tfidf = TfidfVectorizer(stop_words='english')
corpus_trans = tfidf.fit_transform(corpus).toarray()
df = pd.DataFrame(corpus_trans, columns=tfidf.get_feature_names())

In [29]:
df

Unnamed: 0,acting,actors,bad,badness,combination,direction,good,happened,just,movie,plain,rare,trifecta,writing
0,0.0,0.0,0.551939,0.0,0.0,0.0,0.0,0.0,0.0,0.833884,0.0,0.0,0.0,0.0
1,0.0,0.0,0.346182,0.0,0.0,0.0,0.0,0.0,0.663385,0.0,0.663385,0.0,0.0,0.0
2,0.0,0.506765,0.264451,0.0,0.0,0.0,0.506765,0.506765,0.0,0.39954,0.0,0.0,0.0,0.0
3,0.325285,0.0,0.509242,0.325285,0.325285,0.325285,0.0,0.0,0.0,0.0,0.0,0.325285,0.325285,0.325285


**HashingVectorizer**

In [30]:
from sklearn.feature_extraction.text import HashingVectorizer

In [31]:
hv = HashingVectorizer()
corpus_trans = hv.transform(corpus).toarray()
df = pd.DataFrame(corpus_trans)

In [32]:
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1048566,1048567,1048568,1048569,1048570,1048571,1048572,1048573,1048574,1048575
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## N-grams

In [39]:
cv = CountVectorizer(ngram_range=(1,2))
corpus_trans = cv.fit_transform(corpus).toarray()
df = pd.DataFrame(corpus_trans, columns=cv.get_feature_names())

In [40]:
df.columns = [col.replace(' ', '_') for col in df.columns]

In [41]:
df

Unnamed: 0,acting,acting_the,actors,and,and_bad,bad,bad_acting,bad_direction,bad_movie,bad_writing,...,that_happened,that_rare,the,the_trifecta,to,to_good,trifecta,trifecta_of,writing,writing_bad
0,0,0,0,0,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,0,1,0,0,1,0,...,1,0,0,0,1,1,0,0,0,0
3,1,1,0,1,1,3,1,1,0,1,...,0,1,1,1,0,0,1,1,1,1
