### Sprint Challenge: Natural Language Processing

In this Sprint Challenge, you will get an opportunity to work on additional exercises that will help further crystalize the concepts that you have been exposed to this week.






**Question 1**: Load the dataset  (only the first 1000 rows)related to restaurant reviews (Dataset: https://www.dropbox.com/s/i4zh5fb82x7i3sm/restaurant-test.csv?raw=1). 

This data set is a slight variation of the data set that you worked on in the project assignment.

Pre-process the dataset:

a) You will need to eliminate punctuations

b) You will have to deal with/remove stopwords

c) Tokenize the text

d) Stem or Lemmatize to determine the base form of the words

In [0]:
!pip install regex
!pip install gensim



In [0]:
import nltk
nltk.download('all')

import numpy as np
import regex as re
import pandas as pd
from gensim import corpora
from nltk.corpus import stopwords
from nltk import LancasterStemmer
from gensim.models import TfidfModel
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

# read first 1000 rows of data
df = pd.read_csv('https://www.dropbox.com/s/i4zh5fb82x7i3sm/restaurant-test.csv?raw=1', nrows=1000)

In [0]:
# remove junk
df1 = df[df.columns[0]].values

In [0]:
en_stopwords = list(set(nltk.corpus.stopwords.words('english')))

# tokenize function removes punctuation and tokenizes data
def tokenize(data):
  clean = [re.sub(r'[^\w\s]','',i).lower() for i in data]
  en_stopwords = list(set(nltk.corpus.stopwords.words('english')))

  tokens = [word_tokenize(x) for x in clean]
  return tokens

nontokens = []
tokens = tokenize(df1)
# tokens that are not stopwords collected here
for i in tokens:
  nontokens.append([])
  for j in i:
    if j in en_stopwords:
      continue
    else: nontokens[-1].append(j)

tokens = tokenize(df1)

In [0]:
LS = LancasterStemmer()

lemmatized = []

# lemmatized data
for l in nontokens: lemmatized.append([LS.stem(w) for w in l])

**Question 2**: **Perform Vectorization** - you will apply 3 different vectorization techniques. Each technique will generate similar document term matrices where the rows of the matrix will represent the respective text messages and the columns will represent each word or a combination of words. Note that the biggest difference between the techniques is the value depicted in the actual cells of the matrix.

1) Create a document term matrix based on the count of the words in the document. You may want to restrict the # of features/columns based on the top most features ordered by term frequency across the document

2) Create a bigram vector using a combination of adjacent words. In this case, n=2

3) Create a TF-IDF vector wherein the cells of the matrix contain values (i.e. weights) to depict how important a word is to an individual review

In [0]:
import operator
from operator import itemgetter
from collections import Counter

# flatten list
flat_list = [item for sublist in lemmatized for item in sublist]

# Count how many times each word appears
count = Counter(flat_list).items()
sorted_count = sorted(count, key=itemgetter(1))
sorted_count.reverse()                            # put in descending order

# Select 100 most frequent words
top100 = [i[0] for i in sorted_count[:100]]
print(top100)

# Create matrix with reviews as rows and top 100 words as columns, where each cell is 1 if the word appears in the review and 0 otherwise
m = []
for i in lemmatized: m.append([1 if j in i else 0 for j in top100])
print(np.matrix(m))

['plac', 'good', 'get', 'food', 'lik', 'on', 'tim', 'serv', 'us', 'ev', 'ord', 'gre', 'real', 'go', 'would', 'lov', 'back', 'wait', 'im', 'want', 'friend', 'got', 'dont', 'resta', 'try', 'com', 'went', 'mak', 'look', 'littl', 'could', 'know', 'say', 'ask', 'first', 'didnt', 'eat', 'din', 'also', 'loc', 'new', 'think', 'cam', 'much', 'going', 'iv', 'drink', 'nic', 'two', 'lunch', 'peopl', 'alway', 'giv', 'said', 'nev', 'best', 'wel', 'bit', 'night', 'thing', 'pretty', 'chick', 'minut', 'work', 'tak', 'way', 'bar', 'expery', 'review', 'year', 'menu', 'staff', 'long', 'mad', 'sandwich', 'nee', 'burg', 'sauc', 'bet', 'salad', 'sint', 'pric', 'start', 'day', 'right', 'meal', 'sid', 'wasnt', 'cal', 'seat', 'hour', 'fri', 'enjoy', 'around', 'tast', 'star', 'flav', 'man', 'last', 'find']
[[0 0 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 1 1 ... 0 0 1]
 ...
 [1 0 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [0]:
# Create bigram vector using a combination of two adjacent words

def find_ngrams(input_list, n):
  return zip(*[input_list[i:] for i in range(n)])

bigram = list(find_ngrams(top100, 2))
print(bigram)

[('plac', 'good'), ('good', 'get'), ('get', 'food'), ('food', 'lik'), ('lik', 'on'), ('on', 'tim'), ('tim', 'serv'), ('serv', 'us'), ('us', 'ev'), ('ev', 'ord'), ('ord', 'gre'), ('gre', 'real'), ('real', 'go'), ('go', 'would'), ('would', 'lov'), ('lov', 'back'), ('back', 'wait'), ('wait', 'im'), ('im', 'want'), ('want', 'friend'), ('friend', 'got'), ('got', 'dont'), ('dont', 'resta'), ('resta', 'try'), ('try', 'com'), ('com', 'went'), ('went', 'mak'), ('mak', 'look'), ('look', 'littl'), ('littl', 'could'), ('could', 'know'), ('know', 'say'), ('say', 'ask'), ('ask', 'first'), ('first', 'didnt'), ('didnt', 'eat'), ('eat', 'din'), ('din', 'also'), ('also', 'loc'), ('loc', 'new'), ('new', 'think'), ('think', 'cam'), ('cam', 'much'), ('much', 'going'), ('going', 'iv'), ('iv', 'drink'), ('drink', 'nic'), ('nic', 'two'), ('two', 'lunch'), ('lunch', 'peopl'), ('peopl', 'alway'), ('alway', 'giv'), ('giv', 'said'), ('said', 'nev'), ('nev', 'best'), ('best', 'wel'), ('wel', 'bit'), ('bit', 'night

In [0]:
from gensim.corpora import Dictionary
dict = corpora.Dictionary(tokens)
corpus = [dict.doc2bow(text) for text in lemmatized]

tfidf = TfidfModel(corpus);
print(np.matrix([tfidf[i] for i in corpus])[0])

[[list([(0, 0.12259278623561748), (1, 0.09138193783916615), (2, 0.41259817168433677), (3, 0.3445137674932477), (12, 0.08595627779238017), (14, 0.14180548177652033), (15, 0.05876959313977527), (20, 0.08444363789994767), (21, 0.10579187731317281), (22, 0.14685908173258724), (29, 0.14685908173258724), (36, 0.07106771746132792), (40, 0.06898788763180795), (41, 0.09628611392619955), (43, 0.3750137185622269), (45, 0.08248699821372676), (49, 0.1011614620745148), (57, 0.10985544129191069), (59, 0.19146957928752664), (61, 0.19146957928752664), (62, 0.07071006010441128), (68, 0.09719498422158092), (73, 0.13383149266481822), (74, 0.20449716835529572), (77, 0.0725463645861618), (78, 0.11293840018383028), (79, 0.09214138426551395), (85, 0.08248699821372676), (89, 0.17191255558476035), (185, 0.08595627779238017), (460, 0.08475095482535121), (572, 0.07492365454729984), (724, 0.10579187731317281), (777, 0.05612777980216647), (1306, 0.09064274575100788), (1433, 0.09064274575100788), (1815, 0.1129384001

**Question 3: ** 

**a)** Train the Word2vec model with tokenized content; size of the word vectors is 5; the word should show-up at least once in the raw content

**b)** List the number of words in the model's vocabulary

**c)**Examine word similarity to the word "awesome" and "loves"

**d)**Consider each review to be a document on its own. Examine document similarity with Doc2vec to any body of text of your choice

In [0]:
from gensim.models import Word2Vec

model = Word2Vec(nontokens, min_count = 1, size = 5, iter = 500)

words = list(model.wv.vocab)
print(len(words))
print(words)

# Find words similar to "awesome" and "loves"
print('awesome: ', model.wv.similar_by_word('awesome'))
print('loves:   ', model.wv.similar_by_word('loves'))

9911
awesome:  [('rancheros', 0.9964120388031006), ('aussie', 0.9928376078605652), ('huevos', 0.992628812789917), ('depeche', 0.9899281859397888), ('marsala', 0.9896745681762695), ('fans', 0.9896025061607361), ('pointunfortunatelynni', 0.9856349229812622), ('entreenni', 0.9842618107795715), ('warm', 0.9839239120483398), ('highnnthey', 0.983512818813324)]
loves:    [('olive', 0.9970366358757019), ('chile', 0.9970247745513916), ('cheesy', 0.9940910935401917), ('zinburger', 0.9939674139022827), ('seaweed', 0.992052435874939), ('salt', 0.9887025952339172), ('direct', 0.9885267615318298), ('ribs', 0.9883415102958679), ('organic', 0.9879022836685181), ('duo', 0.9875995516777039)]


In [0]:
from gensim.models.doc2vec import TaggedDocument, Doc2Vec

tagged = []
for i, j in enumerate(nontokens):
  tagged.append(TaggedDocument(j, ['sent_{}'.format(i)]))
  
model2 = Doc2Vec(tagged, vector_size=100, epochs=100)
vec = model2.infer_vector('Examine document similarity with Doc2vec to any body of text of your choice'.split())
model2.docvecs.most_similar([vec])

[('sent_260', 0.7732734680175781),
 ('sent_206', 0.7694045305252075),
 ('sent_307', 0.7686758041381836),
 ('sent_222', 0.76849365234375),
 ('sent_428', 0.7650771141052246),
 ('sent_572', 0.7530990242958069),
 ('sent_796', 0.7509182095527649),
 ('sent_84', 0.7508925795555115),
 ('sent_604', 0.7506247758865356),
 ('sent_191', 0.7487233877182007)]

**Question 4: **Iterate over the reviews and output the polarity and subjectivity of the respective tweets. What is the underlying trend with respect to polarity (i.e. positive or negative)?

In [0]:
!pip install vaderSentiment

Collecting vaderSentiment
[?25l  Downloading https://files.pythonhosted.org/packages/86/9e/c53e1fc61aac5ee490a6ac5e21b1ac04e55a7c2aba647bb8411c9aadf24e/vaderSentiment-3.2.1-py2.py3-none-any.whl (125kB)
[K    100% |████████████████████████████████| 133kB 3.7MB/s 
[?25hInstalling collected packages: vaderSentiment
Successfully installed vaderSentiment-3.2.1


In [0]:
#Load the SentimentIntensityAnalyzer object from the VADER package
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

#Create a handle to the SentimentIntensityAnalyzer object
analyzer = SentimentIntensityAnalyzer()

#function that outputs the sentiment ratings
def print_sentiment_ratings(sentence):
    sent = analyzer.polarity_scores(sentence)
    return sent

data = [' '.join(i) for i in nontokens]  
  
pos = 0
neg = 0
neu = 0
com = 0

for i in range(len(data)):
  s = print_sentiment_ratings(data[i])
  pos += s['pos']
  neg += s['neg']
  neu += s['neu']
  com += s['compound']
  if i < 20: print(i, s)

0 {'neg': 0.0, 'neu': 0.638, 'pos': 0.362, 'compound': 0.984}
1 {'neg': 0.175, 'neu': 0.688, 'pos': 0.138, 'compound': -0.1531}
2 {'neg': 0.085, 'neu': 0.581, 'pos': 0.335, 'compound': 0.9981}
3 {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
4 {'neg': 0.152, 'neu': 0.597, 'pos': 0.251, 'compound': 0.7168}
5 {'neg': 0.176, 'neu': 0.549, 'pos': 0.275, 'compound': 0.2263}
6 {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
7 {'neg': 0.168, 'neu': 0.778, 'pos': 0.053, 'compound': -0.4238}
8 {'neg': 0.253, 'neu': 0.421, 'pos': 0.326, 'compound': 0.1779}
9 {'neg': 0.0, 'neu': 0.578, 'pos': 0.422, 'compound': 0.8591}
10 {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
11 {'neg': 0.0, 'neu': 0.917, 'pos': 0.083, 'compound': 0.25}
12 {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
13 {'neg': 0.065, 'neu': 0.707, 'pos': 0.229, 'compound': 0.9558}
14 {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
15 {'neg': 0.0, 'neu': 0.58, 'pos': 0.42, 'compound': 0.7003}
16 {'neg':

In [0]:
print('positive: ', pos/len(data))
print('negative: ', neg/len(data))
print('neutral:  ',neu/len(data))
print('compound: ',com/len(data))

positive:  0.24269599999999975
negative:  0.06306100000000005
neutral:   0.6892170000000001
compound:  0.4843161


**Question 5:** Train a Naive Bayes classifier on a subset of the movie_reviews data set which is part of the NLTK corpus. Once the classifier has been trained, evaluate it's accuracy by testing it against a subset of data from the movie_reviews data set. 

**Step 1**: Import the data set from the nltk corpus

**Step 2**: Examine the categories within the movie_reviews data set

**Step 3**: Examine the files that constitute the movie_reviews data set

**Step 4**: Store a list of words for each file ID, followed by the positive or negative label in one big list.
*Note *that each review has its own ID

**Step 5**:  Randomize the items of a list in place. This is required since there is a high likelihood that we would train on all of the negatives, some positives, and then test only against positives

**Step 6**: Find the most-used words in the text and count how often they are used

**Step 7**: Select the top 5,000 most common words

**Step 8**: Iterate  over the top 5,000  words and build a **feature set **that contains words from across the reviews including whether the word is among the top 5000 words and the corresponding category 

**Step 9**: First split the feature set list in a training and testing subsets


**Step 10**: Train the Naive Bayes Classifier model with the training data set

**Step 11**: Evaluate the accuracy of the model against the testing subset

**Step 12**: Output the most informative features - (for example: which features appear more often in a postive review as compared to a negative review or vice versa).



In [0]:
from sklearn.utils import shuffle
from nltk.corpus import movie_reviews
from sklearn.model_selection import train_test_split
from nltk.sentiment.util import CategorizedPlaintextCorpusReader

mr = pd.DataFrame(columns=['text', 'sentiment'])

for id in movie_reviews.fileids():
  text = ' '.join(movie_reviews.words(id))
  sentiment = 1 if movie_reviews.categories(id) == 'pos' else 0
  mr = mr.append(pd.DataFrame({'text': text,'sentiment': sentiment}, index=[0]))

mr = shuffle(mr)

movie_tokens = tokenize(mr['text'])
movie_nontokens = []

for i in movie_tokens:
  movie_nontokens.append([])
  for j in i:
    if j in en_stopwords:
      continue
    else: movie_nontokens[-1].append(j)

In [0]:
print(mr.head())

  sentiment                                               text
0         0  note : some may consider portions of the follo...
0         0  after 1993 ' s " falling down , " i hoped that...
0         0  i ' m giving this stinker . normally , the wor...
0         0  an 80 - year old woman jumps enthusiastically ...
0         0  vegas vacation is the fourth film starring che...


In [0]:
print(len(movie_reviews.fileids('pos')))

1000


In [0]:
LS2 = LancasterStemmer()
lemmatized2 = []
for l in movie_nontokens: lemmatized2.append([LS.stem(w) for w in l])

In [0]:
flat_list_m = [item for sublist in lemmatized2 for item in sublist]

# Count how many times each work appears
count_m = Counter(flat_list_m).items()
sorted_count_m = sorted(count_m, key=itemgetter(1))
sorted_count_m.reverse()

# Select 5000 most frequent words
top5000 = [i[0] for i in sorted_count_m[:5000]]
print(top5000)

['film', 'movy', 'on', 'act', 'lik', 'ev', 'charact', 'mak', 'real', 'get', 'tim', 'us', 'scen', 'com', 'play', 'good', 'direct', 'story', 'see', 'man', 'would', 'tak', 'much', 'wel', 'also', 'seem', 'end', 'two', 'way', 'look', 'first', 'work', 'giv', 'year', 'thing', 'lov', 'plot', 'lif', 'know', 'perform', 'star', 'littl', 'bad', 'peopl', 'new', 'could', 'nev', 'show', 'best', 'fin', 'rol', 'gre', 'many', 'watch', 'want', 'car', 'mad', 'say', 'hum', 'find', 'writ', 'think', 'big', 'becom', 'stil', 'anoth', 'go', 'back', 'effect', 'turn', 'kil', 'audy', 'world', 'someth', 'liv', 'interest', 'set', 'day', 'feel', 'bet', 'long', 'old', 'howev', 'part', 'fact', 'sery', 'every', 'though', 'cast', 'guy', 'comedy', 'friend', 'run', 'seen', 'enough', 'point', 'cre', 'around', 'going', 'may', 'last', 'lin', 'mat', 'nam', 'bas', 'funny', 'try', 'origin', 'right', 'op', 'produc', 'mom', 'begin', 'wom', 'young', 'tru', 'minut', 'plac', 'high', 'almost', 'ear', 'sint', 'lot', 'person', 'noth', '

In [0]:
mm = []
for i in lemmatized2: mm.append([1 if j in i else 0 for j in top5000])
print(np.matrix(mm))

[[1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 ...
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]]


In [0]:
mmdf = pd.DataFrame(mm, columns = top5000, index = pd.DataFrame(movie_nontokens))
mmdf['sentiment'] = mr['sentiment'].values
print(mmdf.head())

In [0]:
from sklearn.naive_bayes import MultinomialNB

train, test = train_test_split(mmdf, test_size = 0.3)

cols = train.columns[:-1]
gnb = MultinomialNB()
gnb.fit(train[cols], train['sentiment'])
y_pred = gnb.predict(test[cols])

print("Number of mislabeled points out of a total {} points : {}, performance {:05.2f}%"
      .format(
          test.shape[0],
          (test["sentiment"] != y_pred).sum(),
          100*(1-(test["sentiment"] != y_pred).sum()/test.shape[0])
))

Number of mislabeled points out of a total 600 points : 87, performance 85.50%


In [0]:
pos_r = mmdf[mmdf['sentiment'] == 1]
neg_r = mmdf[mmdf['sentiment'] == 0]

pnum = np.array(pos_r[pos_r.columns].sum())
nnum = np.array(neg_r[neg_r.columns].sum())

dif = pnum > nnum

print(mmdf.columns[dif].values[:15])
print(mmdf.columns[~dif].values[:15])

['film' 'one' 'time' 'good' 'story' 'much' 'character' 'also' 'two' 'well'
 'characters' 'first' 'see' 'way' 'life']
['movie' 'like' 'even' 'would' 'get' 'make' 'really' 'plot' 'little'
 'could' 'bad' 'director' 'know' 'action' 'another']
