<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Natural Language Processing (NLP) Review Lab

_Authors: Joseph Nelson (DC)_

---

> **Note: This lab is intended to be done as a walkthrough with the instructor.**

## Introduction


*Adapted from [NLP Crash Course](http://files.meetup.com/7616132/DC-NLP-2013-09%20Charlie%20Greenbacker.pdf) by Charlie Greenbacker, [Introduction to NLP](http://spark-public.s3.amazonaws.com/nlp/slides/intro.pdf) by Dan Jurafsky, Kevin Markham's Data School Curriculum*

### What is NLP?

- Using computers to process (analyze, understand, generate) natural human languages
- Most knowledge created by humans is unstructured text, and we need a way to make sense of it
- Build probabilistic model using data about a language

### What are some of the higher level task areas?

- **Information retrieval**: Find relevant results and similar results
    - [Google](https://www.google.com/)
- **Information extraction**: Structured information from unstructured documents
    - [Events from Gmail](https://support.google.com/calendar/answer/6084018?hl=en)
- **Machine translation**: One language to another
    - [Google Translate](https://translate.google.com/)
- **Text simplification**: Preserve the meaning of text, but simplify the grammar and vocabulary
    - [Rewordify](https://rewordify.com/)
    - [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page)
- **Predictive text input**: Faster or easier typing
    - [My application](https://justmarkham.shinyapps.io/textprediction/)
    - [A much better application](https://farsite.shinyapps.io/swiftkey-cap/)
- **Sentiment analysis**: Attitude of speaker
    - [Hater News](http://haternews.herokuapp.com/)
- **Automatic summarization**: Extractive or abstractive summarization
    - [autotldr](https://www.reddit.com/r/technology/comments/35brc8/21_million_people_still_use_aol_dialup/cr2zzj0)
- **Natural Language Generation**: Generate text from data
    - [How a computer describes a sports match](http://www.bbc.com/news/technology-34204052)
    - [Publishers withdraw more than 120 gibberish papers](http://www.nature.com/news/publishers-withdraw-more-than-120-gibberish-papers-1.14763)
- **Speech recognition and generation**: Speech-to-text, text-to-speech
    - [Google's Web Speech API demo](https://www.google.com/intl/en/chrome/demos/speech.html)
    - [Vocalware Text-to-Speech demo](https://www.vocalware.com/index/demo)
- **Question answering**: Determine the intent of the question, match query with knowledge base, evaluate hypotheses
    - [How did supercomputer Watson beat Jeopardy champion Ken Jennings?](http://blog.ted.com/how-did-supercomputer-watson-beat-jeopardy-champion-ken-jennings-experts-discuss/)
    - [IBM's Watson Trivia Challenge](http://www.nytimes.com/interactive/2010/06/16/magazine/watson-trivia-game.html)
    - [The AI Behind Watson](http://www.aaai.org/Magazine/Watson/watson.php)

### What are some of the lower level components?

- **Tokenization**: breaking text into tokens (words, sentences, n-grams)
- **Stopword removal**: a/an/the
- **Stemming and lemmatization**: root word
- **TF-IDF**: word importance
- **Part-of-speech tagging**: noun/verb/adjective
- **Named entity recognition**: person/organization/location
- **Spelling correction**: "New Yrok City"
- **Word sense disambiguation**: "buy a mouse"
- **Segmentation**: "New York City subway"
- **Language detection**: "translate this page"
- **Machine learning**

### Why is NLP hard?

- **Ambiguity**:
    - Hospitals are Sued by 7 Foot Doctors
    - Juvenile Court to Try Shooting Defendant
    - Local High School Dropouts Cut in Half
- **Non-standard English**: text messages
- **Idioms**: "throw in the towel"
- **Newly coined words**: "retweet"
- **Tricky entity names**: "Where is A Bug's Life playing?"
- **World knowledge**: "Mary and Sue are sisters", "Mary and Sue are mothers"

NLP requires an understanding of the **language** and the **world**.

## Part 1: Reading in the Yelp Reviews

- "corpus" = collection of documents
- "corpora" = plural form of corpus

In [1]:
import pandas as pd
import numpy as np
import scipy as sp

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer

%matplotlib inline

In [2]:
csv_file = '../data/yelp.csv'

In [3]:
yelp = pd.read_csv(csv_file)

In [4]:
yelp[['stars', 'text']].sample(5)

Unnamed: 0,stars,text
7623,5,I really don't like these types of Mexican foo...
4547,2,This was real dissapointment because I really ...
398,3,Sacks is a great little place. One issue and ...
2940,2,To even admit to serving tex-mex is the first ...
4686,1,UPDATE: This location is closed. Boo!


In [5]:
# 10,000 reviews, each with 10 characteristics
yelp.shape

(10000, 10)

In [6]:
yelp['stars'].value_counts()

4    3526
5    3337
3    1461
2     927
1     749
Name: stars, dtype: int64

In [7]:
yelp.loc[9005, 'text']

"Being a creature of habit anytime I want good sushi I go to Tokyo Lobby.  Well, my group wanted to branch out and try something new so we decided on Sakana. Not a fan.  And what's shocking to me is this place was packed!  The restaurant opens at 5:30 on Saturday and we arrived at around 5:45 and were lucky to get the last open table.  I don't get it...\r\n\r\nMessy rolls that all tasted the same.  We ordered the tootsie roll and the crunch roll, both tasted similar, except of course for the crunchy captain crunch on top.  Just a mushy mess, that was hard to eat.  Bland tempura.  No bueno.  I did, however, have a very good tuna poke salad, but I would not go back just for that. \r\n\r\nIf you want good sushi on the west side, or the entire valley for that matter, say no to Sakana and yes to Tokyo Lobby."

In [8]:
yelp.loc[9005, 'stars']

2

### 1.1 Subset the reviews to best and worst.

- Select only 5-star and 1-star reviews.
- The text will be the features, the stars will be the target.
- Create a train-test split.

In [9]:
yelp_reviews = yelp[(yelp.stars==5) | (yelp.stars==1)]

yelp_reviews.sample(10)

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
4241,d_8bMNQd0mesbEUeq1U2kQ,2012-01-16,5jqJpDe8P22HtS4GH4kZ_A,5,This is by far my favorite Indian restaurant i...,review,U1VhMJAKHaTqCSIp9RoRcg,1,0,0
8611,ncxBZxetREZ_jCma0c7mHA,2011-05-14,18bXhCOgwf_-TnvRSr8zWA,1,The wash ain't bad....but the process to get i...,review,OoyMyBD0a-QmEJrFzw78Fw,0,1,0
5042,9BJ5h9X1krpXFjKj0a6wbg,2008-01-20,vgjSeoz6mHX5wVh6MH09sQ,5,Love it!\r\nThe food is VERY spicy though. Lu...,review,PShy2RYNadDUhJf4ErOJ7w,1,0,0
9696,NA3tQYxR6Fq5O8nV6u41Tw,2011-10-04,0XQNmyxaBbZ_2M9AGA8RrQ,5,My favorite place in the world to eat! Great s...,review,JRkqD8JvtQATNTTv6UI7RA,0,0,0
5013,N80E9zoWEpHhi4kH2eKAsw,2010-08-22,KH1QxaALC-vm1LvKPd95UQ,5,"Those of us ""foodies"" know that if the company...",review,NLDDaat42UQXQCOpU4e2TA,3,3,2
5054,e0Or6HYHL03y7IHl0itIOw,2011-02-24,ZAU_99yxKqM_EEPjcDCqIg,5,All I have to say is wow. The auto picks and ...,review,wUCRCqCcRFAvX0e17_6odA,1,1,0
4176,NmtZuT8p4vNk259dvozbvg,2012-10-09,H8W9aUhwqWi__d4BuaDkyA,5,"Kaley helped me get the best room in Mesa AZ, ...",review,jdPaIEMrUpzR4-dC-o5zSw,0,0,0
9110,t0NencbvVVlH6mcRlNTPcg,2010-12-23,ndSwfewtSaTTjXNm2wR71Q,1,I was excited to go to the lululemon athletica...,review,jwSTtW_q8PULge2dK1t_Lg,0,3,0
178,3H2ttTM2aSIaZ6FTjHwDQQ,2012-11-29,6wih8fh9hGHRCc4tk6U3ew,5,I have only gotten the cafe sua da (iced coffe...,review,UhryFhGe1tqdbBzAhdooTQ,0,0,0
8327,rQ4z0EStSZE4acgkne6Hmg,2012-03-12,OKh2j7gfxprNY_BcKSOp5w,5,Tuck Shop is definitely one of my favorite res...,review,JTwQpwYexzuEkYY53XHdVQ,0,0,0


In [10]:
# 4086 reviews
yelp_reviews.shape

(4086, 10)

In [11]:
# Majority class is 5 stars; approx 82%
yelp_reviews['stars'].value_counts()

5    3337
1     749
Name: stars, dtype: int64

In [12]:
X = yelp_reviews.text
y = yelp_reviews.stars

In [13]:
# Split the review dataset into training and test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

## Part 2: Tokenization

- **What:** Separate text into units such as sentences or words
- **Why:** Gives structure to previously unstructured text
- **Notes:** Relatively easy with English language text, not easy with some languages

### 2.1 Use CountVectorizer with stop words to convert the training and testing text data - set up function to vectorize and validate

[CountVectorizer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

- **lowercase:** boolean, True by default
    - Convert all characters to lowercase before tokenizing.
- **ngram_range:** tuple (min_n, max_n)
    - The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

In [14]:
def vectorise_test(vect):
    X_train_vect = vect.fit_transform(X_train)
    print(('Features: ', X_train_vect.shape[1]))
    
    X_test_vect = vect.transform(X_test)
    
    nb = MultinomialNB()
    nb.fit(X_train_vect, y_train)
    
    y_pred = nb.predict(X_test_vect)
    print(('F1-score: ', metrics.f1_score(y_test, y_pred)))
    print(('Accuracy: ', metrics.accuracy_score(y_test, y_pred)))

In [15]:
vect = CountVectorizer(stop_words=['english'])

In [16]:
vectorise_test(vect)

('Features: ', 16711)
('F1-score: ', 0.721407624633431)
('Accuracy: ', 0.9070450097847358)


### 2.2 Predict the star rating with the new features from CountVectorizer with Logistic Regression.

Validate on the test set.

In [17]:
logreg = LogisticRegression(max_iter=1000)
count_train = vect.fit_transform(X_train)
count_test = vect.transform(X_test)

logreg.fit(count_train, y_train)

LogisticRegression(max_iter=1000)

In [18]:
y_pred = logreg.predict(count_test)

In [20]:
print(metrics.f1_score(y_test, y_pred))
print(metrics.accuracy_score(y_test, y_pred))

0.7890410958904109
0.9246575342465754


In [None]:
# Logistic Regression yields better results than MultinomialNB

## Part 3: Stopword Removal

- **What:** Remove common words that will likely appear in any text
- **Why:** They don't tell you much about your text

### 3.1 Recreate your features with CountVectorizer removing stopwords.

- **stop_words:** string {'english'}, list, or None (default)
- If 'english', a built-in stop word list for English is used.
- If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
- If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.

In [21]:
count_vect = CountVectorizer()

In [22]:
vectorise_test(count_vect)

('Features: ', 16712)
('F1-score: ', 0.7235294117647059)
('Accuracy: ', 0.9080234833659491)


In [23]:
# Removing stop words remains the same

### 3.2 Validate your model using the features with stopwords removed using Logistic Regression.

In [24]:
lgrg = LogisticRegression(max_iter=1000)
X_train_vect = vect.fit_transform(X_train)
X_test_vect = vect.transform(X_test)

lgrg.fit(X_train_vect, y_train)

LogisticRegression(max_iter=1000)

In [25]:
pred = lgrg.predict(X_test_vect)

In [27]:
print(metrics.f1_score(y_test, pred))
print(metrics.accuracy_score(y_test, pred))

0.7890410958904109
0.9246575342465754


In [None]:
# Results haven't changed whether you keep or remove stop words

## Part 4: Other CountVectorizer Options

### 4.1 Shrink the maximum number of features and re-test the model.

- **max_features:** int or None, default=None
- If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

In [28]:
count_features = CountVectorizer(stop_words='english', max_features=15000)

In [29]:
vectorise_test(count_features)

('Features: ', 15000)
('F1-score: ', 0.7377521613832853)
('Accuracy: ', 0.910958904109589)


In [None]:
# F1-score improves slightly, accuracy stays the same

### 4.2 Change the minimum document frequency for terms and test the model's performance.

- **min_df:** float in range [0.0, 1.0] or int, default=1
- When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.

In [30]:
vectorizer = CountVectorizer(stop_words='english', min_df=3, max_features=15000)

In [31]:
vectorise_test(vectorizer)

('Features: ', 6036)
('F1-score: ', 0.7631578947368421)
('Accuracy: ', 0.9119373776908023)


In [39]:
# F1-score improves again, accuracy slighty improved

## Part 5: Introduction to TextBlob

TextBlob: "Simplified Text Processing"

### 5.1 Use `TextBlob` to convert the text in the first review in the dataset.

In [40]:
print(yelp.text[0])

My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!


In [32]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\shmel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [33]:
review = TextBlob(yelp.text[0])

### 5.2 List the words in the `TextBlob` object.

In [34]:
review.words

WordList(['My', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', 'was', 'excellent', 'The', 'weather', 'was', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', 'grounds', 'an', 'absolute', 'pleasure', 'Our', 'waitress', 'was', 'excellent', 'and', 'our', 'food', 'arrived', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', 'looked', 'like', 'the', 'place', 'fills', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', 'was', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', 'had', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', 'ingredients', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', 'was', 'amazing', 'While', 'EVERYTHING', 'on', 'the', 'menu', 'looks', 'excellent', 'I', 'had', 'the', 'white', 'truffle', 'scrambled', 'eggs', 

### 5.3 List the sentences in the `TextBlob` object.

In [35]:
review.sentences

[Sentence("My wife took me here on my birthday for breakfast and it was excellent."),
 Sentence("The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure."),
 Sentence("Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning."),
 Sentence("It looked like the place fills up pretty quickly so the earlier you get here the better."),
 Sentence("Do yourself a favor and get their Bloody Mary."),
 Sentence("It was phenomenal and simply the best I've ever had."),
 Sentence("I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it."),
 Sentence("It was amazing."),
 Sentence("While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious."),
 Sentence("It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete."),
 Sentence("It was the best "toast" I've ever had."),


## Part 6: Stemming and Lemmatization

**Stemming:**

- **What:** Reduce a word to its base/stem/root form
- **Why:** Often makes sense to treat related words the same way
- **Notes:**
    - Uses a "simple" and fast rule-based approach
    - Stemmed words are usually not shown to users (used for analysis/indexing)
    - Some search engines treat words with the same stem as synonyms

### 6.1 Initialize the `SnowballStemmer` and stem the words in the first review.

In [36]:
stemmer = SnowballStemmer('english')

In [37]:
# See "amazing" becomes "amaz"
print([stemmer.stem(word) for word in review.words])

['my', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', 'was', 'excel', 'the', 'weather', 'was', 'perfect', 'which', 'made', 'sit', 'outsid', 'overlook', 'their', 'ground', 'an', 'absolut', 'pleasur', 'our', 'waitress', 'was', 'excel', 'and', 'our', 'food', 'arriv', 'quick', 'on', 'the', 'semi-busi', 'saturday', 'morn', 'it', 'look', 'like', 'the', 'place', 'fill', 'up', 'pretti', 'quick', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'bloodi', 'mari', 'it', 'was', 'phenomen', 'and', 'simpli', 'the', 'best', 'i', 've', 'ever', 'had', 'i', "'m", 'pretti', 'sure', 'they', 'onli', 'use', 'ingredi', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'it', 'was', 'amaz', 'while', 'everyth', 'on', 'the', 'menu', 'look', 'excel', 'i', 'had', 'the', 'white', 'truffl', 'scrambl', 'egg', 'veget', 'skillet', 'and', 'it', 'was', 'tasti', 'and', 'delic

### 6.2 Use the built-in `lemmatize` function on the words of the first review (parsed by `TextBlob`)

**Lemmatization**

- **What:** Derive the canonical form ('lemma') of a word
- **Why:** Can be better than stemming
- **Notes:** Uses a dictionary-based approach (slower than stemming)

In [39]:
# See "was" becomes "wa"
nltk.download('wordnet')
print([word.lemmatize() for word in review.words])

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\shmel\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['My', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', 'wa', 'excellent', 'The', 'weather', 'wa', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', 'ground', 'an', 'absolute', 'pleasure', 'Our', 'waitress', 'wa', 'excellent', 'and', 'our', 'food', 'arrived', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', 'looked', 'like', 'the', 'place', 'fill', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', 'wa', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', 'had', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', 'ingredient', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', 'wa', 'amazing', 'While', 'EVERYTHING', 'on', 'the', 'menu', 'look', 'excellent', 'I', 'had', 'the', 'white', 'truffle', 'scrambled', 'egg', 'vegetable', 'skill

### 6.3 Write a function that uses `TextBlob` and `lemmatize` to lemmatize text.

In [45]:
def lemma_text(text):
    text = str(text).lower()
    words = TextBlob(text).words
    return [word.lemmatize() for word in words]

In [46]:
lemma_text(yelp.text[5])

['quiessence',
 'is',
 'simply',
 'put',
 'beautiful',
 'full',
 'window',
 'and',
 'earthy',
 'wooden',
 'wall',
 'give',
 'a',
 'feeling',
 'of',
 'warmth',
 'inside',
 'this',
 'restaurant',
 'perched',
 'in',
 'the',
 'middle',
 'of',
 'a',
 'farm',
 'the',
 'restaurant',
 'seemed',
 'fairly',
 'full',
 'even',
 'on',
 'a',
 'tuesday',
 'evening',
 'we',
 'had',
 'secured',
 'reservation',
 'just',
 'a',
 'couple',
 'day',
 'before',
 'my',
 'friend',
 'and',
 'i',
 'had',
 'sampled',
 'sandwich',
 'at',
 'the',
 'farm',
 'kitchen',
 'earlier',
 'that',
 'week',
 'and',
 'were',
 'impressed',
 'enough',
 'to',
 'want',
 'to',
 'eat',
 'at',
 'the',
 'restaurant',
 'the',
 'crisp',
 'fresh',
 'veggie',
 'did',
 "n't",
 'disappoint',
 'we',
 'ordered',
 'the',
 'salad',
 'with',
 'orange',
 'and',
 'grapefruit',
 'slice',
 'and',
 'the',
 'crudites',
 'to',
 'start',
 'both',
 'were',
 'very',
 'good',
 'i',
 'did',
 "n't",
 'even',
 'know',
 'how',
 'much',
 'i',
 'liked',
 'raw',
 

### 6.4 Provide your function to `CountVectorizer` as the `analyzer` and test the performance of your model.

In [49]:
vector = CountVectorizer(stop_words='english', analyzer=lemma_text, min_df=2)

In [50]:
vectorise_test(vector)

('Features: ', 7878)
('F1-score: ', 0.7584415584415584)
('Accuracy: ', 0.9090019569471625)


In [None]:
# Both Accuracy and F1-score are slightly worse

## Part 7: Term Frequency-Inverse Document Frequency (TF-IDF)

- **What:** Computes "relative frequency" that a word appears in a document compared to its frequency across all documents
- **Why:** More useful than "term frequency" for identifying "important" words in each document (high frequency in that document, low frequency in other documents)
- **Notes:** Used for search engine scoring, text summarization, document clustering

### 7.1 Build a simple TF-IDF using CountVectorizer

- Term Frequency can be calulated with default CountVectorizer.
- Inverse Document Frequency can be calculated with CountVectorizer and argument `binary=True`.

**More details:** [TF-IDF is about what matters](http://planspace.org/20150524-tfidf_is_about_what_matters/)

In [18]:
# A:

## Part 8: Using TF-IDF to Summarize a Yelp Review

> **Note:** Reddit's autotldr uses the [SMMRY](http://smmry.com/about) algorithm, which is based on TF-IDF!

### 8.1 Build a TF-IDF predictor matrix excluding stopwords with `TfidfVectorizer`

In [51]:
def tf_idf_vectorize(vect):
    X_train_vect = vect.fit_transform(X_train)
    print(('Features: ', X_train_vect.shape[1]))
    
    X_test_vect = vect.transform(X_test)
    
    nb = MultinomialNB()
    nb.fit(X_train_vect, y_train)
    
    y_pred = nb.predict(X_test_vect)
    print(('F1-score: ', metrics.f1_score(y_test, y_pred)))
    print(('Accuracy: ', metrics.accuracy_score(y_test, y_pred)))

In [52]:
words_vect = TfidfVectorizer()

In [53]:
# F1-score is very low!
tf_idf_vectorize(words_vect)

('Features: ', 16712)
('F1-score: ', 0.01)
('Accuracy: ', 0.8062622309197651)


In [54]:
tfidf_vect = TfidfVectorizer(stop_words='english')

In [55]:
# F1-score slightly improved!
tf_idf_vectorize(tfidf_vect)

('Features: ', 16415)
('F1-score: ', 0.01990049751243781)
('Accuracy: ', 0.8072407045009785)


### 8.2 Write a function to pull out the top 5 words by TF-IDF score from a review

In [80]:
def tfidf_scores(text):
    
    tfidf_vectorize = TfidfVectorizer(stop_words='english')
    tfidf_vect = tfidf_vectorize.fit_transform(text)
    features = tfidf_vectorize.get_feature_names_out()
    
    scores = pd.DataFrame(tfidf_vect.toarray(), columns=features)
    
    print(scores.nlargest(n=5, columns=features))

In [81]:
tfidf_scores(X_train)

            00  000  00a  00am  00pm   01   02   03  03342   04  ...  \
1515  0.430474  0.0  0.0   0.0   0.0  0.0  0.0  0.0    0.0  0.0  ...   
1783  0.371373  0.0  0.0   0.0   0.0  0.0  0.0  0.0    0.0  0.0  ...   
2721  0.294527  0.0  0.0   0.0   0.0  0.0  0.0  0.0    0.0  0.0  ...   
946   0.291961  0.0  0.0   0.0   0.0  0.0  0.0  0.0    0.0  0.0  ...   
2987  0.277590  0.0  0.0   0.0   0.0  0.0  0.0  0.0    0.0  0.0  ...   

      zucchini  zuccini  zuchinni  zumba  zupa  zupas  zuzu  zzed  école   ém  
1515       0.0      0.0       0.0    0.0   0.0    0.0   0.0   0.0    0.0  0.0  
1783       0.0      0.0       0.0    0.0   0.0    0.0   0.0   0.0    0.0  0.0  
2721       0.0      0.0       0.0    0.0   0.0    0.0   0.0   0.0    0.0  0.0  
946        0.0      0.0       0.0    0.0   0.0    0.0   0.0   0.0    0.0  0.0  
2987       0.0      0.0       0.0    0.0   0.0    0.0   0.0   0.0    0.0  0.0  

[5 rows x 16415 columns]


## Part 9: Sentiment Analysis

### 9.1 Extract sentiment from a review parsed with `TextBlob`

Sentiment polarity ranges from -1, the most negative, to 1, the most positive. A parsed TextBlob object has sentiment which can be accessed with:

    review.sentiment.polarity

In [21]:
# A:

### 9.2 Calculate the sentiment for every review in the full Yelp dataset as a new column.

In [22]:
# A:

### 9.3 Create a boxplot of sentiment by star rating

In [23]:
# A:

### 9.4 Print reviews with the highest and lowest sentiment.

In [24]:
# A:

## 10. [Bonus] Explore fun TextBlob features

### 10.1 Correct spelling with `.correct()`

In [25]:
# A:

### 10.2 Perform spellchecking with `.spellcheck()`

In [26]:
# A:

### 10.3 Extract definitions with `.define()`

In [27]:
# A:

## Conclusion

- NLP is a gigantic field
- Understanding the basics broadens the types of data you can work with
- Simple techniques go a long way
- Use scikit-learn for NLP whenever possible