The datasets in which we will be using here were obtained via open sources.

---

Let's begin with importing our needed libraries.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from nltk.stem.snowball import SnowballStemmer
from sklearn.naive_bayes import MultinomialNB
from textblob import TextBlob, Word
from sklearn import metrics
import pandas as pd
import numpy as np
import scipy as sp

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In the world of data science, it should be no surprise that we may come across the task of analyzing unstructured text and/or making a predictive model using it. Unfortunately, most techniques require numeric data. However, Natural Language Processing (NLP) can provide us a toolset of methods to convert unstructured text into meaningful numeric data. NLP is used to process (analyze, understand, and generate) natural human languages. It stores unstructured text and builds probabilistic models using the data about the language.

Here, we will practice common low-level NLP techniques and will often use a model very popular for text classification known as Naive Bayes. Let's begin using a dataset containing Yelp reviews.

In [2]:
# Let's read in the data #
yelp = pd.read_csv('./data/yelp.csv')
yelp

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0
5,-yxfBYGB6SEqszmxJxd97A,2007-12-13,m2CKSsepBCoRYWxiRUsxAg,4,"Quiessence is, simply put, beautiful. Full wi...",review,sqYN3lNgvPbPCTRsMFu27g,4,3,1
6,zp713qNhx8d9KCJJnrw1xA,2010-02-12,riFQ3vxNpP4rWLk_CSri2A,5,Drop what you're doing and drive here. After I...,review,wFweIWhv2fREZV_dYkz_1g,7,7,4
7,hW0Ne_HTHEAgGF1rAdmR-g,2012-07-12,JL7GXJ9u4YMx7Rzs05NfiQ,4,"Luckily, I didn't have to travel far to make m...",review,1ieuYcKS7zeAv_U15AB13A,0,1,0
8,wNUea3IXZWD63bbOQaOH-g,2012-08-17,XtnfnYmnJYi71yIuGsXIUA,4,Definitely come for Happy hour! Prices are ama...,review,Vh_DlizgGhSqQh4qfZ2h6A,0,0,0
9,nMHhuYan8e3cONo3PornJA,2010-08-11,jJAIXA46pU1swYyRCdfXtQ,5,Nobuo shows his unique talents with everything...,review,sUNkXg8-KFtCMQDV6zRzQg,0,1,0


In [3]:
# Let's create a new dataframe that only contains the 5-star and 1-star reviews #
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

In [4]:
# Let's define X and y #
X = yelp_best_worst.text
y = yelp_best_worst.stars

In [5]:
# Let's split the new dataframe into training and testing sets #
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [6]:
# Let's create document-term matrices from X_train and X_test #
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

In [7]:
# Let's check the shape of X_train (rows are documents and columns are terms or "tokens"/"features") #
print('Rows: %s \nColumns: %s' %(X_train_dtm.shape[0], X_train_dtm.shape[1]))

Rows: 3064 
Columns: 16712


In [8]:
# Let's check out the last 50 features #
print((vect.get_feature_names()[-50:]))

['zach', 'zam', 'zanella', 'zankou', 'zappos', 'zatsiki', 'zen', 'zero', 'zest', 'zexperience', 'zha', 'zhou', 'zia', 'zichini', 'zihuatenejo', 'zilch', 'zillion', 'zin', 'zinburger', 'zinburgergeist', 'zinc', 'zinfandel', 'zing', 'zip', 'zipcar', 'ziploc', 'zipper', 'zippers', 'zipps', 'ziti', 'zoe', 'zombies', 'zone', 'zoners', 'zones', 'zoning', 'zoo', 'zoom', 'zoyo', 'zucca', 'zucchini', 'zuccini', 'zuchinni', 'zumba', 'zupa', 'zupas', 'zuzu', 'zzed', 'école', 'ém']


In [9]:
# Let's check out the vectorizer options #
vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

Note: A common method of reducing the number of features is converting all text to lowercase before generating features. This is because, to a computer, aPPle is stored as a different token than apple. However, there are cases where it might be useful not to convert them to lowercase if capitalization matters. We can see above that CountVectorizer converts the tokens to lowercase by default.

In [10]:
# Let's use Naive Bayes to predict the star rating and check the accuracy #
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)

print('Accuracy:',metrics.accuracy_score(y_test, y_pred_class))

Accuracy: 0.9080234833659491


In [11]:
# Let's check out the value counts for 1-star and 5-star reviews #
y_test.value_counts()

5    823
1    199
Name: stars, dtype: int64

In [12]:
# Let's calculate the null accuracy (set 5-star to 1 and  1-star to 0) #
y_test_binary = np.where(y_test==5, 1, 0)

print('Percent of 5-stars: %s, \nPercent of 1-stars: %s' %(y_test_binary.mean(), (1 - y_test_binary.mean())))

Percent of 5-stars: 0.8052837573385518, 
Percent of 1-stars: 0.19471624266144816


We can see that our model predicted ~ 91% accuracy, which is an improvement over the baseline ~ 81% accuracy (assuming our model always predicts 5-star reviews). Let's look more into how the vectorizer works.

In [13]:
# We have 3,064 Yelp reviews in our training set and 16,712 unique words were found across all documents #
X_train_dtm

<3064x16712 sparse matrix of type '<class 'numpy.int64'>'
	with 236431 stored elements in Compressed Sparse Row format>

In [14]:
# "vocabulary_" is a dictionary that converts each word to its index in the sparse matrix #
vect.vocabulary_

{'my': 9819,
 'boyfriend': 2017,
 'and': 803,
 'stopped': 14198,
 'by': 2336,
 'to': 15127,
 'grab': 6577,
 'bite': 1746,
 'eat': 4920,
 'before': 1578,
 'work': 16458,
 'long': 8821,
 'story': 14212,
 'short': 13296,
 'greeted': 6660,
 'immediately': 7546,
 'lots': 8862,
 'of': 10232,
 'variety': 15828,
 'meat': 9280,
 'lovers': 8886,
 'pizza': 11133,
 'is': 7935,
 'amazing': 739,
 'ask': 1100,
 'for': 5984,
 'tawnya': 14762,
 'she': 13213,
 'really': 11993,
 'nice': 9983,
 'wonderful': 16430,
 'server': 13125,
 'will': 16337,
 'be': 1523,
 'back': 1312,
 'loved': 8881,
 'the': 14932,
 'excellent': 5341,
 'service': 13129,
 'when': 16254,
 'was': 16122,
 'last': 8463,
 'time': 15084,
 'that': 14927,
 'you': 16610,
 'went': 16228,
 'restaurant': 12357,
 'were': 16229,
 'still': 14162,
 'bragging': 2028,
 'about': 330,
 'your': 16617,
 'experience': 5411,
 'two': 15489,
 'weeks': 16205,
 'later': 8471,
 'this': 14991,
 'exactly': 5327,
 'what': 16242,
 'am': 725,
 'doing': 4647,
 'arrog

In [15]:
# Let's convert the sparse matrix into a typical array object #
# Note: Although this takes up much more memory than the sparse matrix, the conversion is sometimes necessary #
X_test_dtm.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [2, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [16]:
# Let's create a function that accepts a vectorizer and calculates the accuracy #
def tokenize_test(vect):
    X_train_dtm = vect.fit_transform(X_train)
    print('Number of features:', X_train_dtm.shape[1])
    X_test_dtm = vect.transform(X_test)
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
    y_pred_class = nb.predict(X_test_dtm)
    print('Accuracy:', metrics.accuracy_score(y_test, y_pred_class))

In [17]:
# Let's mess with the "document frequency" parameter (min_df ignores words that occur less than twice) # 
print('---'*25)
for i in range(1,6):
    vect = CountVectorizer(min_df=i, max_features=10000)
    tokenize_test(vect)
    print('---'*25)

---------------------------------------------------------------------------
Number of features: 10000
Accuracy: 0.9099804305283757
---------------------------------------------------------------------------
Number of features: 8712
Accuracy: 0.9041095890410958
---------------------------------------------------------------------------
Number of features: 6318
Accuracy: 0.9021526418786693
---------------------------------------------------------------------------
Number of features: 5093
Accuracy: 0.9031311154598826
---------------------------------------------------------------------------
Number of features: 4326
Accuracy: 0.9021526418786693
---------------------------------------------------------------------------


Let's include the use of n-grams. N-grams are features which consist of N consecutive words. This is useful because it uses the bag-of-words model; treating "data scientist" as a single feature has more meaning than having the two independent features "data" and "scientist".

In [18]:
# Let's include 1-grams and 2-grams #
vect = CountVectorizer(ngram_range=(1, 2))
X_train_dtm = vect.fit_transform(X_train)
print('Rows: %s \nColumns: %s' %(X_train_dtm.shape[0], X_train_dtm.shape[1]))

Rows: 3064 
Columns: 168423


We can observe how supplementing our features with n-grams can lead to more feature columns. However, we should be wary when we compute n-grams from an entire corpus (a collection of documents) as the number of unique n-grams could be vastly higher than the number of unique unigrams. This could lead to an undesired feature explosion and many of the new features will just be noise. Thus, if we do not have much data, adding n-grams can actually decrease model performance. This is because if each n-gram is only present once or twice in the training set, we are effectively adding mostly noisy features to the mix.

In [19]:
# Let's check out the last 50 features #
print((vect.get_feature_names()[-50:]))

['zoners out', 'zones', 'zones dolls', 'zones so', 'zoning', 'zoning issues', 'zoo', 'zoo but', 'zoo if', 'zoo is', 'zoo not', 'zoo the', 'zoo tour', 'zoom', 'zoom in', 'zoyo', 'zoyo for', 'zucca', 'zucca appetizer', 'zucchini', 'zucchini and', 'zucchini broccoli', 'zucchini carrots', 'zucchini fires', 'zucchini fries', 'zucchini pieces', 'zucchini strips', 'zucchini very', 'zucchini we', 'zucchini with', 'zuccini', 'zuccini italian', 'zuchinni', 'zuchinni again', 'zumba', 'zumba class', 'zumba or', 'zumba yogalates', 'zupa', 'zupa flavors', 'zupas', 'zupas cater', 'zuzu', 'zuzu was', 'zzed', 'zzed in', 'école', 'école lenôtre', 'ém', 'ém all']


Stop words are some of the most common words in a language. They are used so that a sentence makes sense grammatically, such as prepositions and determiners (such as "to", "the", and "and"). However, they are so commonly used that they are generally worthless for predicting the class of a document. Stop-Word Removal is the process used to remove common words that will likely appear in any text.

In [20]:
# Let's remove English stop words #
vect = CountVectorizer(stop_words='english')
tokenize_test(vect)
vect.get_params()

Number of features: 16415
Accuracy: 0.9119373776908023


{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.int64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 1.0,
 'max_features': None,
 'min_df': 1,
 'ngram_range': (1, 1),
 'preprocessor': None,
 'stop_words': 'english',
 'strip_accents': None,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': None,
 'vocabulary': None}

In [21]:
# Let's check out the set of stop words #
print(vect.get_stop_words())

frozenset({'neither', 'seem', 'most', 'hereby', 'whose', 'fire', 'nevertheless', 'ie', 'otherwise', 'sometimes', 'on', 'afterwards', 'across', 'himself', 'nobody', 'down', 'else', 'amongst', 'we', 'detail', 'sometime', 'has', 'behind', 'third', 'which', 'would', 'that', 'whoever', 'from', 'perhaps', 'eleven', 'twelve', 'part', 'whereafter', 'same', 'although', 'least', 'whereas', 'anything', 'three', 'thus', 'throughout', 'after', 'where', 'more', 'itself', 'once', 'system', 'and', 'co', 'became', 'first', 'indeed', 'find', 'since', 'latterly', 'many', 'found', 'have', 'these', 'wherein', 'a', 'sixty', 'go', 'who', 'everything', 'why', 'less', 'than', 'whence', 'besides', 'here', 'while', 'sincere', 'thereupon', 'per', 'former', 'also', 'his', 'formerly', 'mill', 'her', 'four', 'thin', 'empty', 'ours', 'when', 'own', 'thereafter', 'not', 'around', 'hasnt', 'always', 'mine', 'any', 'nothing', 'ltd', 'therein', 'show', 'eg', 'get', 'whatever', 'yourselves', 'twenty', 'yours', 'now', 'her

As we did previously already, we can utilize the max_features parameter. If not set to None, it builds a vocabulary that only considers the top max_features (ordered by term frequency across the corpus). This allows us to keep more common n-grams and remove ones that may appear once. If we include words that only occur once, this can lead to said features being highly associated with a class and cause overfitting.

In [22]:
# Let's remove English stop words and only keep 100 features #
vect = CountVectorizer(stop_words='english', max_features=100)
tokenize_test(vect)

Number of features: 100
Accuracy: 0.8532289628180039


In [23]:
# Let's check out all 100 features #
print(vect.get_feature_names())

['amazing', 'area', 'asked', 'awesome', 'bad', 'bar', 'best', 'better', 'big', 'came', 'cheese', 'chicken', 'clean', 'come', 'day', 'definitely', 'delicious', 'did', 'didn', 'different', 'dinner', 'don', 'eat', 'excellent', 'experience', 'favorite', 'feel', 'food', 'free', 'fresh', 'friendly', 'friends', 'going', 'good', 'got', 'great', 'happy', 'home', 'hot', 'hour', 'just', 'know', 'like', 'little', 'll', 'location', 'long', 'looking', 'love', 'lunch', 'make', 'meal', 'menu', 'minutes', 'need', 'new', 'nice', 'night', 'order', 'ordered', 'people', 'perfect', 'phoenix', 'pizza', 'place', 'pretty', 'price', 'prices', 'really', 'recommend', 'restaurant', 'right', 'said', 'salad', 'sauce', 'say', 'service', 'staff', 'store', 'sure', 'sweet', 'table', 'thing', 'things', 'think', 'time', 'times', 'told', 'took', 'tried', 'try', 've', 'wait', 'want', 'way', 'went', 'wine', 'work', 'worth', 'years']


In [24]:
# Let's remove English stop words and only keep 1000 features #
vect = CountVectorizer(stop_words='english', max_features=1000)
tokenize_test(vect)

Number of features: 1000
Accuracy: 0.8992172211350293


Likewise with every other model, inclusion of more features does not always lead to a better model. Thus, we must tune our feature generator to remove features whose predictive capability is none or very low. Here, there is an increase in accuracy when we double the n-gram size and increase our max features by 1,000-fold. However, we can observe that if we restrict it to only unigrams, then the accuracy increases further. This indicates that our bigrams were very likely adding more noise than signal.

In [25]:
# Let's include 1-grams and 2-grams, and limit the number of features #
print('---'*25,'\n1-grams and 2-grams, up to 100K features')
vect = CountVectorizer(ngram_range=(1, 2), max_features=100000)
tokenize_test(vect)

print('---'*25)

print('1-grams only, up to 100K features')
vect = CountVectorizer(ngram_range=(1, 1), max_features=100000)
tokenize_test(vect)
print('---'*25)

--------------------------------------------------------------------------- 
1-grams and 2-grams, up to 100K features
Number of features: 100000
Accuracy: 0.863013698630137
---------------------------------------------------------------------------
1-grams only, up to 100K features
Number of features: 16712
Accuracy: 0.9080234833659491
---------------------------------------------------------------------------


We can see above that by only using 16,712 unigram features we came away with a much smaller, simpler, and easier-to-think-about model; which also resulted in higher accuracy.

In [26]:
# Let's include 1-grams and 2-grams, and only include terms that appear at least two times #
vect = CountVectorizer(ngram_range=(1, 2), min_df=2)
tokenize_test(vect)

Number of features: 43839
Accuracy: 0.913894324853229


We can always strive for a better accuracy by optimizing model parameters. Let's continue with the use of TextBlob (a Python library which provides a simplified interface for exploring common NLP tasks including part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more).

In [27]:
# Let's print the first review in our dataset #
print(yelp_best_worst.text[0])

My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!
