# Text Data Preprocessing

- Tokenizing the text
- Comparing the accuracy of different approaches
- Removing frequent terms (stop words)
- Removing infrequent terms
-  Handling Unicode errors

# Importing key Modules

In [1]:
# for Python 2: use print only as a function
from __future__ import print_function
import pandas as pd
from sklearn.model_selection import train_test_split

## Part 1: Reading in the Yelp reviews corpus

- "corpus" = collection of documents
- "corpora" = plural form of corpus

In [2]:
# read yelp.csv into a DataFrame using a relative path

path = '../data/yelp.csv'
yelp = pd.read_csv(path)

In [3]:
# alternative: read from a URL instead
# path = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/yelp.csv'
# yelp = pd.read_csv(path)

In [4]:
# examine the first three rows
yelp.head(3)

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0


In [5]:
# examine the text for the first row
yelp.loc[0, 'text']

'My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I\'ve ever had.  I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I\'ve ever had.\n\nAnyway, I can\'t wait to go back!'

**Goal:** Distinguish between 5-star and 1-star reviews using **only** the review text. (We will not be using the other columns.)

In [6]:
# examine the class distribution
yelp.stars.value_counts().sort_index()

1     749
2     927
3    1461
4    3526
5    3337
Name: stars, dtype: int64

In [7]:
# create a new DataFrame that only contains the 5-star and 1-star reviews
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

In [8]:
# examine the shape
yelp_best_worst.shape

(4086, 10)

In [9]:
# define X and y
X = yelp_best_worst.text
y = yelp_best_worst.stars

In [10]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [11]:
# examine the object shapes
print(X_train.shape)
print(X_test.shape)

(3064,)
(1022,)


## Part 2: Tokenizing the text

- **What:** Separate text into units such as words, n-grams, or sentences
- **Why:** Gives structure to previously unstructured text
- **Notes:** Relatively easy with English language text, not easy with some languages

In [12]:
# use CountVectorizer to create document-term matrices from X_train and X_test
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [13]:
# fit and transform X_train
X_train_dtm = vect.fit_transform(X_train)

In [14]:
# only transform X_test
X_test_dtm = vect.transform(X_test)

In [15]:
# examine the shapes: rows are documents, columns are terms (aka "tokens" or "features")
print(X_train_dtm.shape)
print(X_test_dtm.shape)

(3064, 16825)
(1022, 16825)


In [16]:
# examine the last 50 features
print(vect.get_feature_names()[-50:])

['yyyyy', 'z11', 'za', 'zabba', 'zach', 'zam', 'zanella', 'zankou', 'zappos', 'zatsiki', 'zen', 'zero', 'zest', 'zexperience', 'zha', 'zhou', 'zia', 'zihuatenejo', 'zilch', 'zin', 'zinburger', 'zinburgergeist', 'zinc', 'zinfandel', 'zing', 'zip', 'zipcar', 'zipper', 'zippers', 'zipps', 'ziti', 'zoe', 'zombi', 'zombies', 'zone', 'zones', 'zoning', 'zoo', 'zoyo', 'zucca', 'zucchini', 'zuchinni', 'zumba', 'zupa', 'zuzu', 'zwiebel', 'zzed', 'éclairs', 'école', 'ém']


In [17]:
# show default parameters for CountVectorizer
vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

[CountVectorizer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

### 2.a. Lowercase 

- **lowercase:** boolean, True by default
    - Convert all characters to lowercase before tokenizing.

In [18]:
# don't convert to lowercase
vect = CountVectorizer(lowercase=False)
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

(3064, 20838)

- **ngram_range:** tuple (min_n, max_n), default=(1, 1)
    - The lower and upper boundary of the range of n-values for different n-grams to be extracted.
    - All values of n such that min_n <= n <= max_n will be used.

In [19]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

(3064, 169847)

In [20]:
# examine the last 50 features
print(vect.get_feature_names()[-20:])

['zumba', 'zumba class', 'zumba or', 'zumba yogalates', 'zupa', 'zupa flavors', 'zuzu', 'zuzu in', 'zuzu is', 'zuzu the', 'zwiebel', 'zwiebel kräuter', 'zzed', 'zzed in', 'éclairs', 'éclairs napoleons', 'école', 'école lenôtre', 'ém', 'ém all']


In [21]:
# include 1-grams and 3-grams
vect2 = CountVectorizer(ngram_range=(1, 3))
X_train_dtm2 = vect2.fit_transform(X_train)
X_train_dtm2.shape

(3064, 456398)

In [22]:
print(vect2.get_feature_names()[-20:])

['zuzu in downtown', 'zuzu is', 'zuzu is at', 'zuzu the', 'zuzu the ultimate', 'zwiebel', 'zwiebel kräuter', 'zwiebel kräuter salat', 'zzed', 'zzed in', 'zzed in my', 'éclairs', 'éclairs napoleons', 'éclairs napoleons and', 'école', 'école lenôtre', 'école lenôtre trained', 'ém', 'ém all', 'ém all they']


## Part 3: Comparing the accuracy of different approaches

### 3a.Approach 1:

Null Accuracy :Always predict the most frequent class

In [23]:
# calculate null accuracy for 5 star review
y_test.value_counts().head(1) / y_test.shape

5    0.819961
Name: stars, dtype: float64

In [24]:
y_test.value_counts()

5    838
1    184
Name: stars, dtype: int64

### Approach 2:

Use the default parameters for CountVectorizer

In [25]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# define a function that accepts a vectorizer and calculates the accuracy
def tokenize_test(vect):
    
    # create document-term matrices using the vectorizer
    X_train_dtm = vect.fit_transform(X_train)
    X_test_dtm = vect.transform(X_test)
    
    # print the number of features that were generated
    print('Features: ', X_train_dtm.shape[1])
    
    # use Multinomial Naive Bayes to predict the star rating
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
    y_pred_class = nb.predict(X_test_dtm)
    
    # print the accuracy of its predictions
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

In [26]:
# use the default parameters
vect = CountVectorizer()
tokenize_test(vect)

Features:  16825
Accuracy:  0.9187866927592955


### Approach 3:

Don't convert to lowercase

In [27]:
# don't convert to lowercase
vect = CountVectorizer(lowercase=False)
tokenize_test(vect)

Features:  20838
Accuracy:  0.9099804305283757


### Approach 4:

Include 1-grams and 2-grams

In [28]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))
tokenize_test(vect)

Features:  169847
Accuracy:  0.8542074363992173


**Summary:** Tuning CountVectorizer is a form of **feature engineering**, the process through which you create features that don't natively exist in the dataset. Your goal is to create features that contain the **signal** from the data (with respect to the response value), rather than the **noise**.

## Part 4: Removing frequent terms (stop words)

- **What:** Remove common words that appear in most documents
- **Why:** They probably don't tell you much about your text

In [29]:
# show vectorizer parameters
vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 2), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

- **stop_words:** string {'english'}, list, or None (default)
    - If 'english', a built-in stop word list for English is used.
    - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
    - If None, no stop words will be used.

### 4a. Removing Default stopwords

In [30]:
# remove English stop words
vect = CountVectorizer(stop_words='english')
tokenize_test(vect)

Features:  16528
Accuracy:  0.9158512720156555


In [31]:
# examine the stop words
print(sorted(vect.get_stop_words()))

['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give

- **max_df:** float in range [0.0, 1.0] or int, default=1.0
    - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).
    - If float, the parameter represents a proportion of documents.
    - If integer, the parameter represents an absolute count.

In [32]:
# ignore terms that appear in more than 50% of the documents
vect = CountVectorizer(max_df=0.5)
tokenize_test(vect)

Features:  16815
Accuracy:  0.9207436399217221


- **stop\_words\_:** Terms that were ignored because they either:
    - occurred in too many documents (max_df)
    - occurred in too few documents (min_df)
    - were cut off by feature selection (max_features)

### 4b.corpus-specific stop words

In [33]:
# examine the terms that were removed due to max_df ("corpus-specific stop words")
print(vect.stop_words_)

{'the', 'of', 'my', 'is', 'to', 'it', 'and', 'for', 'in', 'this'}


In [34]:
# vect.stop_words_ is completely distinct from vect.get_stop_words()
print(vect.get_stop_words())

None


## Part 5: Removing infrequent terms

- **max_features:** int or None, default=None
    - If not None, build a vocabulary that only considers the top max_features ordered by term frequency across the corpus.

In [35]:
# only keep the top 1000 most frequent terms
vect = CountVectorizer(max_features=1000)
tokenize_test(vect)

Features:  1000
Accuracy:  0.8923679060665362


- **min_df:** float in range [0.0, 1.0] or int, default=1
    - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called "cut-off" in the literature.)
    - If float, the parameter represents a proportion of documents.
    - If integer, the parameter represents an absolute count.

In [36]:
# only keep terms that appear in at least 2 documents
vect = CountVectorizer(min_df=2)
tokenize_test(vect)

Features:  8783
Accuracy:  0.9246575342465754


In [37]:
# include 1-grams and 2-grams, and only keep terms that appear in at least 2 documents
vect = CountVectorizer(ngram_range=(1, 2), min_df=2)
tokenize_test(vect)

Features:  43957
Accuracy:  0.9324853228962818


**Guidelines for tuning CountVectorizer:**

- Use your knowledge of the **problem** and the **text**, and your understanding of the **tuning parameters**, to help you decide what parameters to tune and how to tune them.
- **Experiment**, and let the data tell you the best approach!