# Text Data Pre-Processing
## Natural Language Processing (NLP)



### Scope of this notebook:

### 1.  Data Inspection
### 2.  Add Sentiment Feature to data set
### 3.  Tokenization, Normalization & Custom Stopword Filtering
### 4.  Extract the most common words
### 5.  Create "Bag of Words" data set

In [61]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np

In [62]:
# read data source
df = pd.read_csv("../Resources/helpful_clean_reviews_combined.csv")
df = df.drop(["Unnamed: 0"], axis=1)
df.head()

Unnamed: 0,key,stars,helpful_yes,helpful_no,text,rating
0,0_breyers,1,11,0,I am interested in the flavoring components us...,4.1
1,0_breyers,1,7,0,"Boy, was I surprised when I got my Bryers home...",4.1
2,0_breyers,1,8,0,I havent purchased this product in awhile and ...,4.1
3,0_breyers,1,4,0,The Natural Vanilla recipe change to include T...,4.1
4,0_breyers,5,21,2,I had the same issue with breyers. I finally f...,4.1


### 1. Data Inspection

In [63]:
# data overview
print ('Rows     : ', df.shape[0])
print ('Columns  : ', df.shape[1])
print ('\nFeatures : ', df.columns.tolist())
print ('\nMissing values :  ', df.isnull().sum().values.sum())
print ('\nUnique values :  \n', df.nunique())

Rows     :  3419
Columns  :  6

Features :  ['key', 'stars', 'helpful_yes', 'helpful_no', 'text', 'rating']

Missing values :   0

Unique values :  
 key             184
stars             5
helpful_yes      66
helpful_no       20
text           3419
rating           11
dtype: int64


In [64]:
# find missing values and view data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3419 entries, 0 to 3418
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   key          3419 non-null   object 
 1   stars        3419 non-null   int64  
 2   helpful_yes  3419 non-null   int64  
 3   helpful_no   3419 non-null   int64  
 4   text         3419 non-null   object 
 5   rating       3419 non-null   float64
dtypes: float64(1), int64(3), object(2)
memory usage: 160.4+ KB


In [65]:
# Find null values
for column in df.columns:
    print(f"Column {column} has {df[column].isnull().sum()} null values")

Column key has 0 null values
Column stars has 0 null values
Column helpful_yes has 0 null values
Column helpful_no has 0 null values
Column text has 0 null values
Column rating has 0 null values


In [66]:
# Find duplicate entries
# duplicate entries are not telling us anything new  and can skew results
print(f"Duplicate entries: {(df.duplicated().sum()) * 2}")

Duplicate entries: 0


In [67]:
# drop duplicate entries
df = df.drop_duplicates(subset=['text'])

In [68]:
# create data_df to hold new dataset without duplicates
df_data = pd.DataFrame(df)
df_data

Unnamed: 0,key,stars,helpful_yes,helpful_no,text,rating
0,0_breyers,1,11,0,I am interested in the flavoring components us...,4.1
1,0_breyers,1,7,0,"Boy, was I surprised when I got my Bryers home...",4.1
2,0_breyers,1,8,0,I havent purchased this product in awhile and ...,4.1
3,0_breyers,1,4,0,The Natural Vanilla recipe change to include T...,4.1
4,0_breyers,5,21,2,I had the same issue with breyers. I finally f...,4.1
...,...,...,...,...,...,...
3414,9_hd,5,1,0,I tried the new flavor with layers and it was ...,4.9
3415,9_hd,5,1,0,"love this ice cream, taste fantastic!! will ne...",4.9
3416,9_hd,5,1,0,This is my favorite cream. Where can I find th...,4.9
3417,9_hd,5,1,0,The best tasting ice cream out there! It is ve...,4.9


In [69]:
# Update helpful_clean_reviews_combined to exclude duplicates
#df_data.to_csv("Resources/helpful_clean_reviews_combined.csv", index=True)

### 2.  Add Sentiment Feature to data set

Any review with 4 or more stars gets a value of 1 to reflect positve sentiment. 
Any review with 4 or less stars gets a value of 0 to reflect negative sentiment.

In [70]:
# add sentiment column to df_data
df_data['sentiment'] = pd.Series(dtype='int64')
df_data.head()

Unnamed: 0,key,stars,helpful_yes,helpful_no,text,rating,sentiment
0,0_breyers,1,11,0,I am interested in the flavoring components us...,4.1,
1,0_breyers,1,7,0,"Boy, was I surprised when I got my Bryers home...",4.1,
2,0_breyers,1,8,0,I havent purchased this product in awhile and ...,4.1,
3,0_breyers,1,4,0,The Natural Vanilla recipe change to include T...,4.1,
4,0_breyers,5,21,2,I had the same issue with breyers. I finally f...,4.1,


In [71]:
# assign 1 for positive sentiment, 0 for negative
# if stars 4 or higher, sentiment is positive

def applyFunc(s):
    if s >= 4:
        return 1
    else:
        return 0

# populate column        
df_data['sentiment'] = df_data['stars'].apply(applyFunc)
df_data.head()

Unnamed: 0,key,stars,helpful_yes,helpful_no,text,rating,sentiment
0,0_breyers,1,11,0,I am interested in the flavoring components us...,4.1,0
1,0_breyers,1,7,0,"Boy, was I surprised when I got my Bryers home...",4.1,0
2,0_breyers,1,8,0,I havent purchased this product in awhile and ...,4.1,0
3,0_breyers,1,4,0,The Natural Vanilla recipe change to include T...,4.1,0
4,0_breyers,5,21,2,I had the same issue with breyers. I finally f...,4.1,1


In [72]:
# Create positive sentiment dataframe
# delete if not used in rest of notebook
# again, seeing output may drive inspiration for new ideas or provide clarity on the direction. 

df_positive_sentiment = df_data[df_data['sentiment'] ==1]
df_positive_sentiment

Unnamed: 0,key,stars,helpful_yes,helpful_no,text,rating,sentiment
4,0_breyers,5,21,2,I had the same issue with breyers. I finally f...,4.1,1
17,0_breyers,4,45,3,The taste of Breyers vanilla ice cream decline...,4.1,1
42,0_breyers,4,2,0,This product no longer has specks of vanilla i...,4.1,1
56,0_breyers,5,53,32,After trying Bryers Natural Vanilla Ice Cream ...,4.1,1
68,0_breyers,4,1,0,"Hi. My husband and I like the ice cream, but w...",4.1,1
...,...,...,...,...,...,...,...
3414,9_hd,5,1,0,I tried the new flavor with layers and it was ...,4.9,1
3415,9_hd,5,1,0,"love this ice cream, taste fantastic!! will ne...",4.9,1
3416,9_hd,5,1,0,This is my favorite cream. Where can I find th...,4.9,1
3417,9_hd,5,1,0,The best tasting ice cream out there! It is ve...,4.9,1


In [73]:
# Create negative sentiment dataframe
# delete if not used in rest of notebook

df_negative_sentiment = df_data[df_data['sentiment'] ==0]
df_negative_sentiment

Unnamed: 0,key,stars,helpful_yes,helpful_no,text,rating,sentiment
0,0_breyers,1,11,0,I am interested in the flavoring components us...,4.1,0
1,0_breyers,1,7,0,"Boy, was I surprised when I got my Bryers home...",4.1,0
2,0_breyers,1,8,0,I havent purchased this product in awhile and ...,4.1,0
3,0_breyers,1,4,0,The Natural Vanilla recipe change to include T...,4.1,0
5,0_breyers,1,4,0,I rarely eat ice cream these days but bought t...,4.1,0
...,...,...,...,...,...,...,...
3345,8_talenti,2,3,1,"I dont buy a lot of ice cream, gelato, or swee...",4.3,0
3348,8_talenti,3,3,0,The top layers are great. Tastes like cheeseca...,4.3,0
3354,8_talenti,3,2,1,I was really excited to try this flavor but wa...,4.3,0
3357,8_talenti,3,1,0,All of your flavors are such high quality and ...,4.3,0


In [74]:
# create product_sentiment_reviews.csv
#df_data.to_csv("../Resources/product_sentiment_reviews.csv", index=False)

### 3.  Tokenization, Normalization & Custom Stopword Filtering with NLTK

Here is where all the magic of splitting the reviews into individual words, putting each word into lower case, lemmatizing each to its base form, removing punctuations and excluding stop words occurs.

We perform this step with the NLTK library as it is the most popular in education and research for NLP.  

In [17]:
# create tokenizer dataframe
df_tokenize = pd.DataFrame(df_data)

In [18]:
# import the Tokenizer library
import nltk
from nltk.tokenize import word_tokenize, RegexpTokenizer

# RegexpTokenizer will tokenize according to any regular expression assigned. 
# The regular expression r'\w+' matches any pattern consisting of one or more consecutive letters.
reTokenizer = RegexpTokenizer(r'\w+')



from nltk.corpus import stopwords
from string import punctuation
stop_words = set(stopwords.words('english'))


from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [19]:
# collect all the words from all the reviews into one list

# initialize list to hold words
all_words = []


for i in range(len(df_tokenize['text'])):
    # separate review text into a list of words
    tokens = reTokenizer.tokenize(df_tokenize['text'][i])
    
    
    df_tokenize['text'][i] = []
    
    # iterate through tokens
    for word in tokens:
        # lower the case of each word
        word = word.lower()
        # exclude stop words
        if word not in stop_words:
            
            # Lemmatize words into a standard form and avoid counting the same word more than once
            word = lemmatizer.lemmatize(word)
            # add to list of words
            all_words.append(word)
            # append to text column of dataframe for appropriate row
            df_tokenize['text'][i].append(word)
            

### 4.  Extract the most common words

"bag of words" and "most common words" is used interchangeably throughout the rest of this notebook. We will fix it to be consistent after everyone has mastered comfort with the code.

In [20]:
# Extract the most common words from the list of all_words.

from nltk import FreqDist

# sort all of the words in the all_words list by frequency count
all_words = FreqDist(all_words)
# Extract the 500 most common words from the all_words list
most_common_words = all_words.most_common(500)

# create a list of most common words without the frequency count
word_features = []
for w in most_common_words:
    word_features.append(w[0])

#print 
most_common_words

[('cream', 3171),
 ('ice', 3014),
 ('flavor', 2677),
 ('chocolate', 1416),
 ('love', 1121),
 ('like', 987),
 ('one', 909),
 ('taste', 855),
 ('favorite', 725),
 ('good', 675),
 ('best', 645),
 ('vanilla', 566),
 ('would', 547),
 ('pint', 533),
 ('ever', 518),
 ('time', 504),
 ('get', 469),
 ('creamy', 457),
 ('cookie', 456),
 ('store', 452),
 ('find', 431),
 ('delicious', 431),
 ('please', 430),
 ('great', 398),
 ('try', 389),
 ('really', 388),
 ('butter', 382),
 ('im', 381),
 ('tried', 378),
 ('sweet', 378),
 ('gelato', 373),
 ('perfect', 370),
 ('make', 368),
 ('chip', 351),
 ('product', 350),
 ('buy', 348),
 ('texture', 344),
 ('amazing', 344),
 ('caramel', 341),
 ('breyers', 339),
 ('new', 338),
 ('eat', 338),
 ('peanut', 336),
 ('ive', 334),
 ('first', 331),
 ('year', 327),
 ('much', 314),
 ('go', 308),
 ('dairy', 308),
 ('never', 304),
 ('bought', 285),
 ('dont', 265),
 ('every', 264),
 ('chunk', 261),
 ('back', 254),
 ('always', 252),
 ('better', 244),
 ('could', 240),
 ('even',

In [21]:
print ('There are ', len(all_words), 'unique words total in our text dataset.')
print ('There are ', len(most_common_words), 'unique words in the most common words list.')

There are  6153 unique words total in our text dataset.
There are  500 unique words in the most common words list.


#### 5.  Create "Bag of Words" data set

In [22]:
# create Bag of Words DataFrame
df_bagofwords = pd.DataFrame(df_tokenize)

In [23]:
# create column for bag of words
df_bagofwords['bag_of_words'] = ""
df_bagofwords.head()

Unnamed: 0,key,stars,helpful_yes,helpful_no,text,rating,sentiment,bag_of_words
0,0_breyers,1,11,0,"[interested, flavoring, component, used, notic...",4.1,0,
1,0_breyers,1,7,0,"[boy, surprised, got, bryers, home, discover, ...",4.1,0,
2,0_breyers,1,8,0,"[havent, purchased, product, awhile, surprised...",4.1,0,
3,0_breyers,1,4,0,"[natural, vanilla, recipe, change, include, ta...",4.1,0,
4,0_breyers,5,21,2,"[issue, breyers, finally, found, turkey, hill,...",4.1,1,


In [24]:
# iterate dataframe to populate bag of words column
for i in range(len(df_bagofwords['text'])):
    # initialize empty column    
    df_bagofwords['bag_of_words'][i] = []
    
    # iterate through df row by row
    for word in df_bagofwords['text'][i]:
        # if a word in 'text' is in the most common words
        # note: this is simply the "most_common_words" without the count column
        if word in word_features:
            # if it is, add it to the bag of words cell
            df_bagofwords['bag_of_words'][i].append(word)
            
             

In [25]:
df_bagofwords.head()

Unnamed: 0,key,stars,helpful_yes,helpful_no,text,rating,sentiment,bag_of_words
0,0_breyers,1,11,0,"[interested, flavoring, component, used, notic...",4.1,0,"[used, ingredient, list, vanilla, bean, vanill..."
1,0_breyers,1,7,0,"[boy, surprised, got, bryers, home, discover, ...",4.1,0,"[surprised, got, home, frozen, dairy, dessert,..."
2,0_breyers,1,8,0,"[havent, purchased, product, awhile, surprised...",4.1,0,"[havent, purchased, product, surprised, today,..."
3,0_breyers,1,4,0,"[natural, vanilla, recipe, change, include, ta...",4.1,0,"[natural, vanilla, recipe, change, gum, change..."
4,0_breyers,5,21,2,"[issue, breyers, finally, found, turkey, hill,...",4.1,1,"[issue, breyers, finally, found, natural, ice,..."


In [26]:
# Example to compare text vs bag of words
# set example variable equal to the review row you'd like to see
example = 7

print('text: ', df_bagofwords['text'][example])
print('\nbag_of_words: ',df_bagofwords['bag_of_words'][example])

text:  ['upset', '1', '5qt', 'container', 'natural', 'vanilla', 'two', 'different', 'store', 'lacked', 'little', 'black', 'speck', 'come', 'love', 'expect']

bag_of_words:  ['1', 'container', 'natural', 'vanilla', 'two', 'different', 'store', 'little', 'come', 'love']
