# Natural Language Processing (NLP)
## Feature Extraction & Vectorizing

The primary goal of this notebook is to get the data compatible with supervised machine learning algorithms. 

At this stage in the project, it is beneficial to decision making to see as much of the output as possible to ensure the code
is working as intended and adjust/improve where possible.  The markdowns here are meant to be a guide to walk through what the
code is doing and why.

This is a shell state of the pre-processing model. Need to limit the "tokenized reviews" dataset to the 3000 most common words for step 7 vectorizing of "text" to reflect accurately.

## Scope of this notebook:

### 1.  Data Inspection
### 2.  Add Sentiment Feature to data set
### 3.  Create Product Sentiment Reviews Dataset
### 4.  Tokenize "text" words
### 5.  Bag of Words - Extract the most common words
### 6.  Create Tokenized Reviews data set
### 7.  TFID Vectorizing for Supervised  ML algorithms

In [228]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np

We are starting with a review dataset that has been filtered down to ice cream products that have achieved an amazon rating of 4 stars or higher joined on the key feature with consumer reviews that have been filtered down those that received more helpful_yes votes than helpful_no votes.

(insert why we chose to filter and clean the dataset this way)

In [229]:
# read data source
df = pd.read_csv("Resources/helpful_clean_reviews_combined.csv")
df.head()

Unnamed: 0,key,stars,helpful_yes,helpful_no,text,rating
0,0_breyers,1,11,0,I am interested in the flavoring components us...,4.1
1,0_breyers,1,7,0,"Boy, was I surprised when I got my Bryers home...",4.1
2,0_breyers,1,8,0,I havent purchased this product in awhile and ...,4.1
3,0_breyers,1,4,0,The Natural Vanilla recipe change to include T...,4.1
4,0_breyers,5,21,2,I had the same issue with breyers. I finally f...,4.1


### 1. Data Inspection

Knowing the data is key to ensuring its compatible with any functions or methods required for the code to perform.
We know, off the cusp, that unsupervised ML doesn't like strings or null values so lets identify any of those. Also, we will remove any duplicate data as it doesn't tell us anything new and may skew results. 

In [230]:
# data overview
print ('Rows     : ', df.shape[0])
print ('Columns  : ', df.shape[1])
print ('\nFeatures : ', df.columns.tolist())
print ('\nMissing values :  ', df.isnull().sum().values.sum())
print ('\nUnique values :  \n', df.nunique())

Rows     :  3424
Columns  :  6

Features :  ['key', 'stars', 'helpful_yes', 'helpful_no', 'text', 'rating']

Missing values :   0

Unique values :  
 key             184
stars             5
helpful_yes      66
helpful_no       20
text           3419
rating           11
dtype: int64


In [231]:
# find missing values and view data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3424 entries, 0 to 3423
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   key          3424 non-null   object 
 1   stars        3424 non-null   int64  
 2   helpful_yes  3424 non-null   int64  
 3   helpful_no   3424 non-null   int64  
 4   text         3424 non-null   object 
 5   rating       3424 non-null   float64
dtypes: float64(1), int64(3), object(2)
memory usage: 160.6+ KB


In [232]:
# Find null values
for column in df.columns:
    print(f"Column {column} has {df[column].isnull().sum()} null values")

Column key has 0 null values
Column stars has 0 null values
Column helpful_yes has 0 null values
Column helpful_no has 0 null values
Column text has 0 null values
Column rating has 0 null values


In [233]:
# Find duplicate entries
# duplicate entries are not telling us anything new  and can skew results
print(f"Duplicate entries: {(df.duplicated().sum()) * 2}")

Duplicate entries: 4


In [234]:
# drop duplicate entries
df.drop_duplicates(subset=['text'])

Unnamed: 0,key,stars,helpful_yes,helpful_no,text,rating
0,0_breyers,1,11,0,I am interested in the flavoring components us...,4.1
1,0_breyers,1,7,0,"Boy, was I surprised when I got my Bryers home...",4.1
2,0_breyers,1,8,0,I havent purchased this product in awhile and ...,4.1
3,0_breyers,1,4,0,The Natural Vanilla recipe change to include T...,4.1
4,0_breyers,5,21,2,I had the same issue with breyers. I finally f...,4.1
...,...,...,...,...,...,...
3419,9_hd,5,1,0,I tried the new flavor with layers and it was ...,4.9
3420,9_hd,5,1,0,"love this ice cream, taste fantastic!! will ne...",4.9
3421,9_hd,5,1,0,This is my favorite cream. Where can I find th...,4.9
3422,9_hd,5,1,0,The best tasting ice cream out there! It is ve...,4.9


In [235]:
# create data_df to hold new dataset without duplicates
df_data = pd.DataFrame(df)
df_data

Unnamed: 0,key,stars,helpful_yes,helpful_no,text,rating
0,0_breyers,1,11,0,I am interested in the flavoring components us...,4.1
1,0_breyers,1,7,0,"Boy, was I surprised when I got my Bryers home...",4.1
2,0_breyers,1,8,0,I havent purchased this product in awhile and ...,4.1
3,0_breyers,1,4,0,The Natural Vanilla recipe change to include T...,4.1
4,0_breyers,5,21,2,I had the same issue with breyers. I finally f...,4.1
...,...,...,...,...,...,...
3419,9_hd,5,1,0,I tried the new flavor with layers and it was ...,4.9
3420,9_hd,5,1,0,"love this ice cream, taste fantastic!! will ne...",4.9
3421,9_hd,5,1,0,This is my favorite cream. Where can I find th...,4.9
3422,9_hd,5,1,0,The best tasting ice cream out there! It is ve...,4.9


### 2.  Add Sentiment Feature to data set

We include a sentiment in case we run a sentiment analysis, which is very popular with NLP modeling. 
Here we will assign a value of 1 to reflect positive sentiment. This consists of star rating greater than or equal to 5. 
Any review with a star rating less than 5 gets a value of 0 to reflect negative sentiment. 

In [236]:
# add sentiment column to df_data
df_data['sentiment'] = pd.Series(dtype='int64')
df_data.head()

Unnamed: 0,key,stars,helpful_yes,helpful_no,text,rating,sentiment
0,0_breyers,1,11,0,I am interested in the flavoring components us...,4.1,
1,0_breyers,1,7,0,"Boy, was I surprised when I got my Bryers home...",4.1,
2,0_breyers,1,8,0,I havent purchased this product in awhile and ...,4.1,
3,0_breyers,1,4,0,The Natural Vanilla recipe change to include T...,4.1,
4,0_breyers,5,21,2,I had the same issue with breyers. I finally f...,4.1,


In [237]:
# assign 1 for positive sentiment, 0 for negative
# I'm bad with functions, this link helped me get it right. 
# https://stackoverflow.com/questions/30953299/pandas-if-row-in-column-a-contains-x-write-y-to-row-in-column-b
def applyFunc(s):
    if s >= 5:
        return 1
    else:
        return 0

# populate column        
df_data['sentiment'] = df_data['stars'].apply(applyFunc)
df_data.head()

Unnamed: 0,key,stars,helpful_yes,helpful_no,text,rating,sentiment
0,0_breyers,1,11,0,I am interested in the flavoring components us...,4.1,0
1,0_breyers,1,7,0,"Boy, was I surprised when I got my Bryers home...",4.1,0
2,0_breyers,1,8,0,I havent purchased this product in awhile and ...,4.1,0
3,0_breyers,1,4,0,The Natural Vanilla recipe change to include T...,4.1,0
4,0_breyers,5,21,2,I had the same issue with breyers. I finally f...,4.1,1


In [238]:
# Create positive sentiment dataframe
# delete if not used in rest of notebook
# again, seeing output may drive inspiration for new ideas or provide clarity on the direction. 

df_positive_sentiment = df_data[df_data['sentiment'] ==1]
df_positive_sentiment

Unnamed: 0,key,stars,helpful_yes,helpful_no,text,rating,sentiment
4,0_breyers,5,21,2,I had the same issue with breyers. I finally f...,4.1,1
56,0_breyers,5,53,32,After trying Bryers Natural Vanilla Ice Cream ...,4.1,1
74,0_hd,5,27,0,"if this flavor is ever retired, i swear -- my ...",4.9,1
75,0_hd,5,10,0,I am an ice cream addict and this flavour has ...,4.9,1
76,0_hd,5,4,0,"This flavor is sloop good, I eat about 2 a day...",4.9,1
...,...,...,...,...,...,...,...
3419,9_hd,5,1,0,I tried the new flavor with layers and it was ...,4.9,1
3420,9_hd,5,1,0,"love this ice cream, taste fantastic!! will ne...",4.9,1
3421,9_hd,5,1,0,This is my favorite cream. Where can I find th...,4.9,1
3422,9_hd,5,1,0,The best tasting ice cream out there! It is ve...,4.9,1


In [239]:
# Create negative sentiment dataframe
# delete if not used in rest of notebook

df_negative_sentiment = df_data[df_data['sentiment'] ==0]
df_negative_sentiment

Unnamed: 0,key,stars,helpful_yes,helpful_no,text,rating,sentiment
0,0_breyers,1,11,0,I am interested in the flavoring components us...,4.1,0
1,0_breyers,1,7,0,"Boy, was I surprised when I got my Bryers home...",4.1,0
2,0_breyers,1,8,0,I havent purchased this product in awhile and ...,4.1,0
3,0_breyers,1,4,0,The Natural Vanilla recipe change to include T...,4.1,0
5,0_breyers,1,4,0,I rarely eat ice cream these days but bought t...,4.1,0
...,...,...,...,...,...,...,...
3349,8_talenti,2,3,1,"I dont buy a lot of ice cream, gelato, or swee...",4.3,0
3352,8_talenti,3,3,0,The top layers are great. Tastes like cheeseca...,4.3,0
3358,8_talenti,3,2,1,I was really excited to try this flavor but wa...,4.3,0
3361,8_talenti,3,1,0,All of your flavors are such high quality and ...,4.3,0


### 3.  Create Product Sentiment Reviews Dataset

This is the helpful_cleaned_reviews_combined.csv with duplicates removed and sentiment column added.

In [240]:
# create product_sentiment_reviews.csv
df_data.to_csv("Resources/product_sentiment_reviews.csv", index=False)

### 4.  Tokenize "text" words

We are working with text data we have to count the number of words in the text, as well as, identify the number of times a particular word is present. We Tokenize the text data to do just that.

Here is where all the magic of splitting the reviews into single words, putting each word into lower case, lemmatizing each to its base form, removing punctuations and excluding stop words occurs. We still have more work to do to clean this up, but hey, the code is there. 

While there are plenty of library and program options, we've tokenized with NLTK as this is the best way to see whats happening step by step.  

As we do more research, we may change our minds and adopt libraries and programs that offer "cleaner" coding opportunities and better performance now that we have a clearer vision of what the code is doing function by function.

In [241]:
# create tokenizer dataframe
df_tokenize = pd.DataFrame(df_data)

In [242]:
# import the Tokenizer library
import nltk
from nltk.tokenize import word_tokenize, RegexpTokenizer

# RegexpTokenizer will tokenize according to any regular expression assigned. 
# The regular expression r'\w+' matches any pattern consisting of one or more consecutive letters.
reTokenizer = RegexpTokenizer(r'\w+')



from nltk.corpus import stopwords
from string import punctuation
stop_words = set(stopwords.words('english'))

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [243]:
#stop_words

In [244]:
# collect all the words from all the reviews into one list

# initialize list to hold words
all_words = []


for i in range(len(df_tokenize['text'][i])):
    # separate review text into a list of words
    tokens = reTokenizer.tokenize(df_tokenize['text'][i])
    
    
    df_tokenize['text'][i] = []
    
    # iterate through tokens
    for word in tokens:
        # lower the case of each word
        word = word.lower()
        # exclude stop words
        if word not in stop_words:
            
            # Lemmatize words into a standard form and avoid counting the same word more than once
            word = lemmatizer.lemmatize(word)
            # add to list of words
            all_words.append(word)
            df_tokenize['text'][i].append(word)
            

### 5.  Bag of Words? Extract the most common words

Of the most common words, frequency ranges from 13-3175, with one word "i" as an outlier at 7281.
To Do: make a print statement that tells us this automatically since we will be changing this to find the perfect data for our modeling. 

In [245]:
# Extract the 3000 most common words from the list.

from nltk import FreqDist

all_words = FreqDist(all_words)
most_common_words = all_words.most_common(500)

word_features = []
for w in most_common_words:
    word_features.append(w[0])
    
most_common_words

[('cream', 590),
 ('ice', 567),
 ('flavor', 407),
 ('chocolate', 226),
 ('like', 197),
 ('vanilla', 191),
 ('love', 181),
 ('taste', 173),
 ('butter', 129),
 ('one', 128),
 ('peanut', 126),
 ('breyers', 116),
 ('product', 111),
 ('good', 109),
 ('favorite', 104),
 ('pint', 100),
 ('cookie', 97),
 ('would', 93),
 ('best', 86),
 ('time', 85),
 ('creamy', 85),
 ('coffee', 83),
 ('dough', 83),
 ('really', 80),
 ('chip', 78),
 ('eat', 78),
 ('texture', 77),
 ('natural', 72),
 ('get', 72),
 ('find', 68),
 ('make', 67),
 ('please', 67),
 ('im', 67),
 ('used', 65),
 ('sweet', 65),
 ('tried', 63),
 ('much', 62),
 ('store', 62),
 ('perfect', 62),
 ('delicious', 62),
 ('first', 61),
 ('ever', 61),
 ('chunk', 61),
 ('ingredient', 59),
 ('go', 56),
 ('buy', 56),
 ('brand', 55),
 ('try', 54),
 ('ive', 53),
 ('back', 52),
 ('great', 52),
 ('still', 51),
 ('dairy', 51),
 ('year', 51),
 ('amazing', 51),
 ('bought', 50),
 ('free', 50),
 ('new', 49),
 ('always', 49),
 ('could', 48),
 ('recipe', 45),
 ('h

In [246]:
print ('There are ', len(all_words), 'unique words total in our text dataset.')
print ('There are ', len(most_common_words), 'unique words in the most common words list.')

There are  2790 unique words total in our text dataset.
There are  500 unique words in the most common words list.


### 6.  Create Tokenized Reviews data set

The export is commented out because I haven't limited the text data to the 3000 most common words.

In [247]:
# create tokenized_reviews.csv
# df_tokenize.to_csv("Resources/tokenized_reviews.csv", index=False)

In [248]:
df_tokenize.head()

Unnamed: 0,key,stars,helpful_yes,helpful_no,text,rating,sentiment
0,0_breyers,1,11,0,"[interested, flavoring, component, used, notic...",4.1,0
1,0_breyers,1,7,0,"[boy, surprised, got, bryers, home, discover, ...",4.1,0
2,0_breyers,1,8,0,"[havent, purchased, product, awhile, surprised...",4.1,0
3,0_breyers,1,4,0,"[natural, vanilla, recipe, change, include, ta...",4.1,0
4,0_breyers,5,21,2,"[issue, breyers, finally, found, turkey, hill,...",4.1,1


### 7.  Vectorizing for Supervised ML algorithms

We can now iterate through each review in our Tokenized Reviews dataset and create a vector of 1's and 0's for a given review depending on which words from our chosen 3000 show up in that review. However we should think ahead a little --- which ML algorithm will we use, and what format does it prefer its data in?


Only running TFID because of scaling benefits. 
I've vectorized both the key and text features though we may not need all of this. Just wanted to get it done since we are exploring.

Links to education and code syntax:

https://datascience.stackexchange.com/questions/22250/what-is-the-difference-between-a-hashing-vectorizer-and-a-tfidf-vectorizer

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html#sklearn.feature_extraction.text.HashingVectorizer



#### TFID Key

In [249]:
from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer

In [250]:
# get 'key' term frequencies weighted by their relative importance (IDF)

df_tfidf_key = pd.DataFrame(df_tokenize)

vectorizer = TfidfVectorizer()

sparse_out_key = vectorizer.fit_transform(df_tfidf_key['key'])

tfidf_key_df = pd.DataFrame(data = sparse_out_key.toarray(),
                        columns = vectorizer.get_feature_names())

tfidf_key_df.head()

Unnamed: 0,0_breyers,0_hd,0_talenti,10_bj,10_breyers,10_talenti,11_bj,11_breyers,11_talenti,12_bj,...,6_talenti,7_bj,7_breyers,7_hd,7_talenti,8_bj,8_hd,8_talenti,9_bj,9_hd
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [251]:
tfidf_key_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3424 entries, 0 to 3423
Columns: 184 entries, 0_breyers to 9_hd
dtypes: float64(184)
memory usage: 4.8 MB


In [252]:
print ('\nFeatures : ', tfidf_key_df.columns.tolist())


Features :  ['0_breyers', '0_hd', '0_talenti', '10_bj', '10_breyers', '10_talenti', '11_bj', '11_breyers', '11_talenti', '12_bj', '12_breyers', '12_hd', '12_talenti', '13_bj', '13_hd', '13_talenti', '14_bj', '14_breyers', '14_hd', '14_talenti', '15_breyers', '15_hd', '15_talenti', '16_bj', '16_breyers', '16_hd', '16_talenti', '17_breyers', '17_hd', '17_talenti', '18_breyers', '18_hd', '18_talenti', '19_bj', '19_breyers', '19_hd', '19_talenti', '1_bj', '1_breyers', '1_hd', '1_talenti', '20_bj', '20_hd', '20_talenti', '21_bj', '21_breyers', '21_hd', '22_bj', '22_breyers', '22_hd', '22_talenti', '23_bj', '24_breyers', '24_hd', '24_talenti', '25_bj', '25_breyers', '25_hd', '25_talenti', '26_breyers', '26_hd', '26_talenti', '27_bj', '27_breyers', '27_hd', '27_talenti', '28_bj', '28_talenti', '29_bj', '29_hd', '29_talenti', '2_bj', '2_breyers', '2_hd', '2_talenti', '30_bj', '30_breyers', '30_hd', '30_talenti', '31_bj', '31_breyers', '31_hd', '31_talenti', '32_bj', '32_hd', '32_talenti', '33

#### TFID Text

In [253]:
from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer
df_tfidf_text = pd.DataFrame(df_tokenize)

# convert text list to string and create string column
# required for vectorizer, learned after getting error
# https://stackoverflow.com/questions/45306988/column-of-lists-convert-list-to-string-as-a-new-column
df_tfidf_text['text_str'] = df_tfidf_text['text'].apply(lambda x: ','.join(map(str, x)))

df_tfidf_text.head()

# create tokenized_reviews.csv
df_tfidf_text.to_csv("Resources/tokenized_reviews.csv", index=False)

In [254]:
# get 'text' term frequencies weighted by their relative importance (IDF)
vectorizer2 = TfidfVectorizer()

sparse_out_text = vectorizer2.fit_transform(df_tfidf_text['text_str'])

tfidf_text_df = pd.DataFrame(data = sparse_out_text.toarray(),
                        columns = vectorizer2.get_feature_names())

tfidf_text_df.head()

Unnamed: 0,10,100,101,107th,11,11th,120,1200,13,15,...,youll,young,youre,youve,yuck,yucky,yum,yummm,yummy,zero
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.261575,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
