# NLP Toolkits and Preprocessing Exercises

## Introduction

We will be using [review data from Kaggle](https://www.kaggle.com/snap/amazon-fine-food-reviews) to practice preprocessing text data. The dataset contains user reviews for many products, but today we'll be focusing on the product in the dataset that had the most reviews - an oatmeal cookie. 

The following code will help you load in the data. If this is your first time using nltk, you'll to need to pip install it first.

In [None]:
import nltk
# nltk.download() <-- Run this if it's your first time using nltk to download all of the datasets and models

import pandas as pd
import numpy as np
import re
import string
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
data = pd.read_csv('Downloads/NLP_v01-class2/data/cookie_reviews.csv')
data.head()

## Question 1 ##

* Determine how many reviews there are in total.
* Determine the percent of 1, 2, 3, 4 and 5 star reviews.
* Determine the distribution of character lengths for the reviews, by listing the values and by plotting a histogram.

## Answer to Question 1##

In [None]:
###   1.                          Determine how many reviews there are in total.
num_of_reviews = len(data)
num_of_reviews

In [None]:
###   2.                         Determine the percent of 1, 2, 3, 4 and 5 star reviews.
one_star = []
two_star = []
Three_star = []
four_star = []
Five_star = []

for index, row in data.iterrows():
    #print(row[1])
    if row[1] == 1:
        one_star.append(row[1])
    if row[1] == 2:
        two_star.append(row[1])
    if row[1] == 3:
        Three_star.append(row[1])
    if row[1] == 4:
        four_star.append(row[1])
    if row[1] == 5:
        Five_star.append(row[1])

print("Rating Percentages: ")
print()

###        percent of 1 star 
num1 = len(one_star)
num2 = len(data)
result = '{0:.2f}%'.format((num1 / num2 * 100))
print('1_star : ',result)


###     percent of 2 star 
num1 = len(two_star)
num2 = len(data)


result = '{0:.2f}%'.format((num1 / num2 * 100))
print('2_star : ',result)

###     percent of 3 star 
num1 = len(Three_star)
num2 = len(data)
result = '{0:.2f}%'.format((num1 / num2 * 100))
print('3_star : ', result)

###     percent of 4 star 
num1 = len(four_star)
num2 = len(data)
result = '{0:.2f}%'.format((num1 / num2 * 100))
print('4_star :',result)

###     percent of 5 star 
num1 = len(Five_star)
num2 = len(data)
result = '{0:.2f}%'.format((num1 / num2 * 100))
print('5_star ',result)

In [None]:
###  3   Determine the distribution of character lengths for the reviews, by listing the values and by plotting a histogram.

#data['reviews'].value_counts()
#data['reviews'].value_counts().plot(kind='barh')

## Question 2 ##

* Apply the following preprocessing steps:

     1. Remove all words that contain numbers
     2. Make all the text lowercase
     3. Remove punctuation
     4. Tokenize the reviews into words
     
  Hint #1: Use regular expressions.
  
  Hint #2: The cookie review in the second row has numbers, upper case letters and punctuation. You can use it to test out your regular expressions.
     
     
* Find the most common words.
* Determine the word length distribution over the entire corpus.

## Answers to Question 2

In [None]:
#                              Apply the following preprocessing steps:


###   1. Remove all words that contain numbers
dd = data['reviews']

new_list_removed_number = []
for i in range(len(dd)):
    texts = dd[i]
    new_text = re.sub('\w*\d\w*', ' ', texts)
    new_list_removed_number.append(new_text)

print(new_list_removed_number)
print(len(new_list_removed_number))

In [None]:
###   2. Make all the text lowercase

new_list_lower_case = []
for i in range(len(new_list_removed_number)):
    texts = new_list_removed_number[i].lower()
    new_list_lower_case.append(texts)

print(new_list_lower_case)
print(len(new_list_lower_case))

In [None]:
###   3. Remove punctuation

new_list_removed_punct = []
for i in range(len(new_list_lower_case)):
    texts = new_list_lower_case[i]
    new_txt = re.sub('[%s]' % re.escape(string.punctuation), ' ', texts)
    new_list_removed_punct.append(new_txt)

print(new_list_removed_punct)
print(len(new_list_removed_punct))

In [None]:
### 4. Tokenize the reviews into words

new_list_word_tokenize = []
from nltk.tokenize import word_tokenize
for i in range(len(new_list_removed_punct)):
    texts = new_list_removed_punct[i]
    new_txts = word_tokenize(texts)
    new_list_word_tokenize.append([new_txts])

print(new_list_word_tokenize)
print(len(new_list_word_tokenize))

In [None]:
####  #                             Find the most common words.

import collections

most_common_words = []
counted = []

for i in range(len(new_list_word_tokenize)):
    texts = new_list_word_tokenize[i]
    texts = sum(texts, [])
    counter = collections.Counter(texts)
    #print(counter.most_common())
    counted.append(counter.most_common())

all_nums = []
for i in range(len(counted)):
    texts = counted[i]
    for ii in texts:
        #print(ii[1])
        all_nums.append(ii[1])

max_num = max(all_nums) # get the maximum of the counts
#print(max_num)

for i in range(len(counted)):
    texts = counted[i]
    for ii in texts:
        if ii[1] > max_num - 15  :
            if ii[0] not in most_common_words:
                most_common_words.append(ii[0])
            
print('most common words  = ', most_common_words)

## Question 3 ##

* Apply the following preprocessing techniques:

     * Remove stopwords
     * Perform parts of speech tagging
     * Perform stemming
     * Optional: Perform lemmatization

  Recommendation: Create a new column in your data set for every preprocessing technique you apply, so you can see the progression of the reviews text.

## Answers to Question 3

In [None]:
#                 Apply the following preprocessing techniques:

### Remove stopwords

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
stop_words.append('br') # lets add the 'br' to stopwords as it's the most common and doesn't make sense either.
Stop_W = []

new_list_remove_stopwords = []
for i in range(len(new_list_word_tokenize)):
    texts = new_list_word_tokenize[i]
    texts = sum(texts, [])
    new_tt = [word for word in texts if word not in stop_words]
    new_t = [word for word in texts if word in stop_words] # words that are removed 
    new_list_remove_stopwords.append(new_tt)
    Stop_W.append(new_t)
    
#print('Stop_Words   = ', Stop_W)
print(new_list_remove_stopwords)
print(len(new_list_remove_stopwords))

In [None]:
###  Perform parts of speech tagging

from nltk.tag import pos_tag

new_list_Parts_of_speech_tagging = []
for i in range(len(new_list_remove_stopwords)):
    texts = new_list_remove_stopwords[i]
    tokens = pos_tag(texts)
    new_list_Parts_of_speech_tagging.append(tokens)


    
#print('Stop_Words   = ', Stop_W)
print(new_list_Parts_of_speech_tagging)
print(len(new_list_Parts_of_speech_tagging))

In [None]:
###      Perform stemming

from nltk.stem.lancaster import LancasterStemmer
stemmer = LancasterStemmer()


new_list_Perform_stemming = []
for i in range(len(new_list_remove_stopwords)):
    texts = new_list_remove_stopwords[i]
    mew = []
    for ii in texts:
        #texts = sum(ii, [])
        texts = stemmer.stem(ii)
        mew.append(texts)
    new_list_Perform_stemming.append(mew)
    
print(new_list_Perform_stemming)

## Create a new Dataframe(adding the processed 'reviews')

In [None]:
import pandas as pd 
# initialise data of lists. 

userID = []
rating = []
list_reviews = []


for index, row in data.iterrows():
    userID.append(row[0])
    rating.append(row[1])
    list_reviews.append(row[2])

dataa = {'user_id':userID, 'stars':rating, 'reviews': list_reviews, 'processed_reviews':new_list_Perform_stemming} 
  
# Creates pandas DataFrame. 
df = pd.DataFrame(dataa) 
  
# print the data 
df 

## Question 4 ##

* After going through these preprocessing steps, what are the most common words now? Do they make more sense?

## Answer to Question 4

In [None]:
#                    most common words now

new_list_Perform_stemming = new_list_Perform_stemming


import collections

most_common_words = []
counted = []

for i in range(len(new_list_Perform_stemming)):
    texts = new_list_Perform_stemming[i]
    counter = collections.Counter(texts)
    counted.append(counter.most_common())

all_nums = []
for i in range(len(counted)):
    texts = counted[i]
    for ii in texts:
        all_nums.append(ii[1])

max_num = max(all_nums) # get the maximum of the counts

for i in range(len(counted)):
    texts = counted[i]
    for ii in texts:
        if ii[1] == max_num - 7  :
            if ii[0] not in most_common_words:
                most_common_words.append(ii[0])
            
print('most common words  = ', most_common_words)

In [None]:
#   Do the common words make more sense?

print('Yeah :)' )