Problem Statement No. 16 <br>
Consider the Amazon Alexa Reviews Dataset. This dataset consists of a nearly 3000 Amazon customer reviews (input
text), star ratings, date of review, variant and feedback of various amazon Alexa products like Alexa Echo, Echo dots,
Alexa Firesticks etc. Perform following operations on this dataset.
(I) Remove all punctuations from review text.
(II) Tokenize the review text into words.
(III) Remove the Stopwords from the tokenized text.
(IV) Perform stemming & lemmatization on the review text.
(V) Perform the word vectorization on review text using Bag of Words technique.
(VI) Create representation of Review Text by calculating Term Frequency and Inverse Document Frequency (TF-IDF)

In [12]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
import nltk
import string

In [13]:
df = pd.read_csv("data/Alexa-Dataset.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3150 entries, 0 to 3149
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   rating            3150 non-null   int64 
 1   date              3150 non-null   object
 2   variation         3150 non-null   object
 3   verified_reviews  3149 non-null   object
 4   feedback          3150 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 123.2+ KB


In [14]:
def rm_punc(text):
    if isinstance(text, str):
        return "".join([c for c in text if c not in string.punctuation])
    return ""

df['clean_text'] = df['verified_reviews'].apply(rm_punc)
df.head()

Unnamed: 0,rating,date,variation,verified_reviews,feedback,clean_text
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1,Love my Echo
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1,Loved it
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1,Sometimes while playing a game you can answer ...
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1,I have had a lot of fun with this thing My 4 y...
4,5,31-Jul-18,Charcoal Fabric,Music,1,Music


In [15]:
def rm_impoticons(text):
    return re.sub(r"[^\w\s]", "", text)
df['clean_text'] = df['clean_text'].apply(rm_impoticons)

def tokenize(text):
    return re.split(r'\W+', text)
df['tokenized_text'] = df['clean_text'].apply(tokenize)
df.head()

Unnamed: 0,rating,date,variation,verified_reviews,feedback,clean_text,tokenized_text
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1,Love my Echo,"[Love, my, Echo]"
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1,Loved it,"[Loved, it]"
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1,Sometimes while playing a game you can answer ...,"[Sometimes, while, playing, a, game, you, can,..."
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1,I have had a lot of fun with this thing My 4 y...,"[I, have, had, a, lot, of, fun, with, this, th..."
4,5,31-Jul-18,Charcoal Fabric,Music,1,Music,[Music]


In [16]:
from nltk.corpus import stopwords

STOPWORDS = set(stopwords.words('english'))

def rm_stopwords(text):
    return " ".join([w for w in text if w not in STOPWORDS])

df['clean_tokenized_text'] = df['tokenized_text'].apply(rm_stopwords)
df.head()

Unnamed: 0,rating,date,variation,verified_reviews,feedback,clean_text,tokenized_text,clean_tokenized_text
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1,Love my Echo,"[Love, my, Echo]",Love Echo
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1,Loved it,"[Loved, it]",Loved
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1,Sometimes while playing a game you can answer ...,"[Sometimes, while, playing, a, game, you, can,...",Sometimes playing game answer question correct...
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1,I have had a lot of fun with this thing My 4 y...,"[I, have, had, a, lot, of, fun, with, this, th...",I lot fun thing My 4 yr old learns dinosaurs c...
4,5,31-Jul-18,Charcoal Fabric,Music,1,Music,[Music],Music


In [17]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()
def stem_text(text):
    return " ".join([stemmer.stem(w) for w in text.split()])

df['stemed_text'] = df['clean_tokenized_text'].apply(stem_text)
df.head()

Unnamed: 0,rating,date,variation,verified_reviews,feedback,clean_text,tokenized_text,clean_tokenized_text,stemed_text
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1,Love my Echo,"[Love, my, Echo]",Love Echo,love echo
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1,Loved it,"[Loved, it]",Loved,love
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1,Sometimes while playing a game you can answer ...,"[Sometimes, while, playing, a, game, you, can,...",Sometimes playing game answer question correct...,sometim play game answer question correctli al...
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1,I have had a lot of fun with this thing My 4 y...,"[I, have, had, a, lot, of, fun, with, this, th...",I lot fun thing My 4 yr old learns dinosaurs c...,i lot fun thing my 4 yr old learn dinosaur con...
4,5,31-Jul-18,Charcoal Fabric,Music,1,Music,[Music],Music,music


In [19]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
def lemmatize_text(text):
    return " ".join([lemmatizer.lemmatize(w) for w in text.split()])

df['lemmatized_text'] = df['clean_tokenized_text'].apply(lemmatize_text)
df.head()

[nltk_data] Downloading package wordnet to /home/tammy/nltk_data...


Unnamed: 0,rating,date,variation,verified_reviews,feedback,clean_text,tokenized_text,clean_tokenized_text,stemed_text,lemmatized_text
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1,Love my Echo,"[Love, my, Echo]",Love Echo,love echo,Love Echo
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1,Loved it,"[Loved, it]",Loved,love,Loved
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1,Sometimes while playing a game you can answer ...,"[Sometimes, while, playing, a, game, you, can,...",Sometimes playing game answer question correct...,sometim play game answer question correctli al...,Sometimes playing game answer question correct...
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1,I have had a lot of fun with this thing My 4 y...,"[I, have, had, a, lot, of, fun, with, this, th...",I lot fun thing My 4 yr old learns dinosaurs c...,i lot fun thing my 4 yr old learn dinosaur con...,I lot fun thing My 4 yr old learns dinosaur co...
4,5,31-Jul-18,Charcoal Fabric,Music,1,Music,[Music],Music,music,Music


In [21]:
from sklearn.feature_extraction.text import CountVectorizer
bow_vectorizer = CountVectorizer()
bow_matrix = bow_vectorizer.fit_transform(df['lemmatized_text'])

print("shape of bow matrix: ", bow_matrix.shape)

shape of bow matrix:  (3150, 4156)


In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['lemmatized_text'])

print("Shape: ", tfidf_matrix.shape)

Shape:  (3150, 4156)
