# 1. LOAD TEXT DATA AND PERFORM BASIC DATA EXPLORATION

In [1]:
# Dataset consists of 3000 Amazon customer reviews, star ratings, date of review, variant and feedback of various amazon Alexa products like Alexa Echo, Echo dots.
# The objective is to discover insights into consumer reviews.
# Note that Sentiment analysis could be performed on the data (AI/ML is beyond the scope of this course)
# Dataset: www.kaggle.com/sid321axn/amazon-alexa-reviews

# Import Pandas for data manipulation using dataframes
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


In [6]:
# import the data using read_csv
alexa_df = pd.read_csv('amazon_reviews.csv')
alexa_df

Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
4,5,31-Jul-18,Charcoal Fabric,Music,1
...,...,...,...,...,...
3145,5,30-Jul-18,Black Dot,"Perfect for kids, adults and everyone in betwe...",1
3146,5,30-Jul-18,Black Dot,"Listening to music, searching locations, check...",1
3147,5,30-Jul-18,Black Dot,"I do love these things, i have them running my...",1
3148,5,30-Jul-18,White Dot,Only complaint I have is that the sound qualit...,1


In [19]:
# Show the first couple of rows in the data
alexa_df.head()

Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
4,5,31-Jul-18,Charcoal Fabric,Music,1


In [20]:
# Show the last couple of rows in the data
alexa_df.tail(7)

Unnamed: 0,rating,date,variation,verified_reviews,feedback
3143,5,30-Jul-18,Black Dot,Awesome device wish I bought one ages ago.,1
3144,5,30-Jul-18,Black Dot,love it,1
3145,5,30-Jul-18,Black Dot,"Perfect for kids, adults and everyone in betwe...",1
3146,5,30-Jul-18,Black Dot,"Listening to music, searching locations, check...",1
3147,5,30-Jul-18,Black Dot,"I do love these things, i have them running my...",1
3148,5,30-Jul-18,White Dot,Only complaint I have is that the sound qualit...,1
3149,4,29-Jul-18,Black Dot,Good,1


**MINI CHALLENGE #1:**
- **What is the average rating?**
- **How many unique classes do we have in the variation column?**
- **What is memory usage of this dataframe in memory?**

In [21]:
alexa_df.describe()

Unnamed: 0,rating,feedback
count,3150.0,3150.0
mean,4.463175,0.918413
std,1.068506,0.273778
min,1.0,0.0
25%,4.0,1.0
50%,5.0,1.0
75%,5.0,1.0
max,5.0,1.0


In [22]:
alexa_df['variation'].nunique()

16

In [8]:
alexa_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3150 entries, 0 to 3149
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   rating            3150 non-null   int64 
 1   date              3150 non-null   object
 2   variation         3150 non-null   object
 3   verified_reviews  3150 non-null   object
 4   feedback          3150 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 123.2+ KB


# 2. UPPER AND LOWER

In [23]:
# Check out the verified_reviews column (Text data)
# Now we can convert all the words into upper case or lower case
alexa_df['verified_reviews']

0                                           Love my Echo!
1                                               Loved it!
2       Sometimes while playing a game, you can answer...
3       I have had a lot of fun with this thing. My 4 ...
4                                                   Music
                              ...                        
3145    Perfect for kids, adults and everyone in betwe...
3146    Listening to music, searching locations, check...
3147    I do love these things, i have them running my...
3148    Only complaint I have is that the sound qualit...
3149                                                 Good
Name: verified_reviews, Length: 3150, dtype: object

In [24]:
# You can convert all words in a given column into upper case by applying str.upper()
alexa_df['verified_reviews'].str.upper()

0                                           LOVE MY ECHO!
1                                               LOVED IT!
2       SOMETIMES WHILE PLAYING A GAME, YOU CAN ANSWER...
3       I HAVE HAD A LOT OF FUN WITH THIS THING. MY 4 ...
4                                                   MUSIC
                              ...                        
3145    PERFECT FOR KIDS, ADULTS AND EVERYONE IN BETWE...
3146    LISTENING TO MUSIC, SEARCHING LOCATIONS, CHECK...
3147    I DO LOVE THESE THINGS, I HAVE THEM RUNNING MY...
3148    ONLY COMPLAINT I HAVE IS THAT THE SOUND QUALIT...
3149                                                 GOOD
Name: verified_reviews, Length: 3150, dtype: object

In [12]:
# You can convert all words in a given column into lower case by applying str.lower()
alexa_df['verified_reviews'].str.lower()

0                                           love my echo!
1                                               loved it!
2       sometimes while playing a game, you can answer...
3       i have had a lot of fun with this thing. my 4 ...
4                                                   music
                              ...                        
3145    perfect for kids, adults and everyone in betwe...
3146    listening to music, searching locations, check...
3147    i do love these things, i have them running my...
3148    only complaint i have is that the sound qualit...
3149                                                 good
Name: verified_reviews, Length: 3150, dtype: object

In [25]:
# You can also convert the headernames into upper case
alexa_df.columns = alexa_df.columns.str.upper()
alexa_df

Unnamed: 0,RATING,DATE,VARIATION,VERIFIED_REVIEWS,FEEDBACK
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
4,5,31-Jul-18,Charcoal Fabric,Music,1
...,...,...,...,...,...
3145,5,30-Jul-18,Black Dot,"Perfect for kids, adults and everyone in betwe...",1
3146,5,30-Jul-18,Black Dot,"Listening to music, searching locations, check...",1
3147,5,30-Jul-18,Black Dot,"I do love these things, i have them running my...",1
3148,5,30-Jul-18,White Dot,Only complaint I have is that the sound qualit...,1


In [26]:
# Let's convert them back to lowercase!
alexa_df.columns = alexa_df.columns.str.lower()
alexa_df


Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
4,5,31-Jul-18,Charcoal Fabric,Music,1
...,...,...,...,...,...
3145,5,30-Jul-18,Black Dot,"Perfect for kids, adults and everyone in betwe...",1
3146,5,30-Jul-18,Black Dot,"Listening to music, searching locations, check...",1
3147,5,30-Jul-18,Black Dot,"I do love these things, i have them running my...",1
3148,5,30-Jul-18,White Dot,Only complaint I have is that the sound qualit...,1


**MINI CHALLENGE #2:**
- **Apply a method to return strings where the first character in every word is upper case (external research is required)**

In [29]:
alexa_df['verified_reviews'].str.title()


0                                           Love My Echo!
1                                               Loved It!
2       Sometimes While Playing A Game, You Can Answer...
3       I Have Had A Lot Of Fun With This Thing. My 4 ...
4                                                   Music
                              ...                        
3145    Perfect For Kids, Adults And Everyone In Betwe...
3146    Listening To Music, Searching Locations, Check...
3147    I Do Love These Things, I Have Them Running My...
3148    Only Complaint I Have Is That The Sound Qualit...
3149                                                 Good
Name: verified_reviews, Length: 3150, dtype: object

In [30]:
# - Alternatively
def titleCase(word):
  return word.title()

alexa_df['verified_reviews'].apply(titleCase)

0                                           Love My Echo!
1                                               Loved It!
2       Sometimes While Playing A Game, You Can Answer...
3       I Have Had A Lot Of Fun With This Thing. My 4 ...
4                                                   Music
                              ...                        
3145    Perfect For Kids, Adults And Everyone In Betwe...
3146    Listening To Music, Searching Locations, Check...
3147    I Do Love These Things, I Have Them Running My...
3148    Only Complaint I Have Is That The Sound Qualit...
3149                                                 Good
Name: verified_reviews, Length: 3150, dtype: object

# 3. PANDAS OPERATIONS PART #1

In [31]:
# obtain the length of a given string (how many characters per string)
alexa_df['reviews_length'] = alexa_df['verified_reviews'].str.len()
alexa_df

Unnamed: 0,rating,date,variation,verified_reviews,feedback,reviews_length
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1,13
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1,9
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1,195
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1,172
4,5,31-Jul-18,Charcoal Fabric,Music,1,5
...,...,...,...,...,...,...
3145,5,30-Jul-18,Black Dot,"Perfect for kids, adults and everyone in betwe...",1,50
3146,5,30-Jul-18,Black Dot,"Listening to music, searching locations, check...",1,135
3147,5,30-Jul-18,Black Dot,"I do love these things, i have them running my...",1,441
3148,5,30-Jul-18,White Dot,Only complaint I have is that the sound qualit...,1,380


In [33]:
# Let's obtain the shortest review
min_char = alexa_df['reviews_length'].min()
min_char

1

In [34]:
# Let's obtain the longest review
max_char = alexa_df['reviews_length'].max()
max_char

2851

In [35]:
# Let's filter out the shortest reviews
alexa_df[alexa_df['reviews_length'] == min_char]

Unnamed: 0,rating,date,variation,verified_reviews,feedback,reviews_length
60,5,30-Jul-18,Heather Gray Fabric,üòç,1,1
85,5,30-Jul-18,Heather Gray Fabric,,1,1
183,3,29-Jul-18,Heather Gray Fabric,,1,1
219,5,29-Jul-18,Sandstone Fabric,,1,1
374,1,26-Jul-18,Black,,0,1
...,...,...,...,...,...,...
3114,3,30-Jul-18,Black Dot,,1,1
3120,5,30-Jul-18,Black Dot,,1,1
3123,4,30-Jul-18,Black Dot,,1,1
3126,5,30-Jul-18,Black Dot,,1,1


**MINI CHALLENGE #3:**
- **Locate the verified review that has the maximum number of characters**

In [41]:
alexa_df[alexa_df['reviews_length'] == max_char]

Unnamed: 0,rating,date,variation,verified_reviews,feedback,reviews_length
2016,5,20-Jul-18,Black Plus,Incredible piece of technology.I have this rig...,1,2851


In [42]:
alexa_df[alexa_df['reviews_length'] == max_char]['verified_reviews'].iloc[0]

"Incredible piece of technology.I have this right center of my living room on an island kitchen counter. The mic and speaker goes in every direction and the quality of the sound is quite good. I connected the Echo via Bluetooth to my Sony soundbar on my TV but find the Echo placement and 360 sound more appealing. It's no audiophile equipment but there is good range and decent bass. The sound is more than adequate for any indoor entertaining and loud enough to bother neighbors in my building. The knob on the top works great for adjusting volume. This is my first Echo device and I would imagine having to press volume buttons (on the Echo 2) a large inconvenience and not as precise. For that alone I would recommend this over the regular Echo (2nd generation).The piece looks quality and is quite sturdy with some weight on it. The rubber material on the bottom has a good grip on the granite counter-- my cat can even rub her scent on it without tipping it over.This order came with a free Phi

# 4. PANDAS OPERATIONS PART #2

In [43]:
# You can replace elements in a dataframe as follows:
alexa_df['variation'] = alexa_df['variation'].str.replace('Walnut Finish', 'Walnut')
alexa_df

Unnamed: 0,rating,date,variation,verified_reviews,feedback,reviews_length
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1,13
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1,9
2,4,31-Jul-18,Walnut,"Sometimes while playing a game, you can answer...",1,195
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1,172
4,5,31-Jul-18,Charcoal Fabric,Music,1,5
...,...,...,...,...,...,...
3145,5,30-Jul-18,Black Dot,"Perfect for kids, adults and everyone in betwe...",1,50
3146,5,30-Jul-18,Black Dot,"Listening to music, searching locations, check...",1,135
3147,5,30-Jul-18,Black Dot,"I do love these things, i have them running my...",1,441
3148,5,30-Jul-18,White Dot,Only complaint I have is that the sound qualit...,1,380


In [44]:
# Filter the DataFrame by selecting rows that only ends with the word "love"
# Note that we had to convert all words into lower case first
mask = alexa_df['verified_reviews'].str.lower().str.endswith('love')
alexa_df[mask]

Unnamed: 0,rating,date,variation,verified_reviews,feedback,reviews_length
438,5,7-Jul-18,Black,Love,1,4
2018,5,19-Jul-18,Black Plus,"Love, Love, Love",1,16


In [46]:
# Filter the DataFrame by selecting rows that only starts with the word "love"
# Note that we had to convert all words into lower case first
mask = alexa_df['verified_reviews'].str.lower().str.startswith('love')
alexa_df[mask]

Unnamed: 0,rating,date,variation,verified_reviews,feedback,reviews_length
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1,13
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1,9
9,5,30-Jul-18,Heather Gray Fabric,Love it! I‚Äôve listened to songs I haven‚Äôt hear...,1,114
13,5,30-Jul-18,Charcoal Fabric,"Love, Love, Love!!",1,18
20,5,30-Jul-18,Charcoal Fabric,Love the Echo and how good the music sounds pl...,1,246
...,...,...,...,...,...,...
3089,5,30-Jul-18,Black Dot,Love Alexa!! I own 2 and gave one for a gift ...,1,67
3110,5,30-Jul-18,White Dot,"Love it! I personally prefer Spotify music, so...",1,401
3111,5,30-Jul-18,Black Dot,Love it. It works great. Alexa still has som...,1,174
3124,5,30-Jul-18,Black Dot,Love my Alexa! Actually have 3 throughout the ...,1,128


In [47]:
# you can split the string into a list
alexa_df['verified_reviews'].str.split()

0                                       [Love, my, Echo!]
1                                            [Loved, it!]
2       [Sometimes, while, playing, a, game,, you, can...
3       [I, have, had, a, lot, of, fun, with, this, th...
4                                                 [Music]
                              ...                        
3145    [Perfect, for, kids,, adults, and, everyone, i...
3146    [Listening, to, music,, searching, locations,,...
3147    [I, do, love, these, things,, i, have, them, r...
3148    [Only, complaint, I, have, is, that, the, soun...
3149                                               [Good]
Name: verified_reviews, Length: 3150, dtype: object

In [48]:
# you can also select the index within the extracted list as follows
# Note that index 0 indicates the first element in a given list
alexa_df['verified_reviews'].str.split().str.get(0)

0            Love
1           Loved
2       Sometimes
3               I
4           Music
          ...    
3145      Perfect
3146    Listening
3147            I
3148         Only
3149         Good
Name: verified_reviews, Length: 3150, dtype: object

**MINI CHALLENGE #4:**
- **Filter the DataFrame by selecting rows that contains the word "love" in any location**


In [52]:
mask = alexa_df['verified_reviews'].str.lower().str.contains('love')
alexa_df[mask]

Unnamed: 0,rating,date,variation,verified_reviews,feedback,reviews_length
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1,13
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1,9
9,5,30-Jul-18,Heather Gray Fabric,Love it! I‚Äôve listened to songs I haven‚Äôt hear...,1,114
11,5,30-Jul-18,Charcoal Fabric,I love it! Learning knew things with it eveyda...,1,169
13,5,30-Jul-18,Charcoal Fabric,"Love, Love, Love!!",1,18
...,...,...,...,...,...,...
3124,5,30-Jul-18,Black Dot,Love my Alexa! Actually have 3 throughout the ...,1,128
3135,5,30-Jul-18,White Dot,I loved it does exactly what it says,1,36
3142,4,30-Jul-18,White Dot,My three year old loves it. Good for doing ba...,1,117
3144,5,30-Jul-18,Black Dot,love it,1,7


In [50]:
# - Alternatively

def containsLove(word):
  return 'love' in word.lower()

mask = alexa_df['verified_reviews'].apply(containsLove)
alexa_df[mask]

Unnamed: 0,rating,date,variation,verified_reviews,feedback,reviews_length
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1,13
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1,9
9,5,30-Jul-18,Heather Gray Fabric,Love it! I‚Äôve listened to songs I haven‚Äôt hear...,1,114
11,5,30-Jul-18,Charcoal Fabric,I love it! Learning knew things with it eveyda...,1,169
13,5,30-Jul-18,Charcoal Fabric,"Love, Love, Love!!",1,18
...,...,...,...,...,...,...
3124,5,30-Jul-18,Black Dot,Love my Alexa! Actually have 3 throughout the ...,1,128
3135,5,30-Jul-18,White Dot,I loved it does exactly what it says,1,36
3142,4,30-Jul-18,White Dot,My three year old loves it. Good for doing ba...,1,117
3144,5,30-Jul-18,Black Dot,love it,1,7


# 5. PERFORM TEXT DATA CLEANING BY REMOVE PUNCTUATIONS

In [12]:
# String module is super useful when dealing with text data
# String contains constants and classes for working with text
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [7]:
Test = '$I love Pandas &Data Analytics!!'
Test_punc_removed = [char for char in Test if char not in string.punctuation]
Test_punc_removed_join = ''.join(Test_punc_removed)
Test_punc_removed_join

'I love Pandas Data Analytics'

In [10]:
# Let's define a function to remove punctuations
def remove_punc(message):
  punc_removed = [char for char in message if char not in string.punctuation]
  punc_removed_join = ''.join(punc_removed)
  return punc_removed_join

In [13]:
# Let's remove punctuations from our dataset
alexa_df['verified_reviews_no_punc'] = alexa_df['verified_reviews'].apply(remove_punc)
alexa_df

Unnamed: 0,rating,date,variation,verified_reviews,feedback,verified_reviews_no_punc
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1,Love my Echo
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1,Loved it
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1,Sometimes while playing a game you can answer ...
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1,I have had a lot of fun with this thing My 4 y...
4,5,31-Jul-18,Charcoal Fabric,Music,1,Music
...,...,...,...,...,...,...
3145,5,30-Jul-18,Black Dot,"Perfect for kids, adults and everyone in betwe...",1,Perfect for kids adults and everyone in between
3146,5,30-Jul-18,Black Dot,"Listening to music, searching locations, check...",1,Listening to music searching locations checkin...
3147,5,30-Jul-18,Black Dot,"I do love these things, i have them running my...",1,I do love these things i have them running my ...
3148,5,30-Jul-18,White Dot,Only complaint I have is that the sound qualit...,1,Only complaint I have is that the sound qualit...


**MINI CHALLENGE #5:**
- **Explore at least 3 rows from the DataFrame and check if the function worked as expected**

In [15]:
test1 = remove_punc(alexa_df.iloc[3]['verified_reviews'])
test1

'I have had a lot of fun with this thing My 4 yr old learns about dinosaurs i control the lights and play games like categories Has nice sound when playing music as well'

In [18]:
alexa_df.iloc[3]['verified_reviews']

'I have had a lot of fun with this thing. My 4 yr old learns about dinosaurs, i control the lights and play games like categories. Has nice sound when playing music as well.'

In [19]:
alexa_df.iloc[3]['verified_reviews_no_punc']

'I have had a lot of fun with this thing My 4 yr old learns about dinosaurs i control the lights and play games like categories Has nice sound when playing music as well'

In [16]:
test2 = remove_punc(alexa_df.iloc[2]['verified_reviews'])
test2

'Sometimes while playing a game you can answer a question correctly but Alexa says you got it wrong and answers the same as you  I like being able to turn lights on and off while away from home'

In [20]:
alexa_df.iloc[2]['verified_reviews']

'Sometimes while playing a game, you can answer a question correctly but Alexa says you got it wrong and answers the same as you.  I like being able to turn lights on and off while away from home.'

In [21]:
alexa_df.iloc[2]['verified_reviews_no_punc']

'Sometimes while playing a game you can answer a question correctly but Alexa says you got it wrong and answers the same as you  I like being able to turn lights on and off while away from home'

In [17]:
test3 = remove_punc(alexa_df.iloc[1]['verified_reviews'])
test3

'Loved it'

In [22]:
alexa_df.iloc[1]['verified_reviews']

'Loved it!'

In [23]:
alexa_df.iloc[1]['verified_reviews_no_punc']

'Loved it'

# 6. PERFORM TEXT DATA CLEANING BY REMOVING STOPWORDS

In [2]:
import nltk
from nltk.corpus import stopwords
import gensim
from gensim.utils import simple_preprocess


# import re
# from nltk.stem import PorterStemmer, WordNetLemmatizer
# from nltk.tokenize import word_tokenize, sent_tokenize
# from gensim.parsing.preprocessing import STOPWORDS

In [3]:
# download stopwords
nltk.download('stopwords')
stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/danielmevs/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [4]:
# Add more stopwords by using extend
stop_words = stopwords.words('english')
stop_words.extend(['amazon', 'Amazon', 'alexa', 'Alexa', 'device', 'Dot', 'dot'])

In [7]:
# Simple_preprocess converts a string into a series of lowered case tokens
# Let's try it on a sample dataset
gensim.utils.simple_preprocess(alexa_df['verified_reviews'][0])

['love', 'my', 'echo']

In [8]:
# Remove stopwords and remove short words (less than 2 characters)
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in stop_words and len(token) >= 1:
            result.append(token)
    return result

In [14]:
# apply pre-processing to the text column
alexa_df['verified_reviews_no_punc_no_stopwords'] = alexa_df['verified_reviews_no_punc'].apply(preprocess)

In [15]:
alexa_df['verified_reviews'][38]

'This thing is way cool!  You should get one.  If you want to be cool, that is.'

In [18]:
' '.join(alexa_df['verified_reviews_no_punc_no_stopwords'][38])

'thing way cool get one want cool'

In [19]:
alexa_df['verified_reviews_no_punc_no_stopwords_joined'] = alexa_df['verified_reviews_no_punc_no_stopwords'].apply(lambda x: " ".join(x))

In [20]:
alexa_df

Unnamed: 0,rating,date,variation,verified_reviews,feedback,verified_reviews_no_punc,verified_reviews_no_punc_no_stopwords,verified_reviews_no_punc_no_stopwords_joined
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1,Love my Echo,"[love, echo]",love echo
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1,Loved it,[loved],loved
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1,Sometimes while playing a game you can answer ...,"[sometimes, playing, game, answer, question, c...",sometimes playing game answer question correct...
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1,I have had a lot of fun with this thing My 4 y...,"[lot, fun, thing, yr, old, learns, dinosaurs, ...",lot fun thing yr old learns dinosaurs control ...
4,5,31-Jul-18,Charcoal Fabric,Music,1,Music,[music],music
...,...,...,...,...,...,...,...,...
3145,5,30-Jul-18,Black Dot,"Perfect for kids, adults and everyone in betwe...",1,Perfect for kids adults and everyone in between,"[perfect, kids, adults, everyone]",perfect kids adults everyone
3146,5,30-Jul-18,Black Dot,"Listening to music, searching locations, check...",1,Listening to music searching locations checkin...,"[listening, music, searching, locations, check...",listening music searching locations checking t...
3147,5,30-Jul-18,Black Dot,"I do love these things, i have them running my...",1,I do love these things i have them running my ...,"[love, things, running, entire, home, tv, ligh...",love things running entire home tv lights ther...
3148,5,30-Jul-18,White Dot,Only complaint I have is that the sound qualit...,1,Only complaint I have is that the sound qualit...,"[complaint, sound, quality, isnt, great, mostl...",complaint sound quality isnt great mostly use ...


**MINI CHALLENGE #6:**

- **Modify the code in order keep words that are longer than or equal 3 characters**
- **Add the word 'really' to the list of stopwords and rerun the code**


In [23]:
stop_words.extend('really')

In [24]:
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [25]:
for i in range(len('really')):
    stop_words.pop()

In [26]:
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [27]:
stop_words.extend(['really'])
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [30]:
# Remove stopwords and remove short words (less than 2 characters)
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in stop_words and len(token) >= 3:
            result.append(token)
    return ' '.join(result)

In [31]:
# apply pre-processing to the text column
alexa_df['verified_reviews_no_punc_no_stopwords'] = alexa_df['verified_reviews_no_punc'].apply(preprocess)

In [32]:
alexa_df

Unnamed: 0,rating,date,variation,verified_reviews,feedback,verified_reviews_no_punc,verified_reviews_no_punc_no_stopwords,verified_reviews_no_punc_no_stopwords_joined
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1,Love my Echo,love echo,love echo
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1,Loved it,loved,loved
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1,Sometimes while playing a game you can answer ...,sometimes playing game answer question correct...,sometimes playing game answer question correct...
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1,I have had a lot of fun with this thing My 4 y...,lot fun thing old learns dinosaurs control lig...,lot fun thing yr old learns dinosaurs control ...
4,5,31-Jul-18,Charcoal Fabric,Music,1,Music,music,music
...,...,...,...,...,...,...,...,...
3145,5,30-Jul-18,Black Dot,"Perfect for kids, adults and everyone in betwe...",1,Perfect for kids adults and everyone in between,perfect kids adults everyone,perfect kids adults everyone
3146,5,30-Jul-18,Black Dot,"Listening to music, searching locations, check...",1,Listening to music searching locations checkin...,listening music searching locations checking t...,listening music searching locations checking t...
3147,5,30-Jul-18,Black Dot,"I do love these things, i have them running my...",1,I do love these things i have them running my ...,love things running entire home lights thermos...,love things running entire home tv lights ther...
3148,5,30-Jul-18,White Dot,Only complaint I have is that the sound qualit...,1,Only complaint I have is that the sound qualit...,complaint sound quality isnt great mostly use ...,complaint sound quality isnt great mostly use ...


In [33]:
del alexa_df['verified_reviews_no_punc_no_stopwords_joined']

In [34]:
alexa_df

Unnamed: 0,rating,date,variation,verified_reviews,feedback,verified_reviews_no_punc,verified_reviews_no_punc_no_stopwords
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1,Love my Echo,love echo
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1,Loved it,loved
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1,Sometimes while playing a game you can answer ...,sometimes playing game answer question correct...
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1,I have had a lot of fun with this thing My 4 y...,lot fun thing old learns dinosaurs control lig...
4,5,31-Jul-18,Charcoal Fabric,Music,1,Music,music
...,...,...,...,...,...,...,...
3145,5,30-Jul-18,Black Dot,"Perfect for kids, adults and everyone in betwe...",1,Perfect for kids adults and everyone in between,perfect kids adults everyone
3146,5,30-Jul-18,Black Dot,"Listening to music, searching locations, check...",1,Listening to music searching locations checkin...,listening music searching locations checking t...
3147,5,30-Jul-18,Black Dot,"I do love these things, i have them running my...",1,I do love these things i have them running my ...,love things running entire home lights thermos...
3148,5,30-Jul-18,White Dot,Only complaint I have is that the sound qualit...,1,Only complaint I have is that the sound qualit...,complaint sound quality isnt great mostly use ...


# 7. TOKENIZATION AND PADDING

![alt text](https://drive.google.com/uc?id=1w7r6pUAm1WkFRWAzp0tdcsVfDwzR9dWw)


**MINI CHALLENGE #7:**
- **Explore the data with index 13 and confirm that the tokenization works**


# 8. TEXT DATA VISUALIZATION

In [None]:
# Let's obtain the number of words in every row in the DataFrame


In [None]:
# Let's plot the histogram of the length column


In [None]:
# Use count plot to show how many samples have positive/negative feedback


In [None]:
# Use Seaborn barplot to show variations/ratings


**MINI CHALLENGE #8:**
- **Plot the count plot for the ratings column**



# 9. WORD CLOUD

In [None]:
# Take a dataframe column and convert it into a list


In [None]:
# Join all elements in the list into one massive string!


**MINI CHALLENGE #9:**
- **Plot the wordcloud for the negative ratings**