# Data Cleaning and Preprocessing
Data preprocessing involves the transformation of the raw dataset into an understandable format. Preprocessing data is a fundamental stage in data mining to improve data efficiency. The data preprocessing methods directly affect the outcomes of any analytic algorithm.
### 1. Data overview
######    1.1 Import dataset and libraries
    The dataset has been collected from https://data.world/datafiniti/consumer-reviews-of-amazon-products website and is a .csv file with the size of 365.82 MB. This is a list of over 1,500 consumer reviews for Amazon products like the Kindle, Fire TV Stick, and more provided by Datafiniti's Product Database. The dataset includes basic product information, rating, review text, and more for each product.
######    1.2 Check missing values
    Check and remove all rows that contain missing values to avoid classification or regression error.
### 2. Text preprocessing
######    2.1 Remove none text and special character
        Text data might include website link, hashtags etc… These things better be removed from the text before we run the model.
######    2.2 Convert all text to lowercase
        To avoid the mistake during the training that the word like “We” and “we” might get learn differently, we turn all words with capital letter into lower cases.
######    2.3 Tokenization
        One sentence consists of many words, but not all words are important. To analyze each word, we need to split words into single word for each sentence.
######    2.4 Remove stopword
        Stopwords are words such as ‘I’, ‘we’, ‘my’, ‘you’, ‘own’, ‘only’ etc… These words are not likely to represent particular meaning. The model might consider this as noise, so we remove it as to keep noise level down.
######    2.5 Lemmatization vs. Stemming
        Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers. In contrast to stemming, lemmatization looks beyond word reduction and considers a language’s full vocabulary to apply a morphological analysis to words.

In [1]:
# General packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os

In [2]:
#Visualization libraries
import matplotlib.pyplot as plt 
from matplotlib import rcParams
import seaborn as sns
from plotly import tools
import plotly.graph_objs as go
from plotly.offline import iplot

In [3]:
# NLP packages
import nltk 
from nltk import word_tokenize
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
from wordcloud import WordCloud
from nltk.corpus import stopwords

### Importing Data

In [4]:
# Reading data from .csv file
Reviews = pd.read_csv('DatafinitiElectronicsProductData.csv')

In [5]:
Reviews.head()

Unnamed: 0,id,asins,brand,categories,colors,dateAdded,dateUpdated,dimension,ean,imageURLs,...,reviews.doRecommend,reviews.numHelpful,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.username,sourceURLs,upc,weight
0,AVpf3txeLJeJML43FN82,B0168YIWSI,Microsoft,"Electronics,Computers,Computer Accessories,Key...",Black,2015-11-13T12:28:09Z,2018-01-29T02:15:13Z,11.6 in x 8.5 in x 0.19 in,890000000000.0,https://i5.walmartimages.com/asr/2a41f6f0-844e...,...,True,0.0,5.0,http://reviews.bestbuy.com/3545/4562009/review...,"This keyboard is very easy to type on, but the...",Love the fingerprint reader,JNH1,https://www.walmart.com/ip/Microsoft-Surface-P...,890000000000.0,1.1 pounds
1,AVpf3txeLJeJML43FN82,B0168YIWSI,Microsoft,"Electronics,Computers,Computer Accessories,Key...",Black,2015-11-13T12:28:09Z,2018-01-29T02:15:13Z,11.6 in x 8.5 in x 0.19 in,890000000000.0,https://i5.walmartimages.com/asr/2a41f6f0-844e...,...,True,0.0,4.0,http://reviews.bestbuy.com/3545/4562009/review...,It's thin and light. I can type pretty easily ...,Nice,Appa,https://www.walmart.com/ip/Microsoft-Surface-P...,890000000000.0,1.1 pounds
2,AVpf3txeLJeJML43FN82,B0168YIWSI,Microsoft,"Electronics,Computers,Computer Accessories,Key...",Black,2015-11-13T12:28:09Z,2018-01-29T02:15:13Z,11.6 in x 8.5 in x 0.19 in,890000000000.0,https://i5.walmartimages.com/asr/2a41f6f0-844e...,...,True,0.0,4.0,http://reviews.bestbuy.com/3545/4562009/review...,I love the new design the keys are spaced well...,New,Kman,https://www.walmart.com/ip/Microsoft-Surface-P...,890000000000.0,1.1 pounds
3,AVpf3txeLJeJML43FN82,B0168YIWSI,Microsoft,"Electronics,Computers,Computer Accessories,Key...",Black,2015-11-13T12:28:09Z,2018-01-29T02:15:13Z,11.6 in x 8.5 in x 0.19 in,890000000000.0,https://i5.walmartimages.com/asr/2a41f6f0-844e...,...,True,0.0,5.0,http://reviews.bestbuy.com/3545/4562009/review...,Attached easily and firmly. Has a nice feel. A...,Nice keyboard,UpstateNY,https://www.walmart.com/ip/Microsoft-Surface-P...,890000000000.0,1.1 pounds
4,AVpf3txeLJeJML43FN82,B0168YIWSI,Microsoft,"Electronics,Computers,Computer Accessories,Key...",Black,2015-11-13T12:28:09Z,2018-01-29T02:15:13Z,11.6 in x 8.5 in x 0.19 in,890000000000.0,https://i5.walmartimages.com/asr/2a41f6f0-844e...,...,True,0.0,5.0,http://reviews.bestbuy.com/3545/4562009/review...,"Our original keyboard was okay, but did not ha...",Nice improvement,Glickster,https://www.walmart.com/ip/Microsoft-Surface-P...,890000000000.0,1.1 pounds


### Data Exploration

In [6]:
Reviews.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7299 entries, 0 to 7298
Data columns (total 27 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   7299 non-null   object 
 1   asins                7299 non-null   object 
 2   brand                7299 non-null   object 
 3   categories           7299 non-null   object 
 4   colors               5280 non-null   object 
 5   dateAdded            7299 non-null   object 
 6   dateUpdated          7299 non-null   object 
 7   dimension            6090 non-null   object 
 8   ean                  2951 non-null   float64
 9   imageURLs            7299 non-null   object 
 10  keys                 7299 non-null   object 
 11  manufacturer         4632 non-null   object 
 12  manufacturerNumber   7299 non-null   object 
 13  name                 7299 non-null   object 
 14  primaryCategories    7299 non-null   object 
 15  reviews.date         7238 non-null   o

In [7]:
display(Reviews.describe().round(2))

Unnamed: 0,ean,reviews.numHelpful,reviews.rating,upc
count,2951.0,5813.0,7135.0,7299.0
mean,298649200000.0,0.75,4.37,386671300000.0
std,338551000000.0,3.42,1.04,368169300000.0
min,27108110000.0,0.0,1.0,17817660000.0
25%,97855100000.0,0.0,4.0,50036330000.0
50%,97855100000.0,0.0,5.0,97855100000.0
75%,649000000000.0,0.0,5.0,793000000000.0
max,890000000000.0,128.0,5.0,890000000000.0


### Checking Missing Values

In [8]:
Reviews.isnull().any()

id                     False
asins                  False
brand                  False
categories             False
colors                  True
dateAdded              False
dateUpdated            False
dimension               True
ean                     True
imageURLs              False
keys                   False
manufacturer            True
manufacturerNumber     False
name                   False
primaryCategories      False
reviews.date            True
reviews.dateSeen       False
reviews.doRecommend     True
reviews.numHelpful      True
reviews.rating          True
reviews.sourceURLs     False
reviews.text            True
reviews.title           True
reviews.username       False
sourceURLs             False
upc                    False
weight                 False
dtype: bool

In [9]:
Reviews.isnull().sum()

id                        0
asins                     0
brand                     0
categories                0
colors                 2019
dateAdded                 0
dateUpdated               0
dimension              1209
ean                    4348
imageURLs                 0
keys                      0
manufacturer           2667
manufacturerNumber        0
name                      0
primaryCategories         0
reviews.date             61
reviews.dateSeen          0
reviews.doRecommend    1391
reviews.numHelpful     1486
reviews.rating          164
reviews.sourceURLs        0
reviews.text              5
reviews.title             4
reviews.username          0
sourceURLs                0
upc                       0
weight                    0
dtype: int64

In [10]:
# replace missing values with zero
Reviews['reviews.numHelpful'] = Reviews['reviews.numHelpful'].fillna(0)
Reviews['reviews.rating'] = Reviews['reviews.rating'].fillna(0)

In [11]:
Reviews.isna().sum()

id                        0
asins                     0
brand                     0
categories                0
colors                 2019
dateAdded                 0
dateUpdated               0
dimension              1209
ean                    4348
imageURLs                 0
keys                      0
manufacturer           2667
manufacturerNumber        0
name                      0
primaryCategories         0
reviews.date             61
reviews.dateSeen          0
reviews.doRecommend    1391
reviews.numHelpful        0
reviews.rating            0
reviews.sourceURLs        0
reviews.text              5
reviews.title             4
reviews.username          0
sourceURLs                0
upc                       0
weight                    0
dtype: int64

In [12]:
# Drop null values of unnecessary columns
Reviews.dropna(how = 'any', inplace = True)

In [13]:
Reviews.isnull().sum()

id                     0
asins                  0
brand                  0
categories             0
colors                 0
dateAdded              0
dateUpdated            0
dimension              0
ean                    0
imageURLs              0
keys                   0
manufacturer           0
manufacturerNumber     0
name                   0
primaryCategories      0
reviews.date           0
reviews.dateSeen       0
reviews.doRecommend    0
reviews.numHelpful     0
reviews.rating         0
reviews.sourceURLs     0
reviews.text           0
reviews.title          0
reviews.username       0
sourceURLs             0
upc                    0
weight                 0
dtype: int64

In [14]:
Reviews['reviews.rating'].value_counts()

5.0    1102
4.0     469
3.0     114
2.0      58
1.0      57
Name: reviews.rating, dtype: int64

### Pre processing 

In [15]:
# Converting all words into lower case
Reviews['lowercase_text_reviews'] = Reviews['reviews.text'].str.lower()
print(Reviews['lowercase_text_reviews'])

0       this keyboard is very easy to type on, but the...
1       it's thin and light. i can type pretty easily ...
2       i love the new design the keys are spaced well...
3       attached easily and firmly. has a nice feel. a...
4       our original keyboard was okay, but did not ha...
                              ...                        
7287    best feature is being rechargableworks nice, t...
7288    i'm still trying to learn all the features of ...
7289     great sound system would definitely recommend...
7290    i hated my cable company bulky remote control ...
7291    we were forced to add a cable box as charter c...
Name: lowercase_text_reviews, Length: 1800, dtype: object


In [16]:
# Count unique words that are found in reviews
from nltk import word_tokenize

In [17]:
# Number of tokens before converting reviews to lowercase
token_lists = [word_tokenize(each) for each in Reviews['reviews.text']]
tokens = [item for sublist in token_lists for item in sublist]
print("Number of unique tokens then: ",len(set(tokens)))

# Number of tokens after converting reviews to lowercase
token_lists_lower = [word_tokenize(each) for each in Reviews['lowercase_text_reviews']]
tokens_lower = [item for sublist in token_lists_lower for item in sublist]
print("Number of unique tokens now: ",len(set(tokens_lower)))

Number of unique tokens then:  5715
Number of unique tokens now:  4856


In [18]:
# removing special characters
remove_spl_chars = Reviews['lowercase_text_reviews'].apply(lambda review: [char for char in list(review) if not char.isalnum() and char != ' '])
flat_list = [item for sublist in remove_spl_chars for item in sublist]
 
# Create a set containing special characters
set(flat_list)

{'!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '=',
 '?',
 '@',
 ']',
 '~',
 '–',
 '’',
 '…'}

In [19]:
Review2 = Reviews ['lowercase_text_reviews'].copy()
Reviews['lowercase_text_reviews'] = Reviews['lowercase_text_reviews'].str.replace(r'[^A-Za-z0-9 ]+', ' ')


The default value of regex will change from True to False in a future version.



In [20]:
# Checking outcomes
print(Reviews['lowercase_text_reviews'][0])

this keyboard is very easy to type on  but the fingerprint reader is the best feature  it is very accurate and simplifies login 


In [21]:
token_lists = [word_tokenize(each) for each in Review2]

tokens = [item for sublist in token_lists for item in sublist]
print("Number of unique tokens then: ",len(set(tokens)))

token_lists = [word_tokenize(each) for each in Reviews['lowercase_text_reviews']]
tokens = [item for sublist in token_lists for item in sublist]
print("Number of unique tokens now: ",len(set(tokens)))

Number of unique tokens then:  4856
Number of unique tokens now:  4464


In [22]:
print('All Languages in NLTK: \n')
print(stopwords.fileids())

All Languages in NLTK: 

['arabic', 'azerbaijani', 'bengali', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']


In [23]:
# clean noise words
noise_words = []
eng_stop_words = stopwords.words('english')
eng_stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [24]:
# seperating stopwords and non-stopwords from the reviews
stop_words = set(eng_stop_words)
del_stop_words = []
stopword = []

# using review number 0 as an example
sentence = Reviews['lowercase_text_reviews'][1] 

# tokenize a sentence to the seperate tokens
words = nltk.word_tokenize(sentence)

# Adding words into different array, stopwords and without stopwords(del_stop_words)
for word in words:
    if word in stop_words:
        stopword.append(word)
    else:
        del_stop_words.append(word)
        
print('-- Original Sentence --\n', sentence)
print('\n-- Stopwords in the sentence --\n', stopword)
print('\n-- Non-stopwords in the sentence --\n', del_stop_words)

-- Original Sentence --
 it s thin and light  i can type pretty easily on it 

-- Stopwords in the sentence --
 ['it', 's', 'and', 'i', 'can', 'on', 'it']

-- Non-stopwords in the sentence --
 ['thin', 'light', 'type', 'pretty', 'easily']


In [25]:
# Removing these words to give more focus to the important information

def remove_stopwords(stop_words, sentence):
    return [word for word in nltk.word_tokenize(sentence) if word not in stop_words]

Reviews['withoutstop_reviews_text'] = Reviews['lowercase_text_reviews'].apply(lambda row: remove_stopwords(stop_words, row))
Reviews[['lowercase_text_reviews','withoutstop_reviews_text']]

Unnamed: 0,lowercase_text_reviews,withoutstop_reviews_text
0,this keyboard is very easy to type on but the...,"[keyboard, easy, type, fingerprint, reader, be..."
1,it s thin and light i can type pretty easily ...,"[thin, light, type, pretty, easily]"
2,i love the new design the keys are spaced well...,"[love, new, design, keys, spaced, well, mis, t..."
3,attached easily and firmly has a nice feel a...,"[attached, easily, firmly, nice, feel, must, s..."
4,our original keyboard was okay but did not ha...,"[original, keyboard, okay, laptop, feel, bit, ..."
...,...,...
7287,best feature is being rechargableworks nice t...,"[best, feature, rechargableworks, nice, touch,..."
7288,i m still trying to learn all the features of ...,"[still, trying, learn, features, controller, t..."
7289,great sound system would definitely recommend,"[great, sound, system, would, definitely, reco..."
7290,i hated my cable company bulky remote control ...,"[hated, cable, company, bulky, remote, control..."


### Stemming & lemmatization

In [26]:
from nltk.stem import PorterStemmer, LancasterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet') 
from nltk.corpus import wordnet

porter = PorterStemmer()
lancaster = LancasterStemmer()
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/lingzhang/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [27]:
# defining stemSentence() function to do data stemming
from nltk.stem.porter import *
stemmer = PorterStemmer()

def stemSentence(sentence):
    token_words=word_tokenize(sentence)
    token_words
    stem_sentence=[]
    for word in token_words:
        stem_sentence.append(porter.stem(word))
        stem_sentence.append(" ")
    return "".join(stem_sentence)

Reviews['stemmed_reviews_text'] = Reviews['withoutstop_reviews_text'].apply(lambda x: [stemSentence(y) for y in x])
Reviews[['withoutstop_reviews_text','stemmed_reviews_text']]

Unnamed: 0,withoutstop_reviews_text,stemmed_reviews_text
0,"[keyboard, easy, type, fingerprint, reader, be...","[keyboard , easi , type , fingerprint , reader..."
1,"[thin, light, type, pretty, easily]","[thin , light , type , pretti , easili ]"
2,"[love, new, design, keys, spaced, well, mis, t...","[love , new , design , key , space , well , mi..."
3,"[attached, easily, firmly, nice, feel, must, s...","[attach , easili , firmli , nice , feel , must..."
4,"[original, keyboard, okay, laptop, feel, bit, ...","[origin , keyboard , okay , laptop , feel , bi..."
...,...,...
7287,"[best, feature, rechargableworks, nice, touch,...","[best , featur , rechargablework , nice , touc..."
7288,"[still, trying, learn, features, controller, t...","[still , tri , learn , featur , control , thin..."
7289,"[great, sound, system, would, definitely, reco...","[great , sound , system , would , definit , re..."
7290,"[hated, cable, company, bulky, remote, control...","[hate , cabl , compani , bulki , remot , contr..."


In [28]:
# defining lemmSentence() function to do data lemmatization 
def lemmSentence(sentence):
    token_words=word_tokenize(sentence)
    token_words
    lemma_sentence=[]
    for word in token_words:
        lemma_sentence.append(lemmatizer.lemmatize(word, pos="v"))
        lemma_sentence.append(" ")
    return "".join(lemma_sentence)

Reviews['lemma_reviews_text'] = Reviews['withoutstop_reviews_text'].apply(lambda x: [lemmSentence(y) for y in x])
Reviews[['withoutstop_reviews_text','lemma_reviews_text']]

Unnamed: 0,withoutstop_reviews_text,lemma_reviews_text
0,"[keyboard, easy, type, fingerprint, reader, be...","[keyboard , easy , type , fingerprint , reader..."
1,"[thin, light, type, pretty, easily]","[thin , light , type , pretty , easily ]"
2,"[love, new, design, keys, spaced, well, mis, t...","[love , new , design , key , space , well , mi..."
3,"[attached, easily, firmly, nice, feel, must, s...","[attach , easily , firmly , nice , feel , must..."
4,"[original, keyboard, okay, laptop, feel, bit, ...","[original , keyboard , okay , laptop , feel , ..."
...,...,...
7287,"[best, feature, rechargableworks, nice, touch,...","[best , feature , rechargableworks , nice , to..."
7288,"[still, trying, learn, features, controller, t...","[still , try , learn , feature , controller , ..."
7289,"[great, sound, system, would, definitely, reco...","[great , sound , system , would , definitely ,..."
7290,"[hated, cable, company, bulky, remote, control...","[hat , cable , company , bulky , remote , cont..."


In [29]:
Reviews.to_csv('Pre-processing_DatafinitiElectronicsProductData')