# Description

The dataset is official Amazon Fashion Review dataset. Dataset contains total of 11 columns of which we will be using two columns:

1. The overall column based on which we define the sentiment.
2. The reviewText column which contains reviews in text format.

Preprocessing techniques used:

1. Lower casing the corpus.
2. Dropping the columns which have 5 or less than 5 words.
3. Removing the unwanted numbers.
4. Removing the emojis.
5. Removing all the punctuations.
6. Removing the frequent words.

---



In [1]:
# Importing the necessary libraries

import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

import re
import nltk
import string
from string import digits
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

from collections import Counter, defaultdict

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [2]:
# Reading the first 100000 lines of dataset
df = pd.read_json('/content/AMAZON_FASHION.json', nrows = 100000, lines = True)

In [3]:
df.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image
0,5,True,"10 20, 2014",A1D4G1SNUZWQOT,7106116521,Tracy,Exactly what I needed.,perfect replacements!!,1413763200,,,
1,2,True,"09 28, 2014",A3DDWDH9PX2YX2,7106116521,Sonja Lau,"I agree with the other review, the opening is ...","I agree with the other review, the opening is ...",1411862400,3.0,,
2,4,False,"08 25, 2014",A2MWC41EW7XL15,7106116521,Kathleen,Love these... I am going to order another pack...,My New 'Friends' !!,1408924800,,,
3,2,True,"08 24, 2014",A2UH2QQ275NV45,7106116521,Jodi Stoner,too tiny an opening,Two Stars,1408838400,,,
4,3,False,"07 27, 2014",A89F3LQADZBS5,7106116521,Alexander D.,Okay,Three Stars,1406419200,,,


In [4]:
#Looking at the data information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 12 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   overall         100000 non-null  int64  
 1   verified        100000 non-null  bool   
 2   reviewTime      100000 non-null  object 
 3   reviewerID      100000 non-null  object 
 4   asin            100000 non-null  object 
 5   reviewerName    99994 non-null   object 
 6   reviewText      99891 non-null   object 
 7   summary         99961 non-null   object 
 8   unixReviewTime  100000 non-null  int64  
 9   vote            9405 non-null    float64
 10  style           71963 non-null   object 
 11  image           1577 non-null    object 
dtypes: bool(1), float64(1), int64(2), object(8)
memory usage: 8.5+ MB


In [5]:
# Dropping the unwanted columns
df.drop(['verified', 'reviewTime', 'reviewerID', 'asin', 'reviewerName', 'unixReviewTime', 'vote', 'style', 'summary', 'image'], axis =1, inplace=True)

In [6]:
# Defining the sentiment of the text
df['Sentiment'] = np.where(df['overall']>3, 'positive', 'negative')

In [7]:
# Dropping the NA values
df.dropna(inplace = True)

Dropped the NA rows as there is no way to impute the reviews and the number of NA rows are low as well.

In [8]:
df.drop(['overall'], axis =1, inplace = True)

# Preprocessing

In [9]:
# Normalizing the text
df['reviewText'] = df['reviewText'].str.lower()

In [12]:
# Dropping rows which have 5 or less than 5 words 
def review_length(text):
    if len(text.split()) < 5:
        df.drop(index = df[df['reviewText'] == text].index, inplace =True)

df['reviewText'].apply(lambda x: review_length(x))

1        None
2        None
6        None
7        None
8        None
         ... 
99995    None
99996    None
99997    None
99998    None
99999    None
Name: reviewText, Length: 91240, dtype: object

In [11]:
# Removing the unwamted numbers from the text
def remove_numbers(text):
    """custom function to remove the numbers"""
    return text.translate(str.maketrans('', '', string.digits))

df["reviewText"] = df["reviewText"].apply(lambda text: remove_numbers(text))
df.head()

Unnamed: 0,reviewText,Sentiment
1,"i agree with the other review, the opening is ...",negative
2,love these... i am going to order another pack...,positive
6,these little plastic backs work great. no mor...,positive
7,mother - in - law wanted it as a present for h...,negative
8,"item is of good quality. looks great, too. but...",negative


In [13]:
# Removing the emojis from the text
def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

df["reviewText"] = df["reviewText"].apply(lambda text: remove_numbers(text))
df.head()

Unnamed: 0,reviewText,Sentiment
1,"i agree with the other review, the opening is ...",negative
2,love these... i am going to order another pack...,positive
6,these little plastic backs work great. no mor...,positive
7,mother - in - law wanted it as a present for h...,negative
8,"item is of good quality. looks great, too. but...",negative


In [14]:
# Removing all the punctutions
def remove_punctuation(text):
    """custom function to remove the punctuation"""
    return text.translate(str.maketrans('', '', string.punctuation))

df["reviewText"] = df["reviewText"].apply(lambda text: remove_punctuation(text))
df.head()

Unnamed: 0,reviewText,Sentiment
1,i agree with the other review the opening is t...,negative
2,love these i am going to order another pack to...,positive
6,these little plastic backs work great no more...,positive
7,mother in law wanted it as a present for her...,negative
8,item is of good quality looks great too but it...,negative


In [15]:
# Removing the most frequent words
cnt = Counter()
for text in df["reviewText"].values:
    for word in text.split():
        cnt[word] += 1
        
FREQWORDS = set([w for (w, wc) in cnt.most_common(25)])
def remove_freqwords(text):
    """custom function to remove the frequent words"""
    return " ".join([word for word in str(text).split() if word not in FREQWORDS])

df["text_wo_stopfreq"] = df["reviewText"].apply(lambda text: remove_freqwords(text))
df.head()

Unnamed: 0,reviewText,Sentiment,text_wo_stopfreq
1,i agree with the other review the opening is t...,negative,agree other review opening too small almost be...
2,love these i am going to order another pack to...,positive,love am going order another pack keep work som...
6,these little plastic backs work great no more...,positive,little plastic backs work great no more loosin...
7,mother in law wanted it as a present for her...,negative,mother law wanted as present her sister she li...
8,item is of good quality looks great too but it...,negative,item good quality looks great too does fit s c...


In [16]:
df.reset_index(inplace =True, drop = True)

In [17]:
df.drop(['reviewText'], axis = 1, inplace = True)

# Module 1

In [18]:
# Getting the word count and proportion 
cnt_new = Counter()
for text in df["text_wo_stopfreq"].values:
    for word in text.split():
        cnt_new[word] += 1

In [19]:
cnt_df = pd\
        .DataFrame.from_dict(cnt_new, orient='index')\
        .sort_values(0, ascending=False) \
        .reset_index() \
        .rename(columns={'index':'word', 0:'count'})

In [20]:
cnt_df['proportion'] = cnt_df['count']/cnt_df['count'].sum()
cnt_df.head(10)

Unnamed: 0,word,count,proportion
0,as,19952,0.009769
1,great,19772,0.009681
2,fit,17257,0.008449
3,size,16926,0.008287
4,you,16747,0.0082
5,like,15096,0.007391
6,be,14620,0.007158
7,love,13214,0.00647
8,its,13077,0.006403
9,just,13074,0.006401


In [23]:
print('The total number of words in the corpus is: {}'.format(cnt_df['word'].__len__()))
print('The total word count is: {}'.format(cnt_df['count'].sum()))

The total number of words in the corpus is: 34491
The total word count is: 2042410


In [58]:
# Counting the unigrams and bigrams

count = defaultdict()
def ngrams(text, n):
    Tokens = nltk.word_tokenize(text)
    output = list(nltk.ngrams(Tokens, n))
    for a in output:
        if a in count.keys():
            count[a] += 1
        else:
            count[a] = 1
    return count

In [59]:
n_grams = df['text_wo_stopfreq'].apply(lambda x: ngrams(x, 1))

In [27]:
# Unigram distribution
sorted(count.items(), key = lambda x: x[1], reverse = True)[:10]

[(('as',), 19952),
 (('great',), 19772),
 (('fit',), 17257),
 (('size',), 16926),
 (('you',), 16747),
 (('like',), 15096),
 (('be',), 14620),
 (('love',), 13214),
 (('its',), 13077),
 (('just',), 13074)]

In [31]:
# Bigram distribution
sorted(count.items(), key = lambda x: x[1], reverse = True)[:10]

[(('if', 'you'), 3042),
 (('would', 'be'), 1843),
 (('will', 'be'), 1688),
 (('good', 'quality'), 1658),
 (('too', 'big'), 1594),
 (('you', 'can'), 1512),
 (('year', 'old'), 1491),
 (('well', 'made'), 1470),
 (('as', 'well'), 1429),
 (('too', 'small'), 1416)]

In [32]:
# Getting the POS collection
complete = ''
for w in df['text_wo_stopfreq']:
    complete += w 

In [33]:
tokens = nltk.word_tokenize(complete)
tags = nltk.pos_tag(tokens)
counts = Counter( tag for word,  tag in tags)

pos_df = pd\
        .DataFrame.from_dict(counts, orient='index')\
        .sort_values(0, ascending=False) \
        .reset_index() \
        .rename(columns={'index':'word', 0:'count'})

In [34]:
pos_df.head()

Unnamed: 0,word,count
0,NN,448193
1,JJ,344623
2,RB,193711
3,NNS,144978
4,IN,132735


Neither of the three perfectly proves Zipf's law, however, POS collections and Bigrams are comparatively better than unigram and word count.

# Module 2

In [41]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix

In [35]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(df['text_wo_stopfreq'])

In [37]:
y = LabelEncoder().fit_transform(df['Sentiment'])

In [36]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

In [42]:
lr = LogisticRegression(max_iter= 500)
dt = DecisionTreeClassifier()

In [61]:
models = defaultdict(list)
for model in [lr, dt]:
  for vector in [X_train_counts, X_train_tfidf]:
    for scores in['f1', 'accuracy', 'recall']:
      score = cross_val_score(model, vector, y, cv=5, scoring = scores)
      models['Technique'].append(model)
      models['Vector'].append(vector)
      models['Scoring'].append(scores)
      models['Score'].append(score.mean())

In [62]:
pd.DataFrame(models)

Unnamed: 0,Technique,Vector,Scoring,Score
0,"LogisticRegression(C=1.0, class_weight=None, d...","(0, 552)\t1\n (0, 613)\t1\n (0, 838)\t1\n ...",f1,0.91279
1,"LogisticRegression(C=1.0, class_weight=None, d...","(0, 552)\t1\n (0, 613)\t1\n (0, 838)\t1\n ...",accuracy,0.864419
2,"LogisticRegression(C=1.0, class_weight=None, d...","(0, 552)\t1\n (0, 613)\t1\n (0, 838)\t1\n ...",recall,0.936456
3,"LogisticRegression(C=1.0, class_weight=None, d...","(0, 33962)\t0.11482439387422463\n (0, 32033...",f1,0.914855
4,"LogisticRegression(C=1.0, class_weight=None, d...","(0, 33962)\t0.11482439387422463\n (0, 32033...",accuracy,0.86628
5,"LogisticRegression(C=1.0, class_weight=None, d...","(0, 33962)\t0.11482439387422463\n (0, 32033...",recall,0.946833
6,"DecisionTreeClassifier(ccp_alpha=0.0, class_we...","(0, 552)\t1\n (0, 613)\t1\n (0, 838)\t1\n ...",f1,0.88975
7,"DecisionTreeClassifier(ccp_alpha=0.0, class_we...","(0, 552)\t1\n (0, 613)\t1\n (0, 838)\t1\n ...",accuracy,0.830602
8,"DecisionTreeClassifier(ccp_alpha=0.0, class_we...","(0, 552)\t1\n (0, 613)\t1\n (0, 838)\t1\n ...",recall,0.900791
9,"DecisionTreeClassifier(ccp_alpha=0.0, class_we...","(0, 33962)\t0.11482439387422463\n (0, 32033...",f1,0.88701


# Result

For both TF-IDF and count vectoriser, Logistic Regression gives the best result in all metrics. So, overall Logit model does better than Decision Tree and the best vectoriser is TF-IDF.