In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import ast
from textblob import TextBlob
import datetime
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

In [2]:
url = "Amazon_Reviews_Clothing_Shoes_and_Jewelry_5_sample.csv"
df = pd.read_csv(url)

In [6]:
df.head()

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,B00008ID1O,"[3, 4]",5,I was waiting to have a bad odor from these pa...,"09 10, 2012",ARZLW6MFV58E9,"Denise Elkin-andrews ""dee""",Excellent,1347235200
1,B00008ID1O,"[1, 1]",4,"I have worn these for years, but was concerned...","01 5, 2014",A454H175JAT32,Emily,"some odor, but not really a concern",1388880000
2,B00008ID1O,"[7, 7]",1,Horrible smell gave me an asthmatic attack whe...,"07 19, 2013",A1X5NM3HXQANFL,Jasminenirvana,Terrible Chemical Smell,1374192000
3,B00008ID1O,"[0, 2]",3,These are really verging on &#34;granny pantie...,"10 15, 2013",AVKF05BAX2Z3I,K. H. Noland,If these are bikini then I hate to see briefs!,1381795200
4,B00008ID1O,"[3, 3]",5,"Have been wearing these for years, but was hes...","03 21, 2013",AQJWVL7YBSMOL,Mail Debaser,Fine for Me,1363824000


# 1. Analysis

<h4><li>Is there a correlation between the rating of the product and the helpfulness of the review?</li></h4>


In [None]:
#Dividing the helpful column into nº of Yes votes and total number of votes

df['helpful_column_list'] = df["helpful"].apply(lambda x: ast.literal_eval(str(x)))
df[['Nº of YES votes','Nº of Total votes']] = pd.DataFrame(df["helpful_column_list"].values.tolist(), index= df.index)

#Convert helpful column to a probability: [3,4] e.g. represents a 3/4 probability of choosing "Yes" to the review question. 

df["Helpfulness ratio"] = df["Nº of YES votes"] / df["Nº of Total votes"]

#Substitute NaN values by 0, since our methodology gave 0/0 providing the obvious NaN:

df["Helpfulness ratio"] = df["Helpfulness ratio"].fillna(0)

#Cleaning Dataframe

corr_rating_helpfulness =  df[["Helpfulness ratio", "overall"]]

#Correlation between Helpfulness ratio and Overall rating

corr_rating_helpfulness['Helpfulness ratio'].corr(corr_rating_helpfulness['overall'])

<h4><li>Suggested Answer:</li></h4>
When the value of the correlation is close to zero, generally between -0.1 and +0.1, the variables are said to have no linear relationship or a very weak linear relationship. Since the relationship between "Helpfulness ratio" and "overall" is approximately -0.03, it is possible to state that approximately 0.09% of variation in Helpfulness of the review can be explained by the rating of the product, meaning that around 99.9% is left to explain!

___
<h4><li>Who are the most helpful reviewers?</li></h4>

In [None]:
#Convert ReviwerID column to list

reviewID_list = df["reviewerID"].tolist()
uniqueID = list(sorted(set(reviewID_list)))

#Query all the reviwersID: Helpfulness ratio higher or equal than 0.9 and Nº of total votes higher than 8.

ID_count = {}

for i in uniqueID:
    df_ID = df[df["reviewerID"] == i]
    df_ID = df_ID[df_ID["Helpfulness ratio"] >= 0.9]
    df_ID = df_ID[df_ID["Nº of Total votes"] > 8]
    good_reviews = df_ID.shape[0]
    ID_count[i] = good_reviews
    
#Most helpful reviewers: the presented ID's are individuals that had more than 1 sucessful review.    
    
most_helpful = {k: v for k, v in ID_count.items() if v > 1}

#Passing the most helpful reviewers ID into a list, to represent later the proper reviewerName:

reviewerID = list(most_helpful)  

for reviewer in reviewerID:
    df1 = df[df["reviewerID"] == reviewer]
    name = df1.iloc[1]["reviewerName"]
    print(name)

<h4><li>Suggested Answer:</li></h4>
Since the question is a bit too abstract, I decided to implement my own logic into solving this problem. I decided to query all the ReviewersID's into only the individuals that have more than 8 total votes in a specific review and that had an helpfulness ratio higher than 90%. From this, I was able to verify which reviewers had the most occurencies that satisfied this type of conditions, which were: Sakura67456, Matthew G. Sherwin, and Libri Mundi "Libri Mundi". Arising from this, these were the reviewers that in my opinion were the most helpful ones!

___
<h4><li>Have reviews been getting more or less helpful over time?</li></h4>

In [None]:
#Convert reviewTime into datetime.

df['reviewTime'] = pd.to_datetime(df['reviewTime'], format="%m %d, %Y")

#Create another dataframe containing only the necessary columns

df_question3 = df[["reviewTime", "Helpfulness ratio"]]

#Sort by reviewTime to make it ascending.

df_question3_ascending = df_question3.sort_values(by='reviewTime')
df_question3_ascending.tail()

In [None]:
#Aggregating by Years

df_question3_ascending["Year"] = df_question3_ascending["reviewTime"].dt.year

df_question3_ascending = df_question3_ascending.groupby(['Year']).mean().reset_index()

df_question3_ascending

In [None]:
#Creating a linear regression for the Helpfulness ratio from 2006 to 2014.

# values converts it into a numpy array

X = df_question3_ascending["Year"].values.reshape(-1, 1)
Y = df_question3_ascending["Helpfulness ratio"].values.reshape(-1, 1)

# create object for the class

linear_regressor = LinearRegression() 

#Fit our model -> essentially generates a function that explains Y through X

linear_regressor.fit(X, Y) 

# make predictions

Y_pred = linear_regressor.predict(X)

In [None]:
#Setting the two graphs

plt.scatter(df_question3_ascending["Year"], df_question3_ascending["Helpfulness ratio"] , label= "Ratio", color= "xkcd:navy", linewidth=0.2) 

plt.plot(X, Y_pred, color='red')

# X-axis & Y-axis label 

plt.xlabel('Year') 
plt.ylabel('Ratio') 

# Title 

plt.title('Review Helpfulness over time')

# showing legend 

plt.legend() 
  
# function to show the plot 

plt.show()


<h4><li>Suggested Answer:</li></h4>
From the question, I decided to aggregate the reviews by year, so that the analysis gets more efficient. Through the computations on the lines above, and through the display of the graph, It is clear to see that the helpfulness of the reviews have suffered a downfall. From 2006, the ratio was nearly 0.85, but at 2014, this value decreased substantially to approx. 0.15. The fall has been almost linear, which is visible with the comparation with the linear regression that was done, exactly to check how the reviews were reacting over the time. Although, this ratio has been decreasing over the year, this doesn't mean, we are having lower helpful reviews! On 2006, only three records were made, that explains the huge level of helpfulness, while on 2014, 5362 records were already made and the data about that year hadn't finish yet. This reveals that, despite the decrease in the ratio, part of it, can be explained through the higher increase of the total reviews being done, when compared to the helpful ones! 

# 2. Modelling

<h4><li>After someone writes a review, will it be considered helpful by other
users? Predictive Model with a Binary approach (Classification)</li></h4>

I'm going to consider the dependent variable, the binary variable, by "1" if it is helpful or "0" otherwise. So the objective is to build a mathematical expression that will compute and predict whether a future review will be helpful based on past experiences. Secondly, I will describe a review by being helpful ("1") if it has an Helpfullness ratio higher or equal than 60%, and below this threshhold, I will consider it not helpful ("0"). 

In [None]:
# Create another columns associated with the past of the ID: Quality of review might be influenced whether the user already have done some reviews in the past.

df["Nº Previous Total Reviews"] = ""
df["Nº Previous Helpful Reviews"] = ""

reviewID_list = df["reviewerID"].tolist()
uniqueUSER = list(sorted(set(reviewID_list)))

#Empty Dataframe to fill after the loop.

newDF = pd.DataFrame()


for i in uniqueUSER:
    df1 = df[df["reviewerID"] == i]
    df1.sort_values(by='reviewTime')

    dfSIZE = np.arange(df1.shape[0])

    number = 0
    
    #Loop for the Total reviews before the current one.

    for e in dfSIZE:

        df1.iloc[e, df1.columns.get_loc('Nº Previous Total Reviews')] = number
        number = number + 1

    number1 = 0
    
    #Loop for the Previous Helpful reviews before the currrent one.

    for ii in dfSIZE:


        if df1["Helpfulness ratio"].iloc[ii] > 0.6:
            df1.iloc[ii, df1.columns.get_loc('Nº Previous Helpful Reviews')] = number1
            number1 = number1 + 1
        else:
            df1.iloc[ii, df1.columns.get_loc('Nº Previous Helpful Reviews')] = number1
            number1 = number1 + 0
            
    #This gives us a certain Dataframe, df1, which contains the data for one ID, we just need to merge this DF with the empty one.
    
    newDF = newDF.append(df1)

# newDF contains the Dataframe that contains the data regarding the past reviews.
#newDF


In [None]:
newDFF = newDF[["reviewerID", "Nº Previous Helpful Reviews", "Nº Previous Total Reviews"]]

In [None]:
Aggregated_df = pd.merge(df, newDFF, on='reviewerID')

In [None]:
#sentiment Analysis through textBlob

Aggregated_df["reviewText"] = Aggregated_df["reviewText"].astype(str)
Aggregated_df['Sentiment'] = Aggregated_df['summary'].apply(lambda tweet: TextBlob(tweet).sentiment.polarity)

#Convert Helpfulness ratio into binary variable: >0.6 => 1 ; <0.6 => 0

Aggregated_df.loc[Aggregated_df['Helpfulness ratio'] >= 0.6, 'Binary Helpfulness ratio'] = 1
Aggregated_df.loc[Aggregated_df['Helpfulness ratio'] < 0.6, 'Binary Helpfulness ratio'] = 0

In [None]:
#Prepare Data for Test and Training sets

modelling_X = Aggregated_df[["overall", "Sentiment", "Nº Previous Helpful Reviews_y", 'Nº Previous Total Reviews_y']]
modelling_Y = Aggregated_df["Binary Helpfulness ratio"]

#Convert everything to int

modelling_X["Sentiment"] = modelling_X["Sentiment"].astype(int)
modelling_Y = modelling_Y.to_frame()
modelling_Y["Binary Helpfulness ratio"] = modelling_Y["Binary Helpfulness ratio"].astype(int)

In [None]:
#Naive Bayes Algorithm

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
from sklearn.metrics import f1_score

#Defining our test and train mechanisms

X_train, X_test, y_train, y_test = train_test_split(modelling_X, modelling_Y, test_size=0.3, random_state=42)

#Fit our model -> essentially generates a multilinear function based on train data 

model = BernoulliNB().fit(X_train, y_train) 

#Apply that function into the test dataset

y_pred_BAYES = model.predict(X_test)

#Evaluate the model

print("Accuracy of Naive Bayes classifier on test set:", model.score(X_test, y_test))
print("The f1 score is:", f1_score(y_test, y_pred_BAYES))

print("\nFor example, an individual that reviewed a product with overall rating of \
1, with a text sentiment of 0.5, and that already had 1 helpful review in the past of 1 in total, \
will most likely, according to my model, generate a prediction of:", model.predict([[1, 0.5, 1, 1]]), ". \
Meaning that it will be helful!")

In [None]:
#Logistic Regression

from sklearn.linear_model import LogisticRegression

#Defining our test and train mechanisms

X_train, X_test, y_train, y_test = train_test_split(modelling_X, modelling_Y, test_size=0.3, random_state=42)

#Fit our model -> essentially generates a multilinear function based on train data 

logreg = LogisticRegression()

logreg.fit(X_train, y_train)

#Apply that function into the test dataset

y_pred_LOGREG = logreg.predict(X_test)

#Evaluate the model

print('Accuracy of logistic regression classifier on test set:', logreg.score(X_test, y_test))
print("The f1 score is:", f1_score(y_test, y_pred_LOGREG))

print("\nFor exameple, an individual that reviewed a product with overall rating of \
5, with a text sentiment of 1, and that already had 1 helpful review in the past of 2 in total, \
will most likely, according to my model, generate a prediction of:", logreg.predict([[5, 1,1, 2]]), ". \
Meaning that it will not helpful to others!")

<h4><li>Suggested Answer:</li></h4>
In this question, I've built a model that aims to explain whether a future review will be helpful or not. It is a binary model, where the dependent variable is assigned with value 1 if it helpful, or value 0, otherwise. As independent variables, I selected the following variables: "overall", "Sentiment", "Nº Previous Helpful Reviews_y", "Nº Previous Total Reviews_y". I believe that the overall rating of the product will have a greater influence on the review, specially on low rating items, where customers will seek more information on these products, and so they will resort to these reviews. Following next, "Sentiment" reflects whether the text written by the user is, in fact, positive or negative regarding that specific product, my intuition followed the logic that if a reviewer describes the situation perfectly, than the sentiment analysis would provide an index extremely close to 1 or -1, and therefore would be more informative to the end customer, meaning that it could have an higher probability of being helpful. The last two variables are associated with the past of the reviewer, I believe that customers don't want fake reviews, and eventually will check if this reviewerID is definitely legit and has reputation, therefore the past of the user can play a major role in deciding whether the review will be helpful or not.

From the selected variables, I conducted some train and test scenarios, with different classfiers: Naive Bayes and Logistic Regression. Why Have I chosen this two? So, the Naive Bayes, because it is a probabilistic model, and made sense to use it, since it's an algorithm that can be coded up easily and the predictions made real quick. Next, we have Logistic Regression, which works, exactly, when the dependent variable is binary.  Like all regression analyses, the logistic regression is a predictive analysis.  Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, ratio-level or other independent variables.

After splitting our dataframe into test and train sets, I fit the train sets and this generated the functions that describes the relationship between the dependent and independent variables. From the Naive Bayes, I got an accuracy of 0.774, while with Logistic regression, this yield a value of 0.777, showing that this is a better approach.

Some ideas to explore: I would try to use TPOT library to achieve the best accuracy level. Definitely lacks creativity regarding time, for example understanding the period when the review was made, and perhaps understand if this has any impact whether the review will be helpful or not (e.g. in Christmas time perhaps the probability of a review being helpful may increase). Plus, the sentiment analysis regarding the text that is on the reviews, in my opinion, requires a better text-processor in order to achieve higher results.

# 3. Bonus Question

<h4><li>Would you use it to build a new model to
predict the helpfulness of a review?</li></h4>

Definitely yes! I wasn't familiar with this type of architecture, but from what I read It's encoder it's bidirectional, therefore it takes into account all the words in its surroundings, and not from only one direction, making this mechanism a better way to do sentiment analysis, which will most likely yield a higher accuracy score.

<h4><li>How do you expect that this new model’s performance compare with the previous one
(from Exercise 2)? What makes the BERT model better/worse?</li></h4>

I expect that this new model behaves in a better way than the previous one, simply because it is the most complex and complete architecture at the moment, and that can amplify the results in a wide variety of NLP tasks. From "Towards Data Science", I've managed to understand that BERT reads the entire sequence of words at once. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word), which dethrones the idea of directional models, which read the text input sequentially (left-to-right or right-to-left).