# Rating Product & Sorting Reviews in Amazon

# Business Problem:

One of the most important problems in e-commerce is the correct calculation of the points given to the products after the sale. The solution to this problem means more customer satisfaction for the e-commerce site, product prominence for sellers and a smooth shopping experience for buyers. Another problem is the correct ranking of the comments given to the products. Since the prominence of misleading reviews will directly affect the sales of the product, it will cause both financial loss and customer loss. In solving these 2 basic problems, e-commerce sites and sellers will increase their sales while customers will complete their purchasing journey smoothly.

####  History of Dataset
This dataset of Amazon product data includes product categories and various metadata.The most reviewed product in the electronics category has user ratings and reviews.
#### **Variables**:
- reviewerID - Gözden geçirenin kimliği, örneğin A2SUAM1J3GNN3B
- asin - Ürünün kimliği, örneğin 0000013714
- reviewerName - gözden geçirenin adı
- yararlı - incelemenin yararlılık derecesi, örneğin 2/3
- reviewText - inceleme metni
- genel - ürün değerlendirmesi
- özet - i̇ncelemeni̇n özeti̇
- unixReviewTime - inceleme zamanı (unix zamanı)
- reviewTime - incelemenin zamanı (ham)
- day_diff - Değerlendirmeden itibaren geçen gün sayısı
- helpful_yes - Değerlendirmenin faydalı bulunma sayısı
- total_vote - Değerlendirmeye verilen oy sayısı

### Necessary Lib.

In [3]:
import pandas as pd
import math
import scipy.stats as st

pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', 10)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.float_format', lambda x: '%.5f' % x)


### Read the Data Set and Calculate the Average Score of the Product.

In [7]:
df = pd.read_csv("amazon_review.csv")
df["overall"].mean()

4.587589013224822

In [10]:
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,day_diff,helpful_yes,total_vote
0,A3SBTW3WS4IQSN,B007WTAJTO,,"[0, 0]",No issues.,4.0,Four Stars,1406073600,2014-07-23,138,0,0
1,A18K1ODH1I2MVB,B007WTAJTO,0mie,"[0, 0]","Purchased this for my device, it worked as adv...",5.0,MOAR SPACE!!!,1382659200,2013-10-25,409,0,0
2,A2FII3I2MBMUIA,B007WTAJTO,1K3,"[0, 0]",it works as expected. I should have sprung for...,4.0,nothing to really say....,1356220800,2012-12-23,715,0,0
3,A3H99DFEG68SR,B007WTAJTO,1m2,"[0, 0]",This think has worked out great.Had a diff. br...,5.0,Great buy at this price!!! *** UPDATE,1384992000,2013-11-21,382,0,0
4,A375ZM4U047O79,B007WTAJTO,2&amp;1/2Men,"[0, 0]","Bought it with Retail Packaging, arrived legit...",5.0,best deal around,1373673600,2013-07-13,513,0,0


### Calculate the Weighted Grade Point Average by Date.

In [11]:
# determination of time-based average weights
def time_based_weighted_average(dataframe, w1=50, w2=25, w3=15, w4=10):
    return dataframe.loc[dataframe["day_diff"] <= dataframe["day_diff"].quantile(0.25), "overall"].mean() * w1 / 100 + \
           dataframe.loc[(dataframe["day_diff"] > dataframe["day_diff"].quantile(0.25)) & (dataframe["day_diff"] <= dataframe["day_diff"].quantile(0.50)), "overall"].mean() * w2 / 100 + \
           dataframe.loc[(dataframe["day_diff"] > dataframe["day_diff"].quantile(0.50)) & (dataframe["day_diff"] <= dataframe["day_diff"].quantile(0.75)), "overall"].mean() * w3 / 100 + \
           dataframe.loc[(dataframe["day_diff"] > dataframe["day_diff"].quantile(0.75)), "overall"].mean() * w4 / 100

In [12]:
time_based_weighted_average(df, w1=50, w2=25, w3=15, w4=10)

4.637306192407316

### Determine the 20 Reviews that will be displayed on the Product Detail Page for the product.

In [15]:
# Generate helpful_no Variable
# Note: total_vote is the total number of up-down votes given to a comment.up means helpful.Lastly there is no helpful_no variable in the dataset, it needs to be generated from existing variables.

df["helpful_no"] = df["total_vote"] - df["helpful_yes"]
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,day_diff,helpful_yes,total_vote,helpful_no
0,A3SBTW3WS4IQSN,B007WTAJTO,,"[0, 0]",No issues.,4.0,Four Stars,1406073600,2014-07-23,138,0,0,0
1,A18K1ODH1I2MVB,B007WTAJTO,0mie,"[0, 0]","Purchased this for my device, it worked as adv...",5.0,MOAR SPACE!!!,1382659200,2013-10-25,409,0,0,0
2,A2FII3I2MBMUIA,B007WTAJTO,1K3,"[0, 0]",it works as expected. I should have sprung for...,4.0,nothing to really say....,1356220800,2012-12-23,715,0,0,0
3,A3H99DFEG68SR,B007WTAJTO,1m2,"[0, 0]",This think has worked out great.Had a diff. br...,5.0,Great buy at this price!!! *** UPDATE,1384992000,2013-11-21,382,0,0,0
4,A375ZM4U047O79,B007WTAJTO,2&amp;1/2Men,"[0, 0]","Bought it with Retail Packaging, arrived legit...",5.0,best deal around,1373673600,2013-07-13,513,0,0,0


In [16]:
df = df[["reviewerName", "overall", "summary", "helpful_yes", "helpful_no", "total_vote", "reviewTime"]]

### Calculate score_pos_neg_diff, score_average_rating and wilson_lower_bound Scores and Add to Data

In [18]:
def score_up_down_diff(up, down):
    return up - down

df["score_pos_neg_diff"] = df.apply(lambda x: score_up_down_diff(x["helpful_yes"], x["helpful_no"]), axis=1)

def score_average_rating(up, down):
    if up + down == 0:
        return 0
    return up / (up + down)

### score_average_rating

In [20]:
df["score_average_rating"] = df.apply(lambda x: score_average_rating(x["helpful_yes"], x["helpful_no"]), axis=1)

In [21]:

def wilson_lower_bound(up, down, confidence=0.95):
    """
    Wilson Lower Bound Score hesapla

    - Bernoulli parametresi p için hesaplanacak güven aralığının alt sınırı WLB skoru olarak kabul edilir.
    - Hesaplanacak skor ürün sıralaması için kullanılır.
    - Not:
    Eğer skorlar 1-5 arasıdaysa 1-3 negatif, 4-5 pozitif olarak işaretlenir ve bernoulli'ye uygun hale getirilebilir.
    Bu beraberinde bazı problemleri de getirir. Bu sebeple bayesian average rating yapmak gerekir.

    Parameters
    ----------
    up: int
        up count
    down: int
        down count
    confidence: float
        confidence

    Returns
    -------
    wilson score: float

    """
    n = up + down
    if n == 0:
        return 0
    z = st.norm.ppf(1 - (1 - confidence) / 2)
    phat = 1.0 * up / n
    return (phat + z * z / (2 * n) - z * math.sqrt((phat * (1 - phat) + z * z / (4 * n)) / n)) / (1 + z * z / n)


In [22]:
df["wilson_lower_bound"] = df.apply(lambda x: wilson_lower_bound(x["helpful_yes"], x["helpful_no"]), axis=1)

In [23]:
df.tail(10)

Unnamed: 0,reviewerName,overall,summary,helpful_yes,helpful_no,total_vote,reviewTime,score_pos_neg_diff,score_average_rating,wilson_lower_bound
4905,zht,5.0,Good product,0,0,0,2012-12-28,0,0.0,0.0
4906,Zigcarruse,5.0,just what you'd expect,0,0,0,2014-02-13,0,0.0,0.0
4907,Zim5,5.0,Works as Advertised,0,0,0,2014-06-06,0,0.0,0.0
4908,Zimms,5.0,Works well. Compatible with Mac,0,0,0,2014-12-05,0,0.0,0.0
4909,Zman,5.0,Just gave me soime elbow room,0,0,0,2014-01-29,0,0.0,0.0
4910,"ZM ""J""",1.0,Do not waste your money.,0,0,0,2013-07-23,0,0.0,0.0
4911,Zo,5.0,Great item!,0,0,0,2013-08-22,0,0.0,0.0
4912,Z S Liske,5.0,Fast and reliable memory card,0,0,0,2014-03-31,0,0.0,0.0
4913,Z Taylor,5.0,Great little card,0,0,0,2013-09-16,0,0.0,0.0
4914,Zza,5.0,So far so good.,0,0,0,2014-02-01,0,0.0,0.0


In [25]:
df = df[df["total_vote"] != 0]
df.head()

Unnamed: 0,reviewerName,overall,summary,helpful_yes,helpful_no,total_vote,reviewTime,score_pos_neg_diff,score_average_rating,wilson_lower_bound
8,4evryoung,5.0,Loads of room,1,0,1,2014-03-24,1,1.0,0.20655
17,Aaron F. Virginie,5.0,Get Fast Load Times,0,1,1,2013-04-07,-1,0.0,0.0
26,Aaron T. Swain,5.0,64 GB,1,1,2,2012-07-26,0,0.5,0.09453
28,ABailey8833,5.0,Great product!,1,0,1,2012-12-28,1,1.0,0.20655
37,Abhi gautam,5.0,100% satisfied. Recommend to all.,1,0,1,2012-07-10,1,1.0,0.20655


### 20 Identify the Commentary and Interpret the Results.

In [27]:
df.sort_values("wilson_lower_bound", ascending=False).head(20)

Unnamed: 0,reviewerName,overall,summary,helpful_yes,helpful_no,total_vote,reviewTime,score_pos_neg_diff,score_average_rating,wilson_lower_bound
2031,"Hyoun Kim ""Faluzure""",5.0,UPDATED - Great w/ Galaxy S4 & Galaxy Tab 4 10...,1952,68,2020,2013-01-05,1884,0.96634,0.95754
3449,NLee the Engineer,5.0,Top of the class among all (budget-priced) mic...,1428,77,1505,2012-09-26,1351,0.94884,0.93652
4212,SkincareCEO,1.0,1 Star reviews - Micro SDXC card unmounts itse...,1568,126,1694,2013-05-08,1442,0.92562,0.91214
317,"Amazon Customer ""Kelly""",1.0,"Warning, read this!",422,73,495,2012-02-09,349,0.85253,0.81858
4672,Twister,5.0,Super high capacity!!! Excellent price (on Am...,45,4,49,2014-07-03,41,0.91837,0.80811
1835,goconfigure,5.0,I own it,60,8,68,2014-02-28,52,0.88235,0.78465
3981,"R. Sutton, Jr. ""RWSynergy""",5.0,"Resolving confusion between ""Mobile Ultra"" and...",112,27,139,2012-10-22,85,0.80576,0.73214
3807,R. Heisler,3.0,"Good buy for the money but wait, I had an issue!",22,3,25,2013-02-27,19,0.88,0.70044
4306,Stellar Eller,5.0,Awesome Card!,51,14,65,2012-09-06,37,0.78462,0.67033
4596,"Tom Henriksen ""Doggy Diner""",1.0,Designed incompatibility/Don't support SanDisk,82,27,109,2012-09-22,55,0.75229,0.66359


# The End