<a href="https://colab.research.google.com/github/BerrinKaradag/Rating_Product__Sorting_Reviews_in_Amazon/blob/main/Rating_Product_%26_Sorting_Reviews_in_Amazon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = 'amazon:https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-data-sets%2F5721658%2F9420296%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240929%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240929T145141Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3Db81782444f19ea80dadd93e450bc46801f7cb1da3ba9e3bfe041b0d0fdd30f6ff6b19caefc56c69c0fc716292ad43e6b9a062e33b4261297274eb6f14d76124d7ca11a5ba5e7fa76a15cd703837229d4dbbf473885bbf0ffaff078fdd770774753b92ae05520713b49995397712c4eac95e4d370e242fe95b7aad7524b47e7324041b4b37a6c59f48569b254c76299bfc0b5f7c7825fcffbe1c8c6a61d75736aab9e34cdb03adcda5f8e389aed3ce8ba1802c9b0c66f60d860c7aec7e0405b5f32197b7699d22158dedcd17e4289b1f26a756c8d70cb223d35acaf059b31145d1804cdf90fbc2f160d19c2bb61f7869fccac846f9f4419c8c4bb83b107d5d024'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null
shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
  os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
  pass
try:
  os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
except FileExistsError:
  pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
    directory, download_url_encoded = data_source_mapping.split(':')
    download_url = unquote(download_url_encoded)
    filename = urlparse(download_url).path
    destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
    try:
        with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
            total_length = fileres.headers['content-length']
            print(f'Downloading {directory}, {total_length} bytes compressed')
            dl = 0
            data = fileres.read(CHUNK_SIZE)
            while len(data) > 0:
                dl += len(data)
                tfile.write(data)
                done = int(50 * dl / int(total_length))
                sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
                sys.stdout.flush()
                data = fileres.read(CHUNK_SIZE)
            if filename.endswith('.zip'):
              with ZipFile(tfile) as zfile:
                zfile.extractall(destination_path)
            else:
              with tarfile.open(tfile.name) as tarfile:
                tarfile.extractall(destination_path)
            print(f'\nDownloaded and uncompressed: {directory}')
    except HTTPError as e:
        print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
        continue
    except OSError as e:
        print(f'Failed to load {download_url} to path {destination_path}')
        continue

print('Data source import complete.')


Downloading amazon, 721801 bytes compressed
Downloaded and uncompressed: amazon
Data source import complete.


**Case:** One of the most important problems in e-commerce is the correct calculation of the scores given to the products after the sale. The solution to this problem means providing more customer satisfaction for the e-commerce site, highlighting the product for the sellers and a smooth shopping experience for the buyers. Another problem is the correct ordering of the comments given to the products. Since the prominence of misleading comments will directly affect the sales of the product, it will cause both financial loss and loss of customers. In the solution of these 2 basic problems, the e-commerce site and the sellers will increase their sales while the customers will complete the purchasing journey smoothly.

This dataset, which contains Amazon product data, includes product categories and various metadata. The product with the most comments in the electronics category has user ratings and comments.

**Task 1:** Calculate the Average Rating based on current reviews and compare with the existing average rating.

**Task 2:** Determine 20 reviews to be displayed on the product detail page for the product.

**Variables**

reviewerID - ID of the reviewer

asin - ID of the product

reviewerName - name of the reviewer

helpful - helpfulness rating of the review

reviewText - text of the review

overall - rating of the product

summary - summary of the review

unixReviewTime - time of the review (unix time)

reviewTime - time of the review (raw)

day_diff - Number of days since the review

helpful_yes - Number of times the review was found helpful

total_vote - Number of votes given to the review

In [None]:
import pandas as pd

#wilsonlower bound için:
import math
import scipy.stats as st

In [None]:
df=pd.read_csv("/kaggle/input/amazon/amazon_review.csv")
df.head()

**Task 1:Calculate the Average Rating based on current reviews and compare with the existing average rating**.

In the shared dataset, users have given points and made comments on a product. In this task, our aim is to evaluate the given points by weighting them according to date. The first average point and the weighted point according to date to be obtained must be compared.

**Step 1:** Calculate the average point of the product.

In [None]:
df["overall"].mean()

**Step 2:** Calculate the weighted average score by date.

In [None]:
def average_score (data, w1=50, w2=25, w3=15, w4=10):
    return data.loc[data["day_diff"]<=data["day_diff"].quantile(0.25), "overall"].mean() *w1/100 +\
           data.loc[(data["day_diff"]>data["day_diff"].quantile(0.25)) & (data["day_diff"]<=data["day_diff"].quantile(0.50)), "overall"].mean() *w2/100 +\
           data.loc[(data["day_diff"]>data["day_diff"].quantile(0.50)) & (data["day_diff"]<= data["day_diff"].quantile(0.75)), "overall"].mean() * w3/100 +\
           data.loc[data["day_diff"]> data["day_diff"].quantile(0.75), "overall"].mean()*w4/100

average_score(df)

**Step 3:** Compare and interpret the average of each time period in the weighted score.

In [None]:
def period (data, w1=50, w2=25, w3=15, w4=10):
    q1= data.loc[data["day_diff"]<=data["day_diff"].quantile(0.25), "overall"].mean() *w1/100
    q2= data.loc[(data["day_diff"]>data["day_diff"].quantile(0.25)) & (data["day_diff"]<=data["day_diff"].quantile(0.50)), "overall"].mean() *w2/100
    q3= data.loc[(data["day_diff"]>data["day_diff"].quantile(0.50)) & (data["day_diff"]<= data["day_diff"].quantile(0.75)), "overall"].mean() * w3/100
    q4= data.loc[data["day_diff"]> data["day_diff"].quantile(0.75), "overall"].mean()*w4/100
    weighted_avg = q1* w1 / 100 + q2 * w2 / 100 + q3 * w3 / 100 + q4 * w4 / 100

    return {
        "Q1 (Newest %25)": q1,
        "Q2": q2,
        "Q3": q3,
        "Q4 (Oldest %25)": q4,
        "Weighted Average": weighted_avg
    }

period(df)

**Task 2: Determine 20 reviews to be displayed on the product detail page for the product.**

**Step 1:** Generate the helpful_no variable.

• total_vote is the total number of up-downs given to a comment.

• up means helpful.

There is no helpful_no variable in the dataset, it needs to be generated from the existing variables. Find the number of votes that are not found helpful (helpful_no) by subtracting the number of helpful votes (helpful_yes) from the total number of votes (total_vote)

In [None]:
df["helpful_no"]=df["total_vote"]-df["helpful_yes"]
df.head()

**Step 2:** Calculate score_pos_neg_diff, score_average_rating and wilson_lower_bound scores and add them to the data.

score_pos_neg_diff: Useful votes - Unhelpful votes

score_average_rating: Useful votes / Total votes

wilson_lower_bound: Statistical method used to assess the reliability of ratings, especially binary (e.g., "like/dislike", "up/down") ratings. This method allows a review (or product) to be reliably ranked based on the number of positive ratings it has received. The goal is to prevent reviews with a small number of high ratings from being ranked higher than reviews with a large number of average ratings. Thus, reviews with only a few positive ratings are kept at the bottom when making a reliable ranking.

• Create scores according to score_pos_neg_diff. Then; save it in df with the name score_pos_neg_diff.

• Create scores according to score_average_rating. Then; save it in df with the name score_average_rating.

• Create scores according to wilson_lower_bound. Then; save it in df with the name wilson_lower_bound.

In [None]:
df["score_pos_nef_diff"]=df["helpful_yes"]-df["helpful_no"]
df["score_average_rating"]=df["helpful_yes"]/df["total_vote"]


def wilson_lower_bound(up, down, confidence=0.95):
    n = up + down
    if n == 0:
        return 0
    z = st.norm.ppf(1 - (1 - confidence) / 2)
    phat = 1.0 * up / n
    return (phat + z * z / (2 * n) - z * math.sqrt((phat * (1 - phat) + z * z / (4 * n)) / n)) / (1 + z * z / n)

df["wilson_lower_bound"]=df.apply(lambda x: wilson_lower_bound(x["helpful_yes"], x["helpful_no"]), axis=1)

**Step 3:** Identify and rank the first 20 comments by wilson_lower_bound.

In [None]:
df.sort_values("wilson_lower_bound", ascending=False).head(20)