# AIN212 Fall 2023 Project Assignment
Rating prediction on Women's E-Commerce Clothing Reviews dataset.

### Dataset: We will be working on Women's E-Commerce Clothing Reviews dataset from Kaggle.
https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews

* Authors : Alperen Demirci(2220765010) & Bora Dere(2220765021)

* Course : AIN212 - Data Science

* Emails : alperendemirci@hacettepe.edu.tr & boradere@hacettepe.edu.tr , b2220765010@cs.hacettepe.edu.tr & b2220765021@cs.hacettepe.edu.tr

* Date : 28/12/2023

* Description : This project is about classifying the reviews of the women who bought clothes from an online shopping site. The reviews are rated from 1 to 5, we will classify every entry into 5 bins called very bad,bad,neutral,good,very good. The dataset is explained in the next cell.

## Explaining the dataset:
There are 11 columns in this dataset.
- <span style="color: red;">Clothing ID:</span> Integer Categorical variable that refers to the specific piece being reviewed.
- <span style="color: red;">Age:</span> Positive Integer variable of the reviewers age.
- <span style="color: red;">Title:</span> String variable for the title of the review.
- <span style="color: red;">Review Text:</span> String variable for the review body.
- <span style="color: red;">Rating:</span> Positive Ordinal Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.
- <span style="color: red;">Recommended IND:</span> Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.
- <span style="color: red;">Positive Feedback Count:</span> Positive Integer documenting the number of other customers who found this review positive.
- <span style="color: red;">Division Name:</span> Categorical name of the product high level division.
- <span style="color: red;">Department Name:</span> Categorical name of the product department name.
- <span style="color: red;">Class Name:</span> Categorical name of the product class name.

## Our approach:

* So first we need to clean the data.(remove the null values, remove the unnecessary columns, etc.)
* Then we need to visualize the data in order to understand it better.
* After that we will use different classification algorithms to predict the ratings of the reviews.(Naive Bayes, Decision Tree, Random Forest, etc.)
* We will compare the results of the algorithms and choose the best one.

## Personal Thoughts and Comments:

* Since our problem and data are near to SpamClassification problem, I think the fastest and most accurate algorithm will be Naive Bayes.
* Hence we are not familiar with deep NLP concepts, we will try to solve this problem using only statistics and essences of data science. In my opinion, this will be challenging but educative for us.
* Since our valuable column is based on strings, cleaning the data will be the most challenging part of this project.

## Imports and loading the dataset

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [11]:
data = pd.read_csv('Womens Clothing E-Commerce Reviews.csv')
df = pd.DataFrame(data)

In [12]:
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


## Data preprocessing

First we check the numerical columns' null values.

Then we check the categorical columns' null values.

In [4]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,23486.0,11742.5,6779.968547,0.0,5871.25,11742.5,17613.75,23485.0
Clothing ID,23486.0,918.118709,203.29898,0.0,861.0,936.0,1078.0,1205.0
Age,23486.0,43.198544,12.279544,18.0,34.0,41.0,52.0,99.0
Rating,23486.0,4.196032,1.110031,1.0,4.0,5.0,5.0,5.0
Recommended IND,23486.0,0.822362,0.382216,0.0,1.0,1.0,1.0,1.0
Positive Feedback Count,23486.0,2.535936,5.702202,0.0,0.0,1.0,3.0,122.0


Unnamed column is a poor attempt to create an index column. So, it is redundant.

In [13]:
df.drop(columns='Unnamed: 0',inplace=True)

In [14]:
df.head()

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [6]:
df.isnull().sum()

Clothing ID                   0
Age                           0
Title                      3810
Review Text                 845
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64

As seen we have 3810 null values in Title column. In order to solve this problem, we have concataneated the Title and Review Text columns. Then we have dropped the Review Text column.

* Concatanation format is like this: Title + ": " + Review Text

* We did not dropped the null values in Review Text columns since a customer may just write its' review in the Title column.(Like Title = "I love it!" and Review Text = null)

* For divison name, department name and class name columns, we have filled the null values with their mode values.(Mode Imputation)

In [None]:
## Code for title and review concatanation and removing null values

In [15]:
## Code for imputing mode values in place of null values in division, department and class columns

* But hey, what about the null values in Review Text column? We have dropped them after the concatanation due to lack of information. We can not generate fake reviews :)

In [9]:
df.head()

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


<span style="color: cyan;">Diyorum ki title ile review texti birleştirelim. Format şu şekilde (title): (content) olsun. Bu şekilde boş olanları atmamış hem de null bir column a çözüm getirmiş oluruz. Ayrıca divison ve department name zaten kategorik (biri 3 diğeri 6 unique içeriyo) bu yüzden atmak yerine oneHotEncoding yapalım ayrıca feature sayısı da artar. Evet Naive Bayes çok mantıklı çünkü yaptığımız şey spam classfier'a benziyor bu da Naive Bayes'in iyi olduğu konu hatta aynı metotla yapıyoruz. 
</span>

<span style="color: yellow;">kanka division, department name falan gereksiz deyip atabiliriz başta da. burdan sonrası zaten train test split ve prediction falan. gpt abim MultinomialNB ile falan yaparsınız dedi ben yemeğe iniyom bakılır ona</span>

In [10]:
for col in df.columns:
    print(col, df[col].nunique())

Clothing ID 1179
Age 77
Title 13992
Review Text 22634
Rating 5
Recommended IND 2
Positive Feedback Count 82
Division Name 3
Department Name 6
Class Name 20
