<a href="https://colab.research.google.com/github/Ayavie/Wish.com-Product-Rating-Prediction.ipynb/blob/main/Wish_com_Product_Rating_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Formulation ✅

We have a dataset from an e-commerce website, each row in the data describes some features for a product and its rating in the website.

Our goal is to make a machine learning model that is able to classify the rating of any new product based on its features. 


the input features we have for each product:
```
      'price', 'retail_price', 'currency_buyer', 'units_sold',
       'uses_ad_boosts', 'rating_count', 'badges_count',
       'badge_local_product', 'badge_product_quality', 'badge_fast_shipping',
       'tags', 'product_color', 'product_variation_size_id',
       'product_variation_inventory', 'shipping_option_name',
       'shipping_option_price', 'shipping_is_express', 'countries_shipped_to',
       'inventory_total', 'has_urgency_banner', 'urgency_text',
       'origin_country', 'merchant_title', 'merchant_name',
       'merchant_info_subtitle', 'merchant_rating_count', 'merchant_rating',
       'merchant_id', 'merchant_has_profile_picture',
       'merchant_profile_picture', 'theme', 'crawl_month', 'id' 

```

the target variable is `'rating'`.

We'll use classification technique to help solve this problem.

The experimental protocol is trying Decision Tree with random hyperparameter values then find the best values using GridSearch, applying the same thing with  random forest, SVM and finally applying  Naive Bayes

The preprocessing will be explained in steps as we investigate the dataset.
The challenge is that the dataset is extremely unclean and need alot of preprocessing

There won't be a huge or critical effect on real life for this project as it is only predicting the rating of a product to be sold.

The best solution in my opinion is above 80% accuracy.

It is always expected to get better results when using GridSearchCV

## Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

## Investigating the dataset

In [None]:
#Loading the dataset into pandas dataframe
df=pd.read_csv('train_new.csv')

#Reading few lines
df.head(5)

Unnamed: 0,price,retail_price,currency_buyer,units_sold,uses_ad_boosts,rating,rating_count,badges_count,badge_local_product,badge_product_quality,badge_fast_shipping,tags,product_color,product_variation_size_id,product_variation_inventory,shipping_option_name,shipping_option_price,shipping_is_express,countries_shipped_to,inventory_total,has_urgency_banner,urgency_text,origin_country,merchant_title,merchant_name,merchant_info_subtitle,merchant_rating_count,merchant_rating,merchant_id,merchant_has_profile_picture,merchant_profile_picture,theme,crawl_month,id
0,8.0,7,EUR,10000,1,4.0,1670,0,0,0,0,"Summer,soildcolor,Plus Size,Tank,camisole,Tops...",yellow,M,50,Livraison standard,3,0,30,50,,,CN,keepahorse,keepahorse,"88 % avis positifs (66,644 notes)",66644,4.137582,577fb2b368116418674befd9,1,https://s3-us-west-1.amazonaws.com/sweeper-pro...,summer,2020-08,272
1,8.0,7,EUR,1000,0,4.0,843,1,1,0,0,"bathing suit,Plus Size,bikini set,sexy swimsui...",black,L,50,Livraison standard,3,0,37,50,,,CN,shanghaimingqidianqikejiyouxiangongsi,上海铭启电气科技有限公司,"91 % avis positifs (25,752 notes)",25752,4.256873,566a3ef17233ff2686443082,0,,summer,2020-08,1515
2,6.0,34,EUR,100,0,4.0,10,0,0,0,0,"Summer,Vest,momshirt,Get,summer t-shirts,funny...",white,L,50,Livraison standard,2,0,41,50,,,CN,zhaodongmei,zhaodongmei,"83 % avis positifs (3,897 notes)",3897,3.92892,5aacd6c1ccf0c83e5a6f3de0,0,,summer,2020-08,940
3,4.67,4,EUR,100,1,3.0,82,0,0,0,0,"Summer,Shorts,pants,Beach,Plus Size,beachpant,...",lakeblue,XS,5,Livraison standard,1,0,25,50,,,CN,pookie0331,pookie0331,"87 % avis positifs (7,497 notes)",7497,4.079365,583141fbfef4094e51453d9b,0,,summer,2020-08,309
4,8.0,11,EUR,1000,0,4.0,127,0,0,0,0,"Summer,Floral print,women dresses,fashion dres...",apricot,S,50,Livraison standard,3,0,15,50,1.0,Quantité limitée !,CN,shitongyi12683626,shitongyi12683626,91 % avis positifs (173 notes),173,4.306358,5d3bc9f8e13a7e654424a4cb,0,,summer,2020-08,226


In [None]:
#Checking dataset size
df.shape

(1094, 34)

In [None]:
#Checking the features of the dataset

df.columns

Index(['price', 'retail_price', 'currency_buyer', 'units_sold',
       'uses_ad_boosts', 'rating', 'rating_count', 'badges_count',
       'badge_local_product', 'badge_product_quality', 'badge_fast_shipping',
       'tags', 'product_color', 'product_variation_size_id',
       'product_variation_inventory', 'shipping_option_name',
       'shipping_option_price', 'shipping_is_express', 'countries_shipped_to',
       'inventory_total', 'has_urgency_banner', 'urgency_text',
       'origin_country', 'merchant_title', 'merchant_name',
       'merchant_info_subtitle', 'merchant_rating_count', 'merchant_rating',
       'merchant_id', 'merchant_has_profile_picture',
       'merchant_profile_picture', 'theme', 'crawl_month', 'id'],
      dtype='object')

In [None]:
#Getting description overview of data
df.describe()

Unnamed: 0,price,retail_price,units_sold,uses_ad_boosts,rating,rating_count,badges_count,badge_local_product,badge_product_quality,badge_fast_shipping,product_variation_inventory,shipping_option_price,shipping_is_express,countries_shipped_to,inventory_total,has_urgency_banner,merchant_rating_count,merchant_rating,merchant_has_profile_picture,id
count,1094.0,1094.0,1094.0,1094.0,1094.0,1094.0,1094.0,1094.0,1094.0,1094.0,1094.0,1094.0,1094.0,1094.0,1094.0,349.0,1094.0,1094.0,1094.0,1094.0
mean,8.320786,23.712066,4518.661792,0.432358,4.006399,915.523766,0.11426,0.019196,0.083181,0.011883,32.80713,2.344607,0.002742,40.484461,49.849177,1.0,26783.99,4.036191,0.155393,772.207495
std,4.027448,31.802661,9825.608568,0.49563,0.585706,2037.561158,0.348461,0.137275,0.276282,0.108409,21.454655,1.044781,0.052318,20.196229,2.480687,0.0,86161.71,0.200397,0.362445,453.78443
min,1.0,2.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,8.0,1.0,1.0,4.0,2.941176,0.0,2.0
25%,5.83,7.0,100.0,0.0,4.0,24.0,0.0,0.0,0.0,0.0,6.0,2.0,0.0,31.0,50.0,1.0,1926.0,3.918464,0.0,382.25
50%,8.0,10.0,1000.0,0.0,4.0,143.5,0.0,0.0,0.0,0.0,50.0,2.0,0.0,40.0,50.0,1.0,8225.0,4.04517,0.0,752.5
75%,11.0,26.0,5000.0,1.0,4.0,845.5,0.0,0.0,0.0,0.0,50.0,3.0,0.0,43.0,50.0,1.0,23062.0,4.166826,0.0,1166.5
max,49.0,252.0,100000.0,1.0,6.0,20744.0,3.0,1.0,1.0,1.0,50.0,12.0,1.0,140.0,50.0,1.0,2174765.0,4.577519,1.0,1572.0


In [None]:
#Getting some info on the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1094 entries, 0 to 1093
Data columns (total 34 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   price                         1094 non-null   float64
 1   retail_price                  1094 non-null   int64  
 2   currency_buyer                1094 non-null   object 
 3   units_sold                    1094 non-null   int64  
 4   uses_ad_boosts                1094 non-null   int64  
 5   rating                        1094 non-null   float64
 6   rating_count                  1094 non-null   int64  
 7   badges_count                  1094 non-null   int64  
 8   badge_local_product           1094 non-null   int64  
 9   badge_product_quality         1094 non-null   int64  
 10  badge_fast_shipping           1094 non-null   int64  
 11  tags                          1094 non-null   object 
 12  product_color                 1065 non-null   object 
 13  pro

In [None]:
#Checking duplicated records presence and their count
df.duplicated().value_counts()

False    1094
dtype: int64

In [None]:
#Checking nulls
df.isnull().sum()

price                             0
retail_price                      0
currency_buyer                    0
units_sold                        0
uses_ad_boosts                    0
rating                            0
rating_count                      0
badges_count                      0
badge_local_product               0
badge_product_quality             0
badge_fast_shipping               0
tags                              0
product_color                    29
product_variation_size_id        12
product_variation_inventory       0
shipping_option_name              0
shipping_option_price             0
shipping_is_express               0
countries_shipped_to              0
inventory_total                   0
has_urgency_banner              745
urgency_text                    745
origin_country                   12
merchant_title                    0
merchant_name                     4
merchant_info_subtitle            0
merchant_rating_count             0
merchant_rating             

In [None]:
#Checking the count of values of every column
for clmn in range(df.shape[1]):
    print(df.columns[clmn])
    print(df[df.columns[clmn]].value_counts())    
    print("------------------------")

price
8.00     196
11.00    135
7.00      89
6.00      88
9.00      88
        ... 
25.00      1
5.70       1
3.90       1
4.69       1
4.73       1
Name: price, Length: 117, dtype: int64
------------------------
retail_price
7      127
6      103
10      86
11      73
5       71
      ... 
87       1
139      1
37       1
53       1
57       1
Name: retail_price, Length: 95, dtype: int64
------------------------
currency_buyer
EUR    1094
Name: currency_buyer, dtype: int64
------------------------
units_sold
100       353
1000      280
5000      139
10000     125
20000      79
50         54
10         37
50000      12
100000      5
8           4
1           3
2           2
7           1
Name: units_sold, dtype: int64
------------------------
uses_ad_boosts
0    621
1    473
Name: uses_ad_boosts, dtype: int64
------------------------
rating
4.0    774
5.0    170
3.0    135
2.0     12
1.0      2
6.0      1
Name: rating, dtype: int64
------------------------
rating_count
0       30
2    

#Cleaning the data

First of all, let's drop the outlier in column `'rating'`

In [None]:
#Getting the index of the record

df[df['rating']==6]

Unnamed: 0,price,retail_price,currency_buyer,units_sold,uses_ad_boosts,rating,rating_count,badges_count,badge_local_product,badge_product_quality,badge_fast_shipping,tags,product_color,product_variation_size_id,product_variation_inventory,shipping_option_name,shipping_option_price,shipping_is_express,countries_shipped_to,inventory_total,has_urgency_banner,urgency_text,origin_country,merchant_title,merchant_name,merchant_info_subtitle,merchant_rating_count,merchant_rating,merchant_id,merchant_has_profile_picture,merchant_profile_picture,theme,crawl_month,id
971,49.0,42,EUR,100,0,6.0,6,1,0,0,1,"party,Goth,Plus Size,Summer,Dress,V-neck,bodyc...",Rose red,XXXXXL,2,Livraison Express,12,1,41,2,,,CN,Yanfeng Living museum,yanfenglivingmuseum,"90 % avis positifs (29,977 notes)",29977,4.223138,5925555b2cd9447854443c41,0,,summer,2020-08,118


In [None]:
#Dropping the record with rating outlier using index number
df.drop(axis=0,index=971, inplace=True)

Since columns `'merchant_profile_picture'` and `'urgency_text'`  have a lot of missing values compared to the dataset size, we'll drop them. Also `'id'` , `'merchant_title' , 'merchant_name', 'merchant_id'` column will be dropped because it is not a feature.

Features that has a dominant value across records will be dropped as it's influence on the target variable will be small so columns:  `'currency_buyer', 'shipping_option_name' , 'tags' , 'origin_country', 'shipping_is_express', 'inventory_total' , 'theme' and 'crawl_month' `will be dropped.



In [None]:
df.drop(columns=['merchant_profile_picture', 'urgency_text','id','merchant_title' , 'merchant_name', 'merchant_id','currency_buyer', 'shipping_option_name' , 'tags' , 'origin_country', 'shipping_is_express' , 'theme' , 'crawl_month'],inplace=True)

In [None]:
#As for 'has_urgency_banner' column we will replace NULL values with 0 as meaning it does not have urgency banner
df['has_urgency_banner']=df['has_urgency_banner'].fillna(0)

In [None]:
#Dropping rows tat has null values in 'product_color'

df.dropna(subset=['product_color'], inplace=True)

In [None]:
#Checking correlation between features
df.corr()

Unnamed: 0,price,retail_price,units_sold,uses_ad_boosts,rating,rating_count,badges_count,badge_local_product,badge_product_quality,badge_fast_shipping,product_variation_inventory,shipping_option_price,countries_shipped_to,inventory_total,has_urgency_banner,merchant_rating_count,merchant_rating,merchant_has_profile_picture
price,1.0,0.333535,0.002055,-0.041148,0.199357,0.023134,0.063551,0.07377,0.037648,0.012101,0.320003,0.875298,-0.031085,0.000782,-0.007343,-0.023093,0.052539,0.036427
retail_price,0.333535,1.0,0.018258,0.001531,0.086916,0.060486,0.027698,0.032409,0.009092,0.025478,0.144839,0.295083,0.048397,0.01333,-0.024873,0.016917,0.048394,0.026451
units_sold,0.002055,0.018258,1.0,-0.022974,0.022247,0.899239,0.045979,0.004737,0.062469,-0.023484,0.134271,-0.014028,-0.02423,-0.022932,-0.027766,0.203997,0.13133,0.145115
uses_ad_boosts,-0.041148,0.001531,-0.022974,1.0,-0.04625,-0.049061,0.016092,0.054502,-0.018676,0.03316,-0.100526,-0.049736,0.014941,0.02182,0.047889,0.032867,0.031476,0.007264
rating,0.199357,0.086916,0.022247,-0.04625,1.0,0.035312,0.220186,0.000437,0.269887,0.000316,0.053802,0.171135,0.019633,-0.071608,-0.046469,0.062911,0.208252,0.037115
rating_count,0.023134,0.060486,0.899239,-0.049061,0.035312,1.0,0.070437,0.006804,0.079709,0.009576,0.142383,0.001677,-0.008001,-0.017914,-0.003823,0.209205,0.155355,0.160314
badges_count,0.063551,0.027698,0.045979,0.016092,0.220186,0.070437,1.0,0.532487,0.832399,0.37849,0.060393,0.071991,0.034047,-0.107189,-0.026199,0.088941,0.222196,0.048375
badge_local_product,0.07377,0.032409,0.004737,0.054502,0.000437,0.006804,0.532487,1.0,0.064973,0.281068,0.005575,0.085061,0.024206,0.007263,-0.031279,0.000584,0.116874,-0.018859
badge_product_quality,0.037648,0.009092,0.062469,-0.018676,0.269887,0.079709,0.832399,0.064973,1.0,-0.028701,0.096144,0.026587,0.019269,-0.136958,-0.015488,0.116761,0.196058,0.054823
badge_fast_shipping,0.012101,0.025478,-0.023484,0.03316,0.000316,0.009576,0.37849,0.281068,-0.028701,1.0,-0.069552,0.056789,0.030035,0.005247,-0.00389,-0.02219,0.054882,0.03865


In [None]:
#Checking null values in columns
df.isna().any()

price                           False
retail_price                    False
units_sold                      False
uses_ad_boosts                  False
rating                          False
rating_count                    False
badges_count                    False
badge_local_product             False
badge_product_quality           False
badge_fast_shipping             False
product_color                   False
product_variation_size_id        True
product_variation_inventory     False
shipping_option_price           False
countries_shipped_to            False
inventory_total                 False
has_urgency_banner              False
merchant_info_subtitle          False
merchant_rating_count           False
merchant_rating                 False
merchant_has_profile_picture    False
dtype: bool

In [None]:
#Checking the count of values of every column
for clmn in range(df.shape[1]):
    print(df.columns[clmn])
    print(df[df.columns[clmn]].value_counts())    
    print("------------------------")

price
8.00     191
11.00    133
6.00      88
7.00      87
9.00      86
        ... 
5.82       1
25.00      1
5.70       1
3.90       1
4.73       1
Name: price, Length: 113, dtype: int64
------------------------
retail_price
7      125
6      102
10      86
11      72
5       68
      ... 
27       1
29       1
37       1
139      1
57       1
Name: retail_price, Length: 95, dtype: int64
------------------------
units_sold
100       343
1000      271
5000      137
10000     124
20000      77
50         53
10         35
50000      11
8           4
100000      4
1           3
2           1
7           1
Name: units_sold, dtype: int64
------------------------
uses_ad_boosts
0    605
1    459
Name: uses_ad_boosts, dtype: int64
------------------------
rating
4.0    758
5.0    160
3.0    132
2.0     12
1.0      2
Name: rating, dtype: int64
------------------------
rating_count
0       27
2       26
4       22
27      15
10      15
        ..
370      1
569      1
4421     1
69       1
135 

## Recategorizing and label encoding`'product_variation_size_id'` 

In [None]:
#Replacing all sizes less than small by -S
df['product_variation_size_id'] = df['product_variation_size_id'].replace(['XS','XXS','XXXS','XS.','Size-XXS','Size-XS','SIZE XXS','Size -XXS','Size XXS','SIZE-XXS','SIZE XS'],'-S')

#Replacing all small sizes by S
df['product_variation_size_id'] = df['product_variation_size_id'].replace(['25-S','US-S','S(bust 88cm)','S.','s','Size S', 'Size-S','S/M(child)','S (waist58-62cm)','S Pink','pants-S','Size S.', 'S..','size S','Size/S','Suit-S','SIZE S'],'S')

#Replacing all medium sizes by M
df['product_variation_size_id'] = df['product_variation_size_id'].replace(['Size M','M.','Women Size 36'],'M')


#Replacing all L sizes by L
df['product_variation_size_id'] = df['product_variation_size_id'].replace(['SizeL'],'L')

#Replacing all XL sizes by XL
df['product_variation_size_id'] = df['product_variation_size_id'].replace(['X   L','One Size','1 PC - XL'],'XL')


#Replacing all sizes bigger than XL by +XL
df['product_variation_size_id'] = df['product_variation_size_id'].replace(['XXL','4XL','6XL','04-3XL','XXXXL','XXXXXL','2XL', 'Size-5XL','Size4XL','5XL'],'+XL')



In [None]:
#Filtering sizes other than '-S','S','M','L','XL' and '+XL'

df=df[(df['product_variation_size_id'] == '-S') |(df['product_variation_size_id'] == 'S')| (df['product_variation_size_id'] == 'M') | (df['product_variation_size_id'] == 'L') | (df['product_variation_size_id'] == 'XL') |(df['product_variation_size_id'] == '+XL') ]



In [None]:
#Creating dictionary for replacement values then replace them in the main dataframe to label encode 'product_variation_size_id' column

cleanup_nums = {"product_variation_size_id":{"-S": 1, "S": 2,"M": 3, "L": 4, "XL": 5, "+XL":6,'daughter 24M':2, '6':7, '4':7, '1 pc.':7, '80 X 200 CM':7,
       'EU 35':7, '34':7, '100pcs':7, 'first  generation':7, '2pcs':7,
       'Women Size 37':7, 'AU plug Low quality':7, '3 layered anklet':7, '17':7,
       'US 6.5 (EU 37)':7, 'US5.5-EU35':7, 'EU39(US8)':7, 'nan':7, '36':7, '33':7}}
df = df.replace(cleanup_nums)

In [None]:
#Filling null of 'product_variation_size_id' column by its mode value
df['product_variation_size_id'].fillna(df['product_variation_size_id'].mode()[0],inplace=True)

## Recategorizing and One-Hot-Encoding`'product_color'` 

In [None]:
#Re-categorize 'product_color' column

df['product_color'] = df['product_color'].replace(['Black', 'coolblack'],'black')
df['product_color'] = df['product_color'].replace(['White','whitefloral','star', 'offwhite'],'white')
df['product_color'] = df['product_color'].replace(['lakeblue','navyblue','lightblue','skyblue','navy','navy blue','darkblue','denimblue','Blue'],'blue')
df['product_color'] = df['product_color'].replace(['lightpink', 'pink','Pink','dustypink'],'pink')
df['product_color'] = df['product_color'].replace(['lightyellow'],'yellow')
df['product_color'] = df['product_color'].replace(['winered','watermelonred', 'rosered','rose','lightred','burgundy','wine red','claret','orange-red','wine','Rose red'],'red')
df['product_color'] = df['product_color'].replace(['khaki','armygreen','mintgreen','camouflage','lightgreen','darkgreen','Army green','light green','fluorescentgreen','applegreen'],'green')
df['product_color'] = df['product_color'].replace(['gray','lightgrey','lightgray'],'grey')
df['product_color'] = df['product_color'].replace(['violet',],'purple')
df['product_color'] = df['product_color'].replace(['coffee',],'brown')
df['product_color'] = df['product_color'].replace(['apricot',],'orange')
df['product_color'] = df['product_color'].replace(['camel' ,'ivory'],'beige')
df['product_color'] = df['product_color'].replace(['blackwhite','leopard','pink & grey','army','winered & yellow','white & green','pink & black','pink & white','leopardprint','black & yellow','whitestripe','navyblue & white', 'black & blue','black & green','white & black','black & white','pink & blue','rainbow','floral','brown & yellow','blue & pink'],'multicolor')

In [None]:
#Applying One-Hot-Encoding 
from sklearn.preprocessing import OneHotEncoder
one_hot_encoded_data = pd.get_dummies(df, columns = ['product_color'])
one_hot_encoded_data

Unnamed: 0,price,retail_price,units_sold,uses_ad_boosts,rating,rating_count,badges_count,badge_local_product,badge_product_quality,badge_fast_shipping,product_variation_size_id,product_variation_inventory,shipping_option_price,countries_shipped_to,inventory_total,has_urgency_banner,merchant_info_subtitle,merchant_rating_count,merchant_rating,merchant_has_profile_picture,product_color_beige,product_color_black,product_color_blue,product_color_brown,product_color_green,product_color_grey,product_color_multicolor,product_color_orange,product_color_pink,product_color_purple,product_color_red,product_color_white,product_color_yellow
0,8.00,7,10000,1,4.0,1670,0,0,0,0,3,50,3,30,50,0.0,"88 % avis positifs (66,644 notes)",66644,4.137582,1,0,0,0,0,0,0,0,0,0,0,0,0,1
1,8.00,7,1000,0,4.0,843,1,1,0,0,4,50,3,37,50,0.0,"91 % avis positifs (25,752 notes)",25752,4.256873,0,0,1,0,0,0,0,0,0,0,0,0,0,0
2,6.00,34,100,0,4.0,10,0,0,0,0,4,50,2,41,50,0.0,"83 % avis positifs (3,897 notes)",3897,3.928920,0,0,0,0,0,0,0,0,0,0,0,0,1,0
3,4.67,4,100,1,3.0,82,0,0,0,0,1,5,1,25,50,0.0,"87 % avis positifs (7,497 notes)",7497,4.079365,0,0,0,1,0,0,0,0,0,0,0,0,0,0
4,8.00,11,1000,0,4.0,127,0,0,0,0,2,50,3,15,50,1.0,91 % avis positifs (173 notes),173,4.306358,0,0,0,0,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1088,8.00,7,1000,0,4.0,124,0,0,0,0,2,50,3,43,50,1.0,"85 % avis positifs (3,950 notes)",3950,4.057468,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1089,4.00,4,10000,0,4.0,2150,0,0,0,0,2,50,1,41,50,0.0,"92 % avis positifs (230,951 notes)",230951,4.311066,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1091,9.00,48,1000,1,4.0,135,0,0,0,0,3,50,3,25,50,0.0,"78 % avis positifs (6,632 notes)",6632,3.748794,0,0,0,0,0,0,0,0,0,0,1,0,0,0
1092,11.00,10,10000,1,4.0,1479,0,0,0,0,3,50,2,45,50,1.0,"83 % avis positifs (5,816 notes)",5816,3.936382,0,0,1,0,0,0,0,0,0,0,0,0,0,0


In [None]:
#Assigning the one-hot-encoded data to the original dataframe 
df=one_hot_encoded_data

##Handling 'merchant_info_subtitle' column

In [None]:
#Splitting only the percentage out of it into a new column

percent=df[ 'merchant_info_subtitle' ].str.split(" ", n=1,expand=True)

In [None]:
#Assigning the first column(the one withm percentage) to the original column
df['merchant_info_subtitle']=percent[0]

In [None]:
df['merchant_info_subtitle']

0       88
1       91
2       83
3       87
4       91
        ..
1088    85
1089    92
1091    78
1092    83
1093    86
Name: merchant_info_subtitle, Length: 1023, dtype: object

In [None]:
#Converting the type to numeric and assigning it to the original column
#using errors='coerce' to set invalid pasrsing to NaNs

df['merchant_info_subtitle']=pd.to_numeric(df['merchant_info_subtitle'],errors='coerce')

In [None]:
#Filling NaNs with meean value of the column
df['merchant_info_subtitle'].fillna(df['merchant_info_subtitle'].mean(), inplace= True)

In [None]:
df

Unnamed: 0,price,retail_price,units_sold,uses_ad_boosts,rating,rating_count,badges_count,badge_local_product,badge_product_quality,badge_fast_shipping,product_variation_size_id,product_variation_inventory,shipping_option_price,countries_shipped_to,inventory_total,has_urgency_banner,merchant_info_subtitle,merchant_rating_count,merchant_rating,merchant_has_profile_picture,product_color_beige,product_color_black,product_color_blue,product_color_brown,product_color_green,product_color_grey,product_color_multicolor,product_color_orange,product_color_pink,product_color_purple,product_color_red,product_color_white,product_color_yellow
0,8.00,7,10000,1,4.0,1670,0,0,0,0,3,50,3,30,50,0.0,88.0,66644,4.137582,1,0,0,0,0,0,0,0,0,0,0,0,0,1
1,8.00,7,1000,0,4.0,843,1,1,0,0,4,50,3,37,50,0.0,91.0,25752,4.256873,0,0,1,0,0,0,0,0,0,0,0,0,0,0
2,6.00,34,100,0,4.0,10,0,0,0,0,4,50,2,41,50,0.0,83.0,3897,3.928920,0,0,0,0,0,0,0,0,0,0,0,0,1,0
3,4.67,4,100,1,3.0,82,0,0,0,0,1,5,1,25,50,0.0,87.0,7497,4.079365,0,0,0,1,0,0,0,0,0,0,0,0,0,0
4,8.00,11,1000,0,4.0,127,0,0,0,0,2,50,3,15,50,1.0,91.0,173,4.306358,0,0,0,0,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1088,8.00,7,1000,0,4.0,124,0,0,0,0,2,50,3,43,50,1.0,85.0,3950,4.057468,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1089,4.00,4,10000,0,4.0,2150,0,0,0,0,2,50,1,41,50,0.0,92.0,230951,4.311066,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1091,9.00,48,1000,1,4.0,135,0,0,0,0,3,50,3,25,50,0.0,78.0,6632,3.748794,0,0,0,0,0,0,0,0,0,0,1,0,0,0
1092,11.00,10,10000,1,4.0,1479,0,0,0,0,3,50,2,45,50,1.0,83.0,5816,3.936382,0,0,1,0,0,0,0,0,0,0,0,0,0,0


##Checking duplicates

In [None]:
df.duplicated().value_counts()

False    1011
True       12
dtype: int64

In [None]:
#Dropping duplicates
df.drop_duplicates(inplace=True)

#Assigning dependent and independent variables

In [None]:
xtr=df.drop(columns=['rating'])
ytr=df['rating']

#Applying preprocessing on test data

We will apply the same preprocessing done on training data on the test data for consistency.



In [None]:
#Reading test data into a dataframe
df_test=pd.read_csv('test_new.csv')


In [None]:
#Dropping columns as training set
df_test.drop(columns=['merchant_profile_picture', 'urgency_text','id','merchant_title' , 'merchant_name', 'merchant_id','currency_buyer', 'shipping_option_name' , 'tags' , 'origin_country', 'shipping_is_express' , 'theme' , 'crawl_month'],inplace=True)

## Recategorizing and label encoding`'product_variation_size_id'` of test data

In [None]:
#Replacing all sizes less than small by -S
df_test['product_variation_size_id'] = df_test['product_variation_size_id'].replace(['S(Pink & Black)','Size--S','XS','XXS','XXXS','XS.','Size-XXS','Size-XS','SIZE XXS','Size -XXS','Size XXS','SIZE-XXS','SIZE XS'],'-S')

#Replacing all small sizes by S
df_test['product_variation_size_id'] = df_test['product_variation_size_id'].replace(['S Diameter 30cm','25-S','US-S','S(bust 88cm)','S.','s','Size S', 'Size-S','S/M(child)','S (waist58-62cm)','S Pink','pants-S','Size S.', 'S..','size S','Size/S','Suit-S','SIZE S'],'S')

#Replacing all medium sizes by M
df_test['product_variation_size_id'] = df_test['product_variation_size_id'].replace(['Size M','M.','Women Size 36'],'M')


#Replacing all L sizes by L
df_test['product_variation_size_id'] = df_test['product_variation_size_id'].replace(['SizeL','L.','Size-L','32/L'],'L')

#Replacing all XL sizes by XL
df_test['product_variation_size_id'] = df_test['product_variation_size_id'].replace(['X   L','One Size','1 PC - XL'],'XL')


#Replacing all sizes bigger than XL by +XL
df_test['product_variation_size_id'] = df_test['product_variation_size_id'].replace(['SIZE-4XL','XXXL','3XL','XXL','4XL','6XL','04-3XL','XXXXL','XXXXXL','2XL', 'Size-5XL','Size4XL','5XL'],'+XL')

In [None]:
#Creating dictionary for replacement values then replace them in the main dataframe and assign value 7 to unknown size

cleanup_nums = {"product_variation_size_id":{"-S": 1, "S": 2,"M": 3, "L": 4, "XL": 5, "+XL":6,'daughter 24M':2, '6':7, '4':7, '1 pc.':7, '80 X 200 CM':7,
       'EU 35':7, '34':7, '100pcs':7, 'first  generation':7, '2pcs':7,
       'Women Size 37':7, 'AU plug Low quality':7, '3 layered anklet':7, '17':7,
       'US 6.5 (EU 37)':7, 'US5.5-EU35':7, 'EU39(US8)':7, 'nan':7, '36':7, '33':7}}
df_test = df_test.replace(cleanup_nums)

In [None]:
#Filling null of 'product_variation_size_id' column by its mode value
df_test['product_variation_size_id'].fillna(df['product_variation_size_id'].mode()[0],inplace=True)

## Recategorizing and One-Hot-Encoding`'product_color'` of test data

In [None]:
#Re-categorize 'product_color' column

df_test['product_color'] = df_test['product_color'].replace(['offblack','Black', 'coolblack'],'black')
df_test['product_color'] = df_test['product_color'].replace(['White','whitefloral','star', 'offwhite'],'white')
df_test['product_color'] = df_test['product_color'].replace(['prussianblue','lakeblue','navyblue','lightblue','skyblue','navy','navy blue','darkblue','denimblue','Blue'],'blue')
df_test['product_color'] = df_test['product_color'].replace(['lightpink', 'pink','Pink','dustypink'],'pink')
df_test['product_color'] = df_test['product_color'].replace(['lightyellow'],'yellow')
df_test['product_color'] = df_test['product_color'].replace(['coralred','jasper','winered','watermelonred', 'rosered','rose','lightred','burgundy','wine red','claret','orange-red','wine','Rose red'],'red')
df_test['product_color'] = df_test['product_color'].replace(['army green','lightkhaki','khaki','armygreen','mintgreen','camouflage','lightgreen','darkgreen','Army green','light green','fluorescentgreen','applegreen'],'green')
df_test['product_color'] = df_test['product_color'].replace(['gray','lightgrey','lightgray','silver'],'grey')
df_test['product_color'] = df_test['product_color'].replace(['violet','lightpurple'],'purple')
df_test['product_color'] = df_test['product_color'].replace(['coffee',],'brown')
df_test['product_color'] = df_test['product_color'].replace(['apricot','tan'],'orange')
df_test['product_color'] = df_test['product_color'].replace(['camel' ,'ivory'],'beige')
df_test['product_color'] = df_test['product_color'].replace(['white & red','gray & white','orange & camouflage','red & blue','greysnakeskinprint','black & stripe','white & red ','leopard','pink & grey','army','winered & yellow','white & green','pink & black','pink & white','leopardprint','black & yellow','whitestripe','navyblue & white', 'black & blue','black & green','white & black','black & white','pink & blue','rainbow','floral','brown & yellow','blue & pink'],'multicolor')

In [None]:
#Applying One-Hot-Encoding 

test_one_hot_encoded_data = pd.get_dummies(df_test, columns = ['product_color'])
test_one_hot_encoded_data

Unnamed: 0,price,retail_price,units_sold,uses_ad_boosts,rating_count,badges_count,badge_local_product,badge_product_quality,badge_fast_shipping,product_variation_size_id,product_variation_inventory,shipping_option_price,countries_shipped_to,inventory_total,has_urgency_banner,merchant_info_subtitle,merchant_rating_count,merchant_rating,merchant_has_profile_picture,product_color_beige,product_color_black,product_color_blue,product_color_brown,product_color_green,product_color_grey,product_color_multicolor,product_color_orange,product_color_pink,product_color_purple,product_color_red,product_color_white,product_color_yellow
0,5.75,5,10,1,0,0,0,0,0,1,5,2,139,50,,33 % avis positifs (3 notes),3,2.333333,0,0,0,0,0,0,0,0,0,0,0,0,1,0
1,7.00,7,5000,0,579,0,0,0,0,1,50,2,31,50,,"86 % avis positifs (4,213 notes)",4213,4.058391,0,0,1,0,0,0,0,0,0,0,0,0,0,0
2,5.00,16,1000,1,62,0,0,0,0,1,20,2,41,50,,"85 % avis positifs (10,501 notes)",10501,4.029235,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,18.00,27,5000,1,1903,0,0,0,0,3,50,5,42,50,,"86 % avis positifs (15,855 notes)",15855,4.052917,0,0,0,1,0,0,0,0,0,0,0,0,0,0
4,12.00,81,20000,0,1799,0,0,0,0,3,50,3,38,50,,"88 % avis positifs (151,914 notes)",151914,4.127921,0,0,0,0,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
474,16.00,14,100,1,27,0,0,0,0,3,50,4,35,50,,70 % avis positifs (572 notes),572,3.367133,0,0,1,0,0,0,0,0,0,0,0,0,0,0
475,6.00,17,5000,1,817,0,0,0,0,1,1,2,41,50,,"87 % avis positifs (50,864 notes)",50864,4.079644,1,0,0,1,0,0,0,0,0,0,0,0,0,0
476,4.83,51,50,1,1,0,0,0,0,1,3,1,39,50,1.0,85 % avis positifs (57 notes),57,3.824561,0,0,0,0,0,0,0,0,0,0,0,0,1,0
477,3.00,3,100,1,8,0,0,0,0,7,3,1,26,50,1.0,"87 % avis positifs (1,086 notes)",1086,4.056169,0,0,1,0,0,0,0,0,0,0,0,0,0,0


In [None]:
#Assigning the one-hot-encoded data to the original dataframe 
df_test=test_one_hot_encoded_data

##Handling 'merchant_info_subtitle' column of test data

In [None]:
#Splitting only the percentage out of it into a new column

test_percent=df_test[ 'merchant_info_subtitle' ].str.split(" ", n=1,expand=True)

In [None]:
#Assigning the first column(the one withm percentage) to the original column
df_test['merchant_info_subtitle']=test_percent[0]

In [None]:
df_test['merchant_info_subtitle']

0       33
1       86
2       85
3       86
4       88
      ... 
474     70
475     87
476     85
477     87
478    86%
Name: merchant_info_subtitle, Length: 479, dtype: object

In [None]:
#Converting the type to numeric and assigning it to the original column
#using errors='coerce' to set invalid pasrsing to NaNs

df_test['merchant_info_subtitle']=pd.to_numeric(df_test['merchant_info_subtitle'],errors='coerce')

In [None]:
#Filling NaNs with meean value of the column
df_test['merchant_info_subtitle'].fillna(df_test['merchant_info_subtitle'].mean(), inplace= True)

In [None]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 479 entries, 0 to 478
Data columns (total 32 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   price                         479 non-null    float64
 1   retail_price                  479 non-null    int64  
 2   units_sold                    479 non-null    int64  
 3   uses_ad_boosts                479 non-null    int64  
 4   rating_count                  479 non-null    int64  
 5   badges_count                  479 non-null    int64  
 6   badge_local_product           479 non-null    int64  
 7   badge_product_quality         479 non-null    int64  
 8   badge_fast_shipping           479 non-null    int64  
 9   product_variation_size_id     479 non-null    object 
 10  product_variation_inventory   479 non-null    int64  
 11  shipping_option_price         479 non-null    int64  
 12  countries_shipped_to          479 non-null    int64  
 13  inven

In [None]:
#As for 'has_urgency_banner' column we will replace NULL values with 0 as meaning it does not have urgency banner
df_test['has_urgency_banner']=df_test['has_urgency_banner'].fillna(0)

In [None]:
#Checking the shape of the test dataframe
df_test.shape

(479, 32)

#Decision Tree Model

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Create Decision Tree classifer object with random hyperparameters values
DT_clf = DecisionTreeClassifier(criterion= "entropy", max_depth= 8 , min_samples_leaf= 4 , random_state= 42)

# Train Decision Tree Classifer
DT_clf = DT_clf.fit(xtr,ytr)


In [None]:
#Predicting test values
y_pred = DT_clf.predict(df_test)

In [None]:
#Printing the predicted values
y_pred

array([5., 4., 3., 4., 4., 5., 4., 4., 4., 3., 4., 4., 3., 4., 4., 4., 4.,
       4., 3., 5., 4., 5., 3., 4., 4., 4., 4., 5., 4., 4., 4., 4., 4., 4.,
       4., 4., 4., 4., 4., 4., 5., 4., 3., 4., 4., 4., 2., 4., 4., 4., 4.,
       1., 4., 4., 4., 4., 4., 4., 4., 4., 4., 5., 4., 4., 3., 4., 4., 4.,
       4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 3., 4., 3., 4., 4., 4., 3.,
       4., 4., 4., 4., 5., 5., 4., 3., 5., 4., 4., 4., 4., 4., 4., 4., 4.,
       4., 4., 5., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 5.,
       4., 5., 4., 4., 4., 4., 4., 4., 4., 4., 4., 3., 4., 4., 4., 4., 4.,
       4., 4., 4., 4., 5., 4., 2., 4., 4., 4., 4., 4., 3., 4., 4., 4., 4.,
       4., 4., 4., 4., 4., 4., 4., 4., 5., 4., 4., 4., 4., 4., 4., 4., 4.,
       4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4.,
       4., 4., 4., 4., 3., 4., 4., 4., 4., 4., 3., 3., 4., 4., 4., 4., 4.,
       4., 5., 4., 4., 4., 3., 4., 3., 4., 4., 4., 4., 4., 4., 4., 3., 4.,
       4., 5., 4., 4., 4.

#Finding the optimal Hyperparameters using GridSearchCV

In [None]:
# Creating parameters dictionary for applying GridSearchCV
param_dist = {
     "criterion" : ["gini" , "entropy"],
     "max_depth" : [2 , 4 , 6 , 8, 10, 12, 14, 16],
     "min_samples_leaf": [1 , 2, 3, 4, 5, 6, 7, 8, 9, 10]
 }

In [None]:
from sklearn.model_selection import GridSearchCV

#Instantiating GridSearch
GS = GridSearchCV(DT_clf , param_grid= param_dist , cv= 10 , n_jobs= -1)

In [None]:
#Fitting GridSeach with training data
GS.fit(xtr,ytr)



GridSearchCV(cv=10,
             estimator=DecisionTreeClassifier(criterion='entropy', max_depth=8,
                                              min_samples_leaf=4,
                                              random_state=42),
             n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [2, 4, 6, 8, 10, 12, 14, 16],
                         'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})

In [None]:
#Printing the best hyperparameters value for the model
GS.best_params_

{'criterion': 'entropy', 'max_depth': 6, 'min_samples_leaf': 10}

In [None]:
#Printing the best score that DT can ever get using the best parameters
GS.best_score_

0.7774606872451951

In [None]:
# Create Decision Tree classifer object
clf2 = DecisionTreeClassifier(criterion= "entropy", max_depth= 6 , min_samples_leaf= 10 , random_state= 42)

# Train Decision Tree Classifer
clf2 = clf2.fit(xtr,ytr)

In [None]:
df_test

Unnamed: 0,price,retail_price,units_sold,uses_ad_boosts,rating_count,badges_count,badge_local_product,badge_product_quality,badge_fast_shipping,product_variation_size_id,product_variation_inventory,shipping_option_price,countries_shipped_to,inventory_total,has_urgency_banner,merchant_info_subtitle,merchant_rating_count,merchant_rating,merchant_has_profile_picture,product_color_beige,product_color_black,product_color_blue,product_color_brown,product_color_green,product_color_grey,product_color_multicolor,product_color_orange,product_color_pink,product_color_purple,product_color_red,product_color_white,product_color_yellow
0,5.75,5,10,1,0,0,0,0,0,1,5,2,139,50,0.0,33.000000,3,2.333333,0,0,0,0,0,0,0,0,0,0,0,0,1,0
1,7.00,7,5000,0,579,0,0,0,0,1,50,2,31,50,0.0,86.000000,4213,4.058391,0,0,1,0,0,0,0,0,0,0,0,0,0,0
2,5.00,16,1000,1,62,0,0,0,0,1,20,2,41,50,0.0,85.000000,10501,4.029235,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,18.00,27,5000,1,1903,0,0,0,0,3,50,5,42,50,0.0,86.000000,15855,4.052917,0,0,0,1,0,0,0,0,0,0,0,0,0,0
4,12.00,81,20000,0,1799,0,0,0,0,3,50,3,38,50,0.0,88.000000,151914,4.127921,0,0,0,0,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
474,16.00,14,100,1,27,0,0,0,0,3,50,4,35,50,0.0,70.000000,572,3.367133,0,0,1,0,0,0,0,0,0,0,0,0,0,0
475,6.00,17,5000,1,817,0,0,0,0,1,1,2,41,50,0.0,87.000000,50864,4.079644,1,0,0,1,0,0,0,0,0,0,0,0,0,0
476,4.83,51,50,1,1,0,0,0,0,1,3,1,39,50,1.0,85.000000,57,3.824561,0,0,0,0,0,0,0,0,0,0,0,0,1,0
477,3.00,3,100,1,8,0,0,0,0,7,3,1,26,50,1.0,87.000000,1086,4.056169,0,0,1,0,0,0,0,0,0,0,0,0,0,0


#Applying feature importance to know the importance of each feature

In [None]:
from sklearn.inspection import permutation_importance

#Instantiating permutation importance 
result = permutation_importance(clf2, xtr, ytr, n_repeats=10,random_state=42)

#Getting the mean of all importance of the feature and storing them in an array
result_mean=result.importances_mean

In [None]:
#Printing the mean of importance
result_mean

array([0.0206726 , 0.00870425, 0.00682493, 0.        , 0.09930762,
       0.        , 0.        , 0.04559842, 0.        , 0.        ,
       0.        , 0.        , 0.01780415, 0.        , 0.        ,
       0.        , 0.01503462, 0.00820969, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        ])

In [None]:
#Checking shape of training sey
xtr.shape

(1011, 32)

In [None]:
#looping on columns and add those of 0 permutation importanceto a list
col_names=[]
for i in range(result_mean.shape[0]):
  if result_mean[i]==0:
    colname = xtr.columns[i]
    col_names.append(colname)

In [None]:
#Printing column names to drop
col_names

['uses_ad_boosts',
 'badges_count',
 'badge_local_product',
 'badge_fast_shipping',
 'product_variation_size_id',
 'product_variation_inventory',
 'shipping_option_price',
 'inventory_total',
 'has_urgency_banner',
 'merchant_info_subtitle',
 'merchant_has_profile_picture',
 'product_color_beige',
 'product_color_black',
 'product_color_blue',
 'product_color_brown',
 'product_color_green',
 'product_color_grey',
 'product_color_multicolor',
 'product_color_orange',
 'product_color_pink',
 'product_color_purple',
 'product_color_red',
 'product_color_white',
 'product_color_yellow']

In [None]:
#Dropping columns with no importance
xtr.drop(columns=col_names,inplace=True)

In [None]:
#Dropping columns with no importance from test set as well
df_test.drop(columns=col_names,inplace=True)

In [None]:
#Checking training set shape
xtr.shape

(1011, 8)

In [None]:
#Checking testing set shape
df_test.shape

(479, 8)

In [None]:
# Create Decision Tree classifer object
clf2 = DecisionTreeClassifier(criterion= "entropy", max_depth= 6 , min_samples_leaf= 10 , random_state= 42)

# Train Decision Tree Classifer
clf2 = clf2.fit(xtr,ytr)

In [None]:

preds2=clf2.predict(df_test)

In [None]:
#Reading submission file

preds2 = pd.read_csv('sample_submission_new.csv')

#Replace the rating column with the predicted value
preds2["rating"] = y_pred

#convert it to csv
preds2.to_csv('submission_DT.csv',index=False)

#Random Forest Model

In [None]:
from sklearn.ensemble import RandomForestClassifier
#Instantiating the model
rfc=RandomForestClassifier(random_state=42)

In [None]:
# Creating parameters dictionary for applying GridSearchCV
param_grid = { 
    'n_estimators': [100, 200],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}

In [None]:
#Instanitating and fitting the GridSearch
rfc_gs = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 10)
rfc_gs.fit(xtr, ytr)



GridSearchCV(cv=10, estimator=RandomForestClassifier(random_state=42),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [4, 5, 6, 7, 8],
                         'max_features': ['auto', 'sqrt', 'log2'],
                         'n_estimators': [100, 200]})

In [None]:
#Getting best parameters
rfc_gs.best_params_

{'criterion': 'entropy',
 'max_depth': 8,
 'max_features': 'log2',
 'n_estimators': 200}

In [None]:
#Getting best score
rfc_gs.best_score_




0.8011745292176278

In [None]:
#Instantiating Random Forest model with best parameters
rfc1=RandomForestClassifier(n_estimators=200, max_features='auto', max_depth=8, criterion='entropy',random_state=2 )

#Fitting the model with training data
rfc1.fit(xtr,ytr)

RandomForestClassifier(criterion='entropy', max_depth=8, n_estimators=200,
                       random_state=2)

In [None]:
#making predictions using Random Forest Model
rfc1.predict(df_test)

array([5., 4., 4., 4., 4., 5., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4.,
       4., 3., 5., 4., 5., 4., 4., 4., 4., 4., 5., 4., 4., 4., 4., 4., 4.,
       4., 4., 4., 4., 4., 4., 5., 4., 4., 4., 4., 4., 5., 4., 4., 4., 4.,
       5., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4.,
       4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4.,
       4., 4., 4., 4., 5., 5., 4., 3., 5., 4., 4., 4., 4., 4., 4., 4., 4.,
       4., 4., 5., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 5.,
       4., 3., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4.,
       4., 4., 4., 4., 5., 4., 5., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4.,
       4., 3., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4.,
       4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4.,
       4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4.,
       4., 4., 4., 4., 4., 3., 4., 3., 4., 4., 4., 4., 4., 4., 4., 4., 4.,
       4., 5., 4., 4., 4.

In [None]:
#Reading submission file

preds2 = pd.read_csv('sample_submission_new.csv')

#Replace the rating column with the predicted value
preds2["rating"] = y_pred

#convert it to csv
preds2.to_csv('submission_rfc.csv',index=False)

#Naive Bayes Model

In [None]:
#Instantiating Naive Bayes Model
from sklearn.naive_bayes import BernoulliNB
nbc = BernoulliNB()

#Fitting the model
nbc.fit(xtr,ytr)

BernoulliNB()

In [None]:
y_pred = nbc.predict(df_test)

In [None]:
#Reading submission file

preds2 = pd.read_csv('sample_submission_new.csv')

#Replace the rating column with the predicted value
preds2["rating"] = y_pred

#convert it to csv
preds2.to_csv('submission_NB.csv',index=False)

# Support Vector Machine

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
xtr_scaled = pd.DataFrame(sc.fit_transform(xtr))


In [None]:
#Scaling test data 
df_test = pd.DataFrame(sc.transform(df_test))

In [None]:
from sklearn.svm import SVC

#Instantiating the svm model
model_svc = SVC(random_state=42)
model_svc.fit(xtr_scaled, ytr)

SVC(random_state=42)

## Finding the best hyperparameter

In [None]:
# Creating parameters dictionary for applying GridSearchCV
GS_SVC=GridSearchCV(model_svc, {
        'kernel': ['poly','rbf','sigmoid'],
         'C':[0.1, 1, 10, 100, 1000],
    'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
    }, 
   refit = True, n_jobs=-1, cv=10
   
)
GS_SVC.fit(xtr_scaled, ytr)



GridSearchCV(cv=10, estimator=SVC(), n_jobs=-1,
             param_grid={'C': [0.1, 1, 10, 100, 1000],
                         'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
                         'kernel': ['poly', 'rbf', 'sigmoid']})

In [None]:
df2 = pd.DataFrame(GS_SVC.cv_results_)
df2[['param_kernel','param_gamma','param_C','mean_test_score']]

Unnamed: 0,param_kernel,param_gamma,param_C,mean_test_score
0,poly,1,0.1,0.735847
1,rbf,1,0.1,0.720083
2,sigmoid,1,0.1,0.660726
3,poly,0.1,0.1,0.741827
4,rbf,0.1,0.1,0.720083
...,...,...,...,...
70,rbf,0.001,1000,0.741837
71,sigmoid,0.001,1000,0.725005
72,poly,0.0001,1000,0.720083
73,rbf,0.0001,1000,0.712163


In [None]:
#GridSearch best parameters used
GS_SVC.best_params_

{'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}

In [None]:
#GridSearch best score got
GS_SVC.best_score_

0.7438167346146379

In [None]:
## The SVM model with optimal hyperparameters 

svm = SVC(C=1,kernel='rbf',gamma=0.1,probability=True)

#Fitting the model
svm.fit(xtr_scaled, ytr)

#Predicting the output
y_pred=svm.predict(df_test)


In [None]:
#Reading submission file

preds2 = pd.read_csv('sample_submission_new.csv')

#Replace the rating column with the predicted value
preds2["rating"] = y_pred

#convert it to csv
preds2.to_csv('submission_SVM.csv',index=False)

#Questions



🌈 Why Data Mining is a misnomer? What is another preferred name?

**Answer:It was supposed to ne named Knowledge Discovery in Database because its main goal is discover pattern and knowledge from huge dataset. The process is not about mining (extraction) of the dataset itself**



---



🌈 What is the general knowledge discovery process? What is the difference between a data engineer and data scientist/AI engineer?

**Answer: It is the process of finding patterns and knowledge in the data. The process includes:**

1.   Developing and understanding of: domain of application, porior knowledge which is relevant to the problem and goals of end-user.
2.   Choosing the dataset
3.   Cleaning and preprocessing the data
4.   Data reduction and projection: Choosing features that help solve the problem and applying dimensionality reduction methods.
5.   Deciding the data mining task whether it is regression or classification
6.   Choosing a data mining algorithm
7.    The data mining process itself of discovering the knowledge and patterns from the dataset
8.   Interpreting those patterns into meaningful results
9.   Merge and consolidate those discovered pieces of knowledge

**Data Engineer is the person who prepare data from raw data which is not formatted and may include machine or human errors. They go into collecting, extracting and integrating data from various resources, they also manage to improve data efficiency, quality, and reliability** 

**Data Scientist works on the data provided from the data engineer, he analyzes the data and give useful insights. He use machine learning and statistical models to use the data in both prescriptive and predictive models. He examines the hidden patterns to come up with conclusions that help in decision making**

---


🌈 In data mining, what is the difference between prediction and categorization?


**Answer: Categorization is the process of making groups  for data that that are similar to each other in entities  but prediction but prediction is about fitting a specific shape (or model) as close to the data as possible in order to find out a value for a specific missing target variable**


---



🌈 Why data science/machine learning is a bad idea in the context of information security?

**Answer: Due to some violation in the system that could happen through**

1. Machine Learning Adversial samples which  break system integrity by manipulating the behavior of the system by using fake samples that actually trick the model
2. Backdoor attack in the model by changing the data used in training the model by adding noise

3. Information Leak Model in which we only evaluate the model based on its performance in testing data using some evaluation metric withour caring what else the model did capture from tdata that could be used by attackers

4.  Reconstructing original training data

5. Stealing the neural network 

6. Membership Inference 
 
---

🌈 What is CIA principle and how can we use it to access the security/privacy aspect of the AI system/pipelines?



**Answer: C for Confidentiality of the data which means no disclosure of the data to unauthorized data, I for Integrity of the which means the data is correct and A for availability which means the data is accessable and not lost. We can access the security aspects of AI system through: the internal data of the system and operational data, client/collected data, external data.**


---

References: 
http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html 

https://www.geeksforgeeks.org/difference-between-data-scientist-and-data-engineer/

