<a href="https://colab.research.google.com/github/Arwa-Esam/Wish.com-Product-Rating-Prediction/blob/main/Wish_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Problem Formulation:

**Problem:** Predict the product rating of Wish.com products based on their features.

---
**Input:** The input consists of various features of the product such as product color, size, price, merchant rating, and shipping information, etc.

---
**Output:** The output is a predicted rating for the product, which is a discrete value ranging from 1 to 5, indicating the likelihood that customers will like the product.

---
**Data mining functions required** for this task are classification and predictive modeling. We need to build a classification model to predict the product rating based on the available features.
Example: Support Vector Machines (SVM),Random Forest

---
Challenges that could be faced in this task are:

Data quality issues: As mentioned, the dataset is not 

1.   Data quality issues: the dataset is not clean and has some noise, so preprocessing and cleaning will be required before building the model.
2.   Feature selection: Selecting the most relevant features for the model will be crucial as it will impact the accuracy and generalization capability of the model.

---
The ideal solution would involve building a classification model using various techniques such as data preprocessing and model selection.

# Loading Data


In [None]:
# Import necessary libraries and load the data

import pandas as pd
import numpy as np
import sklearn
from matplotlib import pyplot as plt      

from sklearn.preprocessing import OneHotEncoder,LabelEncoder
from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import StandardScaler , MinMaxScaler


from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
from sklearn.model_selection import GridSearchCV

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import CategoricalNB 
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC


In [None]:
#Reading train & test data from excel files
df_train1 = pd.read_csv('train_new.csv')
df_test1 = pd.read_csv('test_new.csv')
df_test1['rating']= 0
df_train=pd.concat([df_train1,df_test1],ignore_index=True,sort=False)

In [None]:
# print first five rows of the dataset
df_train.head()

Unnamed: 0,price,retail_price,currency_buyer,units_sold,uses_ad_boosts,rating,rating_count,badges_count,badge_local_product,badge_product_quality,...,merchant_name,merchant_info_subtitle,merchant_rating_count,merchant_rating,merchant_id,merchant_has_profile_picture,merchant_profile_picture,theme,crawl_month,id
0,8.0,7,EUR,10000,1,4.0,1670,0,0,0,...,keepahorse,"88 % avis positifs (66,644 notes)",66644,4.137582,577fb2b368116418674befd9,1,https://s3-us-west-1.amazonaws.com/sweeper-pro...,summer,2020-08,272
1,8.0,7,EUR,1000,0,4.0,843,1,1,0,...,上海铭启电气科技有限公司,"91 % avis positifs (25,752 notes)",25752,4.256873,566a3ef17233ff2686443082,0,,summer,2020-08,1515
2,6.0,34,EUR,100,0,4.0,10,0,0,0,...,zhaodongmei,"83 % avis positifs (3,897 notes)",3897,3.92892,5aacd6c1ccf0c83e5a6f3de0,0,,summer,2020-08,940
3,4.67,4,EUR,100,1,3.0,82,0,0,0,...,pookie0331,"87 % avis positifs (7,497 notes)",7497,4.079365,583141fbfef4094e51453d9b,0,,summer,2020-08,309
4,8.0,11,EUR,1000,0,4.0,127,0,0,0,...,shitongyi12683626,91 % avis positifs (173 notes),173,4.306358,5d3bc9f8e13a7e654424a4cb,0,,summer,2020-08,226


In [None]:
# Print the shape of the data
print(df_train.shape)


(1573, 34)


# Exploring  Data

In [None]:
# Print the column names of the training and testing data
print('columns of Train data: ','\n',df_train1.columns,'\n','\n')
print('columns of Test data: ','\n',df_test1.columns)

columns of Train data:  
 Index(['price', 'retail_price', 'currency_buyer', 'units_sold',
       'uses_ad_boosts', 'rating', 'rating_count', 'badges_count',
       'badge_local_product', 'badge_product_quality', 'badge_fast_shipping',
       'tags', 'product_color', 'product_variation_size_id',
       'product_variation_inventory', 'shipping_option_name',
       'shipping_option_price', 'shipping_is_express', 'countries_shipped_to',
       'inventory_total', 'has_urgency_banner', 'urgency_text',
       'origin_country', 'merchant_title', 'merchant_name',
       'merchant_info_subtitle', 'merchant_rating_count', 'merchant_rating',
       'merchant_id', 'merchant_has_profile_picture',
       'merchant_profile_picture', 'theme', 'crawl_month', 'id'],
      dtype='object') 
 

columns of Test data:  
 Index(['price', 'retail_price', 'currency_buyer', 'units_sold',
       'uses_ad_boosts', 'rating_count', 'badges_count', 'badge_local_product',
       'badge_product_quality', 'badge_fast_shi

Notice: Train set and test set are diffrent in columns names, Test has an additional column called "rating" at the end and Tain has 'rating_count'. Instead that all other column names are identical in both lists.

In [None]:
print('shape of Train data: ','\n',df_train1.shape,'\n')
print('shape of Test data: ','\n',df_test1.shape,'\n')

print('first five lines of Train data: ','\n',df_train1.head,'\n','\n')
print('first five lines of Test data: ','\n',df_test1.head,'\n','\n')


shape of Train data:  
 (1094, 34) 

shape of Test data:  
 (479, 34) 

first five lines of Train data:  
 <bound method NDFrame.head of       price  retail_price currency_buyer  units_sold  uses_ad_boosts  rating  \
0      8.00             7            EUR       10000               1     4.0   
1      8.00             7            EUR        1000               0     4.0   
2      6.00            34            EUR         100               0     4.0   
3      4.67             4            EUR         100               1     3.0   
4      8.00            11            EUR        1000               0     4.0   
...     ...           ...            ...         ...             ...     ...   
1089   4.00             4            EUR       10000               0     4.0   
1090   5.00             5            EUR        1000               0     5.0   
1091   9.00            48            EUR        1000               1     4.0   
1092  11.00            10            EUR       10000           

Notice: Shapes are not same, this means that the Train data is almost twice the size of the Test data, and both datasets have the same number of columns.

In [None]:
# Display detailed information about data
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1573 entries, 0 to 1572
Data columns (total 34 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   price                         1573 non-null   float64
 1   retail_price                  1573 non-null   int64  
 2   currency_buyer                1573 non-null   object 
 3   units_sold                    1573 non-null   int64  
 4   uses_ad_boosts                1573 non-null   int64  
 5   rating                        1573 non-null   float64
 6   rating_count                  1573 non-null   int64  
 7   badges_count                  1573 non-null   int64  
 8   badge_local_product           1573 non-null   int64  
 9   badge_product_quality         1573 non-null   int64  
 10  badge_fast_shipping           1573 non-null   int64  
 11  tags                          1573 non-null   object 
 12  product_color                 1532 non-null   object 
 13  pro

**Notice:** There are missing values and columns with data type object which can be converted to int

In [None]:
# Display all Data
df_train

Unnamed: 0,price,retail_price,currency_buyer,units_sold,uses_ad_boosts,rating,rating_count,badges_count,badge_local_product,badge_product_quality,...,merchant_name,merchant_info_subtitle,merchant_rating_count,merchant_rating,merchant_id,merchant_has_profile_picture,merchant_profile_picture,theme,crawl_month,id
0,8.00,7,EUR,10000,1,4.0,1670,0,0,0,...,keepahorse,"88 % avis positifs (66,644 notes)",66644,4.137582,577fb2b368116418674befd9,1,https://s3-us-west-1.amazonaws.com/sweeper-pro...,summer,2020-08,272
1,8.00,7,EUR,1000,0,4.0,843,1,1,0,...,上海铭启电气科技有限公司,"91 % avis positifs (25,752 notes)",25752,4.256873,566a3ef17233ff2686443082,0,,summer,2020-08,1515
2,6.00,34,EUR,100,0,4.0,10,0,0,0,...,zhaodongmei,"83 % avis positifs (3,897 notes)",3897,3.928920,5aacd6c1ccf0c83e5a6f3de0,0,,summer,2020-08,940
3,4.67,4,EUR,100,1,3.0,82,0,0,0,...,pookie0331,"87 % avis positifs (7,497 notes)",7497,4.079365,583141fbfef4094e51453d9b,0,,summer,2020-08,309
4,8.00,11,EUR,1000,0,4.0,127,0,0,0,...,shitongyi12683626,91 % avis positifs (173 notes),173,4.306358,5d3bc9f8e13a7e654424a4cb,0,,summer,2020-08,226
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1568,16.00,14,EUR,100,1,0.0,27,0,0,0,...,hernadezyeli,70 % avis positifs (572 notes),572,3.367133,5dc818ca030a154e48a3a8af,0,,summer,2020-08,1166
1569,6.00,17,EUR,5000,1,0.0,817,0,0,0,...,firstcrush,"87 % avis positifs (50,864 notes)",50864,4.079644,587f85cbb8038f4cb049c56c,1,https://s3-us-west-1.amazonaws.com/sweeper-pro...,summer,2020-08,1416
1570,4.83,51,EUR,50,1,0.0,1,0,0,0,...,saulstore,85 % avis positifs (57 notes),57,3.824561,5e3c5fbfe217d492138970c1,0,,summer,2020-08,63
1571,3.00,3,EUR,100,1,0.0,8,0,0,0,...,hey1827,"87 % avis positifs (1,086 notes)",1086,4.056169,5c6d6c28619caa1d94891d06,0,,summer,2020-08,626


In [None]:
# Print Summary can help to quickly understand the distribution of data in the DataFrame 
print(df_train.describe())

             price  retail_price     units_sold  uses_ad_boosts       rating  \
count  1573.000000   1573.000000    1573.000000     1573.000000  1573.000000   
mean      8.325372     23.288620    4339.005086        0.432931     2.786395   
std       3.932030     30.357863    9356.539302        0.495639     1.907903   
min       1.000000      1.000000       1.000000        0.000000     0.000000   
25%       5.810000      7.000000     100.000000        0.000000     0.000000   
50%       8.000000     10.000000    1000.000000        0.000000     4.000000   
75%      11.000000     26.000000    5000.000000        1.000000     4.000000   
max      49.000000    252.000000  100000.000000        1.000000     6.000000   

       rating_count  badges_count  badge_local_product  badge_product_quality  \
count   1573.000000   1573.000000          1573.000000            1573.000000   
mean     889.659250      0.105531             0.018436               0.074380   
std     1983.928834      0.340709   

In [None]:
# Display columns' names in Data 
df_train.columns

Index(['price', 'retail_price', 'currency_buyer', 'units_sold',
       'uses_ad_boosts', 'rating', 'rating_count', 'badges_count',
       'badge_local_product', 'badge_product_quality', 'badge_fast_shipping',
       'tags', 'product_color', 'product_variation_size_id',
       'product_variation_inventory', 'shipping_option_name',
       'shipping_option_price', 'shipping_is_express', 'countries_shipped_to',
       'inventory_total', 'has_urgency_banner', 'urgency_text',
       'origin_country', 'merchant_title', 'merchant_name',
       'merchant_info_subtitle', 'merchant_rating_count', 'merchant_rating',
       'merchant_id', 'merchant_has_profile_picture',
       'merchant_profile_picture', 'theme', 'crawl_month', 'id'],
      dtype='object')

In [None]:
# Check for and drop duplicate rows from the training data
df_train.duplicated().value_counts()

False    1573
dtype: int64

**notice:** there are no duplicate rows in the DataFrame

In [None]:
# Check Null values in Data in each column of DataFrame.
df_train.isnull().sum()

price                              0
retail_price                       0
currency_buyer                     0
units_sold                         0
uses_ad_boosts                     0
rating                             0
rating_count                       0
badges_count                       0
badge_local_product                0
badge_product_quality              0
badge_fast_shipping                0
tags                               0
product_color                     41
product_variation_size_id         14
product_variation_inventory        0
shipping_option_name               0
shipping_option_price              0
shipping_is_express                0
countries_shipped_to               0
inventory_total                    0
has_urgency_banner              1100
urgency_text                    1100
origin_country                    17
merchant_title                     0
merchant_name                      4
merchant_info_subtitle             1
merchant_rating_count              0
m

From above there are much missing values in columns      

        (has_urgency_banner              1100,
        urgency_text                    1100,
        merchant_profile_picture        924
        product_color                   41,
        'product_variation_size_id'     14, 
        'has_urgency_banner'             1, 
        'urgency_text'                  1100,
        'origin_country'                17, 
        'merchant_name'                 4, 
        'merchant_info_subtitle'        1,    
        'marchant_profile_picture'  1347
        )

repectively, we can drop [has_urgency_banner, urgency_text] as have alot of null values percentage to the total number of data (1573)

In [None]:
#Check detailed details in each column like number of values
for column in range(df_train.shape[1]):
    print(df_train.columns[column])
    print(df_train[df_train.columns[column]].value_counts())    
    print("------------------------")

price
8.00     282
11.00    202
7.00     129
9.00     126
6.00     120
        ... 
4.94       1
1.84       1
4.74       1
3.74       1
1.71       1
Name: price, Length: 127, dtype: int64
------------------------
retail_price
7      177
6      136
10     128
5      102
11      99
      ... 
101      1
250      1
66       1
69       1
1        1
Name: retail_price, Length: 104, dtype: int64
------------------------
currency_buyer
EUR    1573
Name: currency_buyer, dtype: int64
------------------------
units_sold
100       509
1000      405
5000      217
10000     177
20000     103
50         76
10         49
50000      17
100000      6
8           4
1           3
2           2
7           2
3           2
6           1
Name: units_sold, dtype: int64
------------------------
uses_ad_boosts
0    892
1    681
Name: uses_ad_boosts, dtype: int64
------------------------
rating
4.0    774
0.0    479
5.0    170
3.0    135
2.0     12
1.0      2
6.0      1
Name: rating, dtype: int64
--------------

**Notice:** from this detailed information, It seems There are some features not important like 'id', merchant_title', 'merchant_name', will not affect rating of user.

---
['merchant_profile_picture'] has alot of missing data so I will drop them

---
['rating'] has vslue = 6 and From above we don't have rating above 5, So we have to drop or replace this value with 5

---
['merchant_has_profile_picture'] We don't need this column bacause of alot of 0 values comparing to 1 values

---
['rating_count'] User doesn't matter with other experience, so I will drop it





In [None]:
# Check highly correlated features - Correlation
cor_matrix = df_train.corr().abs()
upper_tri = cor_matrix.where(np.triu(np.ones(cor_matrix.shape),k=1).astype(bool))
to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > 0.95)]
print(to_drop)

[]


**notice**: which means there are no pairs of features with a correlation coefficient greater than 0.95

In [None]:
# Display the unique values in the column, and count of each unique value
print(df_train.loc[:,'merchant_name'].unique())
print(df_train.loc[:,'merchant_name'].value_counts())

['keepahorse' '上海铭启电气科技有限公司' 'zhaodongmei' 'pookie0331'
 'shitongyi12683626' 'pashesa' 'shenzhenfinejofashioncoltd'
 'xiakeliuxingfushi' 'leescloset' 'litiannetworking' 'dududustore'
 'redisland' 'fancykiniwardrobe' 'toplifeyouneed' 'jununearstore'
 'huanjun411' 'hzxuch' 'zufanqiudinli' 'hellohorse' 'zengdaitao'
 'kinziewoolshop' 'vvfashion2016' 'fashionbelts' 'liminny' 'llpp'
 'gaoying666300' 'ahaashop' 'zhuangyuping1' 'liveupp' 'mansocean'
 'prosehome' 'happygirlinternationalshoppingpark' 'persisttree'
 'aqlangweigarmentstradingcoltd' 'raverks' '公狼电子'
 'chentidalcurrentclothing' 'good999' 'youngsboutique'
 'shuilingjiaointernationaltradecompany' 'cjinxiang' 'memo2' 'xvsivo7412'
 'masacoltd' nan 'nalininternational' 'longming' 'lionmouse' 'chailing'
 'yellowstore' 'girlsshowtime' 'mattolivo' 'yanzhang198411'
 'sellbestqualitygoods' 'molesfashion' 'harunoutanail' '广州唯适唯服饰有限公司'
 'qb_fashion' 'yshsbjj' 'homeaccessoryinternational' 'fashionhappyworld'
 'langwei' 'hongyuli' 'soband' 'cp743

**Notice**: ['merchant_name'] name will not affect user rating so drop it

In [None]:
#df_train.loc[:,'currency_buyer'].unique()
#df_train.loc[:,'currency_buyer'].value_counts()

In [None]:
#df_train.loc[:,'tags'].unique()
#df_train.loc[:,'tags'].value_counts()

In [None]:
#df_train.loc[:,'has_urgency_banner'].unique()
#df_train['has_urgency_banner'].value_counts()

In [None]:
#df_train.loc[:,'merchant_has_profile_picture'].unique()
#df_train.loc[:,'merchant_has_profile_picture'].value_counts()

In [None]:
df_train.loc[:,'merchant_title'].unique()
df_train.loc[:,'merchant_title'].value_counts()
#title will not affect user rating so drop it

guangzhouweishiweifushiyouxiangongsi        15
Suyi Technology                             12
sjhdstoer                                    9
shuilingjiao international trade company     8
Cenic Beauty                                 8
                                            ..
yuhang081899                                 1
Ruihong Home&Garden Co.,ltd                  1
liudandan28652                               1
HillnTrees                                   1
lkwelyly                                     1
Name: merchant_title, Length: 958, dtype: int64

In [None]:
df_train.loc[:,'shipping_option_name'].unique()
df_train.loc[:,'shipping_option_name'].value_counts()
# There are 1047 from 1093 just in Livraison so this column not important

Livraison standard         1508
Standard Shipping            21
Envio Padrão                  9
Expediere Standard            6
Envío normal                  5
الشحن القياسي                 4
Standardowa wysyłka           3
Standardversand               3
Livraison Express             3
Стандартная доставка          3
Spedizione standard           2
Standart Gönderi              2
การส่งสินค้ามาตรฐาน           2
ការដឹកជញ្ជូនតាមស្តង់ដារ       1
Ekspresowa wysyłka            1
Name: shipping_option_name, dtype: int64

# Cleaning Data

In [None]:
df_train.columns

Index(['price', 'retail_price', 'currency_buyer', 'units_sold',
       'uses_ad_boosts', 'rating', 'rating_count', 'badges_count',
       'badge_local_product', 'badge_product_quality', 'badge_fast_shipping',
       'tags', 'product_color', 'product_variation_size_id',
       'product_variation_inventory', 'shipping_option_name',
       'shipping_option_price', 'shipping_is_express', 'countries_shipped_to',
       'inventory_total', 'has_urgency_banner', 'urgency_text',
       'origin_country', 'merchant_title', 'merchant_name',
       'merchant_info_subtitle', 'merchant_rating_count', 'merchant_rating',
       'merchant_id', 'merchant_has_profile_picture',
       'merchant_profile_picture', 'theme', 'crawl_month', 'id'],
      dtype='object')

In [None]:
# Drop columns that are unlikely to be useful in predicting product rating
df_train.drop(columns=['shipping_is_express' , 'theme' , 'crawl_month',
                       'has_urgency_banner','countries_shipped_to',
                       'urgency_text','merchant_profile_picture',
                       'inventory_total','currency_buyer','tags',
                       'shipping_option_name','countries_shipped_to',
                       'merchant_has_profile_picture','merchant_id','merchant_info_subtitle',
                       'merchant_name','merchant_title','origin_country',
                       'rating_count'],inplace=True)

In [None]:
df_train.columns

Index(['price', 'retail_price', 'units_sold', 'uses_ad_boosts', 'rating',
       'badges_count', 'badge_local_product', 'badge_product_quality',
       'badge_fast_shipping', 'product_color', 'product_variation_size_id',
       'product_variation_inventory', 'shipping_option_price',
       'merchant_rating_count', 'merchant_rating', 'id'],
      dtype='object')

In [None]:
# Display count of unique values for each feature 
df_train.nunique()

price                           127
retail_price                    104
units_sold                       15
uses_ad_boosts                    2
rating                            7
badges_count                      4
badge_local_product               2
badge_product_quality             2
badge_fast_shipping               2
product_color                   101
product_variation_size_id       106
product_variation_inventory      48
shipping_option_price             8
merchant_rating_count           917
merchant_rating                 952
id                             1573
dtype: int64

In [None]:
df_train.columns[df_train.nunique()==1].tolist()

[]

In [None]:
# Display count of unique values for each feature 
df_train.columns

Index(['price', 'retail_price', 'units_sold', 'uses_ad_boosts', 'rating',
       'badges_count', 'badge_local_product', 'badge_product_quality',
       'badge_fast_shipping', 'product_color', 'product_variation_size_id',
       'product_variation_inventory', 'shipping_option_price',
       'merchant_rating_count', 'merchant_rating', 'id'],
      dtype='object')

In [None]:
df_train.isna().any()


price                          False
retail_price                   False
units_sold                     False
uses_ad_boosts                 False
rating                         False
badges_count                   False
badge_local_product            False
badge_product_quality          False
badge_fast_shipping            False
product_color                   True
product_variation_size_id       True
product_variation_inventory    False
shipping_option_price          False
merchant_rating_count          False
merchant_rating                False
id                             False
dtype: bool

## Rating

In [None]:
df_train["rating"].value_counts()

4.0    774
0.0    479
5.0    170
3.0    135
2.0     12
1.0      2
6.0      1
Name: rating, dtype: int64

In [None]:
df_train["rating"].isnull().sum()
df_train["rating"]=df_train["rating"].replace(6,5)

In [None]:
# We don't need 'rating==6' value
# df_train[df_train['rating']==6]  #df_train = df_train[df_train['rating'] != 6] 

In [None]:
#Now Drop it using index number
# df_train.drop(axis=0,index=971, inplace=True)
# Drop rows in Rating columns when == 6  --- df_train = df_train[df_train['rating'] != 6] 

In [None]:
df_train["rating"].value_counts()

4.0    774
0.0    479
5.0    171
3.0    135
2.0     12
1.0      2
Name: rating, dtype: int64

## product_color

In [None]:
df_train['product_color'].isnull().sum()

41

In [None]:
df_train['product_color'].unique()

array(['yellow', 'black', 'white', 'lakeblue', 'apricot', 'brown',
       'winered', 'blue', 'red', 'navyblue', 'green', 'khaki', 'White',
       'white & green', 'multicolor', 'lightpink', 'pink', 'RED',
       'armygreen', 'lightblue', nan, 'coffee', 'grey', 'skyblue',
       'watermelonred', 'pink & black', 'whitefloral', 'purple', 'navy',
       'pink & white', 'rosered', 'orange', 'Black', 'mintgreen',
       'leopardprint', 'gray', 'navy blue', 'star', 'rose', 'lightyellow',
       'camouflage', 'black & yellow', 'whitestripe', 'navyblue & white',
       'black & blue', 'lightred', 'violet', 'gold', 'black & green',
       'white & black', 'burgundy', 'black & white', 'lightgrey',
       'coolblack', 'lightgreen', 'beige', 'darkblue', 'darkgreen',
       'silver', 'wine red', 'Army green', 'pink & blue', 'rainbow',
       'claret', 'floral', 'brown & yellow', 'light green', 'Pink',
       'blue & pink', 'dustypink', 'camel', 'orange-red', 'rosegold',
       'ivory', 'fluorescentg

In [None]:
df_train['product_color'].value_counts()

black               302
white               254
yellow              105
blue                 99
pink                 99
                   ... 
navyblue & white      1
winered & yellow      1
offwhite              1
lightgray             1
gray & white          1
Name: product_color, Length: 101, dtype: int64

In [None]:
df_train['product_color'] = df_train['product_color'].str.lower()

In [None]:
df_train['product_color'].unique()

array(['yellow', 'black', 'white', 'lakeblue', 'apricot', 'brown',
       'winered', 'blue', 'red', 'navyblue', 'green', 'khaki',
       'white & green', 'multicolor', 'lightpink', 'pink', 'armygreen',
       'lightblue', nan, 'coffee', 'grey', 'skyblue', 'watermelonred',
       'pink & black', 'whitefloral', 'purple', 'navy', 'pink & white',
       'rosered', 'orange', 'mintgreen', 'leopardprint', 'gray',
       'navy blue', 'star', 'rose', 'lightyellow', 'camouflage',
       'black & yellow', 'whitestripe', 'navyblue & white',
       'black & blue', 'lightred', 'violet', 'gold', 'black & green',
       'white & black', 'burgundy', 'black & white', 'lightgrey',
       'coolblack', 'lightgreen', 'beige', 'darkblue', 'darkgreen',
       'silver', 'wine red', 'army green', 'pink & blue', 'rainbow',
       'claret', 'floral', 'brown & yellow', 'light green', 'blue & pink',
       'dustypink', 'camel', 'orange-red', 'rosegold', 'ivory',
       'fluorescentgreen', 'winered & yellow', 'offwh

In [None]:
df_train['product_color'].value_counts()

black               305
white               257
yellow              105
pink                101
blue                100
                   ... 
gold                  1
winered & yellow      1
offwhite              1
lightgray             1
gray & white          1
Name: product_color, Length: 95, dtype: int64

In [None]:
 df_train['product_color']= df_train['product_color'].replace(['white & green','black & stripe',
                                                              'pink & black','pink & white', 
                                                              'black & blue','navyblue & white'
                                                              ,'black & yellow','brown & yellow'
                                                              ,'pink & blue', 'black & green'
                                                              ,'blue & pink','pink & grey'
                                                              ,'winered & yellow','red & blue'
                                                              ,'white & red', 'gray & white','orange & camouflage'],'multicolor')

In [None]:
df_train['product_color']= df_train['product_color'].replace(['black & white','white & black'],'blackwhite')
df_train['product_color']= df_train['product_color'].replace(['navy blue'],'navyblue')
df_train['product_color']= df_train['product_color'].replace(['wine red'],'winered')
df_train['product_color']= df_train['product_color'].replace(['army green'],'armygreen')
df_train['product_color']= df_train['product_color'].replace(['light green'],'lightgreen')
df_train['product_color']= df_train['product_color'].replace(['rose red'],'rosered')


In [None]:
df_train['product_color'].unique()

array(['yellow', 'black', 'white', 'lakeblue', 'apricot', 'brown',
       'winered', 'blue', 'red', 'navyblue', 'green', 'khaki',
       'multicolor', 'lightpink', 'pink', 'armygreen', 'lightblue', nan,
       'coffee', 'grey', 'skyblue', 'watermelonred', 'whitefloral',
       'purple', 'navy', 'rosered', 'orange', 'mintgreen', 'leopardprint',
       'gray', 'star', 'rose', 'lightyellow', 'camouflage', 'whitestripe',
       'lightred', 'violet', 'gold', 'blackwhite', 'burgundy',
       'lightgrey', 'coolblack', 'lightgreen', 'beige', 'darkblue',
       'darkgreen', 'silver', 'rainbow', 'claret', 'floral', 'dustypink',
       'camel', 'orange-red', 'rosegold', 'ivory', 'fluorescentgreen',
       'offwhite', 'lightgray', 'wine', 'army', 'applegreen', 'nude',
       'denimblue', 'leopard', 'coralred', 'tan', 'offblack', 'jasper',
       'greysnakeskinprint', 'lightpurple', 'lightkhaki', 'prussianblue'],
      dtype=object)

In [None]:
df_train['product_color'].value_counts()

black           305
white           257
yellow          105
pink            101
blue            100
               ... 
gold              1
whitestripe       1
star              1
leopardprint      1
prussianblue      1
Name: product_color, Length: 71, dtype: int64

In [None]:
df_null=df_train.loc[df_train['product_color'].isnull(),'rating']

In [None]:
df_null.value_counts()

4.0    16
0.0    12
5.0    10
3.0     3
Name: rating, dtype: int64

In [None]:
df=df_train.loc[df_train['product_color'] == 'black','rating']

In [None]:
df.value_counts()


4.0    159
0.0     83
5.0     43
3.0     17
2.0      3
Name: rating, dtype: int64

In [None]:
df.unique()

array([4., 5., 3., 2., 0.])

In [None]:
mode = df_train['product_color'].mode()
mode

0    black
Name: product_color, dtype: object

In [None]:
df_train['product_color'].replace(np.nan,'black',inplace=True)

In [None]:
df_train.columns

Index(['price', 'retail_price', 'units_sold', 'uses_ad_boosts', 'rating',
       'badges_count', 'badge_local_product', 'badge_product_quality',
       'badge_fast_shipping', 'product_color', 'product_variation_size_id',
       'product_variation_inventory', 'shipping_option_price',
       'merchant_rating_count', 'merchant_rating', 'id'],
      dtype='object')

## product_variation_size_id

In [None]:
df_train.rename(columns={'product_variation_size_id':'size_id'},inplace=True)

In [None]:
df_train['size_id'].value_counts()

S                641
XS               356
M                200
XXS              100
L                 49
                ... 
2                  1
20PCS-10PAIRS      1
Size-5XL           1
Size/S             1
36                 1
Name: size_id, Length: 106, dtype: int64

In [None]:
df_train['size_id'].unique()

array(['M', 'L', 'XS', 'S', 'XL', '26(Waist 72cm 28inch)', 'S.',
       'S(bust 88cm)', 'XXS', 's', '29', nan, 'choose a size', 'XXXS',
       'Base Coat', 'Size M', 'XXL', 'M.', 'XS.',
       '100 x 100cm(39.3 x 39.3inch)', '2pcs', '4XL', '1', '25-S',
       'Size-XXS', '5PAIRS', '35', 'Pack of 1', 'Size S', 'Size-S', '6XL',
       '25', 'S/M(child)', '60', 'Size-XS', 'S (waist58-62cm)',
       'SIZE XXS', '10 ml', 'X   L', 'Women Size 36', '04-3XL',
       'Size -XXS', '1 pc.', 'Floating Chair for Kid', 'S Pink', '34',
       'US-S', 'Size XXS', 'pants-S', 'XXXXL', 'SIZE-XXS', 'SIZE XS',
       '1pc', 'Size S.', '100 cm', 'S..', 'Round', '4-5 Years', '5', '33',
       '30 cm', '2', 'XXXXXL', '20PCS-10PAIRS', '2XL', 'Size-5XL',
       'Size4XL', 'One Size', 'size S', 'Size/S', 'B', 'SizeL', '20pcs',
       '1 PC - XL', 'Suit-S', 'Base & Top & Matte Top Coat',
       'Baby Float Boat', '1m by 3m', 'SIZE S', 'White', '40 cm', '5XL',
       '10pcs', 'H01', 'S(Pink & Black)', '32/L', 'dau

In [None]:
df_train['size_id']= df_train['size_id'].replace(['S.','S(bust 88cm)','25-S','Size S', 'Size-S',
'S (waist58-62cm)','S Pink','US-S', 'pants-S', 'S..', 'Size S.','size S', 'Size/S','Suit-S','SIZE S' ],   'S')

In [None]:
df_train['size_id']=df_train['size_id'].str.upper()

In [None]:
df_train['size_id']= df_train['size_id'].replace(['XXXS'],'3XS')
df_train['size_id']= df_train['size_id'].replace(['SIZE-XXS','SIZE XXS','XXS','SIZE -XXS'],'2XS')
df_train['size_id']= df_train['size_id'].replace(['XS.','SIZE-XS','SIZE XS'],   'XS')
df_train['size_id']= df_train['size_id'].replace(['WOMEN SIZE 36', '29','S(PINK & BLACK)','25-S','SIZE--S',
                                                  'US 6.5 (EU 37)','WOMEN SIZE 37','US5.5-EU35','EU 35' ],   'S')
df_train['size_id']= df_train['size_id'].replace(['M.','60','S/M(CHILD)','SIZE M','26(WAIST 72CM 28INCH)' ],   'M')
df_train['size_id']= df_train['size_id'].replace(['SIZEL','32/L','SIZE-L', '100 X 100CM(39.3 X 39.3INCH)','L.',
                                                  'EU39(US8)'],'L')
df_train['size_id']= df_train['size_id'].replace(['X   L','40 CM', '1 PC - XL','34','35','36'],   'XL')
df_train['size_id']= df_train['size_id'].replace(['XXL'],'2XL')
df_train['size_id']= df_train['size_id'].replace(['04-3XL','XXXL'],'3XL')
df_train['size_id']= df_train['size_id'].replace(['SIZE4XL','XXXXL','SIZE-4XL'],'4XL')
df_train['size_id']= df_train['size_id'].replace([ 'SIZE-5XL', 'XXXXXL'],'5XL')
df_train['size_id']= df_train['size_id'].replace([ 'XXXXXXL'],'6XL')


In [None]:
df_train['size_id'].unique()

array(['M', 'L', 'XS', 'S', 'XL', '2XS', nan, 'CHOOSE A SIZE', '3XS',
       'BASE COAT', '2XL', '2PCS', '4XL', '1', '5PAIRS', 'PACK OF 1',
       '6XL', '25', '10 ML', '3XL', '1 PC.', 'FLOATING CHAIR FOR KID',
       '1PC', '100 CM', 'ROUND', '4-5 YEARS', '5', '33', '30 CM', '2',
       '5XL', '20PCS-10PAIRS', 'ONE SIZE', 'B', '20PCS',
       'BASE & TOP & MATTE TOP COAT', 'BABY FLOAT BOAT', '1M BY 3M',
       'WHITE', '10PCS', 'H01', 'DAUGHTER 24M', '4', '80 X 200 CM',
       '100PCS', 'FIRST  GENERATION', 'S DIAMETER 30CM',
       'AU PLUG LOW QUALITY', '3 LAYERED ANKLET', '17'], dtype=object)

In [None]:
df_train.size_id.mode() 


0    S
Name: size_id, dtype: object

In [None]:
df_train['size_id'].fillna('S',inplace=True)

In [None]:
df_train['size_id'].isnull().sum()

0

In [None]:
df_train['size_id']=df_train['size_id'].replace([
       'CHOOSE A SIZE',
       'BASE COAT' , '2PCS', '1', '5PAIRS', 'PACK OF 1',
        '25', '10 ML', '1 PC.', 'FLOATING CHAIR FOR KID',
       '1PC', '100 CM', 'ROUND', '4-5 YEARS', '5', '33', '30 CM', '2',
        '20PCS-10PAIRS', 'ONE SIZE', 'B', '20PCS',
       'BASE & TOP & MATTE TOP COAT', 'BABY FLOAT BOAT', '1M BY 3M',
       'WHITE', '10PCS', 'H01' , 'DAUGHTER 24M', '4',
       '80 X 200 CM', '100PCS', 'FIRST  GENERATION', 'S DIAMETER 30CM',
       'AU PLUG LOW QUALITY', '3 LAYERED ANKLET', '17'],'other')

In [None]:
df_train['size_id'].unique()

array(['M', 'L', 'XS', 'S', 'XL', '2XS', 'other', '3XS', '2XL', '4XL',
       '6XL', '3XL', '5XL'], dtype=object)

In [None]:
df_train['size_id'].value_counts()

S        713
XS       369
M        209
2XS      107
L         57
other     47
XL        26
2XL       19
4XL       10
3XS        6
5XL        5
3XL        4
6XL        1
Name: size_id, dtype: int64

In [None]:
df_train.columns

Index(['price', 'retail_price', 'units_sold', 'uses_ad_boosts', 'rating',
       'badges_count', 'badge_local_product', 'badge_product_quality',
       'badge_fast_shipping', 'product_color', 'size_id',
       'product_variation_inventory', 'shipping_option_price',
       'merchant_rating_count', 'merchant_rating', 'id'],
      dtype='object')

In [None]:
df_train.isna().any()

price                          False
retail_price                   False
units_sold                     False
uses_ad_boosts                 False
rating                         False
badges_count                   False
badge_local_product            False
badge_product_quality          False
badge_fast_shipping            False
product_color                  False
size_id                        False
product_variation_inventory    False
shipping_option_price          False
merchant_rating_count          False
merchant_rating                False
id                             False
dtype: bool

In [None]:
df_train.corr()

Unnamed: 0,price,retail_price,units_sold,uses_ad_boosts,rating,badges_count,badge_local_product,badge_product_quality,badge_fast_shipping,product_variation_inventory,shipping_option_price,merchant_rating_count,merchant_rating,id
price,1.0,0.304748,-0.024815,-0.073269,0.044306,0.063542,0.069592,0.009512,0.087336,0.274819,0.87258,-0.02379,0.047121,0.026923
retail_price,0.304748,1.0,0.012638,-0.035325,0.039483,0.019625,0.022677,9.8e-05,0.032201,0.118842,0.262516,0.029159,0.054399,-0.024683
units_sold,-0.024815,0.012638,1.0,-0.016055,0.030855,0.045402,-0.007544,0.063187,-0.000898,0.133846,-0.030987,0.272897,0.122504,0.008429
uses_ad_boosts,-0.073269,-0.035325,-0.016055,1.0,-0.01151,0.038173,0.061471,0.001697,0.038265,-0.135384,-0.061495,0.029735,-0.022564,0.010322
rating,0.044306,0.039483,0.030855,-0.01151,1.0,0.087689,0.010444,0.110653,-0.005105,-0.007898,0.040162,0.017713,0.07235,-0.031351
badges_count,0.063542,0.019625,0.045402,0.038173,0.087689,1.0,0.568036,0.801352,0.481274,0.031346,0.106984,0.058436,0.215401,0.019927
badge_local_product,0.069592,0.022677,-0.007544,0.061471,0.010444,0.568036,1.0,0.069215,0.364066,-0.009599,0.101477,-0.009927,0.11686,0.027756
badge_product_quality,0.009512,9.8e-05,0.063187,0.001697,0.110653,0.801352,0.069215,1.0,0.011081,0.081776,0.022741,0.090089,0.191399,0.006707
badge_fast_shipping,0.087336,0.032201,-0.000898,0.038265,-0.005105,0.481274,0.364066,0.011081,1.0,-0.084695,0.150136,-0.021419,0.066266,0.011546
product_variation_inventory,0.274819,0.118842,0.133846,-0.135384,-0.007898,0.031346,-0.009599,0.081776,-0.084695,1.0,0.265225,0.083577,0.158771,0.011502


In [None]:
# df_train['inventory_total'].unique()

In [None]:
# df_train['inventory_total'].value_counts()

In [None]:
# df_train[:,'rating'].isna().sum()

In [None]:
df_train.columns.tolist()

['price',
 'retail_price',
 'units_sold',
 'uses_ad_boosts',
 'rating',
 'badges_count',
 'badge_local_product',
 'badge_product_quality',
 'badge_fast_shipping',
 'product_color',
 'size_id',
 'product_variation_inventory',
 'shipping_option_price',
 'merchant_rating_count',
 'merchant_rating',
 'id']

In [None]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1573 entries, 0 to 1572
Data columns (total 16 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   price                        1573 non-null   float64
 1   retail_price                 1573 non-null   int64  
 2   units_sold                   1573 non-null   int64  
 3   uses_ad_boosts               1573 non-null   int64  
 4   rating                       1573 non-null   float64
 5   badges_count                 1573 non-null   int64  
 6   badge_local_product          1573 non-null   int64  
 7   badge_product_quality        1573 non-null   int64  
 8   badge_fast_shipping          1573 non-null   int64  
 9   product_color                1573 non-null   object 
 10  size_id                      1573 non-null   object 
 11  product_variation_inventory  1573 non-null   int64  
 12  shipping_option_price        1573 non-null   int64  
 13  merchant_rating_co

# Encoding

In [None]:
#df_train.drop('merchant_id',axis=1, inplace=True)

In [None]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1573 entries, 0 to 1572
Data columns (total 16 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   price                        1573 non-null   float64
 1   retail_price                 1573 non-null   int64  
 2   units_sold                   1573 non-null   int64  
 3   uses_ad_boosts               1573 non-null   int64  
 4   rating                       1573 non-null   float64
 5   badges_count                 1573 non-null   int64  
 6   badge_local_product          1573 non-null   int64  
 7   badge_product_quality        1573 non-null   int64  
 8   badge_fast_shipping          1573 non-null   int64  
 9   product_color                1573 non-null   object 
 10  size_id                      1573 non-null   object 
 11  product_variation_inventory  1573 non-null   int64  
 12  shipping_option_price        1573 non-null   int64  
 13  merchant_rating_co

In [None]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
# one_hot_encoded_data = pd.get_dummies(df_train, columns = ['product_color','size_id','origin_country'])
# one_hot_encoded_data

# cat_cols = ['product_color','size_id','origin_country']
# data = pd.get_dummies(df_train, columns=cat_cols)

# df_train=one_hot_encoded_data

In [None]:
catigorical_columns= ['product_color','size_id']
encoded_data= LabelEncoder()

for i in catigorical_columns:
    df_train[i]= encoded_data.fit_transform(np.asarray(df_train[i]).reshape(-1,1))

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [None]:
#Check detailed details in each column like number of values
for column in range(df_train.shape[1]):
    print(df_train.columns[column])
    print(df_train[df_train.columns[column]].value_counts())    
    print("------------------------")

price
8.00     282
11.00    202
7.00     129
9.00     126
6.00     120
        ... 
4.94       1
1.84       1
4.74       1
3.74       1
1.71       1
Name: price, Length: 127, dtype: int64
------------------------
retail_price
7      177
6      136
10     128
5      102
11      99
      ... 
101      1
250      1
66       1
69       1
1        1
Name: retail_price, Length: 104, dtype: int64
------------------------
units_sold
100       509
1000      405
5000      217
10000     177
20000     103
50         76
10         49
50000      17
100000      6
8           4
1           3
2           2
7           2
3           2
6           1
Name: units_sold, dtype: int64
------------------------
uses_ad_boosts
0    892
1    681
Name: uses_ad_boosts, dtype: int64
------------------------
rating
4.0    774
0.0    479
5.0    171
3.0    135
2.0     12
1.0      2
Name: rating, dtype: int64
------------------------
badges_count
0    1422
1     138
2      11
3       2
Name: badges_count, dtype: int64
-

In [None]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1573 entries, 0 to 1572
Data columns (total 16 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   price                        1573 non-null   float64
 1   retail_price                 1573 non-null   int64  
 2   units_sold                   1573 non-null   int64  
 3   uses_ad_boosts               1573 non-null   int64  
 4   rating                       1573 non-null   float64
 5   badges_count                 1573 non-null   int64  
 6   badge_local_product          1573 non-null   int64  
 7   badge_product_quality        1573 non-null   int64  
 8   badge_fast_shipping          1573 non-null   int64  
 9   product_color                1573 non-null   int64  
 10  size_id                      1573 non-null   int64  
 11  product_variation_inventory  1573 non-null   int64  
 12  shipping_option_price        1573 non-null   int64  
 13  merchant_rating_co

In [None]:
def getLog(x):
    return np.log(x + 1)
df_train['units_sold']  = df_train['units_sold'].apply(getLog)
df_train['retail_price']  = df_train['retail_price'].apply(getLog)

# Split the data into training and testing sets


In [None]:
# Split the data into training and testing sets


In [None]:
train_index = df_train1.shape[0]              
train_d = df_train.iloc[:train_index,:]  
print(train_d.shape)  



(1094, 16)


In [None]:
test_d = df_train.iloc[train_index :]
print(test_d.shape)  

 

(479, 16)


In [None]:
_id = test_d.id

df_train = df_train.drop(columns=["id"])

In [None]:
_id

1094     937
1095     270
1096     308
1097     913
1098     403
        ... 
1568    1166
1569    1416
1570      63
1571     626
1572     196
Name: id, Length: 479, dtype: int64

In [None]:
test_d.drop('rating', axis=1, inplace=True)
print(test_d.shape) 

(479, 15)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_d.drop('rating', axis=1, inplace=True)


## Splitting the traing data x (features),y(label)

In [None]:
X = train_d.drop(['rating'],axis=1)
y = train_d['rating']

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=25)

# Naive_bayes with using grid search to tune hyperparameter

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid_nb = {
    'var_smoothing': np.logspace(0,-9, num=100)
}

nbModel_grid = GridSearchCV(estimator=GaussianNB(), param_grid=param_grid_nb, verbose=1, cv=10, n_jobs=-1)

nbModel_grid.fit(X_train, y_train)

Fitting 10 folds for each of 100 candidates, totalling 1000 fits




In [None]:
y_pred = nbModel_grid.predict(X_test)

# Evaluate the performance of the model using metrics such as accuracy, F1 score, and confusion matrix.

from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

nbModel_grid_accuracy = accuracy_score(y_test, y_pred)
nbModel_grid_f1 = f1_score(y_test, y_pred, average='weighted')
cm_nbModel_grid = confusion_matrix(y_test, y_pred)

print('Accuracy:', nbModel_grid_accuracy)
print('F1 score:', nbModel_grid_f1)
print('Confusion matrix:\n', cm_nbModel_grid)

Accuracy: 0.7123287671232876
F1 score: 0.6056751467710371
Confusion matrix:
 [[  0   0   3   0]
 [  0   0  20   1]
 [  0   0 155   2]
 [  0   0  37   1]]


# Decision Tree

### Decision tree with using grid search to tune hyperparameter

In [None]:
DT_param = {
    'max_depth': [2, 3, 5, 10, 20],
    'min_samples_leaf': [5, 10, 20, 50, 100],
    'min_samples_split':[2,3,4,5],
    'criterion': ["entropy", "gini"]
}


dt_grid_search = GridSearchCV(estimator = DecisionTreeClassifier(), param_grid= DT_param, cv=5, n_jobs=-1, verbose=3)


dt_grid_search.fit(X_train, y_train)

dt_pred_test = dt_grid_search.predict(X_test)


dt_grid_search_accuracy = accuracy_score(y_test, dt_pred_test)
dt_grid_search_f1 = f1_score(y_test, dt_pred_test, average='weighted')
dt_grid_search_cm= confusion_matrix(y_test, dt_pred_test)

print('Accuracy:', dt_grid_search_accuracy)
print('F1 score:', dt_grid_search_f1)
print('Confusion matrix:\n', dt_grid_search_cm)


Fitting 5 folds for each of 200 candidates, totalling 1000 fits




Accuracy: 0.771689497716895
F1 score: 0.7173464386141423
Confusion matrix:
 [[  0   0   0   0   0]
 [  0   1   0   1   1]
 [  1   0   0  18   2]
 [  0   0   2 154   1]
 [  0   0   0  24  14]]


In [None]:
pred_test = dt_grid_search.predict(test_d)

pred_df = pd.DataFrame(data={'id':_id, 'rating': pred_test})

pred_df.to_csv('pred_walkthrough.csv', index=False)

### model2 Decision Tree with tuning hyperparameters manually


In [None]:
dt_model2 = DecisionTreeClassifier(max_depth = 5, min_samples_leaf = 3, min_samples_split = 5, criterion = 'entropy')

dt_model2.fit(X_train, y_train)

dt_pred_test2 = dt_model2.predict(X_test)


dt_grid_search2_accuracy = accuracy_score(y_test, dt_pred_test2)
dt_grid_search2_f1 = f1_score(y_test, dt_pred_test2, average='weighted')
dt_grid_search2_cm= confusion_matrix(y_test, dt_pred_test2)

print('Accuracy:', dt_grid_search2_accuracy)
print('F1 score:', dt_grid_search2_f1)
print('Confusion matrix:\n', dt_grid_search2_cm)

Accuracy: 0.7671232876712328
F1 score: 0.712098045925513
Confusion matrix:
 [[  0   0   0   0   0]
 [  0   1   0   1   1]
 [  1   0   0  18   2]
 [  0   0   2 154   1]
 [  0   2   0  23  13]]


In [None]:
 #from sklearn.metrics import classification_report
 #print(classification_report(y, y_pred))

# SVM

### SVM with using grid search to tune hyperparameter

In [None]:
from sklearn import svm

svm_param_1 = {
    'C': [0.01,0.1,1.0],
    'kernel' : ['linear', 'rbf'],
    'gamma': [0.5,0.1,1]
}

model_svm_1 = SVC(probability=True)


svm_grid_search = GridSearchCV(estimator=model_svm_1, param_grid= svm_param_1, cv=5, n_jobs=-1, verbose=3)



svm_grid_search.fit(X_train, y_train)
predicted_test_svm = svm_grid_search.predict(X_test)

Fitting 5 folds for each of 18 candidates, totalling 90 fits




In [None]:

svm_grid_search_accuracy = accuracy_score(y_test, predicted_test_svm)
svm_grid_search_f1 = f1_score(y_test, predicted_test_svm, average='weighted')
svm_grid_search_cm= confusion_matrix(y_test, predicted_test_svm)

print('Accuracy:', svm_grid_search_accuracy)
print('F1 score:', svm_grid_search_f1)
print('Confusion matrix:\n', svm_grid_search_cm)


### SVM with tuning hyperparameters manually

In [None]:
model_svm_2 = SVC(C = 1, kernel ='poly',degree = 2,probability=True)
model_svm_2.fit(X_train, y_train)
predicted_test_svm2 = model_svm_2.predict(X_test)

svm2_accuracy = accuracy_score(y_test, predicted_test_svm2)
svm2_f1 = f1_score(y_test, predicted_test_svm2, average='weighted')
svm2_cm= confusion_matrix(y_test, predicted_test_svm2)

print('Accuracy:', svm2_accuracy)
print('F1 score:', svm2_f1)
print('Confusion matrix:\n', svm2_cm)