<center><h1 style="color:#FF0000; font-size:50px; padding:10px; font-family:'serif'">
    FINAL HACKATHON </h1></center>

## PROBLEM STATEMENT

Artificial Intelligence is an integral part of all major e-commerce companies today. Today's online retail platforms are heavily powered by algorithms and applications that use AI. Machine learning is used in a variety of ways, from inventory control and quality assurance in the warehouse to product recommendations and sales demographics on the website.

Let’s say you want to create a promotional campaign for an e-commerce store and offer discounts to customers in the hopes that this might increase your sales.

You have been provided descriptions of products on Amazon and Flipkart, including details like product title, ratings, reviews, and actual prices. In this challenge, you will predict discounted prices of the listed products based on their ratings and actual prices.

## Data Description

- title - Name of the product
- Rating- average rating given to a product
- maincateg - category that the product is listed under(men/women)
- platform - platform on which it is sold on (Eg. Amazon, Flipkart)
- price1 - Discounted Price of the listed product
- actprice1 - Actual price of the listed product
- Offer % - Discount percent
- norating1 - number of ratings available for a particular product
- noreviews1 - number of reviews available for a particular product
- star_5f - number of five star ratings given to a particular product
- star_4f - number of four star ratings given to a particular product
- star_3f - number of three star ratings given to a particular product
- star_2f - number of two star ratings given to a particular product
- star_1f - number of one star ratings given to a particular product
- fulfilled1- whether it is Amazon fulfilled or not

In [1]:
# importing libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# model selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

# importing preprocessing tools and Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.preprocessing import StandardScaler


# importing models
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor

In [2]:
# importing datasets

df_train = pd.read_csv('./train.csv')
df_test = pd.read_csv('./test.csv')

In [3]:
df_train.head()

Unnamed: 0,id,title,Rating,maincateg,platform,price1,actprice1,Offer %,norating1,noreviews1,star_5f,star_4f,star_3f,star_2f,star_1f,fulfilled1
0,16695,Fashionable & Comfortable Bellies For Women (...,3.9,Women,Flipkart,698,999,30.13%,38.0,7.0,17.0,9.0,6.0,3,3,0
1,5120,Combo Pack of 4 Casual Shoes Sneakers For Men ...,3.8,Men,Flipkart,999,1999,50.03%,531.0,69.0,264.0,92.0,73.0,29,73,1
2,18391,Cilia Mode Leo Sneakers For Women (White),4.4,Women,Flipkart,2749,4999,45.01%,17.0,4.0,11.0,3.0,2.0,1,0,1
3,495,Men Black Sports Sandal,4.2,Men,Flipkart,518,724,15.85%,46413.0,6229.0,1045.0,12416.0,5352.0,701,4595,1
4,16408,Men Green Sports Sandal,3.9,Men,Flipkart,1379,2299,40.02%,77.0,3.0,35.0,21.0,7.0,7,7,1


In [4]:
df_train.dtypes

id              int64
title          object
Rating        float64
maincateg      object
platform       object
price1          int64
actprice1       int64
Offer %        object
norating1     float64
noreviews1    float64
star_5f       float64
star_4f       float64
star_3f       float64
star_2f         int64
star_1f         int64
fulfilled1      int64
dtype: object

In [5]:
df_train.isna().sum()

id              0
title           0
Rating          0
maincateg     526
platform        0
price1          0
actprice1       0
Offer %         0
norating1     678
noreviews1    578
star_5f       588
star_4f       539
star_3f       231
star_2f         0
star_1f         0
fulfilled1      0
dtype: int64

In [6]:
df_train.describe()

Unnamed: 0,id,Rating,price1,actprice1,norating1,noreviews1,star_5f,star_4f,star_3f,star_2f,star_1f,fulfilled1
count,15730.0,15730.0,15730.0,15730.0,15052.0,15152.0,15142.0,15191.0,15499.0,15730.0,15730.0,15730.0
mean,10479.541577,4.012873,688.070693,1369.286777,3057.660776,423.976307,1585.239466,655.92331,357.260662,155.085188,275.500572,0.601526
std,6080.166276,0.29844,649.409586,1240.900227,11846.965689,1768.230384,6177.476241,2855.735531,1402.24661,558.650254,958.589075,0.4896
min,3.0,0.0,69.0,42.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,5212.0,3.9,349.0,699.0,63.0,9.0,30.0,12.0,7.0,3.0,6.0,0.0
50%,10458.5,4.0,474.0,999.0,308.0,44.0,150.0,60.0,34.0,17.0,30.0,1.0
75%,15766.75,4.2,699.0,1299.0,1526.0,215.0,788.0,300.0,172.0,77.0,140.0,1.0
max,20973.0,5.0,5998.0,13499.0,289973.0,45448.0,151193.0,74037.0,34978.0,11705.0,18060.0,1.0


In [7]:
df_train['title'].nunique()

4782

In [8]:
# imputing maincateg column by checking the title for gender

df_maincateg_impute = df_train.loc[df_train.maincateg.isna(), :]

df_maincateg_impute

Unnamed: 0,id,title,Rating,maincateg,platform,price1,actprice1,Offer %,norating1,noreviews1,star_5f,star_4f,star_3f,star_2f,star_1f,fulfilled1
19,12336,Women Beige Heels Sandal,4.0,,Flipkart,499,999,50.05%,,,28.0,9.0,10.0,2,5,0
38,20804,"Men Brown, Orange Sports Sandal",4.1,,Flipkart,819,999,18.02%,26640.0,3667.0,14148.0,,,1113,2094,1
68,5575,Women Copper Flats Sandal,4.1,,Flipkart,349,999,65.07%,,27.0,,,22.0,5,15,1
91,4262,Slippers,3.6,,Flipkart,213,249,14.46%,,,,,466.0,281,476,1
116,2767,Ace Slip-On Running Shoes For Men (Black),4.3,,Flipkart,2999,4999,40.01%,,,46.0,27.0,5.0,1,5,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15536,20013,Slides,4.5,,Flipkart,499,799,37.55%,,1.0,23.0,,1.0,1,2,0
15570,9968,Women Multicolor Bellies Sandal,3.7,,Flipkart,501,999,49.85%,,,,141.0,111.0,75,128,1
15583,11224,Jutis For Men (Black),3.7,,Flipkart,494,499,1.00%,,,22.0,,4.0,4,8,0
15674,7209,Zod Runner V3 Wn s IDP Running Shoes For Women...,4.3,,Flipkart,1699,3999,57.51%,,,,,268.0,102,140,1


In [9]:
for i in df_maincateg_impute.index:
    if 'Men' in df_maincateg_impute.loc[i, 'title']:
        df_train.loc[i, 'maincateg'] = 'Men'
    else:
        df_train.loc[i, 'maincateg'] = 'Women'
        
df_train.isna().sum()

id              0
title           0
Rating          0
maincateg       0
platform        0
price1          0
actprice1       0
Offer %         0
norating1     678
noreviews1    578
star_5f       588
star_4f       539
star_3f       231
star_2f         0
star_1f         0
fulfilled1      0
dtype: int64

In [10]:
# similar imputation in df_test
df_maincateg_impute_test = df_test.loc[df_test.maincateg.isna(), :]

for i in df_maincateg_impute_test.index:
    if 'Men' in df_maincateg_impute_test.loc[i, 'title']:
        df_test.loc[i, 'maincateg'] = 'Men'
    else:
        df_test.loc[i, 'maincateg'] = 'Women'
        
df_test.isna().sum()

id              0
title           0
Rating        203
maincateg       0
platform        0
actprice1       0
norating1       0
noreviews1      0
star_5f        68
star_4f         0
star_3f         0
star_2f         0
star_1f       186
fulfilled1      0
dtype: int64

In [11]:
# # imputing Rating column

# def get_average_rating(df):
#     for i in df[df.Rating.isna()].index:
#         df.loc[i:'Rating'] = np.average(df.loc[i, ['star_5f', 'star_4f', 'star_3f', 'star_2f', 'star_1f']])

In [12]:
impute_ratings = KNNImputer()

imputing_df = pd.DataFrame(df_train.loc[:, ['star_5f', 'star_4f', 'star_3f', 'star_2f', 'star_1f']])
imputing_df = pd.DataFrame(impute_ratings.fit_transform(imputing_df))

imputing_df.columns = ['star_5f', 'star_4f', 'star_3f', 'star_2f', 'star_1f']

imputing_df

Unnamed: 0,star_5f,star_4f,star_3f,star_2f,star_1f
0,17.0,9.0,6.0,3.0,3.0
1,264.0,92.0,73.0,29.0,73.0
2,11.0,3.0,2.0,1.0,0.0
3,1045.0,12416.0,5352.0,701.0,4595.0
4,35.0,21.0,7.0,7.0,7.0
...,...,...,...,...,...
15725,485.0,177.0,61.0,41.0,43.0
15726,120.0,45.0,37.0,16.0,28.0
15727,65.8,27.8,20.0,10.0,15.0
15728,13.0,6.0,10.0,25.0,47.0


In [13]:
# # resetting the column orders:
# cols = X_imputed.columns.tolist()
# cols = cols[5:] + cols[0:5]
# cols

# X_imputed = X_imputed[cols]

# X_imputed.columns = X.columns.to_list()

# X_imputed

df_train.update(imputing_df)

df_train.isna().sum()

id              0
title           0
Rating          0
maincateg       0
platform        0
price1          0
actprice1       0
Offer %         0
norating1     678
noreviews1    578
star_5f         0
star_4f         0
star_3f         0
star_2f         0
star_1f         0
fulfilled1      0
dtype: int64

In [14]:
# Same imputation on test set

imputing_df_test = pd.DataFrame(df_test.loc[:, ['star_5f', 'star_4f', 'star_3f', 'star_2f', 'star_1f']])
imputing_df_test = pd.DataFrame(impute_ratings.fit_transform(imputing_df_test))

imputing_df_test.columns = ['star_5f', 'star_4f', 'star_3f', 'star_2f', 'star_1f']

imputing_df_test

Unnamed: 0,star_5f,star_4f,star_3f,star_2f,star_1f
0,14238.0,4295.0,3457.0,1962.0,3976.0
1,1458.0,657.0,397.0,182.0,321.0
2,229.0,70.0,71.0,33.0,46.0
3,141.0,51.0,49.0,17.0,32.0
4,1265.0,414.0,293.0,143.0,308.0
...,...,...,...,...,...
5239,656.4,323.0,155.0,62.0,92.8
5240,350.0,37.0,60.0,5.0,37.0
5241,574.0,290.0,172.0,94.0,150.0
5242,2384.0,974.0,648.0,328.0,533.0


In [15]:
df_test.update(imputing_df)
df_test.isna().sum()

id              0
title           0
Rating        203
maincateg       0
platform        0
actprice1       0
norating1       0
noreviews1      0
star_5f         0
star_4f         0
star_3f         0
star_2f         0
star_1f         0
fulfilled1      0
dtype: int64

In [16]:
df_train.dropna(how = 'any', inplace = True)
df_train.drop(['title'], axis = 1, inplace = True)
df_test.drop(['title'], axis = 1, inplace = True)

In [17]:
X = df_train.loc[:, [i for i in df_train if i != 'price1']]
y = df_train['price1']

In [18]:
imputing_df = X.loc[:, ['norating1', 'noreviews1']]
si = SimpleImputer(strategy = 'median')
imputing_df = pd.DataFrame(si.fit_transform(imputing_df))

imputing_df.columns = ['norating1', 'noreviews1']

X.update(imputing_df)
X.isna().sum()

id            0
Rating        0
maincateg     0
platform      0
actprice1     0
Offer %       0
norating1     0
noreviews1    0
star_5f       0
star_4f       0
star_3f       0
star_2f       0
star_1f       0
fulfilled1    0
dtype: int64

In [19]:
test_impute_df = df_test.loc[:, ['Rating']]
test_impute_df = pd.DataFrame(si.fit_transform(test_impute_df))

test_impute_df.columns = ['Rating']

df_test.update(test_impute_df)
df_test.isna().sum()

id            0
Rating        0
maincateg     0
platform      0
actprice1     0
norating1     0
noreviews1    0
star_5f       0
star_4f       0
star_3f       0
star_2f       0
star_1f       0
fulfilled1    0
dtype: int64

In [20]:
X = X.drop(['id'], axis = 1)

X

Unnamed: 0,Rating,maincateg,platform,actprice1,Offer %,norating1,noreviews1,star_5f,star_4f,star_3f,star_2f,star_1f,fulfilled1
0,3.9,Women,Flipkart,999,30.13%,38.0,7.0,17.0,9.0,6.0,3.0,3.0,0
1,3.8,Men,Flipkart,1999,50.03%,531.0,69.0,264.0,92.0,73.0,29.0,73.0,1
2,4.4,Women,Flipkart,4999,45.01%,17.0,4.0,11.0,3.0,2.0,1.0,0.0,1
3,4.2,Men,Flipkart,724,15.85%,46413.0,6229.0,1045.0,12416.0,5352.0,701.0,4595.0,1
4,3.9,Men,Flipkart,2299,40.02%,77.0,3.0,35.0,21.0,7.0,7.0,7.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
15724,5.0,Men,Flipkart,2199,70.49%,2.0,1.0,2.0,0.0,0.0,0.0,0.0,0
15725,4.3,Women,Flipkart,1199,52.71%,807.0,114.0,485.0,177.0,61.0,41.0,43.0,0
15726,3.9,Women,Flipkart,998,50.00%,246.0,34.0,120.0,45.0,37.0,16.0,28.0,1
15728,3.9,Men,Amazon,4499,50.01%,750.0,479.0,13.0,6.0,10.0,25.0,47.0,1


In [21]:
#Removing percentage sign from offer%
X.drop('Offer %', axis = 1, inplace = True)

In [22]:
# same process on df_test
# df_test['Offer %'] = df_test['Offer %'].str.replace('%', '').astype(float)
df_test

Unnamed: 0,id,Rating,maincateg,platform,actprice1,norating1,noreviews1,star_5f,star_4f,star_3f,star_2f,star_1f,fulfilled1
0,2242,3.8,Men,Flipkart,999,27928,3543,17.0,9.0,6.0,3.0,3.0,1
1,20532,3.9,Women,Flipkart,499,3015,404,264.0,92.0,73.0,29.0,73.0,1
2,10648,3.9,Women,Flipkart,999,449,52,11.0,3.0,2.0,1.0,0.0,1
3,20677,3.9,Men,Flipkart,2999,290,40,1045.0,12416.0,5352.0,701.0,4595.0,1
4,12593,3.9,Men,Flipkart,999,2423,326,35.0,21.0,7.0,7.0,7.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5239,14033,4.0,Women,Flipkart,699,1235,153,140.0,69.0,20.0,9.0,14.0,1
5240,297,3.9,Men,Flipkart,1993,329,56,143.0,39.0,27.0,12.0,29.0,0
5241,18733,3.8,Women,Flipkart,999,1280,135,15.0,11.0,5.0,24.0,44.0,0
5242,6162,3.9,Women,Flipkart,499,4867,574,1140.0,491.0,329.0,136.0,338.0,0


# Tuning Model Parameters

In [23]:
ohe = OneHotEncoder(handle_unknown = 'ignore', sparse = False)

cols_to_encode = [col for col in X.columns if X[col].dtype == 'object']

new_cols = ohe.fit_transform(X[cols_to_encode])

df_enc = pd.DataFrame(new_cols, columns = ['Women', 'Men', 'Amazon', 'Flipkart'])

X = pd.concat([X.reset_index(drop = True),df_enc.reset_index(drop = True)], axis = 1)

X.drop(['maincateg', 'platform'],axis = 1, inplace = True)

In [24]:
X

Unnamed: 0,Rating,actprice1,norating1,noreviews1,star_5f,star_4f,star_3f,star_2f,star_1f,fulfilled1,Women,Men,Amazon,Flipkart
0,3.9,999,38.0,7.0,17.0,9.0,6.0,3.0,3.0,0,0.0,1.0,0.0,1.0
1,3.8,1999,531.0,69.0,264.0,92.0,73.0,29.0,73.0,1,1.0,0.0,0.0,1.0
2,4.4,4999,17.0,4.0,11.0,3.0,2.0,1.0,0.0,1,0.0,1.0,0.0,1.0
3,4.2,724,46413.0,6229.0,1045.0,12416.0,5352.0,701.0,4595.0,1,1.0,0.0,0.0,1.0
4,3.9,2299,77.0,3.0,35.0,21.0,7.0,7.0,7.0,1,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14970,5.0,2199,2.0,1.0,2.0,0.0,0.0,0.0,0.0,0,1.0,0.0,0.0,1.0
14971,4.3,1199,807.0,114.0,485.0,177.0,61.0,41.0,43.0,0,0.0,1.0,0.0,1.0
14972,3.9,998,246.0,34.0,120.0,45.0,37.0,16.0,28.0,1,0.0,1.0,0.0,1.0
14973,3.9,4499,750.0,479.0,13.0,6.0,10.0,25.0,47.0,1,1.0,0.0,1.0,0.0


In [25]:
# Same encoding on test dataset
cols_to_encode_test = [col for col in df_test.columns if df_test[col].dtype == 'object']
cols_to_encode_test
new_cols_test = ohe.fit_transform(df_test[cols_to_encode_test])
new_cols_test
df_enc_test = pd.DataFrame(new_cols_test, columns = ['Women', 'Men', 'Amazon', 'Flipkart'])
df_test = pd.concat([df_test.reset_index(drop = True),df_enc_test.reset_index(drop = True)], axis = 1)

df_test.drop(['maincateg', 'platform'],axis = 1, inplace = True)

df_test

Unnamed: 0,id,Rating,actprice1,norating1,noreviews1,star_5f,star_4f,star_3f,star_2f,star_1f,fulfilled1,Women,Men,Amazon,Flipkart
0,2242,3.8,999,27928,3543,17.0,9.0,6.0,3.0,3.0,1,1.0,0.0,0.0,1.0
1,20532,3.9,499,3015,404,264.0,92.0,73.0,29.0,73.0,1,0.0,1.0,0.0,1.0
2,10648,3.9,999,449,52,11.0,3.0,2.0,1.0,0.0,1,0.0,1.0,0.0,1.0
3,20677,3.9,2999,290,40,1045.0,12416.0,5352.0,701.0,4595.0,1,1.0,0.0,0.0,1.0
4,12593,3.9,999,2423,326,35.0,21.0,7.0,7.0,7.0,0,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5239,14033,4.0,699,1235,153,140.0,69.0,20.0,9.0,14.0,1,0.0,1.0,0.0,1.0
5240,297,3.9,1993,329,56,143.0,39.0,27.0,12.0,29.0,0,1.0,0.0,0.0,1.0
5241,18733,3.8,999,1280,135,15.0,11.0,5.0,24.0,44.0,0,0.0,1.0,0.0,1.0
5242,6162,3.9,499,4867,574,1140.0,491.0,329.0,136.0,338.0,0,0.0,1.0,0.0,1.0


In [26]:
df_test.isna().sum()

id            0
Rating        0
actprice1     0
norating1     0
noreviews1    0
star_5f       0
star_4f       0
star_3f       0
star_2f       0
star_1f       0
fulfilled1    0
Women         0
Men           0
Amazon        0
Flipkart      0
dtype: int64

In [27]:
df_test.dtypes

id              int64
Rating        float64
actprice1       int64
norating1       int64
noreviews1      int64
star_5f       float64
star_4f       float64
star_3f       float64
star_2f       float64
star_1f       float64
fulfilled1      int64
Women         float64
Men           float64
Amazon        float64
Flipkart      float64
dtype: object

In [28]:
# model1 = RandomForestRegressor(random_state = 0)
# model2 = XGBRegressor()

In [29]:
# model1.get_params().keys()

In [30]:
# model2.get_params().keys()

In [31]:
# params_1 = {
#     'n_estimators' : [1,50,100,500,1000],
#     'max_depth': [1,3,5,10,15]
# }

# params_2 = {
#     'n_estimators' : [50,100,300,500,700],
#     'max_depth': [5, 10, 15, 20, 25,30],
#     'learning_rate':[0.01, 0.1, 1, 0.001, 0.5]
# }

# params_combined = [params_1, params_2]

In [32]:
import sklearn
sorted(sklearn.metrics.SCORERS.keys())

# getting the valid scorers for gridsearch 

['accuracy',
 'adjusted_mutual_info_score',
 'adjusted_rand_score',
 'average_precision',
 'balanced_accuracy',
 'completeness_score',
 'explained_variance',
 'f1',
 'f1_macro',
 'f1_micro',
 'f1_samples',
 'f1_weighted',
 'fowlkes_mallows_score',
 'homogeneity_score',
 'jaccard',
 'jaccard_macro',
 'jaccard_micro',
 'jaccard_samples',
 'jaccard_weighted',
 'max_error',
 'mutual_info_score',
 'neg_brier_score',
 'neg_log_loss',
 'neg_mean_absolute_error',
 'neg_mean_absolute_percentage_error',
 'neg_mean_gamma_deviance',
 'neg_mean_poisson_deviance',
 'neg_mean_squared_error',
 'neg_mean_squared_log_error',
 'neg_median_absolute_error',
 'neg_root_mean_squared_error',
 'normalized_mutual_info_score',
 'precision',
 'precision_macro',
 'precision_micro',
 'precision_samples',
 'precision_weighted',
 'r2',
 'rand_score',
 'recall',
 'recall_macro',
 'recall_micro',
 'recall_samples',
 'recall_weighted',
 'roc_auc',
 'roc_auc_ovo',
 'roc_auc_ovo_weighted',
 'roc_auc_ovr',
 'roc_auc_ovr_we

In [33]:
# gs_1 = GridSearchCV(estimator = model1,
#                  param_grid = params_1,
#                  cv = 3,
#                  scoring = 'neg_root_mean_squared_error',
#                  verbose = 1)

# gs_1.fit(X,y)

In [34]:
id_col_test = df_test['id']
df_test.drop('id', axis = 1, inplace = True)

df_test

Unnamed: 0,Rating,actprice1,norating1,noreviews1,star_5f,star_4f,star_3f,star_2f,star_1f,fulfilled1,Women,Men,Amazon,Flipkart
0,3.8,999,27928,3543,17.0,9.0,6.0,3.0,3.0,1,1.0,0.0,0.0,1.0
1,3.9,499,3015,404,264.0,92.0,73.0,29.0,73.0,1,0.0,1.0,0.0,1.0
2,3.9,999,449,52,11.0,3.0,2.0,1.0,0.0,1,0.0,1.0,0.0,1.0
3,3.9,2999,290,40,1045.0,12416.0,5352.0,701.0,4595.0,1,1.0,0.0,0.0,1.0
4,3.9,999,2423,326,35.0,21.0,7.0,7.0,7.0,0,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5239,4.0,699,1235,153,140.0,69.0,20.0,9.0,14.0,1,0.0,1.0,0.0,1.0
5240,3.9,1993,329,56,143.0,39.0,27.0,12.0,29.0,0,1.0,0.0,0.0,1.0
5241,3.8,999,1280,135,15.0,11.0,5.0,24.0,44.0,0,0.0,1.0,0.0,1.0
5242,3.9,499,4867,574,1140.0,491.0,329.0,136.0,338.0,0,0.0,1.0,0.0,1.0


# Trying Ridge Regression

In [35]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

In [36]:
X_train.shape

(11231, 14)

In [37]:
X_test.shape

(3744, 14)

In [38]:
y_train = y_train.values.reshape(-1, 1)
y_test = y_test.values.reshape(-1, 1)

In [39]:
rg = Ridge()
rg.get_params().keys()

dict_keys(['alpha', 'copy_X', 'fit_intercept', 'max_iter', 'normalize', 'positive', 'random_state', 'solver', 'tol'])

In [40]:
# params = {
#     'alpha' : [1000,50,10,5, 3, 1]
# }

# gs = GridSearchCV(rg, 
#                  param_grid = params,
#                  cv = 5,
#                  verbose = 5)

# gs.fit(X_train, y_train)

In [41]:
# gs.best_params_

In [42]:
# rg_final = Ridge(alpha = 3)

In [43]:
linreg = LinearRegression()

In [44]:
cv = cross_val_score(linreg, X, y, scoring = 'neg_root_mean_squared_error')
cv.mean()

-287.5459852750643

In [45]:
linreg.fit(X,y)

LinearRegression()

In [46]:
preds = linreg.predict(df_test)

In [47]:
submission = pd.DataFrame({
    'id': id_col_test,
    'price1': preds
})

In [48]:
submission

Unnamed: 0,id,price1
0,2242,513.569466
1,20532,296.828028
2,10648,522.643059
3,20677,1604.019833
4,12593,465.990389
...,...,...
5239,14033,403.704471
5240,297,913.686010
5241,18733,442.277848
5242,6162,231.209510


In [49]:
submission.to_csv('submission.csv', index = False)

In [50]:
# sns.pairplot(X_train)

# Trying SVM Regression

In [51]:
scale = StandardScaler()

X = pd.DataFrame(scale.fit_transform(X), columns = X.columns.to_list())
X

Unnamed: 0,Rating,actprice1,norating1,noreviews1,star_5f,star_4f,star_3f,star_2f,star_1f,fulfilled1,Women,Men,Amazon,Flipkart
0,-0.380302,-0.299354,-0.256597,-0.242464,-0.252857,-0.228614,-0.248247,-0.271891,-0.283617,-1.224915,-0.854175,0.854175,-0.148944,0.148944
1,-0.715007,0.506612,-0.213782,-0.205704,-0.213014,-0.199013,-0.200571,-0.224876,-0.210045,0.816383,1.170721,-1.170721,-0.148944,0.148944
2,1.293227,2.924511,-0.258421,-0.244243,-0.253825,-0.230754,-0.251093,-0.275507,-0.286770,0.816383,-0.854175,0.854175,-0.148944,0.148944
3,0.623815,-0.520994,3.770902,3.446546,-0.087030,4.196269,3.555877,0.990256,4.542685,0.816383,1.170721,-1.170721,-0.148944,0.148944
4,-0.380302,0.748402,-0.253210,-0.244835,-0.249954,-0.224334,-0.247536,-0.264658,-0.279412,0.816383,1.170721,-1.170721,-0.148944,0.148944
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14970,3.301460,0.667806,-0.259724,-0.246021,-0.255277,-0.231824,-0.252517,-0.277315,-0.286770,-1.224915,1.170721,-1.170721,-0.148944,0.148944
14971,0.958521,-0.138161,-0.189812,-0.179024,-0.177364,-0.168698,-0.209110,-0.203178,-0.241576,-1.224915,-0.854175,0.854175,-0.148944,0.148944
14972,-0.380302,-0.300160,-0.238533,-0.226456,-0.236242,-0.215775,-0.226188,-0.248383,-0.257341,0.816383,-0.854175,0.854175,-0.148944,0.148944
14973,-0.380302,2.521528,-0.194762,0.037384,-0.253502,-0.229684,-0.245401,-0.232109,-0.237372,0.816383,1.170721,-1.170721,6.713935,-6.713935


In [52]:
df_test = pd.DataFrame(scale.fit_transform(df_test), columns = df_test.columns.to_list())
df_test

Unnamed: 0,Rating,actprice1,norating1,noreviews1,star_5f,star_4f,star_3f,star_2f,star_1f,fulfilled1,Women,Men,Amazon,Flipkart
0,-0.711707,-0.296490,1.936287,1.636973,-0.231653,-0.208750,-0.228931,-0.257341,-0.272038,0.808863,1.190674,-1.190674,-0.142244,0.142244
1,-0.373861,-0.686960,0.002051,-0.006015,-0.195028,-0.181750,-0.185337,-0.212968,-0.201315,0.808863,-0.839861,0.839861,-0.142244,0.142244
2,-0.373861,-0.296490,-0.197172,-0.190255,-0.232543,-0.210702,-0.231533,-0.260754,-0.275069,0.808863,-0.839861,0.839861,-0.142244,0.142244
3,-0.373861,1.265390,-0.209517,-0.196536,-0.079221,3.827285,3.249471,0.933897,4.367395,0.808863,1.190674,-1.190674,-0.142244,0.142244
4,-0.373861,-0.296490,-0.043911,-0.046841,-0.228984,-0.204846,-0.228280,-0.250514,-0.267997,-1.236304,1.190674,-1.190674,-0.142244,0.142244
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5239,-0.036014,-0.530772,-0.136147,-0.137391,-0.213415,-0.189232,-0.219822,-0.247101,-0.260925,0.808863,-0.839861,0.839861,-0.142244,0.142244
5240,-0.373861,0.479764,-0.206489,-0.188162,-0.212970,-0.198991,-0.215267,-0.241981,-0.245770,-1.236304,1.190674,-1.190674,-0.142244,0.142244
5241,-0.711707,-0.296490,-0.132653,-0.146812,-0.231950,-0.208099,-0.229582,-0.221501,-0.230615,-1.236304,-0.839861,0.839861,-0.142244,0.142244
5242,-0.373861,-0.686960,0.145840,0.082965,-0.065134,-0.051954,-0.018769,-0.030357,0.066422,-1.236304,-0.839861,0.839861,-0.142244,0.142244


In [75]:
svm = SVR()

params_svr = {
#     'kernel' : ['linear', 'poly', 'rbf', 'sigmoid', 'precomputed']
    'C': [10000,20000,9000]
}

In [76]:
gs_svm = GridSearchCV(estimator = svm,
                 param_grid = params_svr,
                 cv = 5,
                 scoring = 'neg_root_mean_squared_error',
                 verbose = 1,
                 n_jobs = 3)

gs_svm.fit(X,y)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


GridSearchCV(cv=5, estimator=SVR(), n_jobs=3,
             param_grid={'C': [10000, 20000, 9000]},
             scoring='neg_root_mean_squared_error', verbose=1)

In [77]:
gs_svm.best_params_

{'C': 20000}

In [69]:
svm = SVR(C=10000)

svm.fit(X,y)

SVR(C=10000)

In [70]:
preds_svm = svm.predict(df_test)

In [71]:
submission_svm = pd.DataFrame({
    'id': id_col_test,
    'price1': preds_svm
})

submission_svm

Unnamed: 0,id,price1
0,2242,438.422068
1,20532,309.308144
2,10648,471.160001
3,20677,917.806040
4,12593,431.120868
...,...,...
5239,14033,394.216490
5240,297,610.525669
5241,18733,413.987749
5242,6162,317.145427


In [72]:
submission_svm.to_csv('submission_svm.csv', index = False)