<center><h1 style="color:#FF0000; font-size:50px; padding:10px; font-family:'serif'">
    FINAL HACKATHON </h1></center>

## PROBLEM STATEMENT

Artificial Intelligence is an integral part of all major e-commerce companies today. Today's online retail platforms are heavily powered by algorithms and applications that use AI. Machine learning is used in a variety of ways, from inventory control and quality assurance in the warehouse to product recommendations and sales demographics on the website.

Let’s say you want to create a promotional campaign for an e-commerce store and offer discounts to customers in the hopes that this might increase your sales.

You have been provided descriptions of products on Amazon and Flipkart, including details like product title, ratings, reviews, and actual prices. In this challenge, you will predict discounted prices of the listed products based on their ratings and actual prices.

## Data Description

- title - Name of the product
- Rating- average rating given to a product
- maincateg - category that the product is listed under(men/women)
- platform - platform on which it is sold on (Eg. Amazon, Flipkart)
- price1 - Discounted Price of the listed product
- actprice1 - Actual price of the listed product
- Offer % - Discount percent
- norating1 - number of ratings available for a particular product
- noreviews1 - number of reviews available for a particular product
- star_5f - number of five star ratings given to a particular product
- star_4f - number of four star ratings given to a particular product
- star_3f - number of three star ratings given to a particular product
- star_2f - number of two star ratings given to a particular product
- star_1f - number of one star ratings given to a particular product
- fulfilled1- whether it is Amazon fulfilled or not

In [42]:
# importing libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# model selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

# importing preprocessing tools and Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler


# importing models
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
# from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Ridge
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor

In [2]:
# importing datasets

df_train = pd.read_csv('./train.csv')
df_test = pd.read_csv('./test.csv')

In [41]:
df_train.isna().sum()

Rating          0
maincateg       0
platform        0
price1          0
actprice1       0
norating1     678
noreviews1    578
star_5f       588
star_4f       539
star_3f       231
star_2f         0
star_1f         0
fulfilled1      0
dtype: int64

In [4]:
df_test.isna().sum()

id              0
title           0
Rating        203
maincateg      67
platform        0
actprice1       0
norating1       0
noreviews1      0
star_5f        68
star_4f         0
star_3f         0
star_2f         0
star_1f       186
fulfilled1      0
dtype: int64

In [5]:
df_train.drop('Offer %', inplace = True, axis = 1) # this column not present in test dataset

In [6]:
# imputing maincateg column using title names

def impute_main_categ(df):
    df_maincateg_impute = df.loc[df.maincateg.isna(), :]
    for i in df_maincateg_impute.index:
        if 'Men' in df_maincateg_impute.loc[i, 'title']:
            df.loc[i, 'maincateg'] = 'Men'
        else:
            df.loc[i, 'maincateg'] = 'Women'
            
impute_main_categ(df_train)
impute_main_categ(df_test)

print(df_train.isna().sum())
print(df_test.isna().sum())

id              0
title           0
Rating          0
maincateg       0
platform        0
price1          0
actprice1       0
norating1     678
noreviews1    578
star_5f       588
star_4f       539
star_3f       231
star_2f         0
star_1f         0
fulfilled1      0
dtype: int64
id              0
title           0
Rating        203
maincateg       0
platform        0
actprice1       0
norating1       0
noreviews1      0
star_5f        68
star_4f         0
star_3f         0
star_2f         0
star_1f       186
fulfilled1      0
dtype: int64


In [7]:
df_train.drop(['title', 'id'], inplace = True, axis = 1)

id_col = df_test['id'] # saving the id column of test dataset

df_test.drop(['title', 'id'], inplace = True, axis = 1)

In [8]:
X = df_train.loc[:, [col for col in df_train if col != 'price1']]
y = df_train.loc[:, 'price1']

In [9]:
median_imputer = SimpleImputer(strategy= 'median')
ohe = OneHotEncoder()
mean_imputer = SimpleImputer(strategy = 'mean')

transformer = ColumnTransformer(transformers = [
    ('ohe_encoder', ohe, ['maincateg', 'platform']),
    ('imputer', median_imputer, ['star_5f', 'star_4f', 'star_3f', 'star_2f', 'star_1f']),
    ('imputer_mean', mean_imputer, ['Rating', 'norating1', 'noreviews1'])
], remainder = 'passthrough')

In [10]:
random_f = RandomForestRegressor(random_state = 0)
pipe = Pipeline(steps = [
    ('imputing and encoding', transformer),
    ('model', random_f)
])

In [11]:
score = cross_val_score(pipe, X, y, cv = 5)

In [12]:
score

array([0.91229786, 0.89746921, 0.91378218, 0.89785711, 0.89854571])

In [13]:
pipe.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'imputing and encoding', 'model', 'imputing and encoding__n_jobs', 'imputing and encoding__remainder', 'imputing and encoding__sparse_threshold', 'imputing and encoding__transformer_weights', 'imputing and encoding__transformers', 'imputing and encoding__verbose', 'imputing and encoding__verbose_feature_names_out', 'imputing and encoding__ohe_encoder', 'imputing and encoding__imputer', 'imputing and encoding__imputer_mean', 'imputing and encoding__ohe_encoder__categories', 'imputing and encoding__ohe_encoder__drop', 'imputing and encoding__ohe_encoder__dtype', 'imputing and encoding__ohe_encoder__handle_unknown', 'imputing and encoding__ohe_encoder__sparse', 'imputing and encoding__imputer__add_indicator', 'imputing and encoding__imputer__copy', 'imputing and encoding__imputer__fill_value', 'imputing and encoding__imputer__missing_values', 'imputing and encoding__imputer__strategy', 'imputing and encoding__imputer__verbose', 'imputing and encodi

In [14]:
# params = {
#     'model__n_estimators':[1000, 5000, 10000],
#     'model__max_depth': [10, 15]
# }

# gs = GridSearchCV(
#     pipe, param_grid = params, scoring = 'neg_root_mean_squared_error', cv = 3, n_jobs = 3, verbose = 5
# )

In [15]:
# gs.fit(X,y)

In [16]:
# gs.best_params_

In [17]:
random_f_final = RandomForestRegressor(n_estimators = 10000, random_state = 0)

pipe = Pipeline(steps = [
    ('imputing and encoding', transformer),
    ('model', random_f)
])

In [18]:
pipe.fit(X,y)

Pipeline(steps=[('imputing and encoding',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('ohe_encoder',
                                                  OneHotEncoder(),
                                                  ['maincateg', 'platform']),
                                                 ('imputer',
                                                  SimpleImputer(strategy='median'),
                                                  ['star_5f', 'star_4f',
                                                   'star_3f', 'star_2f',
                                                   'star_1f']),
                                                 ('imputer_mean',
                                                  SimpleImputer(),
                                                  ['Rating', 'norating1',
                                                   'noreviews1'])])),
                ('model', RandomForestRegressor(random_s

In [19]:
preds = pd.DataFrame(pipe.predict(df_test), columns = ['price1'])

In [20]:
preds

Unnamed: 0,price1
0,428.047714
1,282.440000
2,466.024000
3,918.410000
4,399.765714
...,...
5239,425.030000
5240,744.209000
5241,388.701430
5242,188.580000


In [21]:
id_col = pd.DataFrame(id_col, columns = ['id'])
id_col

Unnamed: 0,id
0,2242
1,20532
2,10648
3,20677
4,12593
...,...
5239,14033
5240,297
5241,18733
5242,6162


In [22]:
submission = pd.concat([id_col, preds], axis = 1)
submission

Unnamed: 0,id,price1
0,2242,428.047714
1,20532,282.440000
2,10648,466.024000
3,20677,918.410000
4,12593,399.765714
...,...,...
5239,14033,425.030000
5240,297,744.209000
5241,18733,388.701430
5242,6162,188.580000


In [23]:
submission.to_csv('submission_simple_model.csv', index = False)

# Trying XGBRegressor

In [24]:
xgb = XGBRegressor()

pipe_xgb = Pipeline(steps = [
    ('imputing and encoding', transformer),
    ('model_xgb', xgb)
])

In [25]:
score = cross_val_score(pipe, X, y, cv = 5)

In [26]:
score

array([0.91229786, 0.89746921, 0.91378218, 0.89785711, 0.89854571])

In [27]:
params = {
    'model_xgb__n_estimators':[1000, 5000],
    'model_xgb__max_depth': [1, 3, 5, 10, 15]
}

gs = GridSearchCV(
    pipe_xgb, param_grid = params, scoring = 'neg_root_mean_squared_error', cv = 3, n_jobs = 3, verbose = 5
)

In [28]:
gs.fit(X,y)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('imputing and encoding',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('ohe_encoder',
                                                                         OneHotEncoder(),
                                                                         ['maincateg',
                                                                          'platform']),
                                                                        ('imputer',
                                                                         SimpleImputer(strategy='median'),
                                                                         ['star_5f',
                                                                          'star_4f',
                                                                          'star_3f',
                                  

In [29]:
gs.best_params_

{'model_xgb__max_depth': 10, 'model_xgb__n_estimators': 1000}

In [30]:
xgb = XGBRegressor(max_depth = 5, n_estimators = 1000)

pipe_xgb_final = Pipeline(steps = [
    ('imputing and encoding', transformer),
    ('model_xgb', xgb)
])

pipe_xgb_final.fit(X,y)

Pipeline(steps=[('imputing and encoding',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('ohe_encoder',
                                                  OneHotEncoder(),
                                                  ['maincateg', 'platform']),
                                                 ('imputer',
                                                  SimpleImputer(strategy='median'),
                                                  ['star_5f', 'star_4f',
                                                   'star_3f', 'star_2f',
                                                   'star_1f']),
                                                 ('imputer_mean',
                                                  SimpleImputer(),
                                                  ['Rating', 'norating1',
                                                   'noreviews1'])])),
                ('model_xgb',
                 XGBRegres

In [31]:
preds_xgb = pd.DataFrame(pipe_xgb_final.predict(df_test), columns = ['price1'])
preds_xgb

Unnamed: 0,price1
0,418.289337
1,271.637878
2,408.082672
3,878.644409
4,398.354218
...,...
5239,332.200409
5240,857.268311
5241,387.738861
5242,203.177795


In [32]:
submission_xgb_simple = pd.concat([id_col, preds_xgb], axis = 1)
submission_xgb_simple

Unnamed: 0,id,price1
0,2242,418.289337
1,20532,271.637878
2,10648,408.082672
3,20677,878.644409
4,12593,398.354218
...,...,...
5239,14033,332.200409
5240,297,857.268311
5241,18733,387.738861
5242,6162,203.177795


In [33]:
submission_xgb_simple.to_csv('submission_xgb_simple.csv', index = False)

## Trying Ridge Regression

<div class = 'alert alert-danger'> Very poor result of this approach </div>

In [34]:
rg = Ridge(alpha = 0.001)

pipe_rg = Pipeline(steps = [
    ('imputing and encoding', transformer),
    ('model_rg', rg)
])


In [35]:
pipe_rg.fit(X,y)

Pipeline(steps=[('imputing and encoding',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('ohe_encoder',
                                                  OneHotEncoder(),
                                                  ['maincateg', 'platform']),
                                                 ('imputer',
                                                  SimpleImputer(strategy='median'),
                                                  ['star_5f', 'star_4f',
                                                   'star_3f', 'star_2f',
                                                   'star_1f']),
                                                 ('imputer_mean',
                                                  SimpleImputer(),
                                                  ['Rating', 'norating1',
                                                   'noreviews1'])])),
                ('model_rg', Ridge(alpha=0.001))])

In [36]:
preds_rg = pd.DataFrame(pipe_rg.predict(df_test), columns = ['price1'])
preds_rg

Unnamed: 0,price1
0,447.121399
1,292.138971
2,519.509797
3,1433.362971
4,465.337908
...,...
5239,398.794553
5240,917.241446
5241,437.408141
5242,216.756559


In [37]:
submission_rg_simple = pd.concat([id_col, preds_rg], axis = 1)
submission_rg_simple

Unnamed: 0,id,price1
0,2242,447.121399
1,20532,292.138971
2,10648,519.509797
3,20677,1433.362971
4,12593,465.337908
...,...,...
5239,14033,398.794553
5240,297,917.241446
5241,18733,437.408141
5242,6162,216.756559


In [38]:
submission_rg_simple.to_csv('ridge_reg_sub.csv', index = False)

## Trying SVR

In [48]:
gb = GradientBoostingRegressor(loss = 'squared_error')

pipe_gb = Pipeline(steps = [
    ('imputing and encoding', transformer),
    ('model_gb', gb)
])

In [49]:
pipe_gb.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'imputing and encoding', 'model_gb', 'imputing and encoding__n_jobs', 'imputing and encoding__remainder', 'imputing and encoding__sparse_threshold', 'imputing and encoding__transformer_weights', 'imputing and encoding__transformers', 'imputing and encoding__verbose', 'imputing and encoding__verbose_feature_names_out', 'imputing and encoding__ohe_encoder', 'imputing and encoding__imputer', 'imputing and encoding__imputer_mean', 'imputing and encoding__ohe_encoder__categories', 'imputing and encoding__ohe_encoder__drop', 'imputing and encoding__ohe_encoder__dtype', 'imputing and encoding__ohe_encoder__handle_unknown', 'imputing and encoding__ohe_encoder__sparse', 'imputing and encoding__imputer__add_indicator', 'imputing and encoding__imputer__copy', 'imputing and encoding__imputer__fill_value', 'imputing and encoding__imputer__missing_values', 'imputing and encoding__imputer__strategy', 'imputing and encoding__imputer__verbose', 'imputing and enc

In [40]:
gs_lg = GridSearchCV(
    pipe_lg,
    param_grid = {'model_gb__learning_rate': [0.001, 0.1, 0.5, 1, 2],
                 'model_gb__n_estimators': [50, 100, 500, 1000]},
    cv = 5,
    verbose = 5
)

gs_lg.fit(X,y)

[CV 2/3] END model_xgb__max_depth=1, model_xgb__n_estimators=1000;, score=-276.745 total time=   7.7s
[CV 1/3] END model_xgb__max_depth=1, model_xgb__n_estimators=5000;, score=-264.710 total time=  37.8s
[CV 3/3] END model_xgb__max_depth=3, model_xgb__n_estimators=1000;, score=-226.149 total time=   8.8s
[CV 2/3] END model_xgb__max_depth=3, model_xgb__n_estimators=5000;, score=-230.727 total time= 1.0min
[CV 2/3] END model_xgb__max_depth=5, model_xgb__n_estimators=1000;, score=-214.302 total time=  18.8s
[CV 2/3] END model_xgb__max_depth=5, model_xgb__n_estimators=5000;, score=-214.344 total time= 1.4min
[CV 1/3] END model_xgb__max_depth=10, model_xgb__n_estimators=1000;, score=-204.332 total time=  34.7s
[CV 1/3] END model_xgb__max_depth=10, model_xgb__n_estimators=5000;, score=-204.332 total time=  56.0s
[CV 1/3] END model_xgb__max_depth=15, model_xgb__n_estimators=1000;, score=-207.445 total time=  23.2s
[CV 1/3] END model_xgb__max_depth=15, model_xgb__n_estimators=5000;, score=-207