# Author: Kazi Amit Hasan

Department of Computer Science & Engineering, <br/>
Rajshahi University of Engineering & Technology (RUET) <br/>
Website: https://amithasanshuvo.github.io/ <br/>
ResearchGate: https://www.researchgate.net/profile/Kazi_Amit_Hasan <br/>
Linkedin: hhttps://www.linkedin.com/in/kazi-amit-hasan/<br/>
Email: kaziamithasan89@gmail.com <hr>

## Competition: HackerEarth Machine Learning challenge: Slashing prices for the biggest sale day

### Task: Predict the lowest price

A leading global leader of e-commerce has over 150 million paid subscription users. One of the many perks of the subscription is the privilege of buying products at lower prices. For an upcoming sale, the organization has decided to promote local artisans and their products, to help them through these tough times. However, slashed prices may impact local artists.

To not let discounts affect local artists, the company has decided to determine the lowest price at which a particular good can be sold. Your task is to build a predictive model using Machine Learning that helps them set up a lowest-pricing model for these products.

You have to predict the Low_Cap_Price column.

The dataset folder consists of the following files:

- Train.csv: Contains training data [9798 x 9] that must be used to build the model
- Test.csv: Contains test data [5763 x 8] to be predicted on
- sample_submission.csv: Contains sample submission format with dummy values filled for test data

In [23]:
# Importing all the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [24]:
# Loading the files
train = pd.read_csv("Dataset/Train.csv")
test = pd.read_csv("Dataset/Test.csv")

In [25]:
# Printing them
train.head()

Unnamed: 0,Item_Id,Date,State_of_Country,Market_Category,Product_Category,Grade,Demand,Low_Cap_Price,High_Cap_Price
0,IT_1,2007-07-05,0,0,0,0,0.5,2785,4240
1,IT_2,2007-07-05,0,1,0,0,0.7,3574,4756
2,IT_3,2007-07-05,0,103,0,1,1.6,5978,9669
3,IT_4,2007-07-05,0,103,0,0,0.0,5681,8313
4,IT_5,2007-07-05,0,103,0,2,0.0,4924,7257


In [26]:
# Changing the type
test['Demand'] = test['Demand'].astype('int64')


In [None]:
# # Changing the type
train['Demand'] = train['Demand'].astype('int64')

In [27]:
# Fixing the X_train and dropping columns
X_train = train.drop(['Item_Id', 'Date', 'Low_Cap_Price'],axis = 1)
X_train.columns

Index(['State_of_Country', 'Market_Category', 'Product_Category', 'Grade',
       'Demand', 'High_Cap_Price'],
      dtype='object')

In [28]:
# Same process like above
X_test = test.drop(['Item_Id', 'Date'], axis = 1)
X_test.columns

Index(['State_of_Country', 'Market_Category', 'Product_Category', 'Grade',
       'Demand', 'High_Cap_Price'],
      dtype='object')

In [29]:
# Setting the predict variabkle here and the shape
y_train = train['Low_Cap_Price']

In [None]:
X_train.shape

In [9]:
# Trying LR
from sklearn.linear_model import LinearRegression
clf = LinearRegression()
clf.fit(X_train, y_train)

LinearRegression()

In [10]:
preds = clf.predict(X_test)
preds

array([ 2418.53967135,  4100.02927888, 11192.73258092, ...,
        5913.77916699,  5293.24905298,  6569.9374812 ])

In [11]:
submission = pd.DataFrame({'Item_Id':test['Item_Id'], 'Low_Cap_Price':preds})
submission.to_csv('SumissionFile.csv', index=False)

In [37]:
# Trying RF

from sklearn.ensemble import RandomForestRegressor
clf2 = RandomForestRegressor(bootstrap= True,max_depth=10, max_features=6, n_estimators=400,min_samples_leaf=6
                             ,min_samples_split=12, random_state=10,ccp_alpha=0,n_jobs=-1) 
clf2.fit(X_train,y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0, criterion='mse',
                      max_depth=10, max_features=6, max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=6,
                      min_samples_split=12, min_weight_fraction_leaf=0.0,
                      n_estimators=400, n_jobs=-1, oob_score=False,
                      random_state=10, verbose=0, warm_start=False)

In [34]:
preds2 = clf2.predict(X_test)
preds2

array([3202.2141618 , 2844.18582316, 5728.89556476, ..., 6573.06454377,
       5385.94048123, 7425.8844049 ])

In [35]:
submission = pd.DataFrame({'Item_Id':test['Item_Id'], 'Low_Cap_Price':preds2})
submission.to_csv('SubmissionFile.csv', index=False)

In [18]:
# Trying Standart scaler
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)
rescaled_X_train = scaler.transform(X_train)

In [63]:
# Trying Knn
from sklearn.neighbors import KNeighborsRegressor
clf3 = KNeighborsRegressor(algorithm='brute')

clf3.fit(X_train,y_train)
#from sklearn.neighbors import KNeighborsRegressor



KNeighborsRegressor(algorithm='brute', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='uniform')

In [64]:
preds3 = clf3.predict(X_test)
preds3

array([3615.4, 3731. , 8944. , ..., 7283.6, 5817.4, 6286.4])

In [65]:
submission = pd.DataFrame({'Item_Id':test['Item_Id'], 'Low_Cap_Price':preds3})
submission.to_csv('Submission Files.csv', index=False)

In [31]:
# trying svr
from sklearn.svm import SVR
clf4 = SVR(kernel = 'linear')
clf4.fit(X_train,y_train)


SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='linear', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [32]:
preds4 = clf4.predict(X_test)
preds4

array([ 2542.02264256,  4907.44105381, 13236.55784459, ...,
        6283.09381718,  5457.45188755,  7140.40947956])

In [33]:
submission = pd.DataFrame({'Item_Id':test['Item_Id'], 'Low_Cap_Price':preds4})
submission.to_csv('Submission Files.csv', index=False)

In [17]:
submission = pd.DataFrame({'Item_Id':test['Item_Id'], 'Low_Cap_Price':preds5})
submission.to_csv('Submission Files.csv', index=False)

In [97]:
# Selecting the best params

from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'max_depth': [10,50,80, 90, 100, 110],
    'max_features': [2, 3,5,6],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [100, 200,400, 300, 1000]
}
# Create a based model
rf = RandomForestRegressor()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 10, n_jobs = -1, verbose = 2)

In [98]:
grid_search.fit(X_train, y_train)

Fitting 10 folds for each of 1080 candidates, totalling 10800 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   16.4s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 357 tasks      | elapsed:  4.5min
[Parallel(n_jobs=-1)]: Done 640 tasks      | elapsed:  9.1min
[Parallel(n_jobs=-1)]: Done 1005 tasks      | elapsed: 17.0min
[Parallel(n_jobs=-1)]: Done 1450 tasks      | elapsed: 31.6min
[Parallel(n_jobs=-1)]: Done 1977 tasks      | elapsed: 48.8min
[Parallel(n_jobs=-1)]: Done 2584 tasks      | elapsed: 63.6min
[Parallel(n_jobs=-1)]: Done 3273 tasks      | elapsed: 87.9min
[Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed: 111.2min
[Parallel(n_jobs=-1)]: Done 4893 tasks      | elapsed: 135.7min
[Parallel(n_jobs=-1)]: Done 5824 tasks      | elapsed: 161.7min
[Parallel(n_jobs=-1)]: Done 6837 tasks      | elapsed: 191.5min
[Parallel(n_jobs=-1)]: Done 7930 tasks      | elapsed: 495.2min
[Parallel(n_jobs=-1)]: Done 9105 tasks   

GridSearchCV(cv=10, estimator=RandomForestRegressor(), n_jobs=-1,
             param_grid={'bootstrap': [True],
                         'max_depth': [10, 50, 80, 90, 100, 110],
                         'max_features': [2, 3, 5, 6],
                         'min_samples_leaf': [3, 4, 5],
                         'min_samples_split': [8, 10, 12],
                         'n_estimators': [100, 200, 400, 300, 1000]},
             verbose=2)

In [100]:
grid_search.best_params_


{'bootstrap': True,
 'max_depth': 80,
 'max_features': 3,
 'min_samples_leaf': 3,
 'min_samples_split': 10,
 'n_estimators': 1000}

In [None]:
# The settings by which we got the best result

from sklearn.ensemble import RandomForestRegressor
clf2 = RandomForestRegressor(bootstrap= True,max_depth=10, max_features=6, n_estimators=400,min_samples_leaf=5
                             ,min_samples_split=12, random_state=10) 
clf2.fit(X_train,y_train)

#This got me 99.84988 