## Black Friday
### Problen Statement
A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month.
The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.
Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products.

import all necessary package

In [1]:
import numpy as np
import pandas as pd
from sklearn import model_selection
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.svm import LinearSVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import scale
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
import xgboost as xgb
import types
from sklearn.model_selection import KFold

Importing dataset 

In [2]:
dataset_train = pd.read_csv(r'./Data/train.csv')
dataset_test = pd.read_csv(r'./Data/test.csv')

Check first 5 data

In [3]:
print(len(dataset_train))
print(len(dataset_test))
dataset_train.head()

550068
233599


Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969


store categorical variables column name and dictionary with column to unique value mapping associated with it

In [4]:
columns = dataset_train.columns
dic_columnwise_acceped_value = {}
for i in columns[2:-1] :
    temp1 = dataset_train[i].unique()
    temp2 = dataset_test[i].unique()
    try :
        if np.isnan(temp1).any() and np.isnan(temp2).any() :
            temp1 = temp1[~np.isnan(temp1)]
    except TypeError :
        pass
    tem_dup =  np.hstack([temp1, temp2])
    tem_dup = np.unique(tem_dup)
    dic_columnwise_acceped_value[i] = list(tem_dup)
    print(i,tem_dup)

Gender ['F' 'M']
Age ['0-17' '18-25' '26-35' '36-45' '46-50' '51-55' '55+']
Occupation [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]
City_Category ['A' 'B' 'C']
Stay_In_Current_City_Years ['0' '1' '2' '3' '4+']
Marital_Status [0 1]
Product_Category_1 [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]
Product_Category_2 [ 2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17. 18. nan]
Product_Category_3 [ 3.  4.  5.  6.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17. 18. nan]


check and print if any column contain null or NaN value

In [5]:
print(dataset_train.isna().sum())
print(dataset_test.isna().sum())

User_ID                            0
Product_ID                         0
Gender                             0
Age                                0
Occupation                         0
City_Category                      0
Stay_In_Current_City_Years         0
Marital_Status                     0
Product_Category_1                 0
Product_Category_2            173638
Product_Category_3            383247
Purchase                           0
dtype: int64
User_ID                            0
Product_ID                         0
Gender                             0
Age                                0
Occupation                         0
City_Category                      0
Stay_In_Current_City_Years         0
Marital_Status                     0
Product_Category_1                 0
Product_Category_2             72344
Product_Category_3            162562
dtype: int64


In this case product_category_2 and product_category_3 contain null value
Create SimpleImputer class to replace NaN to 999 so that later on it can be HotEncoded

In [6]:
def replace_NaN(data, columns, *args) :    
    mp = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=999)
    transformed_val = mp.fit_transform(data.iloc[:,[columns.get_loc(i) for i in list(args)]].values)
    df = data.copy()
    df[list(args)] = transformed_val
    return df

replacing Nan from two columns Product_Category_2 Product_Category_3

In [7]:
dataset_train = replace_NaN(dataset_train, dataset_train.columns, 'Product_Category_2', 'Product_Category_3')
dataset_test = replace_NaN(dataset_test, dataset_test.columns, 'Product_Category_2', 'Product_Category_3')

In [8]:
dataset_train.head(6)

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,999.0,999.0,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,999.0,999.0,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,999.0,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,999.0,999.0,7969
5,1000003,P00193542,M,26-35,15,A,3,0,1,2.0,999.0,15227


Now for each categorical column we created n-1 number of dummy variable which contain either 0 or 1
create a function which take data, column Names list, column name (In which we want to perform an operation)
and remove column name which we want to remove in order to avoid dummy variable trap

In [9]:
def replace_column_with_Dummy_Columns(data, columns, column_name, remove_column_val, remove_one_dummy=False, dtype=int) :
    temp = pd.get_dummies(data[column_name], prefix=column_name, dtype=dtype)
    col = list(temp.columns)
    removed_col = column_name+'_'+remove_column_val
    removed_col_index = col.index(removed_col)
    temp = temp[col[:removed_col_index] + col[removed_col_index+(1 if remove_one_dummy else 0):]] # removing to avoid dummy variable trap
    previous = data[columns[:columns.index(column_name)]]
    after = data[columns[columns.index(column_name)+1:]]
    previous = previous.join(temp)
    previous = previous.join(after)
    return previous

For One categorical column of training data create hotencoded column from function replace_column_with_Dummy_Columns

In [10]:
def get_dummy_dataset(data, columns, cat_col_list) :
    col = cat_col_list
    df = data[list(filter(lambda x: x not in cat_col_list, columns))]
    for i in cat_col_list :
        dic = data[i].value_counts().to_dict()
        max_key = max(dic, key=dic.get) # for non column removal it automatically contain maximim number
        # so we dont need to remove it seperately
        data = replace_column_with_Dummy_Columns(data, col, i, str(max_key), True)
        col = list(data.columns)
    df= df.join(data)
    return df

Iterating through each categorical column in training and testing set to get dummy coded column 

In [11]:
def get_encoded_dataset(dataset, category_column) :
    columns = list(dataset.columns)
    df = get_dummy_dataset(dataset, columns, category_column)
    return df
def get_label_encoded_data(df, label_col) :
    # label encode data
    for i in label_col :
        le = LabelEncoder()
        df[i] = le.fit_transform(df[i])
    return df

create hot encoded data for training and testing set

In [12]:
def get_train_test_data(df, slice_index, data_label) : # for both numpy and dataframe
    if type(data_label) is not pd.DataFrame :
        assert False
    if type(df) is pd.DataFrame :
        sliced_train_data = df.iloc[:slice_index,:]
        sliced_test_data = df.iloc[slice_index:, :]
        sliced_train_data = pd.concat([sliced_train_data, data_label], axis = 1)
    else :
        data_label = data_label.iloc[:, :].values
        sliced_train_data = df[:slice_index]
        sliced_test_data = df[slice_index:]
        sliced_train_data = np.concatenate((sliced_train_data, data_label), axis = 1)
    return sliced_train_data, sliced_test_data
def get_encoded_data(dataset_train, dataset_test, model_type='non_tree_based') :
    column = list(dataset_train.columns)
    df = dataset_train[column[:-1]] # seperate label column
    df = pd.concat([df, dataset_test], ignore_index=True) # merge 2 dataframe
    category_column = ['Gender', 'Occupation', 'City_Category',
                       'Product_Category_1', 'Product_Category_2',
                       'Product_Category_3']
    label_col = ['Age', 'Stay_In_Current_City_Years']
    if model_type == 'non_tree_based' :
        df = get_label_encoded_data(df, label_col)
        df = get_encoded_dataset(df, category_column)
    elif model_type == 'tree_based' :
        df = get_label_encoded_data(df, label_col+category_column)
    else :
        pass
    return df
def PCR_feature_selection_graph(df) :
    pca = PCA().fit(df)
    #Plotting the Cumulative Summation of the Explained Variance
    plt.figure()
    plt.plot(np.cumsum(pca.explained_variance_ratio_))
    plt.xlabel('Number of Components')
    plt.ylabel('Variance (%)') #for each component
    plt.title('Black friday Explained Variance')
    plt.show()
def PCR_feature_selection(df, n_comp) :
    pca = PCA(n_components=n_comp)
    return pca.fit_transform(df)

The principal components that are dropped correspond to the near collinearities. Consequently, the standard errors of the parameter estimates are reduced, although the tradeoff is that the estimates are biased, and "the bias increases as more  principal components are dropped so we are not applying principal component analysis.we seperate our data transformation in two different sections 
1. Non tree based which uses hot and label encoded data
2. Tree based model which only uses label encoded data

In [13]:
def get_data_model_based(dataset_train, dataset_test, model_type='non_tree_based') :
    data_label = dataset_train[['Purchase']]
    slice_index = len(dataset_train)
    if model_type == 'non_tree_based' :
        df = get_encoded_data(dataset_train, dataset_test, model_type)
        df = df.drop(['User_ID', 'Product_ID'], axis = 1)
        # PCA_feature_selection_graph(df)
        #df = PCA_feature_selection(df, 60)
        data_train, data_test = get_train_test_data(df, slice_index, data_label)
    elif  model_type == 'tree_based' :
        df = get_encoded_data(dataset_train, dataset_test, model_type)
        df = df.drop(['User_ID', 'Product_ID'], axis = 1)
        df = df.applymap(str)
        data_train, data_test = get_train_test_data(df, slice_index, data_label)
    else  :
        return None, None
    return data_train, data_test

For non tree based Data

In [14]:
data_train, data_test = get_data_model_based(dataset_train, dataset_test, model_type='non_tree_based')

For tree based Data

In [15]:
datatb_train, datatb_test = get_data_model_based(dataset_train, dataset_test, model_type='tree_based')

create generic model object which store all the Information of related model

In [16]:
class ModelObject :
    def __init__(self, name) :
        self._name = name
        self._cross_validation_score = None
        self._model = None
    @property
    def name(self) :
        return self._name
    @property
    def cross_validation_score(self) :
        return self._cross_validation_score
    @cross_validation_score.setter
    def cross_validation_score(self, value) :
        try :
            self._cross_validation_score = value
        except Exception as e:
            raise Exception('value object is not in format',e)
    @property
    def model(self) :
        return self._model
    @model.setter
    def model(self, value) :
        self._model = value
        self._cross_validation_score = None
    def __str__(self) :
        res = '\n'
        res += 'Model Name :- ' + self._name + '\n'
        return res

Since data_test doesn't contain target variable thats why I am splitting data_train to X_train, y_train, X_test and y_test

create a function which return train, test split

In [17]:
def get_test_train_data(data) :
    X = data[:, :-1]
    y = data[:, -1]
    X_train, X_test, y_train , y_test = model_selection.train_test_split(X, y, test_size = 0.2, random_state=0)
    return (X_train, X_test, y_train , y_test)

## Build Models
### 1.Linear Regression

In [18]:
def get_optimal_modellr(X_train, y_train) :
    seed_k_fold = 7
    no_of_split = 10
    scoring = 'neg_mean_squared_error'
    lr = ModelObject('Linear Regression')
    kfold = model_selection.KFold(n_splits= no_of_split, random_state=seed_k_fold)
    lr.model = LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=None)
    cv_results = -model_selection.cross_val_score(lr.model, X_train, y_train,cv=kfold, scoring=scoring)
    """The unified scoring API always maximizes the score, so scores which need to be minimized are negated
    in order for the unified scoring API to work correctly. The score that is returned is therefore negated
    when it is a score that should be minimized and left positive if it is a score that should be maximized."""
    cv_results.sort(axis=-1, kind='mergesort', order=None)
    lr.cross_validation_score = cv_results
    return lr

get linear model without any regularization and check its RMSE as the performance metrix

In [19]:
data = get_test_train_data(data_train.values)
if data is not None :
    X_train, X_test, y_train , y_test = data[0], data[1], data[2], data[3]
    lr = get_optimal_modellr(X_train, y_train)
    print('cr mean -',lr.cross_validation_score.mean())

cr mean - 8890732.525132675


Apply regularizationo technique

## 2. Ridge Regression

In [20]:
def get_ridge_regression_model(lambda_val) :
    return Ridge(alpha = lambda_val, fit_intercept=True, normalize=False, copy_X=True,
                         max_iter=None, tol=0.001, solver='auto', random_state=0)
def get_optimal_model_ridge(X_train, y_train, verbose=True, lambda_start=0.001, lambda_stop=1.2, no_split=10) :
    seed_k_fold = 7
    no_of_split = no_split
    range_lambda  = np.logspace(lambda_start, lambda_stop, num=no_split)
    kf = model_selection.KFold(n_splits= no_of_split, random_state=seed_k_fold)
    index = 0
    val_score_list = []
    #Applying cross validation for getting appropriate value of lambda
    for train_index, test_index in kf.split(X_train):
        #print("TRAIN:", train_index, "TEST:", test_index)
        X_tr, X_tst = X_train[train_index], X_train[test_index]
        y_tr, y_tst = y_train[train_index], y_train[test_index]
        temp = get_ridge_regression_model(range_lambda[index])
        temp.fit(X_tr,y_tr)
        predicted = temp.predict(X_tst)
        validation_score = mean_squared_error(y_tst, predicted)
        val_score_list.append(validation_score)
        if verbose :
            print('lambda - '+str(range_lambda[index]) + '--validation score '+str(validation_score))
        index += 1
    min_val_score_index = val_score_list.index(min(val_score_list))
    lambda_optimal = range_lambda[min_val_score_index]
    ridge_reg = ModelObject('Ridge Regression')
    ridge_reg.model = get_ridge_regression_model(lambda_optimal)
    ridge_reg.cross_validation_score = val_score_list[min_val_score_index]
    return ridge_reg

In [21]:
# fitting optimal model
rr = get_optimal_model_ridge(X_train, y_train, verbose=True)
rr.model.fit(X_train,y_train)
predicted = rr.model.predict(X_test)
print(mean_squared_error(y_test, predicted))

lambda - 1.0023052380778996--validation score 8925627.527068092
lambda - 1.362141492331366--validation score 8844921.419810327
lambda - 1.8511620758251646--validation score 8845271.176420026
lambda - 2.515745280696361--validation score 8826084.63764821
lambda - 3.4189196073092853--validation score 8842036.4666963
lambda - 4.646341333097262--validation score 8896723.662220791
lambda - 6.314418080347418--validation score 8870320.828464543
lambda - 8.581348815120236--validation score 8950951.27574684
lambda - 11.66212730131956--validation score 9031182.293514831
lambda - 15.848931924611133--validation score 8873092.101679886
8886996.47578433


## 3. Lasso Regression

In [22]:
def get_lasso_regression_model(lambda_val) :
    # precompute - Whether to use a precomputed Gram matrix to speed up calculations
    # warm_start - When set to True, reuse the solution of the previous call to fit as initialization
    return Lasso(alpha = lambda_val, fit_intercept=True, normalize=False, precompute=False,
                            copy_X=True, max_iter=1000, tol=0.0001, warm_start=False, positive=False,
                            random_state=0, selection='cyclic')
def get_optimal_model_lasso(X_train, y_train, verbose=True, lambda_start=0.001, lambda_stop=1.2, no_split=10) :
    seed_k_fold = 7
    no_of_split = no_split
    range_lambda  = np.logspace(lambda_start, lambda_stop, num=no_split)
    kf = model_selection.KFold(n_splits= no_of_split, random_state=seed_k_fold)
    index = 0
    val_score_list = []
    #Applying cross validation for getting appropriate value of lambda
    for train_index, test_index in kf.split(X_train):
        X_tr, X_tst = X_train[train_index], X_train[test_index]
        y_tr, y_tst = y_train[train_index], y_train[test_index]
        temp =  get_lasso_regression_model(range_lambda[index])
        temp.fit(X_tr, y_tr)
        predicted = temp.predict(X_tst)
        validation_score = mean_squared_error(y_tst, predicted)
        val_score_list.append(validation_score)
        if verbose :
            print('lambda - '+str(range_lambda[index]) + '--validation score '+str(validation_score))
        index += 1
    min_val_score_index = val_score_list.index(min(val_score_list))
    lambda_optimal = range_lambda[min_val_score_index]
    lasso_reg = ModelObject('Lasso Regression')
    lasso_reg.model = get_lasso_regression_model(lambda_optimal)
    lasso_reg.cross_validation_score = val_score_list[min_val_score_index]
    return lasso_reg

In [23]:
# fitting optimal model
lr = get_optimal_model_lasso(X_train, y_train, verbose=True)
lr.model.fit(X_train,y_train)
predicted = lr.model.predict(X_test)
print(mean_squared_error(y_test, predicted))

lambda - 1.0023052380778996--validation score 8936631.365144437
lambda - 1.362141492331366--validation score 8853675.972363977
lambda - 1.8511620758251646--validation score 8875320.468876196
lambda - 2.515745280696361--validation score 8877660.431314765
lambda - 3.4189196073092853--validation score 8944085.185055582
lambda - 4.646341333097262--validation score 9060190.162094783
lambda - 6.314418080347418--validation score 9123042.470936557
lambda - 8.581348815120236--validation score 9318790.540123885
lambda - 11.66212730131956--validation score 9491816.247081403
lambda - 15.848931924611133--validation score 9497639.233877284
8901754.690498047


## 4.Support vector regression

In [24]:
def get_svr_model(c=1.0) :
    #C = it is a hyperparameter that controls how much we penalize our use of slack variables
    #slack variable - as a value ζthat, roughly, indicates how much we must move our point so that
    #it is correctly and confidently classified
    return LinearSVR(epsilon=0.0, tol=0.0001, C=c, loss='epsilon_insensitive', fit_intercept=True,
                     intercept_scaling=1.0, dual=True, verbose=0, random_state=None, max_iter=1000)
def get_optimal_model_svr(X_train, y_train, c_start=14, c_stop=33, verbose=True) :
    range_c  = np.linspace(c_start, c_stop, num=8)
    X_tr, X_tst, y_tr, y_tst = model_selection.train_test_split(X_train, y_train, test_size = 0.1, random_state=0)
    val_score_list = []
    #Applying cross validation for getting appropriate value of c
    for i in range_c:
        temp =  get_svr_model(i)
        temp.fit(X_tr, y_tr)
        predicted = temp.predict(X_tst)
        validation_score = mean_squared_error(y_tst, predicted)
        val_score_list.append(validation_score)
        if verbose :
            print('c - '+str(i) + '--validation score '+str(validation_score))
    min_val_score_index = val_score_list.index(min(val_score_list))
    c_optimal = range_c[min_val_score_index]
    svr_reg = ModelObject('SVR')
    svr_reg.model = get_svr_model(c_optimal)
    svr_reg.cross_validation_score = val_score_list[min_val_score_index]
    return svr_reg

In [25]:
# fitting optimal model
svr = get_optimal_model_svr(X_train, y_train)
svr.model.fit(X_train,y_train)
predicted = svr.model.predict(X_test)
print(mean_squared_error(y_test, predicted))

c - 14.0--validation score 9740006.926035345
c - 16.714285714285715--validation score 9724116.276088322
c - 19.42857142857143--validation score 9737061.310684197
c - 22.142857142857142--validation score 9735489.389833312
c - 24.857142857142858--validation score 9715213.616381284
c - 27.571428571428573--validation score 9708551.151698155
c - 30.285714285714285--validation score 9716244.617435142
c - 33.0--validation score 9705327.730926773
9762564.69189564


## 5.k-nearest neighbors Regression

In [26]:
def get_knr_model(k=5) :
    algo = 'auto' #{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}
    return KNeighborsRegressor(n_neighbors=k, weights='uniform', algorithm=algo, leaf_size=30, p=2,
                               metric='minkowski', metric_params=None, n_jobs=-1)
def get_optimal_model_knr(X_train, y_train,k_start=1, k_stop=5, verbose=True) :
    range_k  = np.arange(k_start,k_stop+1,dtype=int)
    X_tr, X_tst, y_tr, y_tst = model_selection.train_test_split(X_train, y_train, test_size = 0.1, random_state=0)
    val_score_list = []
    #Applying cross validation for getting appropriate value of k
    for j in range_k:
        temp = get_knr_model(j)
        temp.fit(X_tr, y_tr)
        predicted = temp.predict(X_tst)
        validation_score = mean_squared_error(y_tst, predicted)
        val_score_list.append(validation_score)
        if verbose :
            print('k - '+str(j) + '--validation score '+str(validation_score))
    knr_reg = ModelObject('KNR')
    k_optimal = val_score_list.index(min(val_score_list)) + 1
    knr_reg.model = get_knr_model(k_optimal)
    knr_reg.cross_validation_score = val_score_list[k_optimal-1]
    return knr_reg

In [25]:
knr = get_optimal_model_knr(X_train, y_train, k_start=1, k_stop=14, verbose=True)
knr.model.fit(X_train,y_train)
predicted = knr.model.predict(X_test)
print(mean_squared_error(y_test, predicted))

k - 1--validation score 16758322.596668636
k - 2--validation score 12796929.822882107
k - 3--validation score 11675475.985835265
k - 4--validation score 11203160.269369518
k - 5--validation score 10986796.915093398
k - 6--validation score 10925121.449397052
k - 7--validation score 10888404.722758586
k - 8--validation score 10881650.094229693
k - 9--validation score 10891254.60540089
k - 10--validation score 10923843.527569193
k - 11--validation score 10941208.553442938
k - 12--validation score 11014805.403263347
k - 13--validation score 11061729.665181616
k - 14--validation score 11094325.79974507
10963517.764887594


# Tree based learning

get data for tress based model

In [19]:
data = get_test_train_data(datatb_train.values)
if data is not None :
    X_train, X_test, y_train , y_test = data[0], data[1], data[2], data[3]

In [20]:
X_train[:2]

array([['1', '2', '7', '2', '4', '1', '7', '17', '15'],
       ['1', '2', '14', '1', '2', '0', '7', '15', '15']], dtype=object)

## 6. Decision Tree Regression

In [28]:
def get_decision_tree_model( max_depth = None, min_samples_lf = 1, min_samples_splt = 2, verbose = True) :
    criteria = 'mse'
    split = 'best'
    dt = DecisionTreeRegressor(criterion=criteria, splitter=split, max_depth=max_depth,
                                 min_samples_split=min_samples_splt, min_samples_leaf=min_samples_lf,
                                 min_weight_fraction_leaf=0.0,max_features=None, random_state=0,
                                 max_leaf_nodes=None,min_impurity_decrease=0.0,
                                 min_impurity_split=None, presort=False)
    if verbose :
        print('max depth {0} , minimum sample leaf - {1} minimum sample split {2}'.format(str(max_depth), 
                                                                                          str(min_samples_lf),
                                                                                         str(min_samples_splt)))
    dt_reg = ModelObject('Decision Tree')
    dt_reg.model = dt
    return dt_reg

In [29]:
dtr = get_decision_tree_model(14, 10, 10)
dtr.model.fit(X_train, y_train)
predicted = dtr.model.predict(X_test)
print(mean_squared_error(y_test, predicted))

max depth 14 , minimum sample leaf - 10 minimum sample split 10
8691766.743108112


## 7.Random Forest Regressor

In [30]:
def get_random_forest_regressor_model(no_of_trees = 10, max_depth = None, min_samples_lf = 1,
                                              min_samples_splt = 2, verbose = True) :
    #no_of_trees, max_depth, min_samples_lf, min_samples_splt  = 15, None, 10, 10
    criteria = 'mse'
    rf = RandomForestRegressor(n_estimators=no_of_trees, criterion=criteria, max_depth=max_depth, min_samples_split=min_samples_splt,
                                 min_samples_leaf=min_samples_lf, min_weight_fraction_leaf=0.0, max_features=None,
                                 max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None,
                                 bootstrap=True, oob_score=False, n_jobs=None, random_state=0, verbose=0,
                                 warm_start=False)
    #temp =  get_random_forest_regressor_model(no_of_trees, max_depth, min_samples_lf, min_samples_splt)
    #temp.fit(X_train, y_train)
    if verbose :
        print('No of trees {0} max depth {1} , minimum sample leaf - {2} minimum sample split {3}'.format(str(no_of_trees),
                                                                                            str(max_depth), 
                                                                                          str(min_samples_lf),
                                                                                         str(min_samples_splt)))
    rf_reg = ModelObject('Random Forest Regressor')
    rf_reg.model = rf
    return rf_reg

In [31]:
rfr = get_random_forest_regressor_model()
rfr.model.fit(X_train, y_train)
predicted = rfr.model.predict(X_test)
print(mean_squared_error(y_test, predicted))

No of trees 10 max depth None , minimum sample leaf - 1 minimum sample split 2
9516116.01576707


In [32]:
# no_of_trees = 10
# max depth None , minimum sample leaf - 10 minimum sample split 10   8375037.457358595
# max depth 8 , minimum sample leaf - 10 minimum sample split 10     10564781.562420707
# max depth 10 , minimum sample leaf - 10 minimum sample split 10     9739102.832783798
# max depth 12 , minimum sample leaf - 10 minimum sample split 10     9236034.181042949

### 8. ADA Boost Regressor

In [33]:
def get_ADABoost_regressor_model(n_est=1000, l_r=1.0) :
    _loss = 'linear' #‘square’, ‘exponential’}, optional (default=’linear’)
    #base_estimator If None, then the base estimator is DecisionTreeRegressor(max_depth=3)
    ada =  AdaBoostRegressor(base_estimator=None, n_estimators=n_est, learning_rate=l_r, loss=_loss, random_state=0)
    ada_reg = ModelObject('ADA Boost Regressor')
    ada_reg.model = ada
    return ada_reg
ada = get_ADABoost_regressor_model()
ada.model.fit(X_train, y_train)
predicted = ada.model.predict(X_test)
print(mean_squared_error(y_test, predicted))

12216741.593958858


### 9. Gradient Boost Regressor

In [35]:
def get_gradient_boosting_regressor_model(n_est=100, max_dpt = 8, min_leaf = 5, l_r=0.1) :
    loss = 'ls' # ls - least squares regression {ls’, ‘lad’, ‘huber’, ‘quantile’}
    gbr =  GradientBoostingRegressor(loss=loss, learning_rate=l_r, n_estimators=n_est, subsample=1.0, criterion='mse',
                                     min_samples_split=2,min_samples_leaf=min_leaf, min_weight_fraction_leaf=0.0,
                                     max_depth=max_dpt, min_impurity_decrease=0.0,min_impurity_split=None, init=None,
                                     random_state=0, max_features=None, alpha=0.9,verbose=0, max_leaf_nodes=None,
                                     warm_start=False, presort='auto', validation_fraction=0.1, n_iter_no_change=None,
                                     tol=0.0001)
    gbr_reg = ModelObject('Gradient Boost Regressor')
    gbr_reg.model = gbr
    return gbr_reg
gbr = get_gradient_boosting_regressor_model()
gbr.model.fit(X_train, y_train)
predicted = gbr.model.predict(X_test)
print(mean_squared_error(y_test, predicted))

8216126.166656249


### 10. XGBoost

In [36]:
def get_XBGoosting_regressor_model(base_scr, num_rounds = 3000, max_dpth = 6, min_chld_wt=10) :
    param = {}
    param['booster'] = 'gbtree'
    param['verbosity'] = 0
    param['eta '] = 0.3
    param['gamma'] = 0
    param['max_depth'] = max_dpth
    param['min_child_weight'] = min_chld_wt
    param['colsample_bytree'] = 0.7 # subsample ratio of columns when constructing each tree
    param['lambda'] = 1
    param['alpha'] = 0
    param['scale_pos_weight'] = 0.8 # Control the balance of positive and negative weights, useful for unbalanced classes.
    #A typical value to consider: sum(negative instances) / sum(positive instances)
    param['objective'] = 'reg:squarederror'
    param['base_score'] = base_scr
    param['eval_metric'] = 'rmse'
    param['seed'] = 0
    def train_xgb(X_train, y_train) : # creating closure for generic for train test data
        xgtrain = xgb.DMatrix(X_train, label=y_train)
        xgb_m = xgb.train(param, xgtrain, num_rounds)
        xgb_reg = ModelObject('XGBoost Regressor')
        xgb_reg.model = xgb_m
        return xgb_reg
    return train_xgb
xgbr = get_XBGoosting_regressor_model(y_train.mean())
predicted = xgbr(X_train, y_train).model.predict( xgb.DMatrix(X_test))
print(mean_squared_error(y_test, predicted))

8239876.402650073


### 11. Extra Tree Regressor

In [37]:
def get_extra_tree_regressor_model(no_of_trees = 10, max_depth = None, min_samples_lf = 1,
                                              min_samples_splt = 2, verbose = True) :
    #no_of_trees, max_depth, min_samples_lf, min_samples_splt  = 15, None, 10, 10
    criteria = 'mse'
    et = ExtraTreesRegressor(n_estimators=no_of_trees, criterion=criteria, max_depth=max_depth, min_samples_split=min_samples_splt,
                                 min_samples_leaf=min_samples_lf, min_weight_fraction_leaf=0.0, max_features=None,
                                 max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None,
                                 bootstrap=True, oob_score=False, n_jobs=None, random_state=0, verbose=0,
                                 warm_start=False)
    #temp.fit(X_train, y_train)
    if verbose :
        print('No of trees {0} max depth {1} , minimum sample leaf - {2} minimum sample split {3}'.format(str(no_of_trees),
                                                                                            str(max_depth), 
                                                                                          str(min_samples_lf),
                                                                                         str(min_samples_splt)))
    et_reg = ModelObject('Extra Tree Regressor')
    et_reg.model = et
    return et_reg

In [38]:
etr = get_extra_tree_regressor_model()
etr.model.fit(X_train, y_train)
predicted = etr.model.predict(X_test)
print(mean_squared_error(y_test, predicted))

No of trees 10 max depth None , minimum sample leaf - 1 minimum sample split 2
9492625.38208102


### Blending Model

creating model based on Trees i.e Decision trees, Random Forest , ADA Boost, Gradient Boost, Extra Trees,
XGBoost model with different values of Hyper Parameter

In [39]:
model_1 = get_decision_tree_model( max_depth = 10, min_samples_lf = 10, min_samples_splt = 10, verbose = False)
model_2 = get_decision_tree_model( max_depth = 12, min_samples_lf = 10, min_samples_splt = 10, verbose = False)
model_3 = get_decision_tree_model( max_depth = 14, min_samples_lf = 10, min_samples_splt = 10, verbose = False)
model_4 = get_random_forest_regressor_model(no_of_trees = 500, max_depth = 10, min_samples_lf = 10,min_samples_splt = 10,
                                            verbose = False)
model_5 = get_random_forest_regressor_model(no_of_trees = 1000, max_depth = 12, min_samples_lf = 10,min_samples_splt = 10,
                                            verbose = False)
model_6 = get_random_forest_regressor_model(no_of_trees = 1500, max_depth = 14, min_samples_lf = 10,min_samples_splt = 10,
                                            verbose = False)
model_7 = get_extra_tree_regressor_model(no_of_trees = 500, max_depth = 10, min_samples_lf = 10,min_samples_splt = 10,
                                            verbose = False)
model_8 = get_extra_tree_regressor_model(no_of_trees = 1000, max_depth = 12, min_samples_lf = 10,min_samples_splt = 10,
                                            verbose = False)
model_9 = get_extra_tree_regressor_model(no_of_trees = 1500, max_depth = 14, min_samples_lf = 10,min_samples_splt = 10,
                                            verbose = False)
model_10 = get_ADABoost_regressor_model(n_est=1000, l_r=1.0)
# return a function because don't want to create any dependency of training and testing data
model_11 = lambda base_scr : get_XBGoosting_regressor_model(base_scr, num_rounds = 3000, max_dpth = 6, min_chld_wt=10)

### creating 1st level prediction

In [89]:
class ModelsPrediction :
    def __init__(self, *models) :
        self.models = list(models)
        self.gboost_model = {}
    def fit_models(self, X_tr, y_tr, verbose = False) :
        print('start fitting {0} model one by one'.format(len(self.models)))
        counter = 1
        for m in self.models :
            print(counter)
            if isinstance(m, types.FunctionType) :
                base_scr = y_tr.mean()
                temp = m(base_scr)
                model = temp(X_tr, y_tr)
                self.gboost_model[self.models.index(m)] = model
                if verbose :
                    print('fitting {0} model completed:----'.format(model.name))
            else :
                model = m.model.fit(X_tr, y_tr)
                if verbose :
                    print('fitting {0} model completed:----'.format(m.name))
            counter += 1
    def predict(self, X_tst, verbose= False) :
        tp = ()
        counter = 1
        for m in self.models :
            print(counter)
            if isinstance(m, types.FunctionType) :
                temp = self.gboost_model[self.models.index(m)]
                predicted = temp.model.predict( xgb.DMatrix(X_tst))
                if verbose :
                    print('model Name - {0} prediction completed'.format(temp.name))
            else :
                predicted = m.model.predict(X_tst)
                if verbose :
                    print('model Name - {0} prediction completed'.format(m.name))
            counter += 1
            tp += (predicted, )
        return np.vstack(tp).T # create column wise prediction value of length n_sample, number of models

In [90]:
# create first level prediction
flp = ModelsPrediction(model_1, model_2, model_3, model_4, model_5, model_6, model_7, model_8, model_9, model_10, model_11)

fit multiple model 

In [50]:
flp.fit_models(X_train, y_train, True)

start fitting 11 model one by one
1
fitting Decision Tree model completed:----
2
fitting Decision Tree model completed:----
3
fitting Decision Tree model completed:----
4
fitting Random Forest Regressor model completed:----
5
fitting Random Forest Regressor model completed:----
6
fitting Random Forest Regressor model completed:----
7
fitting Extra Tree Regressor model completed:----
8
fitting Extra Tree Regressor model completed:----
9
fitting Extra Tree Regressor model completed:----
10
fitting ADA Boost Regressor model completed:----
11
fitting XGBoost Regressor model completed:----


Predict train data in order to get feature for second level model

In [51]:
X_test_second_level = flp.predict(X_test, True)

1
model Name - Decision Tree prediction completed
2
model Name - Decision Tree prediction completed
3
model Name - Decision Tree prediction completed
4
model Name - Random Forest Regressor prediction completed
5
model Name - Random Forest Regressor prediction completed
6
model Name - Random Forest Regressor prediction completed
7
model Name - Extra Tree Regressor prediction completed
8
model Name - Extra Tree Regressor prediction completed
9
model Name - Extra Tree Regressor prediction completed
10
model Name - ADA Boost Regressor prediction completed
11
model Name - XGBoost Regressor prediction completed


Creating second level model XGBoost 

In [55]:
def get_second_level_model(x_base, X_tr, y_tr, num_rounds = 1500, max_dpth = 8, min_chld_wt=10, verbose = False) :
    kf = KFold(n_splits= 2, shuffle=False, random_state=0)
    for train_index, test_index in kf.split(X_tr):
        X_train, X_test = X_tr[train_index], X_tr[test_index]
        y_train, y_test = y_tr[train_index], y_tr[test_index]
        xgbr = get_XBGoosting_regressor_model(y_train.mean(), num_rounds = 1500, max_dpth = 8, min_chld_wt=10)
        predicted = xgbr(X_train, y_train).model.predict( xgb.DMatrix(X_test))
        mse = mean_squared_error(y_test, predicted)
    # create second level meta feature with all the meta data
    xgbr = get_XBGoosting_regressor_model(y_tr.mean(), num_rounds = 1500, max_dpth = 8, min_chld_wt=10)
    model = xgbr(X_tr, y_tr)
    return model
sl = get_second_level_model(y_test.mean(), X_test_second_level, y_test, num_rounds = 1500, max_dpth = 8,
                            min_chld_wt=10, verbose = True)

Create Final Prediction

In [91]:
first_level_prediction = flp.predict(datatb_test.values, True)
final_prediction = sl.model.predict(xgb.DMatrix(first_level_prediction) , True)

1
model Name - Decision Tree prediction completed
2
model Name - Decision Tree prediction completed
3
model Name - Decision Tree prediction completed
4
model Name - Random Forest Regressor prediction completed
5
model Name - Random Forest Regressor prediction completed
6
model Name - Random Forest Regressor prediction completed
7
model Name - Extra Tree Regressor prediction completed
8
model Name - Extra Tree Regressor prediction completed
9
model Name - Extra Tree Regressor prediction completed
10
model Name - ADA Boost Regressor prediction completed
11
model Name - XGBoost Regressor prediction completed


Store result to test dataset

In [99]:
dataset_test['Purchase'] = np.rint(final_prediction).astype(np.int32)
dataset_test.head(5)

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000004,P00128942,M,46-50,7,B,2,1,1,11.0,999.0,19659
1,1000009,P00113442,M,26-35,17,C,0,0,3,5.0,999.0,10323
2,1000010,P00288442,F,36-45,1,B,4+,1,5,14.0,999.0,8500
3,1000010,P00145342,F,36-45,1,B,4+,1,4,9.0,999.0,2647
4,1000011,P00053842,F,26-35,1,C,1,0,4,5.0,12.0,2787
