## Black Friday
### Problen Statement
A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month.
The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.
Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products.

import all necessary package

In [1]:
import numpy as np
import pandas as pd
from sklearn import model_selection
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import scale
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

Importing dataset 

In [2]:
dataset_train = pd.read_csv(r'./Data/train.csv')
dataset_test = pd.read_csv(r'./Data/test.csv')

Check first 5 data

In [3]:
print(len(dataset_train))
print(len(dataset_test))
dataset_train.head()

550068
233599


Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969


store categorical variables column name and dictionary with column to unique value mapping associated with it

In [4]:
columns = dataset_train.columns
dic_columnwise_acceped_value = {}
for i in columns[2:-1] :
    temp1 = dataset_train[i].unique()
    temp2 = dataset_test[i].unique()
    try :
        if np.isnan(temp1).any() and np.isnan(temp2).any() :
            temp1 = temp1[~np.isnan(temp1)]
    except TypeError :
        pass
    tem_dup =  np.hstack([temp1, temp2])
    tem_dup = np.unique(tem_dup)
    dic_columnwise_acceped_value[i] = list(tem_dup)
    print(i,tem_dup)

Gender ['F' 'M']
Age ['0-17' '18-25' '26-35' '36-45' '46-50' '51-55' '55+']
Occupation [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]
City_Category ['A' 'B' 'C']
Stay_In_Current_City_Years ['0' '1' '2' '3' '4+']
Marital_Status [0 1]
Product_Category_1 [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]
Product_Category_2 [ 2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17. 18. nan]
Product_Category_3 [ 3.  4.  5.  6.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17. 18. nan]


check and print if any column contain null or NaN value

In [5]:
print(dataset_train.isna().sum())
print(dataset_test.isna().sum())

User_ID                            0
Product_ID                         0
Gender                             0
Age                                0
Occupation                         0
City_Category                      0
Stay_In_Current_City_Years         0
Marital_Status                     0
Product_Category_1                 0
Product_Category_2            173638
Product_Category_3            383247
Purchase                           0
dtype: int64
User_ID                            0
Product_ID                         0
Gender                             0
Age                                0
Occupation                         0
City_Category                      0
Stay_In_Current_City_Years         0
Marital_Status                     0
Product_Category_1                 0
Product_Category_2             72344
Product_Category_3            162562
dtype: int64


In this case product_category_2 and product_category_3 contain null value
Create SimpleImputer class to replace NaN to 0 so that later on it can be HotEncoded

In [6]:
def replace_NaN(data, columns, *args) :    
    mp = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=0)
    transformed_val = mp.fit_transform(data.iloc[:,[columns.get_loc(i) for i in list(args)]].values)
    df = data.copy()
    df[list(args)] = transformed_val
    return df

replacing Nan from two columns Product_Category_2 Product_Category_3

In [7]:
dataset_train = replace_NaN(dataset_train, dataset_train.columns, 'Product_Category_2', 'Product_Category_3')
dataset_test = replace_NaN(dataset_test, dataset_test.columns, 'Product_Category_2', 'Product_Category_3')

In [8]:
dataset_train.head(6)

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,0.0,0.0,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,0.0,0.0,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,0.0,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,0.0,0.0,7969
5,1000003,P00193542,M,26-35,15,A,3,0,1,2.0,0.0,15227


Now for each categorical column we created n-1 number of dummy variable which contain either 0 or 1
create a function which take data, column Names list, column name (In which we want to perform an operation)
and remove column name which we want to remove in order to avoid dummy variable trap

In [9]:
def replace_column_with_Dummy_Columns(data, columns, column_name, remove_column_val) :
    temp = pd.get_dummies(data[column_name], prefix=column_name)
    col = list(temp.columns)
    removed_col = column_name+'_'+remove_column_val
    removed_col_index = col.index(removed_col)
    temp = temp[col[:removed_col_index] + col[removed_col_index+1:]] # removing to avoid dummy variable trap
    previous = data[columns[:columns.index(column_name)]]
    after = data[columns[columns.index(column_name)+1:]]
    previous = previous.join(temp)
    previous = previous.join(after)
    return previous

For One categorical column of training data create hotencoded column from function replace_column_with_Dummy_Columns

In [10]:
def get_dummy_dataset(data, columns, cat_col_list) :
    col = cat_col_list
    df = data[list(filter(lambda x: x not in cat_col_list, columns))]
    for i in cat_col_list :
        dic = data[i].value_counts().to_dict()
        max_key = max(dic, key=dic.get) # for non column removal it automatically contain maximim number
        # so we dont need to remove it seperately
        data = replace_column_with_Dummy_Columns(data, col, i, str(max_key))
        col = list(data.columns)
    df= df.join(data)
    return df

Iterating through each categorical column in training and testing set to get dummy coded column 

In [11]:
def get_hotencoded_dataset(columns, dataset_train, dataset_test) :  
    column = list(columns)
    data_label = dataset_train[[column[-1]]]
    slice_index = len(dataset_train)
    df = dataset_train[column[:-1]] # seperate label column
    df = pd.concat([df, dataset_test], ignore_index=True) # merge 2 dataframe
    df = get_dummy_dataset(df, column[:-1], column[2:-1])
    sliced_train_data = df.loc[:slice_index-1]
    sliced_test_data = df.loc[slice_index:]
    sliced_train_data = pd.concat([sliced_train_data, data_label], sort=False, axis=1)
    return sliced_train_data, sliced_test_data

create hot encoded data for training and testing set

In [12]:
data_train, data_test = get_hotencoded_dataset(columns, dataset_train, dataset_test)

create generic model object which store all the Information of related model

In [13]:
class ModelObject :
    def __init__(self, name) :
        self._name = name
        self._cross_validation_score = None
        self._cross_validation_score_mean = None
        self._cross_validation_score_std = None
        self._model = None
    @property
    def name(self) :
        return self._name
    @property
    def cross_validation_score(self) :
        return self._cross_validation_score
    @cross_validation_score.setter
    def cross_validation_score(self, value) :
        try :
            self._cross_validation_score = value
            self._cross_validation_score_mean = value.mean()
            self._cross_validation_score_std = value.std()
        except Exception as e:
            raise Exception('value object is not in format',e)
    @property
    def cross_validation_mean(self) :
        return self._cross_validation_score_mean
    @property
    def cross_validation_std(self) :
        return self._cross_validation_score_std
    @property
    def model(self) :
        return self._model
    @model.setter
    def model(self, value) :
        self._model = value
        self._cross_validation_score = None
        self._cross_validation_score_mean = None
        self._cross_validation_score_std = None
    def __str__(self) :
        res = '\n'
        res += 'Model Name :- ' + self._name + '\n'
        res += 'cross validation score :- ' + str(self._cross_validation_score) + '\n'
        res += 'cross validation mean :- ' + str(self._cross_validation_score_mean) + '\n'
        res += 'cross validation standerd deviation :- ' + str(self._cross_validation_score_std) + '\n'
        if self._model != None :
            res += 'Model Parameter ' + str(self._model.get_params) + '\n'
        else :
            res += '\n'
        return res

Since data_test doesn't contain target variable thats why I am splitting data_train to X_train, y_train, X_test and y_test

create a function which eleminate feature from backward and return train, test split

In [14]:
def get_feature_list(data) :
    return list(data.columns)
def get_dataSet_with_eliminated_features(data, label_col, eliminated_features_count=0, ahead_start=2) :
    f_count = len(get_feature_list(data))
    if label_col < 0 or label_col >= f_count:
        return None
    if (eliminated_features_count  < 0) or (eliminated_features_count > (f_count - ahead_start - 1)) :
        return None
    col_index_list = list(range(ahead_start, f_count))
    del col_index_list[col_index_list.index(label_col)]
    col_index_list = col_index_list if eliminated_features_count == 0 else col_index_list[:-eliminated_features_count]
    #print(col_index_list)
    #print(label_col)
    X = data.iloc[:,col_index_list].values
    y = data.iloc[:, label_col].values
    X_train, X_test, y_train , y_test = model_selection.train_test_split(X, y, test_size = 0.2, random_state=0)
    return (X_train, X_test, y_train , y_test)

For selecting feature we will go through backward elemination methods. since except gender all 
categorical variable has more then 2 level so we can't apply biserial correlation for binary.

## Build Models
### 1.Linear Regression

In [15]:
def get_optimal_modellr(X_train, y_train) :
    seed_k_fold = 7
    no_of_split = 10
    scoring = 'neg_mean_squared_error'
    lr = ModelObject('Linear Regression')
    kfold = model_selection.KFold(n_splits= no_of_split, random_state=seed_k_fold)
    lr.model = LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=None)
    cv_results = -model_selection.cross_val_score(lr.model, X_train, y_train,cv=kfold, scoring=scoring)
    """The unified scoring API always maximizes the score, so scores which need to be minimized are negated
    in order for the unified scoring API to work correctly. The score that is returned is therefore negated
    when it is a score that should be minimized and left positive if it is a score that should be maximized."""
    cv_results.sort(axis=-1, kind='mergesort', order=None)
    lr.cross_validation_score = cv_results
    return lr

get linear model without any regularization and check its RMSE as the performance metrix

In [16]:
col_len = len(get_feature_list(data_train))
for i in range(85) :
    data = get_dataSet_with_eliminated_features(data_train, col_len - 1, i)
    if data is not None :
        X_train, X_test, y_train , y_test = data[0], data[1], data[2], data[3]
        lr = get_optimal_modellr(X_train, y_train)
        print('cr mean -',lr.cross_validation_mean)
        lr.model.fit(X_train, y_train)
        predicted = lr.model.predict(X_test)
        print('mean square error -', mean_squared_error(y_test, predicted))

cr mean - 8889010.0525569
mean square error - 8884286.643687665
cr mean - 8889119.329643253
mean square error - 8884367.552119851
cr mean - 8914314.381786192
mean square error - 8913808.300128395
cr mean - 8914245.64527319
mean square error - 8913802.14324399
cr mean - 8922265.748940362
mean square error - 8922228.094122896
cr mean - 8922520.556201119
mean square error - 8922662.143345544
cr mean - 8926868.407182265
mean square error - 8926438.81577517
cr mean - 8928969.93978287
mean square error - 8928363.68107792
cr mean - 8929490.54441819
mean square error - 8929115.627608754
cr mean - 8930888.804993855
mean square error - 8930261.347693019
cr mean - 8933608.409974288
mean square error - 8933719.199544003
cr mean - 8982816.887958018
mean square error - 8995016.242183307
cr mean - 8982939.051252978
mean square error - 8994636.0060196
cr mean - 8987309.134024603
mean square error - 8996845.336198498
cr mean - 9011593.702206098
mean square error - 9017456.519448606
cr mean - 9011786.98

KeyboardInterrupt: 

## 2. Ridge Regression

In [None]:
def get_optimal_model_ridge(X_train, y_train, lambda_val) :
    seed_k_fold = 7
    no_of_split = 10
    scoring = 'neg_mean_squared_error'
    ridge_reg = ModelObject('Ridge Regression')
    kfold = model_selection.KFold(n_splits= no_of_split, random_state=seed_k_fold)
    lr.model = Ridge(alpha = lambda_val, fit_intercept=True, normalize=False, copy_X=True,
                     max_iter=None, tol=0.001, solver='auto', random_state=0)
    cv_results = -model_selection.cross_val_score(lr.model, X_train, y_train,cv=kfold, scoring=scoring)
    """The unified scoring API always maximizes the score, so scores which need to be minimized are negated
    in order for the unified scoring API to work correctly. The score that is returned is therefore negated
    when it is a score that should be minimized and left positive if it is a score that should be maximized."""
    cv_results.sort(axis=-1, kind='mergesort', order=None)
    lr.cross_validation_score = cv_results
    return lr

In [107]:
df1 = pd.DataFrame({'A':[1,2,3,4]})
df2 = pd.DataFrame({'B':[5,6,7,8], 'C': [6,4,3,2]})
dd = pd.concat([df1, df2], sort=False, axis=1)
dd

Unnamed: 0,A,B,C
0,1,5,6
1,2,6,4
2,3,7,3
3,4,8,2


In [108]:
df1 = pd.DataFrame({'A':[1,2,3,4]})
df2 = pd.DataFrame({'A':[5,6,7,8]})
dd = pd.concat([df1, df2], sort=False, ignore_index=True, axis=0)
dd

Unnamed: 0,A
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
