# Black Friday sales prediction 2023

this notebook is the continuation of the work from the other notebook, it contains the second part of the study project of black friday sales data.

**Goal:** predict the purshase amount in Black Friday against several customer demographics and various product details.

### The project pipeline
 1. ✅look at the big picture
 2. ✅get the data
 3. ✅discover the dataset
 4. ✅visualise & analyse the data to get insights
 5. ⌛preprocess the data for ML algorithmes
 6. ⌛model selection
 7. ⌛fine tune the model
 8. ⌛deploy the model


### import the dataset


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

black_df = pd.read_csv("datasets/Black_data.csv")

in the last notebook we did some exploration of some statistics in the dataset, so we start doing data preprocessing direcly

In [2]:
black_df.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969


## Data preprocessing

we will do:
- data cleaning
- change the categorical columns into numerical
- drop unusefull features
- split the data into train and test

### data cleaning

first, some features like "User_ID" and "Product_ID" are irrelevant in our regression problem, so wedrop it.

In [3]:
black_df.drop(["User_ID", "Product_ID"], axis=1, inplace=True)

In [4]:
black_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 10 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   Gender                      550068 non-null  object 
 1   Age                         550068 non-null  object 
 2   Occupation                  550068 non-null  int64  
 3   City_Category               550068 non-null  object 
 4   Stay_In_Current_City_Years  550068 non-null  object 
 5   Marital_Status              550068 non-null  int64  
 6   Product_Category_1          550068 non-null  int64  
 7   Product_Category_2          376430 non-null  float64
 8   Product_Category_3          166821 non-null  float64
 9   Purchase                    550068 non-null  int64  
dtypes: float64(2), int64(4), object(4)
memory usage: 42.0+ MB


fortunately, most of the columns doesn't have any missing values. but the two columns "Product_Category_2" and "Product_Category_3" have many missing values

In [5]:
# calculate the % if missing values in the two colunmns
data = {
    "attribute": ["Product_Category_2", "Product_Category_3"],
    "% of missing values": [round(black_df["Product_Category_2"].isnull().sum() / black_df.shape[0] * 100, 2),
                            round(black_df["Product_Category_3"].isnull().sum() / black_df.shape[0] * 100, 2)]
}
missing_values = pd.DataFrame(data=data)
missing_values

Unnamed: 0,attribute,% of missing values
0,Product_Category_2,31.57
1,Product_Category_3,69.67


Here is the strategy:
- we imput the column "	Product_Category_2" with the median value of the entier column
- we drop the entier column "Product_Category_3" because it has large percentage of missing values

In [6]:
black_df.drop("Product_Category_3", axis=1, inplace=True)

In [7]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")
product_Category_2_imputed = imputer.fit_transform(black_df[["Product_Category_2"]])
black_df["Product_Category_2"] = product_Category_2_imputed

### Removing outliers

In [8]:
# Calculate the interquartile range (IQR)
q1 = black_df['Purchase'].quantile(0.25)
q3 = black_df['Purchase'].quantile(0.75)
iqr = q3 - q1

# Set the lower and upper bounds for outliers
low = q1 - 1.5 * iqr
high = q3 + 1.5 * iqr

# Remove outliers from the DataFrame
black_df = black_df.loc[(black_df['Purchase'] > low) & (black_df['Purchase'] < high)]

# Reset the index of the cleaned DataFrame
black_df.reset_index(drop=True, inplace=True)

In [9]:
black_df

Unnamed: 0,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Purchase
0,F,0-17,10,A,2,0,3,9.0,8370
1,F,0-17,10,A,2,0,1,6.0,15200
2,F,0-17,10,A,2,0,12,9.0,1422
3,F,0-17,10,A,2,0,12,14.0,1057
4,M,55+,16,C,4+,0,8,9.0,7969
...,...,...,...,...,...,...,...,...,...
547386,M,51-55,13,B,1,1,20,9.0,368
547387,F,26-35,1,C,3,0,20,9.0,371
547388,F,26-35,15,B,4+,1,20,9.0,137
547389,F,55+,1,C,2,0,20,9.0,365


### Handling Text and Categorical Attributes

In [10]:
for feature in black_df:
    if len(black_df[feature].unique()) < 25:
        print(feature, black_df[feature].unique())

Gender ['F' 'M']
Age ['0-17' '55+' '26-35' '46-50' '51-55' '36-45' '18-25']
Occupation [10 16 15  7 20  9  1 12 17  0  3  4 11  8 19  2 18  5 14 13  6]
City_Category ['A' 'C' 'B']
Stay_In_Current_City_Years ['2' '4+' '3' '1' '0']
Marital_Status [0 1]
Product_Category_1 [ 3  1 12  8  5  4  2  6 14 11 13 15  7 16 18 10 17  9 20 19]
Product_Category_2 [ 9.  6. 14.  2.  8. 15. 16. 11.  5.  3.  4. 12. 10. 17. 13.  7. 18.]


it seems that almost all the attributes are categorical, so we need to encode them into nmerical values for machine learning algorithmes

before encoding it, we constat the there is two types of categorical attributes.
- the first is  ordered categories such as "Age", "Stay_In_Current_City_Years" when we can sort its values.
- the second is non ordered categories when its can't be ordered in a specific order.

so we need to handle both of them separately, the reason is that in the unordered categories the machine learning algorithme will assume that two nearby encoded values are similar but they actually totally independants.

In [11]:
ordered_attr = ["Age", "Stay_In_Current_City_Years"]
unordered_attr = ['Gender', 'Occupation', 'City_Category', 'Marital_Status', 'Product_Category_1', 'Product_Category_2']

I will use the ordinal encoder to encode the ordered categories

In [12]:
from sklearn.preprocessing import OrdinalEncoder

# function to encode the ordered attributes in the Dataframe
def encodeOrderedAttributes(df, ordered_attr):
    # encode the attributes
    ord_encoder = OrdinalEncoder()
    cat_encoded = ord_encoder.fit_transform(df[ordered_attr])

    # return the new dataframe with encoded attributes
    return pd.DataFrame(cat_encoded, columns=ordered_attr)

# example
encodeOrderedAttributes(black_df, ordered_attr)

Unnamed: 0,Age,Stay_In_Current_City_Years
0,0.0,2.0
1,0.0,2.0
2,0.0,2.0
3,0.0,2.0
4,6.0,4.0
...,...,...
547386,5.0,1.0
547387,2.0,3.0
547388,2.0,4.0
547389,6.0,2.0


I will use OneHotEncoder to encode the unordered categories

In [13]:
from sklearn.preprocessing import OneHotEncoder


# function to encode the uordered attributes in the Dataframe
def encodeUnorderedAttributes(df, unordered_attr):
    # encode the attributes
    hot_encoder = OneHotEncoder()
    cat_encoded = hot_encoder.fit_transform(df[unordered_attr])
    cat_encoded = cat_encoded.toarray()

    # return new dataframe with encoded attributes
    # create the new columns list
    columns = []
    for attribute, category in zip(unordered_attr, hot_encoder.categories_):
        attributeCategories = [f'{attribute}({str(cat)})' for cat in category]
        columns.extend(attributeCategories)

    # the dataframe
    return pd.DataFrame(data=cat_encoded, columns=columns)


# example
encodeUnorderedAttributes(black_df, unordered_attr)

Unnamed: 0,Gender(F),Gender(M),Occupation(0),Occupation(1),Occupation(2),Occupation(3),Occupation(4),Occupation(5),Occupation(6),Occupation(7),...,Product_Category_2(9.0),Product_Category_2(10.0),Product_Category_2(11.0),Product_Category_2(12.0),Product_Category_2(13.0),Product_Category_2(14.0),Product_Category_2(15.0),Product_Category_2(16.0),Product_Category_2(17.0),Product_Category_2(18.0)
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
547386,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
547387,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
547388,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
547389,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


this function encode the whole dataframe by combining the two last functions together

In [14]:
def encode(df, ordered_attr, unordered_attr):
    ordered_cat = encodeOrderedAttributes(df, ordered_attr)
    unordered_cat = encodeUnorderedAttributes(df, unordered_attr)
    return pd.concat([ordered_cat, unordered_cat], axis=1)

# example
encode(black_df, ordered_attr, unordered_attr)

Unnamed: 0,Age,Stay_In_Current_City_Years,Gender(F),Gender(M),Occupation(0),Occupation(1),Occupation(2),Occupation(3),Occupation(4),Occupation(5),...,Product_Category_2(9.0),Product_Category_2(10.0),Product_Category_2(11.0),Product_Category_2(12.0),Product_Category_2(13.0),Product_Category_2(14.0),Product_Category_2(15.0),Product_Category_2(16.0),Product_Category_2(17.0),Product_Category_2(18.0)
0,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,6.0,4.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
547386,5.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
547387,2.0,3.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
547388,2.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
547389,6.0,2.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now we are ready to encode the whole dataframe to be ready for machine lraning algorithm

In [15]:
df = encode(black_df, ordered_attr, unordered_attr)
df.head()

Unnamed: 0,Age,Stay_In_Current_City_Years,Gender(F),Gender(M),Occupation(0),Occupation(1),Occupation(2),Occupation(3),Occupation(4),Occupation(5),...,Product_Category_2(9.0),Product_Category_2(10.0),Product_Category_2(11.0),Product_Category_2(12.0),Product_Category_2(13.0),Product_Category_2(14.0),Product_Category_2(15.0),Product_Category_2(16.0),Product_Category_2(17.0),Product_Category_2(18.0)
0,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,6.0,4.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Features scaling
Our data are well scaled, all attributes ranging between two small values (most cases between 0 and 1). so, no need to scaling it

### Split the data into train and test sets

In [16]:
from sklearn.model_selection import train_test_split

X = df
y = black_df["Purchase"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

In [17]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((492651, 67), (54740, 67), (492651,), (54740,))

## Model selection

In [18]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)

In [19]:
from sklearn.metrics import mean_squared_error

y_pred = lr.predict(X_test)
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)

print(RMSE)

2988.203684740162


In [20]:
from sklearn.linear_model import Ridge, Lasso

ridge = Ridge()
lasso = Lasso()

In [21]:
ridge = Ridge()
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)
print(RMSE)

2988.2100628153253


In [38]:
lasso = Lasso()
lasso.fit(X_train, y_train)
y_pred = lasso.predict(X_test)
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)

print(RMSE)

2989.896005879615


  model = cd_fast.enet_coordinate_descent(


In [39]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor()
rf.fit(X_train[:100000], y_train[:100000])
y_pred = rf.predict(X_test)
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)

print(RMSE)

3143.515937963064


In [89]:
from sklearn.ensemble import GradientBoostingRegressor

gbr = GradientBoostingRegressor(n_estimators=100)
gbr.fit(X_train, y_train)
y_pred = gbr.predict(X_test)
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)

print(RMSE)

3014.149377371139


In [23]:
from xgboost import XGBRegressor

xgb = XGBRegressor(booster='gbtree' , n_estimators=300, learning_rate=0.45, reg_lambda=1, reg_alpha=0.05)
xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test)
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)

print(RMSE)

2889.8260120212276


as we have seen above most of the models make an RMSE score around 2990 approximately, but its clear that the XGBRegressor model make the best RMSE score. thus, we will choose and fine tune it and try to make even better score

### Fine tune the model

At this step we need to make the best learning of the XGBoost model, by finding the best combination of hyperparameters.
For that I thought to use GridSearch with many possible values of each hyperparameter to find the best combination

the hyperparameter **n_estimators** specify how many estimator the ensemble will train, so more the n_estimator is large more the training will be computationaly expensive. for that I tought to try to find the best value for this parameter separately of the other combinations of parameters (just to avoid the fitting time of the GridSearch)

In [27]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [200, 300, 400, 500],    
}

grid = GridSearchCV(estimator=XGBRegressor(), param_grid=param_grid, scoring='neg_root_mean_squared_error', cv=2)
grid.fit(X_train, y_train)

In [28]:
print(f"best parameter: {grid.best_params_}, best score: {-grid.best_score_}")

best parameter: {'n_estimators': 300}, best score: 2905.1666602119585


now we will use GridSEarch with combination of values of the following hyperparamters, and fixing the hyperparameter `n_estimators` in the best values found which is 300: 
- **`booster`**: the estimator the XGB will be based on
- **`learning_rate`**: the learning rate eta
- **`reg_lambda`**: the 'l1' regularisation term lambda
- **`reg_alpha`**: the 'l2' regularisation term alpha

In [40]:
from sklearn.model_selection import GridSearchCV

param_grid = {'booster': ['gbtree'], 'n_estimators': [300], 'learning_rate': [0.43, 0.45], 'reg_lambda': [0.9, 1], 'reg_alpha':[0.05, 0.1]}

grid = GridSearchCV(estimator=XGBRegressor(), param_grid=param_grid, scoring='neg_root_mean_squared_error', cv=None)
grid.fit(X_train, y_train)

KeyboardInterrupt: 

In [36]:
grid.best_params_

{'booster': 'gbtree',
 'learning_rate': 0.45,
 'n_estimators': 300,
 'reg_alpha': 0.05,
 'reg_lambda': 1}

In [37]:
grid.best_score_

-2893.015259254636

After tryin many combinations of hyperparameters using the GridSearch we found an even better score with the combination *{'booster': 'gbtree','learning_rate': 0.45,'n_estimators': 300,'reg_alpha': 0.05,'reg_lambda': 1}* so we will try to refit the XGBRegressor with those parameters.

In [None]:
final_xgb = XGBRegressor()
final_xgb.fit(X_train, y_train)

y_train_pred = final_xgb.predict(X_train)
y_test_predict = final_xgb.predict(X_test)
train_RMSE = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_RMSE = np.sqrt(mean_squared_error(y_test, y_test_pred))

print(f"train RMSE: {train_RMSE}\ntest RMSE: {test_RMSE}")