# Black Friday Sales Prediction

## Problem Statement

A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month.
The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.
Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products.

![BlackFridayTheory](./img.png 'BlackFridayTheory')

## Import Statements

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

## Reading Data From Training Set

In [2]:
data = pd.read_csv('train.csv')
dataset = data[['User_ID','Product_ID','Gender', 'Age', 'Occupation', 'City_Category',
       'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1',
       'Product_Category_2', 'Product_Category_3', 'Purchase']]
dataset.head()

FileNotFoundError: [Errno 2] File b'train.csv' does not exist: b'train.csv'

## Data Analysis & Visualization

### Purchase

In [None]:
sns.distplot(dataset['Purchase'],color='maroon',bins=25)
plt.ylabel('No. Of Customers')
plt.xlabel("Amount spent in purchase")

In [None]:
print('Skewness of data is :',dataset['Purchase'].skew())
print('Kurtosis of data is :',dataset['Purchase'].kurtosis())

#### Univariate analysis of target variarble ( Purchase ) shows Gaussian distribution with skewness of 0.56 and kurtosis of -0.42 which is pretty good.

In [None]:
sns.boxplot(dataset['Purchase'],color='green')

#### Box Plot of target variable shows presence of outliers which need to be removed from the data.

### Gender

In [None]:
sns.countplot(dataset['Gender'])

#### It can be seen from data that male buyers are relatively more.

### Age

In [None]:
sns.countplot(dataset['Age'])

#### As expected, most purchases are made by people between 18 to 45 years old.

### City Category

In [None]:
sns.countplot(dataset['City_Category'])

#### City B has relatively more buyers than A & C

### Marital Status

In [None]:
sns.countplot(dataset['Marital_Status'])

#### Unmarried buyers are relatively greater.

### Stay In Current City Years

In [None]:
sns.countplot(dataset['Stay_In_Current_City_Years'])

#### People staying in city for year are more keen to buy the product.

### Occupation

In [None]:
plt.figure(figsize=(14,5))
sns.countplot(dataset['Occupation'])

<ul>
    <li>
        Among all occupations the major concentration lies in 0, 4 & 7</li>
    <li>There are very few buyers having occupation 8</li>
<ul>

## Handling Categorical Values

### OneHotEncoding

In [None]:
gen_onehot_features = pd.get_dummies(dataset['Gender'])
dataset = pd.concat([dataset[['User_ID','Product_ID','Age', 'Occupation', 'City_Category',
       'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1',
       'Product_Category_2', 'Product_Category_3', 'Purchase']],pd.DataFrame(gen_onehot_features)],axis=1)
gen_onehot_features.head()

In [None]:
gen_onehot_features_city = pd.get_dummies(dataset['City_Category'])
dataset = pd.concat([dataset[['User_ID','Product_ID','Age', 'Occupation',
       'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1',
       'Product_Category_2', 'Product_Category_3', 'Purchase','M','F']],pd.DataFrame(gen_onehot_features_city)],axis=1)
gen_onehot_features_city.head()

In [None]:
dataset.head()

### Missing Values

In [None]:
dataset.isnull()

In [None]:
dataset['Product_Category_2'] = dataset['Product_Category_2'].fillna(999)
dataset['Product_Category_3'] = dataset['Product_Category_3'].fillna(999)
dataset['Product_Category_2'] = dataset['Product_Category_2'].astype(int)
dataset['Product_Category_3'] = dataset['Product_Category_3'].astype(int)

In [None]:
dataset.head()

### Mapping Ordered Data

In [None]:
gen_ord_map = {'0-17': 0, '18-25': 1, '26-35': 2, 
               '36-45': 3, '46-50': 4, '51-55': 5,'55+':6}
dataset['Age'] = dataset['Age'].map(gen_ord_map)
dataset.head()

### LabelEncoding

In [None]:
from sklearn.preprocessing import LabelEncoder

gle = LabelEncoder()
genre_labels = gle.fit_transform(dataset['Stay_In_Current_City_Years'])
genre_mappings = {index: label for index, label in enumerate(gle.classes_)}
dataset['Stay_In_Current_City_Years'] = genre_labels

In [None]:
# gle = LabelEncoder()
# genre_labels = gle.fit_transform(dataset['Product_ID'])
# genre_mappings = {index: label for index, label in enumerate(gle.classes_)}
# genre_mappings
# dataset['Product_ID'] = genre_labels

In [None]:
gle = LabelEncoder()
genre_labels = gle.fit_transform(dataset['User_ID'])
genre_mappings = {index: label for index, label in enumerate(gle.classes_)}
dataset['User_ID'] = genre_labels

In [None]:
dataset.head()

## Removing Outliers

In [None]:
from scipy import stats
z = np.abs(stats.zscore(dataset['Purchase']))

threshold = 2.33
np.where(z > 2.33)

dataset = dataset[(z<2.33)]

In [None]:
sns.boxplot(dataset['Purchase'])

## Splitting Data

In [None]:
X = dataset[['User_ID','Age', 'Occupation', 'Stay_In_Current_City_Years', 'Marital_Status',
       'Product_Category_1', 'Product_Category_2', 'Product_Category_3', 'M', 'A', 'B']] 

y = dataset['Purchase'] 

from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Training Model

### Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train,y_train)

print("Intercept:",regressor.intercept_)
print("\nSlope:",regressor.coef_)

y_pred = regressor.predict(X_test)

from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

### XGBoost

In [None]:
%%time
import xgboost as xgb

xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = 0.2,
                max_depth = 10, alpha = 15, n_estimators = 1000)

xg_reg.fit(X_train,y_train)

y_pred = xg_reg.predict(X_test)

from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

In [None]:
plt.scatter(y_test,y_pred,alpha=0.5)
plt.plot(y_test,y_test,color='red')

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=66, random_state=0)  
regressor.fit(X_train, y_train)  
y_pred = regressor.predict(X_test) 

from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))


<h2> CONCLUSION </h2>
<p>
<b>We tried 3 models on the same regression problem:- </b>
<ul>
<li>Random forest regressor gives an RMSE of 2900.</li>
<li> Linear regression gives an RMSE of 4444.</li>
 <li>XGBoost gives an RMSE of 2729.</li>
</ul>
<br>
 <b>Out of the 3 models XGBoost gives the lowest RMSE.
 Hence we  will use that model.<b>
    </p>