#Black Friday Sales Prediction

- A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month.
The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.

- Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products.

- https://datahack.analyticsvidhya.com/contest/black-friday/#ProblemStatement

- Description of data set is given in above link

In [None]:
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

- importing all the required libraries and importing dataset from github account to saledf dataframe

In [None]:
Saledf = pd.read_csv('https://raw.githubusercontent.com/Manju410/MLPractice/main/train.csv')

In [None]:
Saledf.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969


- Checking Dataset and No of rows and columns

In [None]:
row, col = Saledf.shape
row, col

(550068, 12)

- Information about dataset

In [None]:
Saledf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 12 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   User_ID                     550068 non-null  int64  
 1   Product_ID                  550068 non-null  object 
 2   Gender                      550068 non-null  object 
 3   Age                         550068 non-null  object 
 4   Occupation                  550068 non-null  int64  
 5   City_Category               550068 non-null  object 
 6   Stay_In_Current_City_Years  550068 non-null  object 
 7   Marital_Status              550068 non-null  int64  
 8   Product_Category_1          550068 non-null  int64  
 9   Product_Category_2          376430 non-null  float64
 10  Product_Category_3          166821 non-null  float64
 11  Purchase                    550068 non-null  int64  
dtypes: float64(2), int64(5), object(5)
memory usage: 50.4+ MB


# Summary of above output

- Above dataset contains 12 columns
- There are 5 columns are in integer datatype, 2 columns are in float datatype and 5 columns are in object datatype.
- Product category 3 and 2 columns contain null values
- Above dataset have 550068 entries.

# Checking for null values in dataset

In [None]:
Saledf.isna().sum()

User_ID                            0
Product_ID                         0
Gender                             0
Age                                0
Occupation                         0
City_Category                      0
Stay_In_Current_City_Years         0
Marital_Status                     0
Product_Category_1                 0
Product_Category_2            173638
Product_Category_3            383247
Purchase                           0
dtype: int64

- Above dataset have 173638 null values in product category 2 column
- Above dataset have 383247 null values in product category 3 column
- remaining columns dont have any null values

# Checking All the columns for unique values to analyze the pattern

In [None]:
Saledf.User_ID.unique()

array([1000001, 1000002, 1000003, ..., 1004113, 1005391, 1001529])

In [None]:
len(Saledf.User_ID.unique())

5891

- User ID column contains 550068 rows but there are 5891 unique values present

In [None]:
Saledf.Product_ID.unique()

array(['P00069042', 'P00248942', 'P00087842', ..., 'P00370293',
       'P00371644', 'P00370853'], dtype=object)

In [None]:
len(Saledf.Product_ID.unique())

3631

- Product ID column contains 550068 rows but there are 3631 unique values present

In [None]:
Saledf.Gender.unique()

array(['F', 'M'], dtype=object)

- Gender column has 2 values Female - F & Male - M

In [None]:
age = Saledf.Age.unique()
age

array(['0-17', '55+', '26-35', '46-50', '51-55', '36-45', '18-25'],
      dtype=object)

In [None]:
def agematch(age):
  if re.match("0-17",age): return 0
  elif re.match("18-25",age): return 1
  elif re.match("26-35",age): return 2
  elif re.match("36-45",age): return 3
  elif re.match("46-50",age): return 4
  elif re.match("51-55",age): return 5
  elif re.match("55+",age): return 6
  else: return age

In [None]:
Saledf.Age = Saledf.Age.apply(agematch)

In [None]:
Saledf.Age.unique()

array([0, 6, 2, 4, 5, 3, 1])

- Age column divided into 7 categories
- As Age column has ordering, so i have created agematch function to give 0 to 6 numbers to orders.
- Applied above function to Age column of Saledf dataframe.

In [None]:
Saledf.Occupation.unique()

array([10, 16, 15,  7, 20,  9,  1, 12, 17,  0,  3,  4, 11,  8, 19,  2, 18,
        5, 14, 13,  6])

In [None]:
Saledf.City_Category.unique()

array(['A', 'C', 'B'], dtype=object)

- City category column has 3 category namely A,B,C with object dayatype

In [None]:
Saledf.Stay_In_Current_City_Years.unique()

array(['2', '4+', '3', '1', '0'], dtype=object)

In [None]:
a = '4+'
mat = re.match("[0-9]+",str(a))[0]
mat

'4'

In [None]:
Saledf.Stay_In_Current_City_Years = [ re.match("[0-9.,]+", str(x))[0] for x in Saledf.Stay_In_Current_City_Years]

In [None]:
Saledf.Stay_In_Current_City_Years=Saledf.Stay_In_Current_City_Years.astype(int)

In [None]:
Saledf.Stay_In_Current_City_Years.unique()

array([2, 4, 3, 1, 0])

- Stay in current city year column contain numerical values but it is object data type due to + symbol presence. so i have tried to match only numbers in this column and converted this column to integer due understading above 4 is denotated by 4+

In [None]:
Saledf.Marital_Status.unique()

array([0, 1])

- Marital Status column has 2 values (0,1) which means,
  0 - Unmarried(single)
  1- Married

In [None]:
Saledf.Product_Category_1.unique()

array([ 3,  1, 12,  8,  5,  4,  2,  6, 14, 11, 13, 15,  7, 16, 18, 10, 17,
        9, 20, 19])

In [None]:
Saledf.Product_Category_2.unique()

array([nan,  6., 14.,  2.,  8., 15., 16., 11.,  5.,  3.,  4., 12.,  9.,
       10., 17., 13.,  7., 18.])

In [None]:
Saledf.Product_Category_2.mode()

0    8.0
dtype: float64

In [None]:
Saledf.Product_Category_2.fillna(Saledf.Product_Category_2.mode()[0],inplace=True)

- Product Category 2 column contains 173638 null values. so i am trying to 

fill those null values with mod of that column because due to consistency of products purchased.

In [None]:
Saledf.Product_Category_3.unique()

array([nan, 14., 17.,  5.,  4., 16., 15.,  8.,  9., 13.,  6., 12.,  3.,
       18., 11., 10.])

In [None]:
Saledf.Product_Category_3.mode()[0]

16.0

In [None]:
Saledf.Product_Category_3.fillna(Saledf.Product_Category_3.mode()[0],inplace=True)

- Product Category 3 column contains 383247 null values. so i am trying to fill those null values with mode of that column because due to consistency of products purchased.

In [None]:
Saledf.isna().sum()

User_ID                       0
Product_ID                    0
Gender                        0
Age                           0
Occupation                    0
City_Category                 0
Stay_In_Current_City_Years    0
Marital_Status                0
Product_Category_1            0
Product_Category_2            0
Product_Category_3            0
Purchase                      0
dtype: int64

- Checking null values after applying fillna conditions.

In [None]:
Saledf.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0,10,A,2,0,3,8.0,16.0,8370
1,1000001,P00248942,F,0,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0,10,A,2,0,12,8.0,16.0,1422
3,1000001,P00085442,F,0,10,A,2,0,12,14.0,16.0,1057
4,1000002,P00285442,M,6,16,C,4,0,8,8.0,16.0,7969


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
Saledf.to_csv('/content/drive/MyDrive/BlackFridayPredictionClean.csv', index=False)