<a href="https://colab.research.google.com/github/Gul-Fatima/Data-Science/blob/main/EdaonBlackFridayDataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Black Friday dataset**
* Objective: Cleaning and preparing data for model training

**Problem Statement:**
A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month. The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.

Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#importing dataset
df_train = pd.read_csv('train.csv')
#df_test = pd.read_csv('test.csv')

In [None]:
df_train.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969


In [None]:
df_train.columns

Index(['User_ID', 'Product_ID', 'Gender', 'Age', 'Occupation', 'City_Category',
       'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1',
       'Product_Category_2', 'Product_Category_3', 'Purchase'],
      dtype='object')

In [None]:
#importing test data
df_test = pd.read_csv('test.csv')
df_test.head(2)

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3
0,1000004,P00128942,M,46-50,7,B,2,1,1,11.0,
1,1000009,P00113442,M,26-35,17,C,0,0,3,5.0,


In [None]:
df_test.columns

Index(['User_ID', 'Product_ID', 'Gender', 'Age', 'Occupation', 'City_Category',
       'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1',
       'Product_Category_2', 'Product_Category_3'],
      dtype='object')

In [None]:
#merging the two dataset
df = pd.concat([df_train, df_test], ignore_index=True) #we can also use merge function here
df.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370.0
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200.0
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422.0
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057.0
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969.0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 783667 entries, 0 to 783666
Data columns (total 12 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   User_ID                     783667 non-null  int64  
 1   Product_ID                  783667 non-null  object 
 2   Gender                      783667 non-null  object 
 3   Age                         783667 non-null  object 
 4   Occupation                  783667 non-null  int64  
 5   City_Category               783667 non-null  object 
 6   Stay_In_Current_City_Years  783667 non-null  object 
 7   Marital_Status              783667 non-null  int64  
 8   Product_Category_1          783667 non-null  int64  
 9   Product_Category_2          537685 non-null  float64
 10  Product_Category_3          237858 non-null  float64
 11  Purchase                    550068 non-null  float64
dtypes: float64(3), int64(4), object(5)
memory usage: 71.7+ MB


In [None]:
df.describe()

Unnamed: 0,User_ID,Occupation,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
count,783667.0,783667.0,783667.0,783667.0,537685.0,237858.0,550068.0
mean,1003029.0,8.0793,0.409777,5.366196,9.844506,12.668605,9263.968713
std,1727.267,6.522206,0.491793,3.87816,5.089093,4.12551,5023.065394
min,1000001.0,0.0,0.0,1.0,2.0,3.0,12.0
25%,1001519.0,2.0,0.0,1.0,5.0,9.0,5823.0
50%,1003075.0,7.0,0.0,5.0,9.0,14.0,8047.0
75%,1004478.0,14.0,1.0,8.0,15.0,16.0,12054.0
max,1006040.0,20.0,1.0,20.0,18.0,18.0,23961.0


In [None]:
#here user id is of no use for our model , it is unique for everyone, so dropping it
df.drop(['User_ID'], axis=1, inplace=True)

In [None]:
df.head(15)

Unnamed: 0,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,P00069042,F,0-17,10,A,2,0,3,,,8370.0
1,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200.0
2,P00087842,F,0-17,10,A,2,0,12,,,1422.0
3,P00085442,F,0-17,10,A,2,0,12,14.0,,1057.0
4,P00285442,M,55+,16,C,4+,0,8,,,7969.0
5,P00193542,M,26-35,15,A,3,0,1,2.0,,15227.0
6,P00184942,M,46-50,7,B,2,1,1,8.0,17.0,19215.0
7,P00346142,M,46-50,7,B,2,1,1,15.0,,15854.0
8,P0097242,M,46-50,7,B,2,1,1,16.0,,15686.0
9,P00274942,M,26-35,20,A,1,1,8,,,7871.0


* Now we are trying to resolve the categorical features rightnow.i.e

1.   Gender
2.   Age(bcz given in range)
3.   City  category , etc

* So we'll try to change this categorical data into umber, just like martial status which is already given in 0/1



In [None]:
#resolving gender : either it would be male or female
pd.get_dummies(df['Gender'])

Unnamed: 0,F,M
0,True,False
1,True,False
2,True,False
3,True,False
4,False,True
...,...,...
783662,True,False
783663,True,False
783664,True,False
783665,True,False


* 0 for female is true
* 1 for male is true

In [None]:
#instead of creating the new data frame we are manipulating data for gender column based on above values
df['Gender'] = df['Gender'].map({'F':0, 'M':1 })
df.head()

Unnamed: 0,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,P00069042,0,0-17,10,A,2,0,3,,,8370.0
1,P00248942,0,0-17,10,A,2,0,1,6.0,14.0,15200.0
2,P00087842,0,0-17,10,A,2,0,12,,,1422.0
3,P00085442,0,0-17,10,A,2,0,12,14.0,,1057.0
4,P00285442,1,55+,16,C,4+,0,8,,,7969.0


* Now the gender is either 0 or 1 rather than M/F

## Handling categorical feature

In [None]:
df['Age'].unique()

array(['0-17', '55+', '26-35', '46-50', '51-55', '36-45', '18-25'],
      dtype=object)

* In order to change this categorical data into numbers, we use encoding:

In [None]:
# Technique 1: fir encoding.

#pd.get_dummies(df['Age'],drop_first= True)
df['Age']=df['Age'].map({'0-17':1, '18-25':2, '26-35':3, '36-45':4, '46-50':5, '51-55':6, '55+':7})

Unnamed: 0,18-25,26-35,36-45,46-50,51-55,55+
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,True
...,...,...,...,...,...,...
783662,False,True,False,False,False,False
783663,False,True,False,False,False,False
783664,False,True,False,False,False,False
783665,False,False,False,True,False,False


In [None]:
 #Technique 2 : for encoding
from sklearn import preprocessing
le=preprocessing.LabelEncoder()
df['Age']=le.fit_transform(df['Age'])
df['Age'].unique()

array([0, 6, 2, 4, 5, 3, 1])

In [None]:
df.head(10)

Unnamed: 0,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,P00069042,0,0,10,A,2,0,3,,,8370.0
1,P00248942,0,0,10,A,2,0,1,6.0,14.0,15200.0
2,P00087842,0,0,10,A,2,0,12,,,1422.0
3,P00085442,0,0,10,A,2,0,12,14.0,,1057.0
4,P00285442,1,6,16,C,4+,0,8,,,7969.0
5,P00193542,1,2,15,A,3,0,1,2.0,,15227.0
6,P00184942,1,4,7,B,2,1,1,8.0,17.0,19215.0
7,P00346142,1,4,7,B,2,1,1,15.0,,15854.0
8,P0097242,1,4,7,B,2,1,1,16.0,,15686.0
9,P00274942,1,2,20,A,1,1,8,,,7871.0
