<a href="https://colab.research.google.com/github/Puru35/Black-Friday-Sales-Prediction/blob/master/Black_Friday_Data_Cleaning_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Black Friday Sales Prediction**

This is the **FIRST** notebook out of 3 that are part of this Black Friday price prediction project. The **description** of the project will be given in this notebook itself.

Link to the Dataset: https://www.kaggle.com/sdolezel/black-friday



## About the Dataset

This project makes use of the Black Friday Dataset Uploaded on Kaggle. It contains 2 Datasets, train and test. Since the target "Purchase", which is the final sales amount, is not present in the train dataset, we will make use of only the test dataset.

This dataset has approximately **550 thousand** rows, with exactly 12 features.

1. User_ID : ID number of the user.
2. Product_ID : ID number of the product.
3. Gender : Either female or male.
4. Age : The range of ages of the perople purchasing the items
5. Occupation: Each occupation has been numbered, so we don't know what the differernt occupations are.
6. City_Category : 3 different cities, A, B and C.
7. Stay_In_Current_Years : The number of years the person purchasing the item has lived in that respective city.
8. Marital_Status : Either married or not.
9. Product_Category_1 : Each number represents a different category, of which we do not know.
10. Product_Category_2 : Same as Product_Category_1
11. Product_Category_3 : Same as Product_Category_1
12. Purchase : This is the target. It contains the final price of the item in that record.



## Problem Statement

Based on the given data, we need to predict the final price of the item in each record. Hence, this is an Estimation problem. as stated above, we are given information about the user who is buying the item, such as his/her age, gender, occupation, marital status, etc. We will try to apply different approaches to train a model to predict the price with the least loss possible, so that it is possible to ascertain the approximate price of a given product and try to form a few conclusions on the relationships of the features.

# Importing Necessary Libraries

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from mpl_toolkits.mplot3d import Axes3D

# Importing the Dataset

In [0]:
data=pd.read_csv("train.csv")

In [3]:
data

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969
...,...,...,...,...,...,...,...,...,...,...,...,...
550063,1006033,P00372445,M,51-55,13,B,1,1,20,,,368
550064,1006035,P00375436,F,26-35,1,C,3,0,20,,,371
550065,1006036,P00375436,F,26-35,15,B,4+,1,20,,,137
550066,1006038,P00375436,F,55+,1,C,2,0,20,,,365


In [0]:
dff = data

In [0]:
dff = dff.drop([dff.columns[0],dff.columns[1]], axis = 1).fillna(0)


As we can see from the first few rows, there are many NaN values present in this dataset. But if we are to remove those rows, then we could lose a lot of information. As we can see, the NaN values are mainly present in the Product Category Features, which signify that that item does not have category in that feature. Hence it would not affect the Dataset if we were to fill them with 0's.

After that, we observe that both User_ID and Product_ID do not affect the data in any way. So it is better to drop them from the dataset.

In [6]:
dff

Unnamed: 0,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,F,0-17,10,A,2,0,3,0.0,0.0,8370
1,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,F,0-17,10,A,2,0,12,0.0,0.0,1422
3,F,0-17,10,A,2,0,12,14.0,0.0,1057
4,M,55+,16,C,4+,0,8,0.0,0.0,7969
...,...,...,...,...,...,...,...,...,...,...
550063,M,51-55,13,B,1,1,20,0.0,0.0,368
550064,F,26-35,1,C,3,0,20,0.0,0.0,371
550065,F,26-35,15,B,4+,1,20,0.0,0.0,137
550066,F,55+,1,C,2,0,20,0.0,0.0,365


In [7]:
dff.dtypes

Gender                         object
Age                            object
Occupation                      int64
City_Category                  object
Stay_In_Current_City_Years     object
Marital_Status                  int64
Product_Category_1              int64
Product_Category_2            float64
Product_Category_3            float64
Purchase                        int64
dtype: object

Here, we see that there are 4 attributes that we need to change from object type to integer type.

## Changing Gender Column (F=0, M=1)

In [8]:
dff["Gender"].value_counts()

M    414259
F    135809
Name: Gender, dtype: int64

In [0]:
dff["Gender"] = dff["Gender"].astype("category").cat.codes

Since There are only two unique values in this feature, we can apply cat.codes to convert female to 0 and male t0 1.

## Changing City Category Column (A=2, B=1, C=0)

In [10]:
dff["City_Category"].value_counts()

B    231173
C    171175
A    147720
Name: City_Category, dtype: int64

In [0]:
for i in range(len(dff["City_Category"])):
  if dff["City_Category"][i] == "A":
    dff["City_Category"][i] = 2
  elif dff["City_Category"][i] == "B":
    dff["City_Category"][i] = 1
  else:
    dff["City_Category"][i] = 0

## Changing Stay in current years Column 

In [0]:
for i in range(len(dff["Stay_In_Current_City_Years"])):
  if dff["Stay_In_Current_City_Years"][i] == "4+":
    dff["Stay_In_Current_City_Years"][i] = 4

In [0]:
dff["Stay_In_Current_City_Years"] = dff["Stay_In_Current_City_Years"].astype(int)

Here, anything above 4 will be considered as 4 years itself. we then convert this entire feature to int.

In [14]:
dff

Unnamed: 0,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,0,0-17,10,2,2,0,3,0.0,0.0,8370
1,0,0-17,10,2,2,0,1,6.0,14.0,15200
2,0,0-17,10,2,2,0,12,0.0,0.0,1422
3,0,0-17,10,2,2,0,12,14.0,0.0,1057
4,1,55+,16,0,4,0,8,0.0,0.0,7969
...,...,...,...,...,...,...,...,...,...,...
550063,1,51-55,13,1,1,1,20,0.0,0.0,368
550064,0,26-35,1,0,3,0,20,0.0,0.0,371
550065,0,26-35,15,1,4,1,20,0.0,0.0,137
550066,0,55+,1,0,2,0,20,0.0,0.0,365


## Changing Age Column

In [15]:
dff['Age'].value_counts()

26-35    219587
36-45    110013
18-25     99660
46-50     45701
51-55     38501
55+       21504
0-17      15102
Name: Age, dtype: int64

In [0]:
for i in range(len(dff['Age'])):
  if dff['Age'][i] == "0-17":
    dff['Age'][i] = 10
  elif dff['Age'][i] == "18-25":
    dff['Age'][i] = 20
  elif dff['Age'][i] == "26-35":
    dff['Age'][i] = 30
  elif dff['Age'][i] == "36-45":
    dff['Age'][i] = 40
  elif dff['Age'][i] == "46-50":
    dff['Age'][i] = 48
  elif dff['Age'][i] == "51-55":
    dff['Age'][i] = 53
  else:                            # dff['Age'][i] == "55+"
    dff['Age'][i] = 55
  

In the age feature, we see that there are 7 different ranges. to convert these into a single integer, we take and approximate average of each range, and re-assign each range to their respective values.

## Changing the columns to their respective data types

In [0]:
dff = dff.infer_objects()

In [18]:
dff.dtypes

Gender                           int8
Age                             int64
Occupation                      int64
City_Category                   int64
Stay_In_Current_City_Years      int64
Marital_Status                  int64
Product_Category_1              int64
Product_Category_2            float64
Product_Category_3            float64
Purchase                        int64
dtype: object

In [19]:
dff

Unnamed: 0,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,0,10,10,2,2,0,3,0.0,0.0,8370
1,0,10,10,2,2,0,1,6.0,14.0,15200
2,0,10,10,2,2,0,12,0.0,0.0,1422
3,0,10,10,2,2,0,12,14.0,0.0,1057
4,1,55,16,0,4,0,8,0.0,0.0,7969
...,...,...,...,...,...,...,...,...,...,...
550063,1,53,13,1,1,1,20,0.0,0.0,368
550064,0,30,1,0,3,0,20,0.0,0.0,371
550065,0,30,15,1,4,1,20,0.0,0.0,137
550066,0,55,1,0,2,0,20,0.0,0.0,365


We now see that all the columns are now in either integer or float format. it will be easier to work with the new cleaned dataset now.

In [0]:
dff.to_csv("Black Friday cleaned Data set.csv")

Please refer **Black Friday Visualization** for the second part of this project.