# Beautiful insights into black friday data
![black friday](https://payfastcoza-bef7.kxcdn.com/wp-content/uploads/PayFast-Black-Friday-2019.png)
Welcome all 👋<br>

In this kernel we will go together into black friday data to discover some insights in it<br>

We will deal with our insights in a form of **question** ➡ **analysis** ➡ **answer** ➡ **decision (if available).** Our Analysis will devided into two ways 👇
* **Exploratory Data Analysis (Numerically)**
* **Exploratory Data Analysis (Graphically)**

First We will import libraries and load data 👇


In [None]:
#import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.impute import SimpleImputer

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

print("libraries loaded successfully")

In [None]:
data_train = pd.read_csv('/kaggle/input/black-friday/train.csv')
print("Data loaded successfully")

# 1. Data exploration
Now let's take fast look at our data 👇

In [None]:
#exploar data
print("data shape : ",data_train.shape)
data_train.describe()

In [None]:
print("Data columns")
print("--------------------------------------------")
print(pd.DataFrame(data_train.columns))
print("--------------------------------------------")

As we see in above ☝ output some columns not useful for our analysis like **User_ID** and **Product_ID** columns. So we will drop them. 

In [None]:
data_train = data_train.drop(['User_ID','Product_ID'], axis=1)
print("User_ID and Product_ID columns droped successfully")

# 2. Missing data 
Now let's deal with missing data 👇

In [None]:
#get total count of data including missing data
total = data_train.isnull().sum().sort_values(ascending=False)

#get percent of missing data relevant to all data
percent = (data_train.isnull().sum()/data_train.isnull().count()).sort_values(ascending=False)

missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(10)

### How to Handle Missing Data ? 🙄
One of the most common problems I have faced in Data Cleaning/Exploratory Analysis is handling the missing values.<br>
This is a picture that give us a guide to deal with missingg data 👇 <br>
<img src='https://miro.medium.com/max/1528/1*_RA3mCS30Pr0vUxbp25Yxw.png' width="550px" style='float:left;'>
<div style='clear:both'></div>
<br>
As we see in above picture there are many ways to deal with Missing Data. In this **Kernel** i will use two of them on at each branch.<br><br>
In **Deletion** I will use **Deleting Columns** technique.<br>

Sometimes we can drop variables if the data is missing for more than 60% observations because these variables are useless.

In **Imputation** because our problem is a general problem I will use **simpelImputer** from **SKlearn** library<br>

I will Delete the following Column **Product_Category_3** because missing data in this columns more than **60%** observations.<br>

And impute **Product_Category_2** column because missing data in this columns less than **60%** observations.


In [None]:
data_train = data_train.drop('Product_Category_3', axis=1)
print("Product_Category_3 column droped successfully")

In [None]:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(pd.DataFrame(data_train['Product_Category_2']))
data_train['Product_Category_2'] = imputer.transform(pd.DataFrame(data_train['Product_Category_2']))
data_train['Product_Category_2'] = np.round(data_train['Product_Category_2'])

print("Product_Category_2 column imputed successfully")

**Moment of truth** 😧<br>
Now let's know if our data contain any missing value

In [None]:
#print max count number of null values
print('Number of missing values = ',data_train.isnull().sum().max())

Now let's get background about our cleaning data 💪 data types 😊

In [None]:
data_train.dtypes

# 3. Exploratory Data Analysis
Now we will get each column from our columns and make our insights process on it. let's start with **Gender** column 👇.
## Gender 👨👩

**Question 1 :** Are females buying higher value purchases than males ? 

### Exploratory Data Analysis (Numerically) 💰

In [None]:
malesPurchaserData = data_train.loc[data_train['Gender'] == 'M']
malesPurchaseMean = np.mean(malesPurchaserData['Purchase'])
print("Purchase mean for male purchasers = ",malesPurchaseMean)

femalsPurchaserData = data_train.loc[data_train['Gender'] == 'F']
femalsPurchaseMean = np.mean(femalsPurchaserData['Purchase'])
print("Purchase mean for femal purchasers = ",femalsPurchaseMean)

### Exploratory Data Analysis (Graphically) 📈

In [None]:
labels=['males','femals']
values = [malesPurchaseMean,femalsPurchaseMean]

plt.bar(labels,values, width=.9, facecolor='b', edgecolor='g', alpha=.5)
plt.text(-0.4,10800,'Average purchase in black friday based on gender')
         
plt.show()

**Answer :** No, males buying higher value purchases than females ? But the difference not large.
<hr>

**Question 2 :** Are males buyers more than females buyers ?

### Exploratory Data Analysis (Numerically) 💰

In [None]:
print('Number of males purchaser = ',malesPurchaserData.shape[0])
print('Number of femals purchaser = ',femalsPurchaserData.shape[0])

### Exploratory Data Analysis (Graphically) 📈

In [None]:
genderCountData = [malesPurchaserData.shape[0],femalsPurchaserData.shape[0]]
labels=['Males','Femals']
plt.axis('equal')

plt.pie(genderCountData, labels=labels,
              explode=[0.1,0],
              autopct='%1.1f%%',
              shadow=True,
              startangle=0,
              labeldistance=1.1,
              pctdistance=.6)

plt.legend(labels)
plt.title('Purchaser number in black friday based on gender')
plt.show()

**Answer :** Yes males buyers more than females buyers. Significantly.<br><br>
**Decision :** We must take care of males than females in the next marketing campaigns. where the ratio is 75 to 25 percent respectively.<hr>
Now let's deal with **Occupation** column 👇.
## Occupation 💼
Occupation column is a numerical column has unique values from 0 to 20. I don't know the meaning of these values, But let's show it 😊.

In [None]:
list(data_train['Occupation'].sort_values().unique())

**Question 3 :** Are purchase value clear vary based on occupation value ?
### Exploratory Data Analysis (Numerically) 💰

In [None]:
labels = []
values = []

for uniqueOccupationValue in data_train['Occupation'].sort_values().unique():
    OccPurchaserData = data_train.loc[data_train['Occupation'] == uniqueOccupationValue]
    OccPurchaserMean = np.mean(OccPurchaserData['Purchase'])
    labels.append(uniqueOccupationValue)
    values.append(OccPurchaserMean)
    
    print("When occupation = ",uniqueOccupationValue," mean purchase value = ",OccPurchaserMean)
    print("------------------------------------------------------------")


### Exploratory Data Analysis (Graphically) 📈

In [None]:
plt.bar(labels,values, width=.9, facecolor='b', edgecolor='w', alpha=.5)
plt.text(-1,10800,'Average purchase in black friday based on Occupation')
         
plt.show()

**Answer :** Purchase value don't effect with occupation value. Because mean values of purchase at each occupation value nearest to each other.

<hr>
Now let's deal with **City_Category** column 👇.
## City_Category 🏡
City_Category column is categorical column has unique values **A, B,** and **C**. I don't know the meaning of these values, But I guess that category **A** best than category **B** and category **B** best than category **C** according to city level. But let's show it 😊.

In [None]:
list(data_train['City_Category'].sort_values().unique())

**Question 4 :** Are purchase value clear vary based on City Category ?
### Exploratory Data Analysis (Numerically) 💰

In [None]:
labels = []
values = []
cityCatCount = []

for uniqueCityCategoryValue in data_train['City_Category'].sort_values().unique():
    CityCatPurchaserData = data_train.loc[data_train['City_Category'] == uniqueCityCategoryValue]
    CityCatPurchaserMean = np.mean(CityCatPurchaserData['Purchase'])
    labels.append(uniqueCityCategoryValue)
    values.append(CityCatPurchaserMean)
    cityCatCount.append(CityCatPurchaserData.shape[0])
    
    print("When City_Category = ",uniqueCityCategoryValue," mean Purchase value = ",CityCatPurchaserMean)
    print("------------------------------------------------------------------------------------------")


### Exploratory Data Analysis (Graphically) 📈

In [None]:
plt.bar(labels,values, width=.9, facecolor='b', edgecolor='w', alpha=.5)
plt.text(-0.7,10800,'Average purchase in black friday based on City_Category')
         
plt.show()

**Answer :** Purchase value doesn't big vary from one category to another. But we can say that people who live in city of category C purchase more than other peoples who live cities of category  A or B.

<hr>

According to above ☝ info<br>
**Question 5 :** can we consider that Purchasers who live in city of category C more than other Purchasers who live cities of category  A or B ?

### Exploratory Data Analysis (Numerically) 💰

In [None]:
for uniqueCityCategoryValue in data_train['City_Category'].sort_values().unique():
    CityCatPurchaserData = data_train.loc[data_train['City_Category'] == uniqueCityCategoryValue]
    print("Purchasers count who live in city of category ",uniqueCityCategoryValue," = ",CityCatPurchaserData.shape[0])
    print("------------------------------------------------------------------------------------------")

### Exploratory Data Analysis (Graphically) 📈

In [None]:
values = cityCatCount
labels = ['A', 'B', 'C']
plt.axis('equal')

plt.pie(values, labels=labels,
              explode=[0.1,0.05,0.05],
              autopct='%1.1f%%',
              shadow=True,
              startangle=0,
              labeldistance=1.1,
              pctdistance=.6)

plt.legend(labels)
plt.title('Average purchase in black friday based on City_Category')
plt.show()

**Answer :** No. Because Purchasers who live in city of category B has largest number of Purchasers than A and C.
<hr>
Now let's deal with Stay_In_Current_City_Years column 👇.
## Stay_In_Current_City_Years 🌔
Stay_In_Current_City_Years is a categorical column has unique values, **'0'**, **'1'**, **'2'**, and **'4+'**. It represent number of years a purchaser stay in a city.<br>Let's show it 😊.

In [None]:
list(data_train['Stay_In_Current_City_Years'].sort_values().unique())

Now we will convert **'4+'** value to 4 to can convert **Stay_In_Current_City_Years** column from string to int. 👇

In [None]:
data_train.loc[data_train['Stay_In_Current_City_Years'] == '4+','Stay_In_Current_City_Years'] = '4'
data_train['Stay_In_Current_City_Years'] = pd.to_numeric(data_train['Stay_In_Current_City_Years'])
print("Stay_In_Current_City_Years converted to int successfully")

**Question 6 :** Are people who stay more years buy less than people who stay less ? 

## Exploratory Data Analysis (Numerically) 💰

In [None]:
labels = []
values = []
yearsCountData = []

for uniqueYearsValue in data_train['Stay_In_Current_City_Years'].sort_values().unique():
    CityYearsPurchaserData = data_train.loc[data_train['Stay_In_Current_City_Years'] == uniqueYearsValue]
    CityYearsPurchaserMean = np.mean(CityYearsPurchaserData['Purchase'])
    labels.append(uniqueYearsValue)
    values.append(CityYearsPurchaserMean)
    yearsCountData.append(CityYearsPurchaserData.shape[0])
    
    if uniqueYearsValue != 4:
        print("Mean purchase of people who stay ",uniqueYearsValue," years = ",CityYearsPurchaserMean)
    elif uniqueYearsValue == 4:
        print("Mean purchase of people who stay more than ",uniqueYearsValue," years = ",CityYearsPurchaserMean)



## Exploratory Data Analysis (Graphically) 📈

In [None]:
plt.bar(labels,values, width=.9, facecolor='b', edgecolor='w', alpha=.5)
plt.text(-0.7,10800,'Average purchase in black friday based on Stay In Current City Years')
         
plt.show()

**Answer:** No. Because the mean purchases of people despite the varying years of stay very close
<hr>
**Question 7 :** Do people who stay longer in city have more action towards buying ?

## Exploratory Data Analysis (Numerically) 💰

In [None]:
for uniqueYearsValue in data_train['Stay_In_Current_City_Years'].sort_values().unique():
    CityYearsPurchaserData = data_train.loc[data_train['Stay_In_Current_City_Years'] == uniqueYearsValue]
    if uniqueYearsValue != 4:
        print("Number of purchasers who stay ",uniqueYearsValue," years = ",CityYearsPurchaserData.shape[0])
    elif uniqueYearsValue == 4:
        print("Number of purchasers who stay more than ",uniqueYearsValue," years = ",CityYearsPurchaserData.shape[0])

    

## Exploratory Data Analysis (Graphically) 📈

In [None]:
values = yearsCountData
labels = list(data_train['Stay_In_Current_City_Years'].sort_values().unique())
plt.axis('equal')

plt.pie(values, labels=labels,
              explode=[0,0.1,0,0,0],
              autopct='%1.1f%%',
              shadow=True,
              startangle=0,
              labeldistance=1.1,
              pctdistance=.6)

plt.legend(labels)
plt.title('Number of purchasers in black friday based on Stay_In_Current_City_Years')
plt.show()

**Answer :** No. Because people who stay longer in city have less action towards buying than others.
**Decision :** We must take care of people who stay in range 1 year in city because they have more action towards buying than others.
<hr>
## Marital_Status 🚩
Marital_Status is a numerical column has two value **0** and **1**. This column describe if purchaser married or not. Let's show it 😊

In [None]:
list(data_train['Marital_Status'].unique())

**Question 8 :** Are people who married buy more than else ?
## Exploratory Data Analysis (Numerically) 💰

In [None]:
labels = []
values = []
maritalStatusCount = []

for maritalStatusValue in data_train['Marital_Status'].unique():
    maritalStatusPurchaserData = data_train.loc[data_train['Marital_Status'] == maritalStatusValue]
    maritalStatusPurchaserMean = np.mean(maritalStatusPurchaserData['Purchase'])
    labels.append(maritalStatusValue)
    values.append(maritalStatusPurchaserMean)
    maritalStatusCount.append(maritalStatusPurchaserData.shape[0])
    
    print("Mean purchase of people who marital Status is ",maritalStatusValue," = ",maritalStatusPurchaserMean)

## Exploratory Data Analysis (Graphically) 📈

In [None]:
plt.bar(labels,values, width=.9, facecolor='b', edgecolor='w', alpha=.5)
plt.text(-0.7,10800,'Average purchase in black friday based on Marital_Status')
         
plt.show()

**Answer :** No. Because purchasers who married or not, have almost same average of purchase.
<hr>
**Question 8 :** Do people who married have more action towards buying ?

## Exploratory Data Analysis (Numerically) 💰

In [None]:
for maritalStatusValue in data_train['Marital_Status'].unique():
    maritalStatusPurchaserData = data_train.loc[data_train['Marital_Status'] == maritalStatusValue]
    print("Number of purchasers who marital status is ",maritalStatusValue," = ",maritalStatusPurchaserData.shape[0])

## Exploratory Data Analysis (Graphically) 📈

In [None]:
values = maritalStatusCount
labels = list(data_train['Marital_Status'].unique())
plt.axis('equal')

plt.pie(values, labels=labels,
              explode=[0,0.1],
              autopct='%1.1f%%',
              shadow=True,
              startangle=0,
              labeldistance=1.1,
              pctdistance=.6)

plt.legend(labels)
plt.title('Number of purchasers in black friday based on Marital_Status')
plt.show()

**Answer :** Yes. Because 59% of purchasers are married.<br>
**Decision :** We must take care of married than singles in the next marketing campaigns. where the ratio is 59 to 41 percent respectively.
<hr>
**Question 9 :** Are Profit from product of category 1 more or from product of category 2 ?  
## Exploratory Data Analysis (Numerically) 💰


In [None]:
Product_Category_1_sum = data_train['Product_Category_1'].sum()
Product_Category_2_sum = data_train['Product_Category_2'].sum()

print("Profit from product of category 1 = ",Product_Category_1_sum)
print("Profit from product of category 2 = ",Product_Category_2_sum)

## Exploratory Data Analysis (Graphically) 📈

In [None]:
labels = ['Product_Category_1','Product_Category_2']
values = [Product_Category_1_sum,Product_Category_2_sum]

plt.bar(labels,values, width=.9, facecolor='b', edgecolor='w', alpha=.5)
plt.text(-0.7,6000000,'Compar between number of boughts of category_1 and category_2 in black friday')
         
plt.show()

**Answer :** Profit from product of category 2 more than product of category 1.<br>
**Decision :** We must take care of product of category 2 in the next marketing campaigns
<hr>
**At the end, I would like this kernel to be helpful for us. and any suggestions to improve this kernel will be much appreciated.**<br>

**Good luck. 👍**