In [None]:
#import necessary libraries
import warnings

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
%matplotlib inline

### Business Understanding

Black Friday is the 'official' kick off to the holiday shopping season, the most important shopping period. The 2018 Black Friday opened a shopping season which became the highest U.S. ecommerce sales day in history with $7.9 billion in revenue.  It's important for the sellers to look into the history sales data and prepare early for the next shopping season so as not to lose the grain. 

I'm going to perform some exploration on Black Friday Dataset From Kaggle. The task or questions I will target for as below:

Question 1: Which User spent most during black Friday, list the top 20 spending users

Question 2: How about the User Distribution by Age Group? And also consider Gender

Question 3: Which products are most popular during Black Friday, list the top 20

Question 4: Look at the users again, this time focus on group by Occupation in different city

Question 5: Correlation between Gender, Age, Occupation, City_Category, Stay_In_Current_City_Years, Marital_Status, Product_Category_x vs Purchase

Understanding these questions may provide some advice for the sellers to better understand the customer purchase behaviour against different products so that the seller can prepare for the next shopping season well.

### Data Understanding
This project will use Black Friday Dataset From Kaggle, which is a sample of the transactions made in a retail store. 

Below are the steps to look at and understand the dataset.

In [None]:
# Read in the Complete Dataset
BlackFriday_Dataset = pd.read_csv('./BlackFriday.csv')
BlackFriday_Dataset.head()

In [None]:
# Get the Basic info of the dataset
BlackFriday_Dataset.describe()

In [None]:
BlackFriday_Dataset.info()

In [None]:
num_rows = BlackFriday_Dataset.shape[0] #Provide the number of rows    in the dataset
num_cols = BlackFriday_Dataset.shape[1] #Provide the number of columns in the dataset
print("Row    number: {}".format(num_rows))
print("Column number: {}".format(num_cols))

In [None]:
# To check the column names in the dataset
BlackFriday_Dataset.columns

### Prepare Data
Some data preparation steps need to be done before using the dataset for exploration, including：
1. Checking columns with missing values and analyze impact
2. Dealing with missing values
3. One-Hot Encoding for Categorical variables such as Club, Nationality, Preferred Positions

In [None]:
# Data Preparation Step 1: check how many missing values are in the dataset
BlackFriday_Dataset.isnull().sum()

After check, missing values only exist in column "Product_Category_2" & "Product_Category_3". From the describe of the dataset, the min values of "Product_Category_2" & "Product_Category_3" are non-zero. My understanding is that the missing value in "Product_Category_2" & "Product_Category_3" means the customer didn't purchase products in these two category. Thus we can use "0" to fill in the missing value.

In [None]:
# Data Preparation Step 2: Fill the missing cell with zero
BlackFriday_Dataset.fillna(0)

In [None]:
# Data Preparation Step 3: One-Hot Encoding for Categorical variables
# One-hot encode the feature:  Gender, Age, City_Category, Stay_in_curent_City_Years
le = LabelEncoder()
BlackFriday_Dataset['Gender_onehot_encode']              = le.fit_transform(BlackFriday_Dataset['Gender'].astype(str))
BlackFriday_Dataset['Age_onehot_encode']                 = le.fit_transform(BlackFriday_Dataset['Age'].astype(str))
BlackFriday_Dataset['City_Category_onehot_encode']       = le.fit_transform(BlackFriday_Dataset['City_Category'].astype(str))
BlackFriday_Dataset['Stay_In_Current_City_Years_encode'] = le.fit_transform(BlackFriday_Dataset['Stay_In_Current_City_Years'].astype(str))

In [None]:
BlackFriday_Dataset.head()

### Answer Questions base on dataset
I have come up some question to be answered by the Data exploration

In [None]:
def plot_groupby(col1,col2):
    plt.figure(figsize = (20,8))
    BlackFriday_Dataset.groupby(col1)[col2].count().nlargest(20).sort_values().plot('barh')

In [None]:
# Question 1: Which User spent most during black Friday, list the top 20 spending users
plot_groupby('User_ID','Purchase')

It's important for the seller to identify high quality customers. These customers with higher purchase amount should be valued. Understanding the needs of these customers will help the merchant to make more suitable operational decisions, such as product type, pricing, after-sales, etc. Loyalty promgram, advertisments should be made to keep these customers continuing shopping with the merchant.

In [None]:
# Question 2: How about the User Distribution by Age Group? And also consider Gender
plt.figure(figsize = (20,8))
sns.countplot(BlackFriday_Dataset['Age'])

In [None]:
plt.figure(figsize = (20,8))
sns.countplot(BlackFriday_Dataset['Age'],hue=BlackFriday_Dataset['Gender'])

We can find from the plot that most of the users who participate in the Black Friday Sale are from age group 26-35, 36-45 and 18-25, which is reasonable as these customers are in the golden age of their life.They make more money than other age groups, and they also have more shopping needs comparing to other age groups.

From the second plot, we can find for all age group, Male customers shop more comparing to Female customers. I think this is because that the most worthwhile things to buy on the Black Friday are electrical appliances, small appliances, and game consoles. Apple products, especially iPad, the price of Black Five is the best in a year.Obviously such products are more popular with male customers.

In [None]:
# Question 3: Which products are most popular during Black Friday, list the top 20
plot_groupby('Product_ID','Purchase')

List out the most popular products may help the merchant adjust their business strategy and can prepare for the next shopping season better so that to Increase revenue and profit.

In [None]:
# Question 4: Look at the users again, this time focus on group by Occupation in different city
plt.figure(figsize = (20,8))
sns.countplot(BlackFriday_Dataset['Occupation'], hue = BlackFriday_Dataset["City_Category"])

The plot shows that for almost all Occupation Category, users from Citi B did more shopping compring to users from City A & Citi C. I think the reason is City B is larger than City A & Citi C and thus has a larger population. And customers from occupation 0, 4, 7 did more shopping than other occupations.

In [None]:
# Question 5: Correlation between Gender, Age, Occupation, City_Category, Stay_In_Current_City_Years, Marital_Status, Product_Category_x vs Purchase
Correlation_DF = BlackFriday_Dataset[['Gender_onehot_encode', 'Age_onehot_encode', 'Occupation', 'City_Category_onehot_encode', 
    'Stay_In_Current_City_Years_encode', 'Marital_Status', 'Product_Category_1', 'Product_Category_2', 'Product_Category_3', 'Purchase']]
Correlation_DF.corr()

In [None]:
colormap = plt.cm.inferno
plt.figure(figsize=(16,12))
plt.title('Correlation between Gender, Age, Occupation, City_Category, Stay_In_Current_City_Years, Marital_Status, Product_Category_x vs Purchase', y=1.05, size=15)
sns.heatmap(Correlation_DF.corr(),linewidths=0.1,vmax=1.0, 
            square=True, cmap=colormap, linecolor='white', annot=True)

From the correlation heatmap above, we can conclude that Gender & City_Category are most postive related to Purchase comparing to other features, while all Product_Category features are negative related to Purchase. Marital_Status, Stay_In_Current_City_Years are not so important features that relate to Purchase. All three Product_Category are highly correlated to each other. Besides that, we can also find that Martital_Status are highly related to Age, which is quite reasonable.