# Exploratory Data Analysis on the Black Friday dataset

## Importing the dependencies

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Problem Statement

A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month. The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.

Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products.

## Importing the training dataset

In [None]:
df_train = pd.read_csv('black_friday_train.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'black_friday_train.csv'

In [None]:
df_train.head()

## Importing the testing dataset

In [None]:
df_test = pd.read_csv('black_friday_test.csv')

In [None]:
df_test.head()

## Combining both the train and test data

In [None]:
df = df_train.append(df_test)

In [None]:
df.head()

In [None]:
df.tail()

## Basic Data stats

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.isnull().sum()

## Data analysis

## Data cleaning

In [None]:
# The customer ID column is unnecessary for our purposes so we can remove it entirely
df.drop(['User_ID'], axis = 1, inplace = True)

In [None]:
df.head()

## Handling Categorical features (Age, Gender, City category etc..)

## Gender

In [None]:
df['Gender']

In [None]:
df['Gender'].value_counts()

In [None]:
df.groupby('Gender').size().plot(kind = 'barh', color = ['green', 'orange'])

One Hot Encoding of gender column by creating dummy variables

In [None]:
# pd.get_dummies(df['Gender'])

But we can directly label 0 or 1 to male or female in place rather than creating dummy variables

In [None]:
# df['Gender'] = df['Gender'].map({'M': 0,'F': 1})
df['Gender']=df['Gender'].apply(lambda x:1 if x == 'F' else 0)
df.head()

## Age

In [None]:
df['Age'].dtype

In [None]:
df['Age'].value_counts()

In [None]:
df['Age'].unique()

In [None]:
df.groupby(['Age']).size().plot(kind = 'barh', color = sns.color_palette('Dark2'))

In [None]:
df['Age'] = df['Age'].map({'0-17': 1, '18-25' : 2, '26-35' : 3, '36-45' : 4, '46-50': 5, '51-55': 6, '55+': 7})

In [None]:
df.head()

## City Category

In [None]:
df.columns

In [None]:
df['City_Category']

We can use one hot encoding for this type of categorical features

In [None]:
df = pd.get_dummies(df, columns = ['City_Category'], drop_first=True)

In [None]:
df.head()

In [None]:
df.info()

So, we have mostly got rid of all the categorical features, now moving onto the further processing

## Missing values

In [None]:
df.isnull().sum()

In [None]:
sns.heatmap(df.isnull(), yticklabels = False, cbar = False, cmap = 'viridis')

Focus on removing the missing values

## Product category 2

In [None]:
df['Product_Category_2']

In [None]:
df['Product_Category_2'].unique()

In [None]:
df['Product_Category_2'].value_counts()

As we can see that the category 8.0 is occuring many times compared to others, one way to deal with the missing values is to impute them with the mode, mean or any statistical feature

In [None]:
df['Product_Category_2'].mode()

In [None]:
# There are a few reasons why we might choose to replace missing values in the 'Product_Category_2' column with the mode:

# - **The mode is the most frequent value in the column.** This means that it is the most representative value of the data.
# - **The mode is a robust statistic.** This means that it is not affected by outliers.
# - **The mode is easy to calculate.** This makes it a convenient choice for replacing missing values.

# Other statistical values, such as the mean or median, can be affected by outliers.
# This means that they may not be as representative of the data as the mode. Additionally, the mean and median can be more difficult to calculate than the mode.

df['Product_Category_2'].fillna(df['Product_Category_2'].mode()[0], inplace = True)


In [None]:
df.head()

In [None]:
df.info()

## Product Category 3

In [None]:
df['Product_Category_3']

In [None]:
df['Product_Category_3'].unique()

In [None]:
df['Product_Category_3'].value_counts()

In [None]:
df['Product_Category_3'].mode()

Replacing all the missing values with the mode

In [None]:
df['Product_Category_3'].fillna(df['Product_Category_3'].mode()[0], inplace = True)

In [None]:
df.info()

## Taking care of remaining missing and datatype issues

In [None]:
df['Stay_In_Current_City_Years'].unique()

we can see there is '4+' which makes it an object we need to remove this and then convert the column to an integer column

In [None]:
df['Stay_In_Current_City_Years'] = df['Stay_In_Current_City_Years'].str.replace('+', '')

In [None]:
df['Stay_In_Current_City_Years'].unique()

In [None]:
df['Stay_In_Current_City_Years'] = df['Stay_In_Current_City_Years'].astype(int)