# Black Friday Purchases Analysis | Walmart

## Objective 

The notebook aims to analyze customer purchase behavior, particularly focusing on the purchase amount in relation to the customer's gender and other influencing factors. The goal is to provide insights that can help the management team at Walmart Inc. make more informed business decisions. Specifically, the goal is to understand whether the spending habits differ between male and female customers:
  - Do women spend more on Black Friday than men? 

### Table of Content
  - [Exploratory Data Analysis](#Exploratory-Data-Analysis)
      - [Data Overview](#Data-Overview)
      - [Finding Unique Values](#Finding-Unique-Values)
      - [Data Cleaning](#Data-Cleaning)
      - [Statistical Summary](#Statistical-Summary)
          - [Observations from Analysis](#Observations-from-Analysis)
          - [Missing & Duplicate Value Detection](#Missing-and-duplicate-value-detection)
- [Data Visualization](#Data-Visualization)
    - [Numerical Feature](#Numerical-Features)
    - [Categorical Features](#Categorical-Features)
    - [Purchase & Our Features](#Purchase-&-Our-Features)
- [Data Analysis](#Data-Analysis)
- [Final Insights](#Insights)
    - [Actionable Insights](#So?)
    - [Recommendations](#Recommendations)


## Exploratory Data Analysis

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(color_codes = True)
import scipy.stats as stats
from scipy.stats import norm
import warnings
warnings.filterwarnings("ignore")

In [None]:
walmart_df = pd.read_csv("walmart_data.csv")

### Data Overview
Taking a look at the first and last few rows to see how the data looks like

In [None]:
walmart_df.head()

In [None]:
walmart_df.tail()

In [None]:
#shape of the dataframe (rows, columns) 
walmart_df.shape

In [None]:
# column names in the dataframe 
walmart_df.columns

### Finding Unique Values
Having a look at the unique values will help understand how varied the dataset is.

In [None]:
def summarize_unique_values(df, max_display=10):
    """
    Summarizes unique values for each column in a DataFrame.
    
    """
    for column in df.columns:
        try:
            unique_count = df[column].nunique()
            unique_values = df[column].unique()
            
            print(f"\nColumn: {column}")
            print(f"  Type: {df[column].dtype}")
            print(f"  Unique Count: {unique_count}")
            
            if unique_count <= max_display:
                print(f"  Unique Values: {unique_values}")
            else:
                print(f"  First {max_display} Unique Values: {unique_values[:max_display]}")
                print(f"  ... and {unique_count - max_display} more unique values.")
        except Exception as e:
            print(f"Error processing column {column}: {e}")

# Example usage
summarize_unique_values(walmart_df)


## Data Cleaning
 For better analysis, some adjustments need to be carried out. For example, the plus sign symbol in the column 'Stay_In_Current_City_Years' should be removed and the column itself should accommodate a numeric value.

In [None]:
walmart_df.Stay_In_Current_City_Years.unique()


In [None]:
walmart_df.Stay_In_Current_City_Years.unique()


In [None]:
# Removing "+" symbol
walmart_df.Stay_In_Current_City_Years=walmart_df.Stay_In_Current_City_Years.str.replace("+","")

In [None]:
# As '+' symbols have already been removed, it is time to change its data type
walmart_df['Stay_In_Current_City_Years'] = pd.to_numeric(walmart_df['Stay_In_Current_City_Years'])

## Statistical Summary

In this section, various statistics of the data will be examined, including the mean, standard deviation, minimum, and maximum values. This will provide a good idea of the overall distribution and range of the data. Before this analysis, data types that are numerical will be selected.

In [None]:
walmart_df.select_dtypes(include=['int64']).skew()

###### Product_Category skewness of 1.03 suggests most products fall into fewer categories, with some rare outliers.

In [None]:
walmart_df.describe(include = 'all')

### Observations from Analysis
- The dataset is observed to contain no missing values.

- The 26–35 age group is identified as having made the highest number of purchases (219,587), surpassing other age groups.

- Customers from City_Category B are observed to have made the most purchases (231,173), compared to other city categories.

- Among the 550,000 data points, 414,259 records belong to males, while the remainder represents females.

- The minimum recorded purchase amount is noted to be $12.

- The maximum recorded purchase amount is identified as $23,961.

- Potential outliers are suggested to be present in the purchase data.

### Missing and duplicate value detection
Both missing and duplicate values can impact the analysis, making it essential to identify their presence. 

In [None]:
# General info about the dataset
walmart_df.info()
print('% of nulls\n', walmart_df.isna().sum()/len(walmart_df))

In [None]:
walmart_df.duplicated(subset=None,keep='first').sum() # 0 indicates no duplicate values in the dataset

## Data Visualization

### Numerical Features

 In this section, visual representations of the numerical data will be created, including graphs that display the distributions of features such as occupation, years in the current city, marital status, and purchase amounts. These graphs provide a clearer view of patterns and trends, making it easier to interpret the data compared to examining numbers alone.

In [None]:
# Setting up a 2x2 grid of subplots
fig, axis = plt.subplots(nrows=2, ncols=2, figsize=(12, 10))
fig.subplots_adjust(top=0.9)  # Adjusting the spacing at the top of the subplots

# Generating distribution plots for each specified column
sns.distplot(walmart_df['Occupation'], kde=True, ax=axis[0,0], color="#900000")
sns.distplot(walmart_df['Stay_In_Current_City_Years'].astype(int), kde=True, ax=axis[0,1], color="#900000")
sns.distplot(walmart_df['Marital_Status'], kde=True, ax=axis[1,0], color="#900000")

# Plotting the distribution of the 'Purchase' variable with a normal curve fit
sns.distplot(walmart_df['Purchase'], ax=axis[1,1], color="#900000", fit=norm)

# Fitting the 'Purchase' data to a normal distribution curve
mu, sigma = norm.fit(walmart_df['Purchase']) 
print(f"The mean (μ) is {mu:.2f} and the standard deviation (σ) is {sigma:.2f} for the normal distribution.")

# Adding a legend for the 'Purchase' distribution plot
axis[1,1].legend([f'Normal Distribution (μ = {mu:.2f}, σ = {sigma:.2f})'], loc='best')

# Displaying the plots
plt.show()


In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Creating subplots with specific titles
fig = make_subplots(
    rows=4, cols=2,
    subplot_titles=("Gender", "Age", "Occupation", "City Category",
                    "Stay In Current City Years", "Marital Status", "Product Category", "Purchase")
)

# Adding histograms for each subplot
fig.add_trace(go.Histogram(x=walmart_df['Gender']), row=1, col=1)
fig.add_trace(go.Histogram(x=walmart_df['Age']), row=1, col=2)
fig.add_trace(go.Histogram(x=walmart_df['Occupation']), row=2, col=1)
fig.add_trace(go.Histogram(x=walmart_df['City_Category']), row=2, col=2)
fig.add_trace(go.Histogram(x=walmart_df['Stay_In_Current_City_Years']), row=3, col=1)
fig.add_trace(go.Histogram(x=walmart_df['Marital_Status']), row=3, col=2)
fig.add_trace(go.Histogram(x=walmart_df['Product_Category']), row=4, col=1)
fig.add_trace(go.Histogram(x=walmart_df['Purchase']), row=4, col=2)

# Updating layout for better visualization
fig.update_layout(
    height=1200, 
    width=1000, 
    title_text="Count Plots", 
    showlegend=False  # Hiding the legend if unnecessary
)

# Displaying the figure
fig.show()


### Observations 

- A higher number of buyers are male, with females representing a minority. This difference is attributed to the categories on sale during Black Friday; evaluating a specific category may alter the gender distribution.
- Seven categories are defined to classify the age of the buyers.
- A majority of the buyers are reported to be single.
- The occupation data shows that Occupation 8 has an extremely low count compared to the other categories. It can be excluded from the calculation as it is unlikely to significantly affect the results.
- The majority of products belong to categories 1, 5, and 8. The low-count categories can be combined into a single category to reduce the complexity of the problem.
- A higher count in City_Category likely indicates a greater population in urban areas.
- Most buyers are observed to have lived in the city for one year, with the remaining categories showing a uniform distribution.

### Categorical Features
In this section, the focus will be on the categorical data, such as gender, age, and city category. Different types of charts will be used to illustrate how these categories relate to purchases. This will aid in understanding which categories have the most impact on purchasing behavior.

In [None]:
fig, axis = plt.subplots(nrows=2, ncols=2, figsize=(12, 10))
fig.subplots_adjust(top=1.2)

sns.boxplot(data=walmart_df, x="Occupation", ax=axis[0,0])
sns.boxplot(data=walmart_df, x="Stay_In_Current_City_Years", orient='h', ax=axis[0,1])
sns.boxplot(data=walmart_df, x="Purchase", orient='h', ax=axis[1,0])


plt.show()

### Purchase & Our Features

In [None]:
attrs = ['Gender', 'Age', 'Occupation', 'City_Category', 'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category']
sns.set(color_codes = True)
fig, axs = plt.subplots(nrows=3, ncols=2, figsize=(20, 16))
fig.subplots_adjust(top=1.3)
count = 0
for row in range(3):
    for col in range(2):
        sns.boxplot(data=walmart_df, y='Purchase', x=attrs[count], ax=axs[row, col])
        axs[row,col].set_title(f"Purchase vs {attrs[count]}", pad=12, fontsize=13)
        count += 1
plt.show()

plt.figure(figsize=(10, 8))
sns.boxplot(data=walmart_df, y='Purchase', x=attrs[-1])
plt.show()

In [None]:
sns.set(color_codes = True)
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(20, 6))

fig.subplots_adjust(top=1.5)
sns.boxplot(data=walmart_df, y='Purchase', x='Gender', hue='Age', ax=axs[0,0])
sns.boxplot(data=walmart_df, y='Purchase', x='Gender', hue='City_Category', ax=axs[0,1])

sns.boxplot(data=walmart_df, y='Purchase', x='Gender', hue='Marital_Status', ax=axs[1,0])
sns.boxplot(data=walmart_df, y='Purchase', x='Gender', hue='Stay_In_Current_City_Years', ax=axs[1,1])
axs[1,1].legend(loc='upper left')

plt.show()

## Data Analysis

##### Do women spend per transactoin than men?

In [None]:
# Average amount spent per customer for Male and Female
amt_df = walmart_df.groupby(['User_ID', 'Gender'])[['Purchase']].sum()
avg_amt_df = amt_df.reset_index()
avg_amt_df

In [None]:
# Distribution of average amounts spent for male customers
avg_amt_df[avg_amt_df['Gender']=='M']['Purchase'].hist(bins=100)
plt.show()

In [None]:
# Distribution of average amounts spent for female customers
avg_amt_df[avg_amt_df['Gender']=='F']['Purchase'].hist(bins=100)
plt.show()

In [None]:
male_avg = avg_amt_df[avg_amt_df['Gender'] == 'M']['Purchase'].mean()
female_avg = avg_amt_df[avg_amt_df['Gender'] == 'F']['Purchase'].mean()

print(f"Average amount spent by male customers: {male_avg:.2f}")
print(f"Average amount spent by female customers: {female_avg:.2f}")


##### Confidence intervals and the distribution of mean expenses for female and male customers.

In [None]:
# Filtering data by gender
male_df = avg_amt_df[avg_amt_df['Gender'] == 'M']
female_df = avg_amt_df[avg_amt_df['Gender'] == 'F']

# Defining sample sizes and number of repetitions
male_sample_size = 3000
female_sample_size = 1500
num_repetitions = 1000

# Initializing lists to store sample means
male_means = []
female_means = []

# Performing bootstrapping and calculating means for each gender
for _ in range(num_repetitions):
    male_sample_mean = male_df.sample(male_sample_size, replace=True)['Purchase'].mean()
    female_sample_mean = female_df.sample(female_sample_size, replace=True)['Purchase'].mean()
    
    male_means.append(male_sample_mean)
    female_means.append(female_sample_mean)


In [None]:
fig, axis = plt.subplots(nrows=1, ncols=2, figsize=(20, 6))

axis[0].hist(male_means, bins=100)
axis[1].hist(female_means, bins=100)
axis[0].set_title("Male - Distribution of Sample Means, Sample Size: 9500")
axis[1].set_title("Female - Distribution of Sample Means, Sample Size: 8500")

plt.show()

# Displaying population mean (mean of sample means) for both genders
print("\n")
print(f"Population mean for Male (Mean of Sample Means of Amount Spent): {np.mean(male_means):.2f}")
print(f"Population mean for Female (Mean of Sample Means of Amount Spent): {np.mean(female_means):.2f}")

# Displaying sample statistics (mean and standard deviation) for both genders
print(f"\nMale - Sample Mean: {male_df['Purchase'].mean():.2f}, Sample Std: {male_df['Purchase'].std():.2f}")
print(f"Female - Sample Mean: {female_df['Purchase'].mean():.2f}, Sample Std: {female_df['Purchase'].std():.2f}")


In [None]:
# Function to calculate confidence intervals
def calculate_confidence_interval(data, confidence_level):
    z_score = norm.ppf((1 + confidence_level) / 2)  # Dynamically calculate z-score
    sample_mean = data.mean()
    sample_std = data.std()
    sample_size = len(data)
    margin_of_error = z_score * sample_std / np.sqrt(sample_size)
    lower_limit = sample_mean - margin_of_error
    upper_limit = sample_mean + margin_of_error
    return lower_limit, upper_limit

In [None]:
# Defining confidence levels for the three cases
confidence_levels = [0.90, 0.95, 0.99]

for level in confidence_levels:
    male_lower_lim, male_upper_lim = calculate_confidence_interval(male_df['Purchase'], level)
    female_lower_lim, female_upper_lim = calculate_confidence_interval(female_df['Purchase'], level)
    
    print(f"Confidence level {level * 100}%")
    print("Male confidence interval of means: ({:.2f}, {:.2f})".format(male_lower_lim, male_upper_lim))
    print("Female confidence interval of means: ({:.2f}, {:.2f})".format(female_lower_lim, female_upper_lim))
    print("\n")

##### Do the confidence intervals for average male and female spending overlap? How can Walmart utilize these insights to drive improvements or implement changes?

The confidence intervals for average male and female spending do not overlap, indicating a significant difference. Walmart can leverage this by analyzing the broader population dataset with techniques like the Central Limit Theorem and varying confidence levels to tailor marketing strategies and optimize revenue.

##### Outcomes of the same analysis conducted for Married versus Unmarried individuals

In [None]:
amt_df = walmart_df.groupby(['User_ID', 'Marital_Status'])[['Purchase']].sum()
# transforming the group labels into columns for easier manipulation
avg_amt_df = amt_df.reset_index()
avg_amt_df

In [None]:
avg_amt_df['Marital_Status'].value_counts()

In [None]:
sample_sizes = {'married': 3000, 'unmarried': 2000}

num_repitions = 1000
means = {'married': [], 'unmarried': []}

for _ in range(num_repitions):
    # Calculating the mean for both married and unmarried samples
    for status, size in sample_sizes.items():
        group_data = avg_amt_df[avg_amt_df['Marital_Status'] == (1 if status == 'married' else 0)]
        mean_value = group_data.sample(size, replace=True)['Purchase'].mean()
        means[status].append(mean_value)

# Plotting the histograms for both groups
fig, axis = plt.subplots(nrows=1, ncols=2, figsize=(20, 6))

axis[0].hist(means['married'], bins=100)
axis[1].hist(means['unmarried'], bins=100)
axis[0].set_title(f"Married - Distribution of means, Sample size: {sample_sizes['married']}")
axis[1].set_title(f"Unmarried - Distribution of means, Sample size: {sample_sizes['unmarried']}")

plt.show()

print("\n")
print(f"Population mean - Mean of sample means of amount spent for Married: {np.mean(means['married']):.2f}")
print(f"Population mean - Mean of sample means of amount spent for Unmarried: {np.mean(means['unmarried']):.2f}")

# Sample statistics for both groups
for status in ['married', 'unmarried']:
    group_data = avg_amt_df[avg_amt_df['Marital_Status'] == (1 if status == 'married' else 0)]
    print(f"\n{status.capitalize()} - Sample mean: {group_data['Purchase'].mean():.2f} Sample std: {group_data['Purchase'].std():.2f}")


In [None]:
# The list [0.90, 0.95, 0.99] for confidence levels was introduced earlier

for val in ["Married", "Unmarried"]:
    new_val = 1 if val == "Married" else 0
    new_df = avg_amt_df[avg_amt_df['Marital_Status'] == new_val]
    
    for confidence_level in confidence_levels:
        lower_lim, upper_lim = calculate_confidence_interval(new_df['Purchase'], confidence_level)
        
        print(f"{val} - {int(confidence_level*100)}% confidence interval of means: ({lower_lim:.2f}, {upper_lim:.2f})")

##### Conducting the analysis for age
Do spending patterns vary across age groups? Which age range does tend to spend more?

In [None]:
amt_df = walmart_df.groupby(['User_ID', 'Age'])[['Purchase']].sum()
# transforming the group labels into columns for easier manipulation
avg_amt_df = amt_df.reset_index()
avg_amt_df

In [None]:
avg_amt_df['Age'].value_counts()

In [None]:
age_intervals = ['26-35', '36-45', '18-25', '46-50', '51-55', '55+', '0-17']

# Looping through age groups and calculating confidence intervals for each confidence level
for age_interval in age_intervals:
    new_df = avg_amt_df[avg_amt_df['Age'] == age_interval]
    
    print(f"\nFor age group {age_interval}:")
    for confidence_level in confidence_levels:
        lower_lim, upper_lim = calculate_confidence_interval(new_df['Purchase'], confidence_level)
        print(f"Confidence Interval at {int(confidence_level*100)}%: ({lower_lim:.2f}, {upper_lim:.2f})")

## Insights

### So?

- **Age Group Spending**: Customers in the 25-40 age range (26-35: 40%, 18-25: 18%, 36-45: 20%) contribute the highest spending, making up approximately 80% of total expenditure.

- **Gender Spending**: Males account for 75% of purchases and spend more on average than females, highlighting their key role in overall sales.

- **Marital Status Influence**: Single men spend more during sales events than married men, suggesting that marriage may reduce discretionary spending.

- **City Tenure**: People who have lived in the city for one year tend to spend the most, while long-term residents show reduced spending, likely due to established lifestyles.

- **City Category**: Although City B contributes most to overall sales, City C leads in specific product purchases, revealing regional preferences.

- **Product Categories**: Categories 1, 5, 8, and 11 exhibit the highest purchasing frequency.

- **Occupation Distribution**: A diverse consumer base is reflected in the data, showing a wide range of occupations.

- **Confidence Intervals**: Confidence intervals support the analysis of spending patterns across various segments, revealing significant variations in purchasing behaviors.


### Recommendations

- **Female Customer Retention**: Since men spend more than women, the company should focus on retaining female customers and attracting new ones to balance spending across genders.

- **Product Focus**: Product Categories 1, 5, 8, and 11 have the highest purchasing frequency, indicating strong customer preference. The company should either increase sales of these popular products or focus on promoting lesser-purchased items to diversify offerings.

- **Married Customer Acquisition**: Unmarried customers spend more than married customers, suggesting that the company should target married individuals to boost spending.

- **Age Group Expansion**: Customers aged 25-40 tend to spend more, so the company should focus on acquiring customers from other age groups to expand their customer base.

- **Geographic Expansion**: With the highest population in Tier-2 City B, the company should consider opening more outlets in Tier-1 and Tier-2 cities like A and C to enhance business growth.
