# Business Data Analysis 
## Company XYZ Objective
This notebook aims to analyze customer purchase behavior, specifically focusing on purchase amounts in relation to gender and other influencing factors. The goal is to help Company XYZ make data-driven decisions by understanding if spending habits differ between male and female customers. The key question is:
Do women spend more on Black Friday than men?



#### 1. Exploratory Data Analysis 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(color_codes = True)
import scipy.stats as stats
from scipy.stats import norm
import warnings
warnings.filterwarnings("ignore")

In [2]:
XYZ_df = pd.read_csv("XYZ_data.csv")

#### 2. Data Overview 

In [3]:
XYZ_df.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,7969


In [4]:
XYZ_df.tail()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category,Purchase
550063,1006033,P00372445,M,51-55,13,B,1,1,20,368
550064,1006035,P00375436,F,26-35,1,C,3,0,20,371
550065,1006036,P00375436,F,26-35,15,B,4+,1,20,137
550066,1006038,P00375436,F,55+,1,C,2,0,20,365
550067,1006039,P00371644,F,46-50,0,B,4+,1,20,490


In [5]:
# Shape of the dataframe
XYZ_df.shape

(550068, 10)

In [6]:
# Name of each column in dataframe
XYZ_df.columns

Index(['User_ID', 'Product_ID', 'Gender', 'Age', 'Occupation', 'City_Category',
       'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category',
       'Purchase'],
      dtype='object')

In [7]:
# Datatype of each column in dataframe
XYZ_df.dtypes

User_ID                        int64
Product_ID                    object
Gender                        object
Age                           object
Occupation                     int64
City_Category                 object
Stay_In_Current_City_Years    object
Marital_Status                 int64
Product_Category               int64
Purchase                       int64
dtype: object

**Next, we'll find out the unique values in each column. This will help us understand how varied the data is.**

In [8]:
def print_unique_values(df):
    for column in df.columns:
        unique_values = df[column].unique()
        print(f"\nUnique Values of {column}: ", unique_values)

# Example usage
print_unique_values(XYZ_df)


Unique Values of User_ID:  [1000001 1000002 1000003 ... 1004113 1005391 1001529]

Unique Values of Product_ID:  ['P00069042' 'P00248942' 'P00087842' ... 'P00370293' 'P00371644'
 'P00370853']

Unique Values of Gender:  ['F' 'M']

Unique Values of Age:  ['0-17' '55+' '26-35' '46-50' '51-55' '36-45' '18-25']

Unique Values of Occupation:  [10 16 15  7 20  9  1 12 17  0  3  4 11  8 19  2 18  5 14 13  6]

Unique Values of City_Category:  ['A' 'C' 'B']

Unique Values of Stay_In_Current_City_Years:  ['2' '4+' '3' '1' '0']

Unique Values of Marital_Status:  [0 1]

Unique Values of Product_Category:  [ 3  1 12  8  5  4  2  6 14 11 13 15  7 16 18 10 17  9 20 19]

Unique Values of Purchase:  [ 8370 15200  1422 ...   135   123   613]


**Count how many unique values there are in each column.**

In [9]:
def print_nunique_values(df):
    for column in df.columns:
        unique_values = df[column].nunique()
        print(f"\nUnique Values of {column}: ", unique_values)

# Example usage
print_nunique_values(XYZ_df)


Unique Values of User_ID:  5891

Unique Values of Product_ID:  3631

Unique Values of Gender:  2

Unique Values of Age:  7

Unique Values of Occupation:  21

Unique Values of City_Category:  3

Unique Values of Stay_In_Current_City_Years:  5

Unique Values of Marital_Status:  2

Unique Values of Product_Category:  20

Unique Values of Purchase:  18105


#### 3. Data Cleaning

We'll make some changes to the data for better analysis. For example, we'll adjust the 'Stay_In_Current_City_Years' column by removing the '+' symbol and converting it to a numeric format. But first, let's look the unique values.

In [10]:
XYZ_df.Stay_In_Current_City_Years.unique()

array(['2', '4+', '3', '1', '0'], dtype=object)

In [11]:
# Removing "+" symbol
XYZ_df.Stay_In_Current_City_Years=XYZ_df.Stay_In_Current_City_Years.str.replace("+","")

In [12]:
XYZ_df.Stay_In_Current_City_Years.unique()

array(['2', '4', '3', '1', '0'], dtype=object)

In [13]:
# Assuming '+' symbols have already been removed as per your screenshot
XYZ_df['Stay_In_Current_City_Years'] = pd.to_numeric(XYZ_df['Stay_In_Current_City_Years'])

#### 4. Statistical Summary 

**In this section, we'll examine some basic statistics of the data, including the mean, standard deviation, minimum, and maximum values. This will help us understand the overall distribution and range of the data. Before that, we'll first filter for numerical data types.**

In [14]:
XYZ_df.select_dtypes(include=['int64']).skew()

User_ID                       0.003066
Occupation                    0.400140
Stay_In_Current_City_Years    0.317236
Marital_Status                0.367437
Product_Category              1.025735
Purchase                      0.600140
dtype: float64

In [15]:
XYZ_df.describe(include = 'all')

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category,Purchase
count,550068.0,550068,550068,550068,550068.0,550068,550068.0,550068.0,550068.0,550068.0
unique,,3631,2,7,,3,,,,
top,,P00265242,M,26-35,,B,,,,
freq,,1880,414259,219587,,231173,,,,
mean,1003029.0,,,,8.076707,,1.858418,0.409653,5.40427,9263.968713
std,1727.592,,,,6.52266,,1.289443,0.49177,3.936211,5023.065394
min,1000001.0,,,,0.0,,0.0,0.0,1.0,12.0
25%,1001516.0,,,,2.0,,1.0,0.0,1.0,5823.0
50%,1003077.0,,,,7.0,,2.0,0.0,5.0,8047.0
75%,1004478.0,,,,14.0,,3.0,1.0,8.0,12054.0


#### 4. Observations from Analysis 

- There are no missing values in the data.
- Customers in the age group of 26-35 have made more purchases (**219,587**) compared to other age groups.
- Customers in **City_Category B** have made more purchases (**231,173**) compared to other City_Categories.
- Out of **550,000** data points, **414,259** are Male, and the rest are Female.
- The customer with the minimum purchase amount spent **\$12**.
- The customer with the maximum purchase amount spent **\$23,961**.
- There might be outliers in the purchase data.


** *Missing Value Detection* **

We'll check if there are any missing values in the data. Missing values can affect the analysis, so it's important to know if there are any. So Let's check the missing values in our dataset.

In [16]:
# Missing value detection
XYZ_df.isna().sum()

User_ID                       0
Product_ID                    0
Gender                        0
Age                           0
Occupation                    0
City_Category                 0
Stay_In_Current_City_Years    0
Marital_Status                0
Product_Category              0
Purchase                      0
dtype: int64

** *Duplicate Value Detection* **

We'll check for any duplicate entries in the data. Duplicates can skew our analysis, so we need to identify and handle them.

In [17]:
# Checking duplicate values in the data set
XYZ_df.duplicated(subset=None,keep='first').sum() 

np.int64(0)

#### 5. Data Visualization 

In [18]:
XYZ_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 10 columns):
 #   Column                      Non-Null Count   Dtype 
---  ------                      --------------   ----- 
 0   User_ID                     550068 non-null  int64 
 1   Product_ID                  550068 non-null  object
 2   Gender                      550068 non-null  object
 3   Age                         550068 non-null  object
 4   Occupation                  550068 non-null  int64 
 5   City_Category               550068 non-null  object
 6   Stay_In_Current_City_Years  550068 non-null  int64 
 7   Marital_Status              550068 non-null  int64 
 8   Product_Category            550068 non-null  int64 
 9   Purchase                    550068 non-null  int64 
dtypes: int64(6), object(4)
memory usage: 42.0+ MB


**In this part, we'll create visual representations of the numerical data. This will include graphs showing distributions of various numerical features like occupation, years in the current city, marital status, and purchase amounts. Graphs help us see patterns and trends more easily than looking at numbers alone.**

In [19]:
[col for col in XYZ_df.select_dtypes(include=['int64']).columns]

['User_ID',
 'Occupation',
 'Stay_In_Current_City_Years',
 'Marital_Status',
 'Product_Category',
 'Purchase']

***Note:** From that list, we can remove User_ID amd Product_Category, because that wont contribute to our analysis.*

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import norm

# Create a 2x2 grid of subplots
fig, axis = plt.subplots(nrows=2, ncols=2, figsize=(12, 10))
fig.subplots_adjust(top=0.9)  # Adjust the top spacing of the subplots

# Plot distribution plots for each specified column
sns.distplot(XYZ_df['Occupation'], kde=True, ax=axis[0,0], color="#900000")
sns.distplot(XYZ_df['Stay_In_Current_City_Years'].astype(int), kde=True, ax=axis[0,1], color="#900000")
sns.distplot(XYZ_df['Marital_Status'], kde=True, ax=axis[1,0], color="#900000")

# Plotting a distribution plot for the 'Purchase' variable with normal curve fit
sns.distplot(XYZ_df['Purchase'], ax=axis[1,1], color="#900000", fit=norm)

# Fitting the target variable to the normal curve 
mu, sigma = norm.fit(XYZ_df['Purchase']) 
print("The mu (mean) is {} and sigma (standard deviation) is {} for the curve".format(mu, sigma))

# Adding a legend for the 'Purchase' distribution plot
axis[1,1].legend(['Normal Distribution (μ = {:.2f}, σ = {:.2f})'.format(mu, sigma)], loc='best')

# Show the plots
plt.show()

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create subplots
fig = make_subplots(
    rows=4, cols=2,
    subplot_titles=("Gender", "Age", "Occupation", "City Category",
                    "Stay In Current City Years", "Marital Status", "Product Category", "Purchase")
)

# Add histograms for each subplot
fig.add_trace(go.Histogram(x=XYZ_df['Gender']), row=1, col=1)
fig.add_trace(go.Histogram(x=XYZ_df['Age']), row=1, col=2)
fig.add_trace(go.Histogram(x=XYZ_df['Occupation']), row=2, col=1)
fig.add_trace(go.Histogram(x=XYZ_df['City_Category']), row=2, col=2)
fig.add_trace(go.Histogram(x=XYZ_df['Stay_In_Current_City_Years']), row=3, col=1)
fig.add_trace(go.Histogram(x=XYZ_df['Marital_Status']), row=3, col=2)
fig.add_trace(go.Histogram(x=XYZ_df['Product_Category']), row=4, col=1)
fig.add_trace(go.Histogram(x=XYZ_df['Purchase']), row=4, col=2)

# Update layout if needed
fig.update_layout(height=1200, width=1000, title_text="Count Plots")
fig.update_layout(showlegend=False)  # Hide the legend if not needed

# Show the figure
fig.show()

**Observations:**
- The majority of buyers are **male**, while females represent the minority. This difference is influenced by the categories on sale during Black Friday, and evaluating a specific category may alter the gender distribution.
- Buyers are classified into **seven age groups**.
- **Most buyers are single**.
- Buyers' occupations vary, but **Occupation 8 has an extremely low count**. It can be ignored in calculations as it has minimal impact on results.
- The majority of products belong to **Category 1, 5, and 8**. Low-count categories could be merged into a single category to simplify the problem.
- A **higher count of buyers in urban areas** suggests a greater population in **City_Category**.
- **Most buyers have lived in the city for one year**, while the remaining categories are fairly evenly distributed.

Next, we'll focus on the categorical data, like gender, age, and city category. We'll use different types of charts to show how these categories relate to purchases. This will help us understand which categories have the most impact on purchasing behavior.

In [None]:
fig, axis = plt.subplots(nrows=2, ncols=2, figsize=(12, 10))
fig.subplots_adjust(top=1.2)

sns.boxplot(data=XYZ_df, x="Occupation", ax=axis[0,0])
sns.boxplot(data=XYZ_df, x="Stay_In_Current_City_Years", orient='h', ax=axis[0,1])
sns.boxplot(data=XYZ_df, x="Purchase", orient='h', ax=axis[1,0])


plt.show()

In [None]:
attrs = ['Gender', 'Age', 'Occupation', 'City_Category', 'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category']
sns.set(color_codes = True)
fig, axs = plt.subplots(nrows=3, ncols=2, figsize=(20, 16))
fig.subplots_adjust(top=1.3)
count = 0
for row in range(3):
    for col in range(2):
        sns.boxplot(data=XYZ_df, y='Purchase', x=attrs[count], ax=axs[row, col])
        axs[row,col].set_title(f"Purchase vs {attrs[count]}", pad=12, fontsize=13)
        count += 1
plt.show()

plt.figure(figsize=(10, 8))
sns.boxplot(data=XYZ_df, y='Purchase', x=attrs[-1])
plt.show()

In [None]:
sns.set(color_codes = True)
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(20, 6))

fig.subplots_adjust(top=1.5)
sns.boxplot(data=XYZ_df, y='Purchase', x='Gender', hue='Age', ax=axs[0,0])
sns.boxplot(data=XYZ_df, y='Purchase', x='Gender', hue='City_Category', ax=axs[0,1])

sns.boxplot(data=XYZ_df, y='Purchase', x='Gender', hue='Marital_Status', ax=axs[1,0])
sns.boxplot(data=XYZ_df, y='Purchase', x='Gender', hue='Stay_In_Current_City_Years', ax=axs[1,1])
axs[1,1].legend(loc='upper left')

plt.show()

#### 6. Data Analysis 

**1. Are women spending more money per transaction than men? Why or Why not?**


In [None]:
# Average amount spend per customer for Male and Female
amt_df = XYZ_df.groupby(['User_ID', 'Gender'])[['Purchase']].sum()
avg_amt_df = amt_df.reset_index()
avg_amt_df

In [None]:
# Gender wise value counts in avg_amt_df
avg_amt_df['Gender'].value_counts()

In [None]:
# histogram of average amount spend for each customer - Male
avg_amt_df[avg_amt_df['Gender']=='M']['Purchase'].hist(bins=100)
plt.show()

In [None]:
# histogram of average amount spend for each customer - Female
avg_amt_df[avg_amt_df['Gender']=='F']['Purchase'].hist(bins=100)
plt.show()

In [None]:
male_avg = avg_amt_df[avg_amt_df['Gender']=='M']['Purchase'].mean()
female_avg = avg_amt_df[avg_amt_df['Gender']=='F']['Purchase'].mean()

print("Average amount spend by Male customers: {:.2f}".format(male_avg))
print("Average amount spend by Female customers: {:.2f}".format(female_avg))

**2. Confidence intervals and distribution of the mean of the expenses by female and male customers.**

In [None]:
male_df = avg_amt_df[avg_amt_df['Gender']=='M']
female_df = avg_amt_df[avg_amt_df['Gender']=='F']

In [None]:
genders = ["M", "F"]

male_sample_size = 3000
female_sample_size = 1500
num_repitions = 1000
male_means = []
female_means = []

for _ in range(num_repitions):
    male_mean = male_df.sample(male_sample_size, replace=True)['Purchase'].mean()
    female_mean = female_df.sample(female_sample_size, replace=True)['Purchase'].mean()
    
    male_means.append(male_mean)

In [None]:
fig, axis = plt.subplots(nrows=1, ncols=2, figsize=(20, 6))

axis[0].hist(male_means, bins=100)
axis[1].hist(female_means, bins=100)
axis[0].set_title("Male - Distribution of means, Sample size: 9500")
axis[1].set_title("Female - Distribution of means, Sample size: 8500")

plt.show()

print("\n")
print("Population mean - Mean of sample means of amount spend for Male: {:.2f}".format(np.mean(male_means)))
print("Population mean - Mean of sample means of amount spend for Female: {:.2f}".format(np.mean(female_means)))

print("\nMale - Sample mean: {:.2f} Sample std: {:.2f}".format(male_df['Purchase'].mean(), male_df['Purchase'].std()))
print("Female - Sample mean: {:.2f} Sample std: {:.2f}".format(female_df['Purchase'].mean(), female_df['Purchase'].std()))

In [None]:
male_margin_of_error_clt = 1.64*male_df['Purchase'].std()/np.sqrt(len(male_df))
male_sample_mean = male_df['Purchase'].mean()
male_lower_lim = male_sample_mean - male_margin_of_error_clt
male_upper_lim = male_sample_mean + male_margin_of_error_clt

female_margin_of_error_clt = 1.64*female_df['Purchase'].std()/np.sqrt(len(female_df))
female_sample_mean = female_df['Purchase'].mean()
female_lower_lim = female_sample_mean - female_margin_of_error_clt
female_upper_lim = female_sample_mean + female_margin_of_error_clt

print("Male confidence interval of means: ({:.2f}, {:.2f})".format(male_lower_lim, male_upper_lim))
print("Female confidence interval of means: ({:.2f}, {:.2f})".format(female_lower_lim, female_upper_lim))

In [None]:
male_margin_of_error_clt = 1.96*male_df['Purchase'].std()/np.sqrt(len(male_df))
male_sample_mean = male_df['Purchase'].mean()
male_lower_lim = male_sample_mean - male_margin_of_error_clt
male_upper_lim = male_sample_mean + male_margin_of_error_clt

female_margin_of_error_clt = 1.96*female_df['Purchase'].std()/np.sqrt(len(female_df))
female_sample_mean = female_df['Purchase'].mean()
female_lower_lim = female_sample_mean - female_margin_of_error_clt
female_upper_lim = female_sample_mean + female_margin_of_error_clt

print("Male confidence interval of means: ({:.2f}, {:.2f})".format(male_lower_lim, male_upper_lim))
print("Female confidence interval of means: ({:.2f}, {:.2f})".format(female_lower_lim, female_upper_lim))

In [None]:
ale_margin_of_error_clt = 2.58*male_df['Purchase'].std()/np.sqrt(len(male_df))
male_sample_mean = male_df['Purchase'].mean()
male_lower_lim = male_sample_mean - male_margin_of_error_clt
male_upper_lim = male_sample_mean + male_margin_of_error_clt

female_margin_of_error_clt = 2.58*female_df['Purchase'].std()/np.sqrt(len(female_df))
female_sample_mean = female_df['Purchase'].mean()
female_lower_lim = female_sample_mean - female_margin_of_error_clt
female_upper_lim = female_sample_mean + female_margin_of_error_clt

print("Male confidence interval of means: ({:.2f}, {:.2f})".format(male_lower_lim, male_upper_lim))
print("Female confidence interval of means: ({:.2f}, {:.2f})".format(female_lower_lim, female_upper_lim))

**3. Are confidence intervals of average male and female spending overlapping? How can Walmart leverage this conclusion to make changes or improvements?**



- The confidence intervals of average male and female spendings are not overlapping.
- Walmart can leverage this problem by taking sample dataset and apply this to whole population dataset by performing Central Limit Theorem and Confidence Intervals of 90\%, 95\%, or 99\% by playing around with the width parameter by reporting those observations to Walmart.

**4. Results when the same activity is performed for Married vs Unmarried**

In [None]:
amt_df = XYZ_df.groupby(['User_ID', 'Marital_Status'])[['Purchase']].sum()
avg_amt_df = amt_df.reset_index()
avg_amt_df

In [None]:
avg_amt_df['Marital_Status'].value_counts()

In [None]:
married_samp_size = 3000
married_samp_size = 2000
num_repitions = 1000
married_means = []
unmarried_means = []

for _ in range(num_repitions):
    married_mean = avg_amt_df[avg_amt_df['Marital_Status']==1].sample(married_samp_size, replace=True)['Purchase'].mean()
    unmarried_mean = avg_amt_df[avg_amt_df['Marital_Status']==0].sample(married_samp_size, replace=True)['Purchase'].mean()
    
    married_means.append(married_mean)
    unmarried_means.append(unmarried_mean)
    
    
fig, axis = plt.subplots(nrows=1, ncols=2, figsize=(20, 6))

axis[0].hist(married_means, bins=100)
axis[1].hist(unmarried_means, bins=100)
axis[0].set_title("Married - Distribution of means, Sample size: 3000")
axis[1].set_title("Unmarried - Distribution of means, Sample size: 2000")

plt.show()
print("\n")
print("Population mean - Mean of sample means of amount spend for Married: {:.2f}".format(np.mean(married_means)))
print("Population mean - Mean of sample means of amount spend for Unmarried: {:.2f}".format(np.mean(unmarried_means)))

print("\nMarried - Sample mean: {:.2f} Sample std: {:.2f}".format(avg_amt_df[avg_amt_df['Marital_Status']==1]['Purchase'].mean(), avg_amt_df[avg_amt_df['Marital_Status']==1]['Purchase'].std()))
print("Unmarried - Sample mean: {:.2f} Sample std: {:.2f}".format(avg_amt_df[avg_amt_df['Marital_Status']==0]['Purchase'].mean(), avg_amt_df[avg_amt_df['Marital_Status']==0]['Purchase'].std()))

In [None]:
for val in ["Married", "Unmarried"]:
    
    new_val = 1 if val == "Married" else 0
    
    new_df = avg_amt_df[avg_amt_df['Marital_Status']==new_val] 
    
    margin_of_error_clt = 1.64*new_df['Purchase'].std()/np.sqrt(len(new_df))
    sample_mean = new_df['Purchase'].mean()
    lower_lim = sample_mean - margin_of_error_clt
    upper_lim = sample_mean + margin_of_error_clt

    print("{} confidence interval of means: ({:.2f}, {:.2f})".format(val, lower_lim, upper_lim))

In [None]:
for val in ["Married", "Unmarried"]:
    
    new_val = 1 if val == "Married" else 0
    
    new_df = avg_amt_df[avg_amt_df['Marital_Status']==new_val] 
    
    margin_of_error_clt = 2.58*new_df['Purchase'].std()/np.sqrt(len(new_df))
    sample_mean = new_df['Purchase'].mean()
    lower_lim = sample_mean - margin_of_error_clt
    upper_lim = sample_mean + margin_of_error_clt

    print("{} confidence interval of means: ({:.2f}, {:.2f})".format(val, lower_lim, upper_lim))

**5. Results when the same activity is performed for Age**

We will examine customer spending patterns across different age groups. By computing confidence intervals for each group, we can identify which age ranges spend more and assess variations in spending habits with statistical certainty.

In [None]:
amt_df = XYZ_df.groupby(['User_ID', 'Age'])[['Purchase']].sum()
avg_amt_df = amt_df.reset_index()
avg_amt_df

In [None]:
avg_amt_df['Age'].value_counts()

In [None]:
sample_size = 200
num_repitions = 1000

all_means = {}

age_intervals = ['26-35', '36-45', '18-25', '46-50', '51-55', '55+', '0-17']
for age_interval in age_intervals:
    all_means[age_interval] = []

for age_interval in age_intervals:
    for _ in range(num_repitions):
        mean = avg_amt_df[avg_amt_df['Age']==age_interval].sample(sample_size, replace=True)['Purchase'].mean()
        all_means[age_interval].append(mean)

Now we can infer about the population that, 90\% of the times

In [None]:
for val in ['26-35', '36-45', '18-25', '46-50', '51-55', '55+', '0-17']:
    
    new_df = avg_amt_df[avg_amt_df['Age']==val] 
    
    margin_of_error_clt = 1.64*new_df['Purchase'].std()/np.sqrt(len(new_df))
    sample_mean = new_df['Purchase'].mean()
    lower_lim = sample_mean - margin_of_error_clt
    upper_lim = sample_mean + margin_of_error_clt
    
    print("For age {}, confidence interval of means: ({:.2f}, {:.2f})".format(val, lower_lim, upper_lim))

In [None]:
for val in ['26-35', '36-45', '18-25', '46-50', '51-55', '55+', '0-17']:
    
    new_df = avg_amt_df[avg_amt_df['Age']==val] 
    
    margin_of_error_clt = 1.96*new_df['Purchase'].std()/np.sqrt(len(new_df))
    sample_mean = new_df['Purchase'].mean()
    lower_lim = sample_mean - margin_of_error_clt
    upper_lim = sample_mean + margin_of_error_clt

    print("For age {}, confidence interval of means: ({:.2f}, {:.2f})".format(val, lower_lim, upper_lim))

In [None]:
for val in ['26-35', '36-45', '18-25', '46-50', '51-55', '55+', '0-17']:
    
    new_df = avg_amt_df[avg_amt_df['Age']==val] 
    
    margin_of_error_clt = 2.58*new_df['Purchase'].std()/np.sqrt(len(new_df))
    sample_mean = new_df['Purchase'].mean()
    lower_lim = sample_mean - margin_of_error_clt
    upper_lim = sample_mean + margin_of_error_clt

    print("For age {}, confidence interval of means: ({:.2f}, {:.2f})".format(val, lower_lim, upper_lim))

#### 7. Final Insights 

**After analyzing the data, we have gathered key insights about customer spending patterns based on age, gender, marital status, city category, and product categories.**

***Actionable Insights***

**Age Group Analysis**
- Around **80\% of customers aged 25-40** are the highest spenders.
  - **26-35 years:** 40\%  
  - **18-25 years:** 18\%  
  - **36-45 years:** 20\%  
- This suggests that younger adults and middle-aged customers contribute the most to overall spending.

**Gender-Based Spending Patterns**
- **Male customers account for ~75\% of total purchases**, while female customers contribute the remaining **25%**.
- On average, **male consumers spend more than females** in the retail store. This is evident when analyzing the total purchase value:
  - **Average amount spent by Male customers:** ₹9,25,408.28  
  - **Average amount spent by Female customers:** ₹7,12,217.18  
- This trend suggests that **male shoppers are the primary drivers of sales**.

**Marital Status & Purchase Behavior**
- When analyzing **marital status and purchase behavior**:
  - **60% of buyers are Single**, while **40\% are Married**.
  - **Single men spend the most during Black Friday**, whereas spending decreases after marriage.
  - This decline in spending among married men may be attributed to **increased financial responsibilities**.

**Impact of Years Spent in the City**
- The **Stay_In_Current_City_Years** feature reveals an interesting trend:
  - **Customers who have lived in the city for 1 year tend to spend the most.**
  - Possible explanation: **New residents are more likely to purchase household items and essentials**, whereas those who have lived in the city for over four years are **more settled and spend less on new purchases**.
  - **Breakdown of spending by years in the city:**
    - **1 year:** 35\%  
    - **2 years:** 18\%  
    - **3 years:** 17\%  

**City-Wise Purchasing Trends**
- The **City_Category** feature reveals:
  - **City B contributes the most to total sales revenue.**
  - However, **for specific products, City C records higher purchase volumes**, which was unexpected.

**Product Category Insights**
- The dataset contains **20 different product categories**.
- The **most frequently purchased categories** are:
  - **Product Category 1, 5, 8, and 11**.

**Occupational Distribution**
- There are **20 distinct occupations** among customers in the city.



#### Confidence Intervals 

**Central Limit Theorem and Confidence Intervals for the Population**

**Average Spending by Gender**
Using the Central Limit Theorem, we can estimate the confidence intervals for the average spending by male and female customers.

- **Average amount spent by male customers:** ₹9,25,408.28  
- **Average amount spent by female customers:** ₹7,12,217.18  

**\90% Confidence Interval**
- Male: **(₹900,471.15, ₹950,217.65)**
- Female: **(₹679,584.51, ₹744,464.28)**

**95% Confidence Interval**
- Male: **(₹895,617.83, ₹955,070.97)**
- Female: **(₹673,254.77, ₹750,794.02)**

**99% Confidence Interval**
- Male: **(₹886,214.53, ₹964,474.27)**
- Female: **(₹660,990.91, ₹763,057.88)**

---

**Confidence Interval by Marital Status**
### **90\% Confidence Interval**
- **Married:** (₹812,686.46, ₹874,367.13)  
- **Unmarried:** (₹853,938.67, ₹907,212.90)  

**95\% Confidence Interval**
- **Married:** (₹806,668.83, ₹880,384.76)  
- **Unmarried:** (₹848,741.18, ₹912,410.38)  

**99\% Confidence Interval**
- **Married:** (₹795,009.68, ₹892,043.91)  
- **Unmarried:** (₹838,671.05, ₹922,480.51)  

---

**Confidence Interval by Age Group**
### **90\% Confidence Interval**
- **Age 26-35:** (₹952,320.12, ₹1,026,998.51)  
- **Age 36-45:** (₹832,542.56, ₹926,788.86)  
- **Age 18-25:** (₹810,323.44, ₹899,402.80)  
- **Age 46-50:** (₹726,410.64, ₹858,686.93)  
- **Age 51-55:** (₹703,953.00, ₹822,448.85)  
- **Age 55+:** (₹487,192.99, ₹592,201.50)  
- **Age 0-17:** (₹542,553.13, ₹695,182.50)  

**95\% Confidence Interval**
- **Age 26-35:** (₹945,034.42, ₹1,034,284.21)  
- **Age 36-45:** (₹823,347.80, ₹935,983.62)  
- **Age 18-25:** (₹801,632.78, ₹908,093.46)  
- **Age 46-50:** (₹713,505.63, ₹871,591.93)  
- **Age 51-55:** (₹692,392.43, ₹834,009.42)  
- **Age 55+:** (₹476,948.26, ₹602,446.23)  
- **Age 0-17:** (₹527,662.46, ₹710,073.17)  

**99\% Confidence Interval**
- **Age 26-35:** (₹930,918.39, ₹1,048,400.25)  
- **Age 36-45:** (₹805,532.95, ₹953,798.47)  
- **Age 18-25:** (₹784,794.60, ₹924,931.63)  
- **Age 46-50:** (₹688,502.19, ₹896,595.37)  
- **Age 51-55:** (₹669,993.82, ₹856,408.03)  
- **Age 55+:** (₹457,099.09, ₹622,295.40)  
- **Age 0-17:** (₹498,811.78, ₹738,923.84)  


#### 8. Recommendations 

1. **Men** spent more money than women, So company should focus on retaining the female customers and getting more female customers.

2. **Product_Category - 1, 5, 8, & 11** have highest purchasing frequency. it means these are the products in these categories are liked more by customers. Company can focus on selling more of these products or selling more of the products which are purchased less.

3. **Unmarried** customers spend more money than married customers, So company should focus on acquisition of married customers.

4. Customers in the **age 25-40** spend more money than the others, So company should focus on acquisition of customers of other age groups.

5. The tier-2 city called **B** has the highest number of population, management should open more outlets in the tier-1 and tier-2 cities like A and C in order to increase the buisness.