<mark>############################## 09/05/2025 ###############################
<mark>################################# Friday ################################

# Warm up

**What Is EDA (Exploratory Data Analysis)?**

Exploratory Data Analysis (EDA) is the process of examining a dataset to uncover its structure, patterns, anomalies, and relationships—before applying formal modeling or statistical tests. It’s the “look before you leap” phase of data science.

Key Goals of EDA:
- Understand data types, distributions, and missing values
- Detect outliers, inconsistencies, or data quality issues
- Reveal relationships between variables (correlation, trends)
- Guide feature selection and preprocessing for modeling


**What Is Preprocessing?**

Preprocessing is the essential first step in any data analysis or machine learning pipeline. It’s where raw, messy, inconsistent data gets cleaned, transformed, and structured so that downstream models or insights are accurate and reliable.

Think of it as the QA phase for your data—before you let algorithms touch it.

*Data Cleaning:* Handles missing values, removes duplicates, corrects outliers and formats

*Data Integration:* Merges data from multiple sources (e.g., WMS + ERP) into a unified dataset

*Data Transformation:* Normalizes, scales, encodes, or reshapes data for modeling readiness

*Data Reduction:* Simplifies data by removing irrelevant features or aggregating values


# Exploratory Data Analysis

#1 Declare Library

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

# 2 Import Dataset

In [None]:
df = pd.read_csv('/content/purchase_data.csv')

# 3 Show the first 5 rows

In [None]:
df.head() # Show the first 5 rows

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10.0,A,2,0.0,3.0,,,8370.0
1,1000001,P00248942,F,0-17,10.0,A,2,0.0,1.0,6.0,14.0,15200.0
2,1000001,P00087842,F,0-17,10.0,A,2,0.0,12.0,,,1422.0
3,1000001,P00085442,F,0-17,10.0,A,2,0.0,12.0,14.0,,1057.0
4,1000002,P00285442,M,55+,16.0,C,4+,0.0,8.0,,,7969.0


#4 Show the last 5 rows

In [None]:
df.tail() # Show the last 5 rows

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
263010,1004473,P00041942,M,36-45,1.0,B,3.0,0.0,5.0,18.0,,3722.0
263011,1004473,P00115142,M,36-45,1.0,B,3.0,0.0,1.0,8.0,17.0,19253.0
263012,1004473,P00188442,M,36-45,1.0,B,3.0,0.0,5.0,7.0,,3608.0
263013,1004473,P00119442,M,36-45,1.0,B,3.0,0.0,5.0,,,3604.0
263014,10,,,,,,,,,,,


#5 Show the statistical summary

In [None]:
df.describe() # Show the statistical summary

Unnamed: 0,User_ID,Occupation,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
count,263015.0,263014.0,263014.0,263014.0,181501.0,80582.0,263014.0
mean,1002941.0,8.083558,0.408685,5.291099,9.844756,12.658298,9319.305269
std,2593.126,6.524052,0.491592,3.745722,5.086696,4.129156,4970.152966
min,10.0,0.0,0.0,1.0,2.0,3.0,185.0
25%,1001457.0,2.0,0.0,1.0,5.0,9.0,5863.0
50%,1002972.0,7.0,0.0,5.0,9.0,14.0,8060.0
75%,1004335.0,14.0,1.0,8.0,15.0,16.0,12059.0
max,1006040.0,20.0,1.0,18.0,18.0,18.0,23961.0


#6 Show the Columns

In [None]:
df.columns

Index(['User_ID', 'Product_ID', 'Gender', 'Age', 'Occupation', 'City_Category',
       'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1',
       'Product_Category_2', 'Product_Category_3', 'Purchase'],
      dtype='object')

#7 Number of Rows and columns (Shape)

In [None]:
df.shape

(263015, 12)

# 8 Check if there are Null values and the total Null values

In [None]:
df.isnull().sum().sum()

np.int64(263956)

# 9 Info - summary of a DataFrame

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 263015 entries, 0 to 263014
Data columns (total 12 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   User_ID                     263015 non-null  int64  
 1   Product_ID                  263014 non-null  object 
 2   Gender                      263014 non-null  object 
 3   Age                         263014 non-null  object 
 4   Occupation                  263014 non-null  float64
 5   City_Category               263014 non-null  object 
 6   Stay_In_Current_City_Years  263014 non-null  object 
 7   Marital_Status              263014 non-null  float64
 8   Product_Category_1          263014 non-null  float64
 9   Product_Category_2          181501 non-null  float64
 10  Product_Category_3          80582 non-null   float64
 11  Purchase                    263014 non-null  float64
dtypes: float64(6), int64(1), object(5)
memory usage: 24.1+ MB


#10. Null

In [None]:
df.isnull().sum()

Unnamed: 0,0
User_ID,0
Product_ID,1
Gender,1
Age,1
Occupation,1
City_Category,1
Stay_In_Current_City_Years,1
Marital_Status,1
Product_Category_1,1
Product_Category_2,81514


## This Concludes Exploratory Data Analysis

## <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

#Preprocessing

# Field = Stay_In_Current_City_Years

In [None]:
df['Stay_In_Current_City_Years'].unique() # displays the unique from the column

array(['2', '4+', '3', '1', '0', nan], dtype=object)

Stay_In_Current_City_Years you have value 4+ in some rows, this makes the column as object column. Usually the column type is Integer and because of this '+' it is object.

Here is the rule, ML cannot recogonize object type, it has to be converted to Numeric

For example, for predicting the weather for tomorrow, you will be passing all the past data and the system will go through the pattern and predict what the weather will look like next day.

In [None]:
# Now replace 4+ with 4.
# Since we know there is only one value that is 4+ we can execute the below command
df['Stay_In_Current_City_Years'] = df['Stay_In_Current_City_Years'].replace("4+", 4)
df.head(10)

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10.0,A,2,0.0,3.0,,,8370.0
1,1000001,P00248942,F,0-17,10.0,A,2,0.0,1.0,6.0,14.0,15200.0
2,1000001,P00087842,F,0-17,10.0,A,2,0.0,12.0,,,1422.0
3,1000001,P00085442,F,0-17,10.0,A,2,0.0,12.0,14.0,,1057.0
4,1000002,P00285442,M,55+,16.0,C,4,0.0,8.0,,,7969.0
5,1000003,P00193542,M,26-35,15.0,A,3,0.0,1.0,2.0,,15227.0
6,1000004,P00184942,M,46-50,7.0,B,2,1.0,1.0,8.0,17.0,19215.0
7,1000004,P00346142,M,46-50,7.0,B,2,1.0,1.0,15.0,,15854.0
8,1000004,P0097242,M,46-50,7.0,B,2,1.0,1.0,16.0,,15686.0
9,1000005,P00274942,M,26-35,20.0,A,1,1.0,8.0,,,7871.0


In [None]:
df['Stay_In_Current_City_Years'].unique()

array(['2', 4, '3', '1', '0', nan], dtype=object)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 263015 entries, 0 to 263014
Data columns (total 12 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   User_ID                     263015 non-null  int64  
 1   Product_ID                  263014 non-null  object 
 2   Gender                      263014 non-null  object 
 3   Age                         263014 non-null  object 
 4   Occupation                  263014 non-null  float64
 5   City_Category               263014 non-null  object 
 6   Stay_In_Current_City_Years  263014 non-null  object 
 7   Marital_Status              263014 non-null  float64
 8   Product_Category_1          263014 non-null  float64
 9   Product_Category_2          181501 non-null  float64
 10  Product_Category_3          80582 non-null   float64
 11  Purchase                    263014 non-null  float64
dtypes: float64(6), int64(1), object(5)
memory usage: 24.1+ MB


In [None]:
# Check for the null values in the columns
df.isnull().sum()

Unnamed: 0,0
User_ID,0
Product_ID,1
Gender,1
Age,1
Occupation,1
City_Category,1
Stay_In_Current_City_Years,1
Marital_Status,1
Product_Category_1,1
Product_Category_2,81514


In [None]:
# What if we have more than one value is of with '+' example, 2+, 3+, etc
# Executive '.replace' for each and every value is not productive.
# Plus you also notice that 'Stay_In_Current_City_Years' still remain as object and cannot be interpreted by ML
# This is because there is another value with 'NaN'

# df['Stay_In_Current_City_Years'].astype(int) will not work because we also have the value NaN.
df['Stay_In_Current_City_Years'] = pd.to_numeric(df['Stay_In_Current_City_Years'], errors='coerce')
                                                 # This will convert all the value to Numeric
                                                 # If there are values that cannot converted it will convert to Null Value
                                                 # Example, 1 will be changed to Numeric 1
                                                 # Example, 2 will be changed to Numeric 2
                                                 # Example, @ will be changed to Numeric Null Value

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 263015 entries, 0 to 263014
Data columns (total 12 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   User_ID                     263015 non-null  int64  
 1   Product_ID                  263014 non-null  object 
 2   Gender                      263014 non-null  object 
 3   Age                         263014 non-null  object 
 4   Occupation                  263014 non-null  float64
 5   City_Category               263014 non-null  object 
 6   Stay_In_Current_City_Years  263014 non-null  float64
 7   Marital_Status              263014 non-null  float64
 8   Product_Category_1          263014 non-null  float64
 9   Product_Category_2          181501 non-null  float64
 10  Product_Category_3          80582 non-null   float64
 11  Purchase                    263014 non-null  float64
dtypes: float64(7), int64(1), object(4)
memory usage: 24.1+ MB


# Field = Age

In [None]:
df['Age'].unique()

array(['0-17', '55+', '26-35', '46-50', '51-55', '36-45', '18-25', nan],
      dtype=object)

This is tough to convert things so it is best to live as it is.

Leaving it as-is preserves interpretability and aligns with how business teams typically think about customer segmentation.

In [None]:
# Lets step back and see how many Null values are there in each and every column
df.isnull().sum()

Unnamed: 0,0
User_ID,0
Product_ID,1
Gender,1
Age,1
Occupation,1
City_Category,1
Stay_In_Current_City_Years,1
Marital_Status,1
Product_Category_1,1
Product_Category_2,81514


# Product_Category_1, Product_Category_2, Product_Category_3

Product_Category_2 81514

Product_Category_3 182433

What can be done with this?

Here we have three product categories, lets assume

P1 = Vegetables

P2 = Milk

P3 = Fruits

When a purchase is made (Carrots, Onions and Apples) and the same person buys Apple again

Here P1 = 2

P2 = 0 since nothing is purchased it is a Null value, but actually it is not a Null value.

P3 = 1

So for product category 2, it is safer to make it 0,

In [None]:
# Convert all the Null Value in Product category 1 & 2
df['Product_Category_2'].fillna(0,inplace= True)
df['Product_Category_3'].fillna(0,inplace= True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Product_Category_2'].fillna(0,inplace= True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Product_Category_3'].fillna(0,inplace= True)


In [None]:
df.isnull().sum()

Unnamed: 0,0
User_ID,0
Product_ID,1
Gender,1
Age,1
Occupation,1
City_Category,1
Stay_In_Current_City_Years,1
Marital_Status,1
Product_Category_1,1
Product_Category_2,0


Now that you cleaned up the data, all you have now is just Null Value

# Drop Null Values

In [None]:
# Drop the Null Values.
# there's a subtle twist in that logic. If you drop all NaN values, you're not just cleaning the column
# you’re actually removing entire rows from the dataset.
# That doesn't keep the record count intact—it reduces it.

df.dropna(inplace = True)

In [None]:
df.isnull().sum()

Unnamed: 0,0
User_ID,0
Product_ID,0
Gender,0
Age,0
Occupation,0
City_Category,0
Stay_In_Current_City_Years,0
Marital_Status,0
Product_Category_1,0
Product_Category_2,0


All Null values got wiped out.

In [None]:
# Lets check the duplicate values
df.duplicated().sum()

np.int64(0)

In [None]:
# Lets check the shape
# Before we remove null values, the shape is (263015, 12)
df.shape

(263014, 12)

We removed one row. make shure there are no records that has Null Values

In [None]:
# Command to see if there are any records that has Null Values
df[df.isnull().any(axis=1)]

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase


In [None]:
#If You Want to See Which Columns Are Null in Each Row
df[df.isnull().any(axis=1)].isnull().sum(axis=1)

Unnamed: 0,0


#Encoding

Lets take gender = M, F, T
This cannot be understood my Machine Language, this is where the Encoding comes to play.
What it does is, it changes the object datatype to Numeric datatype which could be either integer or float.

First, lets assign values

F = 0

M = 1

T = 2

On what basis the numbers are assigned. The characters are sorted in Alphabetical order and the values are assigned from 0, Here F, M, T are already sorted.

There are three types of encoding

1. Label Encoding
2. Ordinal Encoding
3. One-Hot Encoding
4. Target Encoding (Advanced)


*Label Encoding*

It will follow alphabetical order.

*Ordinal Encoding*
Ordinal Encoding is a way to convert ordered categories into numbers so that machine learning models can understand them. It’s used when the categories have a natural ranking, but aren’t numeric by default.

Age Group---------------Encoded Value

0-17--------------------- 0

18-25-------------------- 1

26-35-------------------- 2

36-45-------------------- 3

46-50-------------------- 4

51-55-------------------- 5

55+---------------------- 6

The above age group is ordered categorically. It would not make any sense to put 55+ after the age group 0-17.

Another example would be LKG, UKG, 1st Grade, 2nd Grade,...PhD. It would not be an order if you put PhD after LKG.

*One-Hot Encoding*

One-Hot Encoding transforms categorical values into binary columns, where each column represents one category, and rows are marked with 1 or 0 depending on whether the category is present.
It’s perfect for nominal categories—those with no inherent order.


### Gender - Label Encoding

In [None]:
df.columns

Index(['User_ID', 'Product_ID', 'Gender', 'Age', 'Occupation', 'City_Category',
       'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1',
       'Product_Category_2', 'Product_Category_3', 'Purchase'],
      dtype='object')

In [None]:
# Before running the Label Encoding
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 263014 entries, 0 to 263013
Data columns (total 12 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   User_ID                     263014 non-null  int64  
 1   Product_ID                  263014 non-null  object 
 2   Gender                      263014 non-null  object 
 3   Age                         263014 non-null  object 
 4   Occupation                  263014 non-null  float64
 5   City_Category               263014 non-null  object 
 6   Stay_In_Current_City_Years  263014 non-null  float64
 7   Marital_Status              263014 non-null  float64
 8   Product_Category_1          263014 non-null  float64
 9   Product_Category_2          263014 non-null  float64
 10  Product_Category_3          263014 non-null  float64
 11  Purchase                    263014 non-null  float64
dtypes: float64(7), int64(1), object(4)
memory usage: 26.1+ MB


you want to ensure that the column you're working with stays as object type, especially before applying Label Encoding. LabelEncoder expects categorical data in string format, not numeric or mixed types.


In [None]:
df['Gender'].dtype

dtype('O')

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['Gender_Encoded'] = le.fit_transform(df['Gender'])


In [None]:
df.columns

Index(['User_ID', 'Product_ID', 'Gender', 'Age', 'Occupation', 'City_Category',
       'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1',
       'Product_Category_2', 'Product_Category_3', 'Purchase',
       'Gender_Encoded'],
      dtype='object')

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 263014 entries, 0 to 263013
Data columns (total 13 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   User_ID                     263014 non-null  int64  
 1   Product_ID                  263014 non-null  object 
 2   Gender                      263014 non-null  object 
 3   Age                         263014 non-null  object 
 4   Occupation                  263014 non-null  float64
 5   City_Category               263014 non-null  object 
 6   Stay_In_Current_City_Years  263014 non-null  float64
 7   Marital_Status              263014 non-null  float64
 8   Product_Category_1          263014 non-null  float64
 9   Product_Category_2          263014 non-null  float64
 10  Product_Category_3          263014 non-null  float64
 11  Purchase                    263014 non-null  float64
 12  Gender_Encoded              263014 non-null  int64  
dtypes: float64(7), int6

In [None]:
# Move Gender_Encoded next to Gender
# Get list of columns
cols = list(df.columns)

# Remove 'Gender_Encoded' and reinsert it after 'Gender'
cols.remove('Gender_Encoded')
gender_index = cols.index('Gender') + 1
cols.insert(gender_index, 'Gender_Encoded')

# Reorder DataFrame
df = df[cols]

In [None]:
df.info(verbose = True)

<class 'pandas.core.frame.DataFrame'>
Index: 263014 entries, 0 to 263013
Data columns (total 13 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   User_ID                     263014 non-null  int64  
 1   Product_ID                  263014 non-null  object 
 2   Gender                      263014 non-null  object 
 3   Gender_Encoded              263014 non-null  int64  
 4   Age                         263014 non-null  object 
 5   Occupation                  263014 non-null  float64
 6   City_Category               263014 non-null  object 
 7   Stay_In_Current_City_Years  263014 non-null  float64
 8   Marital_Status              263014 non-null  float64
 9   Product_Category_1          263014 non-null  float64
 10  Product_Category_2          263014 non-null  float64
 11  Product_Category_3          263014 non-null  float64
 12  Purchase                    263014 non-null  float64
dtypes: float64(7), int6

In [None]:
df.head()

Unnamed: 0,User_ID,Product_ID,Gender,Gender_Encoded,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0,0-17,10.0,A,2.0,0.0,3.0,0.0,0.0,8370.0
1,1000001,P00248942,F,0,0-17,10.0,A,2.0,0.0,1.0,6.0,14.0,15200.0
2,1000001,P00087842,F,0,0-17,10.0,A,2.0,0.0,12.0,0.0,0.0,1422.0
3,1000001,P00085442,F,0,0-17,10.0,A,2.0,0.0,12.0,14.0,0.0,1057.0
4,1000002,P00285442,M,1,55+,16.0,C,4.0,0.0,8.0,0.0,0.0,7969.0


### Age - Ordinal Encoding

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 263014 entries, 0 to 263013
Data columns (total 13 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   User_ID                     263014 non-null  int64  
 1   Product_ID                  263014 non-null  object 
 2   Gender                      263014 non-null  object 
 3   Gender_Encoded              263014 non-null  int64  
 4   Age                         263014 non-null  object 
 5   Occupation                  263014 non-null  float64
 6   City_Category               263014 non-null  object 
 7   Stay_In_Current_City_Years  263014 non-null  float64
 8   Marital_Status              263014 non-null  float64
 9   Product_Category_1          263014 non-null  float64
 10  Product_Category_2          263014 non-null  float64
 11  Product_Category_3          263014 non-null  float64
 12  Purchase                    263014 non-null  float64
dtypes: float64(7), int6

In [None]:
df['Age'].unique()

array(['0-17', '55+', '26-35', '46-50', '51-55', '36-45', '18-25'],
      dtype=object)

In [None]:
from sklearn.preprocessing import OrdinalEncoder
# You have to sort it first
age_order = [['0-17', '18-25', '26-35', '36-45', '46-50', '51-55', '55+']]
encoder = OrdinalEncoder(categories=age_order)
df['Age_Encoded'] = encoder.fit_transform(df[['Age']])

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 263014 entries, 0 to 263013
Data columns (total 14 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   User_ID                     263014 non-null  int64  
 1   Product_ID                  263014 non-null  object 
 2   Gender                      263014 non-null  object 
 3   Gender_Encoded              263014 non-null  int64  
 4   Age                         263014 non-null  object 
 5   Occupation                  263014 non-null  float64
 6   City_Category               263014 non-null  object 
 7   Stay_In_Current_City_Years  263014 non-null  float64
 8   Marital_Status              263014 non-null  float64
 9   Product_Category_1          263014 non-null  float64
 10  Product_Category_2          263014 non-null  float64
 11  Product_Category_3          263014 non-null  float64
 12  Purchase                    263014 non-null  float64
 13  Age_Encoded        

In [None]:
# Move Age_Encoded next to Age
# Get list of columns
cols = list(df.columns)

# Remove 'Age_Encoded' and reinsert it after 'Age'
cols.remove('Age_Encoded')
age_index = cols.index('Age') + 1
cols.insert(age_index, 'Age_Encoded')

# Reorder DataFrame
df = df[cols]

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 263014 entries, 0 to 263013
Data columns (total 14 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   User_ID                     263014 non-null  int64  
 1   Product_ID                  263014 non-null  object 
 2   Gender                      263014 non-null  object 
 3   Gender_Encoded              263014 non-null  int64  
 4   Age                         263014 non-null  object 
 5   Age_Encoded                 263014 non-null  float64
 6   Occupation                  263014 non-null  float64
 7   City_Category               263014 non-null  object 
 8   Stay_In_Current_City_Years  263014 non-null  float64
 9   Marital_Status              263014 non-null  float64
 10  Product_Category_1          263014 non-null  float64
 11  Product_Category_2          263014 non-null  float64
 12  Product_Category_3          263014 non-null  float64
 13  Purchase           

In [None]:
# Just to make the Encoding uniform for Gender and Age you can convert the Age_Encoded to Integer
df['Age_Encoded'] = df['Age_Encoded'].astype(int)

In [None]:
df

Unnamed: 0,User_ID,Product_ID,Gender,Gender_Encoded,Age,Age_Encoded,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0,0-17,0,10.0,A,2.0,0.0,3.0,0.0,0.0,8370.0
1,1000001,P00248942,F,0,0-17,0,10.0,A,2.0,0.0,1.0,6.0,14.0,15200.0
2,1000001,P00087842,F,0,0-17,0,10.0,A,2.0,0.0,12.0,0.0,0.0,1422.0
3,1000001,P00085442,F,0,0-17,0,10.0,A,2.0,0.0,12.0,14.0,0.0,1057.0
4,1000002,P00285442,M,1,55+,6,16.0,C,4.0,0.0,8.0,0.0,0.0,7969.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
263009,1004473,P00296542,M,1,36-45,3,1.0,B,3.0,0.0,8.0,0.0,0.0,8061.0
263010,1004473,P00041942,M,1,36-45,3,1.0,B,3.0,0.0,5.0,18.0,0.0,3722.0
263011,1004473,P00115142,M,1,36-45,3,1.0,B,3.0,0.0,1.0,8.0,17.0,19253.0
263012,1004473,P00188442,M,1,36-45,3,1.0,B,3.0,0.0,5.0,7.0,0.0,3608.0


### Summary
Steps to perform  to do the statistical analysis

1. Data Collection
2. Preparing Samples
3. Creating two Hypothesis (Null, Alternate)
4. Applying the Appropriate Test

  4a. if p_value > 0.05 fail to reject the null hypothesis

  4b. if p_value < 0.05 reject the null hypothesis


**Question: 1**

It was observed that the average purchase made by the men of age group between 18-25 was Rs. 10000. Is this still the same or not

Null Hypothesis (H₀): The population group 18-25 makes Rs. 10000 (Mean(μ) = 10000)

Alternative Hypothesis (H₁): The population group 18-25 does not make Rs. 10000 (Mean(μ) <> 10000)


In [None]:
Age_group_18_25 = df[(df['Age_Encoded'] == 1) & (df['Gender_Encoded'] == 1)]
Age_group_18_25

Unnamed: 0,User_ID,Product_ID,Gender,Gender_Encoded,Age,Age_Encoded,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
97,1000021,P00220242,M,1,18-25,1,16.0,B,0.0,0.0,3.0,12.0,0.0,3055.0
98,1000022,P00351142,M,1,18-25,1,15.0,A,4.0,0.0,1.0,8.0,17.0,12099.0
99,1000022,P00213242,M,1,18-25,1,15.0,A,4.0,0.0,5.0,8.0,0.0,8797.0
100,1000022,P00195942,M,1,18-25,1,15.0,A,4.0,0.0,3.0,4.0,0.0,10681.0
101,1000022,P00115642,M,1,18-25,1,15.0,A,4.0,0.0,8.0,14.0,0.0,7801.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
262767,1004446,P00145742,M,1,18-25,1,18.0,B,1.0,1.0,1.0,2.0,0.0,15431.0
262951,1004465,P00309942,M,1,18-25,1,4.0,A,1.0,1.0,5.0,0.0,0.0,1763.0
262952,1004465,P00144042,M,1,18-25,1,4.0,A,1.0,1.0,2.0,3.0,4.0,3240.0
262953,1004465,P00293342,M,1,18-25,1,4.0,A,1.0,1.0,8.0,0.0,0.0,5832.0


In [None]:
print(f"Total rows in age group 18-25 and gender 1: {Age_group_18_25.shape[0]}")

Total rows in age group 18-25 and gender 1: 36332


In [None]:
Age_group_18_25.describe(include='all')

Unnamed: 0,User_ID,Product_ID,Gender,Gender_Encoded,Age,Age_Encoded,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
count,36332.0,36332,36332,36332.0,36332,36332.0,36332.0,36332,36332.0,36332.0,36332.0,36332.0,36332.0,36332.0
unique,,2737,1,,1,,,3,,,,,,
top,,P00112142,M,,18-25,,,B,,,,,,
freq,,144,36332,,36332,,,15477,,,,,,
mean,1002664.0,,,1.0,,1.0,6.948062,,1.851288,0.199714,4.884124,6.643785,4.086233,9440.512276
std,1684.545,,,0.0,,0.0,5.944068,,1.348962,0.399791,3.653687,6.138793,6.339762,5046.085813
min,1000021.0,,,1.0,,1.0,0.0,,0.0,0.0,1.0,0.0,0.0,198.0
25%,1001186.0,,,1.0,,1.0,4.0,,1.0,0.0,1.0,0.0,0.0,5477.0
50%,1002682.0,,,1.0,,1.0,4.0,,2.0,0.0,5.0,5.0,0.0,8109.0
75%,1003871.0,,,1.0,,1.0,12.0,,3.0,0.0,8.0,14.0,9.0,12468.25


In [None]:
# Now Calculate the mean on the purchase column
Age_group_18_25['Purchase'].mean()

np.float64(9440.512275679841)

The sample mean is Rs. 9440.51, which differs from the hypothesized mean of Rs. 10000.
🔍 Based on the statistical test, we reject the Null Hypothesis (H₀).
📌 Conclusion: The average purchase made by men aged 18–25 is not Rs. 10000.


Now try the same with Sample

In [None]:
# Create a Sample with 10% of data from the population
Sample_Age_group_18_25 = Age_group_18_25.sample(3600, random_state = 42) # System pick random 10% data.
                                                            # if you don't mention random_state = 42 (any number),
                                                            # Every time you run this, system will pick different record.
                                                            # You wont see the same record
                                                            # random_state = 42 will lock the record that is picked the first time
                                                            # The same record will be picked
Sample_Age_group_18_25

Unnamed: 0,User_ID,Product_ID,Gender,Gender_Encoded,Age,Age_Encoded,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
205154,1001647,P00196642,M,1,18-25,1,18.0,B,0.0,0.0,8.0,14.0,0.0,5831.0
153395,1005709,P00002142,M,1,18-25,1,4.0,B,4.0,0.0,1.0,5.0,8.0,19455.0
23439,1003653,P00251242,M,1,18-25,1,15.0,C,3.0,1.0,5.0,11.0,0.0,8893.0
258890,1003868,P00223642,M,1,18-25,1,12.0,C,0.0,0.0,11.0,15.0,0.0,3151.0
251998,1002896,P00268442,M,1,18-25,1,14.0,B,0.0,0.0,8.0,17.0,0.0,5933.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99763,1003471,P00256842,M,1,18-25,1,4.0,A,1.0,0.0,5.0,0.0,0.0,5221.0
240739,1001137,P00222842,M,1,18-25,1,4.0,B,4.0,0.0,8.0,0.0,0.0,8059.0
208800,1002127,P00127242,M,1,18-25,1,4.0,C,0.0,1.0,1.0,16.0,0.0,11628.0
225640,1004761,P00265242,M,1,18-25,1,15.0,C,1.0,0.0,5.0,8.0,0.0,8565.0


In [None]:
# Now Calculate the mean on the purchase column
Sample_Age_group_18_25['Purchase'].mean()

np.float64(9332.707222222221)

In [None]:
from scipy.stats import ttest_1samp
a_mean = 10000
t_statistic, p_value = ttest_1samp(Sample_Age_group_18_25['Purchase'], a_mean)
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

T-statistic: -8.031949285851002
P-value: 1.2879087768827267e-15


If p_value is < 0.05 which is in this case, hence reject the Null Hypothesis (H₀)

**Question 2:**

It was observed the percentge of women that spend more than 10000 was 35%. Is it still the same?

Null Hypothesis (H₀): propotion is 35% = 0.35 (μ = 0.35)

Alternative Hypothesis (H₁): propotion is not 35% (μ <> 0.35)

In [None]:
Purchase_GT_10000 = df[df['Purchase'] > 10000] # Retrived from the population
Purchase_GT_10000

Unnamed: 0,User_ID,Product_ID,Gender,Gender_Encoded,Age,Age_Encoded,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
1,1000001,P00248942,F,0,0-17,0,10.0,A,2.0,0.0,1.0,6.0,14.0,15200.0
5,1000003,P00193542,M,1,26-35,2,15.0,A,3.0,0.0,1.0,2.0,0.0,15227.0
6,1000004,P00184942,M,1,46-50,4,7.0,B,2.0,1.0,1.0,8.0,17.0,19215.0
7,1000004,P00346142,M,1,46-50,4,7.0,B,2.0,1.0,1.0,15.0,0.0,15854.0
8,1000004,P0097242,M,1,46-50,4,7.0,B,2.0,1.0,1.0,16.0,0.0,15686.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
262997,1004472,P00262242,F,0,46-50,4,16.0,B,0.0,1.0,1.0,11.0,16.0,15175.0
263001,1004472,P00209742,F,0,46-50,4,16.0,B,0.0,1.0,1.0,11.0,15.0,15430.0
263002,1004472,P00183042,F,0,46-50,4,16.0,B,0.0,1.0,15.0,16.0,0.0,12567.0
263003,1004472,P00345742,F,0,46-50,4,16.0,B,0.0,1.0,1.0,2.0,15.0,15387.0


In [None]:
Purchase_GT_10000['Gender'].value_counts() # Check the count

Unnamed: 0_level_0,count
Gender,Unnamed: 1_level_1
M,72372
F,18685


Here the female count is 18685 and the male count is 72372. The propotion here is 35%, question is do we have 35% of female or 18685 makes 35% of women?

nobs = number of observation = total number of people (men + women) = 72372 + 	18685 = 91057

% of woman who spending > 10000 = 18685/91057 = .2052011 = 20.52%


When testing whether 35% of women spend over Rs. 10,000, you need:

• 	: total sample size (91,057)

• 	: number of women in that sample (18,685)

• 	: hypothesized proportion (0.35)

Then you can run a one-sample proportion z-test like this:

In [None]:
Fem_count = Purchase_GT_10000['Gender_Encoded'].value_counts(0) # gives the number of female count
#nobs
nobs = len(Purchase_GT_10000['Gender_Encoded']) # gives the number of male and female count

Fem_proportion = 0.35

In [None]:
from statsmodels.stats.proportion import proportions_ztest

count = 18685
nobs = 91057
p0 = 0.35

stat, pval = proportions_ztest(count, nobs, value=p0)
print(f"Z-statistic: {stat:.4f}, p-value: {pval:.4f}")

Z-statistic: -108.1940, p-value: 0.0000


If p-value < 0.05, you reject the null hypothesis and conclude that the proportion of women spending over Rs. 10,000 is not 35%.