Assignment Title: Understanding Descriptive Statistics and Sampling for
Machine Learning in Python

Assignment Overview:

In this assignment, you will explore the concepts of descriptive statistics and sampling techniques.
You will analyze dataset related to an e-commerce platform and perform statistical analysis to
understand customer purchasing behavior. By the end of the assignment, you will gain a deeper
understanding of how descriptive statistics and sampling techniques play a crucial role in machine
learning preprocessing.

Scenario:

You have been hired as a data scientist at an e-commerce company called "ShopSmart." The
company wants to understand customer behavior to improve sales and customer satisfaction. They
have collected data over the past year from 10 million customers, including transaction details such
as purchase amounts, customer demographics, product categories, and timestamps. Since the dataset
is enormous, the company wants you to perform a statistical analysis on a representative sample of
the dataset.


Your goal is to:
1. Use descriptive statistics to understand the general customer behavior.
2. Implement different sampling techniques to create a manageable dataset for analysis.

Dataset Details:

The dataset consists of the following columns:

• Customer_ID: Unique identifier for each customer.

• Gender: Gender of the customer (Male/Female).

• Age: Age of the customer.

• Country: Country of residence.

• Purchase_Amount: The amount spent on the purchase.

• Purchase_Category: The category of the purchased item (Electronics, Fashion,
Groceries, etc.).

• Transaction_Timestamp: Date and time of the transaction.

You can either use a synthetic dataset or download an open-source dataset from Kaggle to simulate
the scenario.

Tasks:

Task 1: Descriptive Statistics

• Step 1: Load the dataset into Python using Pandas.

• Step 2: Perform the following descriptive statistics:

• Calculate the mean, median, and mode for the Purchase_Amount column.

• Find the standard deviation and variance for the Purchase_Amount.

• Determine the age distribution of the customers using measures like mean, quartiles, and range.

• Count the frequency of purchases by Purchase_Category to identify the most
popular categories.

• Visualize the distribution of Purchase_Amount using histograms and box plots.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#Load the dataset into Python using Pandas.

ecommerce_data = pd.read_csv('ecommerce_data.csv')
ecommerce_data.head()

In [None]:
# Perform the following descriptive statistics: Calculate the mean, median, and mode for the Purchase_Amount column.

mean_purchase_amount = ecommerce_data['Purchase_Amount'].mean()
mean_purchase_amount

In [None]:
median_purchase_amount = ecommerce_data['Purchase_Amount'].median()
median_purchase_amount

In [10]:
mode_purchase_amount = ecommerce_data['Purchase_Amount'].mode()
mode_purchase_amount

0    67.79
Name: Purchase_Amount, dtype: float64

In [13]:
# Find the standard deviation and variance for the Purchase_Amount

std_purchase_amount = ecommerce_data['Purchase_Amount'].std()
std_purchase_amount

143.0583534531564

In [14]:
var_purchase_amount = ecommerce_data['Purchase_Amount'].var()
var_purchase_amount

20465.692492728227

In [16]:
#Determine the age distribution of the customers using measures like mean, quartiles, and range.

mean_age = ecommerce_data['Age'].mean()
mean_age

43.46592

In [19]:
quartiles = ecommerce_data['Age'].quantile([0.25, 0.50, 0.75]) #quantle foe quartiles
quartiles

0.25    31.0
0.50    43.0
0.75    56.0
Name: Age, dtype: float64

In [20]:
age_range = ecommerce_data['Age'].max() - ecommerce_data['Age'].min()
age_range

51

In [22]:
# Count the frequency of purchases by Purchase_Category to identify the most popular categories.

category_frequencey = ecommerce_data['Purchase_Category'].value_counts()
category_frequencey

Purchase_Category
Furniture      20203
Groceries      20095
Fashion        19970
Books          19926
Electronics    19806
Name: count, dtype: int64

In [2]:
# Visualize the distribution of Purchase_Amount using histograms and box plots.

# Set up the matplotlib figure
plt.figure(figsize=(14, 6))
    
# Histogram
plt.subplot(1, 2, 1)  # (rows, columns, panel number)
sns.histplot(ecommerce_data['Purchase_Amount'], kde=True, bins=30)
plt.title('Histogram of Purchase Amount')
plt.xlabel('Purchase Amount')
plt.ylabel('Frequency')
    
# Box Plot
plt.subplot(1, 2, 2)
sns.boxplot(x=ecommerce_data['Purchase_Amount'])
plt.title('Box Plot of Purchase Amount')
plt.xlabel('Purchase Amount')
    
# Show plots
plt.tight_layout()
plt.show()

NameError: name 'plt' is not defined