üõ†Ô∏è Task 1: Data Setup & Validation
Use your knowledge of the os module and DataFrame attributes to prepare the environment.

Check Existence: Use os.path.exists() to verify the CSV is in your folder before loading.

Inspect Structure: Print the .info() and .describe() to see the average age and price.

Missing Values: Check if there are any null values using .isnull().sum().

In [101]:
import os
import pandas as pd
import numpy as np
import random
import datetime as dt

In [102]:
# read the csv file, hence the file exists

df = pd.read_csv('retail_sales_dataset.csv')
df.head()

Unnamed: 0,Transaction ID,Date,Customer ID,Gender,Age,Product Category,Quantity,Price per Unit,Total Amount
0,1,2023-11-24,CUST001,Male,34,Beauty,3,50,150
1,2,2023-02-27,CUST002,Female,26,Clothing,2,500,1000
2,3,2023-01-13,CUST003,Male,50,Electronics,1,30,30
3,4,2023-05-21,CUST004,Male,37,Clothing,1,500,500
4,5,2023-05-06,CUST005,Male,30,Beauty,2,50,100


In [103]:
# basic info about the data

print(df.info())
print('============================================================================')
print('')
print('============================================================================')
print(df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Transaction ID    1000 non-null   int64 
 1   Date              1000 non-null   object
 2   Customer ID       1000 non-null   object
 3   Gender            1000 non-null   object
 4   Age               1000 non-null   int64 
 5   Product Category  1000 non-null   object
 6   Quantity          1000 non-null   int64 
 7   Price per Unit    1000 non-null   int64 
 8   Total Amount      1000 non-null   int64 
dtypes: int64(5), object(4)
memory usage: 70.4+ KB
None

       Transaction ID         Age     Quantity  Price per Unit  Total Amount
count     1000.000000  1000.00000  1000.000000     1000.000000   1000.000000
mean       500.500000    41.39200     2.514000      179.890000    456.000000
std        288.819436    13.68143     1.132734      189.681356    559.997632
min          1.000000   

In [104]:
# checking for null values

df.isnull().sum()

Transaction ID      0
Date                0
Customer ID         0
Gender              0
Age                 0
Product Category    0
Quantity            0
Price per Unit      0
Total Amount        0
dtype: int64

üõ†Ô∏è Task 2: Time-Based Feature Engineering
The Date column is currently just text (an object). Let's make it useful:

Convert to Datetime: Use pd.to_datetime(df['Date']).

Extract Month: Create a new column Month by extracting the month from the date.

Question: Which month in 2023 had the highest total sales?


In [105]:
df.head()

Unnamed: 0,Transaction ID,Date,Customer ID,Gender,Age,Product Category,Quantity,Price per Unit,Total Amount
0,1,2023-11-24,CUST001,Male,34,Beauty,3,50,150
1,2,2023-02-27,CUST002,Female,26,Clothing,2,500,1000
2,3,2023-01-13,CUST003,Male,50,Electronics,1,30,30
3,4,2023-05-21,CUST004,Male,37,Clothing,1,500,500
4,5,2023-05-06,CUST005,Male,30,Beauty,2,50,100


In [106]:
# convert date object from an object to a 'datetime'

df['Date'] = pd.to_datetime(df['Date'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Transaction ID    1000 non-null   int64         
 1   Date              1000 non-null   datetime64[ns]
 2   Customer ID       1000 non-null   object        
 3   Gender            1000 non-null   object        
 4   Age               1000 non-null   int64         
 5   Product Category  1000 non-null   object        
 6   Quantity          1000 non-null   int64         
 7   Price per Unit    1000 non-null   int64         
 8   Total Amount      1000 non-null   int64         
dtypes: datetime64[ns](1), int64(5), object(3)
memory usage: 70.4+ KB


In [107]:
# extract the month from the Date and create a new column

df.insert(2, 'Month', df['Date'].dt.month)
df.head()

Unnamed: 0,Transaction ID,Date,Month,Customer ID,Gender,Age,Product Category,Quantity,Price per Unit,Total Amount
0,1,2023-11-24,11,CUST001,Male,34,Beauty,3,50,150
1,2,2023-02-27,2,CUST002,Female,26,Clothing,2,500,1000
2,3,2023-01-13,1,CUST003,Male,50,Electronics,1,30,30
3,4,2023-05-21,5,CUST004,Male,37,Clothing,1,500,500
4,5,2023-05-06,5,CUST005,Male,30,Beauty,2,50,100


In [112]:
# check the month with the highest total sales

df.value_counts('Month')

# 5 = May
# therefore the month with the highest number of sales is May

Month
5     105
10     96
8      94
12     91
4      86
2      85
1      78
11     78
6      77
3      73
7      72
9      65
Name: count, dtype: int64

üõ†Ô∏è Task 3: Customer Demographic Analysis
This dataset includes Gender and Age, which is perfect for understanding who is buying.

Gender Split: Use .value_counts() on the Gender column to see the breakdown of shoppers.

Age Bins: Create a simple function and use .apply() to categorize customers into Young (under 30), Adult (30-50), and Senior (50+).

Question: Which age group spends the most on average?

In [113]:
df.head()

Unnamed: 0,Transaction ID,Date,Month,Customer ID,Gender,Age,Product Category,Quantity,Price per Unit,Total Amount
0,1,2023-11-24,11,CUST001,Male,34,Beauty,3,50,150
1,2,2023-02-27,2,CUST002,Female,26,Clothing,2,500,1000
2,3,2023-01-13,1,CUST003,Male,50,Electronics,1,30,30
3,4,2023-05-21,5,CUST004,Male,37,Clothing,1,500,500
4,5,2023-05-06,5,CUST005,Male,30,Beauty,2,50,100


In [115]:
df.value_counts('Gender')

Gender
Female    510
Male      490
Name: count, dtype: int64

In [124]:
# if df['Age'] < 30:
#     print('Young')
# elif df['Age'] >= 30 and df['Age'] <= 50:
#     print('Adult')
# elif df['Age'] > 50:
#     print('Senior')