In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [16]:
np.random.seed(42)

data_size = 500
categories = ['Electronics', 'Clothing', 'Home & Garden', 'Books']
products = {
    'Electronics': ['Smartphone', 'Laptop', 'Headphones', 'Monitor'],
    'Clothing': ['T-Shirt', 'Jeans', 'Sneakers', 'Jacket'],
    'Home & Garden': ['Chair', 'Desk', 'Lamp', 'Plant'],
    'Books': ['Novel', 'Textbook', 'Comic', 'Cookbook']
}

date_range = pd.date_range(start='2023-01-01', periods=data_size, freq='D')
category_list = np.random.choice(categories, data_size)
product_list = [np.random.choice(products[cat]) for cat in category_list]
price_list = np.random.randint(20, 1000, data_size)
quantity_list = np.random.randint(1, 5, data_size)

df = pd.DataFrame({
    'Date': date_range,
    'Category': category_list,
    'Product': product_list,
    'Price': price_list,
    'Quantity': quantity_list,
    'Rating': np.random.uniform(1.0, 5.0, data_size) # Ratings between 1 and 5
})

df.loc[10:20, 'Price'] = np.nan
df.loc[50:55, 'Category'] = np.nan

print("✅ Synthetic Dataset Created")
df.head()

✅ Synthetic Dataset Created


Unnamed: 0,Date,Category,Product,Price,Quantity,Rating
0,2023-01-01,Home & Garden,Lamp,521.0,2,1.884838
1,2023-01-02,Books,Novel,978.0,1,4.950672
2,2023-01-03,Electronics,Smartphone,164.0,1,4.776237
3,2023-01-04,Home & Garden,Plant,220.0,1,1.157707
4,2023-01-05,Home & Garden,Lamp,948.0,1,3.822301


Before analyzing, we must understand the shape and health of our data. We check for data types and missing values.

In [17]:
# DATA INSPECTION

print("--- Data Info ---")
df.info()

print("\n--- Summary Statistics ---")
display(df.describe())

print("\n--- Missing Values ---")
print(df.isnull().sum())


--- Data Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Date      500 non-null    datetime64[ns]
 1   Category  494 non-null    object        
 2   Product   500 non-null    object        
 3   Price     489 non-null    float64       
 4   Quantity  500 non-null    int64         
 5   Rating    500 non-null    float64       
dtypes: datetime64[ns](1), float64(2), int64(1), object(2)
memory usage: 23.6+ KB

--- Summary Statistics ---


Unnamed: 0,Date,Price,Quantity,Rating
count,500,489.0,500.0,500.0
mean,2023-09-07 12:00:00,504.290389,2.438,3.058653
min,2023-01-01 00:00:00,20.0,1.0,1.01976
25%,2023-05-05 18:00:00,248.0,1.0,1.964912
50%,2023-09-07 12:00:00,521.0,2.0,3.120213
75%,2024-01-10 06:00:00,770.0,3.0,4.082144
max,2024-05-14 00:00:00,996.0,4.0,4.997655
std,,293.545186,1.108428,1.185758



--- Missing Values ---
Date         0
Category     6
Product      0
Price       11
Quantity     0
Rating       0
dtype: int64


Data is rarely perfect. Here we handle the missing values we introduced earlier.

In [18]:
# DATA CLEANING

# 1. Fill missing 'Categpory' values with 'Unknown'
df["Category"] = df["Category"].fillna("Unknown")

# 2. Fill missing 'Price' values with the median price (Better than mean for skewed data)
median_price = df["Price"].median()
df["Price"] = df["Price"].fillna(median_price)

# 3. Check if cleaning was successful
print("\n--- Missing Values After Cleaning ---")
print(df.isnull().sum())


--- Missing Values After Cleaning ---
Date        0
Category    0
Product     0
Price       0
Quantity    0
Rating      0
dtype: int64


We often need to create new data based on existijng columns. Here, we calculate Total Revenue.

In [19]:
# FEATURE ENGINEERING

df["Revenue"] = df["Price"] * df["Quantity"]
df["Month"] = df["Date"].dt.month_name()

df.head()

Unnamed: 0,Date,Category,Product,Price,Quantity,Rating,Revenue,Month
0,2023-01-01,Home & Garden,Lamp,521.0,2,1.884838,1042.0,January
1,2023-01-02,Books,Novel,978.0,1,4.950672,978.0,January
2,2023-01-03,Electronics,Smartphone,164.0,1,4.776237,164.0,January
3,2023-01-04,Home & Garden,Plant,220.0,1,1.157707,220.0,January
4,2023-01-05,Home & Garden,Lamp,948.0,1,3.822301,948.0,January


Let's answer business questions. Which category generates the most revenue?

In [25]:
# DATA ANALYSIS (AGGREGATIONS)

category_revenue = df.groupby("Category")["Revenue"].sum().sort_values(ascending=False)

print("--- Total Revenue by Category ---")
category_revenue

--- Total Revenue by Category ---


Category
Books            182615.0
Clothing         145544.0
Electronics      145263.0
Home & Garden    130621.0
Unknown            8991.0
Name: Revenue, dtype: float64

Visualizing the Revenue by Category.