----------------------------------------------
# 🧪 Synthetic Dataset Generator

This notebook creates synthetic customer data for two different datasets: **2023** and **2024**.  
The data is generated using the `Faker` library and can be used for practice, analysis, or demo projects.

# Author: [Chirag Suri]

----------------------------------------------


## 📦 Import required libraries

In [18]:
import pandas as pd
import numpy as np
import random
from faker import Faker
from datetime import datetime

# 🔁 Set seed for reproducibility
np.random.seed(42)
random.seed(42)

# Initialise Faker
fake = Faker()

## 🏗️ Generate Synthetic Data

Use Faker and NumPy to create fields like state,city, date, etc.

## 🗂️ Dataset for 2024

In [19]:
# Constants
num_customers = 689
num_products = 20
num_orders = 1255
categories = ['Electronics', 'Furniture', 'Clothing', 'Toys', 'Books']
states = ["Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware",
          "Florida", "Georgia", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana",
          "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana",
          "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina",
          "North Dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota",
          "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming"]
cities = {state: [fake.city() for _ in range(5)] for state in states}

# Generate data
data = []

order_ids = random.sample(range(1, num_orders + 1), num_orders)  # Unique order IDs

for order_id in order_ids:
    customer_id = random.randint(1, num_customers)
    product_id = random.randint(1, num_products)
    state = random.choice(states)
    city = random.choice(cities[state])
    date = fake.date_between_dates(date_start=pd.to_datetime("2024-01-01"), date_end=pd.to_datetime("2024-12-31"))
    category = random.choice(categories)
    sales = round(random.uniform(100.0, 1000.0), 2)
    quantity_sold = random.randint(1, 10)
    
    data.append([customer_id, product_id, order_id, city, state, date, category, sales, quantity_sold])

### 🧱 Convert to DataFrame

Convert the generated data dictionary into a `pandas` DataFrame.

In [20]:
columns = ["customer_id", "product_id", "order_id", "city", "state", "date", "product_category", "sales", "quantity_sold"]
df = pd.DataFrame(data, columns=columns)

### 🔍 Preview the Data

Display the first 5 records to validate the structure.

In [21]:
print(df.head())

   customer_id  product_id  order_id               city           state  \
0          643           2       229        West Angela        Maryland   
1           93          10        52       Robinsontown   Massachusetts   
2          375          17       564     Port Kevinland  North Carolina   
3          303          11       502  South Matthewfort        Delaware   
4          407          18       458         East David     Mississippi   

         date product_category   sales  quantity_sold  
0  2024-06-23      Electronics  621.62              3  
1  2024-06-05        Furniture  280.72              9  
2  2024-11-13        Furniture  331.27              8  
3  2024-09-28      Electronics  226.68              4  
4  2024-06-25             Toys  112.53              9  


In [22]:
df.head()

Unnamed: 0,customer_id,product_id,order_id,city,state,date,product_category,sales,quantity_sold
0,643,2,229,West Angela,Maryland,2024-06-23,Electronics,621.62,3
1,93,10,52,Robinsontown,Massachusetts,2024-06-05,Furniture,280.72,9
2,375,17,564,Port Kevinland,North Carolina,2024-11-13,Furniture,331.27,8
3,303,11,502,South Matthewfort,Delaware,2024-09-28,Electronics,226.68,4
4,407,18,458,East David,Mississippi,2024-06-25,Toys,112.53,9


### 🗃️ Exporting Dataset

In [34]:
# Save to CSV (Just Remove the # symbol from the next line of code to export)
# df.to_csv('sales_data_2024.csv', index=False) 

## 🗂️ Dataset for 2023

In [29]:
# Set random seed for reproducibility
np.random.seed(42)

# Constants
num_customers = 346
num_orders = 923
num_products = 20
num_categories = 5

# Generate customer IDs
customer_ids = [f'C{str(i).zfill(3)}' for i in range(1, num_customers + 1)]

# Generate product IDs and categories
product_ids = [f'P{str(i).zfill(3)}' for i in range(1, num_products + 1)]
categories = ['Electronics', 'Furniture', 'Clothing', 'Toys', 'Books']

# Create a mapping of products to categories
product_category_map = {product_id: random.choice(categories) for product_id in product_ids}

# Generate states (sample data)
states = ["Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware",
          "Florida", "Georgia", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana",
          "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana",
          "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina",
          "North Dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota",
          "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming"]

# Generate cities (sample data)
cities = ["New York", "Los Angeles", "Chicago", "Houston", "Phoenix", "Philadelphia", "San Antonio", "San Diego", "Dallas", "San Jose"]

# Generate unique order IDs
order_ids = [f'O{str(i).zfill(3)}' for i in range(1, num_orders + 1)]

# Generate order data
data_23 = []
for i in range(num_orders):
    customer_id = random.choice(customer_ids)
    product_id = random.choice(product_ids)
    order_id = order_ids[i]
    city = random.choice(cities)
    state = random.choice(states)
    date = np.random.choice(pd.date_range(start='2023-01-01', end='2023-12-31'))
    category = product_category_map[product_id]
    sales = round(random.uniform(100, 1000), 2)
    quantity_sold = random.randint(1, 10)
    
    data_23.append([customer_id, product_id, order_id, city, state, date, category, sales, quantity_sold])

### 🧱 Convert to DataFrame

Convert the generated data dictionary into a `pandas` DataFrame.

In [30]:
df_2023 =  pd.DataFrame(data_24, columns=['customer_id', 'product_id', 'order_id', 'City', 'State', 'Date', 'Product_Category', 'Sales', 'Quantity_Sold'])

### 🔍 Preview the Data

Display the first 5 records to validate the structure.

In [31]:
# Ensure the Date column is in datetime format
df_2023['Date'] = pd.to_datetime(df_2023['Date'])

# Display the first few rows of the DataFrame
print(df_2024.head())

  customer_id product_id order_id         City          State       Date  \
0        C060       P003     O001  San Antonio        Montana 2023-04-13   
1        C266       P007     O002  San Antonio         Hawaii 2023-12-15   
2        C155       P017     O003       Dallas          Texas 2023-09-28   
3        C158       P014     O004  Los Angeles    Connecticut 2023-04-17   
4        C074       P009     O005      Houston  West Virginia 2023-03-13   

  Product_Category   Sales  Quantity_Sold  
0             Toys  735.46              2  
1      Electronics  846.77              5  
2            Books  897.99             10  
3        Furniture  672.56              6  
4      Electronics  458.17              5  


In [32]:
df_2023.head()

Unnamed: 0,customer_id,product_id,order_id,City,State,Date,Product_Category,Sales,Quantity_Sold
0,C060,P003,O001,San Antonio,Montana,2023-04-13,Toys,735.46,2
1,C266,P007,O002,San Antonio,Hawaii,2023-12-15,Electronics,846.77,5
2,C155,P017,O003,Dallas,Texas,2023-09-28,Books,897.99,10
3,C158,P014,O004,Los Angeles,Connecticut,2023-04-17,Furniture,672.56,6
4,C074,P009,O005,Houston,West Virginia,2023-03-13,Electronics,458.17,5


### 🗃️ Exporting Dataset

In [33]:
# Save to CSV (Just Remove the # symbol from the next line of code to export)
# df_2023.to_csv('sales_data_2023.csv', index=False) 