# Data Generation for E-Commerce Project

This notebook creates a small, simulated dataset for a fictional online shop.  
The goal is to build a realistic set of users, products, website sessions, and orders that we can use later for database work and analytics.

All data is randomly generated using the Faker library.  

## 1. Generate Users

Here we create a set of users with:
- a unique user ID  
- signup date  
- email  
- country  
- marketing channel (e.g., Google Ads, Email, Social Media)

The data is saved as `users.csv` in the `data/raw/` folder.


In [1]:
import pandas as pd
from faker import Faker
import random

fake = Faker()
channels = ['Google Ads', 'Email', 'Social Media', 'Referral']

# Create 500 users
users = []
for i in range(1, 501):
    users.append({
        "user_id": i,
        "signup_date": fake.date_between(start_date='-2y', end_date='today'),
        "email": fake.email(),
        "country": fake.country(),
        "marketing_channel": random.choice(channels)
    })

# Save to CSV in data/raw/
users_df = pd.DataFrame(users)
users_df.to_csv('../data/raw/users.csv', index=False)
print("Users data created!")


Users data created!


## 2. Generate Products

This section creates a small product catalogue.  
Each product has:  
- product ID  
- name  
- category  
- price  
- cost

The output is stored as `products.csv` in `data/raw/`.


In [2]:
import pandas as pd
from faker import Faker
import random

fake = Faker()
categories = ['Electronics', 'Clothing', 'Home', 'Books', 'Toys']
products = []

# Create 50 products
for i in range(1, 51):
    products.append({
        "product_id": i,
        "name": fake.word().capitalize(),
        "category": random.choice(categories),
        "price": round(random.uniform(5, 500), 2),
        "cost": round(random.uniform(2, 400), 2)
    })

# Save to CSV in data/raw/
products_df = pd.DataFrame(products)
products_df.to_csv('../data/raw/products.csv', index=False)
print("Products data created!")


Products data created!


## 3. Website Sessions

We simulate visits to the website.  
Each session contains:  
- a user ID  
- session date  
- traffic source  
- number of pages viewed  
- time spent on the site

This will be saved as `sessions.csv`.


In [3]:
sessions = []
session_id = 1
channels = ['Google Ads', 'Email', 'Social Media', 'Referral']

# For each user, generate 1-5 random sessions
for user_id in range(1, 501):
    num_sessions = random.randint(1, 5)
    for _ in range(num_sessions):
        sessions.append({
            "session_id": session_id,
            "user_id": user_id,
            "session_date": fake.date_between(start_date='-1y', end_date='today'),
            "source": random.choice(channels),
            "pages_viewed": random.randint(1, 10),
            "duration_sec": random.randint(30, 600)
        })
        session_id += 1

sessions_df = pd.DataFrame(sessions)
sessions_df.to_csv('../data/raw/sessions.csv', index=False)
print("Sessions data created!")


Sessions data created!


## 4. Orders and Order Items

Here we generate transactions for the store.  
For each order we store:  
- the user who made it  
- date of purchase  
- order amount  
- order status (completed / refunded)

We also create the items inside each order  
(e.g., which products were bought and in what quantity).

Files saved:
- `orders.csv`
- `order_items.csv`


In [4]:
orders = []
order_items = []
order_id = 1
order_item_id = 1

# Each user makes 1-3 orders
for user_id in range(1, 501):
    num_orders = random.randint(1, 3)
    for _ in range(num_orders):
        order_date = fake.date_between(start_date='-1y', end_date='today')
        num_products = random.randint(1, 5)
        total_amount = 0
        products_in_order = random.sample(range(1, 51), num_products)
        
        for product_id in products_in_order:
            quantity = random.randint(1, 3)
            price = round(random.uniform(5, 500), 2)  # approximate product price
            total_amount += price * quantity
            
            order_items.append({
                "order_item_id": order_item_id,
                "order_id": order_id,
                "product_id": product_id,
                "quantity": quantity,
                "price": price
            })
            order_item_id += 1
        
        orders.append({
            "order_id": order_id,
            "user_id": user_id,
            "order_date": order_date,
            "total_amount": round(total_amount, 2),
            "order_status": random.choice(['completed', 'returned'])
        })
        order_id += 1

# Save to CSV
orders_df = pd.DataFrame(orders)
orders_df.to_csv('../data/raw/orders.csv', index=False)

order_items_df = pd.DataFrame(order_items)
order_items_df.to_csv('../data/raw/order_items.csv', index=False)

print("Orders and Order_Items data created!")


Orders and Order_Items data created!


In [5]:
marketing_spend = []
channels = ['Google Ads', 'Email', 'Social Media', 'Referral']

# Generate monthly spend for 12 months
for month in range(1, 13):
    for channel in channels:
        marketing_spend.append({
            "channel": channel,
            "spend_date": f"2025-{month:02d}-01",  # Example year
            "spend_amount": round(random.uniform(1000, 10000), 2)
        })

marketing_spend_df = pd.DataFrame(marketing_spend)
marketing_spend_df.to_csv('../data/raw/marketing_spend.csv', index=False)
print("Marketing spend data created!")


Marketing spend data created!


### Data files created

All generated files are now stored in:

`data/raw/`

These files will be loaded into the SQLite database in the next notebook.
