# Environment (Installation)

_This notebook provides installation instructions and code to generate example data._

Note: This Jupyter Notebook was originally compiled by Alex Reppel (AR) based on conversations with [ClaudeAI](https://claude.ai/) *(version 3.5 Sonnet)*. For this year's materials, further revisions were made using [Claude Code](https://www.anthropic.com/claude-code) *(Opus 4.1)*, including updated documentation and git commit messages.

## Installing Anaconda

### Official instructions

- Installing Anaconda: https://docs.anaconda.com/anaconda/install/.
- Verifying your installation: https://docs.anaconda.com/anaconda/install/verify-install/.

### Tutorial

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo("YJC6ldI3hWk", width=560)

## Example data

The following code generates two example datasets, which we will be using throughout the module.

### Setup

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import os

In [None]:
# Create directory if it doesn't exist
if not os.path.exists("data/synthetic_data"):
    os.makedirs("data/synthetic_data")

### Example dataset 1: Business metrics

The following script will create an example dataset `business_metrics.csv` for you to work with:

1. Create a `data/synthetic_data/` directory _(if it doesn't exist)_
2. Generate the business metrics dataset
3. Save the dataset as `business_metrics.csv` to `data/synthetic_data/`

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

In [None]:
# Generate dates for about 6-7 months
start_date = datetime(2023, 6, 1)
dates = [start_date + timedelta(days=x) for x in range(200)]

In [None]:
# Base visitors (weekly pattern with random noise)
base_visitors = np.random.normal(1000, 100, 200)
weekly_pattern = 1 + 0.3 * np.sin(np.arange(200) * 2 * np.pi / 7)  # Higher on weekends
seasonal_pattern = 1 + 0.2 * np.sin(np.arange(200) * 2 * np.pi / 90)  # Seasonal trend

visitors = (base_visitors * weekly_pattern * seasonal_pattern).astype(int)

In [None]:
# Marketing spend (planned campaigns)
base_marketing = np.random.normal(500, 50, 200)
# Add some campaign spikes
campaign_dates = [30, 60, 90, 120, 150, 180]
for date in campaign_dates:
    base_marketing[date-3:date+4] *= 2.5

In [None]:
# Conversion rate (partially influenced by marketing)
base_conversion = np.random.normal(0.025, 0.003, 200)
marketing_effect = 0.3 * (base_marketing - base_marketing.mean()) / base_marketing.std()
conversion_rate = np.maximum(0.01, base_conversion + marketing_effect * 0.005)

In [None]:
# Average order value (slight weekly pattern)
aov_base = np.random.normal(50, 5, 200)
aov_weekly = 1 + 0.1 * np.sin(np.arange(200) * 2 * np.pi / 7)
avg_order_value = aov_base * aov_weekly

In [None]:
# Calculate revenue
revenue = visitors * conversion_rate * avg_order_value

In [None]:
# Customer satisfaction (slight negative correlation with visitor numbers)
satisfaction_base = np.random.normal(4.2, 0.2, 200)
visitor_effect = -0.2 * (visitors - visitors.mean()) / visitors.std()
satisfaction = np.clip(satisfaction_base + visitor_effect, 1, 5)

In [None]:
# Create DataFrame
df = pd.DataFrame({
    "date": dates,
    "visitors": visitors,
    "marketing_spend": np.round(base_marketing, 2),
    "conversion_rate": np.round(conversion_rate, 4),
    "avg_order_value": np.round(avg_order_value, 2),
    "revenue": np.round(revenue, 2),
    "satisfaction": np.round(satisfaction, 2)
})

In [None]:
# Add some special event effects
# Black Friday effect
bf_date = 180  # Arbitrary day for demonstration
df.at[bf_date, "visitors"] = int(df.at[bf_date, "visitors"] * 2.5)
df.at[bf_date, "conversion_rate"] = df.at[bf_date, "conversion_rate"] * 1.5
df.at[bf_date, "avg_order_value"] = df.at[bf_date, "avg_order_value"] * 1.3

In [None]:
# Add another special event
# Technical issue day
tech_issue = 145  # Random day
df.at[tech_issue, "visitors"] = int(df.at[tech_issue, "visitors"] * 0.5)
df.at[tech_issue, "conversion_rate"] = df.at[tech_issue, "conversion_rate"] * 0.3
df.at[tech_issue, "satisfaction"] = df.at[tech_issue, "satisfaction"] * 0.7

In [None]:
# Add some realistic missing values (NaN)
# Randomly remove some satisfaction scores (customer didn't respond to survey)
np.random.seed(42)
satisfaction_missing = np.random.choice(df.index, size=15, replace=False)
df.loc[satisfaction_missing, "satisfaction"] = np.nan

# Randomly remove some marketing spend values (data collection issues)
marketing_missing = np.random.choice(df.index, size=8, replace=False)
df.loc[marketing_missing, "marketing_spend"] = np.nan

# Remove avg_order_value for days with very low visitors (system issues)
df.loc[df["visitors"] < df["visitors"].quantile(0.05), "avg_order_value"] = np.nan

# Add some outliers for students to detect
# Extreme high marketing spend
df.loc[np.random.choice(df.index, size=3, replace=False), "marketing_spend"] = df["marketing_spend"].max() * 3

# Negative satisfaction scores (data entry errors)
df.loc[np.random.choice(df.index, size=2, replace=False), "satisfaction"] = -1

# Save to CSV in data directory
df.to_csv("data/synthetic_data/business_metrics.csv", index=False)

In [None]:
print(f"Dataset saved to data/synthetic_data/business_metrics.csv")
print("\nFirst few rows of the dataset:")
print(df.head())
print("\nDataset summary statistics:")
print(df.describe())

#¢# Example dataset 2: Retail data

This script creates a detailed retail dataset with 5,000 transactions over a 6-month period. It will allow you to practice visualization techniques, including:

1. Time series analysis
2. Distribution analysis
3. Categorical comparisons
4. Correlation studies
5. Customer segmentation analysis

The dataset has the following characteristics:

#### Features

- Transaction details _(ID, date, amount)_
- Customer information _(ID, segment)_
- Product information _(category, price, discount)_
- Purchase details _(quantity, total amount)_
- Payment method
- Satisfaction scores

#### Built-in patterns

- Seasonal effects _(weekends, holidays)_
- Time-of-day patterns
- Price variations by product category
- Discount patterns
- Satisfaction scores influenced by discounts and prices
- Realistic customer segmentation

#### Categorical variables

- 5 product categories
- 3 customer segments
- 4 payment methods

In [None]:
# Generate dates - approximately 6 months of transactions
start_date = datetime(2023, 6, 1)
end_date = datetime(2023, 12, 31)
date_range = pd.date_range(start=start_date, end=end_date, freq="H")
n_transactions = 5000

In [None]:
# Generate random transaction timestamps
transaction_dates = np.random.choice(date_range, size=n_transactions)
transaction_dates = sorted(transaction_dates)

In [None]:
# Product categories and their base prices
product_categories = {
    "Electronics": {"base_price": 500, "std": 200},
    "Clothing": {"base_price": 50, "std": 20},
    "Books": {"base_price": 25, "std": 10},
    "Home & Garden": {"base_price": 100, "std": 40},
    "Sports & Outdoors": {"base_price": 75, "std": 30}
}

In [None]:
# Customer segments
customer_segments = ["New", "Regular", "Premium"]
customer_segment_weights = [0.3, 0.5, 0.2]  # 30% new, 50% regular, 20% premium

In [None]:
# Generate transaction data
data = {
    "transaction_id": range(1, n_transactions + 1),
    "date": transaction_dates,
    "customer_id": np.random.randint(1, 1001, size=n_transactions),  # 1000 unique customers
    "customer_segment": np.random.choice(customer_segments, size=n_transactions, p=customer_segment_weights)
}

In [None]:
# Generate products and prices
categories = np.random.choice(list(product_categories.keys()), size=n_transactions)
prices = []
discounts = []
quantities = []

for category in categories:
    base = product_categories[category]["base_price"]
    std = product_categories[category]["std"]
    
    # Generate base price with some variation
    price = np.random.normal(base, std)
    
    # Generate discount (more likely for certain categories and premium customers)
    if category in ["Clothing", "Sports & Outdoors"]:
        discount = np.random.choice([0, 0.1, 0.2, 0.3], p=[0.7, 0.1, 0.1, 0.1])
    else:
        discount = np.random.choice([0, 0.1, 0.2], p=[0.8, 0.15, 0.05])
    
    # Generate quantity (usually 1, sometimes more)
    quantity = np.random.choice([1, 2, 3], p=[0.7, 0.2, 0.1])
    
    prices.append(max(0, price))
    discounts.append(discount)
    quantities.append(quantity)

data["product_category"] = categories
data["unit_price"] = np.round(prices, 2)
data["discount"] = discounts
data["quantity"] = quantities

In [None]:
# Calculate final prices and revenues
data["discounted_price"] = np.round(np.array(prices) * (1 - np.array(discounts)), 2)
data["total_amount"] = np.round(np.array(data["discounted_price"]) * np.array(quantities), 2)

In [None]:
# Create DataFrame
df = pd.DataFrame(data)

In [None]:
# Add seasonal effects
# 1. Weekend uplift
df["is_weekend"] = df["date"].dt.weekday >= 5
df.loc[df["is_weekend"], "total_amount"] *= 1.2

In [None]:
# 2. Holiday season effect (December)
df.loc[df["date"].dt.month == 12, "total_amount"] *= 1.5

In [None]:
# 3. Time of day effect (peak shopping hours)
peak_hours = (df["date"].dt.hour >= 11) & (df["date"].dt.hour <= 19)
df.loc[peak_hours, "total_amount"] *= 1.1

In [None]:
# Add payment method
payment_methods = ["Credit Card", "Debit Card", "Digital Wallet", "Cash"]
df["payment_method"] = np.random.choice(payment_methods, size=n_transactions, p=[0.4, 0.3, 0.2, 0.1])

In [None]:
# Add customer satisfaction scores (influenced by discount and price)
base_satisfaction = np.random.normal(4, 0.5, size=n_transactions)
discount_effect = df["discount"] * 2  # Higher discount = higher satisfaction
price_effect = -0.3 * (df["unit_price"] - df["unit_price"].mean()) / df["unit_price"].std()
df["satisfaction_score"] = np.clip(base_satisfaction + discount_effect + price_effect, 1, 5)
df["satisfaction_score"] = df["satisfaction_score"].round(2)

In [None]:
# Clean up and reorder columns
df = df[[
    "transaction_id", 
    "date", 
    "customer_id",
    "customer_segment",
    "product_category",
    "quantity",
    "unit_price",
    "discount",
    "discounted_price",
    "total_amount",
    "payment_method",
    "satisfaction_score"
]]

In [None]:
# Add realistic missing values and data quality issues
np.random.seed(42)

# Missing customer segments (new customers without classification)
segment_missing = np.random.choice(df.index, size=200, replace=False)
df.loc[segment_missing, "customer_segment"] = np.nan

# Missing satisfaction scores (customers who didn't complete survey)
satisfaction_missing = np.random.choice(df.index, size=500, replace=False)
df.loc[satisfaction_missing, "satisfaction_score"] = np.nan

# Missing payment method for some cash transactions
cash_transactions = df[df["payment_method"] == "Cash"].index
payment_missing = np.random.choice(cash_transactions, size=20, replace=False)
df.loc[payment_missing, "payment_method"] = np.nan

# Add data entry errors
# Negative quantities (should be caught and fixed)
df.loc[np.random.choice(df.index, size=5, replace=False), "quantity"] = -1

# Extremely high unit prices (decimal point errors)
df.loc[np.random.choice(df.index, size=3, replace=False), "unit_price"] = df["unit_price"] * 100

# Zero prices (promotional items not properly recorded)
df.loc[np.random.choice(df.index, size=10, replace=False), "unit_price"] = 0

# Add duplicate transactions (same customer, same time, needs deduplication)
duplicates = df.sample(n=20)
duplicates["transaction_id"] = duplicates["transaction_id"] + 10000
df = pd.concat([df, duplicates], ignore_index=True)

# Sort by date to maintain chronological order
df = df.sort_values("date").reset_index(drop=True)

# Save to CSV
df.to_csv("data/synthetic_data/retail_sales.csv", index=False)

In [None]:
print(f"Dataset saved to data/synthetic_data/retail_sales.csv")
print("\nFirst few rows of the dataset:")
print(df.head())
print("\nDataset summary statistics:")
print(df.describe())
print("\nValue counts for categorical variables:")
print("\nProduct Categories:")
print(df["product_category"].value_counts())
print("\nCustomer Segments:")
print(df["customer_segment"].value_counts())
print("\nPayment Methods:")
print(df["payment_method"].value_counts())