# üìä Dataset Generation: Employee Bonus Prediction

This notebook generates a synthetic dataset for demonstrating gradient descent optimization. We create employee data with known ground truth weights, allowing us to verify if our optimization algorithms can recover the true parameters.

**Goal:** Generate data where `bonus = w‚ÇÅ√óperformance + w‚ÇÇ√óexperience + w‚ÇÉ√óprojects + bias`

---
## Setup

Import NumPy for numerical operations and Pandas for data handling.

In [None]:
import numpy as np
import pandas as pd

---
## Define Ground Truth Parameters

We set the "true" weights and bias that determine employee bonuses. Our gradient descent algorithms should recover these values.

| Parameter | Value | Meaning |
|-----------|-------|--------|
| w‚ÇÅ | 12 | Each performance point adds $12 to bonus |
| w‚ÇÇ | 6 | Each year of experience adds $6 |
| w‚ÇÉ | 2 | Each completed project adds $2 |
| bias | 20 | Base bonus amount |

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Number of employee samples
n_samples = 100

# Ground truth weights (what we want gradient descent to discover)
true_w1 = 12    # Weight for performance
true_w2 = 6     # Weight for years of experience  
true_w3 = 2     # Weight for projects completed
true_bias = 20  # Base bonus

---
## Generate Feature Data

Create random but realistic employee data:
- **Performance**: Rating from 1-10
- **Years of Experience**: 1-10 years
- **Projects Completed**: Based on experience + random factor (more experienced = more projects)

In [None]:
# Generate employee IDs
employee_id = np.array([f"EMP_{i:03}" for i in range(1, n_samples + 1)])

# Generate random features
performance = np.random.randint(1, 11, n_samples)           # 1 to 10
years_of_experience = np.random.randint(1, 11, n_samples)   # 1 to 10

# Projects completed correlates with experience (realistic assumption)
projects_completed = years_of_experience + np.random.randint(1, 4, n_samples)

---
## Calculate Target Variable (Bonus)

Compute the bonus using our ground truth formula:

```
bonus = 12√óperformance + 6√óexperience + 2√óprojects + 20
```

This creates a perfect linear relationship that gradient descent should be able to learn exactly.

In [None]:
# Calculate bonus using the true weights and bias
bonus = (true_w1 * performance) + \
        (true_w2 * years_of_experience) + \
        (true_w3 * projects_completed) + \
        true_bias

print(f"Bonus range: ${bonus.min()} - ${bonus.max()}")
print(f"Average bonus: ${bonus.mean():.2f}")

---
## Create and Export DataFrame

Combine all data into a Pandas DataFrame and save to CSV for use in gradient descent notebooks.

In [None]:
# Create DataFrame
df = pd.DataFrame({
    'employee_id': employee_id,
    'performance': performance,
    'years_of_experience': years_of_experience,
    'projects_completed': projects_completed,
    'bonus': bonus
})

# Display first few rows
print("Sample data:")
df.head(10)

In [None]:
# Export to CSV
file_path = 'emp_bonus.csv'
df.to_csv(file_path, index=False)
print(f"Dataset saved to {file_path}")
print(f"Shape: {df.shape}")

---
## Summary

**Dataset created with:**
- 100 employee samples
- 3 features: performance, years_of_experience, projects_completed
- 1 target: bonus

**Ground truth parameters to recover:**
| Parameter | True Value |
|-----------|------------|
| w‚ÇÅ (performance) | 12 |
| w‚ÇÇ (experience) | 6 |
| w‚ÇÉ (projects) | 2 |
| bias | 20 |

Use `gradient_descent.ipynb` and `gd_vs_mini_gd_vs_sgd.ipynb` to train models and verify they recover these values.