# Pandas Application: Employee Sales Analysis (Demonstration)

_This notebook demonstrates applying [Pandas](https://pandas.pydata.org/) techniques to a real-world dataset. The employee sales dataset has an identical structure to the Olympics dataset for your group project, allowing you to see complete solutions to similar problems._

**Important**: This demonstration uses a **different dataset** (employee_sales.csv) than your group project (athlete_events.csv). The techniques and approaches demonstrated here can be adapted to your Olympics analysis.

Note: This Jupyter Notebook was originally compiled by Alex Reppel (AR) based on conversations with [ClaudeAI](https://claude.ai/) *(version 3.5 Sonnet)*. For this year's materials, further revisions were made using [Claude Code](https://www.anthropic.com/claude-code) *(Sonnet 4.5)*, including updated documentation and git commit messages.

## Building on previous sessions

This demonstration integrates concepts from:

- **Week 04**: Basic Pandas operations (loading data, filtering, grouping, merging)
- **Week 05**: Advanced techniques (reshaping, pivot tables, MultiIndex)

We'll apply these skills to a complete data analysis workflow.

## Dataset mapping

The employee sales dataset has a 1:1 column mapping with the Olympics dataset:

| Employee sales | Olympics | Description |
|----------------|----------|-------------|
| employee_id | ID | Unique identifier |
| name | Name | Full name |
| gender (F/M/D) | Sex (M/F) | Gender/Sex |
| age | Age | Age in years |
| height_cm | Height | Height in cm |
| weight_kg | Weight | Weight in kg |
| team | Team | Team affiliation |
| region | NOC | Region/Country code |
| quarter | Games | Time period (Quarter/Games) |
| year | Year | Year |
| half | Season | Half of year/Season |
| office | City | Location |
| product_category | Sport | Category |
| product | Event | Specific item/event |
| award | Medal | Recognition (Gold/Silver/Bronze) |

This mapping allows you to adapt the code patterns directly to your Olympics analysis.

---

## 🎯 CORE CONTENT (Essential for Group Project)

**Estimated time**: 50-60 minutes

The sections below demonstrate the complete workflow for your Olympics group project:
- Data loading and exploration
- Data cleaning (missing values, data types, duplicates)
- Data wrangling (age groups, temporal features)
- Core data analysis (averages, top countries, medal counts, gender analysis)

These techniques directly address your project requirements. Work through all examples and note how to adapt them for the Olympics dataset.

---

## Part 1: Setup and data loading

### Import libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set display options for better output
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)
pd.set_option("display.width", None)

### Load the dataset

We'll load the employee sales data using `pd.read_csv()`. For your Olympics project, you'll use the same approach with `athlete_events.csv`.

In [None]:
# Load the data
df = pd.read_csv("assets/data/employee_sales.csv")

print(f"Dataset loaded: {len(df)} rows, {len(df.columns)} columns")
print(f"\nColumn names:\n{list(df.columns)}")

### Initial exploration

Always start by exploring your data to understand its structure, contents, and quality.

In [None]:
# View first few rows
print("First 10 rows:")
df.head(10)

In [None]:
# Get dataset information
print("Dataset information:")
df.info()

In [None]:
# Get summary statistics for numeric columns
print("Summary statistics:")
df.describe()

**Observations from initial exploration**:

1. **Data structure**: 245 rows × 15 columns (similar scale to Olympics: 271,116 rows × 15 columns)
2. **Data types**: Mix of numeric (int64, float64) and text (object)
3. **Missing values**: Some columns have missing data (age, height_cm, weight_kg, award)
4. **Value ranges**: Numeric values appear reasonable (age: 22-59, height: 155-194, weight: 50-99)

For the Olympics dataset, you'll see similar patterns but at a much larger scale.

## Part 2: Data cleaning

Data cleaning is essential before analysis. We'll identify and handle:
- Missing values
- Data type issues
- Duplicates
- Outliers (if needed)

### Section 1: Missing values

First, identify which columns have missing data and how much.

In [None]:
# Count missing values
missing = df.isnull().sum()
missing_pct = (df.isnull().sum() / len(df) * 100).round(1)

missing_summary = pd.DataFrame({
    "Missing_Count": missing,
    "Missing_Percent": missing_pct
})

print("Missing values summary:")
print(missing_summary[missing_summary["Missing_Count"] > 0].sort_values("Missing_Count", ascending=False))

**Analysis of missing values**:

- **award** (76.3%): This is expected - most employees don't receive awards, just like most athletes don't win medals
- **weight_kg** (10.2%): Demographic data may be incomplete
- **height_cm** (6.9%): Similar to weight
- **age** (6.1%): Personal information may not always be recorded

**Decision**: Keep missing values as-is for demographic columns (age, height, weight) since:
1. The percentage is relatively low
2. Imputing (filling) could introduce bias
3. Analysis functions like `.mean()` handle NaN automatically

For **award**, NaN represents "no award" which is meaningful information.

**For Olympics**: You'll see similar patterns - most athletes have no medals, and some demographic data is missing.

### Section 2: Data types

Ensure columns have appropriate data types for analysis.

In [None]:
# Check current data types
print("Current data types:")
print(df.dtypes)

In [None]:
# Convert year to datetime for time-based analysis
# Note: We only have the year, not a full date
df["year_dt"] = pd.to_datetime(df["year"], format="%Y")

print("Added year_dt column:")
print(df[["year", "year_dt"]].head())
print(f"\nData type of year_dt: {df['year_dt'].dtype}")

**Why convert year to datetime?**

Even though we only have the year (not month/day), converting to datetime enables:
- Time series analysis
- Easy extraction of decade, century
- Chronological sorting and grouping
- Date-based filtering

For Olympics, this will help analyze trends over 120 years of Olympic history.

### Section 3: Duplicates

Check for and remove duplicate rows.

In [None]:
# Check for duplicates
duplicates = df.duplicated()
print(f"Number of duplicate rows: {duplicates.sum()}")

if duplicates.sum() > 0:
    print(f"\nDuplicate rows:")
    print(df[duplicates])

In [None]:
# Remove duplicates
df_clean = df.drop_duplicates()
print(f"Original rows: {len(df)}")
print(f"After removing duplicates: {len(df_clean)}")
print(f"Rows removed: {len(df) - len(df_clean)}")

# Use the cleaned dataset going forward
df = df_clean

**Important distinction for Olympics**:

In the employee dataset, each row is a unique sale transaction, so duplicates are true data quality issues.

In the Olympics dataset, each row is an athlete-event combination. The same athlete appears multiple times (different events/years). This is NOT a duplicate - it's the correct data structure. Only remove truly identical rows (all columns matching).

## Part 3: Data wrangling

Create new columns to enable richer analysis.

### Section 1: Age groups

Create categorical age groups for demographic analysis.

In [None]:
# Create age groups using pd.cut()
df["age_group"] = pd.cut(
    df["age"],
    bins=[0, 30, 40, 50, 100],
    labels=["20s", "30s", "40s", "50+"]
)

print("Age group distribution:")
print(df["age_group"].value_counts().sort_index())

print("\nSample of age with age_group:")
print(df[["name", "age", "age_group"]].head(10))

**For Olympics**: You might use different bins based on typical athlete ages:
```python
df["age_group"] = pd.cut(
    df["Age"],
    bins=[0, 20, 25, 30, 35, 100],
    labels=["<20", "20-24", "25-29", "30-34", "35+"]
)
```

This allows analysis like: "Which age group wins the most gold medals in swimming?"

### Section 2: Full name handling

Check if name field needs processing.

In [None]:
# Check name format
print("Sample names:")
print(df["name"].head(10))

# The name column already contains full names
# If it were split into first_name and last_name, we could combine them:
# df["full_name"] = df["first_name"] + " " + df["last_name"]

print("\nName column is already in full name format - no processing needed.")

**For Olympics**: The Name column is already a full name, so no processing is needed. If it were split into separate columns, you could combine them as shown in the comment above.

### Section 3: Temporal features

Extract additional time-based features for analysis.

In [None]:
# Extract century from year
# Century: 2000-2099 = 21st century, 1900-1999 = 20th century
df["century"] = ((df["year"] - 1) // 100 + 1)

print("Century distribution:")
print(df["century"].value_counts().sort_index())

print("\nSample of year with century:")
print(df[["year", "century"]].drop_duplicates().sort_values("year"))

**How the century calculation works**:

```python
(year - 1) // 100 + 1
```

Examples:
- 2024: `(2024 - 1) // 100 + 1 = 2023 // 100 + 1 = 20 + 1 = 21` (21st century ✓)
- 2000: `(2000 - 1) // 100 + 1 = 1999 // 100 + 1 = 19 + 1 = 20` (20th century ✓)
- 1896: `(1896 - 1) // 100 + 1 = 1895 // 100 + 1 = 18 + 1 = 19` (19th century ✓)

**For Olympics**: This enables analysis across centuries of Olympic history:
- "How has average athlete age changed from 19th to 21st century?"
- "Which century had the most female participation?"

## Part 4: Data analysis

Now we'll perform various analyses that demonstrate the key requirements of your Olympics project.

### Section 1: Average age by product category

**Olympics equivalent**: Average age by sport

This helps identify which product categories/sports attract employees/athletes of different ages.

In [None]:
# Calculate average age by product category
avg_age_by_category = df.groupby("product_category")["age"].mean().sort_values(ascending=False)

print("Average age by product category:")
print(avg_age_by_category.round(1))

# Visualize
plt.figure(figsize=(10, 6))
avg_age_by_category.plot(kind="barh", color="skyblue")
plt.xlabel("Average Age (years)")
plt.ylabel("Product Category")
plt.title("Average Employee Age by Product Category")
plt.grid(axis="x", alpha=0.3)
plt.tight_layout()
plt.show()

**For Olympics**, adapt this code:

```python
avg_age_by_sport = df.groupby("Sport")["Age"].mean().sort_values(ascending=False)
print(avg_age_by_sport.head(10))  # Top 10 sports by average age
```

This might reveal interesting patterns like:
- Equestrian and shooting tend to have older athletes
- Gymnastics and swimming tend to have younger athletes

### Section 2: Top regions by award count

**Olympics equivalent**: Top countries by medal count

Identify the most successful regions/countries.

In [None]:
# Count total awards by region (excluding NaN)
awards_by_region = df[df["award"].notna()].groupby("region")["award"].count().sort_values(ascending=False)

print("Total awards by region:")
print(awards_by_region)

# Visualize top regions
plt.figure(figsize=(8, 5))
awards_by_region.plot(kind="bar", color="coral")
plt.xlabel("Region")
plt.ylabel("Number of Awards")
plt.title("Total Awards by Region")
plt.xticks(rotation=0)
plt.grid(axis="y", alpha=0.3)
plt.tight_layout()
plt.show()

**Breaking down by award type**:

Let's see the distribution of gold, silver, and bronze awards by region.

In [None]:
# Create a pivot table showing award types by region
award_breakdown = pd.pivot_table(
    df[df["award"].notna()],
    values="employee_id",
    index="region",
    columns="award",
    aggfunc="count",
    fill_value=0
)

# Ensure columns are in Gold, Silver, Bronze order
award_order = ["Gold", "Silver", "Bronze"]
award_breakdown = award_breakdown[[col for col in award_order if col in award_breakdown.columns]]

print("Award breakdown by region:")
print(award_breakdown)

# Stacked bar chart
plt.figure(figsize=(10, 6))
award_breakdown.plot(kind="bar", stacked=True, color=["gold", "silver", "#CD7F32"])
plt.xlabel("Region")
plt.ylabel("Number of Awards")
plt.title("Award Distribution by Region")
plt.legend(title="Award Type")
plt.xticks(rotation=0)
plt.grid(axis="y", alpha=0.3)
plt.tight_layout()
plt.show()

**For Olympics**, adapt this code:

```python
# Top 10 countries by total medals
medals_by_country = df[df["Medal"].notna()].groupby("NOC")["Medal"].count().sort_values(ascending=False)
print(medals_by_country.head(10))

# Medal breakdown (Gold, Silver, Bronze)
medal_breakdown = pd.pivot_table(
    df[df["Medal"].notna()],
    values="ID",
    index="NOC",
    columns="Medal",
    aggfunc="count",
    fill_value=0
).head(10)  # Top 10 countries
```

This is one of the core requirements of your Olympics project.

### Section 3: Most decorated employees by category

**Olympics equivalent**: Most decorated athletes by sport

Identify which individuals have the most awards/medals in each category/sport.

In [None]:
# For each product category, find the employee with the most awards
# First, count awards per employee per category
employee_awards = df[df["award"].notna()].groupby(
    ["product_category", "name"]
).size().reset_index(name="award_count")

# Find the employee with most awards in each category
idx = employee_awards.groupby("product_category")["award_count"].idxmax()
top_employees = employee_awards.loc[idx]

print("Most decorated employee by product category:")
print(top_employees.sort_values("award_count", ascending=False))

**Alternative approach using groupby and transform**:

In [None]:
# Add category ranking to each employee
employee_awards["category_rank"] = employee_awards.groupby("product_category")["award_count"].rank(
    ascending=False, 
    method="min"
)

# Show top 3 employees in each category
top_3_per_category = employee_awards[employee_awards["category_rank"] <= 3].sort_values(
    ["product_category", "category_rank"]
)

print("\nTop 3 employees by award count in each product category:")
print(top_3_per_category)

**For Olympics**, adapt this code:

```python
# Count medals per athlete per sport
athlete_medals = df[df["Medal"].notna()].groupby(
    ["Sport", "Name"]
).size().reset_index(name="medal_count")

# Find the athlete with most medals in each sport
idx = athlete_medals.groupby("Sport")["medal_count"].idxmax()
top_athletes = athlete_medals.loc[idx]
print(top_athletes.sort_values("medal_count", ascending=False).head(10))
```

This reveals legends like Michael Phelps (Swimming), Usain Bolt (Athletics), etc.

### Section 4: Gender analysis

Analyse performance and participation by gender.

In [None]:
# Overall gender distribution
gender_dist = df["gender"].value_counts()

print("Gender distribution in dataset:")
print(gender_dist)
print(f"\nPercentages:")
print((gender_dist / len(df) * 100).round(1))

In [None]:
# Award win rate by gender
gender_performance = df.groupby("gender").agg({
    "employee_id": "count",
    "award": lambda x: x.notna().sum()
}).rename(columns={"employee_id": "total_sales", "award": "awards_won"})

gender_performance["win_rate_%"] = (
    gender_performance["awards_won"] / gender_performance["total_sales"] * 100
).round(1)

print("Performance by gender:")
print(gender_performance)

---

## 📚 SUPPLEMENTARY CONTENT (Stretch Goals)

**Estimated time**: 10-20 minutes

The section below demonstrates advanced Week 05 techniques for stretch goals:
- Pivot tables for multi-dimensional analysis
- Trend analysis over time
- MultiIndex for hierarchical grouping

These are **not required** for the basic project but can help you achieve higher marks through more sophisticated analysis.

---

**For Olympics**, adapt this code:

```python
# Gender distribution
print(df["Sex"].value_counts())

# Medal win rate by gender
gender_performance = df.groupby("Sex").agg({
    "ID": "count",
    "Medal": lambda x: x.notna().sum()
}).rename(columns={"ID": "total_participations", "Medal": "medals_won"})

gender_performance["medal_rate_%"] = (
    gender_performance["medals_won"] / gender_performance["total_participations"] * 100
).round(1)
```

This analysis might reveal:
- Historical trends in female participation (increasing over time)
- Whether medal win rates differ by gender
- Sports with high/low female participation

## Part 5: Advanced techniques (Week 05 concepts)

Apply advanced Pandas techniques for more sophisticated analyses.

### Section 1: Pivot tables for multi-dimensional analysis

Create comprehensive summary tables.

In [None]:
# Awards by region and product category
awards_pivot = pd.pivot_table(
    df[df["award"].notna()],
    values="employee_id",
    index="region",
    columns="product_category",
    aggfunc="count",
    fill_value=0,
    margins=True,
    margins_name="Total"
)

print("Awards by region and product category:")
print(awards_pivot)

**For Olympics**:

```python
# Medals by country and sport
medals_pivot = pd.pivot_table(
    df[df["Medal"].notna()],
    values="ID",
    index="NOC",
    columns="Sport",
    aggfunc="count",
    fill_value=0
).head(10)  # Top 10 countries
```

This reveals which countries dominate which sports (e.g., Kenya in Athletics, USA in Swimming).

### Section 2: Trends over time

Analyse how metrics change over time.

In [None]:
# Awards won per year
awards_by_year = df[df["award"].notna()].groupby("year")["award"].count()

print("Awards by year:")
print(awards_by_year)

# Visualize trend
plt.figure(figsize=(10, 6))
awards_by_year.plot(kind="line", marker="o", color="green")
plt.xlabel("Year")
plt.ylabel("Number of Awards")
plt.title("Award Trend Over Time")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Average age trend over time
avg_age_by_year = df.groupby("year")["age"].mean()

print("Average age by year:")
print(avg_age_by_year.round(1))

# Visualize
plt.figure(figsize=(10, 6))
avg_age_by_year.plot(kind="line", marker="s", color="purple")
plt.xlabel("Year")
plt.ylabel("Average Age (years)")
plt.title("Average Employee Age Over Time")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

**For Olympics**:

```python
# Participation trend (total athlete-events per year)
participation_by_year = df.groupby("Year").size()

# Medal count trend
medals_by_year = df[df["Medal"].notna()].groupby("Year")["Medal"].count()

# Average age trend
avg_age_by_year = df.groupby("Year")["Age"].mean()

# Female participation rate
female_rate = df.groupby("Year")["Sex"].apply(lambda x: (x == "F").sum() / len(x) * 100)
```

These analyses reveal:
- Growing Olympic participation over 120 years
- Increasing female participation
- Changes in athlete age profiles

### Section 3: MultiIndex analysis

Use hierarchical indexing for complex groupings.

In [None]:
# Create hierarchical summary: region → product_category → award type
award_hierarchy = df[df["award"].notna()].groupby(
    ["region", "product_category", "award"]
).size().unstack(fill_value=0)

print("Hierarchical award summary (Region → Category → Award):")
print(award_hierarchy)

In [None]:
# Access specific region's data using cross-section
print("\nLON region breakdown:")
print(award_hierarchy.xs("LON", level="region"))

**For Olympics**:

```python
# Country → Sport → Medal type
medal_hierarchy = df[df["Medal"].notna()].groupby(
    ["NOC", "Sport", "Medal"]
).size().unstack(fill_value=0)

# View specific country's performance
print(medal_hierarchy.xs("USA", level="NOC"))
```

This shows USA's gold/silver/bronze counts in each sport.

## Summary

This demonstration covered the complete data analysis workflow:

### Part 1: Setup and data loading
- Import libraries
- Load CSV data
- Initial exploration with `.head()`, `.info()`, `.describe()`

### Part 2: Data cleaning
- Identify missing values
- Convert data types (year to datetime)
- Remove duplicates

### Part 3: Data wrangling
- Create age groups with `pd.cut()`
- Handle name fields
- Extract temporal features (century)

### Part 4: Data analysis
- Average age by category/sport
- Top regions/countries by awards/medals
- Most decorated individuals
- Gender analysis

### Part 5: Advanced techniques
- Pivot tables for multi-dimensional analysis
- Trend analysis over time
- MultiIndex for hierarchical data

## Next steps for your Olympics project

1. **Adapt the code patterns**: Replace employee sales columns with Olympics columns using the mapping table
2. **Customize analyses**: The Olympics dataset has 120 years of data - explore historical trends
3. **Add visualizations**: Create plots to illustrate your findings
4. **Explore stretch goals**: Use Week 05 advanced techniques (rolling averages, method chaining, etc.)
5. **Document your findings**: Explain what each analysis reveals about Olympic history

## Column mapping quick reference

When adapting code from this demonstration:

```python
# Replace employee sales columns with Olympics columns:
"employee_id"        → "ID"
"name"               → "Name"
"gender"             → "Sex"
"age"                → "Age"
"height_cm"          → "Height"
"weight_kg"          → "Weight"
"team"               → "Team"
"region"             → "NOC"
"quarter"            → "Games"
"year"               → "Year"
"half"               → "Season"
"office"             → "City"
"product_category"   → "Sport"
"product"            → "Event"
"award"              → "Medal"
```

Good luck with your Olympics group project!