# 🧪 Exploratory Data Analysis (EDA) — Daniel Ricciardo Dataset

Welcome! In this guided notebook, we'll practice **Exploratory Data Analysis (EDA)** using a dataset about Daniel Ricciardo's Formula 1 career entries.  
Our goals today are to:
- Understand the dataset structure (columns, types, missing values)
- Clean and standardize the data
- Explore **performance-related proxies** (e.g., teams, engines, event types) across **years**
- Practice **reading plots** and turning outputs into **insights**

> **Teaching note:** This notebook uses a *classroom tone* and explains **why** each step matters before we write any code.

## 1. Setup
We'll import the core libraries and load the dataset.  
If anything errors, read the message carefully—it often tells you exactly what's wrong.

In [None]:
# Always run this cell first
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Make charts a bit bigger for teaching
plt.rcParams['figure.figsize'] = (9,5)

# Load data
df = pd.read_csv(r"C:\Users\ellyh\Documents\UOW\DAC\NEW SYLLABUS\DACSIM\DAC-003_EDA\Session 2\daniel_ricciardo.csv")
print("Rows:", len(df))
print("Columns:", list(df.columns))
df.head(5)

## 2. First Look at the Data
Before analysis, we want to know:
- What **columns** exist and what do they mean?
- What are the **data types**?
- Where are there **missing values**?

In [None]:
# Structure overview
display(df.info())
display(df.describe(include='all', datetime_is_numeric=True))
display(df.isna().sum().sort_values(ascending=False))

**Teacher’s interpretation:**  
- `info()` shows types; note that several columns are `object` (categorical text).  
- `describe()` mixes numeric and categorical summaries because we passed `include='all'`.  
- The missing-value table shows where we’ll need to be careful downstream (e.g., `grid_number` looks text-like with "na").

## 3. Data Cleaning (Light)
We'll do **minimal, safe** cleaning to standardize types and values so plots don't break:
- Convert `year` and `race_number` to integers (nullable)  
- Coerce `grid_number` to numeric (set non-numeric to NaN)  
- Trim whitespace and standardize the case for text columns

In [None]:
# Cleaning
df_clean = df.copy()

# Convert to nullable integer where appropriate
for col in ['year','race_number']:
    df_clean[col] = pd.to_numeric(df_clean[col], errors='coerce').astype('Int64')

# Grid number to numeric (many are 'na')
df_clean['grid_number'] = pd.to_numeric(df_clean['grid_number'], errors='coerce').astype('Int64')

# Standardize text: strip whitespace
text_cols = df_clean.select_dtypes(include='object').columns
for c in text_cols:
    df_clean[c] = df_clean[c].astype(str).str.strip()

df_clean.dtypes


**Why this matters:** Consistent types prevent subtle bugs (e.g., plotting functions treat strings differently from numbers).  


## 4. What Columns Do We Have?
Let's quickly profile each column: unique counts and a few example values. This helps us form **questions** for analysis.

In [None]:

# Unique counts
display(df_clean.nunique().sort_values(ascending=False))

# Peek at sample values per column
for c in df_clean.columns:
    sample_vals = pd.Series(df_clean[c].dropna().unique()).astype(str).head(8).tolist()
    print(f"- {c}: {len(df_clean[c].dropna().unique())} uniques → {sample_vals}")


## 5. Univariate Analysis
We'll start with **one variable at a time**. For categorical columns, bar charts are great. For numeric columns, histograms or boxplots help.
We'll look at:
- `year` (activity over time)  
- `team` (which teams appear most)  
- `event` (types of entries: e.g., race, third driver, DNFs, etc.)

In [None]:
# Counts by year
year_counts = df_clean['year'].value_counts(dropna=False).sort_index()
year_counts.plot(kind='bar')
plt.title('Entries by Year')
plt.xlabel('Year')
plt.ylabel('Count')
plt.show()

In [None]:
# Top teams
team_counts = df_clean['team'].value_counts().head(10)
team_counts.plot(kind='barh')
plt.title('Top 10 Teams by Entry Count')
plt.xlabel('Count'); plt.ylabel('Team')
plt.gca().invert_yaxis()
plt.show()

In [None]:
# Event types (top 12 + 'Other')
event_counts = df_clean['event'].value_counts()
top_n = 12
top_events = event_counts.head(top_n)
other = pd.Series({'Other': event_counts.iloc[top_n:].sum()})
event_plot = pd.concat([top_events, other])
event_plot.plot(kind='bar')
plt.title('Event Types (Top 12 + Other)')
plt.ylabel('Count'); plt.xlabel('Event')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

**Reading the plots:**  
- **Year** shows when Ricciardo appears in the record (including early-career third driver roles).  
- **Teams** reveals team tenure and transitions.  
- **Event types** indicate activity mix (practice/third driver vs race-related entries).

## 6. Bivariate Relationships (Two Variables)
We'll examine how categories change **over time** and how **teams** intersect with **events**.

In [None]:
# Year x Team heatmap
ct_year_team = pd.crosstab(df_clean['year'], df_clean['team'])
plt.imshow(ct_year_team, aspect='auto')
plt.title('Entries by Year x Team'); plt.xlabel('Team'); plt.ylabel('Year')
plt.colorbar(label='Count')
plt.xticks(ticks=range(len(ct_year_team.columns)), labels=ct_year_team.columns, rotation=90)
plt.yticks(ticks=range(len(ct_year_team.index)), labels=ct_year_team.index)
plt.tight_layout(); plt.show()

In [None]:
# Event mix over time (row-normalized)
ct_year_event = pd.crosstab(df_clean['year'], df_clean['event'], normalize='index')
plt.imshow(ct_year_event, aspect='auto')
plt.title('Event Mix by Year (Proportion)'); plt.xlabel('Event'); plt.ylabel('Year')
plt.colorbar(label='Proportion')
plt.xticks(ticks=range(len(ct_year_event.columns)), labels=ct_year_event.columns, rotation=90)
plt.yticks(ticks=range(len(ct_year_event.index)), labels=ct_year_event.index)
plt.tight_layout(); plt.show()

**Interpretation tips:**  
- Concentrations in specific years indicate **tenure changes** and **career phases**.  
- A higher share of non-race events in early years likely reflects **development/third driver** roles.

## 7. Performance Proxies & Dataset Limitations
We intended to focus on **performance metrics** (positions, points, finishes).  
**However, this dataset does not contain columns like `position`, `points`, or `status` typically used for race results.**

We'll do two things:
1) Use **proxies** (e.g., entries by year/team/event) to describe career activity.  
2) Learn how to check for **missing columns** and plan a **merge** if we later obtain a race-results table.

In [None]:
# Confirm absence of typical performance columns
perf_like = [c for c in df_clean.columns if any(k in c.lower() for k in ['position','points','status','result','finish'])]
print("Performance-like columns found:", perf_like)

# Define a very rough proxy for "race context":
# label entries that are NOT clearly 'Third driver' as potentially race/weekend related
df_clean['is_race_context'] = ~df_clean['event'].str.lower().str.contains('third driver')
df_clean['is_race_context'].value_counts(dropna=False)


**Teaching moment:** It's okay to hit limitations! Good analysts **document gaps** and propose next steps (e.g., *"We need to join with race results to study points/finishes"*).  


## 8. Try It Yourself ✍️
1. Plot the **engine types** over years (counts or proportions). What do you notice?  
2. Identify the **top 5 circuits (grand_prix)** by entries. Do any coincide with team changes?  
3. If you had a `race_results.csv` with `year`, `grand_prix`, `position`, and `points`, how would you **merge** it with `df_clean`? *(Write the join code.)*

In [None]:
# Engine types by year (stacked proportions)
ct_year_engine = pd.crosstab(df_clean['year'], df_clean['engine_type'], normalize='index')
ct_year_engine.plot(kind='bar', stacked=True, legend=False)
plt.title('Engine Types by Year (Proportion)'); plt.ylabel('Proportion'); plt.xlabel('Year')
plt.tight_layout(); plt.show()

In [None]:
# Top 5 circuits by entries
top_gp = df_clean['grand_prix'].value_counts().head(5)
top_gp.plot(kind='bar')
plt.title('Top 5 Grand Prix by Entry Count'); plt.ylabel('Count'); plt.xlabel('Grand Prix')
plt.xticks(rotation=45, ha='right'); plt.tight_layout(); plt.show()

In [None]:
# Hypothetical merge pattern (no file provided)
# This demonstrates the code shape students should write.
# results = pd.read_csv('race_results.csv')
# merged = df_clean.merge(results, on=['year','grand_prix'], how='left')
# display(merged.head())
print("See commented example above for the standard left join on ['year','grand_prix'].")

## 9. Wrap-Up: What Did We Learn?
- We practiced a **clean → explore → interpret** EDA loop.  
- We read bar charts and heatmaps to describe **how** activity changed by **year**, **team**, and **event**.  
- We identified **dataset gaps** (no positions/points), and proposed a **merge plan** for future work.

> **Exit ticket:** Write 2–3 bullet points about Ricciardo’s career phases that you can defend using the visuals above.