# 🧪 Exploratory Data Analysis (EDA) — Daniel Ricciardo Dataset

Welcome! In this guided notebook, we'll practice **Exploratory Data Analysis (EDA)** using a dataset about Daniel Ricciardo's Formula 1 career entries.  
Our goals today are to:
- Understand the dataset structure (columns, types, missing values)
- Clean and standardize the data
- Explore **performance-related proxies** (e.g., teams, engines, event types) across **years**
- Practice **reading plots** and turning outputs into **insights**

> **Teaching note:** This notebook uses a *classroom tone* and explains **why** each step matters before we write any code.

## 1. Setup
We'll import the core libraries and load the dataset.  
If anything errors, read the message carefully—it often tells you exactly what's wrong.

In [None]:
# Always run this cell first
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Make charts a bit bigger for teaching
plt.rcParams['figure.figsize'] = (9,5)

# Load data
df = pd.read_csv(r"C:\Users\ellyh\Documents\UOW\DAC\NEW SYLLABUS\DACSIM\DAC-003_EDA\Session 2\daniel_ricciardo.csv")
print("Rows:", len(df))
print("Columns:", list(df.columns))
df.head(5)

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape (1449094201.py, line 10)

## 2. First Look at the Data
Before analysis, we want to know:
- What **columns** exist and what do they mean?
- What are the **data types**?
- Where are there **missing values**?

In [None]:

# TODO: Show info about the dataframe structure and basic summary stats
# HINTS:
# - df.info()
# - df.describe(include='all', datetime_is_numeric=True)
# - df.isna().sum().sort_values(ascending=False)



## 3. Data Cleaning (Light)
We'll do **minimal, safe** cleaning to standardize types and values so plots don't break:
- Convert `year` and `race_number` to integers (nullable)  
- Coerce `grid_number` to numeric (set non-numeric to NaN)  
- Trim whitespace and standardize the case for text columns

In [None]:

# TODO: Copy the template below and complete each step.

# 1) Make safe copies
# df_clean = df.copy()

# 2) Convert types (nullable integer)
# for col in ['year','race_number']:
#     df_clean[col] = pd.to_numeric(df_clean[col], errors='coerce').astype('Int64')

# 3) Coerce grid_number to numeric
# df_clean['grid_number'] = pd.to_numeric(df_clean['grid_number'], errors='coerce').astype('Int64')

# 4) Standardize text columns: strip spaces and title-case
# text_cols = df_clean.select_dtypes(include='object').columns
# for c in text_cols:
#     df_clean[c] = df_clean[c].astype(str).str.strip()

# 5) Simple sanity check
# df_clean.dtypes


## 4. What Columns Do We Have?
Let's quickly profile each column: unique counts and a few example values. This helps us form **questions** for analysis.

In [None]:

# TODO: Show unique counts per column and a preview of values
# HINT: Use .nunique() and .dropna().unique()[:10]



## 5. Univariate Analysis
We'll start with **one variable at a time**. For categorical columns, bar charts are great. For numeric columns, histograms or boxplots help.
We'll look at:
- `year` (activity over time)  
- `team` (which teams appear most)  
- `event` (types of entries: e.g., race, third driver, DNFs, etc.)

In [None]:

# TODO: Plot counts by year (bar). Label axes and rotate ticks for readability.
# HINT: value_counts().sort_index()



In [None]:

# TODO: Top teams by count (horizontal bar for readability)
# HINT: value_counts().head(10)



In [None]:

# TODO: Top event types (bar). Consider grouping rare categories into 'Other'.
# HINT: value_counts(normalize=True)



## 6. Bivariate Relationships (Two Variables)
We'll examine how categories change **over time** and how **teams** intersect with **events**.

In [None]:

# TODO: Entries per year by team (crosstab → heatmap-like with imshow or pcolor)
# HINTS:
# - pd.crosstab(df_clean['year'], df_clean['team'])
# - plt.imshow(...); plt.colorbar()



In [None]:

# TODO: Event mix by year (normalize rows to see proportions)
# HINT: ctab = pd.crosstab(df_clean['year'], df_clean['event'], normalize='index')



## 7. Performance Proxies & Dataset Limitations
We intended to focus on **performance metrics** (positions, points, finishes).  
**However, this dataset does not contain columns like `position`, `points`, or `status` typically used for race results.**

We'll do two things:
1) Use **proxies** (e.g., entries by year/team/event) to describe career activity.  
2) Learn how to check for **missing columns** and plan a **merge** if we later obtain a race-results table.

In [None]:

# TODO: Programmatically confirm that 'position' or 'points' columns are missing
# Then create a boolean column 'is_race_context' that flags entries that look race-related
# HINT: any('position' in c.lower() for c in df_clean.columns)



## 8. Try It Yourself ✍️
1. Plot the **engine types** over years (counts or proportions). What do you notice?  
2. Identify the **top 5 circuits (grand_prix)** by entries. Do any coincide with team changes?  
3. If you had a `race_results.csv` with `year`, `grand_prix`, `position`, and `points`, how would you **merge** it with `df_clean`? *(Write the join code.)*

In [None]:

# TODO 1: Engine types by year
# Your code here



In [None]:

# TODO 2: Top 5 circuits by entries
# Your code here



In [None]:

# TODO 3: Hypothetical merge with race_results
# HINT:
# results = pd.read_csv('race_results.csv')
# merged = df_clean.merge(results, on=['year','grand_prix'], how='left')
# merged.head()



## 9. Wrap-Up: What Did We Learn?
- We practiced a **clean → explore → interpret** EDA loop.  
- We read bar charts and heatmaps to describe **how** activity changed by **year**, **team**, and **event**.  
- We identified **dataset gaps** (no positions/points), and proposed a **merge plan** for future work.

> **Exit ticket:** Write 2–3 bullet points about Ricciardo’s career phases that you can defend using the visuals above.