# Phase II: Data Curation, Exploratory Analysis and Plotting (5\%)

### Group Members: Lauren Cummings, Riyana Roy, Satvik Repaka, Justin Huang

#### Due (Each Group): October 25

Each **project group** will submit a single **jupyter notebook** which contains:

1. (1\%) Expresses the central motivation of the project and explains the (at least) two key questions to be explored. Gives a summary of the data processing pipeline so a technical expert can easily follow along.
2. (2\%) Obtains, cleans, and merges all data sources involved in the project.
3. (2\%) Builds at least two visualizations (graphs/plots) from the data which help to understand or answer the questions of interest. These visualizations will be graded based on how much information they can effectively communicate to readers. Please make sure your visualization are sufficiently distinct from each other.

# Project Details

### **Central Motivation**  
This project aims to understand how tire strategy and weather conditions influence race outcomes. By analyzing these factors together, we aim to gain insights into how teams optimize performance and make decisions under changing conditions.

---

### **Key Questions:**  
1. How does tire strategy impact driver performance and race outcomes under varying weather conditions? 
2. How do changing weather conditions influence pit stop timing and tire choices throughout the race?

---

### **Data Processing Overview:**  

1. **Data Ingestion**: Collect data from race sessions, including event logs, telemetry, and weather data.  

2. **Preprocessing**:  
   - **Laps Data:**  
     - Fill missing pit stop times.  
     - Remove deleted or inaccurate laps.  
     - Map track statuses to meaningful labels (e.g, Safety Car, Yellow Flag).  

   - **Telemetry Data:**    
     - Remove outliers in speed, RPM, and gears.  

   - **Weather Data:**  
     - Align weather timestamps with lap data for consistency.  
     - Filter out unrealistic wind speed values.  
     - Group consecutive rainfall events into a single rain period.  
     - Fill missing temperature and pressure values using forward filling.


3. **Exploratory Data Analysis (EDA)**: Visualize strategy patterns and weather impacts.  
4. **Modeling/Analysis**: Apply statistical methods to find relationships between tires, weather, and strategy.  
5. **Reporting**: Write report to answer the key questions.  



# Imported Libraries

In [3]:
import os
import fastf1
import logging
import pandas as pd

# Data Retrieval

In [5]:
# Create the directory if it doesn't exist
if not os.path.exists('cache'):
    os.makedirs('cache')

# Enable cache 
fastf1.Cache.enable_cache('cache')

# Suppress INFO and WARNING messages
logging.getLogger('fastf1').setLevel(logging.ERROR)

# Load session data
session = fastf1.get_session(2021, 'Imola', 'R')
session.load()

# Extract data
laps = session.laps
print("Laps Data:")
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(laps.head())  # Print all columns for the first few rows

fastest_lap = laps.pick_fastest()
telemetry = fastest_lap.get_telemetry()
print("\nTelemetry Data:")
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(telemetry.head())  # Print all columns for the first few rows

weather = session.weather_data
print("\nWeather Data:")
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(weather.head())  # Print all columns for the first few rows


Laps Data:
                    Time Driver DriverNumber                LapTime  \
0 0 days 00:35:09.853000    GAS           10 0 days 00:01:54.003000   
1 0 days 00:37:32.883000    GAS           10 0 days 00:02:23.030000   
2 0 days 00:39:53.731000    GAS           10 0 days 00:02:20.848000   
3 0 days 00:42:18.428000    GAS           10 0 days 00:02:24.697000   
4 0 days 00:44:42.360000    GAS           10 0 days 00:02:23.932000   

   LapNumber  Stint PitOutTime PitInTime            Sector1Time  \
0        1.0    1.0        NaT       NaT                    NaT   
1        2.0    1.0        NaT       NaT 0 days 00:00:43.289000   
2        3.0    1.0        NaT       NaT 0 days 00:00:42.977000   
3        4.0    1.0        NaT       NaT 0 days 00:00:42.573000   
4        5.0    1.0        NaT       NaT 0 days 00:00:41.394000   

             Sector2Time            Sector3Time     Sector1SessionTime  \
0 0 days 00:00:37.569000 0 days 00:00:36.794000                    NaT   
1 0 days 00

# Data Cleaning

In [8]:
# convert tines
laps['LapTime'] = pd.to_timedelta(laps['LapTime'])
laps['PitOutTime'] = pd.to_timedelta(laps['PitOutTime'])
laps['PitInTime'] = pd.to_timedelta(laps['PitInTime'])
laps['Sector1Time'] = pd.to_timedelta(laps['Sector1Time'])
laps['Sector2Time'] = pd.to_timedelta(laps['Sector2Time'])

# check if they're strings
if laps['LapStartTime'].dtype == 'object':  
    laps['LapStartTime'] = pd.to_datetime(laps['LapStartTime'])
if laps['LapStartDate'].dtype == 'object':  
    laps['LapStartDate'] = pd.to_datetime(laps['LapStartDate'])

In [11]:
# fill nan
laps['PitOutTime'].fillna(laps['PitOutTime'].mean())
laps['PitInTime'].fillna(laps['PitInTime'].mean()) 

0      0 days 01:28:05.076684210
1      0 days 01:28:05.076684210
2      0 days 01:28:05.076684210
3      0 days 01:28:05.076684210
4      0 days 01:28:05.076684210
                  ...           
1122   0 days 01:28:05.076684210
1123   0 days 01:28:05.076684210
1124   0 days 01:28:05.076684210
1125   0 days 01:28:05.076684210
1126   0 days 01:28:05.076684210
Name: PitInTime, Length: 1127, dtype: timedelta64[ns]

In [12]:
# remove inaccurate laps
laps = laps[laps['Deleted'] == False]

In [14]:
# adhusted labels
track_status_mapping = {0: 'Green', 1: 'Yellow Flag', 2: 'Red Flag', 3: 'Safety Car', 4: 'Virtual Safety Car'}
laps.loc[:, 'TrackStatus'] = laps['TrackStatus'].map(track_status_mapping)

In [19]:
# convert times to match
telemetry['SessionTime'] = pd.to_timedelta(telemetry['SessionTime'])
if telemetry['Date'].dtype == 'object':  
    telemetry['Date'] = pd.to_datetime(telemetry['Date'])

In [20]:
# missing vals
telemetry['RPM'].fillna(telemetry['RPM'].mean())
telemetry['Speed'].fillna(telemetry['Speed'].mean())

2      294
3      295
4      296
5      298
6      298
      ... 
657    278
658    279
659    281
660    282
661    282
Name: Speed, Length: 660, dtype: int64

In [21]:
# removing outliers
telemetry = telemetry[(telemetry['Speed'] < 350) & (telemetry['RPM'] < 15000)]

In [22]:
# matching times in weather data
weather['Time'] = pd.to_timedelta(weather['Time'])

In [25]:
# missing values
weather['AirTemp'].fillna(weather['AirTemp'].mean())
weather['Humidity'].fillna(weather['Humidity'].mean())
weather['Pressure'].fillna(weather['Pressure'].mean())
weather['Rainfall'].fillna(False)

0       True
1       True
2       True
3       True
4       True
       ...  
156    False
157    False
158    False
159    False
160    False
Name: Rainfall, Length: 161, dtype: bool

In [26]:
# consecutive rainfall grouped as one rainfall period
weather['Rainfall'] = weather['Rainfall'].astype(bool)
weather['RainPeriod'] = (weather['Rainfall'] != weather['Rainfall'].shift()).cumsum()

In [27]:
# forward fill for missing vals
weather[['AirTemp', 'TrackTemp', 'Pressure']] = weather[['AirTemp', 'TrackTemp', 'Pressure']].ffill()

In [28]:
print("Cleaned Laps Data:")
print(laps.head())
print("\nCleaned Telemetry Data:")
print(telemetry.head())
print("\nCleaned Weather Data:")
print(weather.head())

Cleaned Laps Data:
                    Time Driver DriverNumber                LapTime  \
0 0 days 00:35:09.853000    GAS           10 0 days 00:01:54.003000   
1 0 days 00:37:32.883000    GAS           10 0 days 00:02:23.030000   
2 0 days 00:39:53.731000    GAS           10 0 days 00:02:20.848000   
3 0 days 00:42:18.428000    GAS           10 0 days 00:02:24.697000   
4 0 days 00:44:42.360000    GAS           10 0 days 00:02:23.932000   

   LapNumber  Stint                PitOutTime                 PitInTime  \
0        1.0    1.0 0 days 01:33:59.584228070 0 days 01:28:05.076684210   
1        2.0    1.0 0 days 01:33:59.584228070 0 days 01:28:05.076684210   
2        3.0    1.0 0 days 01:33:59.584228070 0 days 01:28:05.076684210   
3        4.0    1.0 0 days 01:33:59.584228070 0 days 01:28:05.076684210   
4        5.0    1.0 0 days 01:33:59.584228070 0 days 01:28:05.076684210   

             Sector1Time            Sector2Time  ... FreshTyre        Team  \
0                    NaT 