# 0. F1 Prediction Project - - Data Collection
#### Alex Boardman - BrainStation

# Table of Contents
- [1. Introduction](#introduction)
- [2. Import Libraries](#import-libraries)
- [3. Data Exploration](#3.-Data-Exploration)
  - [Circuits](#circuits)
  - [Constructor Results](#constructor-results)
  - [Constructor Standings](#constructor-standings)
  - [Constructors](#constructors)
  - [Driver Standings](#driver-standings)
  - [Drivers](#drivers)
  - [Lap Times](#lap-times)
  - [Pit Stops](#pit-stops)
  - [Qualifying](#qualifying)
  - [Races](#races)
  - [Results](#resultss)
  - [Seasons](#seasons)
  - [Sprint Results](#sprint-results)
  - [Status](#status)
- [9. Summary](#f1-data-for-grand-prix-winner-predictiondataframes-summary)


#### Introduction

Welcome to our F1 Win Prediction Project notebook, where we meticulously gather and synthesize a vast array of Formula 1 data in pursuit of predicting Grand Prix winners. In this notebook, you will find an intricate collection of datasets - ranging from driver and constructor standings to detailed lap times and pit stop strategies, all the way through to the historical and current season's data. Later in this project we will leverage machine learning models, employing techniques like Random Forests and Gradient Boosting to unravel the complex interdependencies that influence race outcomes. Our goal is to transform raw data into actionable insights, enhancing team tactics, engaging the F1 enthusiast community, and offering refined predictive insights for various applications, including betting markets. As we embark on this journey of data-driven discovery, our focus on data integrity, astute feature engineering, and rigorous model validation will guide us toward the podium of predictive prowess.

Most of our data was gathered from this Kaggle site : https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2023, so we will dig into what the data source contains.

### Import Libraries

In [2]:
# Install libraries
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

### Loading Data

#### **Circuits**

In [3]:
# # Load DataFrame
# file = 'file.csv'
# df = pd.read_csv(file)
circuits_df = pd.read_csv('C:/Users/Alex/F1_Capstone/Data/circuits.csv')

In [4]:
circuits_df.head()

Unnamed: 0,circuitId,circuitRef,name,location,country,lat,lng,alt,url
0,1,albert_park,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10,http://en.wikipedia.org/wiki/Melbourne_Grand_P...
1,2,sepang,Sepang International Circuit,Kuala Lumpur,Malaysia,2.76083,101.738,18,http://en.wikipedia.org/wiki/Sepang_Internatio...
2,3,bahrain,Bahrain International Circuit,Sakhir,Bahrain,26.0325,50.5106,7,http://en.wikipedia.org/wiki/Bahrain_Internati...
3,4,catalunya,Circuit de Barcelona-Catalunya,Montmeló,Spain,41.57,2.26111,109,http://en.wikipedia.org/wiki/Circuit_de_Barcel...
4,5,istanbul,Istanbul Park,Istanbul,Turkey,40.9517,29.405,130,http://en.wikipedia.org/wiki/Istanbul_Park


In [5]:
circuits_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   circuitId   77 non-null     int64  
 1   circuitRef  77 non-null     object 
 2   name        77 non-null     object 
 3   location    77 non-null     object 
 4   country     77 non-null     object 
 5   lat         77 non-null     float64
 6   lng         77 non-null     float64
 7   alt         77 non-null     object 
 8   url         77 non-null     object 
dtypes: float64(2), int64(1), object(6)
memory usage: 5.5+ KB


The circuits_df dataframe contains information on Formula 1 circuits. 

There are 77 entries, and each one lists the details of a circuit, including its unique identifier, reference code, name, location, country, geographical coordinates (latitude and longitude), altitude, and a URL to its Wikipedia page. 

This dataset is crucial for exploring the influence of circuit-specific characteristics, like geographical location and track layout, on race outcomes.

#### **Constructor Results**

In [6]:
# # Load DataFrame
# file = 'file.csv'
# df = pd.read_csv(file)
constructor_results_df = pd.read_csv('C:/Users/Alex/F1_Capstone/Data/constructor_results.csv')

In [7]:
constructor_results_df.head()

Unnamed: 0,constructorResultsId,raceId,constructorId,points,status
0,1,18,1,14.0,\N
1,2,18,2,8.0,\N
2,3,18,3,9.0,\N
3,4,18,4,5.0,\N
4,5,18,5,2.0,\N


In [8]:
constructor_results_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12290 entries, 0 to 12289
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   constructorResultsId  12290 non-null  int64  
 1   raceId                12290 non-null  int64  
 2   constructorId         12290 non-null  int64  
 3   points                12290 non-null  float64
 4   status                12290 non-null  object 
dtypes: float64(1), int64(3), object(1)
memory usage: 480.2+ KB


The constructor_results_df dataframe records the results of Formula 1 constructors (teams) in races. 

The dataframe has 12,290 entries and includes information such as the unique ID for the constructor results, the race ID, the constructor ID, the points awarded, and the status of the result. 

The 'points' field indicates the number of points a constructor team earned in a particular race, and the 'status' field could potentially describe the status of the constructor's entry in that race, although it seems to have missing or placeholder values. 

This dataset is valuable for analyzing the performance of F1 teams across races and seasons.

#### **Constructor Standings**

In [9]:
# # Load DataFrame
# file = 'file.csv'
# df = pd.read_csv(file)
constructor_standings_df = pd.read_csv('C:/Users/Alex/F1_Capstone/Data/constructor_standings.csv')

In [10]:
constructor_standings_df.head()

Unnamed: 0,constructorStandingsId,raceId,constructorId,points,position,positionText,wins
0,1,18,1,14.0,1,1,1
1,2,18,2,8.0,3,3,0
2,3,18,3,9.0,2,2,0
3,4,18,4,5.0,4,4,0
4,5,18,5,2.0,5,5,0


In [11]:
constructor_standings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13051 entries, 0 to 13050
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   constructorStandingsId  13051 non-null  int64  
 1   raceId                  13051 non-null  int64  
 2   constructorId           13051 non-null  int64  
 3   points                  13051 non-null  float64
 4   position                13051 non-null  int64  
 5   positionText            13051 non-null  object 
 6   wins                    13051 non-null  int64  
dtypes: float64(1), int64(5), object(1)
memory usage: 713.9+ KB


The constructor_standings_df offers insights into Formula 1 team performance across various races. This DataFrame, encompassing 13,051 entries, provides details like constructor points, rank, wins, and race information. 

Analyzing this data allows for tracking team performance and standings throughout a season, potentially predicting future success based on accumulated points and wins. 

This information is valuable for assessing team competitiveness and their potential in upcoming races and championships.

#### **Constructors**

In [12]:
# # Load DataFrame
# file = 'file.csv'
# df = pd.read_csv(file)
constructors_df = pd.read_csv('C:/Users/Alex/F1_Capstone/Data/constructors.csv')

In [13]:
constructors_df.head()

Unnamed: 0,constructorId,constructorRef,name,nationality,url
0,1,mclaren,McLaren,British,http://en.wikipedia.org/wiki/McLaren
1,2,bmw_sauber,BMW Sauber,German,http://en.wikipedia.org/wiki/BMW_Sauber
2,3,williams,Williams,British,http://en.wikipedia.org/wiki/Williams_Grand_Pr...
3,4,renault,Renault,French,http://en.wikipedia.org/wiki/Renault_in_Formul...
4,5,toro_rosso,Toro Rosso,Italian,http://en.wikipedia.org/wiki/Scuderia_Toro_Rosso


In [14]:
constructors_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 211 entries, 0 to 210
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   constructorId   211 non-null    int64 
 1   constructorRef  211 non-null    object
 2   name            211 non-null    object
 3   nationality     211 non-null    object
 4   url             211 non-null    object
dtypes: int64(1), object(4)
memory usage: 8.4+ KB


The constructors_df provides a comprehensive history of Formula 1 teams, with 211 entries covering various constructors. Key information includes unique team identifiers, reference names, official team names, nationalities, and links to their Wikipedia pages. 

This data is valuable for understanding team history, origins, and their connection to other race-related data. The unique IDs serve as a key to link this data with other datasets for performance analysis, while team nationalities could be explored for potential correlations with success in the sport. 

Additionally, the provided URLs offer access to further information about each team.

#### **Driver Standings**

In [15]:
# # Load DataFrame
# file = 'file.csv'
# df = pd.read_csv(file)
driver_standings_df = pd.read_csv('C:/Users/Alex/F1_Capstone/Data/driver_standings.csv')

In [16]:
driver_standings_df.head()

Unnamed: 0,driverStandingsId,raceId,driverId,points,position,positionText,wins
0,1,18,1,10.0,1,1,1
1,2,18,2,8.0,2,2,0
2,3,18,3,6.0,3,3,0
3,4,18,4,5.0,4,4,0
4,5,18,5,4.0,5,5,0


In [17]:
driver_standings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34124 entries, 0 to 34123
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   driverStandingsId  34124 non-null  int64  
 1   raceId             34124 non-null  int64  
 2   driverId           34124 non-null  int64  
 3   points             34124 non-null  float64
 4   position           34124 non-null  int64  
 5   positionText       34124 non-null  object 
 6   wins               34124 non-null  int64  
dtypes: float64(1), int64(5), object(1)
memory usage: 1.8+ MB


The driver_standings_df DataFrame, containing 34,124 entries, holds information about Formula 1 driver standings across races. 

Each row represents a driver's standing in a specific race, with details like points, position (numerical and textual), race ID, driver ID, and the number of wins. 

This data allows for analyzing individual driver performance and their position changes throughout the season.

#### **Drivers**

In [18]:
# # Load DataFrame
# file = 'file.csv'
# df = pd.read_csv(file)
drivers_df = pd.read_csv('C:/Users/Alex/F1_Capstone/Data/drivers.csv')

In [19]:
drivers_df.head()

Unnamed: 0,driverId,driverRef,number,code,forename,surname,dob,nationality,url
0,1,hamilton,44,HAM,Lewis,Hamilton,1985-01-07,British,http://en.wikipedia.org/wiki/Lewis_Hamilton
1,2,heidfeld,\N,HEI,Nick,Heidfeld,1977-05-10,German,http://en.wikipedia.org/wiki/Nick_Heidfeld
2,3,rosberg,6,ROS,Nico,Rosberg,1985-06-27,German,http://en.wikipedia.org/wiki/Nico_Rosberg
3,4,alonso,14,ALO,Fernando,Alonso,1981-07-29,Spanish,http://en.wikipedia.org/wiki/Fernando_Alonso
4,5,kovalainen,\N,KOV,Heikki,Kovalainen,1981-10-19,Finnish,http://en.wikipedia.org/wiki/Heikki_Kovalainen


In [20]:
drivers_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857 entries, 0 to 856
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   driverId     857 non-null    int64 
 1   driverRef    857 non-null    object
 2   number       857 non-null    object
 3   code         857 non-null    object
 4   forename     857 non-null    object
 5   surname      857 non-null    object
 6   dob          857 non-null    object
 7   nationality  857 non-null    object
 8   url          857 non-null    object
dtypes: int64(1), object(8)
memory usage: 60.4+ KB


The drivers_df DataFrame stores information about Formula 1 drivers. Each row represents a driver, with details like unique ID, racing number, name, nationality, and a link to their Wikipedia page. 

This data can be used for various purposes, such as identifying drivers, tracking their careers, or analyzing performance based on factors like nationality (potential home advantage) or age (potential experience or performance correlation).

#### **Lap Times**

In [21]:
# # Load DataFrame
# file = 'file.csv'
# df = pd.read_csv(file)
lap_times_df = pd.read_csv('C:/Users/Alex/F1_Capstone/Data/lap_times.csv')

In [22]:
lap_times_df.head()

Unnamed: 0,raceId,driverId,lap,position,time,milliseconds
0,841,20,1,1,1:38.109,98109
1,841,20,2,1,1:33.006,93006
2,841,20,3,1,1:32.713,92713
3,841,20,4,1,1:32.803,92803
4,841,20,5,1,1:32.342,92342


In [23]:
lap_times_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 551742 entries, 0 to 551741
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   raceId        551742 non-null  int64 
 1   driverId      551742 non-null  int64 
 2   lap           551742 non-null  int64 
 3   position      551742 non-null  int64 
 4   time          551742 non-null  object
 5   milliseconds  551742 non-null  int64 
dtypes: int64(5), object(1)
memory usage: 25.3+ MB


The lap_times_df DataFrame holds Formula 1 lap times for individual drivers across races. Each row represents a lap, with information like race ID, driver ID, lap number, position during that lap, and the lap time in both seconds and milliseconds. 

This dataset, containing over 550,000 entries, can be used to analyze lap performance, compare lap times between drivers and races, and potentially identify factors influencing lap times.

#### **Pit Stops**

In [24]:
# # Load DataFrame
# file = 'file.csv'
# df = pd.read_csv(file)
pit_stops_df = pd.read_csv('C:/Users/Alex/F1_Capstone/Data/pit_stops.csv')

In [25]:
pit_stops_df.head()

Unnamed: 0,raceId,driverId,stop,lap,time,duration,milliseconds
0,841,153,1,1,17:05:23,26.898,26898
1,841,30,1,1,17:05:52,25.021,25021
2,841,17,1,11,17:20:48,23.426,23426
3,841,4,1,12,17:22:34,23.251,23251
4,841,13,1,13,17:24:10,23.842,23842


In [26]:
pit_stops_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10089 entries, 0 to 10088
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   raceId        10089 non-null  int64 
 1   driverId      10089 non-null  int64 
 2   stop          10089 non-null  int64 
 3   lap           10089 non-null  int64 
 4   time          10089 non-null  object
 5   duration      10089 non-null  object
 6   milliseconds  10089 non-null  int64 
dtypes: int64(5), object(2)
memory usage: 551.9+ KB


The pit_stops_df likely stores Formula 1 pit stop information for individual drivers across races. Each row represents a single pit stop, providing details like race ID, driver ID, lap number when the stop occurred, and the duration in both seconds and milliseconds. 

This data, containing over 10,000 entries, can be valuable for analyzing pit stop strategies, comparing durations and identifying areas for improvement, and understanding the impact of pit stops on overall race performance.

#### **Qualifying**

Getting the qualifying time data was the trickiest part, mainly because the Ergast data repository has some holes in the data and because qualifying rules changed so much over the years. Since 2006, qualifying takes place on a Saturday afternoon in a three-stage “knockout” system where the cars try to set their fastest lap time. In the past, qualifying would only consist of one or two sessions, causing missing data in my dataframe. I decided to consider only the best qualifying time for each driver, regardless of how many qualifying sessions were held in that year. The best qualifying time is reflected in the grid position, so I will later calculate the cumulative difference in times between the first qualified car and the others, hoping that it might give me an indication of how much faster a car is compared to the other ones.

In [28]:
# # Load DataFrame
# file = 'file.csv'
# df = pd.read_csv(file)
qualifying_df = pd.read_csv('C:/Users/Alex/F1_Capstone/Data/qualifying.csv')

In [29]:
qualifying_df.head()

Unnamed: 0,qualifyId,raceId,driverId,constructorId,number,position,q1,q2,q3
0,1,18,1,1,22,1,1:26.572,1:25.187,1:26.714
1,2,18,9,2,4,2,1:26.103,1:25.315,1:26.869
2,3,18,5,1,23,3,1:25.664,1:25.452,1:27.079
3,4,18,13,6,2,4,1:25.994,1:25.691,1:27.178
4,5,18,2,2,3,5,1:25.960,1:25.518,1:27.236


In [30]:
qualifying_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9815 entries, 0 to 9814
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   qualifyId      9815 non-null   int64 
 1   raceId         9815 non-null   int64 
 2   driverId       9815 non-null   int64 
 3   constructorId  9815 non-null   int64 
 4   number         9815 non-null   int64 
 5   position       9815 non-null   int64 
 6   q1             9807 non-null   object
 7   q2             9651 non-null   object
 8   q3             9488 non-null   object
dtypes: int64(6), object(3)
memory usage: 690.2+ KB


The qualifying_df contains 9,815 records of Formula 1 qualifying results, offering insights into driver and team performance across various races. 

It includes details like driver and team IDs, qualifying positions, and lap times from each session segment (q1, q2, q3). While the lap times are currently stored as text, converting them to a numerical format (e.g., seconds) would be useful for analysis. 

This data is essential for studying qualifying performance, a crucial factor influencing race outcomes. By analyzing qualifying positions and comparing times across sessions, you can gain insights into driver consistency, team strategies, and the potential correlation between qualifying and final race results.

#### **Races**

In [31]:
# # Load DataFrame
# file = 'file.csv'
# df = pd.read_csv(file)
races_df = pd.read_csv('C:/Users/Alex/F1_Capstone/Data/races.csv')

In [32]:
races_df.head()

Unnamed: 0,raceId,year,round,circuitId,name,date,time,url,fp1_date,fp1_time,fp2_date,fp2_time,fp3_date,fp3_time,quali_date,quali_time,sprint_date,sprint_time
0,1,2009,1,1,Australian Grand Prix,2009-03-29,06:00:00,http://en.wikipedia.org/wiki/2009_Australian_G...,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N
1,2,2009,2,2,Malaysian Grand Prix,2009-04-05,09:00:00,http://en.wikipedia.org/wiki/2009_Malaysian_Gr...,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N
2,3,2009,3,17,Chinese Grand Prix,2009-04-19,07:00:00,http://en.wikipedia.org/wiki/2009_Chinese_Gran...,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N
3,4,2009,4,3,Bahrain Grand Prix,2009-04-26,12:00:00,http://en.wikipedia.org/wiki/2009_Bahrain_Gran...,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N
4,5,2009,5,4,Spanish Grand Prix,2009-05-10,12:00:00,http://en.wikipedia.org/wiki/2009_Spanish_Gran...,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N


In [33]:
races_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1101 entries, 0 to 1100
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   raceId       1101 non-null   int64 
 1   year         1101 non-null   int64 
 2   round        1101 non-null   int64 
 3   circuitId    1101 non-null   int64 
 4   name         1101 non-null   object
 5   date         1101 non-null   object
 6   time         1101 non-null   object
 7   url          1101 non-null   object
 8   fp1_date     1101 non-null   object
 9   fp1_time     1101 non-null   object
 10  fp2_date     1101 non-null   object
 11  fp2_time     1101 non-null   object
 12  fp3_date     1101 non-null   object
 13  fp3_time     1101 non-null   object
 14  quali_date   1101 non-null   object
 15  quali_time   1101 non-null   object
 16  sprint_date  1101 non-null   object
 17  sprint_time  1101 non-null   object
dtypes: int64(4), object(14)
memory usage: 155.0+ KB


The races_df (1,101 entries, 18 columns) offers details about Formula 1 races, including:

Unique race IDs, years, rounds, and circuit IDs.
Grand Prix names, dates, and times.
Links to relevant Wikipedia pages for each race.
Dates and times for practice sessions, qualifying, and potential sprint races (some with missing data).

This data is valuable for analyzing race schedules, historical trends, and exploring further details via the provided links.

#### **Results**

In [34]:
# # Load DataFrame
# file = 'file.csv'
# df = pd.read_csv(file)
results_df = pd.read_csv('C:/Users/Alex/F1_Capstone/Data/results.csv')

In [35]:
results_df.head()

Unnamed: 0,resultId,raceId,driverId,constructorId,number,grid,position,positionText,positionOrder,points,laps,time,milliseconds,fastestLap,rank,fastestLapTime,fastestLapSpeed,statusId
0,1,18,1,1,22,1,1,1,1,10.0,58,1:34:50.616,5690616,39,2,1:27.452,218.3,1
1,2,18,2,2,3,5,2,2,2,8.0,58,+5.478,5696094,41,3,1:27.739,217.586,1
2,3,18,3,3,7,7,3,3,3,6.0,58,+8.163,5698779,41,5,1:28.090,216.719,1
3,4,18,4,4,5,11,4,4,4,5.0,58,+17.181,5707797,58,7,1:28.603,215.464,1
4,5,18,5,1,23,3,5,5,5,4.0,58,+18.014,5708630,43,1,1:27.418,218.385,1


In [36]:
results_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26080 entries, 0 to 26079
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   resultId         26080 non-null  int64  
 1   raceId           26080 non-null  int64  
 2   driverId         26080 non-null  int64  
 3   constructorId    26080 non-null  int64  
 4   number           26080 non-null  object 
 5   grid             26080 non-null  int64  
 6   position         26080 non-null  object 
 7   positionText     26080 non-null  object 
 8   positionOrder    26080 non-null  int64  
 9   points           26080 non-null  float64
 10  laps             26080 non-null  int64  
 11  time             26080 non-null  object 
 12  milliseconds     26080 non-null  object 
 13  fastestLap       26080 non-null  object 
 14  rank             26080 non-null  object 
 15  fastestLapTime   26080 non-null  object 
 16  fastestLapSpeed  26080 non-null  object 
 17  statusId    

The results_df DataFrame holds Formula 1 race results (26,080 entries, 18 columns). 

Information includes driver IDs, teams, finishing positions, lap times, and fastest laps. 

This data allows for analyzing individual driver performance, constructor performance, and race strategies. The provided glimpse showcases entries from a single race (same raceId), with unique IDs (resultId) distinguishing each driver's performance details.

#### **Seasons**

In [37]:
# # Load DataFrame
# file = 'file.csv'
# df = pd.read_csv(file)
seasons_df = pd.read_csv('C:/Users/Alex/F1_Capstone/Data/seasons.csv')

In [38]:
seasons_df.head()

Unnamed: 0,year,url
0,2009,http://en.wikipedia.org/wiki/2009_Formula_One_...
1,2008,http://en.wikipedia.org/wiki/2008_Formula_One_...
2,2007,http://en.wikipedia.org/wiki/2007_Formula_One_...
3,2006,http://en.wikipedia.org/wiki/2006_Formula_One_...
4,2005,http://en.wikipedia.org/wiki/2005_Formula_One_...


In [39]:
seasons_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74 entries, 0 to 73
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   year    74 non-null     int64 
 1   url     74 non-null     object
dtypes: int64(1), object(1)
memory usage: 1.3+ KB


The seasons_df stores information about Formula One seasons (74 entries, 2 columns). 

It contains the year (integer) and a link (URL) to the season's Wikipedia page. 

This data allows for tracking information across seasons or exploring specific seasons in detail by following the provided links.

#### **Sprint Results**

In [40]:
# # Load DataFrame
# file = 'file.csv'
# df = pd.read_csv(file)
sprint_results_df = pd.read_csv('C:/Users/Alex/F1_Capstone/Data/sprint_results.csv')

In [41]:
sprint_results_df.head()

Unnamed: 0,resultId,raceId,driverId,constructorId,number,grid,position,positionText,positionOrder,points,laps,time,milliseconds,fastestLap,fastestLapTime,statusId
0,1,1061,830,9,33,2,1,1,1,3,17,25:38.426,1538426,14,1:30.013,1
1,2,1061,1,131,44,1,2,2,2,2,17,+1.430,1539856,17,1:29.937,1
2,3,1061,822,131,77,3,3,3,3,1,17,+7.502,1545928,17,1:29.958,1
3,4,1061,844,6,16,4,4,4,4,0,17,+11.278,1549704,16,1:30.163,1
4,5,1061,846,1,4,6,5,5,5,0,17,+24.111,1562537,16,1:30.566,1


In [42]:
sprint_results_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   resultId        180 non-null    int64 
 1   raceId          180 non-null    int64 
 2   driverId        180 non-null    int64 
 3   constructorId   180 non-null    int64 
 4   number          180 non-null    int64 
 5   grid            180 non-null    int64 
 6   position        180 non-null    object
 7   positionText    180 non-null    object
 8   positionOrder   180 non-null    int64 
 9   points          180 non-null    int64 
 10  laps            180 non-null    int64 
 11  time            180 non-null    object
 12  milliseconds    180 non-null    object
 13  fastestLap      180 non-null    object
 14  fastestLapTime  180 non-null    object
 15  statusId        180 non-null    int64 
dtypes: int64(10), object(6)
memory usage: 22.6+ KB


The sprint_results_df is considerably smaller (180 entries) compared to a previous dataset, likely containing sprint race results. 

While sharing similar columns with the main "results" DataFrame, it suggests data from a different event format. The data includes details like driver IDs, positions (numerical and text), laps, times, and fastest laps. 

Analysing the provided glimpse reveals entries from a single race, where all drivers finished (17 laps), points were awarded to top positions, and the fastest lap times were competitive (1:29.937).

#### **Status**

In [43]:
# # Load DataFrame
# file = 'file.csv'
# df = pd.read_csv(file)
status_df = pd.read_csv('C:/Users/Alex/F1_Capstone/Data/status.csv')

In [44]:
status_df.head()

Unnamed: 0,statusId,status
0,1,Finished
1,2,Disqualified
2,3,Accident
3,4,Collision
4,5,Engine


In [45]:
status_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 139 entries, 0 to 138
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   statusId  139 non-null    int64 
 1   status    139 non-null    object
dtypes: int64(1), object(1)
memory usage: 2.3+ KB


The status_df provides a reference for interpreting driver statuses in Formula 1 races. 

This small DataFrame (139 entries) defines each unique status code (e.g., finished, retired) with a corresponding textual description (e.g., "Finished the Race"). 

This information enriches race data by explaining why drivers might not finish a race or the nature of incidents they faced. This allows for deeper analysis, like understanding car reliability issues, driver errors, or the impact of race conditions, ultimately enhancing readability and providing valuable insights for data analysis and visualizations.

### F1 Data for Grand Prix Winner Prediction:DataFrames Summary

This analysis explores various Formula 1 (F1) dataframes and their potential in predicting Grand Prix winners. Each dataframe offers unique insights:

- **Circuits, constructors, and drivers**: These dataframes (circuits_df, constructor_, driver_) provide details on tracks, team performance, and driver profiles, including historical information and potentially influential factors like nationality (drivers_df).

- **Performance data**: Dataframes like lap_times_df and pit_stops_df offer crucial insights into race pace, strategy, and consistency.

- **Qualifying and race results**: Qualifying_df and results_df (including races_df and sprint_results_df) capture qualifying performance and race outcomes, essential for understanding race dynamics and factors influencing wins.

- **Seasonal context**: Seasons_df and status_df provide context for performance across seasons and race statuses (finished/DNF).

The key to accurate predictions lies in combining these data points:

**What would an ideal dataset include?**
An ideal dataset for prediction would integrate information from:

- Qualifying: Grid position as a crucial predictor.

- Standings: Driver and constructor momentum throughout the season.

- Performance data: Lap times, pit stops, and strategy.

- Circuit characteristics: Track-specific features impacting car performance.

- Driver and team profiles: Historical data, experience, and resource capabilities.
- Weather conditions: Potential impact on race day performance.

**How to analyse the dataset**

Machine learning models like Random Forests or Gradient Boosting (GBMs) can leverage this comprehensive data to capture complex relationships between these factors and race outcomes.


**However, success hinges on:**

- Data quality: Ensuring clean and accurate data.
- Feature engineering: Transforming data to best represent the nuances of F1 racing.
- Model training and evaluation: Using cross-validation to avoid overfitting and testing on unseen data for generalizability.
- Advanced techniques: Employing ensemble learning and hyperparameter tuning for further accuracy refinement.

By effectively utilizing this data and model, we can gain valuable insights to:

- Inform team strategies.
- Enhance fan engagement.
- Potentially guide betting markets with more accurate predictions.

In the next notebook we will create our initial modelling dataframe and do some basic EDA. At the moment this data has most of the information I need to get started, but as the project goes on I may need data from other sources.