# F1 Data Science Project


## Data Science Learning Path

### 1. Mathematics for Data Science
   - Algebra: Linear equations, matrix operations.
   - Calculus: Derivatives, optimization.
   - Probability fundamentals: Basic probability, conditional probability.

### 2. Statistics Basics
   - Descriptive statistics: Mean, median, mode, standard deviation, variance.
   - Probability distributions: Normal, binomial, Poisson.
   - Inferential statistics: Sampling, estimation.

### 3. Python Programming
   - Data types and structures: Lists, tuples, dictionaries, sets.
   - Control structures: Loops, conditionals.
   - Functions and modules.
   - Data science libraries: Pandas, Numpy, Matplotlib.

### 4. Data Wrangling and Cleaning
   - Handling missing values: Imputation, removal.
   - Handling outliers: Detection and treatment.
   - Data transformation: Scaling, normalization, encoding categorical variables.

### 5. Exploratory Data Analysis (EDA)
   - Summary statistics: Mean, median, skewness, kurtosis.
   - Data visualization: Histogram, boxplot, pairplot.
   - Feature relationships: Correlation analysis, scatter plots.

### 6. Probability and Probability Distributions
   - Basic probability: Rules of probability, Bayes' theorem.
   - Probability distributions: Normal, binomial, Poisson, uniform distributions.
   - Sampling methods: Random, stratified, cluster sampling.

### 7. Hypothesis Testing
   - Basics: Null and alternative hypotheses.
   - p-values and confidence intervals.
   - Types of tests: t-tests, chi-square tests, ANOVA.

### 8. Data Visualization
   - Basic plots: Histogram, scatter plot, line plot.
   - Advanced visualizations: Heatmap, pairplot, violin plot.
   - Interactive visualizations: Plotly, Dash, Tableau basics.

### 9. Linear Regression
   - Simple linear regression.
   - Multiple linear regression.
   - Evaluation metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE), R-squared.

### 10. Logistic Regression
- Binary classification.
- Sigmoid function and decision boundary.
- Model interpretation and performance metrics.

### 11. Decision Trees and Random Forests
- Basics of decision trees: Splitting, pruning, information gain.
- Random forests: Ensemble learning, bagging.
- Hyperparameters for tuning: Max depth, min samples split.

### 12. k-Nearest Neighbors (kNN)
- Distance metrics: Euclidean, Manhattan.
- Choosing k and model performance.
- Applications: Classification, regression.

### 13. Model Evaluation Metrics
- Classification metrics: Accuracy, precision, recall, F1-score, ROC-AUC.
- Regression metrics: MAE, MSE, RMSE.
- Cross-validation techniques.

### 14. Clustering Algorithms
- K-means clustering: Choosing k, cluster evaluation.
- Hierarchical clustering: Dendrograms, agglomerative and divisive methods.
- Evaluation metrics: Silhouette score, Davies-Bouldin index.

### 15. Dimensionality Reduction
- Principal Component Analysis (PCA): Eigenvalues, eigenvectors.
- t-SNE: Visualization of high-dimensional data.
- Application of dimensionality reduction in preprocessing.

### 16. Hyperparameter Tuning
- Grid Search and Random Search.
- Cross-validation: k-Fold, Leave-One-Out.
- Tuning with libraries: Scikit-Learn’s GridSearchCV.

### 17. Neural Networks
- Basics of neural networks: Perceptron, activation functions.
- Backpropagation and gradient descent.
- Types of layers: Input, hidden, output.

### 18. Deep Learning with CNNs and RNNs
- Convolutional Neural Networks (CNNs): Convolutional layers, pooling.
- Recurrent Neural Networks (RNNs): Sequence data, LSTM, GRU.
- Applications: Image classification, natural language processing.

### 19. Natural Language Processing (NLP)
- Text preprocessing: Tokenization, stemming, lemmatization.
- Vectorization methods: Bag-of-Words, TF-IDF.
- Advanced NLP: Word embeddings, language models (BERT, GPT).

### 20. Model Deployment and Monitoring
- Model deployment: Flask, FastAPI, Docker.
- Cloud platforms: AWS, GCP, Azure for model deployment.
- Monitoring models: Performance tracking, retraining triggers.


## Loading of Data, Libraries and Table.

### List of tables and columns.

| Table                     | Columns                                                                                                                                                                        |
|---------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **circuits_df**           | `circuitId`, `circuitRef`, `name`, `location`, `country`, `lat`, `lng`, `alt`, `url`                                                                                           |
| **constructor_results_df** | `constructorResultsId`, `raceId`, `constructorId`, `points`, `status`                                                                                                          |
| **constructor_standings_df** | `constructorStandingsId`, `raceId`, `constructorId`, `points`, `position`, `positionText`, `wins`                                                                        |
| **lap_times_df**          | `raceId`, `driverId`, `lap`, `position`, `time`, `milliseconds`                                                                                                                |
| **pit_stops_df**          | `raceId`, `driverId`, `stop`, `lap`, `time`, `duration`, `milliseconds`                                                                                                        |
| **qualifying_df**         | `qualifyId`, `raceId`, `driverId`, `constructorId`, `number`, `position`, `q1`, `q2`, `q3`                                                                                     |
| **results_df**            | `resultId`, `raceId`, `driverId`, `constructorId`, `number`, `grid`, `position`, `positionText`, `positionOrder`, `points`, `laps`, `time`, `milliseconds`, `fastestLap`, `rank`, `fastestLapTime`, `fastestLapSpeed`, `statusId` |
| **seasons_df**            | `year`, `url`                                                                                                                                                                  |
| **sprint_results_df**     | `resultId`, `raceId`, `driverId`, `constructorId`, `number`, `grid`, `position`, `positionText`, `positionOrder`, `points`, `laps`, `time`, `milliseconds`, `fastestLap`, `fastestLapTime`, `statusId` |
| **status_df**             | `statusId`, `status`                                                                                                                                                           |
| **drivers_df**            | `driverId`, `driverRef`, `number`, `code`, `forename`, `surname`, `dob`, `nationality`, `url`                                                                                  |
| **races_df**              | `raceId`, `year`, `round`, `circuitId`, `name`, `date`, `time`, `url`, `fp1_date`, `fp1_time`, `fp2_date`, `fp2_time`, `fp3_date`, `fp3_time`, `quali_date`, `quali_time`, `sprint_date`, `sprint_time` |
| **constructors_df**       | `constructorId`, `constructorRef`, `name`, `nationality`, `url`                                                                                                                |
| **driver_standings_df**   | `driverStandingsId`, `raceId`, `driverId`, `points`, `position`, `positionText`, `wins`                                                                                        |


In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import matplotlib
from sklearn.model_selection import train_test_split
import warnings

# Suppress warnings (can be removed for debugging)
warnings.simplefilter("ignore")

# Set display options for pandas
pd.set_option('display.max_columns', None)

# Check library versions
print("Pandas version:", pd.__version__)
print("Seaborn version:", sns.__version__)
print("Matplotlib version:", matplotlib.__version__)
print("NumPy version:", np.__version__)


Pandas version: 2.2.3
Seaborn version: 0.13.2
Matplotlib version: 3.9.2
NumPy version: 2.1.2


In [None]:
circuits_df = pd.read_csv(r'C:\Dataset\F1 Data set\circuits.csv')
constructor_results_df = pd.read_csv(r'C:\Dataset\F1 Data set\constructor_results.csv')
constructor_standings_df = pd.read_csv(r'C:\Dataset\F1 Data set\constructor_standings.csv')
lap_times_df = pd.read_csv(r'C:\Dataset\F1 Data set\lap_times.csv')
pit_stops_df = pd.read_csv(r'C:\Dataset\F1 Data set\pit_stops.csv')
qualifying_df = pd.read_csv(r'C:\Dataset\F1 Data set\qualifying.csv')
results_df = pd.read_csv(r'C:\Dataset\F1 Data set\results.csv')
seasons_df = pd.read_csv(r'C:\Dataset\F1 Data set\seasons.csv')
sprint_results_df = pd.read_csv(r'C:\Dataset\F1 Data set\sprint_results.csv')
status_df = pd.read_csv(r'C:\Dataset\F1 Data set\status.csv')
drivers_df = pd.read_csv(r'C:\Dataset\F1 Data set\drivers.csv')
races_df = pd.read_csv(r'C:\Dataset\F1 Data set\races.csv')
constructors_df = pd.read_csv(r'C:\Dataset\F1 Data set\constructors.csv')
driver_standings_df = pd.read_csv(r'C:\Dataset\F1 Data set\driver_standings.csv')


List of Data Frames and their corresponding names.

In [None]:

dataframes = {
    'circuits_df': circuits_df,
    'constructor_results_df': constructor_results_df,
    'constructor_standings_df': constructor_standings_df,
    'lap_times_df': lap_times_df,
    'pit_stops_df': pit_stops_df,
    'qualifying_df': qualifying_df,
    'results_df': results_df,
    'seasons_df': seasons_df,
    'sprint_results_df': sprint_results_df,
    'status_df': status_df,
    'drivers_df': drivers_df,
    'races_df': races_df,
    'constructors_df': constructors_df,
    'driver_standings_df': driver_standings_df,
}

# Loop through the DataFrames and print their shapes and first few rows
for name, df in dataframes.items():
    print(f"{name}: shape {df.shape}")
    print(df.head(), "\n")

# Iterate through each DataFrame in the dictionary and print table names with column headers
for table_name, df in dataframes.items():
    print(f"Table: {table_name}")
    print("Columns:", list(df.columns))
    print()  # For readability


circuits_df: shape (77, 9)
   circuitId   circuitRef                            name      location  \
0          1  albert_park  Albert Park Grand Prix Circuit     Melbourne   
1          2       sepang    Sepang International Circuit  Kuala Lumpur   
2          3      bahrain   Bahrain International Circuit        Sakhir   
3          4    catalunya  Circuit de Barcelona-Catalunya      Montmeló   
4          5     istanbul                   Istanbul Park      Istanbul   

     country       lat        lng  alt  \
0  Australia -37.84970  144.96800   10   
1   Malaysia   2.76083  101.73800   18   
2    Bahrain  26.03250   50.51060    7   
3      Spain  41.57000    2.26111  109   
4     Turkey  40.95170   29.40500  130   

                                                 url  
0  http://en.wikipedia.org/wiki/Melbourne_Grand_P...  
1  http://en.wikipedia.org/wiki/Sepang_Internatio...  
2  http://en.wikipedia.org/wiki/Bahrain_Internati...  
3  http://en.wikipedia.org/wiki/Circuit_de_Barcel

## Cross Table Data Analysis


### Objective: Find out which country has the most and least circuits.

In [5]:
top_circuits = circuits_df.groupby('country').size().reset_index(name='circuit_count')
top_circuits = top_circuits.sort_values(by='circuit_count', ascending=False)
top_5_countries = top_circuits.head(5)
bottom_5_countries = top_circuits.tail(5)
print(top_5_countries)
print(bottom_5_countries)

     country  circuit_count
33       USA             11
9     France              7
27     Spain              6
32        UK              4
21  Portugal              4
          country  circuit_count
28         Sweden              1
30         Turkey              1
29    Switzerland              1
31            UAE              1
34  United States              1


### Objective: Find out Top 5 Most Successful Drivers and Top 5 Drivers Who Led the Most Laps.

In [8]:
wins = results_df[results_df['positionOrder'] == 1].groupby('driverId').size().reset_index(name='wins')

# Step 2: Calculate total laps led by each driver
laps_led = lap_times_df.groupby('driverId').size().reset_index(name='laps_led')

# Step 3: Merge both results to have wins and laps led in one DataFrame
combined = pd.merge(wins, laps_led, on='driverId', how='outer').fillna(0)

# Step 4: Get top 5 drivers by wins
top_5_winners = combined.nlargest(5, 'wins')

# Step 5: Get top 5 drivers by laps led
top_5_laps_led = combined.nlargest(5, 'laps_led')

# Assuming you have a drivers_df to get driver names
driver_names = drivers_df.set_index('driverId')

# Step 6: Prepare tables for display
top_5_winners['Driver'] = top_5_winners['driverId'].map(driver_names['forename'] + ' ' + driver_names['surname'])
top_5_laps_led['Driver'] = top_5_laps_led['driverId'].map(driver_names['forename'] + ' ' + driver_names['surname'])

# Final tables
print("Top 5 Most Successful Drivers (Wins):")
print(top_5_winners[['Driver', 'wins']].to_string(index=False))

print("\nTop 5 Drivers Who Led the Most Laps:")
print(top_5_laps_led[['Driver', 'laps_led']].to_string(index=False))

Top 5 Most Successful Drivers (Wins):
            Driver  wins
    Lewis Hamilton 104.0
Michael Schumacher  91.0
    Max Verstappen  61.0
  Sebastian Vettel  53.0
       Alain Prost  51.0

Top 5 Drivers Who Led the Most Laps:
          Driver  laps_led
 Fernando Alonso   21123.0
  Lewis Hamilton   19587.0
  Kimi Räikkönen   18623.0
Sebastian Vettel   16427.0
   Jenson Button   16272.0


### Objective: 

In [9]:
total_seasons = seasons_df['year'].nunique()
ferrari_id = constructors_df[constructors_df['constructorRef'] == 'ferrari']['constructorId'].values[0]
merged_df = pd.merge(constructor_results_df, races_df[['raceId', 'year']], on='raceId')
ferrari_participation = merged_df[merged_df['constructorId'] == ferrari_id]
ferrari_unique_years = ferrari_participation['year'].nunique()
ferrari_years = ferrari_participation['year'].unique()
all_years = seasons_df['year'].unique()
missed_years = set(all_years) - set(ferrari_years)
print(f"Total Seasons: {total_seasons}")
print(f"Ferrari Participated in {ferrari_unique_years} Seasons")
print(f"Years Ferrari Did Not Participate: {sorted(missed_years)}")

Total Seasons: 75
Ferrari Participated in 67 Seasons
Years Ferrari Did Not Participate: [np.int64(1950), np.int64(1951), np.int64(1952), np.int64(1953), np.int64(1954), np.int64(1955), np.int64(1956), np.int64(1957)]


# 1. Mathematics for Data Science

## Algebra Exercises

1. **Filter and Analyze Circuit Locations**
   - **Goal**: Identify all unique `location` and `country` pairs in the `circuits_df` table and create a table where each `country` has a count of `circuits` it contains.
   - **Hint**: Use grouping to count occurrences and display results.

---

2. **Matrix of Constructor Standings**
   - **Goal**: Create a matrix showing the `position` of constructors across multiple `races`. Each row represents a `raceId` and each column represents a `constructorId` from the `constructor_standings_df` table.
   - **Hint**: Use pivoting or matrix transformation functions to reshape data.

---

3. **Sum of Points by Driver**
   - **Goal**: Calculate the total `points` each driver has scored across all races using the `results_df` table.
   - **Hint**: Group by `driverId` and use aggregation to sum the `points`.

---

4. **Eigenvalues of a Points Matrix**
   - **Goal**: Construct a 2x2 matrix of points scored by two constructors in two races from `constructor_results_df` and compute its eigenvalues.
   - **Hint**: Choose two specific `constructorId`s and `raceId`s for simplicity.

---

## Calculus Exercises

5. **Rate of Change in Lap Time**
   - **Goal**: Given a `driverId`, calculate the rate of change of their `lap` times over successive laps in the `lap_times_df` table.
   - **Hint**: Use `diff()` to compute time differences between laps for the same `driverId`.

---

6. **Total Pit Stop Duration Over Time**
   - **Goal**: Calculate the total pit stop `duration` for a `driverId` over the course of a race, analyzing how pit stop duration changes across `laps` in `pit_stops_df`.
   - **Hint**: Use the cumulative sum and analyze derivatives of cumulative times to find patterns.

---

7. **Optimization of Fastest Lap Speed**
   - **Goal**: Identify the `fastestLapSpeed` from `results_df` for each `driverId` and find the lap where they achieved it to determine the race's optimal lap.
   - **Hint**: Use `groupby()` and `max()` to isolate the fastest lap speeds.

---


## Probability Fundamentals Exercises

8. **Probability of a Constructor Winning**
   - **Goal**: Calculate the probability of a specific `constructorId` having the highest `position` (winning) across all races in the `results_df` table.
   - **Hint**: Filter for the first position, count the occurrences, and divide by the total number of races.

---

9. **Conditional Probability of Qualifying Position**
   - **Goal**: Calculate the probability that a `driverId` qualifies in the top 3 (`position` <= 3) given that they participated in qualifying, using data from `qualifying_df`.
   - **Hint**: Calculate the proportion of `position` <= 3 among all qualifying entries for each driver.

---

10. **Joint Probability of Winning and Fastest Lap**
      - **Goal**: Calculate the probability that a driver has both won (`position` = 1) and had the `fastestLap` in a race, using `results_df`.
      - **Hint**: Find races where both conditions are true for a `driverId` and calculate the frequency relative to the total races.