# F1 Data Science Project


## Data Science Learning Path

### 1. Mathematics for Data Science
   - Algebra: Linear equations, matrix operations.
   - Calculus: Derivatives, optimization.
   - Probability fundamentals: Basic probability, conditional probability.

### 2. Statistics Basics
   - Descriptive statistics: Mean, median, mode, standard deviation, variance.
   - Probability distributions: Normal, binomial, Poisson.
   - Inferential statistics: Sampling, estimation.

### 3. Python Programming
   - Data types and structures: Lists, tuples, dictionaries, sets.
   - Control structures: Loops, conditionals.
   - Functions and modules.
   - Data science libraries: Pandas, Numpy, Matplotlib.

### 4. Data Wrangling and Cleaning
   - Handling missing values: Imputation, removal.
   - Handling outliers: Detection and treatment.
   - Data transformation: Scaling, normalization, encoding categorical variables.

### 5. Exploratory Data Analysis (EDA)
   - Summary statistics: Mean, median, skewness, kurtosis.
   - Data visualization: Histogram, boxplot, pairplot.
   - Feature relationships: Correlation analysis, scatter plots.

### 6. Probability and Probability Distributions
   - Basic probability: Rules of probability, Bayes' theorem.
   - Probability distributions: Normal, binomial, Poisson, uniform distributions.
   - Sampling methods: Random, stratified, cluster sampling.

### 7. Hypothesis Testing
   - Basics: Null and alternative hypotheses.
   - p-values and confidence intervals.
   - Types of tests: t-tests, chi-square tests, ANOVA.

### 8. Data Visualization
   - Basic plots: Histogram, scatter plot, line plot.
   - Advanced visualizations: Heatmap, pairplot, violin plot.
   - Interactive visualizations: Plotly, Dash, Tableau basics.

### 9. Linear Regression
   - Simple linear regression.
   - Multiple linear regression.
   - Evaluation metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE), R-squared.

### 10. Logistic Regression
- Binary classification.
- Sigmoid function and decision boundary.
- Model interpretation and performance metrics.

### 11. Decision Trees and Random Forests
- Basics of decision trees: Splitting, pruning, information gain.
- Random forests: Ensemble learning, bagging.
- Hyperparameters for tuning: Max depth, min samples split.

### 12. k-Nearest Neighbors (kNN)
- Distance metrics: Euclidean, Manhattan.
- Choosing k and model performance.
- Applications: Classification, regression.

### 13. Model Evaluation Metrics
- Classification metrics: Accuracy, precision, recall, F1-score, ROC-AUC.
- Regression metrics: MAE, MSE, RMSE.
- Cross-validation techniques.

### 14. Clustering Algorithms
- K-means clustering: Choosing k, cluster evaluation.
- Hierarchical clustering: Dendrograms, agglomerative and divisive methods.
- Evaluation metrics: Silhouette score, Davies-Bouldin index.

### 15. Dimensionality Reduction
- Principal Component Analysis (PCA): Eigenvalues, eigenvectors.
- t-SNE: Visualization of high-dimensional data.
- Application of dimensionality reduction in preprocessing.

### 16. Hyperparameter Tuning
- Grid Search and Random Search.
- Cross-validation: k-Fold, Leave-One-Out.
- Tuning with libraries: Scikit-Learn’s GridSearchCV.

### 17. Neural Networks
- Basics of neural networks: Perceptron, activation functions.
- Backpropagation and gradient descent.
- Types of layers: Input, hidden, output.

### 18. Deep Learning with CNNs and RNNs
- Convolutional Neural Networks (CNNs): Convolutional layers, pooling.
- Recurrent Neural Networks (RNNs): Sequence data, LSTM, GRU.
- Applications: Image classification, natural language processing.

### 19. Natural Language Processing (NLP)
- Text preprocessing: Tokenization, stemming, lemmatization.
- Vectorization methods: Bag-of-Words, TF-IDF.
- Advanced NLP: Word embeddings, language models (BERT, GPT).

### 20. Model Deployment and Monitoring
- Model deployment: Flask, FastAPI, Docker.
- Cloud platforms: AWS, GCP, Azure for model deployment.
- Monitoring models: Performance tracking, retraining triggers.


## Loading of Data, Libraries and Table.

### List of tables and columns.

| Table                     | Columns                                                                                                                                                                        |
|---------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **circuits_df**           | `circuitId`, `circuitRef`, `name`, `location`, `country`, `lat`, `lng`, `alt`, `url`                                                                                           |
| **constructor_results_df** | `constructorResultsId`, `raceId`, `constructorId`, `points`, `status`                                                                                                          |
| **constructor_standings_df** | `constructorStandingsId`, `raceId`, `constructorId`, `points`, `position`, `positionText`, `wins`                                                                        |
| **lap_times_df**          | `raceId`, `driverId`, `lap`, `position`, `time`, `milliseconds`                                                                                                                |
| **pit_stops_df**          | `raceId`, `driverId`, `stop`, `lap`, `time`, `duration`, `milliseconds`                                                                                                        |
| **qualifying_df**         | `qualifyId`, `raceId`, `driverId`, `constructorId`, `number`, `position`, `q1`, `q2`, `q3`                                                                                     |
| **results_df**            | `resultId`, `raceId`, `driverId`, `constructorId`, `number`, `grid`, `position`, `positionText`, `positionOrder`, `points`, `laps`, `time`, `milliseconds`, `fastestLap`, `rank`, `fastestLapTime`, `fastestLapSpeed`, `statusId` |
| **seasons_df**            | `year`, `url`                                                                                                                                                                  |
| **sprint_results_df**     | `resultId`, `raceId`, `driverId`, `constructorId`, `number`, `grid`, `position`, `positionText`, `positionOrder`, `points`, `laps`, `time`, `milliseconds`, `fastestLap`, `fastestLapTime`, `statusId` |
| **status_df**             | `statusId`, `status`                                                                                                                                                           |
| **drivers_df**            | `driverId`, `driverRef`, `number`, `code`, `forename`, `surname`, `dob`, `nationality`, `url`                                                                                  |
| **races_df**              | `raceId`, `year`, `round`, `circuitId`, `name`, `date`, `time`, `url`, `fp1_date`, `fp1_time`, `fp2_date`, `fp2_time`, `fp3_date`, `fp3_time`, `quali_date`, `quali_time`, `sprint_date`, `sprint_time` |
| **constructors_df**       | `constructorId`, `constructorRef`, `name`, `nationality`, `url`                                                                                                                |
| **driver_standings_df**   | `driverStandingsId`, `raceId`, `driverId`, `points`, `position`, `positionText`, `wins`                                                                                        |


1. **Importing Libraries:**

  -   **Pandas** (`import pandas as pd`):  
  Pandas is like an advanced spreadsheet tool that allows us to load, manipulate, and analyze large sets of data quickly.

  -   **Seaborn** (`import seaborn as sns`):  
  Seaborn is a tool for making nice-looking charts and graphs. It builds on top of another tool (Matplotlib) to make visualizations prettier and easier to create.

  -   **Matplotlib** (`import matplotlib.pyplot as plt` and `import matplotlib`):  
  This is a library for creating plots and charts in Python. Think of it like drawing tools that help us visualize data.

  -   **NumPy** (`import numpy as np`):  
  NumPy is used for handling numbers and calculations in a more efficient way. It’s great for working with large groups of numbers, especially in math-heavy tasks.

  -   **Scikit-Learn** (`from sklearn.model_selection import train_test_split`):  
  This is a popular library for machine learning. It helps split data into training and testing parts, which is a key step in training predictive models.

2. **Setting Up Warnings:**

```python
import warnings
warnings.simplefilter("ignore")
```

3. **Printing Version Information:**
```python
print("Pandas version:", pd.__version__)
print("Seaborn version:", sns.__version__)
print("Matplotlib version:", matplotlib.__version__)
print("NumPy version:", np.__version__)

These lines display the versions of each library in use, which helps in keeping track of the exact setup, since different versions might have small differences in functionality.

In [36]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import matplotlib
from sklearn.model_selection import train_test_split
import warnings
warnings.simplefilter("ignore")
pd.set_option('display.max_columns', None)
print("Pandas version:", pd.__version__)
print("Seaborn version:", sns.__version__)
print("Matplotlib version:", matplotlib.__version__)
print("NumPy version:", np.__version__)


Pandas version: 2.2.3
Seaborn version: 0.13.2
Matplotlib version: 3.9.2
NumPy version: 2.1.2


In [37]:
circuits_df = pd.read_csv(r'C:\Dataset\F1 Data set\circuits.csv')
constructor_results_df = pd.read_csv(r'C:\Dataset\F1 Data set\constructor_results.csv')
constructor_standings_df = pd.read_csv(r'C:\Dataset\F1 Data set\constructor_standings.csv')
lap_times_df = pd.read_csv(r'C:\Dataset\F1 Data set\lap_times.csv')
pit_stops_df = pd.read_csv(r'C:\Dataset\F1 Data set\pit_stops.csv')
qualifying_df = pd.read_csv(r'C:\Dataset\F1 Data set\qualifying.csv')
results_df = pd.read_csv(r'C:\Dataset\F1 Data set\results.csv')
seasons_df = pd.read_csv(r'C:\Dataset\F1 Data set\seasons.csv')
sprint_results_df = pd.read_csv(r'C:\Dataset\F1 Data set\sprint_results.csv')
status_df = pd.read_csv(r'C:\Dataset\F1 Data set\status.csv')
drivers_df = pd.read_csv(r'C:\Dataset\F1 Data set\drivers.csv')
races_df = pd.read_csv(r'C:\Dataset\F1 Data set\races.csv')
constructors_df = pd.read_csv(r'C:\Dataset\F1 Data set\constructors.csv')
driver_standings_df = pd.read_csv(r'C:\Dataset\F1 Data set\driver_standings.csv')


1. **Creating a Dictionary of DataFrames**:
   - A dictionary called `dataframes` is created, where each key-value pair represents a table name and its corresponding DataFrame. This setup makes it easy to iterate over multiple tables.

2. **Looping to Display Shapes and Sample Rows**:
   - The first loop iterates over each DataFrame in the dictionary, printing:
     - The name of the DataFrame.
     - The shape of the DataFrame, which shows the number of rows and columns.
     - The first few rows of data using `head()`, giving a sample preview of the data.

3. **Looping to Display Column Names**:
   - The second loop iterates over each DataFrame again to print:
     - The name of each table.
     - A list of the column names in each DataFrame.
     - This part is helpful for understanding the structure of each table and identifying available fields for analysis or further processing.

List of Data Frames and their corresponding names.

In [38]:

dataframes = {
    'circuits_df': circuits_df,
    'constructor_results_df': constructor_results_df,
    'constructor_standings_df': constructor_standings_df,
    'lap_times_df': lap_times_df,
    'pit_stops_df': pit_stops_df,
    'qualifying_df': qualifying_df,
    'results_df': results_df,
    'seasons_df': seasons_df,
    'sprint_results_df': sprint_results_df,
    'status_df': status_df,
    'drivers_df': drivers_df,
    'races_df': races_df,
    'constructors_df': constructors_df,
    'driver_standings_df': driver_standings_df,
}
for name, df in dataframes.items():
    print(f"{name}: shape {df.shape}")
    print(df.head(), "\n")
for table_name, df in dataframes.items():
    print(f"Table: {table_name}")
    print("Columns:", list(df.columns))
    print()


circuits_df: shape (77, 9)
   circuitId   circuitRef                            name      location  \
0          1  albert_park  Albert Park Grand Prix Circuit     Melbourne   
1          2       sepang    Sepang International Circuit  Kuala Lumpur   
2          3      bahrain   Bahrain International Circuit        Sakhir   
3          4    catalunya  Circuit de Barcelona-Catalunya      Montmeló   
4          5     istanbul                   Istanbul Park      Istanbul   

     country       lat        lng  alt  \
0  Australia -37.84970  144.96800   10   
1   Malaysia   2.76083  101.73800   18   
2    Bahrain  26.03250   50.51060    7   
3      Spain  41.57000    2.26111  109   
4     Turkey  40.95170   29.40500  130   

                                                 url  
0  http://en.wikipedia.org/wiki/Melbourne_Grand_P...  
1  http://en.wikipedia.org/wiki/Sepang_Internatio...  
2  http://en.wikipedia.org/wiki/Bahrain_Internati...  
3  http://en.wikipedia.org/wiki/Circuit_de_Barcel

## Cross Table Data Analysis


### Objective: Find out which country has the most and least circuits.

In [39]:
top_circuits = circuits_df.groupby('country').size().reset_index(name='circuit_count')
top_circuits = top_circuits.sort_values(by='circuit_count', ascending=False)
top_5_countries = top_circuits.head(5)
bottom_5_countries = top_circuits.tail(5)
print(top_5_countries)
print(bottom_5_countries)

     country  circuit_count
33       USA             11
9     France              7
27     Spain              6
32        UK              4
21  Portugal              4
          country  circuit_count
28         Sweden              1
30         Turkey              1
29    Switzerland              1
31            UAE              1
34  United States              1


### Objective: Find out Top 5 Most Successful Drivers and Top 5 Drivers Who Led the Most Laps.

In [40]:
wins = results_df[results_df['positionOrder'] == 1].groupby('driverId').size().reset_index(name='wins')

# Step 2: Calculate total laps led by each driver
laps_led = lap_times_df.groupby('driverId').size().reset_index(name='laps_led')

# Step 3: Merge both results to have wins and laps led in one DataFrame
combined = pd.merge(wins, laps_led, on='driverId', how='outer').fillna(0)

# Step 4: Get top 5 drivers by wins
top_5_winners = combined.nlargest(5, 'wins')

# Step 5: Get top 5 drivers by laps led
top_5_laps_led = combined.nlargest(5, 'laps_led')

# Assuming you have a drivers_df to get driver names
driver_names = drivers_df.set_index('driverId')

# Step 6: Prepare tables for display
top_5_winners['Driver'] = top_5_winners['driverId'].map(driver_names['forename'] + ' ' + driver_names['surname'])
top_5_laps_led['Driver'] = top_5_laps_led['driverId'].map(driver_names['forename'] + ' ' + driver_names['surname'])

# Final tables
print("Top 5 Most Successful Drivers (Wins):")
print(top_5_winners[['Driver', 'wins']].to_string(index=False))

print("\nTop 5 Drivers Who Led the Most Laps:")
print(top_5_laps_led[['Driver', 'laps_led']].to_string(index=False))

Top 5 Most Successful Drivers (Wins):
            Driver  wins
    Lewis Hamilton 104.0
Michael Schumacher  91.0
    Max Verstappen  61.0
  Sebastian Vettel  53.0
       Alain Prost  51.0

Top 5 Drivers Who Led the Most Laps:
          Driver  laps_led
 Fernando Alonso   21123.0
  Lewis Hamilton   19587.0
  Kimi Räikkönen   18623.0
Sebastian Vettel   16427.0
   Jenson Button   16272.0


### Objective: 

In [41]:
total_seasons = seasons_df['year'].nunique()
ferrari_id = constructors_df[constructors_df['constructorRef'] == 'ferrari']['constructorId'].values[0]
merged_df = pd.merge(constructor_results_df, races_df[['raceId', 'year']], on='raceId')
ferrari_participation = merged_df[merged_df['constructorId'] == ferrari_id]
ferrari_unique_years = ferrari_participation['year'].nunique()
ferrari_years = ferrari_participation['year'].unique()
all_years = seasons_df['year'].unique()
missed_years = set(all_years) - set(ferrari_years)
print(f"Total Seasons: {total_seasons}")
print(f"Ferrari Participated in {ferrari_unique_years} Seasons")
print(f"Years Ferrari Did Not Participate: {sorted(missed_years)}")

Total Seasons: 75
Ferrari Participated in 67 Seasons
Years Ferrari Did Not Participate: [np.int64(1950), np.int64(1951), np.int64(1952), np.int64(1953), np.int64(1954), np.int64(1955), np.int64(1956), np.int64(1957)]


# 1. Mathematics for Data Science

## Algebra Exercises

1. **Filter and Analyze Circuit Locations**
   - **Goal**: Identify all unique `location` and `country` pairs in the `circuits_df` table and create a table where each `country` has a count of `circuits` it contains.
   - **Hint**: Use grouping to count occurrences and display results.
   

#### Solution: Analyzing Unique Location-Country Pairs and Circuit Counts

This code looks at the `circuits_df` DataFrame to find unique locations and countries, as well as how many circuits each country has.

### Steps

1. **Find Unique Location-Country Pairs**:
   - The code gets unique combinations of `location` and `country` from the `circuits_df` DataFrame. This helps us see where circuits are located without duplicates.

   ```python
   unique_location_country_pairs = circuits_df[['location', 'country']].drop_duplicates()
   ```
2. **Print Unique Pairs**:
    - It then prints these unique location-country pairs to show the different circuits.

    ```python
    print("Unique location-country pairs:")
    print(unique_location_country_pairs)
    ```
3. **Count Circuits per Country**:
    - The code groups the data by country and counts the number of unique circuits in each country. This tells us how many circuits each country has.

    ```python
    country_circuit_count = circuits_df.groupby('country')['circuitId'].nunique().reset_index(name='circuit_count')
    ```
4. **Print Circuit Count**:
    - Finally, it prints the count of circuits for each country.

    ```python
    print("Circuit count per country:")
    print(country_circuit_count)
    ```

In [42]:
unique_location_country_pairs = circuits_df[['location', 'country']].drop_duplicates()
print("Unique location-country pairs:")
print(unique_location_country_pairs)
country_circuit_count = circuits_df.groupby('country')['circuitId'].nunique().reset_index(name='circuit_count')
print("Circuit count per country:")
print(country_circuit_count)

Unique location-country pairs:
        location       country
0      Melbourne     Australia
1   Kuala Lumpur      Malaysia
2         Sakhir       Bahrain
3       Montmeló         Spain
4       Istanbul        Turkey
..           ...           ...
72      Portimão      Portugal
73       Mugello         Italy
74        Jeddah  Saudi Arabia
75     Al Daayen         Qatar
76         Miami           USA

[75 rows x 2 columns]
Circuit count per country:
          country  circuit_count
0       Argentina              1
1       Australia              2
2         Austria              2
3      Azerbaijan              1
4         Bahrain              1
5         Belgium              3
6          Brazil              2
7          Canada              3
8           China              1
9          France              7
10        Germany              3
11        Hungary              1
12          India              1
13          Italy              4
14          Japan              3
15          Korea  

2. **Matrix of Constructor Standings**
   - **Goal**: Create a matrix showing the `position` of constructors across multiple `races`. Each row represents a `raceId` and each column represents a `constructorId` from the `constructor_standings_df` table.
   - **Hint**: Use pivoting or matrix transformation functions to reshape data.

### Solution: Creating a Constructor Position Matrix

This code builds a matrix showing the positions of constructors across multiple races. Each row represents a race (`raceId`), and each column represents a constructor (`name`).

### Steps

1. **Merge Constructor Data**:
   - The `constructor_standings_df` is merged with the `constructors_df` to replace `constructorId` with the constructor's name.
   - This makes the data more understandable by using names instead of numeric IDs.

   ```python
   merged_df = constructor_standings_df.merge(
       constructors_df[['constructorId', 'name']],
       on='constructorId',
       how='left'
   )
   
2. **Create the Matrix**:

- The data is pivoted into a matrix where:
    - Rows (index) represent raceId (the race).
    - Columns (columns) represent name (constructor name).
    - Values (values) represent the position of each constructor in the respective race.
    
    ```python
    constructor_position_matrix = merged_df.pivot(
        index='raceId',  
        columns='name',  
        values='position'
    )

3. **Handle Missing Values**:

- Missing positions (where a constructor did not participate in a race) are filled with `"N/A"` for clarity.

    ```python
   constructor_position_matrix.fillna("N/A", inplace=True)

4. **Print the Matrix**:
The final matrix is printed to display the constructor positions across races.

    ```python
    print("Constructor Position Matrix (Rows: raceId, Columns: Constructor Name):")
    print(constructor_position_matrix)

In [43]:
merged_df = constructor_standings_df.merge(
    constructors_df[['constructorId', 'name']],
    on='constructorId',
    how='left'
)
constructor_position_matrix = merged_df.pivot(
    index='raceId',  
    columns='name',  
    values='position' 
)
constructor_position_matrix.fillna("N/A", inplace=True)
print("Constructor Position Matrix (Rows: raceId, Columns: Constructor Name):")
print(constructor_position_matrix)

Constructor Position Matrix (Rows: raceId, Columns: Constructor Name):
name    AGS  ATS Alfa Romeo AlphaTauri Alpine F1 Team Amon Apollon Arrows  \
raceId                                                                      
1       N/A  N/A        N/A        N/A            N/A  N/A     N/A    N/A   
2       N/A  N/A        N/A        N/A            N/A  N/A     N/A    N/A   
3       N/A  N/A        N/A        N/A            N/A  N/A     N/A    N/A   
4       N/A  N/A        N/A        N/A            N/A  N/A     N/A    N/A   
5       N/A  N/A        N/A        N/A            N/A  N/A     N/A    N/A   
...     ...  ...        ...        ...            ...  ...     ...    ...   
1128    N/A  N/A        N/A        N/A            9.0  N/A     N/A    N/A   
1129    N/A  N/A        N/A        N/A            8.0  N/A     N/A    N/A   
1130    N/A  N/A        N/A        N/A            7.0  N/A     N/A    N/A   
1131    N/A  N/A        N/A        N/A            8.0  N/A     N/A    N/A   
1132 

3. **Sum of Points by Driver**
   - **Goal**: Calculate the total `points` each driver has scored across all races using the `results_df` table.
   - **Hint**: Group by `driverId` and use aggregation to sum the `points`.

In [44]:
merged_df = results_df.merge(drivers_df[['driverId', 'forename', 'surname']], on='driverId', how='left')
merged_df['driver_name'] = merged_df['forename'] + ' ' + merged_df['surname']
total_driver_points = merged_df.groupby('driver_name')['points'].sum().reset_index()
total_driver_points.rename(columns={'points': 'total_points'}, inplace=True)
total_driver_points = total_driver_points.sort_values(by='total_points', ascending=False)
print("Total Points Scored by Each Driver:")
print(total_driver_points)

Total Points Scored by Each Driver:
          driver_name  total_points
521    Lewis Hamilton        4713.5
763  Sebastian Vettel        3098.0
569    Max Verstappen        2744.5
256   Fernando Alonso        2304.0
500    Kimi Räikkönen        1873.0
..                ...           ...
812         Tony Gaze           0.0
811        Tony Crook           0.0
820      Torsten Palm           0.0
821     Toshio Suzuki           0.0
15    Albert Scherrer           0.0

[859 rows x 2 columns]


4. **Eigenvalues of a Points Matrix**
   - **Goal**: Construct a 2x2 matrix of points scored by two constructors in two races from `constructor_results_df` and compute its eigenvalues.
   - **Hint**: Choose two specific `constructorId`s and `raceId`s for simplicity.


### Solution: Analyzing Constructor Points and Eigenvalues

This code calculates a matrix of points scored by two constructors in two races and computes the eigenvalues of that matrix. It uses the `constructor_results_df` DataFrame to derive the points.


### Steps:

1. **Group Data by Constructor and Race**:
  - The data is grouped by `constructorId` and `raceId`, and the total points scored by each constructor in each race are calculated.

    ```python
    selected_data = constructor_results_df.groupby(['constructorId', 'raceId'])['points'].sum().reset_index()

2. **Select Two Constructors and Two Races**:
  - Two constructors and two races are chosen for simplicity and analysis.

    ```python
    constructors = selected_data['constructorId'].unique()[:2]
    races = selected_data['raceId'].unique()[:2]
  
3. **Construct the Points Matrix**:

  - A 2x2 matrix is created where each cell contains the points scored by a specific constructor in a specific race. If no points are available for a combination, it is set to 0.

    ```python
      points_matrix = np.zeros((2, 2))
      for i, constructor in enumerate(constructors):
          for j, race in enumerate(races):
              points = selected_data[
                  (selected_data['constructorId'] == constructor) & 
                  (selected_data['raceId'] == race)
              ]['points']
              points_matrix[i, j] = points.values[0] if not points.empty else 0

4. **Compute Eigenvalues**:

  - The eigenvalues of the points matrix are calculated using NumPy's eigvals function. These eigenvalues provide mathematical insights into the matrix.
      ```python
      eigenvalues = np.linalg.eigvals(points_matrix)

5. **Print the Results**:

  - The constructed points matrix and its eigenvalues are displayed.
      ```python
      print("Points Matrix:")
      print(points_matrix)
      print("\nEigenvalues:")
      print(eigenvalues)

### Explanation of the Output:

#### **Points Matrix**:
- The matrix represents the points scored by two constructors in two races.
- Each row is a race, and each column is a constructor.
- Example:
  - Constructor 1 scored **0 points** in Race 1 and **1 point** in Race 2.
  - Constructor 2 scored **0 points** in Race 1 and **4 points** in Race 2.

#### **Eigenvalues**:
- Eigenvalues summarize the matrix's characteristics.
- For this matrix:
  - **0** means there’s no contribution from Constructor 1 in Race 1.
  - **4** reflects Constructor 2’s dominant score in Race 2.

It helps quickly see which constructor performed better overall.


In [45]:
selected_data = constructor_results_df.groupby(['constructorId', 'raceId'])['points'].sum().reset_index()
constructors = selected_data['constructorId'].unique()[:2]
races = selected_data['raceId'].unique()[:2]
points_matrix = np.zeros((2, 2))
for i, constructor in enumerate(constructors):
    for j, race in enumerate(races):
        points = selected_data[(selected_data['constructorId'] == constructor) & (selected_data['raceId'] == race)]['points']
        points_matrix[i, j] = points.values[0] if not points.empty else 0
eigenvalues = np.linalg.eigvals(points_matrix)
print("Points Matrix:")
print(points_matrix)
print("\nEigenvalues:")
print(eigenvalues)

Points Matrix:
[[0. 1.]
 [0. 4.]]

Eigenvalues:
[0. 4.]


## Calculus Exercises


5. **Rate of Change in Lap Time**
   - **Goal**: Given a `driverId`, calculate the rate of change of their `lap` times over successive laps in the `lap_times_df` table.
   - **Hint**: Use `diff()` to compute time differences between laps for the same `driverId`.

In [46]:
lap_times_df = lap_times_df.sort_values(by=['driverId', 'lap'])
lap_times_df['lap_time_change'] = lap_times_df.groupby('driverId')['milliseconds'].diff()
print("Rate of change of lap times for each driver:")
print(lap_times_df[['driverId', 'lap', 'milliseconds', 'lap_time_change']].tail())


Rate of change of lap times for each driver:
        driverId  lap  milliseconds  lap_time_change
564135       860   46         92749            136.0
564136       860   47         92583           -166.0
564137       860   48         92262           -321.0
564138       860   49         92484            222.0
564139       860   50         92186           -298.0


6. **Total Pit Stop Duration Over Time**
   - **Goal**: Calculate the total pit stop `duration` for a `driverId` over the course of a race, analyzing how pit stop duration changes across `laps` in `pit_stops_df`.
   - **Hint**: Use the cumulative sum and analyze derivatives of cumulative times to find patterns.

In [47]:
pit_stops_df['duration'] = pd.to_numeric(pit_stops_df['duration'], errors='coerce')
pit_stops_df['cumulative_duration'] = pit_stops_df.groupby(['driverId', 'raceId'])['duration'].cumsum()
pit_stops_df['duration_change'] = pit_stops_df.groupby(['driverId', 'raceId'])['cumulative_duration'].diff()
print("Pit stop analysis (total and change in duration across laps):")
print(pit_stops_df[['driverId', 'raceId', 'lap', 'duration', 'cumulative_duration', 'duration_change']].tail())


Pit stop analysis (total and change in duration across laps):
       driverId  raceId  lap  duration  cumulative_duration  duration_change
10985       807    1132   39    30.265               59.677           30.265
10986       840    1132   39    29.469               58.492           29.469
10987       839    1132   38    29.086              115.146           29.086
10988       815    1132   47    28.871              118.875           28.871
10989       832    1132   50    28.706               87.002           28.706


7. **Optimization of Fastest Lap Speed**
   - **Goal**: Identify the `fastestLapSpeed` from `results_df` for each `driverId` and find the lap where they achieved it to determine the race's optimal lap.
   - **Hint**: Use `groupby()` and `max()` to isolate the fastest lap speeds.

In [48]:
# Filter out rows where 'fastestLapSpeed' is NaN
valid_results_df = results_df.dropna(subset=['fastestLapSpeed'])

# Group by driverId to find the fastest lap speed and corresponding lap
fastest_lap = valid_results_df.groupby('driverId').apply(
    lambda x: x.loc[x['fastestLapSpeed'].idxmax(), ['raceId', 'fastestLap', 'fastestLapSpeed']]
).reset_index(drop=True)

# Rename columns for better readability
fastest_lap.rename(columns={'raceId': 'Race ID', 'fastestLap': 'Fastest Lap', 'fastestLapSpeed': 'Fastest Speed'}, inplace=True)

# Display the results
print("Fastest Lap Speed for Each Driver and Corresponding Lap:")
print(fastest_lap)


Fastest Lap Speed for Each Driver and Corresponding Lap:
     Race ID Fastest Lap Fastest Speed
0         12          \N            \N
1         62          \N            \N
2         55          \N            \N
3         29          \N            \N
4          1          \N            \N
..       ...         ...           ...
854     1089          41       240.750
855     1110          \N            \N
856     1128          \N            \N
857     1112          44       242.944
858     1122          50       241.103

[859 rows x 3 columns]


## Probability Fundamentals Exercises

8. **Probability of a Constructor Winning**
   - **Goal**: Calculate the probability of a specific `constructorId` having the highest `position` (winning) across all races in the `results_df` table.
   - **Hint**: Filter for the first position, count the occurrences, and divide by the total number of races.

In [49]:
winners_df = results_df[results_df['positionOrder'] == 1]
win_counts = winners_df['constructorId'].value_counts()
total_races = results_df['raceId'].nunique()
win_probabilities = (win_counts / total_races).reset_index()
win_probabilities.columns = ['constructorId', 'Win Probability']
win_probabilities = win_probabilities.merge(
    constructors_df[['constructorId', 'name']],
    on='constructorId',
    how='left'
)
win_probabilities = win_probabilities[['constructorId', 'name', 'Win Probability']]
win_probabilities.rename(columns={'name': 'Constructor Name'}, inplace=True)
print("Winning Probabilities for Each Constructor:")
print(win_probabilities)


Winning Probabilities for Each Constructor:
    constructorId Constructor Name  Win Probability
0               6          Ferrari         0.221024
1               1          McLaren         0.161725
2             131         Mercedes         0.114106
3               9         Red Bull         0.107817
4               3         Williams         0.102426
5              32       Team Lotus         0.040431
6               4          Renault         0.031447
7              22         Benetton         0.024259
8              34          Brabham         0.020665
9              25          Tyrrell         0.020665
10            172     Lotus-Climax         0.019766
11             66              BRM         0.015274
12            170    Cooper-Climax         0.010782
13             51       Alfa Romeo         0.009883
14            180       Lotus-Ford         0.009883
15            118          Vanwall         0.008985
16            196       Matra-Ford         0.008086
17             27   

9. **Conditional Probability of Qualifying Position**
   - **Goal**: Calculate the probability that a `driverId` qualifies in the top 3 (`position` <= 3) given that they participated in qualifying, using data from `qualifying_df`.
   - **Hint**: Calculate the proportion of `position` <= 3 among all qualifying entries for each driver.

In [50]:
# Filter qualifying data for top 3 positions
top_3_qualifying = qualifying_df[qualifying_df['position'] <= 3]

# Count the number of qualifying entries for each driver
total_qualifying_entries = qualifying_df.groupby('driverId').size()

# Count the number of top 3 qualifying entries for each driver
top_3_qualifying_entries = top_3_qualifying.groupby('driverId').size()

# Calculate the probability of qualifying in the top 3
qualifying_probabilities = (top_3_qualifying_entries / total_qualifying_entries).reset_index(name='Top 3 Probability')

# Merge with driver names for readability (optional, requires drivers_df)
qualifying_probabilities = qualifying_probabilities.merge(
    drivers_df[['driverId', 'forename', 'surname']],
    on='driverId',
    how='left'
)

# Create full driver name
qualifying_probabilities['Driver Name'] = qualifying_probabilities['forename'] + ' ' + qualifying_probabilities['surname']

# Reorganize columns
qualifying_probabilities = qualifying_probabilities[['driverId', 'Driver Name', 'Top 3 Probability']]

qualifying_probabilities_sorted = qualifying_probabilities.sort_values(by='Top 3 Probability', ascending=False)

# Display the results
print("Top 3 Qualifying Probabilities for Each Driver (Sorted Descending):")
print(qualifying_probabilities_sorted)


Top 3 Qualifying Probabilities for Each Driver (Sorted Descending):
     driverId     Driver Name  Top 3 Probability
99        102    Ayrton Senna           1.000000
0           1  Lewis Hamilton           0.639535
69         71      Damon Hill           0.542373
139       830  Max Verstappen           0.512690
92         95   Nigel Mansell           0.500000
..        ...             ...                ...
164       855     Guanyu Zhou                NaN
165       856   Nyck de Vries                NaN
167       858  Logan Sargeant                NaN
168       859     Liam Lawson                NaN
169       860  Oliver Bearman                NaN

[170 rows x 3 columns]
