# F1 Data Science Project


## Data Science Learning Path

### 1. Mathematics for Data Science
   - Algebra: Linear equations, matrix operations.
   - Calculus: Derivatives, optimization.
   - Probability fundamentals: Basic probability, conditional probability.

### 2. Statistics Basics
   - Descriptive statistics: Mean, median, mode, standard deviation, variance.
   - Probability distributions: Normal, binomial, Poisson.
   - Inferential statistics: Sampling, estimation.

### 3. Python Programming
   - Data types and structures: Lists, tuples, dictionaries, sets.
   - Control structures: Loops, conditionals.
   - Functions and modules.
   - Data science libraries: Pandas, Numpy, Matplotlib.

### 4. Data Wrangling and Cleaning
   - Handling missing values: Imputation, removal.
   - Handling outliers: Detection and treatment.
   - Data transformation: Scaling, normalization, encoding categorical variables.

### 5. Exploratory Data Analysis (EDA)
   - Summary statistics: Mean, median, skewness, kurtosis.
   - Data visualization: Histogram, boxplot, pairplot.
   - Feature relationships: Correlation analysis, scatter plots.

### 6. Probability and Probability Distributions
   - Basic probability: Rules of probability, Bayes' theorem.
   - Probability distributions: Normal, binomial, Poisson, uniform distributions.
   - Sampling methods: Random, stratified, cluster sampling.

### 7. Hypothesis Testing
   - Basics: Null and alternative hypotheses.
   - p-values and confidence intervals.
   - Types of tests: t-tests, chi-square tests, ANOVA.

### 8. Data Visualization
   - Basic plots: Histogram, scatter plot, line plot.
   - Advanced visualizations: Heatmap, pairplot, violin plot.
   - Interactive visualizations: Plotly, Dash, Tableau basics.

### 9. Linear Regression
   - Simple linear regression.
   - Multiple linear regression.
   - Evaluation metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE), R-squared.

### 10. Logistic Regression
- Binary classification.
- Sigmoid function and decision boundary.
- Model interpretation and performance metrics.

### 11. Decision Trees and Random Forests
- Basics of decision trees: Splitting, pruning, information gain.
- Random forests: Ensemble learning, bagging.
- Hyperparameters for tuning: Max depth, min samples split.

### 12. k-Nearest Neighbors (kNN)
- Distance metrics: Euclidean, Manhattan.
- Choosing k and model performance.
- Applications: Classification, regression.

### 13. Model Evaluation Metrics
- Classification metrics: Accuracy, precision, recall, F1-score, ROC-AUC.
- Regression metrics: MAE, MSE, RMSE.
- Cross-validation techniques.

### 14. Clustering Algorithms
- K-means clustering: Choosing k, cluster evaluation.
- Hierarchical clustering: Dendrograms, agglomerative and divisive methods.
- Evaluation metrics: Silhouette score, Davies-Bouldin index.

### 15. Dimensionality Reduction
- Principal Component Analysis (PCA): Eigenvalues, eigenvectors.
- t-SNE: Visualization of high-dimensional data.
- Application of dimensionality reduction in preprocessing.

### 16. Hyperparameter Tuning
- Grid Search and Random Search.
- Cross-validation: k-Fold, Leave-One-Out.
- Tuning with libraries: Scikit-Learn’s GridSearchCV.

### 17. Neural Networks
- Basics of neural networks: Perceptron, activation functions.
- Backpropagation and gradient descent.
- Types of layers: Input, hidden, output.

### 18. Deep Learning with CNNs and RNNs
- Convolutional Neural Networks (CNNs): Convolutional layers, pooling.
- Recurrent Neural Networks (RNNs): Sequence data, LSTM, GRU.
- Applications: Image classification, natural language processing.

### 19. Natural Language Processing (NLP)
- Text preprocessing: Tokenization, stemming, lemmatization.
- Vectorization methods: Bag-of-Words, TF-IDF.
- Advanced NLP: Word embeddings, language models (BERT, GPT).

### 20. Model Deployment and Monitoring
- Model deployment: Flask, FastAPI, Docker.
- Cloud platforms: AWS, GCP, Azure for model deployment.
- Monitoring models: Performance tracking, retraining triggers.


## Loading of Data, Libraries and Table.

### List of tables and columns.

| Table                     | Columns                                                                                                                                                                        |
|---------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **circuits_df**           | `circuitId`, `circuitRef`, `name`, `location`, `country`, `lat`, `lng`, `alt`, `url`                                                                                           |
| **constructor_results_df** | `constructorResultsId`, `raceId`, `constructorId`, `points`, `status`                                                                                                          |
| **constructor_standings_df** | `constructorStandingsId`, `raceId`, `constructorId`, `points`, `position`, `positionText`, `wins`                                                                        |
| **lap_times_df**          | `raceId`, `driverId`, `lap`, `position`, `time`, `milliseconds`                                                                                                                |
| **pit_stops_df**          | `raceId`, `driverId`, `stop`, `lap`, `time`, `duration`, `milliseconds`                                                                                                        |
| **qualifying_df**         | `qualifyId`, `raceId`, `driverId`, `constructorId`, `number`, `position`, `q1`, `q2`, `q3`                                                                                     |
| **results_df**            | `resultId`, `raceId`, `driverId`, `constructorId`, `number`, `grid`, `position`, `positionText`, `positionOrder`, `points`, `laps`, `time`, `milliseconds`, `fastestLap`, `rank`, `fastestLapTime`, `fastestLapSpeed`, `statusId` |
| **seasons_df**            | `year`, `url`                                                                                                                                                                  |
| **sprint_results_df**     | `resultId`, `raceId`, `driverId`, `constructorId`, `number`, `grid`, `position`, `positionText`, `positionOrder`, `points`, `laps`, `time`, `milliseconds`, `fastestLap`, `fastestLapTime`, `statusId` |
| **status_df**             | `statusId`, `status`                                                                                                                                                           |
| **drivers_df**            | `driverId`, `driverRef`, `number`, `code`, `forename`, `surname`, `dob`, `nationality`, `url`                                                                                  |
| **races_df**              | `raceId`, `year`, `round`, `circuitId`, `name`, `date`, `time`, `url`, `fp1_date`, `fp1_time`, `fp2_date`, `fp2_time`, `fp3_date`, `fp3_time`, `quali_date`, `quali_time`, `sprint_date`, `sprint_time` |
| **constructors_df**       | `constructorId`, `constructorRef`, `name`, `nationality`, `url`                                                                                                                |
| **driver_standings_df**   | `driverStandingsId`, `raceId`, `driverId`, `points`, `position`, `positionText`, `wins`                                                                                        |


1. **Importing Libraries:**

  -   **Pandas** (`import pandas as pd`):  
  Pandas is like an advanced spreadsheet tool that allows us to load, manipulate, and analyze large sets of data quickly.

  -   **Seaborn** (`import seaborn as sns`):  
  Seaborn is a tool for making nice-looking charts and graphs. It builds on top of another tool (Matplotlib) to make visualizations prettier and easier to create.

  -   **Matplotlib** (`import matplotlib.pyplot as plt` and `import matplotlib`):  
  This is a library for creating plots and charts in Python. Think of it like drawing tools that help us visualize data.

  -   **NumPy** (`import numpy as np`):  
  NumPy is used for handling numbers and calculations in a more efficient way. It’s great for working with large groups of numbers, especially in math-heavy tasks.

  -   **Scikit-Learn** (`from sklearn.model_selection import train_test_split`):  
  This is a popular library for machine learning. It helps split data into training and testing parts, which is a key step in training predictive models.

2. **Setting Up Warnings:**

```python
import warnings
warnings.simplefilter("ignore")
```

3. **Printing Version Information:**
```python
print("Pandas version:", pd.__version__)
print("Seaborn version:", sns.__version__)
print("Matplotlib version:", matplotlib.__version__)
print("NumPy version:", np.__version__)

These lines display the versions of each library in use, which helps in keeping track of the exact setup, since different versions might have small differences in functionality.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import matplotlib
from sklearn.model_selection import train_test_split
import warnings
warnings.simplefilter("ignore")
pd.set_option('display.max_columns', None)
print("Pandas version:", pd.__version__)
print("Seaborn version:", sns.__version__)
print("Matplotlib version:", matplotlib.__version__)
print("NumPy version:", np.__version__)


In [None]:
circuits_df = pd.read_csv(r'C:\Dataset\F1 Data set\circuits.csv')
constructor_results_df = pd.read_csv(r'C:\Dataset\F1 Data set\constructor_results.csv')
constructor_standings_df = pd.read_csv(r'C:\Dataset\F1 Data set\constructor_standings.csv')
lap_times_df = pd.read_csv(r'C:\Dataset\F1 Data set\lap_times.csv')
pit_stops_df = pd.read_csv(r'C:\Dataset\F1 Data set\pit_stops.csv')
qualifying_df = pd.read_csv(r'C:\Dataset\F1 Data set\qualifying.csv')
results_df = pd.read_csv(r'C:\Dataset\F1 Data set\results.csv')
seasons_df = pd.read_csv(r'C:\Dataset\F1 Data set\seasons.csv')
sprint_results_df = pd.read_csv(r'C:\Dataset\F1 Data set\sprint_results.csv')
status_df = pd.read_csv(r'C:\Dataset\F1 Data set\status.csv')
drivers_df = pd.read_csv(r'C:\Dataset\F1 Data set\drivers.csv')
races_df = pd.read_csv(r'C:\Dataset\F1 Data set\races.csv')
constructors_df = pd.read_csv(r'C:\Dataset\F1 Data set\constructors.csv')
driver_standings_df = pd.read_csv(r'C:\Dataset\F1 Data set\driver_standings.csv')


1. **Creating a Dictionary of DataFrames**:
   - A dictionary called `dataframes` is created, where each key-value pair represents a table name and its corresponding DataFrame. This setup makes it easy to iterate over multiple tables.

2. **Looping to Display Shapes and Sample Rows**:
   - The first loop iterates over each DataFrame in the dictionary, printing:
     - The name of the DataFrame.
     - The shape of the DataFrame, which shows the number of rows and columns.
     - The first few rows of data using `head()`, giving a sample preview of the data.

3. **Looping to Display Column Names**:
   - The second loop iterates over each DataFrame again to print:
     - The name of each table.
     - A list of the column names in each DataFrame.
     - This part is helpful for understanding the structure of each table and identifying available fields for analysis or further processing.

List of Data Frames and their corresponding names.

In [None]:

dataframes = {
    'circuits_df': circuits_df,
    'constructor_results_df': constructor_results_df,
    'constructor_standings_df': constructor_standings_df,
    'lap_times_df': lap_times_df,
    'pit_stops_df': pit_stops_df,
    'qualifying_df': qualifying_df,
    'results_df': results_df,
    'seasons_df': seasons_df,
    'sprint_results_df': sprint_results_df,
    'status_df': status_df,
    'drivers_df': drivers_df,
    'races_df': races_df,
    'constructors_df': constructors_df,
    'driver_standings_df': driver_standings_df,
}
for name, df in dataframes.items():
    print(f"{name}: shape {df.shape}")
    print(df.head(), "\n")
for table_name, df in dataframes.items():
    print(f"Table: {table_name}")
    print("Columns:", list(df.columns))
    print()


## Cross Table Data Analysis


### Objective: Find out which country has the most and least circuits.

In [None]:
top_circuits = circuits_df.groupby('country').size().reset_index(name='circuit_count')
top_circuits = top_circuits.sort_values(by='circuit_count', ascending=False)
top_5_countries = top_circuits.head(5)
bottom_5_countries = top_circuits.tail(5)
print(top_5_countries)
print(bottom_5_countries)

### Objective: Find out Top 5 Most Successful Drivers and Top 5 Drivers Who Led the Most Laps.

In [None]:
wins = results_df[results_df['positionOrder'] == 1].groupby('driverId').size().reset_index(name='wins')

# Step 2: Calculate total laps led by each driver
laps_led = lap_times_df.groupby('driverId').size().reset_index(name='laps_led')

# Step 3: Merge both results to have wins and laps led in one DataFrame
combined = pd.merge(wins, laps_led, on='driverId', how='outer').fillna(0)

# Step 4: Get top 5 drivers by wins
top_5_winners = combined.nlargest(5, 'wins')

# Step 5: Get top 5 drivers by laps led
top_5_laps_led = combined.nlargest(5, 'laps_led')

# Assuming you have a drivers_df to get driver names
driver_names = drivers_df.set_index('driverId')

# Step 6: Prepare tables for display
top_5_winners['Driver'] = top_5_winners['driverId'].map(driver_names['forename'] + ' ' + driver_names['surname'])
top_5_laps_led['Driver'] = top_5_laps_led['driverId'].map(driver_names['forename'] + ' ' + driver_names['surname'])

# Final tables
print("Top 5 Most Successful Drivers (Wins):")
print(top_5_winners[['Driver', 'wins']].to_string(index=False))

print("\nTop 5 Drivers Who Led the Most Laps:")
print(top_5_laps_led[['Driver', 'laps_led']].to_string(index=False))

### Objective: 

In [None]:
total_seasons = seasons_df['year'].nunique()
ferrari_id = constructors_df[constructors_df['constructorRef'] == 'ferrari']['constructorId'].values[0]
merged_df = pd.merge(constructor_results_df, races_df[['raceId', 'year']], on='raceId')
ferrari_participation = merged_df[merged_df['constructorId'] == ferrari_id]
ferrari_unique_years = ferrari_participation['year'].nunique()
ferrari_years = ferrari_participation['year'].unique()
all_years = seasons_df['year'].unique()
missed_years = set(all_years) - set(ferrari_years)
print(f"Total Seasons: {total_seasons}")
print(f"Ferrari Participated in {ferrari_unique_years} Seasons")
print(f"Years Ferrari Did Not Participate: {sorted(missed_years)}")

# 1. Mathematics for Data Science

## Algebra Exercises

1. **Filter and Analyze Circuit Locations**
   - **Goal**: Identify all unique `location` and `country` pairs in the `circuits_df` table and create a table where each `country` has a count of `circuits` it contains.
   - **Hint**: Use grouping to count occurrences and display results.
   

#### Solution: Analyzing Unique Location-Country Pairs and Circuit Counts

This code looks at the `circuits_df` DataFrame to find unique locations and countries, as well as how many circuits each country has.

### Steps

1. **Find Unique Location-Country Pairs**:
   - The code gets unique combinations of `location` and `country` from the `circuits_df` DataFrame. This helps us see where circuits are located without duplicates.

   ```python
   unique_location_country_pairs = circuits_df[['location', 'country']].drop_duplicates()
   ```
2. **Print Unique Pairs**:
    - It then prints these unique location-country pairs to show the different circuits.

    ```python
    print("Unique location-country pairs:")
    print(unique_location_country_pairs)
    ```
3. **Count Circuits per Country**:
    - The code groups the data by country and counts the number of unique circuits in each country. This tells us how many circuits each country has.

    ```python
    country_circuit_count = circuits_df.groupby('country')['circuitId'].nunique().reset_index(name='circuit_count')
    ```
4. **Print Circuit Count**:
    - Finally, it prints the count of circuits for each country.

    ```python
    print("Circuit count per country:")
    print(country_circuit_count)
    ```

In [None]:
unique_location_country_pairs = circuits_df[['location', 'country']].drop_duplicates()
print("Unique location-country pairs:")
print(unique_location_country_pairs)
country_circuit_count = circuits_df.groupby('country')['circuitId'].nunique().reset_index(name='circuit_count')
print("Circuit count per country:")
print(country_circuit_count)

2. **Matrix of Constructor Standings**
   - **Goal**: Create a matrix showing the `position` of constructors across multiple `races`. Each row represents a `raceId` and each column represents a `constructorId` from the `constructor_standings_df` table.
   - **Hint**: Use pivoting or matrix transformation functions to reshape data.

### Solution: Creating a Constructor Position Matrix

This code builds a matrix showing the positions of constructors across multiple races. Each row represents a race (`raceId`), and each column represents a constructor (`name`).

### Steps

1. **Merge Constructor Data**:
   - The `constructor_standings_df` is merged with the `constructors_df` to replace `constructorId` with the constructor's name.
   - This makes the data more understandable by using names instead of numeric IDs.

   ```python
   merged_df = constructor_standings_df.merge(
       constructors_df[['constructorId', 'name']],
       on='constructorId',
       how='left'
   )
   
2. **Create the Matrix**:

- The data is pivoted into a matrix where:
    - Rows (index) represent raceId (the race).
    - Columns (columns) represent name (constructor name).
    - Values (values) represent the position of each constructor in the respective race.
    
    ```python
    constructor_position_matrix = merged_df.pivot(
        index='raceId',  
        columns='name',  
        values='position'
    )

3. **Handle Missing Values**:

- Missing positions (where a constructor did not participate in a race) are filled with `"N/A"` for clarity.

    ```python
   constructor_position_matrix.fillna("N/A", inplace=True)

4. **Print the Matrix**:
The final matrix is printed to display the constructor positions across races.

    ```python
    print("Constructor Position Matrix (Rows: raceId, Columns: Constructor Name):")
    print(constructor_position_matrix)

In [None]:
merged_df = constructor_standings_df.merge(
    constructors_df[['constructorId', 'name']],
    on='constructorId',
    how='left'
)
constructor_position_matrix = merged_df.pivot(
    index='raceId',  
    columns='name',  
    values='position' 
)
constructor_position_matrix.fillna("N/A", inplace=True)
print("Constructor Position Matrix (Rows: raceId, Columns: Constructor Name):")
print(constructor_position_matrix)

3. **Sum of Points by Driver**
   - **Goal**: Calculate the total `points` each driver has scored across all races using the `results_df` table.
   - **Hint**: Group by `driverId` and use aggregation to sum the `points`.

In [None]:
merged_df = results_df.merge(drivers_df[['driverId', 'forename', 'surname']], on='driverId', how='left')
merged_df['driver_name'] = merged_df['forename'] + ' ' + merged_df['surname']
total_driver_points = merged_df.groupby('driver_name')['points'].sum().reset_index()
total_driver_points.rename(columns={'points': 'total_points'}, inplace=True)
total_driver_points = total_driver_points.sort_values(by='total_points', ascending=False)
print("Total Points Scored by Each Driver:")
print(total_driver_points)

4. **Eigenvalues of a Points Matrix**
   - **Goal**: Construct a 2x2 matrix of points scored by two constructors in two races from `constructor_results_df` and compute its eigenvalues.
   - **Hint**: Choose two specific `constructorId`s and `raceId`s for simplicity.


### Solution: Analyzing Constructor Points and Eigenvalues

This code calculates a matrix of points scored by two constructors in two races and computes the eigenvalues of that matrix. It uses the `constructor_results_df` DataFrame to derive the points.


### Steps:

1. **Group Data by Constructor and Race**:
   - The data is grouped by `constructorId` and `raceId`, and the total points scored by each constructor in each race are calculated.

  ```python
selected_data = constructor_results_df.groupby(['constructorId', 'raceId'])['points'].sum().reset_index()

2. **Select Two Constructors and Two Races**:
  - Two constructors and two races are chosen for simplicity and analysis.

  ```python
constructors = selected_data['constructorId'].unique()[:2]
races = selected_data['raceId'].unique()[:2]
  
3. **Construct the Points Matrix**:

  - A 2x2 matrix is created where each cell contains the points scored by a specific constructor in a specific race. If no points are available for a combination, it is set to 0.

  ```python
  points_matrix = np.zeros((2, 2))
  for i, constructor in enumerate(constructors):
      for j, race in enumerate(races):
          points = selected_data[
              (selected_data['constructorId'] == constructor) & 
              (selected_data['raceId'] == race)
          ]['points']
          points_matrix[i, j] = points.values[0] if not points.empty else 0

4. **Compute Eigenvalues**:

  - The eigenvalues of the points matrix are calculated using NumPy's eigvals function. These eigenvalues provide mathematical insights into the matrix.
  ```python
  eigenvalues = np.linalg.eigvals(points_matrix)

5. **Print the Results**:

  - The constructed points matrix and its eigenvalues are displayed.
  ```python
  print("Points Matrix:")
  print(points_matrix)
  print("\nEigenvalues:")
  print(eigenvalues)

### Explanation of the Output:

#### **Points Matrix**:
- The matrix represents the points scored by two constructors in two races.
- Each row is a race, and each column is a constructor.
- Example:
  - Constructor 1 scored **0 points** in Race 1 and **1 point** in Race 2.
  - Constructor 2 scored **0 points** in Race 1 and **4 points** in Race 2.

#### **Eigenvalues**:
- Eigenvalues summarize the matrix's characteristics.
- For this matrix:
  - **0** means there’s no contribution from Constructor 1 in Race 1.
  - **4** reflects Constructor 2’s dominant score in Race 2.

It helps quickly see which constructor performed better overall.


In [None]:
selected_data = constructor_results_df.groupby(['constructorId', 'raceId'])['points'].sum().reset_index()
constructors = selected_data['constructorId'].unique()[:2]
races = selected_data['raceId'].unique()[:2]
points_matrix = np.zeros((2, 2))
for i, constructor in enumerate(constructors):
    for j, race in enumerate(races):
        points = selected_data[(selected_data['constructorId'] == constructor) & (selected_data['raceId'] == race)]['points']
        points_matrix[i, j] = points.values[0] if not points.empty else 0
eigenvalues = np.linalg.eigvals(points_matrix)
print("Points Matrix:")
print(points_matrix)
print("\nEigenvalues:")
print(eigenvalues)

## Calculus Exercises

5. **Rate of Change in Lap Time**
   - **Goal**: Given a `driverId`, calculate the rate of change of their `lap` times over successive laps in the `lap_times_df` table.
   - **Hint**: Use `diff()` to compute time differences between laps for the same `driverId`.

---

6. **Total Pit Stop Duration Over Time**
   - **Goal**: Calculate the total pit stop `duration` for a `driverId` over the course of a race, analyzing how pit stop duration changes across `laps` in `pit_stops_df`.
   - **Hint**: Use the cumulative sum and analyze derivatives of cumulative times to find patterns.

---

7. **Optimization of Fastest Lap Speed**
   - **Goal**: Identify the `fastestLapSpeed` from `results_df` for each `driverId` and find the lap where they achieved it to determine the race's optimal lap.
   - **Hint**: Use `groupby()` and `max()` to isolate the fastest lap speeds.

---


## Probability Fundamentals Exercises

8. **Probability of a Constructor Winning**
   - **Goal**: Calculate the probability of a specific `constructorId` having the highest `position` (winning) across all races in the `results_df` table.
   - **Hint**: Filter for the first position, count the occurrences, and divide by the total number of races.

---

9. **Conditional Probability of Qualifying Position**
   - **Goal**: Calculate the probability that a `driverId` qualifies in the top 3 (`position` <= 3) given that they participated in qualifying, using data from `qualifying_df`.
   - **Hint**: Calculate the proportion of `position` <= 3 among all qualifying entries for each driver.

---

10. **Joint Probability of Winning and Fastest Lap**
      - **Goal**: Calculate the probability that a driver has both won (`position` = 1) and had the `fastestLap` in a race, using `results_df`.
      - **Hint**: Find races where both conditions are true for a `driverId` and calculate the frequency relative to the total races.