# Executive Summary: Exploratory Data Analysis on F1 Dataset
Author: Alex Searle

## Project Overview
The purpose of this EDA is to learn more about the features of the F1 dataset to classify if a drive will finish on the podium(Top 3 places).

## Dataset Description
- The dataset consists of 15134 instance and has 14 attributes.
- The target variable is top_3, which indicates if a driver finished in the top 3 positions or not, this represents the class we will predict.
- The features include, driver_avg_finish_pos_season, top3_driver_season_percentage, driver_avg_finish_pos_season_lag, top3_driver_season_percentage_lag, Constructor_Top3_Percent, and grid, among others which will be explored in detail.

## Key Findings

### 1. Data Distribution
- Target variable distribution:   
![Top 3 Distribution](Images/TargetDistribution.png)  
The distribution of the target varibale indicates that there are more many more instances that are not top 3 then instances that are top 3 instances
- Class imbalance is observed with 83.45% of instances belong to finished not in the top 3 and 16.55% belong to top 3 finishes

### 2. Feature Engineering
Went right into feature engineering because only use able feature in the original dataset was grid which isn't a great predictor by itself  
Features created:
- driver_avg_finish_pos_season: a drivers average finish position for that season upto the last race
- top3_driver_season_percentage: percentage of times that a driver has finished in the top 3 for the season up to the last race
- Constructor_Top3_Percent: percentage of times that a constructor(team) has finished in the top 3 for the season up to the last race
- driver_avg_finish_pos_season_lag: a drivers avg finishing pos at the same round of the season last year
- top3_driver_season_percentage_lag: percentage of times the driver finished in the top 3 at this round last year

### 3. Feature Analysis
- grid exhibits a left skewed distribution with spikes at certain values, might need to play with value more but should be fine for now. Also exhibits a strong correlation with the target variable
- driver_avg_finish_pos_season exhibits a normal distribution and shows correlation with the finishing in the top 3.
- top3_driver_season_percentage exhibits a mostly uniform distribution but with a spike at zero which makes sense because most drivers won't finish in the top 3 in a given season. Also displays a correlation with finishing in the top 3.
- Constructor_Top3_Percent exhibits a left skewed distribution with a spike at 0 which makes sense because most teams don't have a car in the top 3 in a given season. Also displays a correlation with finishing in the top 3.
- driver_avg_finish_pos_season_lag exhibits all the same features as driver_avg_finishing_pos_season because it is the same data just pushed forward a year.
- top3_driver_season_percentage_lag exhibits all the same features as top3_driver_season_percentage_lag because it is the same data just pushed forward a year.
** For graphs and visualizations go to 5. Exploratory Visualization

### 4. Correlation Analysis
Correlation Heat Map:  
![Heat Map](Images/CorrelationMap.png)
- Correlation analysis reveals that there are no super strong correlations with the target variable but the strongest (positionOrder) cannot be used because that encapsulates the same data as the top_3. The top 3 correlations are Constructor_Top3_Percent(positive), top3_driver_season_percentage(positive), and grid(negative)
- Strong correlations are observed between top3_driver_season_percentage and Constructor_Top3_Percent (and also the lag for both) suggesting potential multicollinearity which makes sense as these are highly related.

### 5. Exploratory Visualization
- #### Grid:  
![Grid Distribution](Images/GridDistribution.png)  
![Grid](Images/grid.png)  
##### It can be seen that the distributions for the top 3 and not in the top 3 are very different indicating that this could be a very good predictor.
- #### driver_avg_finish_pos_season:  
![driver_avg_finish_pos_season dist](Images/driver_avg_finish_pos_seasonDistribution.png)  
![driver_avg_finish_pos_season](Images/PositionOrderVsdriver_avg_finish_pos_season.png)
##### Distributions for top 3 and not in the top 3 are a bit different could be used for prediction.
- #### top3_driver_season_percentage: 
![top3_driver_season_percentage dist](Images/top3_driver_season_percentageDistribution.png)  
![top3_driver_season_percentagen](Images/top3_driver_season_percentage.png)
##### Distributions for top 3 and not in the top 3 are decently different because of the spike at zero for not in top 3 could be a good feature for prediction.
- #### Constructor_Top3_Percent: 
![Constructor_Top3_Percent dist](Images/Constructor_Top3_PercentDistribution.png)  
![Constructor_Top3_Percent](Images/Constructor_Top_3Percent.png)
##### Distributions for top 3 and not in the top 3 are decently different again because of the spike at zero for not in top 3 could be a good feature for prediction.
- #### driver_avg_finish_pos_season_lag: 
![driver_avg_finish_pos_season_lag dist](Images/driver_avg_finish_pos_season_lagDistribution.png)  
![Constructor_Top3_Percent](Images/driver_avg_pos_season_lag.png)
##### Very similar to its non lag parent, same findings.
- #### top3_driver_season_percentage_lag: 
![driver_avg_finish_pos_season_lag dist](Images/top3_driver_season_percentage_lagDistribution.png)  
![Constructor_Top3_Percent](Images/top3_driver_season_percentage_lag.png)
##### Very similar to its non lag parent, same findings.

## Next Steps:

### 1. Data Preprocessing
- There was missing data after merging many of the data sources and these instances with missing values were deleted from the data set
- Need to scale features for use with SVMs in modeling, if necessary will scale all the values by making them all into z-score

### 2. Modeling
- The best models to use on this data for the given features is Decision Trees and Random Forest because the features distributions vary greatly for if they are in the top 3 or not and these model work well when this is the case.
- Split the data into a test and training set based on year to prevent data leakage, then used grid search to cross validate the model while finding the best parameters
- Best model found was a Decision Tree with an AUC score of 0.85 and .71 on the train and test sets respectively. What this means is that the model does an ok job at discriminating between drivers who will and who won't be in the top 3 for a certain race and that there isn't a high trade off between true positive rate and false positive rate. The large difference between the training and test scores also indicates that the model is overfit even after hyper tuning meaning that there is probably a problem with the data which can be address in another iteration of this exploration.
- This model is easily explainable because it is just a series of if else statements so it will be easy to explain to clients exactly how the answers are found. This model is also easily scalable in a production environment because evaluating if else statements takes very little computational power also meaning that it could be deployed on a small computer trackside if need be 


# Conclusion
There are multiple features that can be good discriminators for the target feature as their distributions vary for the top 3 and not top 3 classes of the target feature and we already used them to build a decent model. This however revealed that there are some flaws in the data as we had an over fit model even after hyper tuning but this is useful information and shows that in the next iteration of this project we will have to revisit some processes and improve them to try and get a better model.