# Predicting the number of pit stops

The number of pit stops and their duration have decreased with the years. In the first phase of this subproject I'll try to predict how many stops will there be in the race and later I would like to predict when yould it happen.

Bur first, some EDA...

In [None]:
from IPython.display import Image 
from IPython.core.display import HTML 
Image(url= "https://media.giphy.com/media/MovqJSMROh1gA/giphy.gif")

## EDA

### Setting up the main dataset

In [None]:
#The data stored in this path is obtained from the API of https://ergast.com/mrd/. It is continuously updated.
#To update this data please run the file "API_Requests_Results_Qualifying_Laps_PitStops.py"

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

path = 'C:/Users/gabri/Dropbox/Gaby/Proyectos/My_Portafolio/F1/Data/'

PitsDF=pd.read_csv(path+"PitsDF.csv")

We should import info from the results data so that we can know the result of each driver at the end of each race. We should only take into account the drivers that ended each race. If the driver do not ends the race, they will probably have no pit stops.

In [None]:
ResultsDF=pd.read_csv(path+"ResultsDF.csv")
PitsDF=PitsDF.drop(columns=['Unnamed: 0'])

In [None]:
#Transforming pits duration into seconds:
PitsDF[['durationSEC','nothing']]=PitsDF['duration'].str.split(":", expand=True)
PitsDF['duration_in_sec']=PitsDF['durationSEC'].astype(float)
PitsDF=PitsDF.drop(columns=['durationSEC','nothing'])


In [None]:
#Creating Season-Round feature:
ResultsDF["Season-Round-Driver"]=ResultsDF["season"].astype(str)+"-"+ResultsDF["round"].astype(str)+"-"+ResultsDF["Driver.driverId"].astype(str)
PitsDF["Season-Round-Driver"]=PitsDF["season"].astype(str)+"-"+PitsDF["round"].astype(str)+"-"+PitsDF["driverId"].astype(str)


In [None]:
#Left Join of the Pits DF with the Results DF
PitsExtraDF=PitsDF.merge(ResultsDF[["Season-Round-Driver","status",'Constructor.constructorId','Constructor.name',"laps"]],on="Season-Round-Driver",how="left")

In [None]:
#Top number of pits per race:
PitsExtraDF["stop"].unique()

In [None]:
#Dividing the main df into separate ones taking into account the number of stops, to then unify them:
Pits1ExtraDF = PitsExtraDF[PitsExtraDF["stop"]==1].rename(columns={'lap': 'Pit1_lap',"time":"Pit1_time","duration_in_sec":"Pit1_duration"})
Pits2ExtraDF = PitsExtraDF[PitsExtraDF["stop"]==2].rename(columns={'lap': 'Pit2_lap',"time":"Pit2_time","duration_in_sec":"Pit2_duration"})
Pits3ExtraDF = PitsExtraDF[PitsExtraDF["stop"]==3].rename(columns={'lap': 'Pit3_lap',"time":"Pit3_time","duration_in_sec":"Pit3_duration"})
Pits4ExtraDF = PitsExtraDF[PitsExtraDF["stop"]==4].rename(columns={'lap': 'Pit4_lap',"time":"Pit4_time","duration_in_sec":"Pit4_duration"})

PitsUnified=Pits1ExtraDF[['Season-Round-Driver', 'date', 'status', 'driverId', 'season','round', 'raceName', 'Circuit.circuitId', 'Circuit.circuitName',
       'Circuit.Location.country','Constructor.constructorId', 'Constructor.name', 'laps',
       'Pit1_lap', 'Pit1_time', 'Pit1_duration']]

#Unifying the separate datasets forming one dataset with one row per race and per driver
PitsUnified=PitsUnified.merge(Pits2ExtraDF[["Season-Round-Driver",'Pit2_lap', 'Pit2_time', 'Pit2_duration']],on="Season-Round-Driver",how="left")
PitsUnified=PitsUnified.merge(Pits3ExtraDF[["Season-Round-Driver",'Pit3_lap', 'Pit3_time', 'Pit3_duration']],on="Season-Round-Driver",how="left")
PitsUnified=PitsUnified.merge(Pits4ExtraDF[["Season-Round-Driver",'Pit4_lap', 'Pit4_time', 'Pit4_duration']],on="Season-Round-Driver",how="left")

#Replacing na values of pits columns with 0
PitsUnified=PitsUnified.fillna(0)

len(PitsUnified) #counting the number of rows

In [None]:
#Adding the total number of pits per driver and per race
conditions = [
    (PitsUnified['Pit4_lap'] > 0),
    (PitsUnified['Pit4_lap'] == 0) & (PitsUnified['Pit3_lap'] > 0),
    (PitsUnified['Pit4_lap'] == 0) & (PitsUnified['Pit3_lap'] == 0) & (PitsUnified['Pit2_lap'] > 0),
    (PitsUnified['Pit4_lap'] == 0) & (PitsUnified['Pit3_lap'] == 0) & (PitsUnified['Pit2_lap'] == 0) & (PitsUnified['Pit1_lap'] > 0)
]

values = [4, 3, 2, 1]

PitsUnified['Num_Pits'] = np.select(conditions, values)

In [None]:
#Creating the feature of laps per pitstop. This tells us how many laps in average you can do between pit stops
PitsUnified["LapsbetweenPitstops"]=PitsUnified["laps"]/PitsUnified['Num_Pits']

#Pits of drivers that ended the races:
PitsUnified_Finished=PitsUnified[PitsUnified["status"]=="Finished"].reset_index()

#Now we have one row per race and driver, only of the drivers who finished the race
PitsUnified_Finished.tail(5)

### Average numper of pit stops per circuit

Mugello circuit and Park Zandvoort are on average the circuits that have the highest number of pit stops per race

In [None]:
sns.set_theme(style="darkgrid", palette="magma",font_scale=1.2,font="serif")
sns.catplot(data=PitsUnified_Finished,x="Circuit.circuitName", y="Num_Pits",kind="bar",height=12,aspect=2)
plt.xticks(rotation=90)
plt.show()

### Average numper of pit stops per year

The number of pit stops on average decreased in 2018. This might have been caused by several causes like changes in the regulations, different circuits in that season, etc.

In [None]:
sns.set_theme(style="darkgrid", palette="magma",font_scale=1.2,font="serif")
sns.catplot(data=PitsUnified_Finished,x="season", y="Num_Pits",kind="bar",height=4,aspect=2)
plt.xticks(rotation=90)
plt.show()

### Average number of laps between pit stops per circuit

We can see that contrary to what it was believed, the number of laps on average per pit stop have increased and decreased depending on each circuit.

In [None]:
#Selecting specific circuits where at least two seasons a race has been helds=
SelectedCircuits=['albert_park', 'sepang', 'shanghai', 'bahrain', 'catalunya',
       'monaco', 'villeneuve', 'silverstone',
       'hockenheimring', 'hungaroring', 'spa', 'monza', 'marina_bay',
       'suzuka', 'yas_marina', 'americas',
       'interlagos', 'nurburgring', 'red_bull_ring', 'sochi', 'rodriguez',
       'BAK']

PitsUnified_Finished_Selected=PitsUnified_Finished[PitsUnified_Finished["Circuit.circuitId"].isin(SelectedCircuits) == True]

sns.set_theme(style="darkgrid", palette="magma",font_scale=0.8,font="serif")
sns.relplot(data=PitsUnified_Finished_Selected,x="season", y="LapsbetweenPitstops",col="Circuit.circuitName",col_wrap=4,kind="line",height=2,aspect=2)

### Average number of laps between pit stops

In [None]:
sns.set_theme(style="darkgrid", palette="magma",font_scale=1.2,font="serif")
sns.catplot(data=PitsUnified_Finished,x="season", y="LapsbetweenPitstops",kind="bar",height=4,aspect=2)

### Encoding the categorical features

In [None]:
from sklearn.preprocessing import LabelEncoder

#Features to encode:
to_encode=['Season-Round-Driver', 'driverId', 'Circuit.circuitId','Circuit.Location.country', 'Constructor.constructorId']

for n in to_encode:
    encoder = LabelEncoder()
    encoder.fit(PitsUnified_Finished[n])
    nameencoded=n+"_enc"
    encoders=encoder.transform(PitsUnified_Finished[n])
    PitsUnified_Finished[nameencoded]=encoders

PitsUnified_Finished=PitsUnified_Finished.reset_index() #reset index of added features, if this is not done, there is an error later


### Relationship between features

In [None]:
#Relationship between all the variables

#Calculating correlation: Heatmap
sns.set_theme(style="whitegrid", palette="magma",font_scale=0.7,font="serif")
fig, ax = plt.subplots(figsize=(12, 10))
cmap = sns.diverging_palette(0, 210, 95, 49, as_cmap=True)
sns.heatmap(PitsUnified_Finished.corr(), annot=True, fmt=".2f", 
           linewidths=5,cmap=cmap, vmin=-1, vmax=1, 
           cbar_kws={"shrink": .8}, square=True)
plt.show()

### Before making predictions...

In [None]:
from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score #measures used to evaluate the models
from sklearn.metrics import confusion_matrix, classification_report #confusion matrix to evaluate results

from sklearn.model_selection import GridSearchCV #Hyperparameter optimization
from sklearn.model_selection import KFold #set kfold configuration
from sklearn.model_selection import cross_val_score #cross validation
from sklearn.metrics import make_scorer #set scores desired to train models
from sklearn.metrics import mean_squared_error

#Set scorers
f1_scorer=make_scorer(f1_score)
accuracy_scorer=make_scorer(accuracy_score)

### Function to test and evaluate the algorithms
def testing_the_classifier(ticks_, thesize=(5,3)):
    Train=pd.DataFrame()
    Train["Predicted"]=y_train_predicted
    Train["Real"]=y_train.tolist()

    Test=pd.DataFrame()
    Test["Predicted"]=y_test_predicted
    Test["Real"]=y_test.tolist()

    #Generate the confusion matrixes

    cf_matrixtrain = confusion_matrix(Train["Real"], Train["Predicted"])
    cf_matrixtest = confusion_matrix(Test["Real"], Test["Predicted"])

    print("\n Training Data:")
    sns.set_theme(style="whitegrid", palette="BuPu",font_scale=1,font="serif")
    plt.figure(figsize=thesize)
    ax = sns.heatmap(cf_matrixtrain, annot=True,cmap="BuPu",fmt='d')
    ax.set_xlabel('\nPredicted Values')
    ax.set_ylabel('Actual Values ')
    ax.xaxis.set_ticklabels(ticks_)
    ax.yaxis.set_ticklabels(ticks_)
    plt.show()

    print(classification_report(y_train, y_train_predicted))
    

    print("\n Testing Data:")
    sns.set_theme(style="whitegrid", palette="BuPu",font_scale=1,font="serif")
    plt.figure(figsize=thesize)
    ax = sns.heatmap(cf_matrixtest, annot=True,cmap="BuPu",fmt='d')
    ax.set_xlabel('\nPredicted Values')
    ax.set_ylabel('Actual Values ')
    ax.xaxis.set_ticklabels(ticks_)
    ax.yaxis.set_ticklabels(ticks_)
    plt.show()

    print(classification_report(y_test, y_test_predicted))


In [None]:
#Bar plot: class distribution
g=sns.catplot(x="Num_Pits",data=PitsUnified_Finished,kind="count",height=5,aspect=2)
g.set(xlabel="Number of pits in our data")
plt.show()

#There exists a class imbalance

In [None]:
len(PitsUnified_Finished[PitsUnified_Finished["Num_Pits"]==3]) #number of drivers-races with 3 pitstops

## 1st: Random Forest Classifier to predict the feature "Num_Pits"

In [None]:
#PitsUnified_Finished.columns

Selected=['season','round','laps', 'driverId_enc',
       'Circuit.circuitId_enc', 'Circuit.Location.country_enc',
       'Constructor.constructorId_enc']

selectednumber=31994 #randomseed

In [None]:
#Divide data into training and testing - stratified
from sklearn.model_selection import train_test_split #separte train and test data

X_train, X_test, y_train, y_test = train_test_split(PitsUnified_Finished[Selected], PitsUnified_Finished["Num_Pits"], test_size=0.20,stratify=PitsUnified_Finished["Num_Pits"],random_state=selectednumber)


In [None]:
from sklearn.ensemble import RandomForestClassifier

#Cross Validation using Grid Search
RF=RandomForestClassifier(random_state=selectednumber,n_estimators=100)
tuned_parameters = [{'criterion': ["gini","entropy"],"max_features":["auto","sqrt","log2",None]}]
# for x in range(2,10):
#     # clf = GridSearchCV(RF, tuned_parameters, cv=KFold(n_splits=x), scoring=f1_scorer)
#     # clf.fit(X_train, y_train)
#     # print("Folds: ",x,"- F1 score: ",clf.best_score_," ",clf.best_params_)
#     #in this case the f1 scorer was nan in every fold
#     clf2 = GridSearchCV(RF, tuned_parameters, cv=KFold(n_splits=x), scoring=accuracy_scorer)
#     clf2.fit(X_train, y_train)
#     print("Folds: ",x,"- Accuracy: ",clf2.best_score_," ",clf2.best_params_)

In [None]:
RF=RandomForestClassifier(n_estimators=100,random_state=selectednumber,criterion="gini",max_features="auto")
RF.fit(X_train, y_train)
y_train_predicted=RF.predict(X_train)
y_test_predicted=RF.predict(X_test)
np.unique(y_test_predicted)
mse_train = mean_squared_error(y_train,y_train_predicted)
mse_test = mean_squared_error(y_test,y_test_predicted)

print(mse_train,"-",mse_test)



## 1st: Results

In [None]:
testing_the_classifier(["1","2","3"])

## 2nd: Random Forest Classifier to predict the feature Lap of pit stops (if any)

First, predicting the lap in which the first pit stop is done

Secondly, predicting the lap in which the second pit stop is done (0 is also an option with 0 pit stops)

Thirdly, predicting the lap in which the third pit stop is done (0 is also an option with 0 pit stops)

In [None]:
#Function to evaluate the models
from sklearn import metrics

def evaluatelapspitstops(y_train,y_train_predicted,y_test,y_test_predicted):
    y_true = y_train 
    y_pred = y_train_predicted 

    print("\n Training Scores:")
    print('Mean Absolute Error (MAE):', metrics.mean_absolute_error(y_true, y_pred))
    print('Mean Squared Error (MSE):', metrics.mean_squared_error(y_true, y_pred))
    print('Root Mean Squared Error (RMSE):', metrics.mean_squared_error(y_true, y_pred, squared=False))
    print('Mean Absolute Percentage Error (MAPE):', metrics.mean_absolute_percentage_error(y_true, y_pred))
    print('Explained Variance Score:', metrics.explained_variance_score(y_true, y_pred))
    print('Max Error:', metrics.max_error(y_true, y_pred))
    print('Mean Squared Log Error:', metrics.mean_squared_log_error(y_true, y_pred))
    print('Median Absolute Error:', metrics.median_absolute_error(y_true, y_pred))
    print('R^2:', metrics.r2_score(y_true, y_pred))

    y_true = y_test 
    y_pred = y_test_predicted 
    
    print("\n Testing Scores:")
    print('Mean Absolute Error (MAE):', metrics.mean_absolute_error(y_true, y_pred))
    print('Mean Squared Error (MSE):', metrics.mean_squared_error(y_true, y_pred))
    print('Root Mean Squared Error (RMSE):', metrics.mean_squared_error(y_true, y_pred, squared=False))
    print('Mean Absolute Percentage Error (MAPE):', metrics.mean_absolute_percentage_error(y_true, y_pred))
    print('Explained Variance Score:', metrics.explained_variance_score(y_true, y_pred))
    print('Max Error:', metrics.max_error(y_true, y_pred))
    print('Mean Squared Log Error:', metrics.mean_squared_log_error(y_true, y_pred))
    print('Median Absolute Error:', metrics.median_absolute_error(y_true, y_pred))
    print('R^2:', metrics.r2_score(y_true, y_pred))


In [None]:
#PitsUnified_Finished.columns

Selected=['season','round','laps', 'driverId_enc',
       'Circuit.circuitId_enc', 'Circuit.Location.country_enc',
       'Constructor.constructorId_enc','Pit1_lap','Pit2_lap','Pit3_lap','Pit4_lap',"Num_Pits"]

#The 3 features including the Pit laps will be taken out later

#randomseed

In [None]:
PitsUnified_Finished.columns

In [None]:
#Divide data into training and testing - stratified
from sklearn.model_selection import train_test_split #separte train and test data

X_train, X_test, y_train, y_test = train_test_split(PitsUnified_Finished[Selected], PitsUnified_Finished['Pit1_lap'], test_size=0.20,stratify=PitsUnified_Finished["Num_Pits"],random_state=selectednumber)
#X_train, X_test, y_train, y_test = train_test_split(PitsUnified_Finished[Selected], PitsUnified_Finished['Pit1_lap'], test_size=0.20,random_state=selectednumber)


In [None]:
PitInfotrain=X_train[['Pit1_lap','Pit2_lap','Pit3_lap','Pit4_lap',"Num_Pits"]] #we needed them to have this info in the correct order for later
X_train=X_train.drop(columns=['Pit1_lap','Pit2_lap','Pit3_lap','Pit4_lap',"Num_Pits"]) #the 3 features are dropped from the training set

PitInfotest=X_test[['Pit1_lap','Pit2_lap','Pit3_lap','Pit4_lap',"Num_Pits"]] #we needed them to have this info in the correct order for later
X_test=X_test.drop(columns=['Pit1_lap','Pit2_lap','Pit3_lap','Pit4_lap',"Num_Pits"]) #the 3 features are dropped from the training set


In [None]:
from sklearn.ensemble import RandomForestClassifier

#Cross Validation using Grid Search
RF=RandomForestClassifier(random_state=selectednumber,n_estimators=100)
tuned_parameters = [{'criterion': ["gini","entropy"],"max_features":["auto","sqrt","log2",None]}]
# for x in range(2,10):
#     # clf = GridSearchCV(RF, tuned_parameters, cv=KFold(n_splits=x), scoring=f1_scorer)
#     # clf.fit(X_train, y_train)
#     # print("Folds: ",x,"- F1 score: ",clf.best_score_," ",clf.best_params_)
#     #in this case the f1 scorer was nan in every fold
#     clf2 = GridSearchCV(RF, tuned_parameters, cv=KFold(n_splits=x), scoring=accuracy_scorer)
#     clf2.fit(X_train, y_train)
#     print("Folds: ",x,"- Accuracy: ",clf2.best_score_," ",clf2.best_params_)

In [None]:
RF=RandomForestClassifier(n_estimators=100,random_state=selectednumber,criterion="gini",max_features="auto")
RF.fit(X_train, y_train)
y_train_predicted=RF.predict(X_train)
y_test_predicted=RF.predict(X_test)

ticks=PitsUnified_Finished["Pit1_lap"].unique()

evaluatelapspitstops(y_train,y_train_predicted,y_test,y_test_predicted)


In [None]:
#Including the info of the laps PREDICTED for the fist pitstop into the independent features in the training and testing set
X_train["Pit1_lap_predicted"]=y_train_predicted
X_test["Pit1_lap_predicted"]=y_test_predicted

y_train=PitInfotrain['Pit2_lap'].copy(deep=True) #changing the dependent feauture to pitstops2
y_test=PitInfotest['Pit2_lap'].copy(deep=True) #changing the dependent feauture to pitstops2


In [None]:
RF=RandomForestClassifier(n_estimators=100,random_state=selectednumber,criterion="gini",max_features="auto")
RF.fit(X_train, y_train)
y_train_predicted=RF.predict(X_train)
y_test_predicted=RF.predict(X_test)

ticks=PitsUnified_Finished["Pit2_lap"].unique()

evaluatelapspitstops(y_train,y_train_predicted,y_test,y_test_predicted)

In [None]:
#Including the info of the laps PREDICTED for the second pitstop into the independent features in the training and testing set
X_train["Pit2_lap_predicted"]=y_train_predicted
X_test["Pit2_lap_predicted"]=y_test_predicted

y_train=PitInfotrain['Pit3_lap'].copy(deep=True) #changing the dependent feauture to pitstops3
y_test=PitInfotest['Pit3_lap'].copy(deep=True) #changing the dependent feauture to pitstops3


In [None]:
RF=RandomForestClassifier(n_estimators=100,random_state=selectednumber,criterion="gini",max_features="auto")
RF.fit(X_train, y_train)
y_train_predicted=RF.predict(X_train)
y_test_predicted=RF.predict(X_test)

ticks=PitsUnified_Finished["Pit3_lap"].unique()

evaluatelapspitstops(y_train,y_train_predicted,y_test,y_test_predicted)

In [None]:
#Including pits4 information
X_train["Pit4_lap"]=PitInfotrain['Pit4_lap'] 
X_test["Pit4_lap"]=PitInfotest['Pit4_lap'] 

#Including the info of the laps PREDICTED for the second pitstop into the independent features in the training and testing set
X_train["Pit3_lap_predicted"]=y_train_predicted
X_test["Pit3_lap_predicted"]=y_test_predicted

#Adding the total number of pits per driver and per race p1
conditionsX_train = [
    (X_train['Pit4_lap'] > 0),
    (X_train['Pit4_lap'] == 0) & (X_train['Pit3_lap_predicted'] > 0),
    (X_train['Pit4_lap'] == 0) & (X_train['Pit3_lap_predicted'] == 0) & (X_train['Pit2_lap_predicted'] > 0),
    (X_train['Pit4_lap'] == 0) & (X_train['Pit3_lap_predicted'] == 0) & (X_train['Pit2_lap_predicted'] == 0) & (X_train['Pit1_lap_predicted'] > 0)
]

conditionsX_test = [
    (X_test['Pit4_lap'] > 0),
    (X_test['Pit4_lap'] == 0) & (X_test['Pit3_lap_predicted'] > 0),
    (X_test['Pit4_lap'] == 0) & (X_test['Pit3_lap_predicted'] == 0) & (X_test['Pit2_lap_predicted'] > 0),
    (X_test['Pit4_lap'] == 0) & (X_test['Pit3_lap_predicted'] == 0) & (X_test['Pit2_lap_predicted'] == 0) & (X_test['Pit1_lap_predicted'] > 0)
]

values = [4, 3, 2, 1]

#Adding the total number of pits per driver and per race p2
X_train['Num_Pits_Predicted'] = np.select(conditionsX_train, values)
X_test['Num_Pits_Predicted'] = np.select(conditionsX_test, values)

In [None]:
y_train=PitInfotrain['Num_Pits']
y_test=PitInfotest['Num_Pits']

y_train_predicted=X_train['Num_Pits_Predicted']
y_test_predicted=X_test['Num_Pits_Predicted']


## 2nd: Results

In [None]:
testing_the_classifier(["1","2","3"])

## NN1