# **Machine Learning Models**

Now that the data has been preprocessed, we can begin building the machine learning models. This will mostly be done through Scikit Learn...

As I aim to predict who will finish in first place for each Grand Prix in a season, this can be either a Regression of Classification problem. The outline of this notebook will therefore reflect this:

* **1. Loading in the Data**
* **2. Classification**
* **3. Regression**

## **Dependencies** 

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix, precision_score
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn import svm
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.neural_network import MLPClassifier, MLPRegressor

np.set_printoptions(precision=4)

## **1. Loading in the Data**

In [2]:
data_url = 'https://raw.githubusercontent.com/DeanLundie/Formula-1/master/Data/final_df.csv'
data = pd.read_csv(data_url)

In [3]:
data.head()

Unnamed: 0,season,round,weather_warm,weather_cold,weather_dry,weather_wet,weather_cloudy,driver,grid,podium,...,constructor_minardi,constructor_prost,constructor_red_bull,constructor_renault,constructor_sauber,constructor_team_lotus,constructor_toro_rosso,constructor_toyota,constructor_tyrrell,constructor_williams
0,1983,1,0,0,1,0,0,keke_rosberg,1,15,...,0,0,0,0,0,0,0,0,0,1
1,1983,1,0,0,1,0,0,prost,2,6,...,0,0,0,1,0,0,0,0,0,0
2,1983,1,0,0,1,0,0,tambay,3,4,...,0,0,0,0,0,0,0,0,0,0
3,1983,1,0,0,1,0,0,piquet,4,1,...,0,0,0,0,0,0,0,0,0,0
4,1983,1,0,0,1,0,0,warwick,5,7,...,0,0,0,0,0,0,0,0,0,0


## **2. Classification**

Before splitting the data into a training and test dataset, we need to scale any variables that need to be adjusted and recode the 'podium' variable to be 0 when the result is not first and 1 when it is

In [4]:
df = data.copy()
df.podium = df.podium.map(lambda x: 1 if x == 1 else 0)

train = df[df.season <2019]
X_train = train.drop(['driver', 'podium'], axis = 1)
y_train = train.podium

scaler = StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X_train.columns)

In [6]:
X_train.head()

Unnamed: 0,season,round,weather_warm,weather_cold,weather_dry,weather_wet,weather_cloudy,grid,driver_points,driver_wins,...,constructor_minardi,constructor_prost,constructor_red_bull,constructor_renault,constructor_sauber,constructor_team_lotus,constructor_toro_rosso,constructor_toyota,constructor_tyrrell,constructor_williams
0,-1.647288,-1.609202,-0.85429,-0.153865,1.795187,-0.337454,-0.368569,-1.609252,-0.465538,-0.310694,...,-0.216781,-0.109347,-0.194312,-0.203511,-0.239673,-0.163875,-0.185557,-0.140049,-0.187624,3.211253
1,-1.647288,-1.609202,-0.85429,-0.153865,1.795187,-0.337454,-0.368569,-1.461102,-0.465538,-0.310694,...,-0.216781,-0.109347,-0.194312,4.913749,-0.239673,-0.163875,-0.185557,-0.140049,-0.187624,-0.311405
2,-1.647288,-1.609202,-0.85429,-0.153865,1.795187,-0.337454,-0.368569,-1.312952,-0.465538,-0.310694,...,-0.216781,-0.109347,-0.194312,-0.203511,-0.239673,-0.163875,-0.185557,-0.140049,-0.187624,-0.311405
3,-1.647288,-1.609202,-0.85429,-0.153865,1.795187,-0.337454,-0.368569,-1.164803,-0.465538,-0.310694,...,-0.216781,-0.109347,-0.194312,-0.203511,-0.239673,-0.163875,-0.185557,-0.140049,-0.187624,-0.311405
4,-1.647288,-1.609202,-0.85429,-0.153865,1.795187,-0.337454,-0.368569,-1.016653,-0.465538,-0.310694,...,-0.216781,-0.109347,-0.194312,-0.203511,-0.239673,-0.163875,-0.185557,-0.140049,-0.187624,-0.311405


In [5]:
def score_classification(model):
    score = 0
    for circuit in df[df.season == 2019]['round'].unique():

        test = df[(df.season == 2019) & (df['round'] == circuit)]
        X_test = test.drop(['driver', 'podium'], axis = 1)
        y_test = test.podium

        #scaling
        X_test = pd.DataFrame(scaler.transform(X_test), columns = X_test.columns)

        # make predictions
        prediction_df = pd.DataFrame(model.predict_proba(X_test), columns = ['proba_0', 'proba_1'])
        prediction_df['actual'] = y_test.reset_index(drop = True)
        prediction_df.sort_values('proba_1', ascending = False, inplace = True)
        prediction_df.reset_index(inplace = True, drop = True)
        prediction_df['predicted'] = prediction_df.index
        prediction_df['predicted'] = prediction_df.predicted.map(lambda x: 1 if x == 0 else 0)

        score += precision_score(prediction_df.actual, prediction_df.predicted)

    model_score = score / df[df.season == 2019]['round'].unique().max()
    return model_score

In [6]:
comparison_dict ={'model':[],
                  'params': [],
                  'score': []}

### **2.1. Logistic Regression**

In [10]:
# Logistic Regression

params={'penalty': ['l1', 'l2'],
        'solver': ['saga', 'liblinear'],
        'C': np.logspace(-3,1,20)}

for penalty in params['penalty']:
    for solver in params['solver']:
        for c in params['C']:
            model_params = (penalty, solver, c)
            model = LogisticRegression(penalty = penalty, solver = solver, C = c, max_iter = 10000)
            model.fit(X_train, y_train)
            
            model_score = score_classification(model)
            
            comparison_dict['model'].append('logistic_regression')
            comparison_dict['params'].append(model_params)
            comparison_dict['score'].append(model_score)

model
logistic_regression    0.571429
Name: score, dtype: float64

### **2.2. Random Forest**

In [13]:
params={'criterion': ['gini', 'entropy'],
        'max_features': [0.8, 'auto', None],
        'max_depth': list(np.linspace(5, 55, 26)) + [None]}

for criterion in params['criterion']:
    for max_features in params['max_features']:
        for max_depth in params['max_depth']:
            model_params = (criterion, max_features, max_depth)
            model = RandomForestClassifier(criterion = criterion, max_features = max_features, max_depth = max_depth)
            model.fit(X_train, y_train)
            
            model_score = score_classification(model)
            
            comparison_dict['model'].append('random_forest_classifier')
            comparison_dict['params'].append(model_params)
            comparison_dict['score'].append(model_score)

### **2.3. Support Vector Machine**

In [15]:
# Support Vector Machines

params={'gamma': np.logspace(-4, -1, 20),
        'C': np.logspace(-2, 1, 20),
        'kernel': ['linear', 'poly', 'rbf', 'sigmoid']} 

for gamma in params['gamma']:
    for c in params['C']:
        for kernel in params['kernel']:
            model_params = (gamma, c, kernel)
            model = svm.SVC(probability = True, gamma = gamma, C = c, kernel = kernel )
            model.fit(X_train, y_train)
            
            model_score = score_classification(model)
            
            comparison_dict['model'].append('svm_classifier')
            comparison_dict['params'].append(model_params)
            comparison_dict['score'].append(model_score)

In [16]:
pd.DataFrame(comparison_dict).groupby('model')['score'].max()

model
logistic_regression         0.571429
random_forest_classifier    0.523810
svm_classifier              0.619048
Name: score, dtype: float64

To make sure that we dont need to run these models again, I am saving part of the results in a dataframe

In [24]:
ml_results_one = pd.DataFrame(comparison_dict)

ml_results_one.to_csv(r'F:\OneDrive\Documents\VSCode\python_projects\Formula1\Data\ml_results_one.csv', index = False)

In [47]:
ml_results_one

Unnamed: 0,model,params,score
0,logistic_regression,"(l1, saga, 0.001)",0.380952
1,logistic_regression,"(l1, saga, 0.001623776739188721)",0.380952
2,logistic_regression,"(l1, saga, 0.0026366508987303583)",0.380952
3,logistic_regression,"(l1, saga, 0.004281332398719396)",0.380952
4,logistic_regression,"(l1, saga, 0.0069519279617756054)",0.428571
...,...,...,...
1837,svm_classifier,"(0.1, 6.951927961775605, sigmoid)",0.238095
1838,svm_classifier,"(0.1, 10.0, linear)",0.476190
1839,svm_classifier,"(0.1, 10.0, poly)",0.380952
1840,svm_classifier,"(0.1, 10.0, rbf)",0.380952


We can now use the follow code to find the corresponding parameters for each of the models which had the highest predictive accuracy:

In [46]:
models = ['logistic_regression', 'random_forest_classifier', 'svm_classifier']
for model in models:
    df = ml_results_one.loc[ml_results_one['model'] == model].reset_index(drop = True)
    print(df.loc[df['score'].argmax(),:])

model                    logistic_regression
params    (l1, liblinear, 1.438449888287663)
score                               0.571429
Name: 35, dtype: object
model     random_forest_classifier
params       (entropy, None, 17.0)
score                      0.52381
Name: 141, dtype: object
model                            svm_classifier
params    (0.0001, 2.3357214690901213, sigmoid)
score                                  0.619048
Name: 63, dtype: object


### **2.4. Neural Network**

In [8]:
params={'hidden_layer_sizes': [(80,20,40,5), (75,25,50,10)], 
        'activation': ['identity', 'logistic', 'tanh', 'relu'], 
        'solver': ['lbfgs', 'sgd', 'adam'], 
        'alpha': np.logspace(-4,2,20)} 


for hidden_layer_sizes in params['hidden_layer_sizes']:
    for activation in params['activation']:
        for solver in params['solver']:
            for alpha in params['alpha']:
                model_params = (hidden_layer_sizes, activation, solver, alpha )
                model = MLPClassifier(hidden_layer_sizes = hidden_layer_sizes,
                                      activation = activation, solver = solver, alpha = alpha, random_state = 1)
                model.fit(X_train, y_train)

                model_score = score_classification(model)

                comparison_dict['model'].append('neural_network_classifier')
                comparison_dict['params'].append(model_params)
                comparison_dict['score'].append(model_score)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("

In [10]:
ml_results_two = pd.DataFrame(comparison_dict)

ml_results_two.to_csv(r'F:\OneDrive\Documents\VSCode\python_projects\Formula1\Data\ml_results_two.csv', index = False)

### **2.5. XGBoost**