# Minimum Viable Product for Tennis Matches Results' Predictions

In [None]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn import metrics

Read the file of the preprocessing.

In [None]:
Tennis_MVP_df = pd.read_csv('Tennis_MVP.csv')

We need to separate the explained variable from the predictors.

In [None]:
y = Tennis_MVP_df['P1_WINS']
x = Tennis_MVP_df
x.drop('P1_WINS', axis = 1, inplace=True)

# Logistic Regression

## Adapt features set to Logistic Regression

To be able to use Logistic Regression, we need to transform our **categorical** variables into **integer** ones. Specifically the surface variable and the names of the players.
We can drop their names and the dates of the matcehs for the MVP as no "memory" features is present, it will be implemented at a later stage. <br>
As for the surface, it is not an ordinal feature and we therefore have several options available such as One hot encoding or Label Binarizer. We can then use the *dummy* variables in our LR.
We will also drop the tournaments' names for now.

In [None]:
Tennis_MVP_df.drop(['P1_NAME','P2_NAME','T_NAME','T_DATE'], axis=1, inplace=True)

### OnehotEncoding - LabelBinarizer - get_dummies

Need to look into the differences between these functions that allow to transform from categorical features to numerical ones.

Online, they seem to indicate that get dummies is the easiest one to use as it links the news binary columns of each dummies to the names of their initial features.

__[Link for reference](https://stackabuse.com/one-hot-encoding-in-python-with-pandas-and-scikit-learn/)__


In [None]:
surface_ohe = pd.get_dummies(Tennis_MVP_df.SURFACE, prefix='SURFACE')
Tennis_MVP_df['SURFACE'] = surface_ohe

### Training Data - Test Data

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.20, random_state=26)

In [None]:
log_reg = LogisticRegression(random_state= 26)
log_reg.fit(x_train,y_train)

In [None]:
predictions = log_reg.predict(x_test)

In [None]:
score = log_reg.score(x_test, y_test)
cm = metrics.confusion_matrix(y_test, predictions)


In [26]:
coefficients = np.round(log_reg.coef_,5)
names_features = Tennis_MVP_df.columns
coeffs_with_columns_names = pd.DataFrame(data=coefficients,columns=names_features)