# 02 Logistic Regression Models

## Imports

In [21]:
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
import pickle
from sklearn.model_selection import TimeSeriesSplit

## Read in the training and test data

In [3]:
with open('../../02_Data/02_Processed_Data/X_train.pkl', 'rb') as f:
    X_train = pickle.load(f)
    
with open('../../02_Data/02_Processed_Data/y_train.pkl', 'rb') as f:
    y_train = pickle.load(f)    

with open('../../02_Data/02_Processed_Data/X_test.pkl', 'rb') as f:
    X_test = pickle.load(f)
    
with open('../../02_Data/02_Processed_Data/y_test.pkl', 'rb') as f:
    y_test = pickle.load(f)    

## Naive Logistic Regression

In [3]:
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)
print('Train:', lr.score(X_train,y_train))
print('Test:', lr.score(X_test,y_test))

Train: 0.5129437869822485
Test: 0.4823529411764706


Doing the untuned logistic regression again just as a reference.  

## Logistic Regression with Lasso Regularization

In [None]:
lr = LogisticRegression(penalty='l1', max_iter=1000)
lr.fit(X_train, y_train)
print('Train:', lr.score(X_train,y_train))
print('Test:', lr.score(X_test,y_test))

I can't get this cell to run successfully.  I believe there is an issue with too many features.  I've allowed this to run for over an hour with my CPU at 99.9% before interrupting the kernel and then running more models with feature selection

I expect this to do better because I have a lot of features for the amount of data that I have.  Reducing the C value from 1 to 0.1 increases the regularization strength and changing the penalty from 'l2' to 'l1' changes the regularization from ridge (reducing coefficients) to lasso (completely dropping coefficients).

## Logistic Regression with GridSearch

In [None]:
lr = LogisticRegression()
lr_params = {
    'penalty' : ['l1','l2'],
    'C':[0.001, 0.01, 0.1],
    'max_iter':[1000]
}
gs = GridSearchCV(lr, param_grid=lr_params)
gs.fit(X_train, y_train)
print('Best Score:', gs.best_score_)
print('Best Parameters:', gs.best_params_)
print('Test:',gs.score(X_test,y_test))

I set up this model, but it won't run either.

----------------

# Try modeling using 100 Best Parameters and PCA

## Read in Select 100 Best and PCA datasets for modeling

In [4]:
with open('../../02_Data/02_Processed_Data/X_train_100b.pkl', 'rb') as f:
    X_train_100b = pickle.load(f)
    
with open('../../02_Data/02_Processed_Data/X_test_100b.pkl', 'rb') as f:
    X_test_100b = pickle.load(f)
    
with open('../../02_Data/02_Processed_Data/X_train_100b_pca.pkl', 'rb') as f:
    X_train_100b_pca = pickle.load(f)
    
with open('../../02_Data/02_Processed_Data/X_test_100b_pca.pkl', 'rb') as f:
    X_test_100b_pca = pickle.load(f)   

## Naive Logistic Regression using Select K Best 100 parameters 

In [6]:
lr = LogisticRegression(random_state=42)
lr.fit(X_train_100b, y_train)
print('Train:', lr.score(X_train_100b,y_train))
print('Test:', lr.score(X_test_100b,y_test))

Train: 0.6168639053254438
Test: 0.5686274509803921


By just selecting the top 100 best parameters, the untuned logistic regression improves 0.08 for the test set.  This is a nice improvement for just selecting the top 100 best parameters

## Logistic Regression with Lasso Regularization using Select K=100 Best parameters 

In [10]:
lr = LogisticRegression(penalty='l1', C=0.1, max_iter=1000, random_state=42)
lr.fit(X_train_100b, y_train)
print('Train:', lr.score(X_train_100b,y_train))
print('Test:', lr.score(X_test_100b,y_test))

Train: 0.603180473372781
Test: 0.5627450980392157


No improvement by adding Lasson Regularization.  This is not too surprising since I've already done feature selection through Selecting 100 Best parameters

## Logistic Regression with GridSearch using Select K=100 Best parameters 

In [30]:
lr = LogisticRegression(random_state=42)
lr_params = {
    'penalty' : ['l1','l2'],
    'C':[0.0001, 0.001, 0.01, 0.1],
    'max_iter':[1000, 2000]
}
gs = GridSearchCV(lr, param_grid=lr_params)
gs.fit(X_train_100b, y_train)
print('Best Score:', gs.best_score_)
print('Best Parameters:', gs.best_params_)
print('Train:',gs.score(X_train_100b,y_train))
print('Test:',gs.score(X_test_100b,y_test))

Best Score: 0.5968934911242604
Best Parameters: {'C': 0.001, 'max_iter': 1000, 'penalty': 'l2'}
Train: 0.5998520710059172
Test: 0.5803921568627451


Gridsearch improved the best score by another 0.02 getting to 0.58.  I ran this gridsearch a number of times, adding additional search parameters and it ended with using 0.001 learning rate for ridge regression at a max iteration of 1000.  Selection of ridge over lasso regularization is not surprising since we've already done manual feature selection, so controlling for the size of coefficients is more effective.

#### Look at the best coefficients

In [19]:
coefs = pd.DataFrame(gs.best_estimator_.coef_, columns=X_train_100b.columns).T.sort_values(0, ascending=False)
top_bot_coefs = coefs.head(3)
top_bot_coefs = top_bot_coefs.append(coefs.tail(3))
top_bot_coefs

Unnamed: 0,0
f1_reach_adv,0.044867
f1_grappling_submissions_attempts_avg_diff,0.044572
f1_knock_down_landed_avg_diff,0.041376
f2_total_strikes_landed_avg_diff,-0.022935
f2_head_total_strikes_landed_avg_diff,-0.023089
f1_f2_clinch_head_strikes_landed_avg,-0.038542


Reach advantage was the strongest predictor.  Based on my reading, this was expected.  Interestingly, the number of submission attempts was also a strong predictor of wins.  This is probably capturing the wins for submission fighters.  The 3rd best predictor was the difference in average knock downs.  This is an indicator for power punching, essentially saying if the fighter knocks down their past opponents more, then they have a higher likelihood of winning.

For negative predictors, or predictors of losses, the top score of f1_f2 clinch head strikes is confusing.  f1_f2 in the data indicates the inverse "defensive" rating.  A high f1_f2 value would indicate that fighter 1 is bad at avoiding these types of strikes.  Essentially, this is saying that if a fighter is bad at avoiding clinch head strikes, they will more likely lose.  The other 2 negative predictors make sense.  The difference in average head total strikes and total strikes for fighter 2.  Basically if the opponent has high strikes landed in their past fights, it means that fighter 1 has a higher chance of losing.

In [31]:
lr = LogisticRegression(random_state=42)
lr_params = {
    'penalty' : ['l1','l2'],
    'C':[0.0001, 0.001, 0.01, 0.1, 1],
    'max_iter':[1000, 2000]
}
time_cv = TimeSeriesSplit(n_splits=5).split(X_train_100b)
gs = GridSearchCV(lr, param_grid=lr_params, cv=time_cv, )
gs.fit(X_train_100b, y_train)
print('Best Score:', gs.best_score_)
print('Best Parameters:', gs.best_params_)
print('Train:',gs.score(X_train_100b,y_train))
print('Test:',gs.score(X_test_100b,y_test))

Best Score: 0.5857777777777777
Best Parameters: {'C': 0.01, 'max_iter': 2000, 'penalty': 'l2'}
Train: 0.6068786982248521
Test: 0.5882352941176471


I'm not sure if the time series split is actually working correctly.  It looks like the time series split is just splitting up the data by index...  I need to try to sort 

## Logistic Regression using Select K Best 100 parameters and then PCA

In [8]:
lr = LogisticRegression(random_state=42)
lr.fit(X_train_100b_pca, y_train)
print('Train:', lr.score(X_train_100b_pca,y_train))
print('Test:', lr.score(X_test_100b_pca,y_test))

Train: 0.5795118343195266
Test: 0.5705882352941176


There is very little overfitting happening now even with an untuned model

## Logistic Regression with Lasso using Select K=100 Best parameters and PCA

In [11]:
lr = LogisticRegression(penalty='l1', C=0.1, max_iter=1000, random_state=42)
lr.fit(X_train_100b_pca, y_train)
print('Train:', lr.score(X_train_100b_pca,y_train))
print('Test:', lr.score(X_test_100b_pca,y_test))

Train: 0.5821005917159763
Test: 0.5666666666666667


Again, unsurprisingly logistic regression with lasso regularization doesn't do much on data that has already had feature selection/reduction performed

## Logistic Regression with GridSearch using Select K=100 Best parameters and PCA

In [8]:
lr = LogisticRegression()
lr_params = {
    'penalty' : ['l1','l2'],
    'C':[0.0001, 0.001, 0.01, 0.1],
    'max_iter':[1000, 2000]
}
gs = GridSearchCV(lr, param_grid=lr_params)
gs.fit(X_train_100b_pca, y_train)
print('Best Score:', gs.best_score_)
print('Best Parameters:', gs.best_params_)
print('Train:',gs.score(X_train_100b_pca,y_train))
print('Test:',gs.score(X_test_100b_pca,y_test))

Best Score: 0.584689349112426
Best Parameters: {'C': 0.01, 'max_iter': 1000, 'penalty': 'l1'}
Train: 0.5835798816568047
Test: 0.5705882352941176


Surprisingly the gridsearch using 100 best parameters and PCA picked lasso regression.  The overall score didn't change much topping out at 0.57