# INSY695 Group Project

## Step 4: Modeling 

We will examine supervised machine learning models on sports betting of NHL hockey data. The target and control features are defined as follows: <br>
- <b>Target Feature</b>: won <br>
- <b>Control Features</b>: x1, x2, x3, x4 ...<br>

The modeling objective is to investigate factors that contribute to the outcome of a hockey game and predict which NHL hockey team will win a game. The results can be used to make data-driven decisions regarding the game's outcome. The following set of models is used:
- <b>Logistic Regression</b>
- <b>Artificial Neural Networks</b>
- <b>Random Forest</b>
- <b>Gradient Boosting</b>

Specifically, logistic regression will serve as a baseline model for the classification problem and the rest of the classification models will be built and compared against its model performance as well as feature importance. 

## 4.1: Load Libraries

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np
from matplotlib import pyplot

## 4.2: Split Data

In [None]:
X = df.drop(columns=['won'])
y = df['won']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 5)

## 4.3: Build Model

### Logistic Regression

In [None]:
lr=LogisticRegression()
model1=lr.fit(X_train,y_train)

In [None]:
# cross-validation
y_test_pred=model1.predict(X_test)
accuracy_lr=metrics.accuracy_score(y_test, y_test_pred)
print(accuracy_lr)

In [None]:
# get importance
importance = model1.coef_[0]
# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()

### ANN

In [None]:
mlp = MLPClassifier(hidden_layer_sizes=(5),max_iter=1000, random_state=5) 
model2 = mlp.fit(X_train,y_train)

In [None]:
# cross-validation
y_test_pred = model2.predict(X_test)
accuracy_ANN=metrics.accuracy_score(y_test, y_test_pred)
print(accuracy_ANN) 

In [None]:
# get importance
importance = model2.coef_[0]
# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()

### Random Forest

In [None]:
randomforest = RandomForestClassifier(random_state=5,oob_score=True) 
model3 = randomforest.fit(X, y)

# cross-validation
model3.oob_score_ 

In [None]:
# get importance
importance = model3.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()

### Gradient Boosting

In [None]:
gbt = GradientBoostingClassifier(random_state=5)
model4 = gbt.fit(X_train, y_train)

In [None]:
# cross-validation
y_test_pred = model4.predict(X_test)
accuracy_gbt=metrics.accuracy_score(y_test, y_test_pred)
print(accuracy_gbt)

In [None]:
# get importance
importance = model4.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()

Based on the output above, we can conclude that ___ generates the highest number of correct predictions regarding the game's outcome. The most contributing factors are outlined below: 
- x1
- x2


## 4.4: Model Fine-Tuning

Selecting the best performing model to perform fine-tuning via grid search.

In [None]:
# find optimal ___ with grid search for ____
alpha = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
param_grid = dict(alpha=alpha)
grid = GridSearchCV(estimator=lin_reg, param_grid=param_grid, scoring='neg_mean_squared_error', verbose=1, n_jobs=-1)
grid_result = grid.fit(X_train, y_train)
print('Best Score: ', grid_result.best_score_)
print('Best Params: ', grid_result.best_params_)