# Data Mining Project

## Knowledge Extraction Pipeline

This notebooks defines a series os steps that end up in the production of the desired prediction, including Data Preparation, Modeling and Evaluation, according to CRISP-DM guidelines.
For information regarding Data Understanding, please refer to [Data Understanding](data_understanding.ipynb).


#### Dependencies

The code block below defines the major dependencies for the project.
To make sure you are set up, please run the following command in order to update dependencies:

```bash
pip install -r requirements.txt
```

We chose to use a set of technologies that we were familiar with and should be adequate for the problem at hand.
These include **sklearn** to model the data, **matplotlib** and **seaborn** to create graphics and **pandas** to better read the data.


In [1]:
# Dependencies
import pandas as pd
import matplotlib.pyplot as plt
import os

from utils.files import *

## Data Understanding

### Related Work

Sports related predictions are a fairly common problem.
It serves of value for different entities, such as bookmakers, sports teams and fans.
This fact together with the recent increase in the availability of data justifies the employment of machine learning techniques to the problem. [<a href="#ref1">1</a>]

The problem of predicting the outcome of a basketball game has been approached in different ways.
A common take on the subject is to try and predict the outcome of a single game, as opposed to the set of qualified teams.
Nevertheless, some similarities found between solutions were the use of machine learning algorithms and of similar attributes (rebounds, free throws, turn overs, etc).

In [<a href="#ref2">2</a>] the authors identify the characteristic high-dimensionality of the problem, and employ a Support Vector Machine Algorithm that predicted the outcome of a game with 88% accuracy.

Among common attributes, the author of [<a href="#ref3">3</a>] found that the most important ones were Free Throws, Offensive Rebounds, Turn Overs and +/- (Plus Minus).
They were also able to predict the champion team with an 86% recall using Random Forest.

Finally, the authors of [<a href="#ref4">4</a>] used a Naive Bayes Classifier to predict the outcome of games with 67% accuracy.


## Data Preparation

TODO: add text about data set (summary from data exploration)


TODO: add text about which transformations were made


In [2]:
df = pd.read_csv(os.path.join(DATA_PATH, DATA_MERGED))
df.head()

Unnamed: 0,year,playoff,EFF,confID
0,9,0,0.0,EA
1,10,1,4.642656,EA
2,1,0,0.0,EA
3,2,1,3.983925,EA
4,3,1,3.436282,EA


## Modeling and Evaluation

The following block imports general utility functions (defined in `utils/modeling.py`) that will serve to model the data and access results.

We chose to create the training and testing subsets in a temporal fashion.
The reason being it wouldn't make sense to scatter data from different years, since our game data is chronological.
As an example, we can train the model with the first 9 years and use the 10th and last to test the model's predictions.


In [3]:
from utils.modeling import *

### Ranking-based prediction

In [4]:
def pred():
    # check if df has keys confID per, eff. if not, return
    if not {'confID', 'per', 'eff'}.issubset(df.columns):
        return
    
    test_year = 10
    test_df = df[df['year'] == test_year]

    X_test = test_df.drop(columns=['playoff'])
    y_test = test_df['playoff']

    criteria = 'eff'
    threshold = { 
        'EA' : X_test[X_test['confID'] == 'EA'][criteria].nlargest(4).min(),
        'WE' : X_test[X_test['confID'] == 'WE'][criteria].nlargest(4).min()
        }
    y_pred = X_test.apply(lambda row: 1 if row[criteria] >= threshold[row['confID']] else 0, axis=1)
    X_test['playoff'] = y_test
    X_test['pred'] = y_pred
    X_test['Correct'] = X_test.apply(lambda row: True if row['pred'] == row['playoff'] else False, axis=1)

    print(threshold)
    display(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    displayResults(Result(y_test, y_pred, accuracy, precision, recall, f1))


### Modeling

In [5]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn import svm

models = {
    "Decision Tree" : DecisionTreeClassifier(random_state=42, criterion='entropy', max_depth=6, min_samples_leaf=8, min_samples_split=3),
    "Random Forest" : RandomForestClassifier(n_estimators=100, max_depth=2, random_state=42),
    "Naive Bayes" : GaussianNB(),
    "Neural Net" : MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=42),
    "SVM" : svm.SVC(kernel='linear', C=1, random_state=42)
    }

def testDf(test_year, models):
    print("Testing on year", test_year)
    results = pd.DataFrame(columns=['Model', 'Accuracy', 'Precision', 'Recall', 'F1'])
    for name, model in models.items():
        res = runModel(df, model, test_year=test_year)
        results.loc[len(results)] = [name] + res.toRow()
    
    #make an average row
    results.loc[len(results)] = ["Average"] + results.mean(axis=0, numeric_only=True).tolist()

    display(results)
    return results

r9 = testDf(9,models)
r10 = testDf(10,models)

Testing on year 9


Unnamed: 0,Model,Accuracy,Precision,Recall,F1
0,Decision Tree,0.785714,0.777778,0.875,0.823529
1,Random Forest,0.857143,0.875,0.875,0.875
2,Naive Bayes,0.714286,0.75,0.75,0.75
3,Neural Net,0.285714,0.375,0.375,0.375
4,SVM,0.571429,0.571429,1.0,0.727273
5,Average,0.642857,0.669841,0.775,0.71016


Testing on year 10


Unnamed: 0,Model,Accuracy,Precision,Recall,F1
0,Decision Tree,0.538462,0.625,0.625,0.625
1,Random Forest,0.769231,0.777778,0.875,0.823529
2,Naive Bayes,0.692308,0.75,0.75,0.75
3,Neural Net,0.692308,0.75,0.75,0.75
4,SVM,0.615385,0.615385,1.0,0.761905
5,Average,0.661538,0.703632,0.8,0.742087


In [6]:
DecisionTree_GSCV = DecisionTree_GridSearch(df)
DecisionTree_bestParams = DecisionTree_GSCV.best_params_

# For EFF
# Fitting 5 folds for each of 3888 candidates, totalling 19440 fits
# DecisionTree_bestParams = {'criterion': 'gini', 'max_depth': 2, 'max_features': 'log2', 'min_samples_leaf': 4, 'min_samples_split': 5}
# 0.6206153846153847
# DecisionTreeClassifier(max_depth=2, max_features='log2', min_samples_leaf=4,
#                        min_samples_split=5)

Fitting 1 folds for each of 3888 candidates, totalling 3888 fits
0      9
11     9
22     9
32     9
42     9
      ..
135    4
136    5
137    6
138    7
139    8
Name: year, Length: 129, dtype: int64


  X = Y.append(X[X['year'] < test_year])


{'criterion': 'gini', 'max_depth': 2, 'max_features': 'auto', 'min_samples_leaf': 2, 'min_samples_split': 4}
0.5478260869565217
DecisionTreeClassifier(max_depth=2, max_features='auto', min_samples_leaf=2,
                       min_samples_split=4)




In [7]:
RandomForest_GSCV = RandomForest_GridSearch(df)
RandomForest_bestParams = RandomForest_GSCV.best_params_

# Fitting 5 folds for each of 6480 candidates, totalling 32400 fits
# RandomForest_bestParams = {'criterion': 'log_loss', 'max_depth': 1, 'max_features': None, 'min_samples_split': 3, 'n_estimators': 5}
# 0.6436923076923077
# RandomForestClassifier(criterion='log_loss', max_depth=1, max_features=None,
#                        min_samples_split=3, n_estimators=5)

  X = Y.append(X[X['year'] < test_year])


Fitting 1 folds for each of 6480 candidates, totalling 6480 fits
0      9
11     9
22     9
32     9
42     9
      ..
135    4
136    5
137    6
138    7
139    8
Name: year, Length: 129, dtype: int64
{'criterion': 'entropy', 'max_depth': 1, 'max_features': 'sqrt', 'min_samples_split': 4, 'n_estimators': 5}
0.5652173913043478
RandomForestClassifier(criterion='entropy', max_depth=1, min_samples_split=4,
                       n_estimators=5)


In [8]:
NeuralNet_GSCV = NeuralNet_GridSearch(df)
NeuralNet_bestParams = NeuralNet_GSCV.best_params_

# Fitting 5 folds for each of 96 candidates, totalling 480 fits
# NeuralNet_bestParams = {'activation': 'relu', 'alpha': 1e-06, 'hidden_layer_sizes': (100, 100, 100, 100), 'learning_rate': 'constant', 'max_iter': 5000, 'solver': 'sgd'}
# 0.5741538461538462
# MLPClassifier(alpha=1e-06, hidden_layer_sizes=(100, 100, 100, 100),
#               max_iter=5000, random_state=42, solver='sgd')

Fitting 1 folds for each of 96 candidates, totalling 96 fits
0      9
11     9
22     9
32     9
42     9
      ..
135    4
136    5
137    6
138    7
139    8
Name: year, Length: 129, dtype: int64


  X = Y.append(X[X['year'] < test_year])


{'activation': 'tanh', 'alpha': 1e-06, 'hidden_layer_sizes': (100, 100), 'learning_rate': 'constant', 'max_iter': 5000, 'solver': 'lbfgs'}
0.41739130434782606
MLPClassifier(activation='tanh', alpha=1e-06, hidden_layer_sizes=(100, 100),
              max_iter=5000, random_state=42, solver='lbfgs')


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


In [9]:
SVM_GSCV = SVM_GridSearch(df)
SVM_bestParams = SVM_GSCV.best_params_

# Fitting 5 folds for each of 128 candidates, totalling 640 fits
# SVM_bestParams = {'C': 0.2, 'kernel': 'poly', 'probability': False, 'shrinking': False}
# 0.5581538461538462
# SVC(C=0.2, kernel='poly', shrinking=False)

  X = Y.append(X[X['year'] < test_year])


Fitting 1 folds for each of 128 candidates, totalling 128 fits
0      9
11     9
22     9
32     9
42     9
      ..
135    4
136    5
137    6
138    7
139    8
Name: year, Length: 129, dtype: int64
{'C': 0.2, 'kernel': 'rbf', 'probability': False, 'shrinking': False}
0.41739130434782606
SVC(C=0.2, shrinking=False)


In [10]:
# For year=10
improved_models = {
    "Decision Tree" : DecisionTreeClassifier(**DecisionTree_bestParams,random_state=42),
    "Random Forest" : RandomForestClassifier(**RandomForest_bestParams,random_state=42),
    "Naive Bayes" : GaussianNB(),
    "Neural Net" : MLPClassifier(**NeuralNet_bestParams,random_state=42),
    "SVM" : svm.SVC(**SVM_bestParams,random_state=42)
    }

testDf(10,improved_models)

Testing on year 10


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Unnamed: 0,Model,Accuracy,Precision,Recall,F1
0,Decision Tree,0.615385,0.636364,0.875,0.736842
1,Random Forest,0.615385,0.636364,0.875,0.736842
2,Naive Bayes,0.692308,0.75,0.75,0.75
3,Neural Net,0.384615,0.5,0.5,0.5
4,SVM,0.615385,0.615385,1.0,0.761905
5,Average,0.584615,0.627622,0.8,0.697118


Unnamed: 0,Model,Accuracy,Precision,Recall,F1
0,Decision Tree,0.615385,0.636364,0.875,0.736842
1,Random Forest,0.615385,0.636364,0.875,0.736842
2,Naive Bayes,0.692308,0.75,0.75,0.75
3,Neural Net,0.384615,0.5,0.5,0.5
4,SVM,0.615385,0.615385,1.0,0.761905
5,Average,0.584615,0.627622,0.8,0.697118


### References

<a id="ref1"></a> [1] Bunker, R. P., & Thabtah, F. (2019). A machine learning framework for sport result prediction. Applied Computing and Informatics, 15(1), 27-33. https://doi.org/10.1016/j.aci.2017.09.005

<a id="ref2"></a> [2] Jadhav, A. (2016). Predicting the NBA playoff using SVM. CORE. https://core.ac.uk/display/230494997?utm_source=pdf&utm_medium=banner&utm_campaign=pdf-decoration-v1

<a id="ref3"></a> [3] Jien, O. W. (2022, January 5). Prediction model for NBA championship by Machine Learning. Medium. https://medium.com/@weinjien99/prediction-model-for-nba-championship-by-machine-learning-8e8884ea72c8

<a id="ref4"></a> [4] D. Miljković, L. Gajić, A. Kovačević and Z. Konjović, "The use of data mining for basketball matches outcomes prediction," IEEE 8th International Symposium on Intelligent Systems and Informatics, Subotica, Serbia, 2010, pp. 309-312, doi: 10.1109/SISY.2010.5647440.
