# Data Mining Project

## Knowledge Extraction Pipeline

This notebooks defines a series of steps that end up in the production of the desired prediction, including Data Preparation, Modeling and Evaluation, according to CRISP-DM guidelines.
For information regarding Data Understanding, please refer to [Data Understanding](data_understanding.ipynb).


#### Dependencies

The code block below defines the major dependencies for the project.
To make sure you are set up, please run the following command in order to update dependencies:

```bash
pip install -r requirements.txt
```

We chose to use a set of technologies that we were familiar with and should be adequate for the problem at hand.
These include **sklearn** to model the data, **matplotlib** and **seaborn** to create graphics and **pandas** to better read the data.


In [None]:
# Dependencies
import pandas as pd
import matplotlib.pyplot as plt
import os

from utils.files import *

## Data Understanding

### Related Work

Sports related predictions are a fairly common problem.
It serves of value for different entities, such as bookmakers, sports teams and fans.
This fact together with the recent increase in the availability of data justifies the employment of machine learning techniques to the problem. [<a href="#ref1">1</a>]

The problem of predicting the outcome of a basketball game has been approached in different ways.
A common take on the subject is to try and predict the outcome of a single game, as opposed to the set of qualified teams.
Nevertheless, some similarities found between solutions were the use of machine learning algorithms and of similar attributes (rebounds, free throws, turn overs, etc).

In [<a href="#ref2">2</a>] the authors identify the characteristic high-dimensionality of the problem, and employ a Support Vector Machine Algorithm that predicted the outcome of a game with 88% accuracy.

Among common attributes, the author of [<a href="#ref3">3</a>] found that the most important ones were Free Throws, Offensive Rebounds, Turn Overs and +/- (Plus Minus).
They were also able to predict the champion team with an 86% recall using Random Forest.

Finally, the authors of [<a href="#ref4">4</a>] used a Naive Bayes Classifier to predict the outcome of games with 67% accuracy.


## Data Preparation

TODO: add text about data set (summary from data exploration)


TODO: add text about which transformations were made


In [None]:
df = pd.read_csv(os.path.join(DATA_PATH, DATA_MERGED))
df.head()

## Modeling and Evaluation

The following block imports general utility functions (defined in `utils/modeling.py`) that will serve to model the data and access results.

We chose to create the training and testing subsets in a temporal fashion.
The reason being it wouldn't make sense to scatter data from different years, since our game data is chronological.
As an example, we can train the model with the first 9 years and use the 10th and last to test the model's predictions.


In [None]:
from utils.modeling import *


### Ranking-based prediction

In [None]:
def pred():
    # check if df has keys confID per, eff. if not, return
    if not {'confID', 'per', 'eff'}.issubset(df.columns):
        return
    
    test_year = 10
    test_df = df[df['year'] == test_year]

    X_test = test_df.drop(columns=['playoff'])
    y_test = test_df['playoff']

    criteria = 'eff'
    threshold = { 
        'EA' : X_test[X_test['confID'] == 'EA'][criteria].nlargest(4).min(),
        'WE' : X_test[X_test['confID'] == 'WE'][criteria].nlargest(4).min()
        }
    y_pred = X_test.apply(lambda row: 1 if row[criteria] >= threshold[row['confID']] else 0, axis=1)
    X_test['playoff'] = y_test
    X_test['pred'] = y_pred
    X_test['Correct'] = X_test.apply(lambda row: True if row['pred'] == row['playoff'] else False, axis=1)

    print(threshold)
    display(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    displayResults(Result(y_test, y_pred, accuracy, precision, recall, f1))

pred()


### Modeling

In [None]:

from utils.modeling import *

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn import svm

models = {
    "Decision Tree" : DecisionTreeClassifier(random_state=42, criterion='entropy', max_depth=6, min_samples_leaf=8, min_samples_split=3),
    "Random Forest" : RandomForestClassifier(n_estimators=100, max_depth=2, random_state=42),
    "Naive Bayes" : GaussianNB(),
    "Neural Net" : MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=42, max_iter=500),
    "SVM" : svm.SVC(kernel='linear', C=1, random_state=42)
    }

# very best hyperparametrs

models = {
    "Decision Tree" : DecisionTreeClassifier(random_state=42, criterion='entropy', max_depth=6, min_samples_leaf=8, min_samples_split=3),
    "Random Forest" : RandomForestClassifier(n_estimators=100, max_depth=2, random_state=42),
    "Naive Bayes" : GaussianNB(),
    "Neural Net" : MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=42, max_iter=500),
    "SVM" : svm.SVC(kernel='linear', C=1, random_state=42)
    }


def testDf(test_year, models, display_results=True):
    print("Testing on year", test_year)
    results = pd.DataFrame(columns=['Model', 'Accuracy', 'Precision', 'Recall', 'F1'])
    for name, model in models.items():
        res = runModel(df, model, test_year=test_year)
        results.loc[len(results)] = [name] + res.toRow()
    
    #make an average row
    results.loc[len(results)] = ["Average"] + results.mean(axis=0, numeric_only=True).tolist()

    if display_results: display(results)
    return results

avg_score = pd.DataFrame(columns=['year', 'Accuracy', 'Precision', 'Recall', 'F1'])

for i in range(4, 11):
    r = testDf(i,models,display_results=True)
    avg_score.loc[len(avg_score)] = [str(i)] +(r.iloc[-1]).tolist()[1:]

avg_score.loc[len(avg_score)] = ["Average"] + avg_score.mean(axis=0, numeric_only=True).tolist()
display(avg_score)

In [None]:
from sklearn import ensemble as ens

models_dict = {}

# Voting with all models
# models_dict['ensemble hard'] = ens.VotingClassifier(estimators=list(models.items()), voting='hard')

# Voting just for Decision Tree and Random Forest
models_new={
    "Decision Tree" : DecisionTreeClassifier(random_state=42, criterion='entropy', max_depth=6, min_samples_leaf=8, min_samples_split=3),
    "Random Forest" : RandomForestClassifier(n_estimators=100, max_depth=2, random_state=42)
} 
models_dict['ensemble hard'] = ens.VotingClassifier(estimators=list(models_new.items()), voting='hard')

ensemble_score = pd.DataFrame(columns=['year', 'Accuracy', 'Precision', 'Recall', 'F1'])

for i in range(4, 11):
    r = testDf(i,models_dict,display_results=True)
    ensemble_score.loc[len(ensemble_score)] = [str(i)] + (r.iloc[-1]).tolist()[1:]

ensemble_score.loc[len(ensemble_score)] = ["Average"] + ensemble_score.mean(axis=0, numeric_only=True).tolist()
display(ensemble_score)

### References

<a id="ref1"></a> [1] Bunker, R. P., & Thabtah, F. (2019). A machine learning framework for sport result prediction. Applied Computing and Informatics, 15(1), 27-33. https://doi.org/10.1016/j.aci.2017.09.005

<a id="ref2"></a> [2] Jadhav, A. (2016). Predicting the NBA playoff using SVM. CORE. https://core.ac.uk/display/230494997?utm_source=pdf&utm_medium=banner&utm_campaign=pdf-decoration-v1

<a id="ref3"></a> [3] Jien, O. W. (2022, January 5). Prediction model for NBA championship by Machine Learning. Medium. https://medium.com/@weinjien99/prediction-model-for-nba-championship-by-machine-learning-8e8884ea72c8

<a id="ref4"></a> [4] D. Miljković, L. Gajić, A. Kovačević and Z. Konjović, "The use of data mining for basketball matches outcomes prediction," IEEE 8th International Symposium on Intelligent Systems and Informatics, Subotica, Serbia, 2010, pp. 309-312, doi: 10.1109/SISY.2010.5647440.
