# FootbalPrediction classification
After comparing the experimental results 3, the model used is the Decision Tree with TfIdfVectorizer using a trigram approach.
The classification, obtained by the model, belongs to this set of classes: 0 (draw), 1 (home win), 2 (away win).


In [1]:
from my_tokenizer import MyTokenizer
import pandas as pd
import util_strings as utils

The normalization of the features is necessary, given that in the text we have the names of the teams and the nicknames, I am going to replace all the words relating to the home team with home team and the same thing for the away team.
This is done via the feature_normalization() method.

In [2]:
mt = MyTokenizer(pd.read_csv(utils.completed_dataset, index_col=0))
mt.feature_normalization()

Before train the model, texts have to be transformed through several phases:
- punctuation and numbers removing using regex
- tokenization: it allows to split each document into tokens. Nltk is the tokenization chosen because it reaches the highest performance
- stop word filtering: words belonging to the "stopword set" are removed because they are considered noisy words
- stemming: it allows to get the root word of each token

These tokens have to be transformed into vectors of numbers. This can quantify the importance or relevance of words in a document amongst a collection of documents. 
A TfidfVectorizer model is built and saved in the "ML model" folder.

In [3]:
mt.clean_text()
vectorizer = True #se True TfidfVectorizer, se False CountVectorizer
X_train, X_test, y_train, y_test = mt.set_bigram_and_get_sets(vectorizer) 
path_vec = utils.TfidfVectorizer if vectorizer else utils.CountVectorizer
mt.save_vectorizer(path_vec)

## Building the model
The model used is the Decision Tree with TfIdfVectorizer using a trigram approach

In [4]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
import sklearn.metrics as mtr

In [40]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
report = mtr.classification_report(y_test, y_pred, output_dict=True, zero_division=0)
report['accuracy']

0.7239263803680982

In [13]:
analysis = {
    'Model': 'Model', 'Accuracy': report['accuracy'],
    'Avg Precision (macro)': report['macro avg']['precision'],
    'Avg Recall (macro)': report['macro avg']['recall'],
    'Avg F1-score (macro)': report['macro avg']['f1-score'],
    'Avg Precision (weighted)': report['weighted avg']['precision'],
    'Avg Recall (weighted)': report['weighted avg']['recall'],
    'Avg F1-score (weighted)': report['weighted avg']['f1-score']
}
analysis

{'Model': 'Model',
 'Accuracy': 0.8312883435582822,
 'Avg Precision (macro)': 0.8211910048282247,
 'Avg Recall (macro)': 0.8059748512750217,
 'Avg F1-score (macro)': 0.8124245745729123,
 'Avg Precision (weighted)': 0.8312407219357778,
 'Avg Recall (weighted)': 0.8312883435582822,
 'Avg F1-score (weighted)': 0.8300314766901935}

The built model is saved in the "ML model" folder.

In [14]:
import pickle
file_name = utils.classificatorTfIdf if vectorizer else utils.classificatorCV_LR
with open(file_name, 'wb') as file:
    pickle.dump(model, file)