# <font color='#000000'>Master Thesis Hugo Alves | <span style="color:#BFD62F;">__Nova__</span> <span style="color:#5C666C;">__IMS__</span></font>

Welcome to the third chapter of notebooks developed for this project. We (Professor Roberto Henriques, Professor Ricardo Santos, and Hugo Alves) aim to develop a machine learning (ML) framework to predict student admissions to postgraduate and masters' programs at Nova IMS, as well as the final grade point average (GPA) of those who are accepted.

# MANTER APENAS A IMAGEM CORRETA
<br>
<div align="center">
  <img src="https://i.ibb.co/tbdN6KX/Notebooks-Workflow.png" alt="Workflow" width="800" />
</div>
<br>

<br>
<div align="center">
  <img src="https://i.ibb.co/BtH7KLZ/Notebooks-Workflow.png" alt="Workflow" width="800" />
</div>
<br>

After exploring and preprocessing our data, we finally arrive to the __modelling__ stage. Thus, the main focus of this notebook will be to create, select, and test the models to predict admissions, as well as understanding which variables most contribute to these predictions. Enjoy!


# <font color="#5C666C">Contents (ADAPTAR!!!)</font> <a class="anchor" id="toc"></a>
[Initial Setup](#setup)<br>
- [Library and Functions Import](#library)<br>
- [Retrieving the Dataframes](#dataframes)<br>
   
[4. Modelling](#modelling)<br>
- [4.1. Decision Tree](#dt)<br>
- [4.2. Logistic Regression](#lr)<br>
- [4.3. Naïve Bayes](#nb)<br>
- [4.4. Neural Networks](#nn)<br>
- [4.5. Support Vector Machines](#svm)<br>
- [4.6. Random Forest](#rf)<br>
- [4.7. Bagging](#bagging)<br>
- [4.8. Adaptive Boosting](#ab)<br>
- [4.9. Gradient Boosting](#gb)<br>
- [4.10. Stacking](#stacking)<br>
- [4.11. Voting](#voting)<br>
- [4.12. Final Model Selection](#selection)<br>

[5. Evaluating variable importance](#shap)<br>

</div>

# <font color="#BFD62F">_____________</font>
# <font color='#5C666C'>Initial Setup</font> <a class="anchor" id="setup"></a>
[Back to Contents](#toc)

## <font color='#BFD62F'>Library and Functions Import</font> <a class="anchor" id="library"></a>
[Back to Contents](#toc)

We are using the Python version 3.11.8.

In [1]:
#! pip install pandas==2.2.1
#! pip install numpy==1.24.4
#! pip install matplotlib==3.8.3
#! pip install seaborn==0.12.2
#! pip install plotly==5.20.0
#! pip install tenacity==8.2.2
#! pip install openpyxl>=3.1.0
#% pip install nbformat>=4.3.0
#! pip install rapidfuzz==3.11.0
#! pip install xlrd==2.0.1
#! pip install sklearn==1.2.2
#! pip install imblearn==0.11.0

In [2]:
%run Imports

In [3]:
import Functions as tf

## <font color='#BFD62F'>Retrieving the Dataframes </font> <a class="anchor" id="dataframes"></a>
[Back to Contents](#toc)

Let's retrieve the dataframes that we used in the previous notebook.

In [4]:
%store -r X_admissions_train_scaled
%store -r X_admissions_val_scaled
%store -r y_admissions_train
%store -r y_admissions_val

%store -r numerical_variables_admissions
%store -r binary_variables_admissions
%store -r categorical_variables_admissions

# ADAPTAR

# <font color='#BFD62F'>______________</font>
# <font color='#5C666C'>4. Modelling </font> <a class="anchor" id="modelling"></a>
[Back to Contents](#toc)

It's finally time to start making predictions on the students to admit to Nova IMS' postgraduate and masters' programs.

We will start by testing some of the single models employed by the literature, namely the __Decision Tree__, __Logistic Regression__, __Naïve Bayes__, and __Support Vector Machines__. Although one reviewed study [(Chakraborty et al. (2018))](https://doi.org/10.1007/s12597-017-0329-2) used a hybrid model based on a decision tree and neural networks, and another [(Priyadarshini et al. (2023))](https://doi.org/10.1109/TransAI60598.2023.00040) opted for a deep learning approach, we will resort to a "simpler" __Artificial Neural Networks__ to aid us in our classification problem. <br>
As for the ensemble models, we will utilize a __Random Forest__ to make predictions, as well as with __Bagging__, __Adaptive Boosting__, __Gradient Boosting__, __Stacking__, and __Voting__ algorithms (whose estimators will be determined through a grid search).

For each algorithm, the ideal hyperparameters were determined through a grid search, which will hopefully return the closest to a global optimum of the model. We will utilize a function to fit the model to the training dataset and assess it on the validation data, returning accuracy, precision, recall, f1 score, and AUROC (where applicable) as metrics. It will also save the model to a Pickle file, so that it can easily be retrieved without constantly needing to run the function.

__Note:__ To prevent running this notebook for a very long time, we will leave commented the code used for each algorithm. If needed, the models are stored in a Pickle file and can be called without the necessity to run the function again.

## <font color='#BFD62F'>4.1. Decision Tree </font> <a class="anchor" id="dt"></a>
[Back to Contents](#toc)

Decision trees are one of the simplest of ML algorithms. Essentially, they are a set of rules that leads some input data from the root node (the first in the tree) down to a leaf node, where all variables belonging to that leaf are assigned to a certain label. From the root node, all instances of the data are partitioned, using a value for a certain variable as a condition. All variables are tested, and the one that displays the highest predictive power is set as the condition for the root node. The data is split according to whether they match the winning condition, and the process is repeated for each of the new nodes that originated from the split. This procedure goes on until all samples inside the node contain the same label for the target variable.

In [6]:
%%time

dt_admissions = DecisionTreeClassifier(random_state = 92)

dt_param_grid = {
    "criterion": ["gini", "entropy"],
    "max_depth": [None, 4, 6, 8, 10, 14, 18],
    "min_samples_split": [0.01, 0.05, 0.1, 0.2, 0.3, 0.5],
    "min_samples_leaf": [0.01, 0.03, 0.05, 0.1, 0.15, 0.2, 0.3, 0.5],
    "max_features": [None, 0.1, 0.3, 0.5, "sqrt", "log2"],
    "max_leaf_nodes": [None, 5, 10, 20, 30],
    "min_impurity_decrease": [0, 0.01, 0.05]
}

dt_best_params, dt_y_pred, dt_accuracy, dt_precision, dt_recall, dt_f1, dt_roc_auc = tf.run_model_classification(dt_admissions,
                                                                                                                 dt_param_grid,
                                                                                                                 X_admissions_train_scaled,
                                                                                                                 y_admissions_train,
                                                                                                                 X_admissions_val_scaled,
                                                                                                                 y_admissions_val,
                                                                                                                 "f1",
                                                                                                                 True,
                                                                                                                 "Decision_Tree")

Fitting 5 folds for each of 60480 candidates, totalling 302400 fits

Classification Report:
               precision    recall  f1-score   support

           0       0.58      0.15      0.24       808
           1       0.82      0.97      0.89      3257

    accuracy                           0.81      4065
   macro avg       0.70      0.56      0.56      4065
weighted avg       0.77      0.81      0.76      4065

Accuracy: 0.8096 | Precision: 0.8215 | Recall: 0.9739 | F1 score: 0.8913 | AUROC: 0.7222


## <font color='#BFD62F'>4.2. Logistic Regression </font> <a class="anchor" id="lr"></a>
[Back to Contents](#toc)

Logistic regression is a statistical technique utilized to predict categorical outcomes from a set of independent variables when the dependent variable is categorical. Unlike linear regression, which is suited for problems where the target is numerical, logistic regression predicts the probability of belonging to a certain category, using the sigmoid function to ensure that the predicted values always fall between 0 and 1. It uses the maximum likelihood estimator to determine the coefficients that relate the predictors to the probability of occurrence.

In [7]:
%%time

lr_admissions = LogisticRegression(random_state = 92)

lr_param_grid = {
    "penalty": ["l1", "l2", "elasticnet"],
    "C": [0.0001, 0.001, 0.01, 0.1, 0.5, 1, 10, 100, 1000],
    "solver": ["lbfgs", "liblinear", "saga", "sag", "newton-cholesky"],
    "max_iter": [50, 100, 150, 200, 250, 300, 350, 400, 450, 500]
}

lr_best_params, lr_y_pred, lr_accuracy, lr_precision, lr_recall, lr_f1, lr_roc_auc = tf.run_model_classification(lr_admissions,
                                                                                                                 lr_param_grid,
                                                                                                                 X_admissions_train_scaled,
                                                                                                                 y_admissions_train,
                                                                                                                 X_admissions_val_scaled,
                                                                                                                 y_admissions_val,
                                                                                                                 "f1",
                                                                                                                 True,
                                                                                                                 "Logistic_Regression")

Fitting 5 folds for each of 1350 candidates, totalling 6750 fits

Caching the list of root modules, please wait!
(This will only be done once - type '%rehashx' to reset cache!)


Classification Report:
               precision    recall  f1-score   support

           0       0.69      0.11      0.20       808
           1       0.82      0.99      0.89      3257

    accuracy                           0.81      4065
   macro avg       0.75      0.55      0.55      4065
weighted avg       0.79      0.81      0.76      4065

Accuracy: 0.8138 | Precision: 0.8179 | Recall: 0.9874 | F1 score: 0.8947 | AUROC: 0.7341


## <font color='#BFD62F'>4.3. Naïve Bayes </font> <a class="anchor" id="nb"></a>
[Back to Contents](#toc)

Similarly to the logistic regression, Bayesian classification predicts class membership probabilities, with a threshold then assigning the predictions to a certain target label. It relies on the Bayes Theorem, which helps to determine the probability of an event with random knowledge, and is used to calculate the probability of one event, given that another event has already occurred. Essentially, it is the probability of a hypothesis given the presence of certain evidence. One of its key points (and the reason for it being "naïve") is that it assumes that the occurrence of an event is independent from the occurrence of other events - as such, each predictor individually contributes to a prediction, without relying on other independent variables.

In our case, we will use a variation of Bayesian classification that assumes that the independent variables follow a Bernoulli distribution, which assumes them to be categorical. Although this is not true for all variables, we will still test it to assess its ability to generate accurate predictions.

In [9]:
%%time

nb_admissions = BernoulliNB()

nb_param_grid = {
    "binarize": np.arange(0, 1.1, 0.1)
}

nb_best_params, nb_y_pred, nb_accuracy, nb_precision, nb_recall, nb_f1, nb_roc_auc = tf.run_model_classification(nb_admissions,
                                                                                                                 nb_param_grid,
                                                                                                                 X_admissions_train_scaled,
                                                                                                                 y_admissions_train,
                                                                                                                 X_admissions_val_scaled,
                                                                                                                 y_admissions_val,
                                                                                                                 "f1",
                                                                                                                 True,
                                                                                                                 "Naive_Bayes")

Fitting 5 folds for each of 11 candidates, totalling 55 fits

Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00       808
           1       0.80      1.00      0.89      3257

    accuracy                           0.80      4065
   macro avg       0.40      0.50      0.44      4065
weighted avg       0.64      0.80      0.71      4065

Accuracy: 0.8012 | Precision: 0.8012 | Recall: 1.0 | F1 score: 0.8896 | AUROC: 0.4982


## <font color='#BFD62F'>4.4. Neural Networks </font> <a class="anchor" id="nn"></a>
[Back to Contents](#toc)

Artificial neural networks are computing systems that attempt to mimic the animals' biological neural networks. It is composed of several perceptrons - single units of logic that output a binary conclusion - that can be organized in a number of layers through which the input data is passed.

At the first iteration of a neural network, the input data is passed onto the first layer of nodes (perceptrons), which receives the name of input layer. Each value and variable of the input data are assigned a random weight, increasing or decreasing its relevance when it arrives to the input layer. Inside each node, an activation function transforms the input into an output, before passing it to the next layer (also with higher or lower random weights), where the process is repeated until the data arrives at the output layer, that returns a final prediction. The main goal is to optimize the weights and reduce the prediction error (loss), which is done through backpropagation and several iterations of the data passing through the nodes, where each should, ideally, be closer to an accurate prediction.

In [10]:
%%time

nn_admissions = MLPClassifier(random_state = 92)

nn_param_grid = {
    "hidden_layer_sizes": [(100,), (100, 100), (100, 50), (50,), (50, 50), (50, 25)],
    "activation": ["relu", "logistic"],
    "solver": ["adam", "sgd"],
    "alpha": [0.00001, 0.0001, 0.0005, 0.001, 0.01, 0.1],
    "learning_rate": ["constant", "invscaling", "adaptive"],
    "batch_size": [100, 200, 500],
    "learning_rate_init": [0.0001, 0.001, 0.005, 0.01, 0.1],
    "max_iter": [10, 50, 100, 200]
}

nn_best_params, nn_y_pred, nn_accuracy, nn_precision, nn_recall, nn_f1, nn_roc_auc = tf.run_model_classification(nn_admissions,
                                                                                                                 nn_param_grid,
                                                                                                                 X_admissions_train_scaled,
                                                                                                                 y_admissions_train,
                                                                                                                 X_admissions_val_scaled,
                                                                                                                 y_admissions_val,
                                                                                                                 "f1",
                                                                                                                 True,
                                                                                                                 "Neural_Networks")

Fitting 5 folds for each of 25920 candidates, totalling 129600 fits

Classification Report:
               precision    recall  f1-score   support

           0       0.62      0.19      0.29       808
           1       0.83      0.97      0.89      3257

    accuracy                           0.82      4065
   macro avg       0.72      0.58      0.59      4065
weighted avg       0.79      0.82      0.77      4065

Accuracy: 0.8155 | Precision: 0.8282 | Recall: 0.9711 | F1 score: 0.894 | AUROC: 0.7464


## <font color='#BFD62F'>4.5. Support Vector Machines </font> <a class="anchor" id="svm"></a>
[Back to Contents](#toc)

In SVM, each data item is seen as a point in the n-dimensional space. The classification task is done by finding the hyperplane that better differentiates the two (or more) target classes. Support vectors are simply the coordinates of an individual observation. A support vector machine is the frontier that best segregates the two classes. In regression problems, SVM finds the function that best fits the data within a specified margin of tolerance, with support vectors influencing the shape of the regression function.

In [None]:
%%time

svm_admissions = SVC(random_state = 92)

svm_param_grid = {
    "C": [0.01, 0.1, 0.5, 1, 10],
    "kernel": ["linear", "poly", "rbf", "sigmoid"],
    "gamma": ["scale", "auto", 0.001, 0.01, 0.1, 1, 10]
}

svm_best_params, svm_y_pred, svm_accuracy, svm_precision, svm_recall, svm_f1, svm_roc_auc = tf.run_model_classification(svm_admissions,
                                                                                                                        svm_param_grid,
                                                                                                                        X_admissions_train_scaled,
                                                                                                                        y_admissions_train,
                                                                                                                        X_admissions_val_scaled,
                                                                                                                        y_admissions_val,
                                                                                                                        "f1",
                                                                                                                        True,
                                                                                                                        "SVM")

Fitting 5 folds for each of 140 candidates, totalling 700 fits


## <font color='#BFD62F'>4.6. Random Forest </font> <a class="anchor" id="rf"></a>
[Back to Contents](#toc)

Ensemble learners rely on the combination of multiple supervised learning algorithms to make predictions. Following the Wisdom of the Crowds Principle, several weak learners might performed better as a whole, when compared to a strong individual learner.

Random Forests are a group of decision trees trained in parallel, each using a sample of observations and a certain set of features from the feature space. In the end, their individual predictions are combined to obtain a final prediction.

In [1]:
%%time

rf_admissions = RandomForestClassifier(random_state = 92)

rf_param_grid = {
    "n_estimators": [10, 20, 50, 75, 100, 200],
    "criterion": ["gini", "entropy"],
    "max_depth": [None, 3, 6, 9, 12, 15],
    "min_samples_split": [0.01, 0.05, 0.1, 0.15, 0.2, 0.3, 0.5],
    "min_samples_leaf": [0.01, 0.03, 0.05, 0.1, 0.15, 0.2, 0.3, 0.5],
    "max_features": [None, 0.1, 0.3, 0.5, "sqrt", "log2"],
    "max_leaf_nodes": [None, 5, 10, 20, 30],
    "max_samples": [0.1, 0.2, 0.4, 0.6, 0.8, None]
}

rf_best_params, rf_y_pred, rf_accuracy, rf_precision, rf_recall, rf_f1, rf_roc_auc = tf.run_model_classification(rf_admissions,
                                                                                                                 rf_param_grid,
                                                                                                                 X_admissions_train_scaled,
                                                                                                                 y_admissions_train,
                                                                                                                 X_admissions_val_scaled,
                                                                                                                 y_admissions_val,
                                                                                                                 "f1",
                                                                                                                 True,
                                                                                                                 "Random_Forest")

NameError: name 'RandomForestClassifier' is not defined

# REVER TEXTO E ABORDAGEM 

## <font color='#BFD62F'>4.7. Bagging </font> <a class="anchor" id="bagging"></a>
[Back to Contents](#toc)

Bootstrap aggregating (commonly known as bagging) is about parallelly training multiple instances of the same algorithm (estimator) on bootstrap (sampled with replacement) replicas of the training set, with each individual prediction being combined in the end to achieve a final prediction. The main focus of this technique is on reducing variance.

As it may have been noticeable through the descriptions, random forest are bagging algorithms. However, in addition to bootstraping samples, random forests also do it with features, randomly selecting a subset of variables to include in each individual tree.

__Note:__ In the function below, we are aware that when we use decision trees as estimators and the "bootstrap_features" parameter is set to true, we are in fact computing a random forest as well. The option to bootstrap features was included to assess the difference in performance of the remaining estimators, comparing a scenario where all features are included in all individual instances to one where a subset of variables is selected.

In [None]:
%%time

bg_admissions = BaggingClassifier(random_state = 92)

bg_param_grid = {
    "estimator": [DecisionTreeClassifier(**dt_best_params, random_state = 92),
                  MLPClassifier(**nn_best_params, random_state = 92),
                  SVC(**svm_best_params, random_state = 92),
                  RandomForestClassifier(**rf_best_params, random_state = 92)],
    "n_estimators": [5, 10, 15, 20],
    "max_samples": [0.1, 0.2, 0.4, 0.6, 0.8, 1],
    "max_features": [0.1, 0.2, 0.4, 0.6, 0.8, 1],
    "bootstrap_features": [True, False]
}

bg_best_params, bg_y_pred, bg_accuracy, bg_precision, bg_recall, bg_f1, bg_roc_auc = tf.run_model_classification(bg_admissions,
                                                                                                                 bg_param_grid,
                                                                                                                 X_admissions_train_scaled,
                                                                                                                 y_admissions_train,
                                                                                                                 X_admissions_val_scaled,
                                                                                                                 y_admissions_val,
                                                                                                                 "f1",
                                                                                                                 True,
                                                                                                                 "Bagging")

# REVER TEXTO E ABORDAGEM

## <font color='#BFD62F'>4.8. Adaptive Boosting </font> <a class="anchor" id="ab"></a>
[Back to Contents](#toc)

The idea behind boosting ensembles is that a weak learner can be boosted into a strong learning algorithm in a set of iterative training, where each new model is encouraged to pay more attention to the observations that were incorrectly classified by earlier models. A sequence of classifiers is created, where higher influence is given to the most accurate ones. The main focus of this technique is on reducing bias.

In Adaptive Boosting (or AdaBoost, for short), a common approach is to create shallow decision trees as weak learners, known as stumps. Each stump considers the mistakes from previous stumps (the mistakes are given a higher weight when computing the following weak learner), and those who are more accurate are assigned a higher weight in the final predictions.

In [None]:
%%time

ab_admissions = AdaBoostClassifier(random_state = 92)

ab_param_grid = {
    "estimator": [DecisionTreeClassifier(**dt_best_params, random_state = 92),
                  MLPClassifier(**nn_best_params, random_state = 92),
                  SVC(**svm_best_params, random_state = 92),
                  RandomForestClassifier(**rf_best_params, random_state = 92)],
    "n_estimators": [5, 10, 20, 50, 100],
    "learning_rate": [0.001, 0.01, 0.05, 0.1, 0.25, 0.5, 1]
}

ab_best_params, ab_y_pred, ab_accuracy, ab_precision, ab_recall, ab_f1, ab_roc_auc = tf.run_model_classification(ab_admissions,
                                                                                                                 ab_param_grid,
                                                                                                                 X_admissions_train_scaled,
                                                                                                                 y_admissions_train,
                                                                                                                 X_admissions_val_scaled,
                                                                                                                 y_admissions_val,
                                                                                                                 "f1",
                                                                                                                 True,
                                                                                                                 "Adaptive_Boosting")

# REVER TEXTO E ABORDAGEM 

## <font color='#BFD62F'>4.9. Gradient Boosting </font> <a class="anchor" id="gb"></a>
[Back to Contents](#toc)

In contrast to AdaBoost, which assigns a higher weight to the observations that were incorrectly classified, Gradient Boosting uses the errors (residuals, in regression problems) to create a new, (ideally) improved weak learner. This process is performed until a certain threshold is achieved or the new instance of the model fails to improve the fit.

In [None]:
%%time

gb_admissions = GradientBoostingClassifier(random_state = 92)

gb_param_grid = {
    "estimator": [DecisionTreeClassifier(**dt_best_params, random_state = 92),
                  MLPClassifier(**nn_best_params, random_state = 92),
                  SVC(**svm_best_params, random_state = 92),
                  RandomForestClassifier(**rf_best_params, random_state = 92)],
    "n_estimators": [10, 20, 50, 100],
    "learning_rate": [0.01, 0.05, 0.1, 0.5, 1],
    "min_samples_split": [0.01, 0.05, 0.1, 0.2, 0.3, 0.5],
    "min_samples_leaf": [0.01, 0.05, 0.1, 0.2, 0.3, 0.5],
    "max_depth": [None, 4, 6, 8, 10, 14, 20],
    "min_impurity_decrease": [0, 0.01, 0.05]
}

gb_best_params, gb_y_pred, gb_accuracy, gb_precision, gb_recall, gb_f1, gb_roc_auc = tf.run_model_classification(gb_admissions,
                                                                                                                 gb_param_grid,
                                                                                                                 X_admissions_train_scaled,
                                                                                                                 y_admissions_train,
                                                                                                                 X_admissions_val_scaled,
                                                                                                                 y_admissions_val,
                                                                                                                 "f1",
                                                                                                                 True,
                                                                                                                 "Gradient_Boosting")

# REVER TEXTO E ABORDAGEM

## <font color='#BFD62F'>4.10. Stacking </font> <a class="anchor" id="stacking"></a>
[Back to Contents](#toc)

# REVER E VER SE FICA MELHOR COM BAGGING BOOSTING

Stacking (or stacked generalization) consists in stacking the output of individual estimators and using a meta-learning algorithm to compute the final predictions. First, a group of base models is created and used to make predictions, as they normally would; then, a meta-model compiles those predictions and performs a final one, given the knowledge provided by the base models.

In [None]:
%%time

st_admissions = StackingClassifier()

# This is not a Grid Search, only put this way to fit into the function
st_param_grid = {
    "estimators": [[("Decision Tree", DecisionTreeClassifier(**dt_best_params, random_state = 92)),
                    ("Logistic Regression", LogisticRegression(**lr_best_params, random_state = 92)),
                    ("Neural Networks", MLPClassifier(**nn_best_params, random_state = 92)),
                    ("SVM", SVC(**svm_best_params, random_state = 92))]],
    "final_estimator": [LogisticRegression(random_state = 92)]
}

st_best_params, st_y_pred, st_accuracy, st_precision, st_recall, st_f1, st_roc_auc = tf.run_model_classification(st_admissions,
                                                                                                                 st_param_grid,
                                                                                                                 X_admissions_train_scaled,
                                                                                                                 y_admissions_train,
                                                                                                                 X_admissions_val_scaled,
                                                                                                                 y_admissions_val,
                                                                                                                 "f1",
                                                                                                                 True,
                                                                                                                 "Stacking")

# REVER TEXTO E ABORDAGEM

## <font color='#BFD62F'>4.11. Voting </font> <a class="anchor" id="voting"></a>
[Back to Contents](#toc)

# ACRESCENTAR UM QUINTO ALGORITMO (E VERIFICAR WEIGHTS)

As the name suggests, a voting algorithm aggregates the predictions of each weak learner, and opts for the one that was most commonly chosen for each observation.

In our case, we will also test the possibility of assigning higher weights to the weak learners that performed better in terms of their f1 score.

In [None]:
voting_f1_scores = [dt_f1, lr_f1, nn_f1, svm_f1]

sum_f1_scores = sum(voting_f1_scores)
avg_f1_score = sum_f1_scores / len(voting_f1_scores)

weights_f1 = []
for i in voting_f1_scores:
    added_weight = i - avg_f1_score
    weight = 1 + 200 * added_weight
    weights_f1.append(weight)

weights_f1

In [None]:
%%time

vt_admissions = VotingClassifier()

vt_param_grid = {
    "estimators": [[("Decision Tree", DecisionTreeClassifier(**dt_best_params, random_state = 92)),
                    ("Logistic Regression", LogisticRegression(**lr_best_params, random_state = 92)),
                    ("Neural Networks", MLPClassifier(**nn_best_params, random_state = 92)),
                    ("SVM", SVC(**svm_best_params, random_state = 92))]],
    "weights": [None, weights_f1]
}

vt_best_params, vt_y_pred, vt_accuracy, vt_precision, vt_recall, vt_f1, vt_roc_auc = tf.run_model_classification(vt_admissions,
                                                                                                                 vt_param_grid,
                                                                                                                 X_admissions_train_scaled,
                                                                                                                 y_admissions_train,
                                                                                                                 X_admissions_val_scaled,
                                                                                                                 y_admissions_val,
                                                                                                                 "f1",
                                                                                                                 True,
                                                                                                                 "Voting")

## <font color='#BFD62F'>4.12. Final Model Selection </font> <a class="anchor" id="selection"></a>
[Back to Contents](#toc)

We will now summarize our results in table before making a final selection.

In [None]:
metrics = {
    "Model": ["Decision Tree", "Logistic Regression", "Naïve Bayes", "Neural Networks", "SVM",
              "Random Forest", "Bagging", "Adaptive Boosting", "Gradient Boosting", "Stacking", "Voting"],
    "Accuracy": [dt_accuracy, lr_accuracy, nb_accuracy, nn_accuracy, svm_accuracy,
                 rf_accuracy, bg_accuracy, ab_accuracy, gb_accuracy, st_accuracy, vt_accuracy],
    "Precision": [dt_precision, lr_precision, nb_precision, nn_precision, svm_precision,
                  rf_precision, bg_precision, ab_precision, gb_precision, st_precision, vt_precision],
    "Recall": [dt_recall, lr_recall, nb_recall, nn_recall, svm_recall,
               rf_recall, bg_recall, ab_recall, gb_recall, st_recall, vt_recall],
    "F1 Score": [dt_f1, lr_f1, nb_f1, nn_f1, svm_f1,
                 rf_f1, bg_f1, ab_f1, gb_f1, st_f1, vt_f1],
    "ROC AUC": [dt_roc_auc, lr_roc_auc, nb_roc_auc, nn_roc_auc, svm_roc_auc,
                rf_roc_auc, bg_roc_auc, ab_roc_auc, gb_roc_auc, st_roc_auc, vt_roc_auc]
}

scores = pd.DataFrame(metrics)
tf.highlight_max_column(scores)

In [190]:
num_prefixes = 10
models = ['dt', 'rf', 'xgb','lgbm']

scores_dict = {model: [] for model in models}
prefixes = [f'prefix_{i}' for i in range(1, num_prefixes + 1)]

for i in range(1, num_prefixes + 1):
    for model in models:
        variable_name = f'final_score_{i}_{model}'
        score = globals().get(variable_name, None)
        scores_dict[model].append(score)

scores_df = pd.DataFrame(scores_dict, index=prefixes)

scores_df

Unnamed: 0,dt,rf,xgb,lgbm
prefix_1,0.573348,0.57491,0.560439,0.559533
prefix_2,0.707851,0.689705,0.654775,0.703573
prefix_3,0.639889,0.685729,0.656824,0.68993
prefix_4,0.700765,0.73138,0.731635,0.737659
prefix_5,0.735061,0.723577,0.738774,0.737084
prefix_6,0.810566,0.798742,0.823003,0.801288
prefix_7,0.770016,0.761049,0.845399,0.820338
prefix_8,0.804595,0.84717,0.871546,0.842686
prefix_9,0.746369,0.805241,0.79615,0.81882
prefix_10,0.368622,0.368622,0.821865,0.583976


The tables allows us to conclude regarding the best model for each prefix:
* __Prefix 1__: Random Forest
* __Prefix 2__: Decision Tree
* __Prefix 3__: LightGBM
* __Prefix 4__: LightGBM
* __Prefix 5__: XGBoost
* __Prefix 6__: XGBoost
* __Prefix 7__: XGBoost
* __Prefix 8__: XGBoost
* __Prefix 9__: LightGBM
* __Prefix 10__: XGBoost

As we observe, simpler models work better with simpler datasets (as is the case for the first two prefixes), while the most complex ones perform better as we have more prefixes. Now, using these models, we will train the models on all data so that, when the project is implemented, the predictions are made taking into account the most recent information as well.

# <font color='#BFD62F'>___________________________________</font>
# <font color='#5C666C'>5. Evaluating Variable Importance </font> <a class="anchor" id="shap"></a>
[Back to Contents](#toc)

# LINKS ÚTEIS:
* youtube: https://www.youtube.com/watch?v=L8_sVRhBDLU
* (se por acaso não funcionar bem para as admissões por ser categórico): https://medium.com/towards-data-science/shap-for-categorical-features-7c63e6a554ea