# PART 3: Intermediate Data Processing

In this Jupyter Notebook, we conclude our end-to-end data pipeline through a **predictive** lens: we apply classical machine learning models and algorithms to assess predictive accuracies for our training and testing data. 

- **NOTE**: Before working through this notebook, please ensure that you have all necessary dependencies as denoted in [Section A: Imports and Initializations](#section-A) of this notebook.

- **NOTE**: Before working through Sections A-D of this notebook, please run all code cells in [Appendix A: Supplementary Custom Objects](#appendix-A) to ensure that all relevant functions and objects are appropriately instantiated and ready for use.

---

## 🔵 TABLE OF CONTENTS 🔵 <a name="TOC"></a>

Use this **table of contents** to navigate the various sections of the processing notebook.

#### 1. [Section A: Imports and Initializations](#section-A)

    All necessary imports and object instantiations for data predictions.

#### 2. [Section B: Loading our Processed Data](#section-B)

    Loading processed data states for current access.

#### 3. [Section C: Machine Learning](#section-C)

    Applying classical machine learning algorithms on processed datasets.
    
#### 4. [Section D: Deep Learning](#section-D)

    Applying advanced machine learning and deep learning algorithms on processed datasets. 

#### 5. [Appendix A: Supplementary Custom Objects](#appendix-A)

    Custom Python object architectures used throughout the data processing.
    
---

## 🔹 Section A: Imports and Initializations <a name="section-A"></a>

General Importations for Data Manipulation and Visualization.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Modules for Data Preparation and Model Evaluation.

In [3]:
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold, StratifiedKFold
from sklearn.preprocessing import label_binarize
from sklearn.utils import class_weight

Algorithms for Data Resampling.

In [4]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

  from numpy.core.umath_tests import inner1d


Algorithms for Classical Machine Learning.

In [5]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import confusion_matrix, roc_curve, auc, f1_score

Custom Algorithmic Support Structures.

In [6]:
import sys
sys.path.append("../structures/")
from custom_structures import *

##### [(back to top)](#TOC)

---

## 🔹 Section B: Loading our Processed Data <a name="section-B"></a>

In [7]:
REL_PATH_PROC_DATA = "../data/processed/"
DATA_X, DATA_y = "X/", "y/"
SUBDIR_PROC, SUBDIR_SCA, SUBDIR_RED = "processed/", "scaled/", "reduced/"

X_TRAIN_PROC, X_TEST_PROC = "train_pXp.csv", "test_pXp.csv"
X_TRAIN_SCA, X_TEST_SCA = "train_pXs.csv", "test_pXs.csv"
X_TRAIN_RED, X_TEST_RED = "train_pXr.csv", "test_pXr.csv"
y_TRAIN_PROC, y_TEST_PROC = "train_pyp.csv", "test_pyp.csv"

#### Loading Data: _Fully Processed X-Datasets_

In [8]:
X_train_pro = pd.read_csv(REL_PATH_PROC_DATA + DATA_X + SUBDIR_PROC + X_TRAIN_PROC, index_col=0)
X_test_pro = pd.read_csv(REL_PATH_PROC_DATA + DATA_X + SUBDIR_PROC + X_TEST_PROC, index_col=0)

#### Loading Data: _Scaled X-Datasets_

In [9]:
X_train_sca = pd.read_csv(REL_PATH_PROC_DATA + DATA_X + SUBDIR_SCA + X_TRAIN_SCA, index_col=0)
X_test_sca = pd.read_csv(REL_PATH_PROC_DATA + DATA_X + SUBDIR_SCA + X_TEST_SCA, index_col=0)

#### Loading Data: _Dimensionally Reduced X-Datasets_

In [10]:
X_train_red = pd.read_csv(REL_PATH_PROC_DATA + DATA_X + SUBDIR_RED + X_TRAIN_RED, index_col=0)
X_test_red = pd.read_csv(REL_PATH_PROC_DATA + DATA_X + SUBDIR_RED + X_TEST_RED, index_col=0)

#### Loading Data: _Fully Processed Targets (y)_

In [11]:
y_train_pro = np.ravel(pd.read_csv(REL_PATH_PROC_DATA + DATA_y + SUBDIR_PROC + y_TRAIN_PROC, index_col=0, header=None))
y_test_pro = np.ravel(pd.read_csv(REL_PATH_PROC_DATA + DATA_y + SUBDIR_PROC + y_TEST_PROC, index_col=0, header=None))

##### [(back to top)](#TOC)

---

## 🔹 Section C: Machine Learning <a name="section-C"></a>

#### Models to Use:
- **k-Nearest Neighbors Classifier**
    - _Hyperparameters_: `n_neighbors`, ~~`leaf_size`~~, `weights`, ~~`algorithm`~~
- ~~**Support Vector Classifier**~~
    - ~~_Hyperparameters_: `kernel`, `C`, `gamma`, `degree`~~
- **Decision Tree Classifier**
    - _Hyperparameters_: ~~`max_features`~~, `min_samples_split`, `min_samples_leaf`
- **Random Forest Classifier**
    - _Hyperparameters_: ~~`criterion`~~, ~~`n_estimators`~~, `min_samples_split`, `min_samples_leaf`
- **Logistic Regression Classifier**
    - _Hyperparameters_: `penalty`, `C`
    
_TODO_: Explore mechanics behind SVM dysfunction.

#### Determine Optimal Hyperparameters for Conducting Cross-Validation-Driven Machine Learning.

In [12]:
model_hyperparams = {
    "kNN": (KNeighborsClassifier, {
        "n_neighbors": [1, 3, 5],
#         "leaf_size": [1, 2, 3, 5],
        "weights": ["uniform", "distance"],
#         "algorithm": ["auto", "ball_tree", "kd_tree", "brute"]
    }), 
#     "svc": (SVC, {
#         "kernel": ["linear", "rbf", "poly"],
#         "C": [0.1, 1, 10, 100],
#         "gamma": [0.1, 1, 10, 100],
#         "degree": [0, 1, 2, 3, 4, 5, 6]
#     }), 
    "dtree": (DecisionTreeClassifier, {
#         "max_features": ["auto", "sqrt", "log2"],
        "min_samples_split": [2, 3, 4, 5, 6],
        "min_samples_leaf": [1, 2, 3, 4, 5]  
    }), 
    "rforest": (RandomForestClassifier, {
#         "criterion": ["gini", "entropy"],
#         "n_estimators": [10, 15, 20, 25, 30],
        "min_samples_split": [2, 3, 4, 5, 6],
        "min_samples_leaf": [1, 2, 3, 4, 5]
    }), 
    "logreg": (LogisticRegression, {
        "penalty": ["l1", "l2"],
        "C": [0.1, 1, 10]
    })
}

param_table = {
    "processed": (
        (X_train_pro, X_test_pro),
        dict.fromkeys(model_hyperparams)
    ),
    "scaled": (
        (X_train_sca, X_test_sca),
        dict.fromkeys(model_hyperparams)
    ),
    "reduced": (
        (X_train_red, X_test_red),
        dict.fromkeys(model_hyperparams)
    )
}

#### Save Optimized Hyperparameters for Later Use

In [13]:
for dataset in param_table.keys():
    print("WORKING ON DATASET: {}".format(dataset.upper()))
    for model in model_hyperparams.keys():
        print("\tFITTING MODEL: {}".format(model.upper()))
        classifier = model_hyperparams[model][0]()
        clf_grid = GridSearchCV(classifier, model_hyperparams[model][1], cv=5, verbose=0)
        optimal_model = clf_grid.fit(param_table[dataset][0][0], y_train_pro)
        param_table[dataset][1][model] = optimal_model.best_estimator_.get_params()
        print("\t\tMODEL FITTED SUCCESSFULLY.")

WORKING ON DATASET: SCALED
	FITTING MODEL: KNN
		MODEL FITTED SUCCESSFULLY.
	FITTING MODEL: DTREE
		MODEL FITTED SUCCESSFULLY.
	FITTING MODEL: RFOREST
		MODEL FITTED SUCCESSFULLY.
	FITTING MODEL: LOGREG
		MODEL FITTED SUCCESSFULLY.
WORKING ON DATASET: PROCESSED
	FITTING MODEL: KNN
		MODEL FITTED SUCCESSFULLY.
	FITTING MODEL: DTREE
		MODEL FITTED SUCCESSFULLY.
	FITTING MODEL: RFOREST
		MODEL FITTED SUCCESSFULLY.
	FITTING MODEL: LOGREG
		MODEL FITTED SUCCESSFULLY.
WORKING ON DATASET: REDUCED
	FITTING MODEL: KNN
		MODEL FITTED SUCCESSFULLY.
	FITTING MODEL: DTREE
		MODEL FITTED SUCCESSFULLY.
	FITTING MODEL: RFOREST
		MODEL FITTED SUCCESSFULLY.
	FITTING MODEL: LOGREG
		MODEL FITTED SUCCESSFULLY.


#### Attain Generalized Accuracy Scores for Optimized Models

In [75]:
for dataset in param_table.keys():
    for model in model_hyperparams.keys():
        get_model_accuracy_metrics(model, dataset, param_table[dataset][1][model], scoring="standard")

For Data Case [SCALED] with Model [KNN] -> STANDARD ACCURACY SCORE: 0.6680.
For Data Case [SCALED] with Model [DTREE] -> STANDARD ACCURACY SCORE: 0.2456.
For Data Case [SCALED] with Model [RFOREST] -> STANDARD ACCURACY SCORE: 0.3367.
For Data Case [SCALED] with Model [LOGREG] -> STANDARD ACCURACY SCORE: 0.2962.
For Data Case [PROCESSED] with Model [KNN] -> STANDARD ACCURACY SCORE: 0.6319.
For Data Case [PROCESSED] with Model [DTREE] -> STANDARD ACCURACY SCORE: 0.2456.
For Data Case [PROCESSED] with Model [RFOREST] -> STANDARD ACCURACY SCORE: 0.3558.
For Data Case [PROCESSED] with Model [LOGREG] -> STANDARD ACCURACY SCORE: 0.3062.
For Data Case [REDUCED] with Model [KNN] -> STANDARD ACCURACY SCORE: 0.2513.
For Data Case [REDUCED] with Model [DTREE] -> STANDARD ACCURACY SCORE: 0.2574.
For Data Case [REDUCED] with Model [RFOREST] -> STANDARD ACCURACY SCORE: 0.2624.
For Data Case [REDUCED] with Model [LOGREG] -> STANDARD ACCURACY SCORE: 0.2637.


#### Attain F1 Accuracy Scores for Optimized Models

In [83]:
for dataset in param_table.keys():
    for model in model_hyperparams.keys():
        get_model_accuracy_metrics(model, dataset, param_table[dataset][1][model], scoring="f1")

For Data Case [SCALED] with Model [KNN] & Optimal Average Parameter [MICRO] -> F1 ACCURACY SCORE 0.6680.
For Data Case [SCALED] with Model [DTREE] & Optimal Average Parameter [WEIGHTED] -> F1 ACCURACY SCORE 0.2457.
For Data Case [SCALED] with Model [RFOREST] & Optimal Average Parameter [MICRO] -> F1 ACCURACY SCORE 0.3544.
For Data Case [SCALED] with Model [LOGREG] & Optimal Average Parameter [MACRO] -> F1 ACCURACY SCORE 0.2971.
For Data Case [PROCESSED] with Model [KNN] & Optimal Average Parameter [MICRO] -> F1 ACCURACY SCORE 0.6319.
For Data Case [PROCESSED] with Model [DTREE] & Optimal Average Parameter [WEIGHTED] -> F1 ACCURACY SCORE 0.2457.
For Data Case [PROCESSED] with Model [RFOREST] & Optimal Average Parameter [MICRO] -> F1 ACCURACY SCORE 0.3705.
For Data Case [PROCESSED] with Model [LOGREG] & Optimal Average Parameter [MACRO] -> F1 ACCURACY SCORE 0.3118.
For Data Case [REDUCED] with Model [KNN] & Optimal Average Parameter [MICRO] -> F1 ACCURACY SCORE 0.2513.
For Data Case [RED

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


### For both generalized and F1 accuracy, the best performing model appears to be the _k-Nearest Neighbors_ Classifier Algorithm on the Scaled Data, with an approximate accuracy of `~66.8%`. 

Methods to improve our accuracy are as follows:
- Use different data scaling/normalization techniques to retain more signal.
- Implement more fine-tuned encoding schemas, including frequency-based and categorical encoding.
- Try different variations of abstract classical machine learning models (e.g. forest variants).
- Create a deep learning approach to learn and retain more signal while dropping noisy data.
- Run more hyperparameter tuning samples to attain more optimal models.

##### [(back to top)](#TOC)

---

## 🔹 Section D: Deep Learning <a name="section-D"></a>

TBD.

##### [(back to top)](#TOC)

---

## 🔹 Appendix A: Supplementary Custom Objects <a name="appendix-A"></a>

#### A[1]: Model Accuracy Metric Function.

Function to run a machine learning model with optimized hyperparameters on user-specified dataset and attain generalized or F1 accuracy metrics, as well as potentially attaining a confusion matrix. 

In [82]:
def get_model_accuracy_metrics(model, variant, hyperparams, scoring="standard"):
    """ Function to get descriptively statistical accuracy metrics using cross-validation on machine learning model prediction. """
    X_train, X_test = param_table[variant][0]
    y_train, y_test = y_train_pro, y_test_pro
    
    classifier = model_hyperparams[model][0](**param_table[variant][1][model])
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    cmat = confusion_matrix(y_pred, y_test)
    if scoring == "standard":
        optimal_accuracy = classifier.score(X_test, y_test)
        print("For Data Case [{}] with Model [{}] -> {} ACCURACY SCORE: {:.4f}.".format(variant.upper(), 
                                                                                        model.upper(), 
                                                                                        scoring.upper(),
                                                                                        optimal_accuracy))
    if scoring == "f1":
        optimal_accuracy, optimal_avg, averages = float(), str(), ["macro", "micro", "weighted"]
        for average in averages:
            acc = f1_score(y_test, y_pred, average=average)
            if acc > optimal_accuracy:
                optimal_avg, optimal_accuracy = average, acc
        print("For Data Case [{}] with Model [{}] & Optimal Average Parameter [{}] -> {} ACCURACY SCORE {:.4f}.".format(variant.upper(),
                                                                                                                        model.upper(),
                                                                                                                        optimal_avg.upper(),
                                                                                                                        scoring.upper(),
                                                                                                                        optimal_accuracy))

##### [(back to top)](#TOC)

---