# PART 3: Intermediate Data Processing

In this Jupyter Notebook, we conclude our end-to-end data pipeline through a **predictive** lens: we apply classical machine learning models and algorithms to assess predictive accuracies for our training and testing data. 

- **NOTE**: Before working through this notebook, please ensure that you have all necessary dependencies as denoted in [Section A: Imports and Initializations](#section-A) of this notebook.

- **NOTE**: Before working through Sections A-D of this notebook, please run all code cells in [Appendix A: Supplementary Custom Objects](#appendix-A) to ensure that all relevant functions and objects are appropriately instantiated and ready for use.

---

## 🔵 TABLE OF CONTENTS 🔵 <a name="TOC"></a>

Use this **table of contents** to navigate the various sections of the processing notebook.

#### 1. [Section A: Imports and Initializations](#section-A)

    All necessary imports and object instantiations for data predictions.

#### 2. [Section B: Loading our Processed Data](#section-B)

    Loading processed data states for current access.

#### 3. [Section C: Machine Learning](#section-C)

    Applying classical machine learning algorithms on processed datasets.
    
#### 4. [Section D: Deep Learning](#section-D)

    Applying advanced machine learning and deep learning algorithms on processed datasets. 

#### 5. [Appendix A: Supplementary Custom Objects](#appendix-A)

    Custom Python object architectures used throughout the data processing.
    
---

## 🔹 Section A: Imports and Initializations <a name="section-A"></a>

General Importations for Data Manipulation and Visualization.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Modules for Data Preparation and Model Evaluation.

In [2]:
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold, StratifiedKFold
from sklearn.preprocessing import label_binarize
from sklearn.utils import class_weight

Algorithms for Data Resampling.

In [3]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

  from numpy.core.umath_tests import inner1d


Algorithms for Classical Machine Learning.

In [4]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import confusion_matrix, roc_curve, auc, f1_score

Custom Algorithmic Support Structures.

In [5]:
import sys
sys.path.append("../structures/")
from custom_structures import *

##### [(back to top)](#TOC)

---

## 🔹 Section B: Loading our Processed Data <a name="section-B"></a>

In [6]:
REL_PATH_PROC_DATA = "../data/processed/"
DATA_X, DATA_y = "X/", "y/"
SUBDIR_PROC, SUBDIR_SCA, SUBDIR_RED = "processed/", "scaled/", "reduced/"

X_TRAIN_PROC, X_TEST_PROC = "train_pXp.csv", "test_pXp.csv"
X_TRAIN_SCA, X_TEST_SCA = "train_pXs.csv", "test_pXs.csv"
X_TRAIN_RED, X_TEST_RED = "train_pXr.csv", "test_pXr.csv"
y_TRAIN_PROC, y_TEST_PROC = "train_pyp.csv", "test_pyp.csv"

#### Loading Data: _Fully Processed X-Datasets_

In [7]:
X_train_pro = pd.read_csv(REL_PATH_PROC_DATA + DATA_X + SUBDIR_PROC + X_TRAIN_PROC, index_col=0)
X_test_pro = pd.read_csv(REL_PATH_PROC_DATA + DATA_X + SUBDIR_PROC + X_TEST_PROC, index_col=0)

#### Loading Data: _Scaled X-Datasets_

In [8]:
X_train_sca = pd.read_csv(REL_PATH_PROC_DATA + DATA_X + SUBDIR_SCA + X_TRAIN_SCA, index_col=0)
X_test_sca = pd.read_csv(REL_PATH_PROC_DATA + DATA_X + SUBDIR_SCA + X_TEST_SCA, index_col=0)

#### Loading Data: _Dimensionally Reduced X-Datasets_

In [9]:
X_train_red = pd.read_csv(REL_PATH_PROC_DATA + DATA_X + SUBDIR_RED + X_TRAIN_RED, index_col=0)
X_test_red = pd.read_csv(REL_PATH_PROC_DATA + DATA_X + SUBDIR_RED + X_TEST_RED, index_col=0)

#### Loading Data: _Fully Processed Targets (y)_

In [10]:
y_train_pro = np.ravel(pd.read_csv(REL_PATH_PROC_DATA + DATA_y + SUBDIR_PROC + y_TRAIN_PROC, index_col=0, header=None))
y_test_pro = np.ravel(pd.read_csv(REL_PATH_PROC_DATA + DATA_y + SUBDIR_PROC + y_TEST_PROC, index_col=0, header=None))

##### [(back to top)](#TOC)

---

## 🔹 Section C: Machine Learning <a name="section-C"></a>

#### Models to Use:
- **k-Nearest Neighbors Classifier**
    - _Hyperparameters_: `n_neighbors`, ~~`leaf_size`~~, `weights`, ~~`algorithm`~~
- **Support Vector Classifier**
    - _Hyperparameters_: `kernel`, `C`, ~~`gamma`~~, ~~`degree`~~
- **Decision Tree Classifier**
    - _Hyperparameters_: ~~`max_features`~~, `min_samples_split`, `min_samples_leaf`
- **Random Forest Classifier**
    - _Hyperparameters_: ~~`criterion`~~, ~~`n_estimators`~~, `min_samples_split`, `min_samples_leaf`
- **Logistic Regression Classifier**
    - _Hyperparameters_: `penalty`, `C`

#### Determine Optimal Hyperparameters for Conducting Cross-Validation-Driven Machine Learning.

In [12]:
model_hyperparams = {
    "kNN": (KNeighborsClassifier, {
        "n_neighbors": [1, 3, 5],
#         "leaf_size": [1, 2, 3, 5],
        "weights": ["uniform", "distance"],
#         "algorithm": ["auto", "ball_tree", "kd_tree", "brute"]
    }), 
    "svc": (SVC, {
        "kernel": ["linear"],
        "C": [0.1, 1, 10, 100],
#         "gamma": [0.1, 1, 10],
#         "degree": [0, 1, 2, 3, 4, 5, 6]
    }), 
    "dtree": (DecisionTreeClassifier, {
#         "max_features": ["auto", "sqrt", "log2"],
        "min_samples_split": [2, 3, 4, 5, 6],
        "min_samples_leaf": [1, 2, 3, 4, 5]  
    }), 
    "rforest": (RandomForestClassifier, {
#         "criterion": ["gini", "entropy"],
#         "n_estimators": [10, 15, 20, 25, 30],
        "min_samples_split": [2, 3, 4, 5, 6],
        "min_samples_leaf": [1, 2, 3, 4, 5]
    }), 
    "logreg": (LogisticRegression, {
        "penalty": ["l1", "l2"],
        "C": [0.1, 1, 10]
    })
}

param_table = {
    "processed": (
        (X_train_pro, X_test_pro),
        dict.fromkeys(model_hyperparams)
    ),
    "scaled": (
        (X_train_sca, X_test_sca),
        dict.fromkeys(model_hyperparams)
    ),
    "reduced": (
        (X_train_red, X_test_red),
        dict.fromkeys(model_hyperparams)
    )
}

In [None]:
for dataset in param_table.keys():
    print("WORKING ON DATASET: {}".format(dataset.upper()))
    for model in model_hyperparams.keys():
        print("\tFITTING MODEL: {}".format(model.upper()))
        classifier = model_hyperparams[model][0]()
        clf_grid = GridSearchCV(classifier, model_hyperparams[model][1], cv=5, verbose=0)
        optimal_model = clf_grid.fit(param_table[dataset][0][0], y_train_pro)
        param_table[dataset][1][model] = optimal_model.best_estimator_.get_params()
        print("\t\tMODEL FITTED SUCCESSFULLY.")

WORKING ON DATASET: SCALED
	FITTING MODEL: KNN
		MODEL FITTED SUCCESSFULLY.
	FITTING MODEL: DTREE
		MODEL FITTED SUCCESSFULLY.
	FITTING MODEL: RFOREST
		MODEL FITTED SUCCESSFULLY.
	FITTING MODEL: SVC
		MODEL FITTED SUCCESSFULLY.
	FITTING MODEL: LOGREG
		MODEL FITTED SUCCESSFULLY.
WORKING ON DATASET: PROCESSED
	FITTING MODEL: KNN
		MODEL FITTED SUCCESSFULLY.
	FITTING MODEL: DTREE
		MODEL FITTED SUCCESSFULLY.
	FITTING MODEL: RFOREST
		MODEL FITTED SUCCESSFULLY.
	FITTING MODEL: SVC
		MODEL FITTED SUCCESSFULLY.
	FITTING MODEL: LOGREG
		MODEL FITTED SUCCESSFULLY.
WORKING ON DATASET: REDUCED
	FITTING MODEL: KNN
		MODEL FITTED SUCCESSFULLY.
	FITTING MODEL: DTREE
		MODEL FITTED SUCCESSFULLY.
	FITTING MODEL: RFOREST
		MODEL FITTED SUCCESSFULLY.
	FITTING MODEL: SVC


##### [(back to top)](#TOC)

---

## 🔹 Section D: Deep Learning <a name="section-D"></a>

TBD.

##### [(back to top)](#TOC)

---

## 🔹 Appendix A: Supplementary Custom Objects <a name="appendix-A"></a>

##### [(back to top)](#TOC)

---