# PART 2: Intermediate Data Processing

In this Jupyter Notebook, we further investigate the interim datasets through a **processing** lens: we analyze, transform, scale, encode, reduce, and otherwise munge our data to prepare it for predictive analysis and machine learning-based modeling. 

- **NOTE**: Before working through this notebook, please ensure that you have all necessary dependencies as denoted in [Section A: Imports and Initializations](#section-A) of this notebook.

- **NOTE**: Before working through Sections A-D of this notebook, please run all code cells in [Appendix A: Supplementary Custom Objects](#appendix-A) to ensure that all relevant functions and objects are appropriately instantiated and ready for use.

---

## 🔵 TABLE OF CONTENTS 🔵 <a name="TOC"></a>

Use this **table of contents** to navigate the various sections of the processing notebook.

#### 1. [Section A: Imports and Initializations](#section-A)

    All necessary imports and object instantiations for data processing.

#### 2. [Section B: Specialized Encoding](#section-B)

    Data encoding operations, including value range mapping, 
    correlational plotting, and categorical encoding.

#### 3. [Section C: Data Scaling & Transformation](#section-C)

    Data transformation techniques, including standard scaling/normalization
    and feature reduction techniques.

#### 4. [Section D: Saving Our Processed Datasets](#section-D)

    Saving processed data states for further access.

#### 5. [Appendix A: Supplementary Custom Objects](#appendix-A)

    Custom Python object architectures used throughout the data processing.
    
---

## 🔹 Section A: Imports and Initializations <a name="section-A"></a>

General Importations for Data Manipulation and Visualization.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Algorithms for Data Scaling and Feature Reduction.

In [3]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

Custom Algorithmic Structures for Processed Data Visualization.

In [4]:
import sys
sys.path.append("../structures/")
# from dataset_processor import Dataset_Processor
from custom_structures import cmat_

#### Instantiate Our Processor Engine

Custom Processor Class for Target-Oriented Data Modification.

**NOTE**: Please refer to _Appendix A: Supplementary Custom Objects_ to view the fully implemented processor object.

In [18]:
proc = Dataset_Processor()

##### [(back to top)](#TOC)

---

## 🔹 Section B: Data Encoding <a name="section-B"></a>

#### Read Our Preprocessed Data Into Conditional DataFrame(s)

**Call** `.load_data()` **method to load in all conditionally separated interim datasets.**

_NOTE_: Currently loading in both datasets independently using defaulted condition `which="both"`.

In [19]:
(df_train_i, df_test_i) = proc.load_data()

### ⭐️ _TODO_: Reencode each feature using binary, tertiary, and quarternary encoding schemas. ⭐️

##### [(back to top)](#TOC)

---

## 🔹 Section C: Data Scaling & Transformation <a name="section-C"></a>

##### [(back to top)](#TOC)

---

## 🔹 Section D: Saving Our Processed Datasets <a name="section-D"></a>

##### [(back to top)](#TOC)

---

## 🔹 Appendix A: Supplementary Custom Objects <a name="appendix-A"></a>

#### A[1]: 6Nomads Dataset Processor.

To view the **Data Processor Engine**, please follow the following steps:

1. Navigate to the `structures` sibling directory. 
2. Access the `dataset_processor.py` file. 
3. View the `Dataset_Processor()` object architecture.

In [17]:
class Dataset_Processor(object):
    """ Class object instance for processing and transforming 6Nomads data for predictive analytics. """
    def __init__(self):
        """ Initializer method for object instance creation. """
        self.REL_PATH_TO_INT_DATA_TRAIN = "../data/interim/train_i.csv"
        self.REL_PATH_TO_INT_DATA_TEST = "../data/interim/test_i.csv"
        
    def load_data(self, which="both"):
        """ 
        Instance method to load in dataset(s) into conditionally separated/joined Pandas DataFrame(s). 
        
        INPUTS:
            {which}:
                - str(both): Reads in training and testing data files as tuple of individual DataFrames. (DEFAULT)
                - str(all): Reads in training and testing data files as single conjoined DataFrame.
                - str(train): Reads in training data file as single DataFrame.
                - str(test): Reads in testing data file as single DataFrame.
                
        OUTPUTS:
            pandas.DataFrame: Single or multiple Pandas DataFrame object(s) containing relevant data.
        """
        # Validate conditional data loading arguments
        if which not in ["all", "both", "train", "test"]:
            raise ValueError("ERROR: Inappropriate value passed to argument `which`.\n\nExpected value in range:\n - all\n - both\n - train\n - test\n\nActual:\n - {}".format(which))
        
        # Independently load training data
        if which == "train":
            return pd.read_csv(self.REL_PATH_TO_INT_DATA_TRAIN, index_col=0)
        
        # Independently load testing data
        if which == "test":
            return pd.read_csv(self.REL_PATH_TO_INT_DATA_TEST, index_col=0)
        else:
            df_train = pd.read_csv(self.REL_PATH_TO_INT_DATA_TRAIN, index_col=0)
            df_test = pd.read_csv(self.REL_PATH_TO_INT_DATA_TEST, index_col=0)
            
            # Load merged training and testing data
            if which == "all":
                return pd.concat([df_train, df_test], keys=["train", "test"], sort=True)
            
            # Load separated training and testing data (DEFAULT)
            if which == "both":
                return df_train, df_test

##### [(back to top)](#TOC)

---