# PART 1: Exploratory Data Analysis

In this Jupyter Notebook, we analyze the given external datasets through a **preprocessing** lens: we manipulate, curate, and prepare data to better understand what we're dealing with and to prepare our input data for more advanced prediction-driven modifications.

---

## 🔵 TABLE OF CONTENTS 🔵 <a name="TOC"></a>

Use this **table of contents** to navigate the various sections of the preprocessing notebook.

#### 1. [Section A: Imports and Initializations](#section-A)

    All necessary imports and object instantiations for data preprocessing.
    
#### 2. [Section B: Manipulating Our Datasets](#section-B)

    Data manipulation operations, including null value removal/imputation, 
    data splitting/merging, and data frequency generation.

#### 3. [Section C: Visualizing Trends Across Our Data](#section-C)

    Data visualizations to outline trends and patterns inherent across our data
    that may mandate further analysis.

#### 4. [Section D: Saving Our Interim Datasets](#section-D)

    Saving preprocessed data states for further access.

#### 5. [Appendix A: Supplementary Custom Objects](#appendix-A)

    Custom Python object architectures used throughout the data preprocessing.

#### 6. [Appendix B: Data Dictionary](#appendix-B)

    Data dictionary representation for our dataset.
    
---

## 🔹 Section A: Imports and Initializations <a name="section-A"></a>

General Importations for Data Manipulation and Visualization.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Custom Algorithmic Structures for Data Preprocessing.

In [2]:
import sys
sys.path.append("../structures/")
# from dataset_preprocessor import Dataset_Preprocessor

#### Instantiate our Preprocessor Engine

Custom Preprocessor Class for Directed Data Manipulation.

**NOTE**: Please refer to _Appendix A: Supplementary Custom Objects_ for instructions on how to view the fully implemented dataset preprocessor object.

In [28]:
preproc = Dataset_Preprocessor()

##### [(back to top)](#TOC)

---

## 🔹 Section B: Manipulating Our Datasets <a name="section-B"></a>

#### 🔸 CHECKPOINT 🔸

**Interim data ready to save.**

##### [(back to top)](#TOC)

---

## 🔹 Section C: Visualizing Trends Across Our Data <a name="section-C"></a>

##### [(back to top)](#TOC)

---

## 🔹 Section D: Saving Our Interim Datasets <a name="section-D"></a>

##### [(back to top)](#TOC)

---

## 🔹 Appendix A: Supplementary Custom Objects <a name="appendix-A"></a>

#### A[1]: 6Nomads Dataset Preprocessor.

To view the **Data Preprocessor Engine**, please follow the following steps:

1. Navigate to the `structures` sibling directory. 
2. Access the `dataset_preprocessor.py` file. 
3. View the `Dataset_Preprocessor()` object architecture.

_NOTE_: **Creating Preprocessor Engine in Notebook Until Further Separation of Concerns.**

In [33]:
class Dataset_Preprocessor(object):
    """ Class object instance for preprocessing and cleaning 6Nomads data for predictive analytics. """
    def __init__(self):
        """ Initializer method for object instance creation. """
        self.REL_PATH_TO_EXT_DATA_TRAIN = "../data/external/train.csv"
        self.REL_PATH_TO_EXT_DATA_TEST = "../data/external/test.csv"
        
    def load_data(self, which="both"):
        """ 
        Instance method to load in dataset(s) into conditionally separated/joined Pandas DataFrame(s). 
        
        INPUTS:
            {which}:
                - str(both): Reads in training and testing data files as tuple of individual DataFrames. (DEFAULT)
                - str(all): Reads in training and testing data files as single conjoined DataFrame.
                - str(train): Reads in training data file as single DataFrame.
                - str(test): Reads in testing data file as single DataFrame.
                
        OUTPUTS:
            pandas.DataFrame: Single or multiple Pandas DataFrame object(s) containing relevant data.
        """
        # Validate conditional data loading arguments
        if which not in ["all", "both", "train", "test"]:
            raise ValueError("ERROR: Expected value in range:\n - all\n - both\n - train\n - test\n\nReceived:\n - {}".format(which))
        
        # Independently load training data
        if which == "train":
            return pd.read_csv(self.REL_PATH_TO_EXT_DATA_TRAIN)
        
        # Independently load testing data
        if which == "test":
            return pd.read_csv(self.REL_PATH_TO_EXT_DATA_TEST)
        else:
            df_train = pd.read_csv(self.REL_PATH_TO_EXT_DATA_TRAIN)
            df_test = pd.read_csv(self.REL_PATH_TO_EXT_DATA_TEST)
            
            # Load merged training and testing data
            if which == "all":
                return pd.concat([df_train, df_test], keys=["train", "test"], sort=True)
            
            # Load separated training and testing data (DEFAULT)
            if which == "both":
                return df_train, df_test

##### [(back to top)](#TOC)

---

## 🔹 Appendix B: Data Dictionary <a name="appendix-B"></a>

##### [(back to top)](#TOC)

---