# PART 1: Exploratory Data Analysis

In this Jupyter Notebook, we analyze the given external datasets through a **preprocessing** lens: we manipulate, curate, and prepare data to better understand what we're dealing with and to prepare our input data for more advanced prediction-driven modifications.

- **NOTE**: Before working through this notebook, please ensure that you have all necessary dependencies as denoted in [Section A: Imports and Initializations](#section-A) of this notebook.

- **NOTE**: Before working through Sections A-D of this notebook, please run all code cells in [Appendix A: Supplementary Custom Objects](#appendix-A) to ensure that all relevant functions and objects are appropriately instantiated and ready for use.

---

## 🔵 TABLE OF CONTENTS 🔵 <a name="TOC"></a>

Use this **table of contents** to navigate the various sections of the preprocessing notebook.

#### 1. [Section A: Imports and Initializations](#section-A)

    All necessary imports and object instantiations for data preprocessing.
    
#### 2. [Section B: Manipulating Our Datasets](#section-B)

    Data manipulation operations, including null value removal/imputation, 
    data splitting/merging, and data frequency generation.

#### 3. [Section C: Visualizing Trends Across Our Data](#section-C)

    Data visualizations to outline trends and patterns inherent across our data
    that may mandate further analysis.

#### 4. [Section D: Saving Our Interim Datasets](#section-D)

    Saving preprocessed data states for further access.

#### 5. [Appendix A: Supplementary Custom Objects](#appendix-A)

    Custom Python object architectures used throughout the data preprocessing.

#### 6. [Appendix B: Data Dictionary](#appendix-B)

    Data dictionary representation for our dataset.
    
---

## 🔹 Section A: Imports and Initializations <a name="section-A"></a>

General Importations for Data Manipulation and Visualization.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Custom Algorithmic Structures for Data Preprocessing.

In [2]:
import sys
sys.path.append("../structures/")
# from dataset_preprocessor import Dataset_Preprocessor

#### Instantiate our Preprocessor Engine

Custom Preprocessor Class for Directed Data Manipulation.

**NOTE**: Please refer to _Appendix A: Supplementary Custom Objects_ for instructions on how to view the fully implemented dataset preprocessor object.

In [28]:
preproc = Dataset_Preprocessor()

#### Read Our Raw Data Into Conditional DataFrame(s)

**Call** `.load_data()` **method to load in all conditionally separated external datasets.**

_NOTE_: Currently loading in both datasets independently using defaulted condition `which="both"`.

In [34]:
(df_train, df_test) = preproc.load_data()

#### Get Unique Values Across Each Feature in Training Dataset.

**Call the** `get_uniques()` **custom function to identify unique values across all input features for dataset(s).**

_NOTE_: Currently identifying unique data values across all features in dataset using defaulted conditions `features=None` and `how="value"`.

In [74]:
unique_values_train, unique_values_test = get_uniques(df_train), get_uniques(df_test)
unique_types_train, unique_types_test = get_uniques(df_train, how="dtype"), get_uniques(df_test, how="dtype")

---
## 🔸 NEXT TASK: _Merge both dictionaries by key into single data dictionary._ 🔸
---

In [73]:
unique_types_test

{'-1': [int],
 '-10': [float],
 '-10.1': [int],
 '-10.2': [int],
 '-11': [float],
 '-11.1': [int],
 '-12': [int],
 '-13': [int],
 '-14': [int],
 '-14.1': [int],
 '-19': [float],
 '-19.1': [int],
 '-2': [float],
 '-2.1': [int],
 '-2.2': [int],
 '-23': [int],
 '-23.1': [int],
 '-27': [int],
 '-27.1': [int],
 '-28': [float],
 '-3': [float],
 '-3.1': [int],
 '-3.2': [int],
 '-33': [float],
 '-36': [int],
 '-4': [int],
 '-4.1': [int],
 '-4.2': [int],
 '-4.3': [int],
 '-47': [int],
 '-5': [int],
 '-5.1': [float],
 '-59': [int],
 '-6': [int],
 '-68': [int],
 '-7': [int],
 '-9': [int],
 '0': [int],
 '0.1': [int],
 '0.2': [int],
 '0.3': [int],
 '0.4': [int],
 '1': [int],
 '11': [int],
 '12': [int],
 '12.1': [int],
 '14': [int],
 '15': [int],
 '16': [int],
 '16.1': [int],
 '19': [int],
 '2': [int],
 '2.1': [int],
 '20': [int],
 '29': [int],
 '3': [int],
 '37': [int],
 '4': [int],
 '41': [float],
 '5': [int],
 '5.1': [int],
 '5.2': [int],
 '60': [int],
 '7': [int],
 '9': [int],
 'A': [str]}

##### [(back to top)](#TOC)

---

## 🔹 Section B: Manipulating Our Datasets <a name="section-B"></a>

#### 🔸 CHECKPOINT 🔸

**Interim data ready to save.**

##### [(back to top)](#TOC)

---

## 🔹 Section C: Visualizing Trends Across Our Data <a name="section-C"></a>

##### [(back to top)](#TOC)

---

## 🔹 Section D: Saving Our Interim Datasets <a name="section-D"></a>

##### [(back to top)](#TOC)

---

## 🔹 Appendix A: Supplementary Custom Objects <a name="appendix-A"></a>

#### A[1]: 6Nomads Dataset Preprocessor.

To view the **Data Preprocessor Engine**, please follow the following steps:

1. Navigate to the `structures` sibling directory. 
2. Access the `dataset_preprocessor.py` file. 
3. View the `Dataset_Preprocessor()` object architecture.

_NOTE_: **Creating Preprocessor Engine in Notebook Until Further Separation of Concerns.**

In [49]:
class Dataset_Preprocessor(object):
    """ Class object instance for preprocessing and cleaning 6Nomads data for predictive analytics. """
    def __init__(self):
        """ Initializer method for object instance creation. """
        self.REL_PATH_TO_EXT_DATA_TRAIN = "../data/external/train.csv"
        self.REL_PATH_TO_EXT_DATA_TEST = "../data/external/test.csv"
        
    def load_data(self, which="both"):
        """ 
        Instance method to load in dataset(s) into conditionally separated/joined Pandas DataFrame(s). 
        
        INPUTS:
            {which}:
                - str(both): Reads in training and testing data files as tuple of individual DataFrames. (DEFAULT)
                - str(all): Reads in training and testing data files as single conjoined DataFrame.
                - str(train): Reads in training data file as single DataFrame.
                - str(test): Reads in testing data file as single DataFrame.
                
        OUTPUTS:
            pandas.DataFrame: Single or multiple Pandas DataFrame object(s) containing relevant data.
        """
        # Validate conditional data loading arguments
        if which not in ["all", "both", "train", "test"]:
            raise ValueError("ERROR: Inappropriate value passed to argument `which`.\n\nExpected value in range:\n - all\n - both\n - train\n - test\n\nActual:\n - {}".format(which))
        
        # Independently load training data
        if which == "train":
            return pd.read_csv(self.REL_PATH_TO_EXT_DATA_TRAIN)
        
        # Independently load testing data
        if which == "test":
            return pd.read_csv(self.REL_PATH_TO_EXT_DATA_TEST)
        else:
            df_train = pd.read_csv(self.REL_PATH_TO_EXT_DATA_TRAIN)
            df_test = pd.read_csv(self.REL_PATH_TO_EXT_DATA_TEST)
            
            # Load merged training and testing data
            if which == "all":
                return pd.concat([df_train, df_test], keys=["train", "test"], sort=True)
            
            # Load separated training and testing data (DEFAULT)
            if which == "both":
                return df_train, df_test

#### A[2]: Function to Obtain Relevant Unique Values or Data Types from Feature(s) Across Dataset.

In [65]:
def get_uniques(dataset, features=None, how="value"):
    """
    Custom function that analyzes a dataset's given feature(s) and returns all unique values or data types
    across each inputted feature.
    
    INPUTS:
        {features}:
            - NoneType(None): Sets function to use all features across dataset. (DEFAULT)
            - str: Single referenced feature in dataset.
            - list: List of referenced features in dataset.
        {how}:
            - str(value): Identifies unique data values. (DEFAULT)
            - str(dtype): Identifies unique data types.
    
    OUTPUTS:
        dict: Dictionary structure mapping each input feature to relevantly identified unique values/types.
    """
    # Validate selected features argument
    if features is not None and type(features) not in [str, list]:
        raise TypeError("ERROR: Inappropriate data type passed to argument `features`.\n\nExpected type in range:\n - NoneType\n - str()\n - list()\n\nActual:\n - {}".format(str(type(features))))
    
    # Validate unique identifier argument
    if how not in ["value", "dtype"]:
        raise ValueError("ERROR: Inappropriate value passed to argument `how`.\n\nExpected value in range:\n - value\n - dtype\n\nActual:\n - {}".format(how))
        
    # Reformat `features` object into list
    if features is None:
        features = dataset.columns.tolist()
    if type(features) == str:
        features = [features]
    
    # Create uniques object and iteratively map each feature to associated unique data
    uniques = dict()
    for feature in features:
        # Create dictionary object associating feature(s) and unique values
        if how == "value":
            uniques[feature] = sorted(dataset[feature].unique().tolist())
        # Create dictionary object associating feature(s) and unique data types
        if how == "dtype":
            uniques[feature] = list(set(map(type, dataset[feature])))
    return uniques

##### [(back to top)](#TOC)

---

## 🔹 Appendix B: Data Dictionary <a name="appendix-B"></a>

##### [(back to top)](#TOC)

---