# PART 1: Exploratory Data Analysis

In this Jupyter Notebook, we analyze the given external datasets through a **preprocessing** lens: we manipulate, curate, and prepare data to better understand what we're dealing with and to prepare our input data for more advanced prediction-driven modifications.

- **NOTE**: Before working through this notebook, please ensure that you have all necessary dependencies as denoted in [Section A: Imports and Initializations](#section-A) of this notebook.

- **NOTE**: Before working through Sections A-D of this notebook, please run all code cells in [Appendix A: Supplementary Custom Objects](#appendix-A) to ensure that all relevant functions and objects are appropriately instantiated and ready for use.

---

## 🔵 TABLE OF CONTENTS 🔵 <a name="TOC"></a>

Use this **table of contents** to navigate the various sections of the preprocessing notebook.

#### 1. [Section A: Imports and Initializations](#section-A)

    All necessary imports and object instantiations for data preprocessing.
    
#### 2. [Section B: Manipulating Our Datasets](#section-B)

    Data manipulation operations, including null value removal/imputation, 
    data splitting/merging, and data frequency generation.

#### 3. [Section C: Visualizing Trends Across Our Data](#section-C)

    Data visualizations to outline trends and patterns inherent across our data
    that may mandate further analysis.

#### 4. [Section D: Saving Our Interim Datasets](#section-D)

    Saving preprocessed data states for further access.

#### 5. [Appendix A: Supplementary Custom Objects](#appendix-A)

    Custom Python object architectures used throughout the data preprocessing.

#### 6. [Appendix B: Data Dictionary](#appendix-B)

    Data dictionary representation for our dataset.
    
---

## 🔹 Section A: Imports and Initializations <a name="section-A"></a>

General Importations for Data Manipulation and Visualization.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Custom Algorithmic Structures for Data Preprocessing.

In [2]:
import sys
sys.path.append("../structures/")
from custom_structures import corrplot_
from dataset_preprocessor import Dataset_Preprocessor

#### Instantiate our Preprocessor Engine

Custom Preprocessor Class for Directed Data Manipulation.

**NOTE**: Please refer to _Appendix A: Supplementary Custom Objects_ for instructions on how to view the fully implemented dataset preprocessor object.

In [3]:
preproc = Dataset_Preprocessor()

##### [(back to top)](#TOC)

---

## 🔹 Section B: Manipulating Our Datasets <a name="section-B"></a>

#### Read Our Raw Data Into Conditional DataFrame(s)

**Call** `.load_data()` **method to load in all conditionally separated external datasets.**

_NOTE_: Currently loading in both datasets independently using defaulted condition `which="both"`.

In [4]:
(df_train, df_test) = preproc.load_data()

#### Get Unique Values Across Each Feature in Training Dataset.

**Call the** `get_uniques()` **custom function to identify unique values across all input features for dataset(s).**

_NOTE_: Currently identifying unique data values across all features in dataset using defaulted conditions `features=None` and `how="value"`.

In [9]:
uniques_train, uniques_test = get_uniques(df_train), get_uniques(df_test)

#### Check Which Features Across Training/Testing Data Contain `NaN` (Null) Values.

_NOTE_: Null values are denoted as `np.nan` (float) datatypes and will appear as `<type 'float'>` data notations.

In [10]:
floating_features_train = identify_typed_features(uniques_train)

> _RESULT_: No null values detected across any features in training dataset.

In [11]:
floating_features_test = identify_typed_features(uniques_test)

IDENTIFIED FEATURE OF TYPE '<type 'float'>': -28
IDENTIFIED FEATURE OF TYPE '<type 'float'>': 41
IDENTIFIED FEATURE OF TYPE '<type 'float'>': -11
IDENTIFIED FEATURE OF TYPE '<type 'float'>': -10
IDENTIFIED FEATURE OF TYPE '<type 'float'>': -19
IDENTIFIED FEATURE OF TYPE '<type 'float'>': -5.1
IDENTIFIED FEATURE OF TYPE '<type 'float'>': -33
IDENTIFIED FEATURE OF TYPE '<type 'float'>': -3
IDENTIFIED FEATURE OF TYPE '<type 'float'>': -2


> _RESULT_: Null values potentially detected across nine (9) features in testing dataset.

#### Confirm Null Value Presence Across Identified "Floating" Features. 

In [15]:
get_null_metrics(df_test, 
                 subset=floating_features_test, 
                 metric="binary")

-28     False
41      False
-11     False
-10     False
-19     False
-5.1    False
-33     False
-3      False
-2      False
dtype: bool

#### Get Proportion of Null Values Across Each Identified "Floating" Feature.

In [16]:
get_null_metrics(df_test, 
                 subset=floating_features_test, 
                 metric="percent")

-28     0.0
41      0.0
-11     0.0
-10     0.0
-19     0.0
-5.1    0.0
-33     0.0
-3      0.0
-2      0.0
dtype: float64

#### Impute Null Values Across Floating Features

**NOTE**: Since null values are highly sparse across our data (highest frequent occurrency is ~0.1%) and the size of our data is not small, we can safely drop null values rather than performing advanced imputation (e.g. _null value flagging_, _mean/mode replacement_). 

In [14]:
preproc.null_imputer(df_test, 
                     subset=floating_features_test, 
                     method="drop", 
                     na_filter="any")

Null imputation has successfully completed.


#### Reencode Alphabetical Features for Numerical Encoding Consistency

In [17]:
LOOKUP_TABLE_ALPHANUMERIC = {
    1: "A", 
    2: "B", 
    3: "C", 
    4: "D"
}

_NOTE_: Feature encoding occurs inplace; if condition `drop_og` is `True`, then rerunning method call will result in errors due to dropped target. 

In [18]:
preproc.feature_encoder(df_train,
                        target="C",
                        lookup_table=LOOKUP_TABLE_ALPHANUMERIC,
                        drop_og=True)

preproc.feature_encoder(df_test,
                        target="A",
                        lookup_table=LOOKUP_TABLE_ALPHANUMERIC,
                        drop_og=True)

#### 🔸 CHECKPOINT 🔸

**Interim data ready to save.**

##### [(back to top)](#TOC)

---

## 🔹 Section C: Visualizing Trends Across Our Data <a name="section-C"></a>

#### ⭐️ _TODO_: Include visualizations towards end of pipeline architectural creation. ⭐️

##### [(back to top)](#TOC)

---

## 🔹 Section D: Saving Our Interim Datasets <a name="section-D"></a>

Interim datasets are data states directly after preprocessing, where data is designated for curation and manipulation prior to target vs. non-target handling.

#### Save Current (Preprocessed) Data States to Interim Datasets

**Call** `.save_dataset()` **method to save data state to interim folder for processing accessability.**

In [19]:
REL_PATH_TO_ITM_DATA = "../data/interim/"
FILENAME_TRAINING, FILENAME_TESTING = "train_i", "test_i"

In [20]:
preproc.save_dataset(df_train, REL_PATH_TO_ITM_DATA + FILENAME_TRAINING)
preproc.save_dataset(df_test, REL_PATH_TO_ITM_DATA + FILENAME_TESTING)

##### [(back to top)](#TOC)

---

## 🔹 Appendix A: Supplementary Custom Objects <a name="appendix-A"></a>

#### A[1]: 6Nomads Dataset Preprocessor.

To view the **Data Preprocessor Engine**, please follow the following steps:

1. Navigate to the `structures` sibling directory. 
2. Access the `dataset_preprocessor.py` file. 
3. View the `Dataset_Preprocessor()` object architecture.

_NOTE_: **Creating Preprocessor Engine in Notebook Until Further Separation of Concerns.**

#### A[2]: Function to Obtain Relevant Unique Values or Data Types from Feature(s) Across Dataset.

In [6]:
def get_uniques(dataset, features=None, how="both"):
    """
    Custom function that analyzes a dataset's given feature(s) and returns all unique values or data types
    across each inputted feature.
    
    INPUTS:
        {features}:
            - NoneType(None): Sets function to use all features across dataset. (DEFAULT)
            - str: Single referenced feature in dataset.
            - list: List of referenced features in dataset.
        {how}:
            - str(both): Identifies both unique data types and values. (DEFAULT)
            - str(dtype): Identifies unique data types.
            - str(value): Identifies unique data values.
    
    OUTPUTS:
        dict(uniques): Dictionary structure mapping each input feature to relevantly identified unique values/types.
    """
    # Validate selected features argument
    if features is not None and type(features) not in [str, list]:
        raise TypeError("ERROR: Inappropriate data type passed to argument `features`.\n\nExpected type in range:\n - NoneType\n - str()\n - list()\n\nActual:\n - {}".format(str(type(features))))
    
    # Validate unique identifier argument
    if how not in ["both", "dtype", "value"]:
        raise ValueError("ERROR: Inappropriate value passed to argument `how`.\n\nExpected value in range:\n - both\n - dtype\n - value\n\nActual:\n - {}".format(how))
        
    # Reformat `features` object into list
    if features is None:
        features = dataset.columns.tolist()
    if type(features) == str:
        features = [features]
        
    # Create uniques object and iteratively map each feature to associated unique data
    uniques = dict()
    # Create dictionary object associating feature(s) and unique data types and values
    if how == "both":
        unique_types, unique_values = dict(), dict()
        for feature in features:
            unique_types[feature] = list(set(map(type, dataset[feature])))
            unique_values[feature] = sorted(dataset[feature].unique().tolist())
        unique_components = [unique_types, unique_values]
        for feature in unique_types.keys():
            uniques[feature] = {"dtypes": unique_components[0][feature], "values": unique_components[1][feature]}
    else:
        for feature in features:
            # Create dictionary object associating feature(s) and unique data types
            if how == "dtype":
                uniques[feature] = list(set(map(type, dataset[feature])))
            # Create dictionary object associating feature(s) and unique values
            if how == "value":
                uniques[feature] = sorted(dataset[feature].unique().tolist())
    return uniques

#### A[3]: Function to Identify and Return Features Containing Unique Input Data Types.

In [7]:
def identify_typed_features(uniques, dtype=float):
    """
    Custom function that extracts features from previously generated unique feature data
    based on whether or not feature includes user-specified data type.
    
    INPUTS: 
        {uniques}:
            - dict: Dictionary object of feature associations generated by `get_uniques()`.
        {dtype}:
            - type(float): Float data type. (DEFAULT)
            - type(int): Integer data type.
            - type(str): String data type.
    
    OUTPUTS:
        list(typed_features): List of feature names corresponding to identified user-specified data types.
    """
    typed_features = list()
    for key in uniques.keys():
        if uniques[key]["dtypes"][0] == dtype:
            print("IDENTIFIED FEATURE OF TYPE '{}': {}".format(str(dtype), key))
            typed_features.append(key)
    return typed_features

#### A[4]: Function to Calculate Null/Missing Metrics of Given Feature Data.

In [8]:
def get_null_metrics(dataset, subset=None, metric="percent"):
    """
    Custom function that produces series of associated features and metrics related to
    presence and proportion of null/missing values.
    
    INPUTS:
        {dataset}:
            - pd.DataFrame: Single input dataset.
        {subset}:
            - NoneType: If None, all features are used for null metric evaluation. (DEFAULT)
            - list: Array of features across data to consider; others are ignored.
        {metric}:
            - str(percent): Determines calculation of relative proportions of null values per feature. (DEFAULT)
            - str(count): Determines calculation of absolute count of null values per feature.
            - str(binary): Determines identification of whether or not any null values occur per feature.
    
    OUTPUTS:
        pd.Series: Series of associated feature names and relative null value prevalence metrics.
    """
    # Validate `subset` keyword argument
    if subset is None:
        subset = dataset.columns.tolist()
        
    # Calculate percentages for null values across each input feature
    if metric == "percent":
        return dataset[subset].isna().sum() / len(dataset)
    # Calculate total counts of null values across each input feature
    elif metric == "count":
        return dataset[subset].isna().sum()
    # Determine True/False based on null value presence across each input feature
    elif metric == "binary":
        binarized_metrics = list()
        for feature in subset:
            nulls_in_feature = dataset[feature].isna().values.any()
            binarized_metrics.append((feature, nulls_in_feature))
        binarized_metrics_series = list(zip(*binarized_metrics))
        return pd.Series(binarized_metrics_series[1], index=binarized_metrics_series[0])

##### [(back to top)](#TOC)

---

## 🔹 Appendix B: Data Dictionary <a name="appendix-B"></a>

##### [(back to top)](#TOC)

---