# Exploratory Data Analysis (EDA)
---

1.   **[Introduction to Gradient Boosting](#1.-Introduction-to-Gradient-Boosting)**
2.   **[Foundations of Gradient Boosting](#2.-Foundations-of-Gradient-Boosting)**
3.   **[Gradient Boost Tuning](#3.-Gradient-Boost-Tuning)**
4.   **[Model Validation](#4.-Model-Validation)**
5.   **[Exploratory Data Analysis](#5.-Exploratory-Data-Analysis)**
6.   **[Model Construction](#6.-Model-Construction)**
7.   **[Model Evaluation](#7.-Model-Evaluation)**

---
<a name="1.-Introduction--to-EDA"></a>
### 1. Introduction to EDA

#### 1.1 Foundations

**Exploratory Data Analysis (EDA) |** The process of investigating, organizing, and analyzing datasets and summarizing their main characteristics, often employing data wrangling and visualization methods.

**6 Practices of EDA:**
- **Discovering |** process of data familiarization in order to conceptualize how the data can be used
- **Structuring |** the process of taking raw data and organizing or transforming it to be more easily visualized, explained, or modeled 
- **Cleaning |** the process of removing errors that may distort your data or make it less useful
- **Joining |** the process of augmenting or adjusting data by adding values from other datasets
- **Validating |** the process of verifying that the data is consistent and high quality
- **Presenting |** making the cleaned dataset or data visualizations available to others for analysis or further modeling


---
<a name="5.-Exploratory-Data-Analysis"></a>
### 5. Exploratory Data Analysis

#### 5.1 Imports

In [None]:
# Import relevant libraries and modules.



# Load the dataset into a DataFrame and save in a variable
data = pd.read_csv("example_file.csv")

#### 5.2 Data Exploration
After loading the dataset, the next step is to prepare the data to be suitable for clustering. This includes: 

*   Exploring data
*   Checking for missing values
*   Encoding categorical data 
*   Dropping irrelevant columns
*   Renaming columns
*   Create training and testing data

In [None]:
# Display the first 10 rows of the data
data.head(10)

In [None]:
# Display number of rows, number of columns
data.shape

In [None]:
# Display the data type for each column. NB logistic regression models expect numeric data
data.dtypes

**Question to answer:** Identify the target (or predicted) variable. What is the initial hypothesis about which variables will be valuable in predicting the target variable?

##### 5.3 Prepare model for predictions

**Question to answer:** Before proceeding with modeling, consider which metrics should ultimately be leveraged to evaluate the model. Which metrics are most suited to evaluating this type of model?
- Important to evaluate not just accuracy but the balance of false positives and false negatives that the model predicts. Therefore precision, recall and f1 score will be the best metrics for classification models.
- The ROC AUC score is also suited to classification modelling

##### 5.3.1 Convert Variables to Numeric

In [None]:
# Convert the object predictor variables to numerical dummies.
data_dummies = pd.get_dummies(data, 
                                columns=['categorical_column1','categorical_column2','categorical_column3','categorical_column4'])

##### 5.3.2 Isolate Target and Predictor Variables

In [None]:
# Separate the dataset into labels (y) and features (X).
y = data_subset["target_variable"]

X = data_subset.copy()
X = X.drop("target_variable", axis = 1)

##### 5.2.4 Create Training and Test Data

In [None]:
# Separate into train, validate, test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, stratify= y, random_state = 0)
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size = 0.25, stratify= y_train, random_state = 0)