# ML Exploration Notebook

This notebook can be used to explore the data of an underlying problem and see if the data is suited for predictive analysis. Several regressions will be compared on predictive performance metrics R2, mean squared error and mean average error, such that the user can get a head start in solving the problem or managing expectations.

The notebook is structured in the following way:

    1. Set project path
    2. Read in (raw) Data Set
    3. Basic Data Information
    4. Set Dependent & Independent Variables 
    5. Set Parameter Values
    6. Generate Predictions
    7. Compare Models

### 1. Set project path

In order to make use of the files stored on Github via Colab, we first have to clone the folder on Github to the current Colab-folder.

In [None]:
# Clone the entire repo.
!git clone -s git://github.com/PippleNL/Pipple-Lecture-8-ML-prediction.git cloned-repo
%cd cloned-repo
#!ls for colab

We will use package os to set the correct project_path

In [None]:
import os
project_path = os.getcwd()

### 2. Read in (raw) Data Set

The (raw) data set of the underlying problem is read from a comma seperated file (.csv). 

In [None]:
import pandas as pd
data_path = os.path.join(project_path, 'data','WA_Fn-UseC_-HR-Employee-Attrition.csv')
data = pd.read_csv(data_path)

### 3. Basic Data Information

Below you can find some basic information of the data set. It lists the first couple of rows, a summary of the dataframe including the dtype and number of non-null values per column and the shape of the dataframe.

In [None]:
data.info()

In [None]:
data.head()

If you'd like to change a column's dtype to numeric, please add this column to the list of strings below (e.g. columns2num = ['Age'])

In [None]:
from functions.data_preparation import column2num
columns2num = []

if len(columns2num) > 0:
    data = column2num(data, columns2num)

We can check the number of NaN values for each column.

In [None]:
data.isnull().sum()

We'll state the number of unique values per column. If a column only has one value, it will not have any impact on a prediction.

In [None]:
print('Unique Values for Each Feature: \n')
for i in data.columns:
    print(i, ':', data[i].nunique())

We can have a look at the correlation matrix to get an idea of relations between variables.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(figsize=(9,9)) 
corr_mat = round(data.corr(method='pearson'), 2)
sns.heatmap(corr_mat, vmin=-1, vmax=1, center=0, annot=False, cmap=sns.diverging_palette(20, 220, n=200), square=True)

### 4. Set Dependent & Independent Variables

Specify below in string which variable (i.e. column) will be used as dependent variable. This variable will be set as y (i.e. label) and will ultimately be modeled. Also, specify a list of other (independent) variables in string that are used to explain the dependent variable. If empty, all other variables will be used.

In [None]:
dependent = 'MonthlyIncome'  # fill in your dependent variable here...
independent = ['Age', 'BusinessTravel', 'Department', 'Education', 'Gender', 'JobSatisfaction', 'TotalWorkingYears'] # fill the list of independent variables here...

### 5. Set Parameter Values

Specify below the parameter values used while comparing models. If kept commented, the default value will be used. If uncommented, added these parameters to the function main.

In [None]:
impute_strategy = 0.  # either a float or 'drop' (default), 'mean', 'median', 'most_frequent'
labelenc_x = ['BusinessTravel']  # fill the list of independent variables for label encoding here...
onehotenc_x = ['Gender', 'Department']  # fill the list of independent variables for one hot encoding
feature_scaling = 'auto'  # None, 'standardisation', 'minmax' or 'auto'

### 6. Generate Predictions

Predictions are generated for several models using the function 'main_regression' from the Python script 'compare_models'. Note that if not specified differently, all default parameter values are used. More information can be retrieved by running main_regression().

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

from functions.compare_models import main_regression
predictions = main_regression(data, dependent, independent, impute_strategy=impute_strategy, labelenc_x=labelenc_x, onehotenc_x=onehotenc_x, feature_scaling_method=feature_scaling)

### 7. Compare Models

Models are compared based on predictive performance metrics that are calculated and sorted by the (own-developed) function 'sort_compute_metrics_regr' in the Python script 'compare_models'. The residual, i.e. the difference between the predicted value and the actual value, are drawn using the (own-developed) function 'draw_residual_plot'. More information on both function can be retrieved using sort_compute_metrics_regr() and draw_residual_plot().

In [None]:
from functions.compare_models import sort_compute_metrics_regr
header, scores = sort_compute_metrics_regr(predictions)
pd.DataFrame(scores, columns=header)

In [None]:
from functions.compare_models import draw_residual_plot
draw_residual_plot(predictions)