# ML Exploration Notebook

This notebook can be used to explore the data of an underlying problem and see if the data is suited for predictive analysis. Several classifiers will be compared on predictive performance metrics such as accuracy, precision, recall and area under curve, such that the user can get a head start in solving the problem or managing expectations.

The notebook is structured in the following way:

    1. Set project path
    2. Read in (raw) Data Set
    3. Basic Data Information
    4. Set Dependent & Independent Variables 
    5. Set Parameter Values
    6. Generate Predictions
    7. Compare Models

### 1. Set project path

In order to make use of the files stored on Github via Colab, we first have to clone the folder on Github to the current Colab-folder.

In [None]:
from pathlib import Path


base_path = Path('.')
projects_path = Path(base_path, 'projects')
ml_expl_path = Path(projects_path, 'ml_copy')


In [None]:
import os


if projects_path.exists():
  print(f"{str(projects_path.absolute())} exists")
  if ml_expl_path.exists():
    print(f"{str(ml_expl_path.absolute())} exists")
    print(f"Updating repository ML-exploration")
    cmd_cd = f"cd {str(ml_expl_path.absolute())}"
    cmd_update = f"git pull origin main"
    os.system(cmd_cd)
    os.system(cmd_update)
  else:
    print(f"Cloning repository into {str(ml_expl_path.absolute())}.")
    cmd_clone = f"git clone -s  https://github.com/PippleNL/Pipple-Lecture-8-ML-prediction.git {str(ml_expl_path.absolute())}"
    print(cmd_clone)
    os.system(cmd_clone)
else:
  print(f"{str(projects_path.absolute())} does not exist")
  projects_path.mkdir(parents=True, exist_ok=True)
  cmd_clone = f"git clone -s  https://github.com/PippleNL/Pipple-Lecture-8-ML-prediction.git {str(ml_expl_path.absolute())}"
  print(cmd_clone)

print("Done")
  

In [None]:
# adding ml_copy to pathname
import sys
sys.path.append(str(ml_expl_path.absolute()))

### 2. Read in (raw) Data Set

The (raw) data set of the underlying problem is read from a comma seperated file (.csv). 

In [None]:
import pandas as pd


data_path = Path(ml_expl_path, 'data', 'Beer_data.csv')
data = pd.read_csv(data_path)

### 3. Basic Data Information

Below you can find some basic information of the data set. It lists the first couple of rows, a summary of the dataframe including the dtype (data-type) and number of non-null values per column and the shape of the dataframe.

In [None]:
data.info()

In [None]:
data.head()

If you want to change a column with dtype 'object' to 'numeric', you can use the following function. This is only possible if the column actually contains numerical values.

In [None]:
from functions.data_preparation import column2num
columns2num = []

if len(columns2num) > 0:
    data = column2num(data, columns2num)

We can check the number of NaN values for each column.

In [None]:
data.isnull().sum()

We'll state the number of unique values per column. If a column only has one value, it will not have any impact on a prediction.

In [None]:
print('Unique Values for Each Feature: \n')
for i in data.columns:
    print(i, ':', data[i].nunique())

We can have a look at the correlation matrix to get an idea of relations between the numeric variables.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(figsize=(9,9)) 
corr_mat = round(data.corr(method='pearson'), 2)
sns.heatmap(corr_mat, vmin=-1, vmax=1, center=0, annot=True, cmap=sns.diverging_palette(20, 220, n=200), square=True, ax=ax)

In a classification problem the dependent variable is often no numeric variable and therefore it's correlation with the other variables is not given. To get some feeling about the dependencies between columns in terms of correlation, we can first transform the categorical variables into numerical variables and then again create the correlation matrix. 

In [None]:
data_copy = data.copy()
from sklearn.preprocessing import LabelEncoder
import numpy as np
label_encoder_1 = LabelEncoder()

object_columns = data_copy.dtypes == np.object  #get array with True/False indicating for each column if it is a object
for object_column in data_copy.columns[object_columns]:
    label_transformed = label_encoder_1.fit_transform(data_copy[object_column]) # fit and tranform data per column
    data_copy[object_column] = label_transformed # Replace categorical values with transformed numerical values

data_copy.head()

In [None]:
fig, ax = plt.subplots(figsize=(9,9)) 
corr_mat = round(data_copy.corr(method='pearson'), 2)
sns.heatmap(corr_mat, vmin=-1, vmax=1, center=0, annot=True, cmap=sns.diverging_palette(20, 220, n=200), square=True, ax=ax)

### 4. Set Dependent & Independent Variables

Specify below in string which variable (i.e. column) will be used as dependent variable. This variable will be set as y (i.e. label) and will ultimately be modeled. Also, specify a list of other (independent) variables in string that are used to explain the dependent variable. If empty, all other variables will be used.

In [None]:
data.columns

In [None]:
dependent = 'Score'  # fill in your dependent variable here.
independent = ['Calories', 'Acid', 'Belgian']  # fill the list of independent variables here...

### 5. Set Parameter Values

Specify below the parameter values used while comparing models. If kept commented, the default value will be used. If uncommented, added these parameters to the function main.

In [None]:
impute_strategy = 0.  # either a float or 'drop' (default), 'mean', 'median', 'most_frequent'
labelenc_x = ['Belgian']  # fill the list of independent variables for label encoding here..., if empty then []
onehotenc_x = []  # fill the list of independent variables for one hot encoding, if empty then []
labelenc_y = True  # boolean specifying if label encoding for y variable is necessary
feature_scaling = 'auto'  # None, 'standardisation', 'minmax' or 'auto'

### 6. Generate Predictions

Predictions are generated for several models using the function 'main_classificaion' from the Python script 'compare_models'. Note that if not specified differently, all default parameter values are used. More information can be retrieved by running 'main_classification()'. The function gives two lists; predictions and classes.

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

from functions.compare_models import main_classification
predictions, classes = main_classification(data, dependent, independent, impute_strategy=impute_strategy, labelenc_x=labelenc_x, onehotenc_x=onehotenc_x, feature_scaling_method=feature_scaling)

#### Some additional information

- True Positives; Number of correctly identified 'Positive' Values; model says that is 'Positive' and in reality it is 'Positive' 
- False Positives; Model says that it is 'Positive', but in reality it is not 'Positive'
- True Negatives; Number of correctly identified 'Negative' Values; model says 'Negative' and in reality it is
- False Negatives; Model says that it is 'Negative', but in reality it is not 'Negative'

#### Evaluation metrics for classification
- Accuracy = ratio of correctly predicted classes -> (True Positive + True Negative)/(all observations) 
- Precision = Given that model predicts a class, how many are in reality that class -> (True Positive)/(True Positive + False Positive)
- Recall = Given that in reality it is a class, how many are predicted by the model -> (True Positive) / (True Positive + False Negative)
https://en.wikipedia.org/wiki/Sensitivity_and_specificity

### 7. Compare Models

Models are compared based on predictive performance metrics that are calculated and sorted by the (own-developed) function 'sort_compute_metrics_clf' in the Python script 'compare_models'. More information on the function can be retrieved using sort_compute_metrics_clf().

In [None]:
from functions.compare_models import sort_compute_metrics_clf
multi_class = True if len(classes) > 2 else False
header, scores = sort_compute_metrics_clf(predictions, multi_class=multi_class)
pd.DataFrame(scores, columns=header)

In [None]:
from sklearn.metrics import confusion_matrix
yticks_workaround = [i for i in classes] #Workaround for error in package plt version 3.1.1.
%matplotlib inline

for i in range(0,len(predictions)):
    cm = confusion_matrix(predictions[i][2], predictions[i][1])
    plt.figure(figsize = (6,6))
    ax = plt.subplot()
    
    #df_cm = pd.DataFrame(cm, index = [i for i in classes], columns = [i for i in classes])
    #sns.heatmap(df_cm, annot=True, ax = ax, fmt='.3g')

    # Workaround for error in package plt version 3.1.1
    ax.matshow(cm, cmap=plt.cm.Blues)
    for row in range(0,len(classes)):
        for column in range(0,len(classes)):
            value = cm[column,row]
            ax.text(row, column, str(value), va='center', ha='center')
    ax.set_xlabel('Predicted labels')
    ax.set_xticks(list(range(0,len(classes))))
    ax.set_xticklabels(classes)
    ax.xaxis.set_ticks_position("bottom")
    ax.set_yticklabels([''] + yticks_workaround)    # Workaround for plt version 3.1.1.
    ax.set_ylabel('True labels')
    ax.set_title('Confusion Matrix - '+ predictions[i][0])