# SEN163A - Responsible Data Analytics
## Lab session 5: Predictive Analytics: Regression and Classification
### Delft University of Technology
### Q3 2022

**Instructor**: Dr. Ir. Jacopo De Stefani - J.deStefani@tudelft.nl

**TAs**: Antonio Sanchez Martin - A.SanchezMartin@student.tudelft.nl

#### Instructions

Lab session aim to:
- Show and reinforce how models and ideas presented in class are put to practice.
- Help you gather hands-on machine learning skills.

Lab sessions are:

- Learning environments where you work with Jupyter notebooks and where you can get support from TAs and fellow students.
- Not graded and do not have to be submitted.
- A good preparation for the assignments (which are graded).


### Application: Predictive analytics of a health and insurance related data

In this lab session, we will explore how to performe predicitive analytics to solve both a classification (predicting a categorical variable) and a regression (predicting a numerical variable) task. 
The classification case will be related to the prediction of the occurrence of a stroke, based on both physiological measurements as well as user features.
The regression case, on the other hand, will be related to the prediction of health insurance costs, based on user features and behaviour.

#### Learning objectives
After completing the following exercises you will be able to:

1. Apply common preprocessing techniques to prepare data for machine learning techniques: categorical preprocessing, imputation.
2. Split the available dataset into a training set (for model fitting) and a testing set (for performance evaluation).
3. Fit benchmark models to determine baseline performances on both a classification and regression case.
4. Compute the most commonly applied performance measures for classification and regression tasks.
5. Fit the most commonly applied machine learning predictive models for classification and regression tasks.
6. Compare predictive models across different performance metrics.

In [1]:
import pandas
import numpy

import seaborn
import matplotlib

seaborn.set_palette("Set2")
seaborn.color_palette("Set2")

#
seaborn.set(rc={"figure.figsize":(15, 10),
            'legend.title_fontsize' : 25,
            'legend.fontsize' : 20,
            'xtick.labelsize' : 20,
            'ytick.labelsize' : 20,
            'axes.labelsize' : 25})

In [5]:
#seaborn.set_context('notebook')
seaborn.set_context('paper')
#seaborn.set_context('talk')
#seaborn.set_context('poster')

# Predictive Analytics - Classification example

The classification task we will be tackling is based on the [Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset?select=healthcare-dataset-stroke-data.csv).

In this case, we will use the available data to try to predict the occurrence of a stroke (`stroke` variable) as a function of the other variables.

Before starting the modeling task, please have a look at the metadata about the [Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset?select=healthcare-dataset-stroke-data.csv), in order to better understand the meaning of the different variables.




## Activity 1.1 - Descriptive analytics

We are going to use the `pandas` library to perform some exploratory understanding of the data.

1. Load the dataset `healthcare-dataset-stroke-data.csv` in the `stroke_df` variable
2. Display the content of the `stroke_df` variable
3. What are the type of the different columns? Use the knowledge from `pandas` to determine the type.


In [8]:
# Load the dataset healthcare-dataset in the stroke_df variable
stroke_df = pandas.read_csv("data\\healthcare-dataset-stroke-data.csv")

stroke_df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [9]:
stroke_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [11]:
# make the gender column categorical
stroke_df['gender'].astype('category')

# make the ever_married column categorical
stroke_df['ever_married'].astype('category')

# make the work_type column categorical
stroke_df['work_type'].astype('category')

# make the Residence_type column categorical
stroke_df['Residence_type'].astype('category')

# make the smoking_status column categorical
stroke_df['smoking_status'].astype('category')

# make the stroke column boolean
stroke_df['stroke'] = stroke_df['stroke'].astype('bool')

# make the hypertension column boolean
stroke_df['hypertension'] = stroke_df['hypertension'].astype('bool')

# make the heart_disease column boolean
stroke_df['heart_disease'] = stroke_df['heart_disease'].astype('bool')

# make the age column int
stroke_df['age'] = stroke_df['age'].astype('int')

stroke_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   int32  
 3   hypertension       5110 non-null   bool   
 4   heart_disease      5110 non-null   bool   
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   bool   
dtypes: bool(3), float64(2), int32(1), int64(1), object(5)
memory usage: 354.4+ KB


## Activity 1.2 - Diagnostic analytics

A common problem in many datasets is missing data, usually indicated by N/A, NA, NaN, and extreme values (outliers).

As a reminder, several ways exist to deal with incomplete or missing data, the most common being:

![MissingData](figures/MissingData.png)

**Source:** *Skarga-Bandurova, I., Biloborodova, T., & Dyachenko, Y. (2018). Strategy to managing mixed datasets with missing items. In Information Processing and Management of Uncertainty in Knowledge-Based Systems. Theory and Foundations: 17th International Conference, IPMU 2018, Cádiz, Spain, June 11-15, 2018, Proceedings, Part II 17 (pp. 608-620). Springer International Publishing.*


1. Is there any column containing missing data in this dataset?
2. If there are any, display the column(s) containing missing data.
3. Count the number of missing values in the column(s) containing missing data.
4. Analyze the missing values and their potential causes, and propose the most appropriate way to process them in order to have a dataset without missing values for the further steps.
5. Produce a new dataset `stroke_noNA_df` containing no missing values.


## Activity 1.3

In order to apply a Machine Learning predictive model on the [Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset?select=healthcare-dataset-stroke-data.csv) that we had previously imported in the `stroke_df` variable, we need to perform the following operations:

1. Impute missing values (Done in 1.2 by dropping/imputing the missing values)
2. Split data into training and test using the `train_test_split` function (1.3)
3. Transform categorical variables (1.4)

**N.B.**: Please note that the transformation in categorical variables needs to be done after the split into training and test set in order to avoid information leakage (normally the testing set should not be seen by the model during its training phase).

We are going to use the `scikit-learn` library to perform most of the split and transformation tasks.

Here you need to:
1. Divide the `stroke_noNA_df` dataset into two variables:
- `X` containing the input variables
- `Y` containing the target variable (`stroke`)
2. Use the `train_test_split` function to obtain `X_train, X_test, Y_train, Y_test` with a 70% train - 30% test split

## Activity 1.4

Before inputting the data to a Machine Learning model, we need all the inputs to be numeric.
In order to transform categorical data into numeric ones, three techniques exist (cf. https://www.kaggle.com/code/alexisbcook/categorical-variables):
- Dropping Categorical variables
- Ordinal Encoding: A categorical variable is replaced by a single numerical variable, where each category is mapped to a different, increasing integer value.
- One-hot Encoding: A categorical variable with $n$ different categories is replaced by $n$ binary variables, each of them corresponding to a category. 

We are going to use the `scikit-learn` library to perform the transformation of the variables and to subsequently fit the models.

1. Have a look at the documentation of the [Scikit-learn](https://scikit-learn.org/stable/index.html) library 
2. Have a look at the following [code](https://www.kaggle.com/code/alexisbcook/categorical-variables) to perform the transformation of categorical variables and create the following variables to store the processed data:
    - Dropping Categorical variables: `drop_X_train` and `drop_X_test`
    - Ordinal Encoding: `label_X_train` and `label_X_test`
    - One-hot Encoding: `OH_X_train` and `OH_X_test`

## Activity 1.5

Finally, with the data cleaned of missing values, and with the categorical variable appropriately transformed we are able to fit some models using the `scikit-learn` library.

As seen in Lecture 5 a starter, we will will be using a baseline for classification models: a [Naive Bayesian Model](https://scikit-learn.org/stable/modules/naive_bayes.html)

1. Have a look at the documentation of the [Scikit-learn](https://scikit-learn.org/stable/index.html) library for the Naive Bayes model.
2. Initialize the model
3. Use the `fit` function to perform the training of the model on the training set
4. Use the `predict` function to perform the prediction of the model on the test set
5. Use the `accuracy_score, balanced_accuracy_score, f1_score` to compare the predictions with the actual values and obtain different performance metrics about the models

## Activity 1.6

Now that you are familiar with the pipeline of training, testing and evaluating one model, you can easily repeat the procedure for multiple models.

1. Have a look at the documentation of the [Scikit-learn](https://scikit-learn.org/stable/index.html) library for other classification models:
    - Logistic Regression
    - Decision Trees
    - Random Forest
    - Gradient Boosting
    - Artificial Neural Networks
    - K-Nearest Neighbors
2. For each model:
    1. Initialize the model
    2. Use the `fit` function to perform the training of the model on the training set
    3. Use the `predict` function to perform the prediction of the model on the test set
    4. Use the `accuracy_score, balanced_accuracy_score, f1_score` to compare the predictions with the actual values and obtain performance metrics about the models.
    
3. Create a dictionary/Data Frame in order to be able to compare the performance scores of the different models.
    1. Are there any differences in the values of the metrics?
    2. Why are these values different? Check the documentation to get to know more about the metrics.



## Activity 1.7

Congratulations! By now you should be able to train, test and evaluate multiple models on a classification task.

1. Have a look at the documentation of the [Scikit-learn](https://scikit-learn.org/stable/index.html) library for the different parameters of other classification models.

2. Analyze the impact of different changes in the predictive setup on the model:
- Does the amount of data in the training set affect the predictive performance? Try to apply the procedure by varying the training-test proportion.
- Does the parameter setting of the different models have an impact on the model performances? Try to tweak the performance by varying the parameters.

# Predictive Analytics - Regression

The regression task we will be tackling is based on the [Medical Cost Personal Dataset](https://www.kaggle.com/datasets/mirichoi0218/insurance?ref=hackernoon.com&select=insurance.csv).

In this case, we will use the available data to try to predict the insurance cost (`charges` variable) as a function of the other variables.

Before starting the modeling task, please have a look at the metadata about the [Medical Cost Personal Dataset](https://www.kaggle.com/datasets/mirichoi0218/insurance?ref=hackernoon.com&select=insurance.csv), in order to better understand the meaning of the different variables.

## Activity 2.1 - Descriptive analytics

We are going to use the `pandas` library to perform some exploratory understanding of the data.

1. Load the dataset in the `insurance_df` variable
2. Display the content of the `insurance_df` variable
3. What are the type of the different columns? Use the knowledge from `pandas` to determine the type.


## Activity 2.2 - Diagnostic analytics

A common problem in many datasets is missing data, usually indicated by N/A, NA, NaN, and extreme values (outliers).

As a reminder, several ways exist to deal with incomplete or missing data, the most common being:

![MissingData](figures/MissingData.png)

**Source:** *Skarga-Bandurova, I., Biloborodova, T., & Dyachenko, Y. (2018). Strategy to managing mixed datasets with missing items. In Information Processing and Management of Uncertainty in Knowledge-Based Systems. Theory and Foundations: 17th International Conference, IPMU 2018, Cádiz, Spain, June 11-15, 2018, Proceedings, Part II 17 (pp. 608-620). Springer International Publishing.*


1. Is there any column containing missing data in this dataset?
2. If there are any, display the column(s) containing missing data.
3. Count the number of missing values in the column(s) containing missing data.
4. Analyze the missing values and their potential causes, and propose the most appropriate way to process them in order to have a dataset without missing values for the further steps.
5. Produce a new dataset `insurance_noNA_df` containing no missing values.



## Activity 2.3

In order to apply a Machine Learning predictive model on the [Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset?select=healthcare-dataset-stroke-data.csv) that we had previously imported in the `stroke_df` variable, we need to perform the following operations:

1. Impute missing values (Done in 2.2 by dropping/imputing the missing values)
2. Split data into training and test using the `train_test_split` function (2.3)
3. Transform categorical variables (2.4)

**N.B.**: Please note that the transformation in categorical variables needs to be done after the split into training and test set in order to avoid information leakage (normally the testing set should not be seen by the model during its training phase).

We are going to use the `scikit-learn` library to perform most of the split and transformation tasks.

Here you need to:
1. Divide the `insurance_noNA_df` dataset into two variables:
- `X` containing the input variables
- `Y` containing the target variable (`charges`)
2. Use the `train_test_split` function to obtain `X_train, X_test, Y_train, Y_test` with a 70% train - 30% test split.

## Activity 2.4

Before inputting the data to a Machine Learning model, we need all the inputs to be numeric.
In order to transform categorical data into numeric ones, three techniques exist (cf. https://www.kaggle.com/code/alexisbcook/categorical-variables):
- Dropping Categorical variables
- Ordinal Encoding: A categorical variable is replaced by a single numerical variable, where each category is mapped to a different, increasing integer value.
- One-hot Encoding: A categorical variable with $n$ different categories is replaced by $n$ binary variables, each of them corresponding to a category. 

We are going to use the `scikit-learn` library to perform the transformation of the variables and to subsequently fit the models.

1. Have a look at the documentation of the [Scikit-learn](https://scikit-learn.org/stable/index.html) library 
2. Have a look at the following [code](https://www.kaggle.com/code/alexisbcook/categorical-variables) to perform the transformation of categorical variables and create the following variables to store the processed data:
    - Dropping Categorical variables: `drop_X_train` and `drop_X_test`
    - Ordinal Encoding: `label_X_train` and `label_X_test`
    - One-hot Encoding: `OH_X_train` and `OH_X_test`

## Activity 2.5

Finally, with the data cleaned of missing values, and with the categorical variables appropriately transformed we are able to fit some models using the `scikit-learn` library.

As seen in Lecture 5 a starter, we will will be using a baseline for regression models: a [Linear Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)

1. Have a look at the documentation of the [Scikit-learn](https://scikit-learn.org/stable/index.html) library for the Linear Regression model.
2. Initialize the model
3. Use the `fit` function to perform the training of the model on the training set.
4. Use the `predict` function to perform the prediction of the model on the test set.
5. Use the `mean_squared_error, mean_absolute_error, mean_absolute_percentage_error` to compare the predictions with the actual values and obtain different performance metrics about the models.

## Activity 2.6

Now that you are familiar with the pipeline of training, testing and evaluating one model, you can easily repeat the procedure for multiple models.

1. Have a look at the documentation of the [Scikit-learn](https://scikit-learn.org/stable/index.html) library for other regression models:
    - Perceptron
    - Lasso/ElasticNet
    - Decision Tree
    - Random Forest
    - Gradient Boosting
    - Artificial Neural Networks
    - K-Nearest Neighbors
2. For each model:
    1. Initialize the model
    2. Use the `fit` function to perform the training of the model on the training set
    3. Use the `predict` function to perform the prediction of the model on the test set
    4. Use the `accuracy_score, balanced_accuracy_score, f1_score` to compare the predictions with the actual values and obtain performance metrics about the models.
    
3. Create a dictionary/Data Frame in order to be able to compare the performance scores of the different models.
    1. Are there any differences in the values of the metrics?
    2. Why are these values different? Check the documentation to get to know more about the metrics.

## Activity 2.7

Congratulations! By now you should be able to train, test and evaluate multiple models on a classification task.

1. Have a look at the documentation of the [Scikit-learn](https://scikit-learn.org/stable/index.html) library for the different parameters of other classification models.

2. Analyze the impact of different changes in the predictive setup on the model:
- Does the amount of data in the training set affect the predictive performance? Try to apply the procedure by varying the training-test proportion.
- Does the parameter setting of the different models have an impact on the model performances? Try to tweak the performance by varying the parameters.