# EEC 174AY 2023 Fall Pre-lab B1



## Introduction

This asignment follows Lab2 and uses the same dataset as Lab2.

### Reading Data into Memory

In [1]:
# make sure we can plot in future if we want
%matplotlib notebook
# make sure to ignore warnings
import warnings
warnings.simplefilter('ignore')
# Import statement for pandas
import pandas as pd
# This is just a small configuration change for purposes of the class
pd.options.display.max_rows = 10

# Get our train X and y datasets for the problem
train_x = pd.read_csv('ece174_pva_train_x.csv')
train_y = pd.read_csv('ece174_pva_train_y.csv')

# Get our validation X and y datasets for the problem.
test_x = pd.read_csv('ece174_pva_validation_x.csv')
test_y = pd.read_csv('ece174_pva_validation_y.csv')

In [2]:
# output some rows of the dataset just to get a better feel for the information
train_x

Unnamed: 0,breath_id,i_time,tve,max_flow,min_flow,max_pressure,peep,ip_auc,ep_auc,patient
0,1,0.80,545.032222,51.06,-41.03,17.37,7.600,11.122367,16.057733,66
1,2,0.80,531.880278,53.13,-39.97,17.13,7.508,11.077750,17.310533,66
2,3,0.86,523.876667,52.86,-38.24,17.11,7.658,12.066000,16.697800,66
3,4,0.80,507.636111,51.04,-39.37,17.14,7.572,11.097800,15.774250,66
4,5,0.80,518.618889,47.88,-38.51,16.92,7.598,11.065400,18.483333,66
...,...,...,...,...,...,...,...,...,...,...
5970,296,0.90,355.365278,42.26,-51.51,23.53,13.194,19.216400,21.816367,662
5971,297,0.90,316.806944,42.10,-55.17,24.61,12.896,19.800467,21.739700,662
5972,298,0.92,395.971111,42.95,-22.47,21.35,13.090,16.997767,21.457600,662
5973,299,0.90,373.426389,40.34,-36.81,21.69,13.334,17.944000,21.798167,662


Here we have 18 columns. I'm going to give a detailed breakdown here. Feel free to come back to it as necessary.

Features:
 * *breath_id* - matches with a specific breath identifier from the raw data file.
 * *patient* - the patient the data came from
 * *min_flow* - The minimum flow observation on the breath
 * *max_flow* - The maximum flow observation on the breath
 * *tvi* - The inhaled volume of air for each breath
 * *tve* - The exhaled volume of air for each breath
 * *tve_tvi_ratio* - The ratio of `tve / tvi`
 * *i_time* - The amount of time patient was breathing in for each breath
 * *e_time* - The amount of time patient was breathing out for each breath
 * *ie_ratio* - The ratio of `i_time / e_time`
 * *rr* - The respiratory rate in number of breaths per minute. Measured by `60 / (i_time + e_time)`
 * *min_pressure* - the minimum pressure observation on the breath
 * *max_pressure* - the maximum pressure observation on the breath
 * *peep* - the baseline pressure setting on the ventilator
 * *pip* - the maximum pressure setting of inspiration. Slight difference compared to max_pressure
 * *maw* - the mean pressure for the entire breath
 * *ip_auc* - the area under the curve of the inspiratory pressure
 * *ep_auc* - the area under the curve of the expiratory pressure

## Featurization

Featurization is the process where you extract information from raw data. This information can then be fed into a machine learning algorithm to perform the task you want. In the current case we will need to extract additional information from the ventilator data in order to create a valid machine learning classifier.

### Processing the Data
The first step we need to do is to be able to read the raw data files and put them into memory. We have taken this problem away from you for the purposes of this homework and have given you the code so that you can do this

In [3]:
import csv


def process_ventilator_data(filename):
    descriptor = open(filename)
    reader = csv.reader(descriptor)
    breath_id = 1

    all_breath_data = []
    current_flow_data = []
    current_pressure_data = []

    for row in reader:
        if (row[0].strip() == 'BS' or row[0].strip() == 'BE') and current_flow_data != []:
            all_breath_data.append({'breath_id': breath_id, 'flow': current_flow_data, 'pressure': current_pressure_data})
            breath_id += 1
            current_flow_data = []
            current_pressure_data = []
        else:
            try:
                current_flow_data.append(round(float(row[0]), 2))
                current_pressure_data.append(round(float(row[1]), 2))
            except (IndexError, ValueError):
                continue
    return all_breath_data

In [4]:
from glob import glob
import os

# import for Simpson's method. This will be helpful for calculating TVi
from scipy.integrate import simps
from statistics import mean


def extract_features_for_file(filename, existing_features):
    """
    Extract features for every single breath in file. To make matters a bit easier, we use
    existing features that we've already extracted from the file to help speed the process.
    """
    patient = filename.split('\\')[-2]
    all_breath_data = process_ventilator_data(filename)
    all_features = []

    for breath_data in all_breath_data:
        breath_id = breath_data['breath_id']
        existing_breath_features = existing_features[existing_features.breath_id == breath_id].iloc[0]

        flow = breath_data['flow']
        pressure = breath_data['pressure']

        # inspiratory time (the amount of time a patient is inhaling for)
        i_time = existing_breath_features.i_time
        # exhaled tidal volume
        tve = existing_breath_features.tve
        # maximum flow for breath
        max_flow = existing_breath_features.max_flow
        # minimum flow for the breath
        min_flow = existing_breath_features.min_flow
        # maximum pressure for the breath
        max_pressure = existing_breath_features.max_pressure
        # The minimum pressure setting on the ventilator
        peep = existing_breath_features.peep
        # The area under the curve of the inspiratory pressure curve
        ip_auc = existing_breath_features.ip_auc
        # The area under the curve of the expiratory pressure curve
        ep_auc = existing_breath_features.ep_auc

        # This is the array index where the inhalation ends. We divide by 0.02 because
        # thats how frequently the ventilator samples data, every 0.02 seconds.
        x0_index = int(i_time / 0.02)

        # Part of your assignment is to extract the following features for all breaths:
        #
        # Expiratory Time. The amount of time a patient is exhaling
        e_time = len(flow) * .02 - i_time
        #
        # I:E ratio. The ratio of inspiratory to expiratory time. Measured by i_time/e_time
        i_e_ratio = i_time / e_time
        #
        # Respiratory rate. The number of breaths a patient is breathing. This is measured by
        # 60 / (total breath time in seconds)
        rr = 60 / (i_time + e_time)
        rr = 60 / (len(flow) * .02)
        #
        # Tidal volume inhaled. The amount of air volume inhaled in the breath.
        # Hint: use the simps function.
        # This will output volume in L/min, convert to ml/sec (* 1000 / 60)
        tvi = simps(flow[:x0_index], dx=0.02) * 1000 / 60
        #
        # Tidal volume ratio. Measured by tve/tvi
        tve_tvi_ratio = tve / tvi
        #
        # Minimum pressure of the breath
        min_pressure = min(pressure)
        #
        # PIP - peak inspiratory pressure. The peak pressure during inhalation
        pip = max(pressure[:x0_index])
        #
        # MAW - mean airway pressure for inhalation.
        maw = mean(pressure[:x0_index])

        all_features.append([
            breath_id, i_time, e_time, i_e_ratio, rr, tvi, tve, tve_tvi_ratio,
            max_flow, min_flow, max_pressure, min_pressure, pip, maw,
            peep, ip_auc, ep_auc, int(patient)
        ])
    columns = [
        'breath_id', 'i_time', 'e_time', 'i_e_ratio', 'rr', 'tvi', 'tve',
        'tve_tvi_ratio', 'max_flow', 'min_flow', 'max_pressure',
        'min_pressure', 'pip', 'maw', 'peep', 'ip_auc', 'ep_auc', 'patient'
    ]
    return all_features, columns


def remake_dataset(dataset):
    data_files = glob(os.path.join('data', '*/*.csv'))

    patient_to_file_map = {}
    for filename in data_files:
        patient = filename.split('\\')[-2]  # patient is embedded in this part of filename
        patient_to_file_map[patient] = filename

    data = []
    # iterate over all the unique patients in the train set
    for patient in dataset.patient.unique():
        existing_features = dataset[dataset.patient == patient]
        filename = patient_to_file_map[str(patient)]
        breath_data, columns = extract_features_for_file(filename, existing_features)
        # add breath rows
        data.extend(breath_data)
    # create new data frame with the new added information
    return pd.DataFrame(data, columns=columns)

In [5]:
# remake train set
train_x = remake_dataset(train_x)
# remake validation set.
test_x = remake_dataset(test_x)

In [6]:
train_x

Unnamed: 0,breath_id,i_time,e_time,i_e_ratio,rr,tvi,tve,tve_tvi_ratio,max_flow,min_flow,max_pressure,min_pressure,pip,maw,peep,ip_auc,ep_auc,patient
0,1,0.80,1.66,0.481928,24.390244,481.917778,545.032222,1.130965,51.06,-41.03,17.37,7.04,17.37,14.208500,7.600,11.122367,16.057733,66
1,2,0.80,1.80,0.444444,23.076923,484.920278,531.880278,1.096841,53.13,-39.97,17.13,7.04,17.13,14.149500,7.508,11.077750,17.310533,66
2,3,0.86,1.74,0.494253,23.076923,521.370000,523.876667,1.004808,52.86,-38.24,17.11,7.04,17.11,14.311860,7.658,12.066000,16.697800,66
3,4,0.80,1.64,0.487805,24.590164,482.908333,507.636111,1.051206,51.04,-39.37,17.14,7.04,17.14,14.174500,7.572,11.097800,15.774250,66
4,5,0.80,1.94,0.412371,21.897810,466.349722,518.618889,1.112081,47.88,-38.51,16.92,7.04,16.92,14.131250,7.598,11.065400,18.483333,66
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5970,296,0.90,1.60,0.562500,24.000000,349.682222,355.365278,1.016252,42.26,-51.51,23.53,13.00,23.53,21.750667,13.194,19.216400,21.816367,662
5971,297,0.90,1.60,0.562500,24.000000,350.130000,316.806944,0.904827,42.10,-55.17,24.61,12.80,24.61,22.414667,12.896,19.800467,21.739700,662
5972,298,0.92,1.58,0.582278,24.000000,355.670278,395.971111,1.113310,42.95,-22.47,21.35,12.90,21.35,18.785435,13.090,16.997767,21.457600,662
5973,299,0.90,1.60,0.562500,24.000000,337.941111,373.426389,1.105004,40.34,-36.81,21.69,13.02,21.69,20.308444,13.334,17.944000,21.798167,662


### Create Ground Truth (that the machine understands)

In [7]:
# Read the test dataset and set it up. Technically we're using the validation set.
test_y

Unnamed: 0,breath_id,patient,bsa,dta,cough,suction
0,20,292,1,0,0,0
1,21,292,1,0,0,0
2,22,292,0,0,0,0
3,23,292,1,0,0,0
4,24,292,1,0,0,0
...,...,...,...,...,...,...
1242,295,114,0,0,0,0
1243,296,114,0,0,0,0
1244,297,114,0,0,0,0
1245,298,114,0,0,0,0


What does this mean?

We have 6 columns here
 * *breath_id* - matches with a specific breath identifier from the raw data file.
 * *patient* - the patient the data came from
 * *bsa* - Breath Stacking Asynchrony. A single breath where the patient is trapping air in their chest
 * *dta* - Double Trigger Asynchrony. Two breaths in a row where the patient is trapping air
 * *cough* - What it sounds like, when a patient coughs
 * *suction* -  Nurses perform suction procedures to remove excess fluid from an endotracheal tube. This waveform is indicative of that.

Now that we understand what our columns are, we need to put it into a format where the machine can understand it and create a learning model. Because this is a multiclass model, let's just have non-PVA breaths be class 0, breath stacking can be class 1, double trigger can be class 2.

In [8]:
# Create a multi-class y vector that we can use for training purposes.
train_y_vector = train_y.bsa * 1 + train_y.dta * 2
test_y_vector = test_y.bsa * 1 + test_y.dta * 2
test_y_vector

0       1
1       1
2       0
3       1
4       1
       ..
1242    0
1243    0
1244    0
1245    0
1246    0
Length: 1247, dtype: int64

In [9]:
# See if there places where the data was mis-annotated, where both double trigger and breath stack was annotated.
# It's just good to know if this is happening or not so that we can either drop the data, or change it later on.
train_y_vector[train_y_vector > 2]

5438    3
5440    3
5521    3
dtype: int64

### Creating a Model


In [10]:
# Need to finalize dataset and remove misannotated examples first.

# just drop places where data is double annotated.
misannotated_train = train_y_vector > 2
misannotated_test = test_y_vector > 2

# ~ is the NOT operator
train_x = train_x.loc[~misannotated_train]
train_y_vector = train_y_vector.loc[~misannotated_train]

# do same thing for test
test_x = test_x.loc[~misannotated_test]
test_y_vector = test_y_vector.loc[~misannotated_test]



# Also make sure to drop data that is NaN. This is very important because otherwise your model won't train.
# The .any(axis=1) function basically says, if there are any nans in this *ROW* then mark the row as true.
# The .any(axis=0) would mark columns as True/False, but this isn't helpful now.
nans_train = train_x.isna().any(axis=1)
nans_test = test_x.isna().any(axis=1)

# now filter them out of the dataset in the same way
train_x = train_x.loc[~nans_train]
train_y_vector = train_y_vector.loc[~nans_train]

test_x = test_x.loc[~nans_test]
test_y_vector = test_y_vector.loc[~nans_test]


# any time we drop things from a data frame or series in pandas it is often helpful to re-index the object.
# the index is usually a sequential ordering of the rows like 1, 2, ... n. Sometimes it can be different
# but for now we'll just use sequential ordering
train_x.index = range(len(train_x))
train_y_vector.index = range(len(train_y_vector))

test_x.index = range(len(test_x))
test_y_vector.index = range(len(test_y_vector))

In [11]:
train_x

Unnamed: 0,breath_id,i_time,e_time,i_e_ratio,rr,tvi,tve,tve_tvi_ratio,max_flow,min_flow,max_pressure,min_pressure,pip,maw,peep,ip_auc,ep_auc,patient
0,1,0.80,1.66,0.481928,24.390244,481.917778,545.032222,1.130965,51.06,-41.03,17.37,7.04,17.37,14.208500,7.600,11.122367,16.057733,66
1,2,0.80,1.80,0.444444,23.076923,484.920278,531.880278,1.096841,53.13,-39.97,17.13,7.04,17.13,14.149500,7.508,11.077750,17.310533,66
2,3,0.86,1.74,0.494253,23.076923,521.370000,523.876667,1.004808,52.86,-38.24,17.11,7.04,17.11,14.311860,7.658,12.066000,16.697800,66
3,4,0.80,1.64,0.487805,24.590164,482.908333,507.636111,1.051206,51.04,-39.37,17.14,7.04,17.14,14.174500,7.572,11.097800,15.774250,66
4,5,0.80,1.94,0.412371,21.897810,466.349722,518.618889,1.112081,47.88,-38.51,16.92,7.04,16.92,14.131250,7.598,11.065400,18.483333,66
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5967,296,0.90,1.60,0.562500,24.000000,349.682222,355.365278,1.016252,42.26,-51.51,23.53,13.00,23.53,21.750667,13.194,19.216400,21.816367,662
5968,297,0.90,1.60,0.562500,24.000000,350.130000,316.806944,0.904827,42.10,-55.17,24.61,12.80,24.61,22.414667,12.896,19.800467,21.739700,662
5969,298,0.92,1.58,0.582278,24.000000,355.670278,395.971111,1.113310,42.95,-22.47,21.35,12.90,21.35,18.785435,13.090,16.997767,21.457600,662
5970,299,0.90,1.60,0.562500,24.000000,337.941111,373.426389,1.105004,40.34,-36.81,21.69,13.02,21.69,20.308444,13.334,17.944000,21.798167,662


__From Lab2, we know there are only 3 classes in our labels as shown below.__

In [12]:
test_y

Unnamed: 0,breath_id,patient,bsa,dta,cough,suction
0,20,292,1,0,0,0
1,21,292,1,0,0,0
2,22,292,0,0,0,0
3,23,292,1,0,0,0
4,24,292,1,0,0,0
...,...,...,...,...,...,...
1242,295,114,0,0,0,0
1243,296,114,0,0,0,0
1244,297,114,0,0,0,0
1245,298,114,0,0,0,0


__What does this mean?__

We have 6 columns here
 * *breath_id* - matches with a specific breath identifier from the raw data file.
 * *patient* - the patient the data came from
 * *bsa* - Breath Stacking Asynchrony. A single breath where the patient is trapping air in their chest
 * *dta* - Double Trigger Asynchrony. Two breaths in a row where the patient is trapping air
 * *cough* - What it sounds like, when a patient coughs
 * *suction* -  Nurses perform suction procedures to remove excess fluid from an endotracheal tube. This waveform is indicative of that.

Now that we understand what our columns are, we need to put it into a format where the machine can understand it and create a learning model. Because this is a multiclass model, let's just have __non-PVA breaths be class 0, breath stacking can be class 1, double trigger can be class 2__.


## Pre-Lab
It's often helpful to have data scaled into a certain range of values. For neural networks it is essential, and for random forests it very frequently helps improve performance. There are multiple different ways you can scale your data.

#### Standardization
A popular method is standardization where you subtract the mean of feature and then divide by its standard deviation.

$(x_f - \mu_f) \div \sigma_f$

Where $x_f$ is the feature vector, or more simply, a single column in the pandas dataframe.
$\mu_f$ is the mean of the feature vector. Which can also be computed in pandas via `data_frame.column_name.mean()`
$\sigma_f$ is the standard deviation of the feature vector. Which can be computed `data_frame.column_name.std()`.

Scikit-Learn has a class to do this that also saves your coefficients.

```python
from sklearn.preprocessing import StandardScaler

# initialize scaler with default parameters. To play around with class options check out the scikit-learn
# documentation at https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
scaler = StandardScaler()
# Fit the scaler to training data, and then scale the training data.
train_set_scaled = scaler.fit_transform(train_set)
# Transform testing data based on fitted model.
#
# note the difference here! We don't fit our scaler to the test set because this could bias our model.
test_set_scaled = scaler.transform(test_set)
```

#### Min-Max
Min max scaling natively scales all feature vectors between 0 and 1. The math doing this is again pretty simple.

$(x_f - min(x_f)) \div (max(x_f) - min(x_f))$

Where the `min` function is just finding the minimum value of a feature vector, and the `max` function is finding the maximum value of a feature vector. You can do this quickly in Scikit-Learn too.

```python
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
train_set_scaled = scaler.fit_transform(train_set)
test_set_scaled = scaler.transform(test_set)
```

#### Robust Scaler
Probably the most advanced out of these scalers (but not necessarily better), the robust scaler removes the median of the feature vector, and then scales it according to a quantile range (by default the IQR). This scaler is strong if your data has large amounts of outliers. Different models may also be good with different scalers. Sometimes it is helpful just to play around with your model and see what works.

```python
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
train_set_scaled = scaler.fit_transform(train_set)
test_set_scaled = scaler.transform(test_set)
```

### Implement these three scaling methods separately on your Random Forest, SVM, Logistic Regression from Lab 2, compare and discuss the results. Additionally, implement  Random Forest, SVM, Logistic Regression without any scaling. Compare and discuss results. What did you obsever for Logistic Regression? Why?

In [13]:
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
t_x = train_x

#### Standardization

In [14]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
# Fit the scaler to training data, and then scale the training data.
scaler.fit(t_x[columns_to_use])
train_x_ss = scaler.transform(t_x[columns_to_use])

In [15]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

RF_model = RandomForestClassifier()
RF_model.fit(train_x_ss, train_y_vector)

predictions = RF_model.predict(scaler.transform(test_x[columns_to_use]))

# Adjust predictions (if needed)
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx - 1] = 2

print("RF & Standardization")
RF_report = classification_report(test_y_vector, predictions)
print(RF_report)

RF & Standardization
              precision    recall  f1-score   support

           0       0.90      0.96      0.93       842
           1       0.85      0.97      0.90       301
           2       1.00      0.06      0.11       104

    accuracy                           0.89      1247
   macro avg       0.92      0.66      0.65      1247
weighted avg       0.90      0.89      0.86      1247



In [16]:
from sklearn.linear_model import LogisticRegression

LR_model = LogisticRegression(penalty='l1',solver='saga')
LR_model.fit(train_x_ss, train_y_vector)

predictions = LR_model.predict(scaler.transform(test_x[columns_to_use]))

for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx - 1] = 2

print("LR & Standardization")
LR_report = classification_report(test_y_vector, predictions)
print(LR_report)

LR & Standardization
              precision    recall  f1-score   support

           0       0.82      0.87      0.85       842
           1       0.63      0.73      0.68       301
           2       0.33      0.02      0.04       104

    accuracy                           0.77      1247
   macro avg       0.60      0.54      0.52      1247
weighted avg       0.74      0.77      0.74      1247



In [17]:
from sklearn.svm import SVC

SVM_model = SVC()
SVM_model.fit(train_x_ss, train_y_vector)
predictions = SVM_model.predict(scaler.transform(test_x[columns_to_use]))

for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2

print("SVM & Standardization")
SVM_report = classification_report(test_y_vector, predictions)        
print(SVM_report)

SVM & Standardization
              precision    recall  f1-score   support

           0       0.86      0.88      0.87       842
           1       0.71      0.80      0.76       301
           2       0.44      0.21      0.29       104

    accuracy                           0.80      1247
   macro avg       0.67      0.63      0.64      1247
weighted avg       0.79      0.80      0.79      1247



#### Min-Max

In [18]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(t_x[columns_to_use])
train_x_mm = scaler.transform(t_x[columns_to_use])

In [19]:
RF_model.fit(train_x_mm, train_y_vector)

predictions = RF_model.predict(scaler.transform(test_x[columns_to_use]))

# Adjust predictions (if needed)
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx - 1] = 2

print("RF & Min-Max")
RF_report = classification_report(test_y_vector, predictions)
print(RF_report)

RF & Min-Max
              precision    recall  f1-score   support

           0       0.90      0.96      0.93       842
           1       0.85      0.96      0.90       301
           2       0.75      0.12      0.20       104

    accuracy                           0.89      1247
   macro avg       0.84      0.68      0.68      1247
weighted avg       0.88      0.89      0.86      1247



In [20]:
LR_model.fit(train_x_mm, train_y_vector)

predictions = LR_model.predict(scaler.transform(test_x[columns_to_use]))

for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx - 1] = 2

print("LR & Min-Max")
LR_report = classification_report(test_y_vector, predictions)
print(LR_report)

LR & Min-Max
              precision    recall  f1-score   support

           0       0.87      0.70      0.78       842
           1       0.43      0.81      0.56       301
           2       0.50      0.02      0.04       104

    accuracy                           0.67      1247
   macro avg       0.60      0.51      0.46      1247
weighted avg       0.73      0.67      0.66      1247



In [21]:
SVM_model.fit(train_x_mm, train_y_vector)
predictions = SVM_model.predict(scaler.transform(test_x[columns_to_use]))

for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2

print("SVM & Min-Max")
SVM_report = classification_report(test_y_vector, predictions)        
print(SVM_report)

SVM & Min-Max
              precision    recall  f1-score   support

           0       0.86      0.89      0.88       842
           1       0.59      0.74      0.66       301
           2       1.00      0.04      0.07       104

    accuracy                           0.78      1247
   macro avg       0.82      0.56      0.54      1247
weighted avg       0.81      0.78      0.76      1247



#### Robust Scaler

In [22]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
scaler.fit(t_x[columns_to_use])
train_x_rb = scaler.transform(t_x[columns_to_use])

In [23]:
RF_model.fit(train_x_rb, train_y_vector)

predictions = RF_model.predict(scaler.transform(test_x[columns_to_use]))

# Adjust predictions (if needed)
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx - 1] = 2

print("RF & Robust")
RF_report = classification_report(test_y_vector, predictions)
print(RF_report)

RF & Robust
              precision    recall  f1-score   support

           0       0.90      0.96      0.93       842
           1       0.85      0.96      0.90       301
           2       0.71      0.10      0.17       104

    accuracy                           0.89      1247
   macro avg       0.82      0.67      0.67      1247
weighted avg       0.87      0.89      0.86      1247



In [24]:
LR_model.fit(train_x_rb, train_y_vector)

predictions = LR_model.predict(scaler.transform(test_x[columns_to_use]))

for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx - 1] = 2

print("LR & Robust")
LR_report = classification_report(test_y_vector, predictions)
print(LR_report)

LR & Robust
              precision    recall  f1-score   support

           0       0.80      0.94      0.86       842
           1       0.64      0.54      0.59       301
           2       0.00      0.00      0.00       104

    accuracy                           0.77      1247
   macro avg       0.48      0.49      0.48      1247
weighted avg       0.69      0.77      0.72      1247



In [25]:
SVM_model.fit(train_x_rb, train_y_vector)
predictions = SVM_model.predict(scaler.transform(test_x[columns_to_use]))

for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2

print("SVM & Robust")
SVM_report = classification_report(test_y_vector, predictions)        
print(SVM_report)

SVM & Robust
              precision    recall  f1-score   support

           0       0.70      0.98      0.82       842
           1       0.48      0.10      0.16       301
           2       0.00      0.00      0.00       104

    accuracy                           0.69      1247
   macro avg       0.39      0.36      0.33      1247
weighted avg       0.59      0.69      0.59      1247



#### Non-scaled

In [26]:
RF_model.fit(train_x[columns_to_use], train_y_vector)
predictions = RF_model.predict(test_x[columns_to_use])
    
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2

print("RF & Non-scaled")
RF_report = classification_report(test_y_vector, predictions)
print(RF_report)

RF & Non-scaled
              precision    recall  f1-score   support

           0       0.90      0.96      0.93       842
           1       0.84      0.96      0.90       301
           2       0.50      0.02      0.04       104

    accuracy                           0.88      1247
   macro avg       0.75      0.65      0.62      1247
weighted avg       0.85      0.88      0.85      1247



In [27]:
LR_model.fit(train_x[columns_to_use], train_y_vector)
predictions = LR_model.predict(test_x[columns_to_use])
    
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2

print("LR & Non-scaled")
LR_report = classification_report(test_y_vector, predictions)
print(LR_report)

LR & Non-scaled
              precision    recall  f1-score   support

           0       0.76      0.97      0.85       842
           1       0.70      0.38      0.49       301
           2       0.00      0.00      0.00       104

    accuracy                           0.75      1247
   macro avg       0.48      0.45      0.45      1247
weighted avg       0.68      0.75      0.69      1247



In [28]:
SVM_model.fit(train_x[columns_to_use], train_y_vector)
predictions = SVM_model.predict(test_x[columns_to_use])
    
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2

print("SVM & Non-scaled")
SVM_report = classification_report(test_y_vector, predictions)
print(SVM_report)

SVM & Non-scaled
              precision    recall  f1-score   support

           0       0.88      0.97      0.92       842
           1       0.76      0.79      0.78       301
           2       0.00      0.00      0.00       104

    accuracy                           0.85      1247
   macro avg       0.55      0.59      0.57      1247
weighted avg       0.77      0.85      0.81      1247



#### Discussion for the results

Random Forest (RF):

Standardization: In this case, RF achieved an accuracy of 0.89, with reasonably high precision, recall, and F1-scores for class 0 and class 1. However, the F1-score for class 2 is relatively low, indicating difficulty in correctly classifying this class.

Min-Max Scaling: Similar to standardization, RF achieved good accuracy (0.89) and high performance for classes 0 and 1, but still struggled with class 2.

Robust Scaling: Results are similar to standardization and Min-Max scaling, with good performance for classes 0 and 1 and lower performance for class 2.

Non-scaled: RF maintains a high accuracy (0.88) but struggles to classify class 2, as seen in the low F1-score.

Logistic Regression (LR):

Standardization: LR achieved lower accuracy (0.77) compared to RF. The precision, recall, and F1-scores for class 0 are reasonably good, but LR struggles with class 1 and class 2.

Min-Max Scaling: LR performs worse with Min-Max scaling, especially for class 2, where the F1-score is quite low.

Robust Scaling: LR's performance remains poor with robust scaling for class 2.

Non-scaled: Similar to Min-Max scaling, LR's performance is poor for class 2.

Support Vector Machine (SVM):

Standardization: SVM performs better than LR but worse than RF, with an accuracy of 0.80. It exhibits good precision and recall for class 0 but struggles with class 1 and class 2.

Min-Max Scaling: SVM's performance is relatively consistent with standardization, with some improvement in class 2 recall.

Robust Scaling: SVM's performance remains poor for class 2.

Non-scaled: SVM performs relatively well, but it still has difficulty classifying class 2.

#### What did you obsever for Logistic Regression? Why?

Standardization:

LR tends to perform relatively well with standardization compared with other scaling methods but it is not efficient enough. 
It achieves a higher accuracy (0.77) compared to Min-Max scaling, Robust scaling, and non-scaling.
The precision and recall for class 0 are reasonably good, suggesting that LR can correctly identify this class.
However, LR struggles with class 2 with low F1-score, indicating that it has difficulty correctly classifying it.

Min-Max Scaling:

LR performs worse with Min-Max scaling, achieving the lowest accuracy (0.67) among the scaling methods.
The F1-scores for class 2 are notably low, suggesting that LR struggles to correctly classify this class.

Robust Scaling:

LR's performance with robust scaling is similar to standardization but slightly worse. It achieves an accuracy of 0.77.
Like the other scaling methods, it struggles with class 2, showing low F1-scores.

Non-scaled:

Without any scaling, LR performs better than Min-Max scaling but worse than standardization and robust scaling. It achieves an accuracy of 0.75.
Similar to Min-Max scaling, LR struggles to classify class 2.

Based on the results, LR has a poor performance for classifying class 2, but it is acceptable for class 0 and class 1.

#### Conjectures about the causes of LR's performance

Logistic Regression (LR) is a linear model commonly used for binary and multiclass classification tasks. With LR, the model learns the relationship between the input features and the target variable by fitting a logistic function (sigmoid) to the linear combination of the features. It then makes predictions by estimating the probability of an instance belonging to a specific class.

Like many other machine learning algorithms, LR can be sensitive to the scale of the input features. Feature Scaling ensures that they have similar ranges and helps the algorithm converge faster and reach better results. Standardization (mean centering and scaling to unit variance) and Min-Max scaling (scaling to a specified range, typically [0, 1]) are common scaling techniques used with logistic regression.

Due to the sensitivity to Feature Scaling, Robust scaling (using the median and interquartile range) and non-scaling (no scaling at all) can lead to large variations in feature scales, which can negatively impact LR's ability to learn and generalize.

Additionally, LR assumes a linear relationship between the features and the log-odds of the target variable. If this assumption does not hold, LR may struggle to model the relationship accurately. In cases where the features have complex nonlinear relationships with the target variable, LR may underperform compared to more flexible models like RF or SVM.

Most importantly, if class 2 is imbalanced or has limited data points, LR may struggle to learn a good decision boundary. Imbalanced classes can lead to poor model performance, and LR might not have the capacity to deal with this issue effectively even with scaling.

Reference: https://www.sciencedirect.com/topics/computer-science/logistic-regression#:~:text=Logistic%20regression%20is%20a%20process,%2Fno%2C%20and%20so%20on.