# EEC 193A Pre Project B1: Improve your ML model

## Introduction

This asignment follows Lab2 and uses the same dataset as Lab2. 

### Reading Data into Memory

In [36]:
# make sure we can plot in future if we want
%matplotlib notebook
# make sure to ignore warnings
import warnings
warnings.simplefilter('ignore')
# Import statement for pandas
import pandas as pd
# This is just a small configuration change for purposes of the class
pd.options.display.max_rows = 10

# Get our train X and y datasets for the problem
train_x = pd.read_csv('ece193a_pva_train_x.csv')
train_y = pd.read_csv('ece193a_pva_train_y.csv')

# Get our validation X and y datasets for the problem. 
test_x = pd.read_csv('ece193a_pva_validation_x.csv')
test_y = pd.read_csv('ece193a_pva_validation_y.csv')

In [37]:
# output some rows of the dataset just to get a better feel for the information
train_x

Unnamed: 0,breath_id,i_time,tve,max_flow,min_flow,max_pressure,peep,ip_auc,ep_auc,patient
0,1,0.80,545.032222,51.06,-41.03,17.37,7.600,11.122367,16.057733,66
1,2,0.80,531.880278,53.13,-39.97,17.13,7.508,11.077750,17.310533,66
2,3,0.86,523.876667,52.86,-38.24,17.11,7.658,12.066000,16.697800,66
3,4,0.80,507.636111,51.04,-39.37,17.14,7.572,11.097800,15.774250,66
4,5,0.80,518.618889,47.88,-38.51,16.92,7.598,11.065400,18.483333,66
...,...,...,...,...,...,...,...,...,...,...
5970,296,0.90,355.365278,42.26,-51.51,23.53,13.194,19.216400,21.816367,662
5971,297,0.90,316.806944,42.10,-55.17,24.61,12.896,19.800467,21.739700,662
5972,298,0.92,395.971111,42.95,-22.47,21.35,13.090,16.997767,21.457600,662
5973,299,0.90,373.426389,40.34,-36.81,21.69,13.334,17.944000,21.798167,662


Here we have 18 columns. I'm going to give a detailed breakdown here. Feel free to come back to it as necessary. 

Features:
 * *breath_id* - matches with a specific breath identifier from the raw data file.
 * *patient* - the patient the data came from
 * *min_flow* - The minimum flow observation on the breath
 * *max_flow* - The maximum flow observation on the breath
 * *tvi* - The inhaled volume of air for each breath
 * *tve* - The exhaled volume of air for each breath
 * *tve_tvi_ratio* - The ratio of `tve / tvi`
 * *i_time* - The amount of time patient was breathing in for each breath
 * *e_time* - The amount of time patient was breathing out for each breath
 * *ie_ratio* - The ratio of `i_time / e_time`
 * *rr* - The respiratory rate in number of breaths per minute. Measured by `60 / (i_time + e_time)`
 * *min_pressure* - the minimum pressure observation on the breath
 * *max_pressure* - the maximum pressure observation on the breath
 * *peep* - the baseline pressure setting on the ventilator
 * *pip* - the maximum pressure setting of inspiration. Slight difference compared to max_pressure
 * *maw* - the mean pressure for the entire breath
 * *ip_auc* - the area under the curve of the inspiratory pressure
 * *ep_auc* - the area under the curve of the expiratory pressure

## Featurization

Featurization is the process where you extract information from raw data. This information can then be fed into a machine learning algorithm to perform the task you want. In the current case we will need to extract additional information from the ventilator data in order to create a valid machine learning classifier.

### Processing the Data
The first step we need to do is to be able to read the raw data files and put them into memory. We have taken this problem away from you for the purposes of this homework and have given you the code so that you can do this

In [38]:
import csv


def process_ventilator_data(filename):
    descriptor = open(filename)
    reader = csv.reader(descriptor)
    breath_id = 1
    
    all_breath_data = []
    current_flow_data = []
    current_pressure_data = []
    
    for row in reader:
        if (row[0].strip() == 'BS' or row[0].strip() == 'BE') and current_flow_data != []:
            all_breath_data.append({'breath_id': breath_id, 'flow': current_flow_data, 'pressure': current_pressure_data})
            breath_id += 1
            current_flow_data = []
            current_pressure_data = []
        else:
            try:
                current_flow_data.append(round(float(row[0]), 2))
                current_pressure_data.append(round(float(row[1]), 2))
            except (IndexError, ValueError):
                continue
    return all_breath_data

In [39]:
from glob import glob
import os

# import for Simpson's method. This will be helpful for calculating TVi
from scipy.integrate import simps
from statistics import mean


def extract_features_for_file(filename, existing_features):
    """
    Extract features for every single breath in file. To make matters a bit easier, we use
    existing features that we've already extracted from the file to help speed the process.
    """
    patient = filename.split('/')[-2]
    all_breath_data = process_ventilator_data(filename)
    all_features = []

    for breath_data in all_breath_data:
        breath_id = breath_data['breath_id']
        existing_breath_features = existing_features[existing_features.breath_id == breath_id].iloc[0]

        flow = breath_data['flow']
        pressure = breath_data['pressure']
        
        # inspiratory time (the amount of time a patient is inhaling for)
        i_time = existing_breath_features.i_time
        # exhaled tidal volume
        tve = existing_breath_features.tve
        # maximum flow for breath
        max_flow = existing_breath_features.max_flow
        # minimum flow for the breath
        min_flow = existing_breath_features.min_flow
        # maximum pressure for the breath
        max_pressure = existing_breath_features.max_pressure
        # The minimum pressure setting on the ventilator
        peep = existing_breath_features.peep
        # The area under the curve of the inspiratory pressure curve
        ip_auc = existing_breath_features.ip_auc
        # The area under the curve of the expiratory pressure curve
        ep_auc = existing_breath_features.ep_auc
        
        # This is the array index where the inhalation ends. We divide by 0.02 because
        # thats how frequently the ventilator samples data, every 0.02 seconds.
        x0_index = int(i_time / 0.02)
        
        # Part of your assignment is to extract the following features for all breaths:
        #
        # Expiratory Time. The amount of time a patient is exhaling
        e_time = len(flow) * .02 - i_time
        #
        # I:E ratio. The ratio of inspiratory to expiratory time. Measured by i_time/e_time
        i_e_ratio = i_time / e_time
        #
        # Respiratory rate. The number of breaths a patient is breathing. This is measured by
        # 60 / (total breath time in seconds)
        rr = 60 / (i_time + e_time)
        rr = 60 / (len(flow) * .02)
        #
        # Tidal volume inhaled. The amount of air volume inhaled in the breath. 
        # Hint: use the simps function.
        # This will output volume in L/min, convert to ml/sec (* 1000 / 60)
        tvi = simps(flow[:x0_index], dx=0.02) * 1000 / 60
        # 
        # Tidal volume ratio. Measured by tve/tvi
        tve_tvi_ratio = tve / tvi
        #
        # Minimum pressure of the breath
        min_pressure = min(pressure)
        #
        # PIP - peak inspiratory pressure. The peak pressure during inhalation
        pip = max(pressure[:x0_index])
        #
        # MAW - mean airway pressure for inhalation.
        maw = mean(pressure[:x0_index])
        
        all_features.append([
            breath_id, i_time, e_time, i_e_ratio, rr, tvi, tve, tve_tvi_ratio,
            max_flow, min_flow, max_pressure, min_pressure, pip, maw, 
            peep, ip_auc, ep_auc, int(patient) 
        ])
    columns = [
        'breath_id', 'i_time', 'e_time', 'i_e_ratio', 'rr', 'tvi', 'tve',
        'tve_tvi_ratio', 'max_flow', 'min_flow', 'max_pressure',
        'min_pressure', 'pip', 'maw', 'peep', 'ip_auc', 'ep_auc', 'patient'
    ]
    return all_features, columns


def remake_dataset(dataset):
    data_files = glob(os.path.join('data', '*/*.csv'))
    
    patient_to_file_map = {}
    for filename in data_files:
        patient = filename.split('/')[-2]  # patient is embedded in this part of filename
        patient_to_file_map[patient] = filename

    data = []
    # iterate over all the unique patients in the train set
    for patient in dataset.patient.unique():
        existing_features = dataset[dataset.patient == patient]
        filename = patient_to_file_map[str(patient)]
        breath_data, columns = extract_features_for_file(filename, existing_features)
        # add breath rows
        data.extend(breath_data)
    # create new data frame with the new added information
    return pd.DataFrame(data, columns=columns)

In [40]:
# remake train set
train_x = remake_dataset(train_x)
# remake validation set.
test_x = remake_dataset(test_x)

KeyError: '66'

In [41]:
train_x

Unnamed: 0,breath_id,i_time,tve,max_flow,min_flow,max_pressure,peep,ip_auc,ep_auc,patient
0,1,0.80,545.032222,51.06,-41.03,17.37,7.600,11.122367,16.057733,66
1,2,0.80,531.880278,53.13,-39.97,17.13,7.508,11.077750,17.310533,66
2,3,0.86,523.876667,52.86,-38.24,17.11,7.658,12.066000,16.697800,66
3,4,0.80,507.636111,51.04,-39.37,17.14,7.572,11.097800,15.774250,66
4,5,0.80,518.618889,47.88,-38.51,16.92,7.598,11.065400,18.483333,66
...,...,...,...,...,...,...,...,...,...,...
5970,296,0.90,355.365278,42.26,-51.51,23.53,13.194,19.216400,21.816367,662
5971,297,0.90,316.806944,42.10,-55.17,24.61,12.896,19.800467,21.739700,662
5972,298,0.92,395.971111,42.95,-22.47,21.35,13.090,16.997767,21.457600,662
5973,299,0.90,373.426389,40.34,-36.81,21.69,13.334,17.944000,21.798167,662


### Create Ground Truth (that the machine understands)

In [42]:
# Read the test dataset and set it up. Technically we're using the validation set.
test_y

Unnamed: 0,breath_id,patient,bsa,dta,cough,suction
0,20,292,1,0,0,0
1,21,292,1,0,0,0
2,22,292,0,0,0,0
3,23,292,1,0,0,0
4,24,292,1,0,0,0
...,...,...,...,...,...,...
1242,295,114,0,0,0,0
1243,296,114,0,0,0,0
1244,297,114,0,0,0,0
1245,298,114,0,0,0,0


What does this mean?

We have 6 columns here
 * *breath_id* - matches with a specific breath identifier from the raw data file.
 * *patient* - the patient the data came from
 * *bsa* - Breath Stacking Asynchrony. A single breath where the patient is trapping air in their chest
 * *dta* - Double Trigger Asynchrony. Two breaths in a row where the patient is trapping air
 * *cough* - What it sounds like, when a patient coughs
 * *suction* -  Nurses perform suction procedures to remove excess fluid from an endotracheal tube. This waveform is indicative of that. 
 
Now that we understand what our columns are, we need to put it into a format where the machine can understand it and create a learning model. Because this is a multiclass model, let's just have non-PVA breaths be class 0, breath stacking can be class 1, double trigger can be class 2.

In [43]:
# Create a multi-class y vector that we can use for training purposes. 
train_y_vector = train_y.bsa * 1 + train_y.dta * 2
test_y_vector = test_y.bsa * 1 + test_y.dta * 2
test_y_vector

0       1
1       1
2       0
3       1
4       1
       ..
1242    0
1243    0
1244    0
1245    0
1246    0
Length: 1247, dtype: int64

In [44]:
# See if there places where the data was mis-annotated, where both double trigger and breath stack was annotated.
# It's just good to know if this is happening or not so that we can either drop the data, or change it later on.
train_y_vector[train_y_vector > 2]

5438    3
5440    3
5521    3
dtype: int64

### Creating a Model


In [105]:
# Need to finalize dataset and remove misannotated examples first.

# just drop places where data is double annotated. 
misannotated_train = train_y_vector > 2
misannotated_test = test_y_vector > 2

# ~ is the NOT operator
train_x = train_x.loc[~misannotated_train]
train_y_vector = train_y_vector.loc[~misannotated_train]

# do same thing for test 
test_x = test_x.loc[~misannotated_test]
test_y_vector = test_y_vector.loc[~misannotated_test]



# Also make sure to drop data that is NaN. This is very important because otherwise your model won't train.
# The .any(axis=1) function basically says, if there are any nans in this *ROW* then mark the row as true.
# The .any(axis=0) would mark columns as True/False, but this isn't helpful now.
nans_train = train_x.isna().any(axis=1)
nans_test = test_x.isna().any(axis=1)

# now filter them out of the dataset in the same way
train_x = train_x.loc[~nans_train]
train_y_vector = train_y_vector.loc[~nans_train]

test_x = test_x.loc[~nans_test]
test_y_vector = test_y_vector.loc[~nans_test]


# any time we drop things from a data frame or series in pandas it is often helpful to re-index the object.
# the index is usually a sequential ordering of the rows like 1, 2, ... n. Sometimes it can be different
# but for now we'll just use sequential ordering
train_x.index = range(len(train_x))
train_y_vector.index = range(len(train_y_vector))

test_x.index = range(len(test_x))
test_y_vector.index = range(len(test_y_vector))

In [46]:
train_x

Unnamed: 0,breath_id,i_time,tve,max_flow,min_flow,max_pressure,peep,ip_auc,ep_auc,patient
0,1,0.80,545.032222,51.06,-41.03,17.37,7.600,11.122367,16.057733,66
1,2,0.80,531.880278,53.13,-39.97,17.13,7.508,11.077750,17.310533,66
2,3,0.86,523.876667,52.86,-38.24,17.11,7.658,12.066000,16.697800,66
3,4,0.80,507.636111,51.04,-39.37,17.14,7.572,11.097800,15.774250,66
4,5,0.80,518.618889,47.88,-38.51,16.92,7.598,11.065400,18.483333,66
...,...,...,...,...,...,...,...,...,...,...
5967,296,0.90,355.365278,42.26,-51.51,23.53,13.194,19.216400,21.816367,662
5968,297,0.90,316.806944,42.10,-55.17,24.61,12.896,19.800467,21.739700,662
5969,298,0.92,395.971111,42.95,-22.47,21.35,13.090,16.997767,21.457600,662
5970,299,0.90,373.426389,40.34,-36.81,21.69,13.334,17.944000,21.798167,662


__From Lab2, we know there are only 3 classes in our labels as shown below.__

In [47]:
test_y

Unnamed: 0,breath_id,patient,bsa,dta,cough,suction
0,20,292,1,0,0,0
1,21,292,1,0,0,0
2,22,292,0,0,0,0
3,23,292,1,0,0,0
4,24,292,1,0,0,0
...,...,...,...,...,...,...
1242,295,114,0,0,0,0
1243,296,114,0,0,0,0
1244,297,114,0,0,0,0
1245,298,114,0,0,0,0


__What does this mean?__

We have 6 columns here
 * *breath_id* - matches with a specific breath identifier from the raw data file.
 * *patient* - the patient the data came from
 * *bsa* - Breath Stacking Asynchrony. A single breath where the patient is trapping air in their chest
 * *dta* - Double Trigger Asynchrony. Two breaths in a row where the patient is trapping air
 * *cough* - What it sounds like, when a patient coughs
 * *suction* -  Nurses perform suction procedures to remove excess fluid from an endotracheal tube. This waveform is indicative of that. 
 
Now that we understand what our columns are, we need to put it into a format where the machine can understand it and create a learning model. Because this is a multiclass model, let's just have __non-PVA breaths be class 0, breath stacking can be class 1, double trigger can be class 2__.


## Assignment \#1 Scaling
It's often helpful to have data scaled into a certain range of values. For neural networks it is essential, and for random forests it very frequently helps improve performance. There are multiple different ways you can scale your data. 

#### Standardization
A popular method is standardization where you subtract the mean of feature and then divide by its standard deviation. 

$(x_f - \mu_f) \div \sigma_f$

Where $x_f$ is the feature vector, or more simply, a single column in the pandas dataframe.
$\mu_f$ is the mean of the feature vector. Which can also be computed in pandas via `data_frame.column_name.mean()`
$\sigma_f$ is the standard deviation of the feature vector. Which can be computed `data_frame.column_name.std()`. 

Scikit-Learn has a class to do this that also saves your coefficients.

```python
from sklearn.preprocessing import StandardScaler

# initialize scaler with default parameters. To play around with class options check out the scikit-learn 
# documentation at https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
scaler = StandardScaler()
# Fit the scaler to training data, and then scale the training data.
train_set_scaled = scaler.fit_transform(train_set)
# Transform testing data based on fitted model.
#
# note the difference here! We don't fit our scaler to the test set because this could bias our model.
test_set_scaled = scaler.transform(test_set)
```

#### Min-Max
Min max scaling natively scales all feature vectors between 0 and 1. The math doing this is again pretty simple.

$(x_f - min(x_f)) \div (max(x_f) - min(x_f))$

Where the `min` function is just finding the minimum value of a feature vector, and the `max` function is finding the maximum value of a feature vector. You can do this quickly in Scikit-Learn too.

```python
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
train_set_scaled = scaler.fit_transform(train_set)
test_set_scaled = scaler.transform(test_set)
```

#### Robust Scaler
Probably the most advanced out of these scalers (but not necessarily better), the robust scaler removes the median of the feature vector, and then scales it according to a quantile range (by default the IQR). This scaler is strong if your data has large amounts of outliers. Different models may also be good with different scalers. Sometimes it is helpful just to play around with your model and see what works.

```python
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
train_set_scaled = scaler.fit_transform(train_set)
test_set_scaled = scaler.transform(test_set)
```

### Implement these three scaling methods separately on your Random Forest, the model you selected in Assignment 3 from Lab 2, compare and discuss the results.

In [108]:
#######################################
#Random Forest with StandardScaler
#######################################
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
train_set_scaled= scaler.fit_transform(train_x[columns_to_use])
test_set_scaled = scaler.transform(test_x[columns_to_use])

from sklearn.ensemble import RandomForestClassifier

# initialize model
model = RandomForestClassifier()

# don't use patient and breath_id columns
#columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))

# fit the model to training 
model.fit(train_set_scaled, train_y_vector)

# Now that the model is fitted, evaluate how well it is performing
predictions = model.predict(test_set_scaled)

for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2
        
from sklearn.metrics import classification_report

print(classification_report(test_y_vector, predictions))

              precision    recall  f1-score   support

           0       0.84      0.65      0.73       842
           1       0.43      0.84      0.57       301
           2       1.00      0.02      0.04       104

    accuracy                           0.64      1247
   macro avg       0.76      0.50      0.45      1247
weighted avg       0.76      0.64      0.64      1247



In [109]:
#######################################
#####Logistic Regression with StandardScaler
#######################################
from sklearn.linear_model import LogisticRegression

modelLR = LogisticRegression()
# fit the model to training 
modelLR.fit(train_set_scaled, train_y_vector)

# Now that the model is fitted, evaluate how well it is performing
predictions = modelLR.predict(test_set_scaled)
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2
from sklearn.metrics import classification_report

print(classification_report(test_y_vector, predictions))

              precision    recall  f1-score   support

           0       0.87      0.64      0.73       842
           1       0.39      0.81      0.53       301
           2       0.50      0.02      0.04       104

    accuracy                           0.63      1247
   macro avg       0.59      0.49      0.43      1247
weighted avg       0.72      0.63      0.63      1247



In [110]:
#######################################
#Random Forest with Min-Max
#######################################
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
train_set_scaled = scaler.fit_transform(train_x)
test_set_scaled = scaler.transform(test_x)

from sklearn.ensemble import RandomForestClassifier

# initialize model
model = RandomForestClassifier()

# don't use patient and breath_id columns
#columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))

# fit the model to training 
model.fit(train_set_scaled, train_y_vector)

# Now that the model is fitted, evaluate how well it is performing
predictions = model.predict(test_set_scaled)

for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2
from sklearn.metrics import classification_report

print(classification_report(test_y_vector, predictions))

              precision    recall  f1-score   support

           0       0.84      0.64      0.72       842
           1       0.42      0.85      0.57       301
           2       1.00      0.02      0.04       104

    accuracy                           0.64      1247
   macro avg       0.75      0.50      0.44      1247
weighted avg       0.75      0.64      0.63      1247



In [111]:
#######################################
#Logistic Regression with Min-Max
#######################################
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

modelLR = LogisticRegression()
# fit the model to training 
modelLR.fit(train_set_scaled, train_y_vector)

# Now that the model is fitted, evaluate how well it is performing
predictions = modelLR.predict(test_set_scaled)
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2
from sklearn.metrics import classification_report

print(classification_report(test_y_vector, predictions))

              precision    recall  f1-score   support

           0       0.88      0.60      0.71       842
           1       0.37      0.84      0.52       301
           2       0.00      0.00      0.00       104

    accuracy                           0.61      1247
   macro avg       0.42      0.48      0.41      1247
weighted avg       0.69      0.61      0.61      1247



In [112]:
#####################################
#Random Forest with Robust Scaler
#####################################
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
train_set_scaled = scaler.fit_transform(train_x)
test_set_scaled = scaler.transform(test_x)
from sklearn.ensemble import RandomForestClassifier

# initialize model
model = RandomForestClassifier()

# don't use patient and breath_id columns
#columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))

# fit the model to training 
model.fit(train_set_scaled, train_y_vector)

# Now that the model is fitted, evaluate how well it is performing
predictions = model.predict(test_set_scaled)

for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2
from sklearn.metrics import classification_report

print(classification_report(test_y_vector, predictions))

              precision    recall  f1-score   support

           0       0.84      0.66      0.74       842
           1       0.44      0.85      0.58       301
           2       1.00      0.02      0.04       104

    accuracy                           0.65      1247
   macro avg       0.76      0.51      0.45      1247
weighted avg       0.76      0.65      0.64      1247



In [113]:
#####################################
#Logistic Regression with Robust Scaler
#####################################
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

modelLR = LogisticRegression()
# fit the model to training 
modelLR.fit(train_set_scaled, train_y_vector)

# Now that the model is fitted, evaluate how well it is performing
predictions = modelLR.predict(test_set_scaled)
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2
from sklearn.metrics import classification_report

print(classification_report(test_y_vector, predictions))

              precision    recall  f1-score   support

           0       0.88      0.57      0.69       842
           1       0.36      0.84      0.51       301
           2       0.55      0.06      0.10       104

    accuracy                           0.59      1247
   macro avg       0.60      0.49      0.43      1247
weighted avg       0.73      0.59      0.60      1247



## Assignment \#2 Feature Selection

One thing to note is that we are using 16 different features for input into our model. Some of these features can be of little value to classifying whether a breath is asynchronous or not. So, one of the easiest things we can do for ourselves is to reduce the number of features that we have in an intelligent way.

### $\chi^2$ Feature Selection (chi squared)

Probably one of the easiest methods and intuitive methods to use for feature selection in classification problems. The [$\chi^2$ test](https://en.wikipedia.org/wiki/Chi-squared_test) measures whether a two statistical distributions are independent. In t[he applied case](https://nlp.stanford.edu/IR-book/html/htmledition/feature-selectionchi2-feature-selection-1.html), this means asking the question of whether a single feature is independent of the target vector. If a feature and the outcome are independent then this variable might not be helpful for our model. If a feature is independent of the outcome it will have a high chi2 value and a high pvalue. On the other hand, if a feature is not independent of the outcome, then it will have a high chi2 value and a low p-value (within range of 0-.05).

There is a function in scikit-learn that enables you to do the $\chi^2$ test.

In [48]:
# this is the PrettyPrint function. Just makes things look a bit nicer on output.
from pprint import pprint

from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler

# get all columns in our dataset except patient and breath_id
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))

# must scale feature vectors so they are non-negative
scaler = MinMaxScaler()
train_set = scaler.fit_transform(train_x[columns_to_use])

# the chi2 test will output two things, chi2 and p values. The p values are the most relevant item that we want
# to use. A feature with a p-value between 0 and 0.05 means that a feature might be a good predictor of our outcome.
chi2_vals, pvals = chi2(train_set, train_y_vector)

# mash column names with p-values so we know which p-value belongs to which feature
cols_to_pvals = zip(pvals, columns_to_use)
# Sort the p-values in ascending order (smallest first).
cols_sorted = sorted(cols_to_pvals)
# pretty print the sorted values.
pprint(cols_sorted)

[(2.0845956610568945e-15, 'ep_auc'),
 (4.1963626094160874e-10, 'tve'),
 (0.01875342252606717, 'i_time'),
 (0.05675212148048651, 'ip_auc'),
 (0.15346609771450404, 'max_pressure'),
 (0.4176834544812503, 'min_flow'),
 (0.4539978193631702, 'peep'),
 (0.6801265398971507, 'max_flow')]


There are 2 features that had p-values below 0.05:

 * tve 
 * ep_auc 
 
So let's use these features for our next model.

In [115]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler

model = RandomForestClassifier()
columns_to_use = ['tve', 'ep_auc']

scaler = MinMaxScaler()
train_set = scaler.fit_transform(train_x[columns_to_use])
test_set = scaler.transform(test_x[columns_to_use])

model.fit(train_set, train_y_vector)
predictions = model.predict(test_set)
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))

              precision    recall  f1-score   support

           0       0.77      0.78      0.77       842
           1       0.40      0.48      0.44       301
           2       0.07      0.02      0.03       104

    accuracy                           0.64      1247
   macro avg       0.41      0.43      0.41      1247
weighted avg       0.62      0.64      0.63      1247



Our performance actually dropped when we were trying to use $\chi^2$ test. Does this mean that the $\chi^2$ method isn't good for our problem?

What is happening above?

Even though the $\chi^2$ test is telling us these features are relevant to prediction, this just isn't the case in  the test set. This can happen frequently in machine learning, where information that is relevant to the training set doesn't generalize to the testing set. Are there other methods of feature selection which are more likely to generalize to the testing set?

### Expert Feature Selection

It always helps to have expert knowledge on the problem to improve model performance. In this case expert knowledge can be considered medical knowledge. So what kind of medical knowledge can we use to help this?

#### Breath Stack (BSA)
Remember the waveforms here? This means that the patient is trapping air in their chest. We can measure this via the `tve_tvi_ratio`. The way that our doctors annotated breaths was if the breaths had a `tve_tvi_ratio < .9` and they weren't a suction/cough or another anomaly.

<img src="bsa-breath.png" alt="Drawing" style="width: 400px;"/>

#### Double Trigger (DTA)
Double trigger has a double-hump pattern to it. 

<img src="dta-breaths.png" alt="Drawing" style="width: 600px;"/>

The way our doctors annotated it was if 

1. It wasn't an anomaly
2. First breath in sequence had an `e_time < .3` seconds
3. First breath in sequence had `tve_tvi_ratio < .25` OR first breath had `0.25 <= tve_tvi_ratio < 0.5` and `tve < 100`

Knowing this which features can we use here?

In [114]:
############################################
#Assignment2
############################################

#Expert Feature Selection
model = RandomForestClassifier()
columns_to_use = list(set(train_x.columns).difference(['patient', 
                                                       'breath_id'], 
                                                      ['tve'],
                                                     ['max_flow'], 
                                                      ['min_flow'], 
                                                      ['max_pressure'], 
                                                      ['peep'], 
                                                      ['breath_id'], 
                                                      ['i_time']))

# pick features based on expert selection. left for reader to determine best columns
train_set = train_x[columns_to_use]
test_set = test_x[columns_to_use]

model.fit(train_set, train_y_vector)
predictions = model.predict(test_set)
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))

              precision    recall  f1-score   support

           0       0.78      0.50      0.61       842
           1       0.33      0.76      0.46       301
           2       0.17      0.02      0.03       104

    accuracy                           0.52      1247
   macro avg       0.43      0.43      0.37      1247
weighted avg       0.62      0.52      0.53      1247



### Other Methods

You are welcome to use other methods / mathematical functions for feature selection as well. I will briefly outline some of them here.


#### Wrapper Methods

This performs feature selection by brute force. Using your validation set, train many models with every single possible feature combination you can have. Determination of which features work best can be chosen based on the best performing model. Then you can apply this model to your testing set to determine performance. 

Pros:
 * easy to understand
 * easy to code
 
Cons:
 * prone to overfitting
 * is time consuming. Must train $n!$ models if $n$ is the number of features.
 
#### PCA (Principal Component Analysis)

This method utilizes the [principal component analysis algorithm](https://en.wikipedia.org/wiki/Principal_component_analysis) to transform your dataset and generate new features that are independent of each other. The user gets to choose the number of features that are generated, and often modelers choose to generate an increasing number of features, and then train a new model for each PCA run while determining the performance of each model.

Pros:
 * Dimensionality reduction will cause models to train faster
 * Generated features are linearly uncorrelated with each other
 * Easy to utilize because there are multiple existing functions for this, like in [sckit-learn.](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)
 
Cons:
 * Loss of information in your data will likely occur and may cause performance degradation
 * Human comprehension of features is lost when PCA is performed

#### Mutual Information

[Mutual Information](https://en.wikipedia.org/wiki/Mutual_information) is similar to $\chi^2$ feature selection and measures the dependency between two variables. For machine learning this dependency can be measured between a feature and the target. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

Pros:
 * Fast
 * Supported by [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html#r50b872b699c4-1)
 
Cons:
 * Like $\chi^2$ may not generalize to the test set.

#### Mixed Methods

It is possible use a variety of methods in combination with each other. Generally expert feature selection is the first method used and then additional synthetic methods are added on top of this. Ultimately as the modeler, this work is on you to figure out how to perform best. One method might work for one problem and then completely fail for another. This is why it is often best to utilize as many possible methods as possible when performing modeling and only declare a winner when all other possible methods have been explored. It is critical to always beware of overfitting. Have a good validation set to evaluate your model, and don't pick your best methods versus your testing set. This is almost guaranteed to lead to overfitting.

### Finish Expert Feature Selection & Find another Feature Selection Method to Use.

Finish the coding for expert feature selection and use another feature selection method like PCA/mutual information/wrapper methods for use in your model. Which one performs best? 

In [None]:
# XXX code here

## Other Ways to Improve Your Model

### Class Imbalance

Class imbalance occurs when one class comprises a larger ratio of the observations in the dataset than another. This can be seen very clearly in our current training dataset.

In [None]:
train_y_vector.value_counts() / len(train_y_vector) * 100

We can see here that normal observations comprise 72.3% of our training dataset, BSA is 19.68%, and DTA is 8.01%. This imbalance can have implications on the training of machine learning models because our model may not have enough information to learn effective class boundaries. Some algorithms are more resistant to class imbalance than others. Neural networks however are particularly affected by imbalance issues because of the nature of the way training is performed with these algorithms. Often algorithms besides neural networks benefit from techniques to reduce the class imbalance issue too. There are a number of techniques to tackle class imbalance.

#### ROS (Random Over-Sampling)

Random over-sampling aims to oversample minority classes by choosing observations at random with replacement until we  meet a certain ratio of majority to minority class observations. This is a fairly easy thing to code yourself if you wanted to do it, but just for ease we're going to use the [imbalanced-learn python package.](https://imbalanced-learn.readthedocs.io/en/stable/)

In [80]:
import imblearn

# get all columns in our dataset except patient and breath_id
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
# Initialize the RandomOverSampler. This initialization will give us 1:1:1 class ratios. If we want different
# ratios then we can chance the sampling_strategy input argument. For more details see the documentation
# https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.RandomOverSampler.html#imblearn.over_sampling.RandomOverSampler
ros = imblearn.over_sampling.RandomOverSampler()
# re-sample the train set ONLY. Don't resample the testing set because otherwise you would be biasing model conclusions
train_x_ros, train_y_ros = ros.fit_resample(train_x[columns_to_use], train_y_vector)

# put the target vector into a series so we can just do some convenience function.
train_y_ros = pd.Series(train_y_ros)
# You'll see the dataset is equilibrated now with equal observations normal, BSA, and DTA breaths.
train_y_ros.value_counts()

# Now we can put this back into our model and see if performance changes. This is left for the reader
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
model.fit(train_x_ros, train_y_ros)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2       
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))

              precision    recall  f1-score   support

           0       0.86      0.63      0.73       842
           1       0.43      0.88      0.57       301
           2       1.00      0.13      0.24       104

    accuracy                           0.65      1247
   macro avg       0.76      0.55      0.51      1247
weighted avg       0.77      0.65      0.65      1247



#### RUS (Random Under-Sampling)

![](over-sampling-undersampling.png)

Random under-sampling is basically the inverse of the over-sampling technique. Instead of selecting with replacement from minority classes, here we randomly sample from the majority classes only until they meet some class ratio with the minority classes.

In [81]:
import imblearn

# get all columns in our dataset except patient and breath_id
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
# Initialize the RandomUnderSampler. This initialization will give us 1:1:1 class ratios. If we want different
# ratios then we can chance the sampling_strategy input argument. For more details see the documentation
# https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.under_sampling.RandomUnderSampler.html#imblearn.under_sampling.RandomUnderSampler
rus = imblearn.under_sampling.RandomUnderSampler()
# re-sample the train set ONLY. Don't resample the testing set because otherwise you would be biasing model conclusions
train_x_rus, train_y_rus = rus.fit_resample(train_x[columns_to_use], train_y_vector)

# put the target vector into a series so we can just do some convenience function.
train_y_rus = pd.Series(train_y_rus)
# You'll see the dataset is equilibrated now with equal observations normal, BSA, and DTA breaths.
train_y_rus.value_counts()

# Now we can put this back into our model and see if performance changes. This is left for the reader
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
model.fit(train_x_rus, train_y_rus)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2        
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))

              precision    recall  f1-score   support

           0       0.86      0.53      0.66       842
           1       0.41      0.80      0.54       301
           2       0.17      0.25      0.21       104

    accuracy                           0.57      1247
   macro avg       0.48      0.53      0.47      1247
weighted avg       0.70      0.57      0.59      1247



There are some downsides to RUS in that we are discarding data from the majority class which might be useful for the future. Also if some classes have very low ratios of data relative to the majority class then RUS may have more limited utility. With RUS, as with ROS, we will need to evaluate the effect of different class ratios on our validation set. Maybe a `4:2:1` ratio would be best for this problem, we just don't know until we try. I will leave this as an additional exercise for the reader.

#### SMOTE (Synthetic Minority Oversampling TEchnique)

One downside about the methods mentioned is that they always are drawn from the existing distribution of class data. It is quite possible that if we collected additional samples that there would be new observations that fit in between these existing data points. This is the intuition behind smote that can also be seen in the below image.

![](smote-intuition.png)

The benefit of SMOTE is that we are expanding our dataset, which means more data for our model to train on, while we are semi-intelligently generating new samples. Of course generated data may have no basis for reality, so good modeling habit should always check to see whether RUS, ROS, or SMOTE works best for a problem, and which class ratios work best for which technique. 

In [82]:
import imblearn 

# get all columns in our dataset except patient and breath_id
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
# Initialize SMOTE. This initialization will give us 1:1:1 class ratios. If we want different
# ratios then we can chance the sampling_strategy input argument. For more details see the documentation
# https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html#imblearn.over_sampling.SMOTE
smote = imblearn.over_sampling.SMOTE()
# re-sample the train set ONLY. Don't resample the testing set because otherwise you would be biasing model conclusions
train_x_smote, train_y_smote = smote.fit_resample(train_x[columns_to_use], train_y_vector)

# put the target vector into a series so we can just do some convenience function.
train_y_smote = pd.Series(train_y_smote)
# You'll see the dataset is equilibrated now with equal observations normal, BSA, and DTA breaths.
train_y_smote.value_counts()


# Now we can put this back into our model and see if performance changes. This is left for the reader
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
model.fit(train_x_smote, train_y_smote)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2
        
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))

              precision    recall  f1-score   support

           0       0.69      0.15      0.25       842
           1       0.39      0.76      0.52       301
           2       0.06      0.27      0.10       104

    accuracy                           0.31      1247
   macro avg       0.38      0.39      0.29      1247
weighted avg       0.57      0.31      0.30      1247



## Assignment \#3 Utilize all 3 Imbalance Correction Techniques 

Utilize ROS, RUS, and SMOTE with the following imbalance ratios: 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0. Is there an algorithm that performs best? Are there ratios of imbalance that perform best?

In [118]:
# Example code for creating 0.3 imbalance ratio. The same parameters will work for ROS and RUS functions too.
##################################0.3 for SMOTE
import imblearn 
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
smote = imblearn.over_sampling.SMOTE(sampling_strategy = 0.3)
train_x_smote, train_y_smote = smote.fit_resample(train_x[columns_to_use], train_y_vector)
train_y_smote = pd.Series(train_y_smote)

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(train_x_smote, train_y_smote)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2
        
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))
##################################0.3 for ROS
import imblearn 
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
smote = imblearn.over_sampling.SMOTE(sampling_strategy = 0.3)
train_x_smote, train_y_smote = smote.fit_resample(train_x[columns_to_use], train_y_vector)
train_y_smote = pd.Series(train_y_smote)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(train_x_ros, train_y_ros)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2       
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))

##################################0.3 for RUS
import imblearn 
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
smote = imblearn.over_sampling.SMOTE(sampling_strategy = 0.3)
train_x_smote, train_y_smote = smote.fit_resample(train_x[columns_to_use], train_y_vector)
train_y_smote = pd.Series(train_y_smote)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(train_x_rus, train_y_rus)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2        
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))


In [None]:
##################################0.4 SMOTE
import imblearn 
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
smote = imblearn.over_sampling.SMOTE(sampling_strategy = 0.4)
train_x_smote, train_y_smote = smote.fit_resample(train_x[columns_to_use], train_y_vector)
train_y_smote = pd.Series(train_y_smote)

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(train_x_smote, train_y_smote)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2
        
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))
##################################0.4 for ROS
import imblearn 
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
smote = imblearn.over_sampling.SMOTE(sampling_strategy = 0.4)
train_x_smote, train_y_smote = smote.fit_resample(train_x[columns_to_use], train_y_vector)
train_y_smote = pd.Series(train_y_smote)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(train_x_ros, train_y_ros)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2       
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))

##################################0.4 for RUS
import imblearn 
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
smote = imblearn.over_sampling.SMOTE(sampling_strategy = 0.4)
train_x_smote, train_y_smote = smote.fit_resample(train_x[columns_to_use], train_y_vector)
train_y_smote = pd.Series(train_y_smote)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(train_x_rus, train_y_rus)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2        
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))


In [None]:
##################################0.5 SMOTE
import imblearn 
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
smote = imblearn.over_sampling.SMOTE(sampling_strategy = 0.5)
train_x_smote, train_y_smote = smote.fit_resample(train_x[columns_to_use], train_y_vector)
train_y_smote = pd.Series(train_y_smote)

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(train_x_smote, train_y_smote)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2
        
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))
##################################0.5 for ROS
import imblearn 
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
smote = imblearn.over_sampling.SMOTE(sampling_strategy = 0.5)
train_x_smote, train_y_smote = smote.fit_resample(train_x[columns_to_use], train_y_vector)
train_y_smote = pd.Series(train_y_smote)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(train_x_ros, train_y_ros)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2       
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))

##################################0.5 for RUS
import imblearn 
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
smote = imblearn.over_sampling.SMOTE(sampling_strategy = 0.5)
train_x_smote, train_y_smote = smote.fit_resample(train_x[columns_to_use], train_y_vector)
train_y_smote = pd.Series(train_y_smote)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(train_x_rus, train_y_rus)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2        
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))

In [None]:
##################################0.6 SMOTE
import imblearn 
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
smote = imblearn.over_sampling.SMOTE(sampling_strategy = 0.6)
train_x_smote, train_y_smote = smote.fit_resample(train_x[columns_to_use], train_y_vector)
train_y_smote = pd.Series(train_y_smote)

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(train_x_smote, train_y_smote)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2
        
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))
##################################0.6for ROS
import imblearn 
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
smote = imblearn.over_sampling.SMOTE(sampling_strategy = 0.6)
train_x_smote, train_y_smote = smote.fit_resample(train_x[columns_to_use], train_y_vector)
train_y_smote = pd.Series(train_y_smote)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(train_x_ros, train_y_ros)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2       
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))

##################################0.6 for RUS
import imblearn 
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
smote = imblearn.over_sampling.SMOTE(sampling_strategy = 0.6)
train_x_smote, train_y_smote = smote.fit_resample(train_x[columns_to_use], train_y_vector)
train_y_smote = pd.Series(train_y_smote)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(train_x_rus, train_y_rus)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2        
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))

In [None]:
##################################0.7 SMOTE
import imblearn 
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
smote = imblearn.over_sampling.SMOTE(sampling_strategy = 0.7)
train_x_smote, train_y_smote = smote.fit_resample(train_x[columns_to_use], train_y_vector)
train_y_smote = pd.Series(train_y_smote)

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(train_x_smote, train_y_smote)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2
        
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))
##################################0.7 for ROS
import imblearn 
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
smote = imblearn.over_sampling.SMOTE(sampling_strategy = 0.7)
train_x_smote, train_y_smote = smote.fit_resample(train_x[columns_to_use], train_y_vector)
train_y_smote = pd.Series(train_y_smote)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(train_x_ros, train_y_ros)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2       
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))

##################################0.7 for RUS
import imblearn 
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
smote = imblearn.over_sampling.SMOTE(sampling_strategy = 0.7)
train_x_smote, train_y_smote = smote.fit_resample(train_x[columns_to_use], train_y_vector)
train_y_smote = pd.Series(train_y_smote)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(train_x_rus, train_y_rus)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2        
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))

In [None]:
##################################0.8SMOTE
import imblearn 
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
smote = imblearn.over_sampling.SMOTE(sampling_strategy = 0.8)
train_x_smote, train_y_smote = smote.fit_resample(train_x[columns_to_use], train_y_vector)
train_y_smote = pd.Series(train_y_smote)

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(train_x_smote, train_y_smote)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2
        
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))
##################################0.8for ROS
import imblearn 
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
smote = imblearn.over_sampling.SMOTE(sampling_strategy = 0.8)
train_x_smote, train_y_smote = smote.fit_resample(train_x[columns_to_use], train_y_vector)
train_y_smote = pd.Series(train_y_smote)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(train_x_ros, train_y_ros)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2       
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))

##################################0.8 for RUS
import imblearn 
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
smote = imblearn.over_sampling.SMOTE(sampling_strategy = 0.8)
train_x_smote, train_y_smote = smote.fit_resample(train_x[columns_to_use], train_y_vector)
train_y_smote = pd.Series(train_y_smote)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(train_x_rus, train_y_rus)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2        
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))

In [None]:
##################################0.9SMOTE
import imblearn 
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
smote = imblearn.over_sampling.SMOTE(sampling_strategy = 0.9)
train_x_smote, train_y_smote = smote.fit_resample(train_x[columns_to_use], train_y_vector)
train_y_smote = pd.Series(train_y_smote)

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(train_x_smote, train_y_smote)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2
        
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))
##################################0.9 for ROS
import imblearn 
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
smote = imblearn.over_sampling.SMOTE(sampling_strategy = 0.9)
train_x_smote, train_y_smote = smote.fit_resample(train_x[columns_to_use], train_y_vector)
train_y_smote = pd.Series(train_y_smote)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(train_x_ros, train_y_ros)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2       
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))

##################################0.9for RUS
import imblearn 
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
smote = imblearn.over_sampling.SMOTE(sampling_strategy = 0.9)
train_x_smote, train_y_smote = smote.fit_resample(train_x[columns_to_use], train_y_vector)
train_y_smote = pd.Series(train_y_smote)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(train_x_rus, train_y_rus)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2        
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))

In [None]:
##################################1.0 SMOTE
import imblearn 
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
smote = imblearn.over_sampling.SMOTE(sampling_strategy = 1.0)
train_x_smote, train_y_smote = smote.fit_resample(train_x[columns_to_use], train_y_vector)
train_y_smote = pd.Series(train_y_smote)

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(train_x_smote, train_y_smote)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2
        
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))
##################################1.0 for ROS
import imblearn 
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
smote = imblearn.over_sampling.SMOTE(sampling_strategy = 1.0)
train_x_smote, train_y_smote = smote.fit_resample(train_x[columns_to_use], train_y_vector)
train_y_smote = pd.Series(train_y_smote)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(train_x_ros, train_y_ros)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2       
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))

##################################1.0 for RUS
import imblearn 
columns_to_use = list(set(train_x.columns).difference(['patient', 'breath_id']))
smote = imblearn.over_sampling.SMOTE(sampling_strategy = 1.0)
train_x_smote, train_y_smote = smote.fit_resample(train_x[columns_to_use], train_y_vector)
train_y_smote = pd.Series(train_y_smote)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(train_x_rus, train_y_rus)
predictions = model.predict(test_x[columns_to_use])
for idx, pred in enumerate(predictions):
    if pred == 2:
        predictions[idx-1] = 2        
from sklearn.metrics import classification_report
print(classification_report(test_y_vector, predictions))