# 2. Logistic Regression and Decision Trees

This notebook contains the workflow for the third milestone in the Manning liveProject *Handling Sensitive Data.*

In [1]:
import pandas as pd
import numpy as np
import os.path
from scipy.fft import fft
from scipy.io import wavfile
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

## Read data (copied from milestone 1)

In [2]:
csvpath = "../../data/" # path to csv file
wavpath = "../../data/Recordings/" # path to the folder with wav files

In [3]:
# import metadata
meta = pd.read_csv(csvpath + "Covid19Study_June2021.csv", delimiter=";") 

# importing the wav files
wavdata = []

for pid in meta["ParticipantID"].values:
    if os.path.isfile(wavpath + "RecordingParticipant" + str(pid) +".wav"):
        _, data = wavfile.read(wavpath + "RecordingParticipant" + str(pid) +".wav") # note: the wav files are 16-bit integer PCM
    else:
        sample_rate = 44100
        data = np.zeros((10000,))
    wavdata.append(np.array(data))

## Create features (almost the same as milestone 2)

In [4]:
raw_last8192 = [wav[-8192:] for wav in wavdata] # extract the last 8192 sample values from each recording
fft_last8192 = [np.abs(fft(rawwav)) for rawwav in raw_last8192] # get the absolute Fourier transformed values

In [5]:
bins = [400,600,800,1000]
X = np.zeros((len(fft_last8192), len(bins))) # array to store feature vectors

# normalize the absolute Fourier transformed values for indices listed in bins
for i in range(len(fft_last8192)):
    X[i,:] = fft_last8192[i][bins] / np.max(fft_last8192[i])
    
# creating y from the dataframe values
y = np.array(meta["Covid19"].copy())
y = np.where(y=="y",1,-1)
y = y.reshape(-1,1)

## Split features and labels

In [6]:
random_state = 81

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=random_state)

## Learning models and predicting

In [8]:
# train logistic regression and print the train and test accuracy
logreg = LogisticRegression(random_state=random_state).fit(X_train,y_train.reshape(-1))
lr_train_pred = logreg.predict(X_train)
lr_test_pred = logreg.predict(X_test)

lr_train_acc = accuracy_score(y_train.reshape(-1), lr_train_pred)
lr_test_acc = accuracy_score(y_test.reshape(-1),lr_test_pred)

print("***********Logistic regression accuracies***********")
print("Training accuracy:",lr_train_acc)
print("Test accuracy:", lr_test_acc)

***********Logistic regression accuracies***********
Training accuracy: 0.53
Test accuracy: 0.47


In [9]:
# train decision tree and print the train and test accuracy
dtree = DecisionTreeClassifier(random_state=random_state).fit(X_train, y_train)
dt_train_pred = dtree.predict(X_train)
dt_test_pred = dtree.predict(X_test)

dt_train_acc = accuracy_score(y_train, dt_train_pred)
dt_test_acc = accuracy_score(y_test, dt_test_pred)

print("***********Decision tree accuracies***********")
print("Training accuracy:",dt_train_acc)
print("Test accuracy:", dt_test_acc)

***********Decision tree accuracies***********
Training accuracy: 1.0
Test accuracy: 0.76


## Whether the models overfit

I think the decision tree overfits because it achieves 100% accuracy on the training set but only 76% on the test set. Doing perfect on the training set but not so much on the test set is a typical symptom of overfitting.

Logistic regression did not overfit because the training accuracy and test accuracy is about the same. However, I don't think logistic regression is a good choice for these features because the score indicates that the accuracy is no better than a random guess.