In [1]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix


# Feature manipulation 
Now that your data has been thoroughly cleaned (w.r.t. your goal to model diagnoses) and explored, you'll need to "play around" and prepare good features.

You don't have to think about modelling (machine learning) at this stage (although it won't do harm). Perform feature selection and feature engineering in ways that you think will be beneficial for a "mental" model of the data. Such a model consists of hypotheses that you should be able to test.

Feel free to do any sort of feature maniplulation on the data you like. Ideally, at the end of the process, you'll have a rectangular data table consisting of only (floating-point) numbers and nothing else.

In Problem 4, I saw that all the potential predictors were essentially *not* associated with the asthma diagnosis. I'm not sure how transofrming the variables (i.e. feature manipulations) would help. For the purpose of exercise, I'll run a logistic regression to see what happens. 

In [2]:
asthma_an = pd.read_csv("../data/asthma_disease_data_analysis.csv")

## Split data

In [3]:
predictors = asthma_an.drop(columns="diagnosis")

In [4]:
X_train, X_test, y_train, y_test = train_test_split(predictors, asthma_an.diagnosis, test_size=0.3, random_state=42)

## Fit logistic regression

In [5]:
model = LogisticRegression(max_iter=1000)  # increase max_iter if needed
model.fit(X_train, y_train)


0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


## Make predictions

In [6]:
# Predicted probabilities for class 1
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Predicted class labels
y_pred = model.predict(X_test)


## Evaluate performance

In [7]:
# Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

# ROC AUC
print("ROC AUC:", roc_auc_score(y_test, y_pred_proba))

# Confusion matrix
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))


Accuracy: 0.9484679665738162
ROC AUC: 0.5404611660118268
Confusion matrix:
 [[681   0]
 [ 37   0]]


The model always predicts the majority class. This is not surprising, given the the predictors are uncorrelated with the outcome. 