The purpose of this notebook is to demonstrate profficiency in the preprocessing steps. The more complex "ML Model.ipynb" notebook only required very few preprocessing steps. Hence, to demonstrate profficiency in the other preprocessing steps, this notebook has been created. It aims to take 2 raw MIMIC-III tables and convert them into data that can then be used to run various machine learning models. In this example, the machine learning model applied is logistic regression. 

Library Imports: You import necessary Python libraries and modules for data handling (pandas), machine learning (scikit-learn), text processing (TfidfVectorizer), and handling imbalanced datasets (SMOTE from imblearn).

Function to Calculate Age: calculate_age function takes the date of birth (dob) and admission time (admittime), and calculates the age of the patient at the time of admission.

Data Extraction:

Load PATIENTS and ADMISSIONS tables from the MIMIC-III dataset.
Merge these tables on the SUBJECT_ID field to combine patient demographic data with their admission details.

In [14]:
import pandas as pd
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.impute import SimpleImputer

# Function to calculate age
def calculate_age(dob, admittime):
    dob = datetime.strptime(dob, '%Y-%m-%d %H:%M:%S')
    admittime = datetime.strptime(admittime, '%Y-%m-%d %H:%M:%S')
    return admittime.year - dob.year - ((admittime.month, admittime.day) < (dob.month, dob.day))

# Load data
patients = pd.read_csv('~/Desktop/mimic-iii-clinical-database-1.4/PATIENTS.csv')
admissions = pd.read_csv('~/Desktop/mimic-iii-clinical-database-1.4/ADMISSIONS.csv')

# Merge datasets on subject_id
data = pd.merge(patients, admissions, on='SUBJECT_ID')

# Select relevant features
data = data[['SUBJECT_ID', 'GENDER', 'DOB', 'ADMITTIME', 'DEATHTIME']]

# Calculate age and convert to numeric
data['AGE'] = data.apply(lambda x: calculate_age(x['DOB'], x['ADMITTIME']), axis=1)
data.drop(['DOB', 'ADMITTIME'], axis=1, inplace=True)

# Create a target variable for mortality
data['MORTALITY'] = data['DEATHTIME'].apply(lambda x: 0 if pd.isnull(x) else 1)
data.drop('DEATHTIME', axis=1, inplace=True)

# Encode gender
data['GENDER'] = data['GENDER'].map({'M': 1, 'F': 0})

# Handle missing values
imputer = SimpleImputer(strategy='mean')
data[['AGE']] = imputer.fit_transform(data[['AGE']])

# Split data into features and target
X = data.drop('MORTALITY', axis=1)
y = data['MORTALITY']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predictions and evaluations
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'ROC AUC Score: {roc_auc}')


Accuracy: 0.9037809426924381
ROC AUC Score: 0.5
