Using the **MIMIC dataset**, several types of predictions can be made by leveraging the rich clinical data it provides. Below are common predictive tasks, examples of columns and attributes used, and approaches for these analyses.

---

### **1. Predicting ICU Mortality**
   - **Goal**: Predict whether a patient will survive their ICU stay.
   - **Key Columns and Attributes**:
     - From `CHARTEVENTS`:
       - Vital signs: Heart rate, blood pressure, respiratory rate, temperature.
       - Lab results: Blood glucose, creatinine, pH, lactate.
     - From `ICUSTAYS`:
       - ICU admission and discharge times.
       - Length of ICU stay.
     - From `ADMISSIONS`:
       - Admission type: Elective, emergency.
       - Diagnosis text or ICD codes.
     - Demographics:
       - Age, gender, ethnicity.
   - **Approach**:
     - Time-series models like LSTMs for sequential data.
     - Feature engineering for static features and gradient-boosted models like XGBoost.

---

### **2. Predicting Length of Stay (LOS)**
   - **Goal**: Estimate the number of days a patient will spend in the ICU or hospital.
   - **Key Columns and Attributes**:
     - From `ICUSTAYS`:
       - Admission and discharge times.
     - From `CHARTEVENTS`:
       - First 24 hours of vitals and lab results (e.g., heart rate, blood pressure, creatinine).
     - From `ADMISSIONS`:
       - Admission type and source.
     - From `PATIENTS`:
       - Age and chronic conditions.
   - **Approach**:
     - Regression models (e.g., Linear Regression, Random Forest Regressor).
     - Time-series analysis for sequential trends.

---

### **3. Sepsis Prediction**
   - **Goal**: Predict the onset of sepsis based on patient data.
   - **Key Columns and Attributes**:
     - From `CHARTEVENTS`:
       - Heart rate, respiratory rate, temperature, white blood cell count, and lactate levels.
     - From `LABEVENTS`:
       - Blood culture results.
     - From `INPUTEVENTS`:
       - Fluid intake and drug administration (e.g., antibiotics).
   - **Approach**:
     - Feature engineering to identify trends over time (e.g., lactate rising).
     - Gradient-boosting models or recurrent neural networks (RNNs).

---

### **4. Readmission Prediction**
   - **Goal**: Predict whether a patient will be readmitted to the hospital within 30 days of discharge.
   - **Key Columns and Attributes**:
     - From `ADMISSIONS`:
       - Discharge date and type.
     - From `CHARTEVENTS`:
       - Clinical stability indicators at discharge.
     - From `PATIENTS`:
       - Chronic conditions and comorbidities.
   - **Approach**:
     - Logistic regression or classification models.
     - Feature selection from discharge-related data.

---

### **5. Predicting Diagnoses (ICD Code Prediction)**
   - **Goal**: Predict ICD-9 codes based on patient clinical data.
   - **Key Columns and Attributes**:
     - From `NOTEEVENTS`:
       - Clinical notes and discharge summaries.
     - From `CHARTEVENTS`:
       - Vitals, interventions, and lab results.
     - From `LABEVENTS`:
       - Blood tests and other lab measurements.
   - **Approach**:
     - Natural Language Processing (NLP) for note text (e.g., embeddings using BERT or Word2Vec).
     - Multi-label classification using neural networks.

---

### **6. Ventilator Use Prediction**
   - **Goal**: Predict whether a patient will require mechanical ventilation.
   - **Key Columns and Attributes**:
     - From `CHARTEVENTS`:
       - SpO2 (oxygen saturation), respiratory rate, blood gases (pCO2, pO2).
     - From `INPUTEVENTS`:
       - Drugs related to sedation or muscle relaxation.
   - **Approach**:
     - Binary classification using decision trees, random forests, or deep learning.

---

### **7. Predicting Outcomes for Specific Conditions**
   - **Example**: Predicting outcomes for patients with acute kidney injury (AKI).
   - **Key Columns and Attributes**:
     - From `CHARTEVENTS`:
       - Creatinine levels, urine output, blood pressure.
     - From `LABEVENTS`:
       - Electrolytes, pH levels.
   - **Approach**:
     - Combining static features (age, gender) and dynamic features (creatinine trends).

---

### General Workflow for Predictions Using MIMIC Data
1. **Data Extraction**:
   - Identify relevant tables (e.g., `CHARTEVENTS`, `LABEVENTS`, `ADMISSIONS`).
   - Use SQL queries to extract data, joining on `SUBJECT_ID`, `HADM_ID`, or `ICUSTAY_ID`.

2. **Data Cleaning**:
   - Handle missing values, outliers, and erroneous data.
   - Standardize units of measurement (e.g., converting °F to °C).

3. **Feature Engineering**:
   - Aggregate time-series data into summary statistics (e.g., max, min, mean).
   - Extract sequential patterns from time-series data.

4. **Modeling**:
   - Select appropriate algorithms (e.g., logistic regression for classification, LSTM for time-series).
   - Train models using patient data.

5. **Evaluation**:
   - Use metrics like accuracy, precision, recall, AUROC, and RMSE (for regression).

---

### Python Example for ICU Mortality Prediction
```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load data
chartevents = pd.read_csv("CHARTEVENTS.csv", usecols=['SUBJECT_ID', 'ICUSTAY_ID', 'ITEMID', 'VALUE', 'CHARTTIME'])
admissions = pd.read_csv("ADMISSIONS.csv", usecols=['SUBJECT_ID', 'HADM_ID', 'ADMITTIME', 'DISCHTIME', 'DEATHTIME'])

# Merge and preprocess data
merged_data = pd.merge(chartevents, admissions, on='SUBJECT_ID', how='inner')

# Feature engineering
# Example: Extracting vital stats from first 24 hours
first_24h = merged_data[merged_data['CHARTTIME'] < (merged_data['ADMITTIME'] + pd.Timedelta(hours=24))]
features = first_24h.groupby('SUBJECT_ID')['VALUE'].mean().reset_index()
features.rename(columns={'VALUE': 'AVG_VITALS'}, inplace=True)

# Add target variable (mortality)
features = pd.merge(features, admissions[['SUBJECT_ID', 'DEATHTIME']], on='SUBJECT_ID', how='left')
features['MORTALITY'] = features['DEATHTIME'].notnull().astype(int)

# Train-test split
X = features[['AVG_VITALS']]
y = features['MORTALITY']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model training
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Evaluation
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
```

By combining domain knowledge, robust feature engineering, and advanced models, MIMIC data allows for impactful predictive insights in healthcare.

To predict ICD-9 codes (International Classification of Diseases, 9th Revision), the following types of information are typically needed. These details describe the patient's medical condition and associated healthcare data:

### 1. **Demographic Information**
   - **Age**: Helps in diagnosing age-specific conditions.
   - **Sex**: Some diseases are gender-specific or more common in one gender.
   - **Ethnicity**: Certain conditions have higher prevalence in specific ethnic groups.

### 2. **Clinical Information**
   - **Primary Diagnosis**: The main reason for the patient's visit or admission.
   - **Secondary Diagnoses**: Any coexisting conditions that may impact treatment or outcomes.
   - **Symptoms**: A detailed description of the patient's symptoms.
   - **Vital Signs**: Blood pressure, heart rate, temperature, etc.

### 3. **Medical History**
   - **Chronic Conditions**: Previous diagnoses such as diabetes, hypertension, or asthma.
   - **Past Surgeries or Treatments**: Helps understand complications or predisposing factors.
   - **Family History**: Genetic predispositions to certain diseases.

### 4. **Laboratory Results**
   - **Blood Tests**: Levels of glucose, hemoglobin, cholesterol, etc.
   - **Urinalysis**: Indicates infections, kidney conditions, or other disorders.
   - **Imaging Results**: X-rays, CT scans, MRIs for structural abnormalities.

### 5. **Medications**
   - **Current Medications**: Can give clues about ongoing treatment and conditions.
   - **Medication History**: Allergies, prior adverse reactions, and treatment patterns.

### 6. **Procedures**
   - **Diagnostic Procedures**: Biopsies, endoscopies, etc.
   - **Therapeutic Procedures**: Surgeries or interventions already performed.

### 7. **Social Determinants of Health**
   - **Lifestyle Choices**: Smoking, alcohol use, diet, exercise.
   - **Occupational Hazards**: Exposure to chemicals, repetitive stress injuries.
   - **Living Conditions**: Housing stability, access to healthcare, and socioeconomic status.

### 8. **Encounter Information**
   - **Reason for Visit**: Symptoms or issues prompting the encounter.
   - **Length of Stay**: For inpatient cases, this may hint at the severity.
   - **Specialty**: The type of healthcare provider (e.g., cardiologist, neurologist).

### 9. **Natural Language Data**
   - **Clinical Notes**: Free-text descriptions from physicians or nurses about patient conditions, examination findings, and differential diagnoses.

### 10. **Behavioral and Psychological Assessments**
   - **Mental Health Diagnoses**: Depression, anxiety, or other psychiatric conditions.
   - **Cognitive Testing Results**: When relevant, for conditions like dementia or developmental delays.

By collecting and preprocessing this information, predictive models like machine learning algorithms can classify conditions into appropriate ICD-9 codes. However, it's crucial to ensure patient privacy and follow HIPAA (Health Insurance Portability and Accountability Act) regulations when working with such sensitive data.

In [64]:
import pandas as pd
import numpy as np

In [65]:
df=pd.read_csv(r'D:\FINALYEARPROJECTREC\data\ADMISSIONS.csv')

In [66]:
df.head()

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,ADMITTIME,DISCHTIME,DEATHTIME,ADMISSION_TYPE,ADMISSION_LOCATION,DISCHARGE_LOCATION,INSURANCE,LANGUAGE,RELIGION,MARITAL_STATUS,ETHNICITY,EDREGTIME,EDOUTTIME,DIAGNOSIS,HOSPITAL_EXPIRE_FLAG,HAS_CHARTEVENTS_DATA
0,21,22,165315,09-04-2196 12:26,10-04-2196 15:54,,EMERGENCY,EMERGENCY ROOM ADMIT,DISC-TRAN CANCER/CHLDRN H,Private,,UNOBTAINABLE,MARRIED,WHITE,09-04-2196 10:06,09-04-2196 13:24,BENZODIAZEPINE OVERDOSE,0,1
1,22,23,152223,03-09-2153 07:15,08-09-2153 19:10,,ELECTIVE,PHYS REFERRAL/NORMAL DELI,HOME HEALTH CARE,Medicare,,CATHOLIC,MARRIED,WHITE,,,CORONARY ARTERY DISEASE\CORONARY ARTERY BYPASS...,0,1
2,23,23,124321,18-10-2157 19:34,25-10-2157 14:00,,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,HOME HEALTH CARE,Medicare,ENGL,CATHOLIC,MARRIED,WHITE,,,BRAIN MASS,0,1
3,24,24,161859,06-06-2139 16:14,09-06-2139 12:48,,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,HOME,Private,,PROTESTANT QUAKER,SINGLE,WHITE,,,INTERIOR MYOCARDIAL INFARCTION,0,1
4,25,25,129635,02-11-2160 02:06,05-11-2160 14:55,,EMERGENCY,EMERGENCY ROOM ADMIT,HOME,Private,,UNOBTAINABLE,MARRIED,WHITE,02-11-2160 01:01,02-11-2160 04:27,ACUTE CORONARY SYNDROME,0,1


In [67]:
df['DIAGNOSIS'].value_counts()

DIAGNOSIS
NEWBORN                                  7823
PNEUMONIA                                1566
SEPSIS                                   1184
CONGESTIVE HEART FAILURE                  928
CORONARY ARTERY DISEASE                   840
                                         ... 
DIAPHRAGM RUPTURE                           1
RIGHT ANTERIOR CEREBRAL ARTERY STROKE       1
HYPOXIA, ACUTE RENAL FAILURE                1
S/P MOTOR VECHICLE ACCIDENT                 1
JOINT EFFUSION                              1
Name: count, Length: 15682, dtype: int64

In [68]:
df['DIAGNOSIS'].unique().tolist()

['BENZODIAZEPINE OVERDOSE',
 'CORONARY ARTERY DISEASE\\CORONARY ARTERY BYPASS GRAFT/SDA',
 'BRAIN MASS',
 'INTERIOR MYOCARDIAL INFARCTION',
 'ACUTE CORONARY SYNDROME',
 'V-TACH',
 'NEWBORN',
 'UNSTABLE ANGINA\\CATH',
 'STATUS EPILEPTICUS',
 'TRACHEAL STENOSIS/SDA',
 'SEPSIS;TELEMETRY',
 'CHEST PAIN\\CATH',
 'BRADYCARDIA',
 'AORTIC VALVE DISEASE\\CORONARY ARTERY BYPASS GRAFT WITH AVR /SDA',
 'CORONARY ARTERY DISEASE\\CORONARY ARTERY BYPASS GRAFT /SDA',
 'CHEST PAIN/SHORTNESS OF BREATH',
 'VENTRAL HERNIA/SDA',
 'CONGESTIVE HEART FAILURE',
 'ACUTE MYOCARDIAL INFARCTION-SEPSIS',
 'RIGHT BRAIN STEM LESION/SDA',
 'GASTROINTESTINAL BLEED',
 'SEIZURE',
 'SEPSIS',
 'PNEUMONIA',
 'ALTERED MENTAL STATUS',
 'R/O MYOCARDIAL INFARCTION',
 'CEREBROVASCULAR ACCIDENT;TELEMETRY',
 'HYPOTENSION',
 'DEEP VEIN THROMBOSIS;HEMOCULT POSITIVE',
 'HYPONATREMIA-R/O MYOCARDIAL INFARCTION-RHABDOMYOLYSIS',
 'SUBDURAL HEMATOMA',
 'MASSIVE HEMOPTYSIS',
 'CORONARY ARTERY DISEASE\\CARDIAC CATH',
 'NECROTIZING FASCITITI

In [69]:
df[df['DIAGNOSIS']=='PNEUMONIA'].to_csv(r'D:\FINALYEARPROJECTREC\artifacts\PNEUMONIA.csv')

In [70]:
df[df['DIAGNOSIS']=='SEPSIS'].to_csv(r'D:\FINALYEARPROJECTREC\artifacts\PNEUMONIA.csv')

In [71]:
df[df['DIAGNOSIS']=='CONGESTIVE HEART FAILURE'].to_csv(r'D:\FINALYEARPROJECTREC\artifacts\CONGESTIVE HEART FAILURE.csv')

In [72]:
df[df['DIAGNOSIS']=='CONGESTIVE HEART FAILURE'].to_csv(r'D:\FINALYEARPROJECTREC\artifacts\CONGESTIVE HEART FAILURE.csv')

In [73]:
df[df['DIAGNOSIS']=='CORONARY ARTERY DISEASE'].to_csv(r'D:\FINALYEARPROJECTREC\artifacts\CORONARY ARTERY DISEASE.csv')

In [74]:
PNEUMONIA=pd.read_csv(r'D:\FINALYEARPROJECTREC\artifacts\PNEUMONIA.csv')

In [80]:
PNEUMONIA

Unnamed: 0.1,Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,ADMITTIME,DISCHTIME,DEATHTIME,ADMISSION_TYPE,ADMISSION_LOCATION,DISCHARGE_LOCATION,INSURANCE,LANGUAGE,RELIGION,MARITAL_STATUS,ETHNICITY,EDREGTIME,EDOUTTIME,DIAGNOSIS,HOSPITAL_EXPIRE_FLAG,HAS_CHARTEVENTS_DATA
0,24,458,357,122609,01-11-2198 22:36,14-11-2198 14:20,,EMERGENCY,EMERGENCY ROOM ADMIT,REHAB/DISTINCT PART HOSP,Private,ENGL,NOT SPECIFIED,MARRIED,WHITE,01-11-2198 18:01,01-11-2198 23:06,SEPSIS,0,1
1,37,471,366,134462,18-11-2164 20:27,22-11-2164 15:18,,EMERGENCY,EMERGENCY ROOM ADMIT,HOME HEALTH CARE,Medicare,ENGL,CATHOLIC,SINGLE,HISPANIC OR LATINO,18-11-2164 10:52,18-11-2164 21:31,SEPSIS,0,1
2,98,96,94,183686,25-02-2176 16:49,29-02-2176 17:45,,EMERGENCY,EMERGENCY ROOM ADMIT,HOME HEALTH CARE,Medicare,CANT,NOT SPECIFIED,MARRIED,ASIAN,25-02-2176 10:35,25-02-2176 18:14,SEPSIS,0,1
3,230,20,21,111970,30-01-2135 20:50,08-02-2135 02:08,08-02-2135 02:08,EMERGENCY,EMERGENCY ROOM ADMIT,DEAD/EXPIRED,Medicare,,JEWISH,MARRIED,WHITE,30-01-2135 18:46,30-01-2135 22:05,SEPSIS,1,1
4,300,448,353,108923,28-03-2151 16:01,13-04-2151 16:10,,EMERGENCY,EMERGENCY ROOM ADMIT,HOME,Medicare,PTUN,JEWISH,SINGLE,WHITE,28-03-2151 13:02,28-03-2151 17:46,SEPSIS,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1179,58852,57781,96261,150731,27-03-2196 17:05,08-04-2196 12:40,,EMERGENCY,CLINIC REFERRAL/PREMATURE,HOME HEALTH CARE,Private,ENGL,NOT SPECIFIED,MARRIED,WHITE,27-03-2196 15:00,27-03-2196 18:59,SEPSIS,0,1
1180,58860,55969,90688,112686,09-11-2154 11:29,13-11-2154 17:30,,EMERGENCY,CLINIC REFERRAL/PREMATURE,LONG TERM CARE HOSPITAL,Medicare,ENGL,CATHOLIC,WIDOWED,WHITE,09-11-2154 05:33,09-11-2154 12:50,SEPSIS,0,1
1181,58890,57483,95372,181449,17-04-2124 16:11,24-04-2124 15:38,,EMERGENCY,CLINIC REFERRAL/PREMATURE,HOME HEALTH CARE,Medicare,SPAN,OTHER,MARRIED,HISPANIC OR LATINO,17-04-2124 09:52,17-04-2124 17:17,SEPSIS,0,1
1182,58934,58557,98698,134977,18-10-2188 02:00,22-10-2188 15:53,,EMERGENCY,EMERGENCY ROOM ADMIT,HOME,Government,ENGL,CATHOLIC,MARRIED,WHITE,17-10-2188 22:01,18-10-2188 03:27,SEPSIS,0,1


In [75]:
CONGESTIVE_HEART_FAILURE=pd.read_csv(r'D:\FINALYEARPROJECTREC\artifacts\CONGESTIVE HEART FAILURE.csv')

In [76]:
CORONARY_ARTERY_DISEASE=df[df['DIAGNOSIS']=='CORONARY ARTERY DISEASE']

In [77]:
data1=pd.concat([PNEUMONIA,CONGESTIVE_HEART_FAILURE,CORONARY_ARTERY_DISEASE],axis=0,ignore_index=True)

In [78]:
data1['DIAGNOSIS'].value_counts()

DIAGNOSIS
SEPSIS                      1184
CONGESTIVE HEART FAILURE     928
CORONARY ARTERY DISEASE      840
Name: count, dtype: int64

In [79]:
data1['DIAGNOSIS']

0                        SEPSIS
1                        SEPSIS
2                        SEPSIS
3                        SEPSIS
4                        SEPSIS
                 ...           
2947    CORONARY ARTERY DISEASE
2948    CORONARY ARTERY DISEASE
2949    CORONARY ARTERY DISEASE
2950    CORONARY ARTERY DISEASE
2951    CORONARY ARTERY DISEASE
Name: DIAGNOSIS, Length: 2952, dtype: object

In [None]:
import pandas as pd
import num

In [3]:
notes_df.columns=notes_df.columns.str.upper()
diagnoses_df.columns=diagnoses_df.columns.str.upper()
# Merge the data on HADM_ID
merged_df = pd.merge(notes_df, diagnoses_df, on='HADM_ID')

# Select relevant columns
data = merged_df[['TEXT', 'ICD9_CODE']]

# Drop rows with missing values
data = data.dropna()

# Sample the data for simplicity (optional)
data = data.sample(frac=0.1, random_state=42)

# Display the first few rows
print(data.head())

In [1]:
import pandas as pd

# Load the data
notes_df = pd.read_csv(r'D:\FINALYEARPROJECTREC\data\NOTEEVENTS.csv')
diagnoses_df = pd.read_csv(r'D:\FINALYEARPROJECTREC\data\DIAGNOSES_ICD.csv')


KeyError: 'HADM_ID'

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Convert text to TF-IDF features
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X = vectorizer.fit_transform(data['TEXT'])

# Convert ICD9 codes to labels
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(data['ICD9_CODE'])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Train a Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.4f}')

# Print classification report
print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))

In [None]:
import joblib

# Save the model and vectorizer
joblib.dump(model, 'icd9_predictor_model.pkl')
joblib.dump(vectorizer, 'tfidf_vectorizer.pkl')
joblib.dump(label_encoder, 'label_encoder.pkl')

In [None]:
# Load the model, vectorizer, and label encoder
model = joblib.load('icd9_predictor_model.pkl')
vectorizer = joblib.load('tfidf_vectorizer.pkl')
label_encoder = joblib.load('label_encoder.pkl')

# Predict on new text
new_text = "Patient presents with chest pain and shortness of breath."
new_text_vectorized = vectorizer.transform([new_text])
predicted_label = model.predict(new_text_vectorized)
predicted_icd9 = label_encoder.inverse_transform(predicted_label)

print(f'Predicted ICD-9 Code: {predicted_icd9[0]}')