# Table Of Contents

1. Introduction
2. Challenges in Solving Hospital Return Visits
3. Dataset Description
    - 3.1. Overview
    - 3.2. Readmission Categorization
    - 3.3. Features in the Dataset
4. Project Setup
5. Exploratory Data Analysis 
    - 5.1. Feature Type Classification
    - 5.2 Exploration of Categorical Variables and Data Distribution
    - 5.3 Data Insights through Violin Plots
    - 5.4 Exploring Age Groups in the Dataset
        - 5.4.1 Analysis 1: Number of Medications Across Age Groups by Readmission Status
        - 5.4.2 Analysis 2: Readmission Rates Across Age Groups

6. Data cleaning and pre-processing
7. Machine Learning Model Evaluation and Hyperparameter Tuning
        - 7.1. Model Selection and Justification
    - 7.2. Selection of Performance Metrics
    - 7.3. Cross-Validation and Hyperparameter Tuning for SVM
    - 7.4 In-depth Analysis of Cross-Validation and Pre-processing Strategies
    - 7.5 Plotting Model Performance with Hyper-Parameters
    - 7.6 Model Performance Overview: Mean and Std Metrics
8. Baseline Model Creation and Evaluation
9. Final Model Performance Evaluation
10. Comparison of Final Model v/s Baseline Model Metrics
11. Conclusion: Comparison and Insights from Both Models


### Introduction

Diabetes makes sugar levels in the blood too high. Checking diabetic patients while they're in the hospital is really important. This study wants to figure out if a diabetic patient might come back to the hospital within 30 days. Knowing this could help save money on treatment, solve medical problems, and make sure patients stay healthy and safe. They used information from Virginia Commonwealth University to do this study. 

### Challenges in Solving Hospital Return Visits

Many things make it hard to stop people from coming back to the hospital after they leave. Some health problems are really tough to handle, especially when someone has many of them or a sickness that stays for a long time. Also, sometimes different parts of the hospital or different doctors don't talk to each other well, so patients don't get the right care when they need it. People who leave the hospital might find it tough to understand what the doctors tell them to do, get help they need, or know how to take care of themselves, mostly because of money problems, not having enough help, or not knowing enough. After leaving the hospital, some folks don't get the help they need at home or the right medicine. Hospitals might also not try hard enough to keep people from coming back because of how they get paid. Lastly, gathering all the information about a patient from different places to know who needs extra help to avoid coming back to the hospital is really hard.

### Dataset Description

#### Overview

The dataset (https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008) used in this study covers ten years (1999-2008) and includes information from 130 hospitals and healthcare networks in the US. Each entry in the dataset relates to hospital records of patients diagnosed with diabetes. It contains details about their lab tests, medicines they took, and how long they stayed, for up to 14 days. The main aim is to figure out if these patients might need to go back to the hospital within 30 days after leaving.

#### Readmission Categorization

In the dataset, there are three distinct classes used for readmission categorization:

1. `<30` indicates patients who were readmitted in less than 30 days.
2. `>30` represents patients who were readmitted in more than 30 days.
3. `No` signifies no record of readmission.

To streamline the classification task, we categorize patients into two groups: those with no record of readmission versus those with a record of readmission, achieved by consolidating the classes labeled as "<30" and ">30".

#### Features in the Dataset

The dataset comprises various features that provide detailed information about patient encounters in hospitals. Some of the features include:

1. **Encounter ID and Patient Number:** Unique identifiers for each patient's encounter and individual patient in the dataset.

2. **Race and Gender:** Categorical features representing the patient's race and gender. Race includes categories such as Caucasian, Asian, African American, Hispanic, and others. Gender includes male, female, and unknown/invalid values.

3. **Age:** Categorized into 10-year intervals ranging from 0 to 100 years old.

4. **Weight:** Indicates the patient's weight in pounds.

5. **Admission Type, Discharge Disposition, and Admission Source IDs:** Categorical features represented by integer identifiers, signifying details about the admission type, discharge disposition, and admission source.

6. **Time in Hospital:** Represents the duration of the patient's hospital stay in days.

7. **Payer Code and Medical Specialty:** Categorical features denoting the payer code and medical specialty, providing information about the patient's insurance payer and the admitting physician's specialty.

8. **Number of Lab Procedures, Procedures, and Medications:** Numeric features indicating the count of lab procedures, non-lab procedures, and distinct medications administered during the patient encounter.

9. **Number of Outpatient, Emergency, and Inpatient Visits:** Numeric features indicating the count of outpatient, emergency, and inpatient visits of the patient in the year before the encounter.

10. **Diagnosis Codes (Diag_1, Diag_2):** Categorical features representing primary and secondary diagnoses using codes from the International Classification of Diseases, 9th Revision (ICD9). These codes signify specific medical diagnoses or conditions.

11. **A1Cresult:** Another categorical feature representing the range of a test result or indicating if the test wasn't conducted. The values include >8 if the result was greater than 8%, >7 if between 7% and 8%, normal if less than 7%, and none if the test wasn't conducted. No missing values are present for this feature.

12. **Medication Features (e.g., Metformin, Repaglinide, Nateglinide, Chlorpropamide, Glimepiride, Acetohexamide):** These categorical features indicate if a drug was prescribed or if there was a change in dosage during the patient encounter. Values include 'up' for increased dosage, 'down' for decreased dosage, 'steady' for unchanged dosage, and 'no' if the drug was not prescribed. There are no missing values for these features.

### Setup and Library Imports

In [2]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer


import warnings
warnings.filterwarnings("ignore")

### Loading Dataset and Initial Data Exploration

In [3]:
path = 'diabetic_data.csv'
dataset = pd.read_csv(path)

print(f'The shape of the dataset is: {dataset.shape}\n')
print(f'The type of the columns is:\n{dataset.dtypes.value_counts()}')

The shape of the dataset is: (101766, 50)

The type of the columns is:
object    37
int64     13
Name: count, dtype: int64


#### Feature Type Classification

The data contains 50 features and 101766 samples, which makes the sample-to-feature ratio 2035:1

Out of 50, the dataset consists of:
- categorical features = 37 continuous features = 13

In [4]:
categorical_features = []  
continuous_features = []   


for column in dataset.columns:
    
    if dataset[column].dtype == 'object':
        categorical_features.append(column)
    else:
        
        unique_values = dataset[column].nunique()
        if unique_values <= 15:  
            categorical_features.append(column)
        else:
            continuous_features.append(column)


sample_rows = dataset.head(5)


print("\nCategorical Features Identification:")
print("\n")
print(categorical_features)
print("\nContinuous Features Identification:")
print("\n")
print(continuous_features)

print("\nSample Rows:")
print("\n")
print(sample_rows)



Categorical Features Identification:


['race', 'gender', 'age', 'weight', 'admission_type_id', 'time_in_hospital', 'payer_code', 'medical_specialty', 'num_procedures', 'diag_1', 'diag_2', 'diag_3', 'max_glu_serum', 'A1Cresult', 'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', 'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted']

Continuous Features Identification:


['encounter_id', 'patient_nbr', 'discharge_disposition_id', 'admission_source_id', 'num_lab_procedures', 'num_medications', 'number_outpatient', 'number_emergency', 'number_inpatient', 'number_diagnoses']

Sample Rows:


   encounter_id  patient_nbr             race  gender      age weight  \
0       22783

> We observe that the dataset has '?' value for many features in the dataset. The feature weight has most of the values as '?'.
