<a href="https://colab.research.google.com/github/PhysicianTechie/4-Data-Science-and-Machine-Learning/blob/main/Diabetes_130_US_Hospitals_for_Years_1999_2008_UC_IRVINE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



***Diabetes Readmission Project***

**Objective**
The project aims to predict the likelihood of re-admission for diabetic patients using various machine learning techniques.

Summmary by
Dr Vaibhav Joshi
Nov 2024 **bold text** ***bold text***


Diabetes 130-US Hospitals for Years 1999-2008
Donated on 5/2/2014
The dataset represents ten years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. Each row concerns hospital records of patients diagnosed with diabetes, who underwent laboratory, medications, and stayed up to 14 days. The goal is to determine the early readmission of the patient within 30 days of discharge. The problem is important for the following reasons. Despite high-quality evidence showing improved clinical outcomes for diabetic patients who receive various preventive and therapeutic interventions, many patients do not receive them. This can be partially attributed to arbitrary diabetes management in hospital environments, which fail to attend to glycemic control. Failure to provide proper diabetes care not only increases the managing costs for the hospitals (as the patients are readmitted) but also impacts the morbidity and mortality of the patients, who may face complications associated with diabetes.

Dataset Characteristics
Multivariate

Subject Area
Health and Medicine

Associated Tasks
Classification, Clustering

Feature Type
Categorical, Integer

# Instances
101766

# Features
47


In [None]:
# prompt: please upload csv file

from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving diabetic_data.csv to diabetic_data.csv
Saving IDS_mapping.csv to IDS_mapping.csv
User uploaded file "diabetic_data.csv" with length 19159383 bytes
User uploaded file "IDS_mapping.csv" with length 2547 bytes


Dataset Information
What do the instances in this dataset represent?

The instances represent hospitalized patient records diagnosed with diabetes.

Are there recommended data splits?

No recommendation. The standard train-test split could be used. Can use three-way holdout split (i.e., train-validation-test) when doing model selection.

Does the dataset contain data that might be considered sensitive in any way?

Yes. The dataset contains information about the age, gender, and race of the patients.

Additional Information

The dataset represents ten years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria.
(1)	It is an inpatient encounter (a hospital admission).
(2)	It is a diabetic encounter, that is, one during which any kind of diabetes was entered into the system as a diagnosis.
(3)	The length of stay was at least 1 day and at most 14 days.
(4)	Laboratory tests were performed during the encounter.
(5)	Medications were administered during the encounter.

The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab tests performed, HbA1c test result, diagnosis, number of medications, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc.

Has Missing Values?

Yes

Data Loading
•	Dataset: diabetic_data.csv
•	Entries: 101,766
•	Columns: 50
•	Tool: Pandas DataFrame (df_diabetes)


In [None]:
# prompt: lets have some basic information about the dataset

# Assuming you have uploaded the dataset file and assigned it to a variable 'df'
# If not, you would need to read the file into a pandas DataFrame
# For example, if the uploaded file name is 'diabetes_data.csv':
import pandas as pd
import io

# Access the uploaded file content from the 'uploaded' dictionary
# Get the filename from the uploaded dictionary keys
file_name = list(uploaded.keys())[0]
# Read the file content into a pandas DataFrame
df = pd.read_csv(io.BytesIO(uploaded[file_name]))


# Basic dataset information
print("Dataset shape:", df.shape)  # Number of rows and columns
print("\nDataset columns:", df.columns.tolist())  # List of column names
print("\nDataset info:\n")
print(df.info())  # Data types and non-null counts
print("\nDescriptive statistics:\n")
print(df.describe())  # Basic statistical information (count, mean, std, etc.)
print("\nNumber of missing values per column:\n")
print(df.isnull().sum())  # Count of missing values for each column

Dataset shape: (101766, 50)

Dataset columns: ['encounter_id', 'patient_nbr', 'race', 'gender', 'age', 'weight', 'admission_type_id', 'discharge_disposition_id', 'admission_source_id', 'time_in_hospital', 'payer_code', 'medical_specialty', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_outpatient', 'number_emergency', 'number_inpatient', 'diag_1', 'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult', 'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', 'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted']

Dataset info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 50 columns):
 #   Colu

In [None]:
# prompt: please show me number of patients

print("Number of patients:", len(df))

Number of patients: 101766


# ***Data cleaning***

In [None]:
# prompt: We will Check for missing values

# Check for missing values in the DataFrame
print("\nNumber of missing values per column:\n")
print(df.isnull().sum())  # Count of missing values for each column

# You can also calculate the percentage of missing values for each column
print("\nPercentage of missing values per column:\n")
print((df.isnull().sum() / len(df)) * 100)


Number of missing values per column:

encounter_id                    0
patient_nbr                     0
race                            0
gender                          0
age                             0
weight                          0
admission_type_id               0
discharge_disposition_id        0
admission_source_id             0
time_in_hospital                0
payer_code                      0
medical_specialty               0
num_lab_procedures              0
num_procedures                  0
num_medications                 0
number_outpatient               0
number_emergency                0
number_inpatient                0
diag_1                          0
diag_2                          0
diag_3                          0
number_diagnoses                0
max_glu_serum               96420
A1Cresult                   84748
metformin                       0
repaglinide                     0
nateglinide                     0
chlorpropamide                  0
glimepiri

# ***Identifying Missing Values***
We examined the dataset to identify missing values. Specifically, we focused on the ‘max_glu_serum’ and ‘A1Cresult’ columns. These columns contained some missing data, which needed to be addressed.

# ***Calculating Missing Value Percentage***
To understand the impact of missing values, we calculated the percentage of missing data for each column. This information helped us decide whether to drop the columns entirely or impute the missing values


# ***Topic 2***
Introduce the data, talk through some key properties and discuss a plan for EDA (e.g. bivariates, handling categorical data).

In [None]:
# prompt:  Introduce the data, talk through some key properties and discuss a plan for EDA (e.g. bivariates, handling categorical data).



**Data Introduction:**

The dataset contains patient records from 130 US hospitals over a decade (1999-2008) for patients diagnosed with diabetes. The main goal is to predict readmission within 30 days of discharge. It has over 100,000 instances and 47 features, including patient demographics, medical history, lab test results, medication information, and hospital visits.

**Key Properties:**

*   **Multivariate:** The dataset contains numerous features, making it a multivariate dataset.
*   **Classification Task:** The main task is to predict whether a patient will be readmitted, which is a classification problem.
*   **Feature Types:** The data includes both categorical (e.g., gender, race) and numerical (e.g., age, number of lab tests) features.
*   **Missing Values:** The dataset has missing values, which need to be addressed during data cleaning.
*   **Potential Biases:** The dataset may contain biases due to factors like the geographical location of hospitals and the demographics of the patients treated.
*   **Sensitive Data:** It includes sensitive patient data like age, gender, and race, which need to be handled responsibly and ethically.

**EDA Plan:**

1.  **Data Cleaning:**
    *   Handle missing values: We can either remove rows/columns with missing values or impute them using strategies like mean/median imputation, or more advanced techniques (e.g., KNN imputation) for numerical features and mode imputation for categorical features.
    *   Check for outliers: Identify outliers in numerical features (e.g., age, time in hospital) and consider strategies to address them (e.g., capping, removal).
    *   Data type consistency: Verify that the data types for each column are correctly assigned.

2.  **Univariate Analysis:**
    *   Descriptive statistics: Calculate basic statistics (e.g., mean, median, standard deviation, range) for numerical features.
    *   Frequency distributions: Analyze the distribution of categorical features (e.g., gender, race, admission type) using bar plots or histograms.
    *   Visualizations: Create histograms, box plots, and density plots for numerical features to understand their distributions and identify potential outliers.

3.  **Bivariate Analysis:**
    *   Correlation analysis: Calculate correlation coefficients between numerical features and identify potential relationships.
    *   Scatter plots: Visualize relationships between pairs of numerical features.
    *   Cross-tabulation: Analyze the relationships between categorical features and the target variable (readmission) using cross-tabulations or contingency tables.
    *   Analyze interactions between features and the target variable using visualizations.

4.  **Handling Categorical Data:**
    *   One-hot encoding: Convert categorical features into numerical representations using one-hot encoding for machine learning models.
    *   Label encoding: Assign numerical labels to categorical values.

5.  **Feature Engineering:**
    *   Create new features from existing ones that may be more informative for predicting readmission.
    *   For instance, combine the number of inpatient, outpatient, and emergency visits in the past year to create a total visit count.

6.  **Target Variable Analysis:**
    *   Analyze the distribution of the target variable (readmission).
    *   Understand the class imbalance, if any (e.g., are there more patients who are not readmitted than those who are?).

**Additional Considerations:**

*   Visualizations: Use various visualizations (e.g., histograms, box plots, scatter plots, bar charts) to explore the relationships between variables and gain insights.
*   Feature Importance: Explore feature importance to identify the key factors influencing readmission.

By following this plan, we can gain a deep understanding of the dataset, identify potential problems, and prepare the data for building machine learning models to predict readmission for diabetic patients.




