## Dimensionality Reduction & Unsupervised Learning

To perform dimensionality reduction and unsupervised learning, we can use various techniques such as Principal Component Analysis (PCA), t-SNE, and clustering algorithms like K-means or DBSCAN. These techniques help in reducing the dimensionality of the data and finding patterns or groups within the data without the need for labeled data.

## Authors
* **Alireza Arbabi**
* **Hadi Babalou**
* **Ali Padyav**
* **Kasra Hajiheidari**

## Table of Contents

## Setting Up the Environment

In [832]:
# !pip install numpy
# !pip install pandas
# !pip install scikit-learn


In [833]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.calibration import LabelEncoder
import csv
from io import StringIO



import warnings
warnings.filterwarnings("ignore")

## Data Preparation

### Dataset Description

In 2014, some researchers published an article called "Impact of c1HbA Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records." They gathered data on diabetic patients from many hospitals and clinics in America. Some of this data, about 200,000 items with 50 features, has been shared with the public in a way that keeps people's identities private.

Features

- Encounter ID:	Unique identifier of an encounter
- Patient number: Unique identifier of a patient
- Race: Caucasian, Asian, African American, Hispanic, and other	2%
- Gender: male, female, and unknown/invalid	0%
- Age: Grouped in 10-year intervals: 0, 10), 10, 20), …, 90, 100)	0%
- Weight: Weight in pounds.	97%
- Admission type: Integer identifier corresponding to 9 distinct values, for example, emergency, urgent, elective, - newborn, and not available	0%
- Discharge disposition:	Integer identifier corresponding to 29 distinct values, for example, discharged to home, - expired, and not available	0%
- Admission source: Integer identifier corresponding to 21 distinct values, for example, physician referral, emergency room, and transfer from a hospital	0%
- Time in hospital: Integer number of days between admission and discharge	0%
- Payer code: Integer identifier corresponding to 23 distinct values, for example, Blue Cross/Blue Shield, Medicare, - and self-pay	52%
- Medical specialty: Integer identifier of a specialty of the admitting physician, corresponding to 84 distinct values, for example, cardiology, internal medicine, family/general practice, and surgeon	53%
- Number of lab procedures: Number of lab tests performed during the encounter	0%
- Number of procedures: Number of procedures (other than lab tests) performed during the encounter	0%
- Number of medications: Number of distinct generic names administered during the encounter	0%
- Number of outpatient visits: Number of outpatient visits of the patient in the year preceding the encounter	0%
- Number of emergency visits: Number of emergency visits of the patient in the year preceding the encounter	0%
- Number of inpatient visits: Number of inpatient visits of the patient in the year preceding the encounter	0%
- Diagnosis 1: The primary diagnosis (coded as first three digits of ICD9); 848 distinct values	0%
- Diagnosis 2: Secondary diagnosis (coded as first three digits of ICD9); 923 distinct values	0%
- Diagnosis 3: Additional secondary diagnosis (coded as first three digits of ICD9); 954 distinct values	1%
- Number of diagnoses: Number of diagnoses entered to the system	0%
- Glucose serum test result: Indicates the range of the result or if the test was not taken. Values: “>200,” “>300,” “normal,” and “none” if not measured	0%
- A1c test result: Indicates the range of the result or if the test was not taken. Values: “>8” if the result was greater than 8%, “>7” if the result was greater than 7% but less than 8%, “normal” if the result was less than 7%, and “none” if not measured.	0%
- Change of medications: Indicates if there was a change in diabetic medications (either dosage or generic name). Values: “change” and “no change”	0%
- Diabetes medications: Indicates if there was any diabetic medication prescribed. Values: “yes” and “no”	0%
- 24 features for medications: For the generic names: `metformin`, `repaglinide`, `nateglinide`, `chlorpropamide`, `glimepiride`, `acetohexamide`, `glipizide`, `glyburide`, `tolbutamide`, `pioglitazone`, `rosiglitazone`, `acarbose`, `miglitol`, `troglitazone`, `tolazamide`, `examide`, `sitagliptin`, `insulin`, `glyburide-metformin`, `glipizide-metformin`, `glimepiride-pioglitazone`, `metformin-rosiglitazone`, `and metformin-pioglitazone`, the feature indicates whether the drug was prescribed or there was a change in the dosage. Values: “up” if the dosage was increased during the encounter, “down” if the dosage was decreased, “steady” if the dosage did not change, and “no” if the drug was not prescribed	0%
- Readmitted: Days to inpatient readmission. Values: “<30” if the patient was readmitted in less than 30 days, “>30” if the patient was readmitted in more than 30 days, and “No” for no record of readmission.

### Loading the Dataset

In [834]:
df = pd.read_csv('diabetic_data.csv')
df.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


Store mapping of feature names to their descriptions in a dictionary.

In [835]:
def csv_to_dict(csv_string, header):
    if header:
        csv_string = csv_string.split("\n", 1)[1]

    csv_file = StringIO(csv_string)
    dictionary = {}
    reader = csv.reader(csv_file, delimiter=',', quotechar='"')

    for row in reader:
        key, value = row
        dictionary[key] = value

    return dictionary

In [836]:
with open('IDs_mapping.csv', 'r') as file:
    csv_data = file.read()

admission_type, discharge_disposition, admission_source = csv_data.split('\n,\n')

admission_type_mapping = csv_to_dict(admission_type, header=True)
discharge_disposition_mapping = csv_to_dict(discharge_disposition, header=True)
admission_source_mapping = csv_to_dict(admission_source, header=True)

print(admission_type_mapping)
print(discharge_disposition_mapping)
print(admission_source_mapping)


{'1': 'Emergency', '2': 'Urgent', '3': 'Elective', '4': 'Newborn', '5': 'Not Available', '6': 'NULL', '7': 'Trauma Center', '8': 'Not Mapped'}
{'1': 'Discharged to home', '2': 'Discharged/transferred to another short term hospital', '3': 'Discharged/transferred to SNF', '4': 'Discharged/transferred to ICF', '5': 'Discharged/transferred to another type of inpatient care institution', '6': 'Discharged/transferred to home with home health service', '7': 'Left AMA', '8': 'Discharged/transferred to home under care of Home IV provider', '9': 'Admitted as an inpatient to this hospital', '10': 'Neonate discharged to another hospital for neonatal aftercare', '11': 'Expired', '12': 'Still patient or expected to return for outpatient services', '13': 'Hospice / home', '14': 'Hospice / medical facility', '15': 'Discharged/transferred within this institution to Medicare approved swing bed', '16': 'Discharged/transferred/referred another institution for outpatient services', '17': 'Discharged/transf

### Preprocessing

In [837]:
print(df.dtypes)

encounter_id                 int64
patient_nbr                  int64
race                        object
gender                      object
age                         object
weight                      object
admission_type_id            int64
discharge_disposition_id     int64
admission_source_id          int64
time_in_hospital             int64
payer_code                  object
medical_specialty           object
num_lab_procedures           int64
num_procedures               int64
num_medications              int64
number_outpatient            int64
number_emergency             int64
number_inpatient             int64
diag_1                      object
diag_2                      object
diag_3                      object
number_diagnoses             int64
max_glu_serum               object
A1Cresult                   object
metformin                   object
repaglinide                 object
nateglinide                 object
chlorpropamide              object
glimepiride         

As we saw in the previous section, there is "?" in the dataset. We need to replace them with NaN values.

In [838]:
df = df.replace('?', np.nan)
df = df.replace('None', np.nan)

#### Missing Values

A function to calculate the percentage of missing values in each column.

In [839]:
def null_percentage(df):
    missing = df.isnull().sum()
    missing = missing[missing > 0]
    missing_percentage = missing / df.shape[0] * 100
    missing_info = pd.DataFrame({'missing': missing, 'missing_percentage': missing_percentage})
    missing_info = missing_info.sort_values(by='missing', ascending=False)
    print(missing_info)
    return missing_info

In [840]:
missing_info = null_percentage(df)

                   missing  missing_percentage
weight               98569           96.858479
max_glu_serum        96420           94.746772
A1Cresult            84748           83.277322
medical_specialty    49949           49.082208
payer_code           40256           39.557416
race                  2273            2.233555
diag_3                1423            1.398306
diag_2                 358            0.351787
diag_1                  21            0.020636


In [841]:
missing_cols = missing_info[missing_info['missing_percentage'] > 30].index
print(missing_cols)
df = df.drop(missing_cols, axis=1)

Index(['weight', 'max_glu_serum', 'A1Cresult', 'medical_specialty',
       'payer_code'],
      dtype='object')


In [842]:
df = df.dropna()

In [843]:
missing_info = null_percentage(df)

Empty DataFrame
Columns: [missing, missing_percentage]
Index: []


#### Duplicates

There is no duplicate data in the dataset.

In [844]:
print(df['encounter_id'].duplicated().sum())

0


We don't need id columns, so we drop them.

In [845]:
df.drop(['encounter_id', 'patient_nbr'], axis=1, inplace=True)

#### Type Conversion

Non numerical columns are converted to numerical columns.

In [846]:
df.select_dtypes(exclude=np.number).columns

Index(['race', 'gender', 'age', 'diag_1', 'diag_2', 'diag_3', 'metformin',
       'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
       'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted'],
      dtype='object')

In [847]:
def encode_onehot(df, column):
    onehot = pd.get_dummies(df[column], prefix=column)
    df = pd.concat([df, onehot], axis=1)
    df = df.drop(column, axis=1)
    return df

def encode_label(df, column):
    le = LabelEncoder()
    df[column] = le.fit_transform(df[column])
    mapping = dict(zip(le.classes_, le.transform(le.classes_)))
    return mapping, df

##### Race

In [848]:
df['race'].value_counts()

Caucasian          75079
AfricanAmerican    18881
Hispanic            1984
Other               1484
Asian                625
Name: race, dtype: int64

In [849]:
df = encode_onehot(df, 'race')

##### Gender

In [850]:
df['gender'].value_counts()

Female             52833
Male               45219
Unknown/Invalid        1
Name: gender, dtype: int64

In [851]:
df = df[df['gender'] != 'Unknown/Invalid']

df = encode_onehot(df, 'gender')

##### Age

In [852]:
df['age'].value_counts()

[70-80)     25305
[60-70)     21809
[80-90)     16702
[50-60)     16697
[40-50)      9265
[30-40)      3548
[90-100)     2717
[20-30)      1478
[10-20)       466
[0-10)         65
Name: age, dtype: int64

In [853]:
age_mapping = {"[0-10)": 0, "[10-20)": 1, "[20-30)": 2,
               "[30-40)": 3, "[40-50)": 4, "[50-60)": 5,
               "[60-70)": 6, "[70-80)": 7, "[80-90)": 8,
               "[90-100)": 9}

df['age'] = df['age'].replace(age_mapping)

##### Diag

diag_1 is the primary diagnosis 848 distinct values, diag_2 is the secondary diagnosis 923 distinct values, and diag_3 is the additional secondary diagnosis 954 distinct values.

Each diag is coded as first three digits of ICD9.

ICD-9 codes are used to classify diseases and injuries. The first three digits of the ICD-9 code represent the category of the diagnosis.

- 001–139: infectious and parasitic diseases
- 140–239: neoplasms
- 240–279: endocrine, nutritional and metabolic diseases, and immunity disorders
- 280–289: diseases of the blood and blood-forming organs
- 290–319: mental disorders
- 320–389: diseases of the nervous system and sense organs
- 390–459: diseases of the circulatory system
- 460–519: diseases of the respiratory system
- 520–579: diseases of the digestive system
- 580–629: diseases of the genitourinary system
- 630–679: complications of pregnancy, childbirth, and the puerperium
- 680–709: diseases of the skin and subcutaneous tissue
- 710–739: diseases of the musculoskeletal system and connective tissue
- 740–759: congenital anomalies
- 760–779: certain conditions originating in the perinatal period
- 780–799: symptoms, signs, and ill-defined conditions
- 800–999: injury and poisoning
- E and V codes: external causes of injury and supplemental classification

In [854]:
ranges = [
    (1, 139, 1),
    (140, 239, 2),
    (240, 279, 3),
    (280, 289, 4),
    (290, 319, 5),
    (320, 389, 6),
    (390, 459, 7),
    (460, 519, 8),
    (520, 579, 9),
    (580, 629, 10),
    (630, 679, 11),
    (680, 709, 12),
    (710, 739, 13),
    (740, 759, 14),
    (760, 779, 15),
    (780, 799, 16),
    (800, 999, 17)
]

def map_diag_code(code):
    try:
        num_code = int(code)
        for start, end, label in ranges:
            if start <= num_code <= end:
                return label
    except ValueError:
        if code.startswith('E') or code.startswith('V'):
            return 18
    return 0

In [855]:
df['diag_1'] = df['diag_1'].apply(map_diag_code)
df['diag_2'] = df['diag_2'].apply(map_diag_code)
df['diag_3'] = df['diag_3'].apply(map_diag_code)

##### 24 features for medications

We use label encoding for these columns.

In [856]:
medication_cols = ['metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
               'glimepiride', 'acetohexamide', 'glipizide', 'glyburide',
               'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose',
               'miglitol', 'troglitazone', 'tolazamide', 'examide', 'citoglipton',
               'insulin', 'glyburide-metformin', 'glipizide-metformin',
               'glimepiride-pioglitazone', 'metformin-rosiglitazone',
               'metformin-pioglitazone']

medication_mapping = { 'No': 0, 'Steady': 1, 'Up': 2, 'Down': 3 }

for col in medication_cols:
    df[col] = df[col].replace(medication_mapping)


##### Change

In [857]:
change_mapping, df = encode_label(df, 'change')
print(change_mapping)

{'Ch': 0, 'No': 1}


##### DiabetesMed

In [858]:
diabetes_med_mapping, df = encode_label(df, 'diabetesMed')
print(diabetes_med_mapping)

{'No': 0, 'Yes': 1}


##### Readmitted

In [859]:
def map_readmitted(value):
    if value == 'NO':
        return 0
    elif value == '>30':
        return 1
    else:
        return 2

In [860]:
df['readmitted'] = df['readmitted'].apply(map_readmitted)

**This far we have dropped the columns that have more than 50% missing values, and we have converted the non-numerical columns to numerical columns. Also handled the missing values and the duplicates.**

In [861]:
len(df.select_dtypes(exclude=np.number).columns)

0

#### Normalization

StandardScaler is a preprocessing technique in machine learning used to standardize the features by removing the mean and scaling them to unit variance. This ensures that each feature has a mean of 0 and a standard deviation of 1.

Standardization is often performed on al features in datasets before training machine learning models. It helps in situations where the features have different scales or units, ensuring that each feature contributes equally to the analysis and preventing features with larger scales from dominating the model's training process.

In [862]:
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
df_scaled = pd.DataFrame(df_scaled, columns=df.columns)
df_scaled.head()

Unnamed: 0,age,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,...,change,diabetesMed,readmitted,race_AfricanAmerican,race_Asian,race_Caucasian,race_Hispanic,race_Other,gender_Female,gender_Male
0,-3.287868,-0.707395,-0.51859,0.300436,-0.475104,0.804171,-0.790599,0.231907,-0.293279,-0.214727,...,-1.079609,0.548896,0.614311,-0.488348,-0.080094,0.553159,-0.143708,-0.123923,0.925141,-0.925141
1,-2.646461,-0.707395,-0.51859,0.300436,-0.80921,-1.630937,2.136003,-0.384733,1.265132,-0.214727,...,0.926261,0.548896,-0.845175,2.04772,-0.080094,-1.8078,-0.143708,-0.123923,0.925141,-0.925141
2,-2.005054,-0.707395,-0.51859,0.300436,-0.80921,0.0432,-0.205279,-0.014749,-0.293279,-0.214727,...,-1.079609,0.548896,-0.845175,-0.488348,-0.080094,0.553159,-0.143708,-0.123923,-1.080917,1.080917
3,-1.363647,-0.707395,-0.51859,0.300436,-1.143317,0.39832,-0.790599,-1.001373,-0.293279,-0.214727,...,-1.079609,0.548896,-0.845175,-0.488348,-0.080094,0.553159,-0.143708,-0.123923,-1.080917,1.080917
4,-0.72224,-0.017794,-0.51859,-0.927579,-0.475104,-0.616309,2.721323,-0.014749,-0.293279,-0.214727,...,0.926261,0.548896,0.614311,-0.488348,-0.080094,0.553159,-0.143708,-0.123923,-1.080917,1.080917
