# Diabetes Cluster Analysis

Aim: Clustering on the diabetes dataset on behaviour of different people and drwaing useful insights

To perform the clustering, we would first import basic libraries and the required dataset

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
data=pd.read_csv('C:\\Users\\Arman Soni\\OneDrive\\Desktop\\JU\\ML\\diabetic_data.csv',na_values=['?'])
data

In [None]:
#info column helps in giving the null values and the column present in datasets
print(data.info())

In [None]:
#describe code gives the description of the dataset
data.describe()

Weight Analysis

In [None]:
print("the null values in weight column is :", data['weight'].isnull().sum()/101766)

As we see that out of 101766 elements, 98569 data is empty which is approx 96% empty and hence we will drop the column

In [None]:
data1=data.drop('weight',axis=1)
data1

Gender Analysis

In [None]:
print('the unique values in the gender column is:',data1["gender"].unique())

In [None]:
data1.gender.value_counts()

Out of 101766 only 3 rows are filled with unknown gender and hence there are no issues removing it

In [None]:
gender_index = data1[data1.gender == 'Unknown/Invalid'].index
data1=data1.drop(gender_index, axis=0)

In [None]:
data1

Race Analysis

In [None]:
print("the null values in race column is :",data1['race'].isnull().sum())

Instead of removing the null values, we can impute using the mode to maintain the data count

In [None]:
mode=data1['race'].mode()[0]
data1['race'].fillna(mode,inplace=True)

In [None]:
data1.info()

Diagnosis Columns Cleaning

print("the null values in diag_1 column is:",data1['diag_1'].isnull().sum()/101763)
print("the null values in diag_2 column is:",data1['diag_2'].isnull().sum()/101763)
print("the null values in diag_3 column is:",data1['diag_3'].isnull().sum()/101763)

The null values are hardly 1%, but deletion of rows can reduce the data count,so optimal solution is to impute using mode

In [None]:
data1['diag_1'].fillna(mode,inplace=True)
data1['diag_2'].fillna(mode,inplace=True)
data1['diag_3'].fillna(mode,inplace=True)

In [None]:
print("the null values in diag_1 column is:",data1['diag_1'].isnull().sum()/101763)
print("the null values in diag_2 column is:",data1['diag_2'].isnull().sum()/101763)
print("the null values in diag_3 column is:",data1['diag_3'].isnull().sum()/101763)

In [None]:
print("the null values in Payer_code column is:",data1['payer_code'].isnull().sum()/101763)
print("the null values in medical specialty column is:",data1['medical_specialty'].isnull().sum()/101763)

The payer code column has null values which means the client might not know who insured the patient, so we shall replace null values with "other"

In [None]:
data1['payer_code'].fillna('others',inplace=True)
data1

In [None]:
unique_values = pd.DataFrame(data1['medical_specialty'].unique(), columns=['medical_specialty_unique_values'])

print(unique_values)

The null values in medical specialty column shows the absence of reason behind the treatment and hence we would replace with "not known"

In [None]:
data1['medical_specialty'].fillna('notknown',inplace=True)
data1

In [None]:
data1.info()

In [None]:
data1['readmitted'].unique()

Greater and lesser than 30 can be replaced with "yes" because it shows presence of readmission and can be helpful in easy analysis

In [None]:
df = data1.replace(['<30', '>30'], 'YES')

Columns like encounter id and patient_nbr are unique and can be removed to reduce the dimenions

In [None]:
df=df.drop(['encounter_id','patient_nbr'],axis=1)
df

In [None]:
print('the unique values in examide are:',df['examide'].unique())
print('the unique values in citoglipton are:',df['citoglipton'].unique())

The columns examide and citoglipton have only one unique value and thus can be removed as they won't give fruitful insight

In [None]:
df=df.drop(['examide','citoglipton'],axis=1)
df

In [None]:
print("the values in metmorfin-pigo is:",df['metformin-pioglitazone'].value_counts())
print("the values in metmorfin-rosi is:",df['metformin-rosiglitazone'].value_counts())
print("the values in glimepiride-piog is:",df['glimepiride-pioglitazone'].value_counts())
print("the values in glimepiride-piog is:",df['glimepiride-pioglitazone'].value_counts())


In the above columns 99% of the data is 'no' and the other 1% is a different value and thus can be removed because its almost same as the column having the same value

In [None]:
df=df.drop(['metformin-pioglitazone','metformin-rosiglitazone','glimepiride-pioglitazone','glimepiride-pioglitazone'],axis=1)
df

In [None]:
df.info()

Data is clean and processed

# Label Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder

# Create a LabelEncoder object
le = LabelEncoder()

# Iterate over all columns except 'race'
for col in df.columns:
    if col not in ['race']:
        df[col] = le.fit_transform(df[col])


Label Encoding has been imputed to all columns except race to convert the data type and also helping in easy analysis
Race column will be manually imputed with label for easy rememberance

In [None]:
df['race'] = df['race'].replace(['Caucasian'], 0)
df['race'] = df['race'].replace(['AfricanAmerican'], 1)
df['race'] = df['race'].replace(['Other'], 2)
df['race'] = df['race'].replace(['Asian'], 3)
df['race'] = df['race'].replace(['Hispanic'], 4)

df['race'].unique()

In [None]:
df

In [None]:
#caucasian-0
#african american 1
#other 2
#asian 3
#hispanic 4

# Standard Scaler

Standard Scaling brings all columns to a particular range of 0 and 1

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df)
scaled_df = pd.DataFrame(scaler.transform(df),columns= df.columns )

In [None]:
scaled_df

# PCA

PCA helps in reducing the dimensions equivalent to the original dataset. This process helps in forming better clusters and more accurate result

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca.fit(scaled_df)
PCA_df = pd.DataFrame(pca.transform(scaled_df), columns=(["col1","col2", "col3"]))
PCA_df.describe().T

# Elbow Visualiser
##elbow visualiser recommends the best amount of cluster needed to cluster the data according to the features

In [None]:
from yellowbrick.cluster import KElbowVisualizer
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
Elbow_M = KElbowVisualizer(KMeans(), k=13)
Elbow_M.fit(PCA_df)
Elbow_M.show()

In [None]:
from matplotlib.colors import ListedColormap
from sklearn import metrics
import warnings
import sys
kmeans = KMeans(n_clusters=5, init='k-means++', n_init=10, random_state=42)
kmeans.fit(PCA_df)
labels_optimal = kmeans.labels_

In [None]:
plt.scatter(PCA_df[labels_optimal == 0]["col1"].values , PCA_df[labels_optimal == 0]["col2"].values, color= 'green')
plt.scatter(PCA_df[labels_optimal == 1]["col1"].values , PCA_df[labels_optimal == 1]["col2"].values, color= 'red')
plt.scatter(PCA_df[labels_optimal == 2]["col1"].values , PCA_df[labels_optimal == 2]["col2"].values, color= 'yellow')
plt.scatter(PCA_df[labels_optimal == 3]["col1"].values , PCA_df[labels_optimal == 3]["col2"].values, color= 'blue')
plt.scatter(PCA_df[labels_optimal == 4]["col1"].values , PCA_df[labels_optimal == 4]["col2"].values, color= 'purple')

Cluster Validity
Cluster Validity helps validating the quality and the performance of the cluster. We will use silhoutte score to check the quality.
Silhouette Coefficient or silhouette score is a metric used to calculate the goodness of a clustering technique.

In [None]:
from sklearn.metrics import silhouette_score
labels = kmeans.labels_
silhouette_avg = silhouette_score(PCA_df, labels)
print("The average silhouette_score is :", silhouette_avg)

In [None]:
cluster_df = pd.DataFrame({'cluster': labels_optimal})

# Plot countplot
sns.countplot(data=cluster_df, x='cluster')

In [None]:
df['labels']=labels_optimal
df

In [None]:
import matplotlib.pyplot as plt
sns.countplot(data = df,x=df["age"], hue=df["labels"], palette='icefire')
plt.title("Relation between age and labels")
plt.legend()
plt.show()


In [None]:
#cluster 0: 70-90
#cluster 1: 60-80
#cluster 2: 50-70
#cluster 3: 40-60
#cluster 4: 30-50

In [None]:
#caucasian-0
#african american 1
#other 2
#asian 3
#hispanic 4

In [None]:
sns.countplot(data = df,x=df["race"], hue=df["labels"], palette='icefire')
plt.title("Relation between race and labels")
plt.legend()
plt.show()

In [None]:
#cluster 0 is caucasian
#cluster 1 is african american
#cluster 2 is other
#cluster 3 is asian
#cluster 4 is hispanic

In [None]:
sns.countplot(data=df, x='labels', hue='gender')
plt.title('Distribution of Labels by Gender')
plt.show()

In [None]:
#cluster 0 has females
#cluster 1 has females
#cluster 2 has males
#cluster 3 has males
#cluster 4 has females

In [None]:
#no-1- no need of insulin
#up-3- increasing dosage of insulin
#steady-2 - enough insulin
#down-0- decreasing insulin

In [None]:
sns.scatterplot(data=df, x="age", y="insulin", hue="gender")

# Set the title of the plot
plt.title("Relationship between Age, Insulin, and Gender")

# Display the plot
plt.show()

In [None]:
#we see majority of males increased their dosage and females havesteady insulin

In [None]:
#female-0
#male-1

In [None]:
import matplotlib.pyplot as plt
sns.countplot(data = df,x=df["diabetesMed"], hue=df["gender"])
plt.title("Relation between diabetesMed and gender")
plt.legend()
plt.show()

In [None]:
#females take medicine more than male

In [None]:
sns.barplot(data=df, x="age", y="time_in_hospital",hue='gender')
plt.title("Relationship between Age, Time in Hospital, and Gender")
plt.show()


In [None]:
#females from 70-90 take atleast 4 days to be in hospital while male take 3 days in an average otherwise age group till 70 takes around 2-3 days

In [None]:
import matplotlib.pyplot as plt
sns.countplot(data = df,x=df["number_diagnoses"], hue=df["labels"])
plt.title("Relation between number_diagnoses and labels")
plt.legend()
plt.show()


In [None]:
#cluster 0 has 8 days of diagnosis
#cluster 1 has 7 days
#cluster 2 has a range of 5-7 days
#cluster 3 has 4 days
#cluster 4 has 4/8 days

In [None]:
df.info()

In [None]:
sns.countplot(data=df, x='readmitted', hue='labels')

# Set the title of the plot
plt.title('Distribution of Readmissions by Cluster')

# Display the plot
plt.show()

In [None]:
#cluster 0 are the most readmitted ones
#cluster 4 are the do not get readmitted

In [None]:
sns.barplot(data=df, x='labels', y='time_in_hospital')

# Set the title of the plot
plt.title('Relationship between Time in Hospital and Labels')

# Display the plot
plt.show()

#insights

1. cluster 0:
        age: 70-90
        race: caucasian
        gender: female
        no. of diagnosis: 8 
        readmitted: maximum
        time in hospital: 4 days

2. cluster 1:
        age: 60-80
        race: african american
        gender: female
        no. of diagnosis: 7 
        readmitted: normal count
        time in hospital: 6-7 days

3. Cluster 2:
        age: 50-70
        race: other
        gender: male
        no. of diagnosis: 5-7
        readmitted: least
        time in hospital: 3 days

4. Cluster 3:
        age: 40-60
        race: asian
        gender: male
        no. of diagnosis: 4 
        readmitted: most likely
        time in hospital: 2 days

5. Cluster 4:
        age: 30-50
        race: hispanic
        gender: female
        no. of diagnosis: 4
        readmitted: least likely
        time in hospital: 1-2 days


Other Insights
1. we see majority of males increased their dosage and females have steady insulin
2. females take medicine more than male
3. females from 70-90 take atleast 4 days to be in hospital while male take 3 days in an average otherwise age group till 70 takes around 2-3 days


# Conclusion
The patients in cluster 0 are the patients who needs more treatment and more care of. Patients in cluster 3 and 4 are the young adults who seem to be just affected by diagnosis.