In [None]:

import pandas as pd
from skleam.cluster import KMeans
from skleam.preprocessing import StandardScaler
from skleam.metrics import silhouette_score
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Load your dataset
# Assuming your dataset is stored in a CSV file named 'your_dataset.csv
df=pd.read_csv('/kaggle/input/ckdisease/kidney_disease.csv')

# Extract features for clustering
# Adjust the columns based on your dataset
features = ['age', 'bp', 'sg', 'al', 'su', 'rc', 'wc']
X = df[features]

# Convert 'rc' and 'wc' columns to numeric, handling errors
X['rc']=pd.to_numeric(X['rc'], errors='coerce')
X['wc']=pd.to_numeric(X['wc'], errors='coerce')

# Handling missing values
X.fillna(X.mean(), inplace=True) # Filling missing values with the mean

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Determine the number of clusters (K)
num_clusters = 2 # You can adjust this based on your analysis

#Apply K-Means
kmeans = KMeans(n_clusters=num_clusters, random_state=42, n_init=10)
y_pred = kmeans.fit_predict(X_scaled)

# Compute silhouette score
silhouette_avg = silhouette_score(X_scaled, y_pred)
print(f"Silhouette Score: {silhouette_avg :. 2f}")

# Randomly sample a subset of points for visualization
subset_indices = np.random.choice(len(X_scaled), size=min(500, len(X_scaled)),replace=False)
X_subset =X_scaled[subset_indices]
y_pred_subset =y_pred[subset_indices]

# Plot the clusters with a subset of points
plt.figure(figsize=(8, 6))
sns.scatterplot(x=X_subset[:, 0], y=X_subset[:,1], hue=y_pred_subset,
palette='viridis', legend='full', alpha=0.6)
plt.title("K-Means Clustering on Your Dataset (Subset of Points)")
plt.xlabel("Standardized Feature 1")
plt.ylabel("Standardized Feature 2")
plt.show()

This Python script uses the scikit-learn library to perform K-Means clustering on a kidney disease dataset. The goal is to group the data into two clusters based on various features such as age, blood pressure, specific gravity, albumin, sugar, red blood cell count, and white blood cell count.

The script starts by importing the necessary libraries. It then loads the kidney disease dataset from a specified CSV file into a pandas DataFrame. The features for clustering are extracted from the DataFrame.

The 'rc' (red blood cell count) and 'wc' (white blood cell count) columns are converted to numeric values, handling any errors that may occur during the conversion. Any missing values in the features are replaced with the mean value of the respective feature using the `fillna()` function.

The features are then standardized using the `StandardScaler` from scikit-learn. Standardization is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

The script then applies the K-Means algorithm to the standardized features. The number of clusters (K) is set to 2, but this can be adjusted based on your analysis. The `fit_predict()` function is used to compute cluster centers and predict the cluster index for each sample.

The silhouette score is computed for the clustering. The silhouette score is a measure of how similar an object is to its own cluster compared to other clusters. The silhouette scores range from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.

Finally, the script visualizes the clusters using a scatter plot. A subset of points is randomly sampled from the dataset for this visualization. The standardized features are plotted on the x and y axes, and the predicted cluster is represented by the color of the points. This provides a visual representation of how the K-Means algorithm has grouped the data based on the features.