### Machine Learning Models on Healthcare Dataset
we will use Python libraries lib like numpy and pandas in this notebook will load and preprocess the healthcare dataset, apply label encoding and feature scaling, then train and evaluate seven machine learning models the list of models are below.
### Machine Learning Models Used
1. Linear Regression
2. Logistic Regression
3. KMeans Clustering
4. K-Nearest Neighbors (KNN)
5. Support Vector Machine (SVM)
6. Random Forest
7. Decision Tree


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

In [2]:
from google.colab import files
uploaded = files.upload()

Saving healthcare_dataset.csv to healthcare_dataset.csv


In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
# Load dataset
df = pd.read_csv('healthcare_dataset.csv')
# Encode categorical variables
le = LabelEncoder()
for col in df.select_dtypes(include='object'):
    df[col] = le.fit_transform(df[col])
# Features and target variables
X = df.drop(['Billing Amount', 'Test Results'], axis=1)
y_reg = df['Billing Amount']
y_clf = df['Test Results']
# Train-test split
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X, y_reg, test_size=0.2, random_state=42)
X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(X, y_clf, test_size=0.2, random_state=42)
# Feature scaling
scaler = StandardScaler()
X_train_reg = scaler.fit_transform(X_train_reg)
X_test_reg = scaler.transform(X_test_reg)
X_train_clf = scaler.fit_transform(X_train_clf)
X_test_clf = scaler.transform(X_test_clf)

###Linear Regression
Linear Regression is used to model the relationship between a dependent variable and one or more independent variables. We are using it to predict continuous values like BMI in the dataset.

In [4]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
# parameter grid for Ridge
ridge_params = {'alpha': [0.01, 0.1, 1, 10, 100]}
ridge_grid = GridSearchCV(Ridge(), ridge_params, cv=5, scoring='neg_mean_squared_error')
ridge_grid.fit(X_train_reg, y_train_reg)
# Best model
best_ridge = ridge_grid.best_estimator_
pred_reg = best_ridge.predict(X_test_reg)
# Evaluation
rmse = np.sqrt(mean_squared_error(y_test_reg, pred_reg))
print("Best Ridge Alpha:", ridge_grid.best_params_['alpha'])
print("Tuned Ridge Regression RMSE:", rmse)

Best Ridge Alpha: 100
Tuned Ridge Regression RMSE: 14109.15730808872


###Logistic Regression
Logistic Regression is a classification algorithm used to predict binary outcomes, like whether a person is likely to have a stroke or not. It’s widely used in healthcare prediction.

In [6]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly.fit_transform(X_train_clf)
X_test_poly = poly.transform(X_test_clf)

log_params = {
    'C': [0.1, 1, 10],
    'solver': ['liblinear'],
    'penalty': ['l2'],
    'class_weight': ['balanced']
}
log_grid = GridSearchCV(LogisticRegression(max_iter=1000), log_params, cv=5, scoring='accuracy')
log_grid.fit(X_train_poly, y_train_clf)

log_preds = log_grid.predict(X_test_poly)
print("Best Logistic Regression Params:", log_grid.best_params_)
print("\nTuned Logistic Regression Report:\n", classification_report(y_test_clf, log_preds))

Best Logistic Regression Params: {'C': 1, 'class_weight': 'balanced', 'penalty': 'l2', 'solver': 'liblinear'}

Tuned Logistic Regression Report:
               precision    recall  f1-score   support

           0       0.34      0.34      0.34      3754
           1       0.32      0.34      0.33      3617
           2       0.33      0.31      0.32      3729

    accuracy                           0.33     11100
   macro avg       0.33      0.33      0.33     11100
weighted avg       0.33      0.33      0.33     11100



###K-Means Clustering
K-Means is an unsupervised algorithm used to group similar data points. We use it to find natural clusters in the healthcare data (e.g., patient risk groups).

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Range of cluster numbers to try
best_score = -1
best_k = 2
scores = []
# Try different values of k (clusters)
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_train_clf)
    score = silhouette_score(X_train_clf, kmeans.labels_)
    scores.append((k, score))
    if score > best_score:
        best_score = score
        best_k = k
print(f"\nBest k: {best_k} with Silhouette Score: {best_score:.4f}")
# Fit final model with best k
final_kmeans = KMeans(n_clusters=best_k, random_state=42)
final_kmeans.fit(X_train_clf)
kmeans_preds = final_kmeans.predict(X_test_clf)
print("\nK-Means Clustering Labels (Test):", np.unique(kmeans_preds))
print("Silhouette Score (Train):", silhouette_score(X_train_clf, final_kmeans.labels_))


Best k: 2 with Silhouette Score: 0.1090

K-Means Clustering Labels (Test): [0 1]
Silhouette Score (Train): 0.1089846064079903


###K-Nearest Neighbors (KNN)
KNN is a lazy learning algorithm used for classification based on similarity. We use it to classify new patients based on the closest data points.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_clf = scaler.fit_transform(X_train_clf)
X_test_clf = scaler.transform(X_test_clf)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

knn_params = {'n_neighbors': list(range(1, 21))}

knn_grid = GridSearchCV(KNeighborsClassifier(), knn_params, cv=5, scoring='accuracy')
knn_grid.fit(X_train_clf, y_train_clf)

best_knn = knn_grid.best_estimator_
knn_preds = best_knn.predict(X_test_clf)

print("Best KNN Params:", knn_grid.best_params_)
print("\nTuned KNN Report:\n", classification_report(y_test_clf, knn_preds))

Best KNN Params: {'n_neighbors': 1}

Tuned KNN Report:
               precision    recall  f1-score   support

           0       0.44      0.44      0.44      3754
           1       0.44      0.44      0.44      3617
           2       0.44      0.44      0.44      3729

    accuracy                           0.44     11100
   macro avg       0.44      0.44      0.44     11100
weighted avg       0.44      0.44      0.44     11100



###Support Vector Machine (SVM)
SVM is a powerful classification algorithm that finds the best boundary between classes. It’s used for stroke prediction where the classes may not be linearly separable.

In [None]:
from sklearn.svm import SVC

svm_model = SVC(kernel='rbf', random_state=42)
svm_model.fit(X_train_clf, y_train_clf)
svm_preds = svm_model.predict(X_test_clf)
print("\nSVM Classification Report:\n", classification_report(y_test_clf, svm_preds))


SVM Classification Report:
               precision    recall  f1-score   support

           0       0.36      0.40      0.38      3754
           1       0.35      0.36      0.36      3617
           2       0.35      0.30      0.32      3729

    accuracy                           0.35     11100
   macro avg       0.35      0.35      0.35     11100
weighted avg       0.35      0.35      0.35     11100



### Random Forest
Random Forest is an ensemble model that uses multiple decision trees. It helps improve prediction accuracy and reduce overfitting in both regression and classification tasks.

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
scaler = StandardScaler()
X_train_clf = scaler.fit_transform(X_train_clf)
X_test_clf = scaler.transform(X_test_clf)

In [None]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_clf, y_train_clf)
rf_preds = rf_model.predict(X_test_clf)
print("\n Random Forest Classification Report:\n")
print(classification_report(y_test_clf, rf_preds))


 Random Forest Classification Report:

              precision    recall  f1-score   support

           0       0.44      0.47      0.45      3754
           1       0.44      0.42      0.43      3617
           2       0.45      0.43      0.44      3729

    accuracy                           0.44     11100
   macro avg       0.44      0.44      0.44     11100
weighted avg       0.44      0.44      0.44     11100



###Decision Tree
A Decision Tree is a flowchart-like structure used for decision making. It’s easy to interpret and is used for classification tasks like stroke detection.

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train_clf, y_train_clf)
dt_preds = dt_model.predict(X_test_clf)
print("\n Decision Tree Report:\n", classification_report(y_test_clf, dt_preds))


 Decision Tree Report:
               precision    recall  f1-score   support

           0       0.43      0.42      0.43      3754
           1       0.41      0.42      0.42      3617
           2       0.44      0.44      0.44      3729

    accuracy                           0.43     11100
   macro avg       0.43      0.43      0.43     11100
weighted avg       0.43      0.43      0.43     11100

