# Early stage Diabetes Risk Prediction
#### By Edison Musinde
## About Data.

*This dataset contains information on the signs and symptoms of newly diagnosed diabetic patients or those at risk of  developing diabetes. The data was collected through direct questionnaires administered to patients at the Sylhet Diabetes Hospital in Sylhet, Bangladesh, and approved by a doctor.

### Variables Table:

* age (Feature, Integer): Age of the patient
* gender (Feature, Categorical): Gender of the patient
* polyuria (Feature, Binary): Presence of polyuria (Yes/No)
* polydipsia (Feature, Binary): Presence of polydipsia (Yes/No)
* sudden_weight_loss (Feature, Binary): Experience of sudden weight loss (Yes/No)
* weakness (Feature, Binary): Experience of weakness (Yes/No)
* polyphagia (Feature, Binary): Presence of polyphagia (Yes/No)
* genital_thrush (Feature, Binary): Presence of genital thrush (Yes/No)
* visual_blurring (Feature, Binary): Experience of visual blurring (Yes/No)
* itching (Feature, Binary): Experience of itching (Yes/No)

### Description:

* Type: Multivariate
* Subject Area: Computer Science
* Associated Tasks: Classification
* Feature Type: Categorical, Integer
* Number of Instances: 520
* Number of Features: 16
* Missing Values: Yes

### Additional Variable Information:

* Age: Range from 20 to 65 years
* Gender: 1 for Male, 2 for Female
* Polyuria: 1 for Yes, 2 for No
* Polydipsia: 1 for Yes, 2 for No
* Sudden Weight Loss: 1 for Yes, 2 for No
* Weakness: 1 for Yes, 2 for No
* Polyphagia: 1 for Yes, 2 for No
* Genital Thrush: 1 for Yes, 2 for No
* Visual Blurring: 1 for Yes, 2 for No
* Itching: 1 for Yes, 2 for No
* Irritability: 1 for Yes, 2 for No
* Delayed Healing: 1 for Yes, 2 for No
* Partial Paresis: 1 for Yes, 2 for No
* Muscle Stiffness: 1 for Yes, 2 for No
* Alopecia: 1 for Yes, 2 for No
* Obesity: 1 for Yes, 2 for No
* Class: 1 for Positive, 2 for Negative




# Objectives
* Visualize the data to find out which features have the highest correlation with early stage diabetes. 
* Experiment with various classification techniques to identify which machine learning algorithm is best for early diabetes risk prediction

In [None]:
# Importing the neccessary modules 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import OneHotEncoder

In [None]:
#Load the file
df = pd.read_csv('/kaggle/input/early-stage-diabetes-risk-prediction/diabetes_data_upload.csv')

In [None]:
#Viewing the data
df

In [None]:
df.shape #Returns the size of the DataFrame

In [None]:
df.isna().sum() # returns the sum of missing values per column

### Observation
1. There are no missing values in the dataset. 

# Data visualization

In [None]:
# Check the distribution of values in the target feature 'class'
sns.countplot(data=df, x='class')
plt.title('Countplot of positive and negative cases.')
plt.xlabel('Diagnosis')
plt.ylabel('Count')

# Observation
1. There are more positive than negative cases in the data

In [None]:
sns.catplot(df, x='Gender', y='Age', hue='class', kind='swarm')
plt.title('Catplot of Gender VS Age with Diagnosis as hue')

# Observations
1. Female patients constitute the highers positive diagnosis. 
2. All patients above the age of 75 tested positive

In [None]:
df.dtypes

In [None]:
df.columns

In [None]:
categorical_cols = ['Gender', 'Polyuria', 'Polydipsia', 'sudden weight loss',
       'weakness', 'Polyphagia', 'Genital thrush', 'visual blurring',
       'Itching', 'Irritability', 'delayed healing', 'partial paresis',
       'muscle stiffness', 'Alopecia', 'Obesity',]

In [None]:
import category_encoders as ce

In [None]:
# Use BinaryEncoder to encode categorical columns
binary_encoder = ce.BinaryEncoder(cols=categorical_cols)
df_encoded = binary_encoder.fit_transform(df)

# Check the encoded DataFrame
df_encoded

In [None]:
df_encoded['class'] = df_encoded['class'].map({'Positive': 1, 'Negative': 0})

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
column_to_scale = 'Age'
df_encoded[column_to_scale] = scaler.fit_transform(df_encoded[[column_to_scale]])
df_encoded

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [None]:
X = df_encoded.drop(['class'], axis=1)
y = df_encoded['class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred

In [None]:
from sklearn.metrics import classification_report

# Model Comparison

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, precision_recall_curve
from category_encoders import BinaryEncoder
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
models = {
    'Logistic Regression': LogisticRegression(),
    'Random Forest': RandomForestClassifier(),
    'SVM': SVC(probability=True),
    'KNN': KNeighborsClassifier()
}

for name, model in models.items():
    model.fit(X_train, y_train)

In [None]:
predictions = {name: model.predict(X_test) for name, model in models.items()}

In [None]:
reports = {name: classification_report(y_test, pred) for name, pred in predictions.items()}
for name, report in reports.items():
    print(f"Classification Report for {name}:\n{report}\n")

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for ax, (name, pred) in zip(axes, predictions.items()):
    cm = confusion_matrix(y_test, pred)
    sns.heatmap(cm, annot=True, fmt='d', ax=ax, cmap='Blues')
    ax.set_title(f'{name} Confusion Matrix')
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')

plt.tight_layout()
plt.show()

In [None]:
from sklearn.metrics import classification_report
import pandas as pd

# Calculate classification reports
reports = {name: classification_report(y_test, pred, output_dict=True) for name, pred in predictions.items()}

# Extract weighted averages
weighted_averages = {
    name: {
        'precision': report['weighted avg']['precision'],
        'recall': report['weighted avg']['recall'],
        'f1-score': report['weighted avg']['f1-score']
    }
    for name, report in reports.items()
}

# Convert to DataFrame
df_weighted_averages = pd.DataFrame(weighted_averages).T
print(df_weighted_averages)


In [None]:
plt.figure(figsize=(12, 6))

metrics = ['precision', 'recall', 'f1-score']
for i, metric in enumerate(metrics, 1):
    plt.subplot(1, 3, i)
    sns.barplot(x=df_weighted_averages.index, y=df_weighted_averages[metric])
    plt.title(f'Weighted Average {metric.capitalize()}')
    plt.xlabel('Model')
    plt.ylabel(metric.capitalize())
    plt.xticks(rotation=90)

plt.tight_layout()
plt.show()