## Diabetes Prediction 

Diabetes prediction refers to the process of using various data, such as medical records, demographic information, lifestyle factors, and genetic predispositions, to predict the likelihood of an individual developing diabetes in the future. Machine learning techniques are commonly used for diabetes prediction. These techniques involve training a model on historical data of individuals who have been diagnosed with diabetes or have risk factors for diabetes, and then using this trained model to make predictions on new, unseen data.



The following steps are followed,

* Import Library
* Load the diabetes dataset
* Split the data into 6 files
* Save the data files
* Define the model
* Define the pipeline
* Train the models
* Save the model
* Evaluate the model on the training set

#### Import Library

In [52]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.pipeline import make_pipeline
import pickle

#### Load the diabetes dataset

In [54]:
diabetes_df = pd.read_csv('Data/diabetes.csv')
X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']

#### Split the data into 6 files

In [55]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=0)

#### Save the data files

In [56]:
X_train.to_csv('X_train.csv', index=False)
X_test.to_csv('X_test.csv', index=False)
y_train.to_csv('y_train.csv', index=False)
y_test.to_csv('y_test.csv', index=False)
X_val.to_csv('X_val.csv', index=False)
y_val.to_csv('y_val.csv', index=False)

#### Define the models

In [57]:
models = [
    ('Decision Tree', DecisionTreeClassifier()),
    ('Random Forest', RandomForestClassifier()),
    ('Logistic Regression', LogisticRegression())
]


#### Define the pipeline

In [58]:
pipeline = make_pipeline(StandardScaler(), RFE(estimator=LogisticRegression(), n_features_to_select=3))

#### Train the models

In [59]:
for name, model in models:
    pipeline.set_params(rfe__estimator=model, rfe__n_features_to_select=3)
    pipeline.fit(X_train, y_train)
    acc = pipeline.score(X_val, y_val)
    print(f'{name} accuracy on validation set: {acc:.2%}')

Decision Tree accuracy on validation set: 64.58%
Random Forest accuracy on validation set: 76.39%
Logistic Regression accuracy on validation set: 77.78%


#### Save the model

In [60]:
# Save the best model
with open('diabetes_model.pkl', 'wb') as f:
    pickle.dump(pipeline, f)

# Load the saved model
with open('diabetes_model.pkl', 'rb') as f:
    model = pickle.load(f)

#### Evaluate the model on the training set

In [61]:
acc = model.score(X_train, y_train)
print(f'Model accuracy on training set: {acc:.2%}')

Model accuracy on training set: 74.31%
