<h2>Title of the project 2: “Diabetes Patients”</h2>
<hr>
 
<em>About Dataset</em>
<br>
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney
Diseases. 
<br>
The objective of the dataset is to diagnostically predict whether a patient has diabetes
based on certain diagnostic measurements included in the dataset. Several constraints were placed
on the selection of these instances from a larger database. In particular, all patients here are females
at least 21 years old of Pima Indian heritage. From the data set in the (.csv) File We can find several variables, some of them are independent
(several medical predictor variables) and only one target-dependent variable (Outcome).


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


In [2]:
# Load the CSV data into a pandas DataFrame
diabetes_data= pd.read_csv('diabetes.csv')
# Print out the first few rows of the DataFrame to see what it looks like
diabetes_data.head(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
# Split the data into features (X) and target (y)
X = diabetes_data.drop('Outcome', axis=1)
y = diabetes_data['Outcome']

# Split the data into training, validation, and test sets (70% train, 15% validation, 15% test)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

scaler = StandardScaler()

# Scale the training, validation, and test sets
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)


In [4]:

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score


In [5]:
# Initialize models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'Support Vector Machine': SVC(),
    'Gradient Boosting': GradientBoostingClassifier()
}
models_list = list(models.items())
results = {}


In [6]:
# LogisticRegression
name, model = models_list[0]

model.fit(X_train, y_train)
y_val_pred = model.predict(X_val)
accuracy = accuracy_score(y_val, y_val_pred)
results[name] = accuracy
print(f'The accuracy of the {name} model is {accuracy}')

The accuracy of the Logistic Regression model is 0.7304347826086957


In [7]:
# Decision Tree
name, model = models_list[1]

model.fit(X_train, y_train)
y_val_pred = model.predict(X_val)
accuracy = accuracy_score(y_val, y_val_pred)
results[name] = accuracy
print(f'The accuracy of the {name} model is {accuracy}')

The accuracy of the Decision Tree model is 0.7304347826086957


In [8]:
# Random Forest
name, model = models_list[2]

model.fit(X_train, y_train)
y_val_pred = model.predict(X_val)
accuracy = accuracy_score(y_val, y_val_pred)
results[name] = accuracy
print(f'The accuracy of the {name} model is {accuracy}')

The accuracy of the Random Forest model is 0.7304347826086957


In [9]:
# Support Vector Machine
name, model = models_list[3]

model.fit(X_train, y_train)
y_val_pred = model.predict(X_val)
accuracy = accuracy_score(y_val, y_val_pred)
results[name] = accuracy
print(f'The accuracy of the {name} model is {accuracy}')

The accuracy of the Support Vector Machine model is 0.7391304347826086


In [10]:
# Gradient Boosting
name, model = models_list[4]

model.fit(X_train, y_train)
y_val_pred = model.predict(X_val)
accuracy = accuracy_score(y_val, y_val_pred)
results[name] = accuracy
print(f'The accuracy of the {name} model is {accuracy}')

The accuracy of the Gradient Boosting model is 0.7652173913043478


In [11]:
# Find the best model
best_model = max(results, key=results.get)
print(f'Best Model: {best_model} (Validation Accuracy: {results[best_model]:.2f})')

Best Model: Gradient Boosting (Validation Accuracy: 0.77)


In [12]:
# Assume your input values are stored in a list called 'input_values'
input_values = [6, 148, 72, 35, 0, 33.6, 0.627, 50]

# Reshape the input values to match the shape of X_train
input_values = pd.Series(input_values).values.reshape(1, -1)

# Use the best model for prediction
best_model_instance = models[best_model]
prediction = best_model_instance.predict(input_values)

print(f'Predicted Outcome: {"Diabetes" if prediction[0] == 1 else "No Diabetes"}')


Predicted Outcome: Diabetes
