# Predicting the development of diabetes using machine learning

## Goal

My goal is to use sample data to train a model that can predict the probability of the development of diabetes in new patients. The data will be gathered and cleaned, then used to train the model using the supervised learning algorithm: logistic regression.

## Project

### The data

The dataset contains the following features:

- Age: The patient's age, in years.
- Gender: The patient's gender, male or female.
- BMI: The patient's body mass index (BMI), a measure of weight relative to height.
- Blood pressure: The patient's blood pressure, in mmHg.
- Fasting Blood Sugar: The patient's fasting blood sugar, in mg/dL.
- Hemoglobin A1c: The patient's hemoglobin A1c, a measure of blood sugar control over the past 3 months.
- Family history of diabetes: Whether the patient has a family history of diabetes.
- Smoking: Whether the patient smokes.
- Diet: Whether the patient has a poor or healthy diet.
- Exercise: Whether the patient exercises regularly.
- Diagnosis: The patient's diagnosis, either diabetes or no diabetes.


In [101]:
import pandas as pd

file_path = 'diabetes-classification-cleaned.csv'

# Creating a DataFrame from the CSV file
data_frame = pd.read_csv(file_path)

# Displaying the first few rows of the DataFrame
data_frame.head()

Unnamed: 0,age,gender,bmi,blood_pressure,fasting_blood_sugar,hemoglobin_a1c,family_history_of_diabetes,smoking,diet,exercise,diagnosis
0,45,Male,25,Normal,100,5.7,No,No,Healthy,Regular,No
1,55,Female,30,High,120,6.4,Yes,Yes,Poor,No,Yes
2,65,Male,35,High,140,7.1,Yes,Yes,Poor,No,Yes
3,75,Female,40,High,160,7.8,Yes,Yes,Poor,No,Yes
4,40,Male,20,Normal,80,5.0,No,No,Healthy,Regular,No


The test data must be separated into test data and training data. The training data will be used to train the machine learning model to make predictions, then the test data will be used to exercise the model.

In [102]:
columns = data_frame.columns.values.tolist()
feature_columns = [col for col in columns if col != 'diagnosis']

# Displaying the applicable feature columns
print("Feature columns:", feature_columns)

Feature columns: ['age', 'gender', 'bmi', 'blood_pressure', 'fasting_blood_sugar', 'hemoglobin_a1c', 'family_history_of_diabetes', 'smoking', 'diet', 'exercise']


In [103]:
# Distinguishing features from the target variable 'diagnosis'
X = data_frame[feature_columns] # features
y = data_frame.diagnosis # target variable
print("Features:", end='\n')
print(X, end='\n')
print("Target Variable:", end='\n')
print(y, end='\n')

Features:
     age  gender  bmi blood_pressure  fasting_blood_sugar  hemoglobin_a1c  \
0     45    Male   25         Normal                  100             5.7   
1     55  Female   30           High                  120             6.4   
2     65    Male   35           High                  140             7.1   
3     75  Female   40           High                  160             7.8   
4     40    Male   20         Normal                   80             5.0   
..   ...     ...  ...            ...                  ...             ...   
123   17  Female   15         Normal                  100             5.7   
124   22    Male   19         Normal                  120             6.4   
125   27  Female   24           High                  140             7.1   
126   32    Male   29           High                  160             7.8   
127   37  Female   34           High                  180             8.5   

    family_history_of_diabetes smoking     diet exercise  
0     

In [104]:
from sklearn.model_selection import train_test_split

# Splitting data into training and test datasets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=7)
print("Training Data Row Count:", len(X_train), end='\n')
print("Test Data Row Count:", len(X_test), end='\n')

Training Data Row Count: 96
Test Data Row Count: 32
