<a href="https://colab.research.google.com/github/Longman-max/Diabetes-Predictor/blob/main/diabetes(updated).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Precious Kings
## Project Title: Time Series Predictive Modeling Diabetes Progression and Health Risk Stratification Using Electronic Health Records

In [3]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split # Though we might not split for simplicity here
from sklearn.ensemble import RandomForestClassifier # Import RandomForest
from sklearn.preprocessing import StandardScaler     # For scaling features
import pickle                                       # For saving the model and scaler
import os                                           # To check if the file exists

print("Libraries imported successfully Precious")

Libraries imported successfully Precious


### Load dataset

In [6]:
# Load the dataset
data = pd.read_csv('/content/diabetes_prediction_india (1).csv')
display(data.head())

Unnamed: 0,Age,Gender,BMI,Family_History,Physical_Activity,Diet_Type,Smoking_Status,Alcohol_Intake,Stress_Level,Hypertension,...,Health_Insurance,Regular_Checkups,Medication_For_Chronic_Conditions,Pregnancies,Polycystic_Ovary_Syndrome,Glucose_Tolerance_Test_Result,Vitamin_D_Level,C_Protein_Level,Thyroid_Condition,Diabetes_Status
0,48,Male,35.5,No,High,Non-Vegetarian,Never,,Medium,Yes,...,No,No,No,0,0,124.3,31.5,7.46,Yes,Yes
1,18,Other,28.7,Yes,Medium,Non-Vegetarian,Current,Moderate,High,No,...,Yes,Yes,No,0,0,151.4,12.5,5.64,Yes,No
2,21,Other,30.0,Yes,High,Non-Vegetarian,Current,Moderate,High,Yes,...,No,No,Yes,0,0,106.1,35.8,7.2,No,Yes
3,25,Female,25.6,No,Medium,Vegetarian,Former,Moderate,High,Yes,...,No,No,Yes,1,No,85.6,15.4,6.53,Yes,No
4,78,Male,38.8,No,High,Non-Vegetarian,Current,High,High,No,...,No,No,Yes,0,0,77.0,28.6,0.58,No,Yes


###  Define Features (X) and Target (y)

In [None]:
# Define the list of feature column names
# Make sure these match your CSV column names EXACTLY!
feature_cols = ['Age', 'BMI', 'Blood Glucose', 'Blood Pressure', 'HbA1c', 'Insulin Level', 'Skin thickness', 'Pregnancies', 'Family history', 'Physical Activity', 'Smoking status', 'Alcohol Intake', 'Diet Qualtiy', 'Cholesterol', 'Triglycerides', 'Waiste ratio']
target_col = 'Outcome' # Assuming 'Outcome' is still the target column

# Create the features DataFrame (X)
# Using .get() with a default empty list to handle missing columns
X = data[[col for col in feature_cols if col in data.columns]]

# Create the target Series (y)
y = data[target_col]

print("Features (X) and Target (y) defined.")
print(f"Shape of X: {X.shape}")
print(f"Shape of y: {y.shape}")

###  Preprocess Data: Feature Scaling

In [None]:
# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler to the features (X) and transform X
# fit_transform() does both steps in one go
X_scaled = scaler.fit_transform(X)

print("Features scaled using StandardScaler.")

# You can optionally convert X_scaled back to a DataFrame to view it nicely (optional)
# X_scaled_df = pd.DataFrame(X_scaled, columns=feature_cols)
# print("\nFirst 5 rows of scaled features:")
# print(X_scaled_df.head())

print("\nScaler is fitted and ready to be saved.")

### Train the Random Forest Model

In [None]:
# Initialize the Random Forest Classifier model
# n_estimators is the number of trees in the forest
model = RandomForestClassifier(n_estimators=100, random_state=42) # random_state for reproducibility

# Train the model
model.fit(X_scaled, y)

print("Random Forest Classifier model trained successfully!")

### Save the Model and Scaler

In [None]:
# Define filenames for the saved files
model_filename = 'diabetes_rf_model.pkl' # Updated filename
scaler_filename = 'scaler.pkl'

# Save the trained model
with open(model_filename, 'wb') as model_file:
    pickle.dump(model, model_file)
print(f"Model saved successfully as '{model_filename}'")

# Save the fitted scaler
with open(scaler_filename, 'wb') as scaler_file:
    pickle.dump(scaler, scaler_file)
print(f"Scaler saved successfully as '{scaler_filename}'")

### Predict the model

In [None]:
# Make predictions using the trained model
predictions = model.predict(X_scaled)

# Display the first 10 predictions
print("First 10 predictions:", predictions[:10])

### Calculate the F1 Score

In [None]:
# Import the f1_score function
from sklearn.metrics import f1_score

# Calculate the F1 score
f1 = f1_score(y, predictions)

# Display the F1 score
print(f'F1 Score: {f1:.2f}')

# Task
Update the notebook "Time Series Predictive Modeling Diabetes Progression and Health Risk Stratification Using Electronic Health Records" to work with the provided dataset "EHR_data.csv".

## Load and explore data

### Subtask:
Load the EHR data and perform initial exploration to understand its structure, features, and potential issues.


**Reasoning**:
Load the EHR data from the CSV file, display the first few rows, show the column names and their data types, and calculate basic descriptive statistics and missing values.



# Task
Update the Time Series Predictive Modeling Diabetes Progression and Health Risk Stratification Using Electronic Health Records model to use the "diabetes_prediction_india.csv" dataset.

## Load the new dataset

### Subtask:
Load the `diabetes_prediction_india.csv` file into a pandas DataFrame.


**Reasoning**:
Load the dataset from 'diabetes_prediction_india.csv' into a pandas DataFrame and display the first few rows.

