## ML Revolution in Healthcare: The Diabetes Risk Prediction Challenge.
 The Diabetes Risk Prediction Challenge Using BRFSS Survey Data.
 
**Problem Statement:**
This project addresses the pressing issue of diabetes prevalence in the United States, aiming to predict distinct risk categories using the Behavioral Risk Factor Surveillance System (BRFSS) 2015 dataset. The dataset, comprising 50,000 survey responses, poses a challenge due to class imbalance. The objective is to develop a predictive model that effectively classifies individuals into three categories: 0 for no diabetes/only during pregnancy, 1 for prediabetes, and 2 for diabetes. The emphasis is on mitigating class imbalance to enhance the accuracy of the predictive modeling process.

**About the Dataset:**
The dataset used in this project is sourced from the CDC's Behavioral Risk Factor Surveillance System (BRFSS) 2015, obtained from the UC Irvine Machine Learning Repository. It comprises 21 feature variables and a target variable (Diabetes_012) categorizing respondents into stages: 0 for no diabetes/only during pregnancy, 1 for prediabetes, and 2 for diabetes.

In [None]:
### Intial Exploration:
In the initial exploration, the dataset is read, and its shape is checked to understand its structure and dimensions.

import pandas as pd
from sklearn.model_selection import RandomizedSearchCV
from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder, OrdinalEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.metrics import make_scorer, f1_score

In [None]:

dataset = pd.read_csv('./train_dataset.csv')
pd.set_option('display.max_columns', 500)
dataset.head()

# Predict the values with holdout data data. model predictions is a NumPy array or a list
X_test_df = pd.read_csv('./holdout.csv')
X_test_df.head()


### Data Cleaning:
Data preprocessing involves cleaning the dataset by handling missing values, removing duplicates, and addressing any inconsistencies or errors. This ensures the data is accurate and reliable for analysis.

In [None]:
df_selected = dataset.dropna()
# df_selected = dataset
df_selected.shape
df_selected.groupby(['Diabetes_012']).size()
df_selected.head()

column_name = 'Diabetes_012'
if df_selected[column_name].isna().any():
    print(f"The column '{column_name}' contains NaN values.")
else:
    print(f"The column '{column_name}' does not contain NaN values.")


### Transforming Data:
This step includes transforming variables or features through methods like scaling, normalization, or encoding categorical variables. Transformation ensures that the data is on a consistent scale and format, preventing certain features from dominating the analysis due to their scale.



In [None]:
y_train = df_selected['Diabetes_012']
continuous_transformer = Pipeline(steps=[
    ('scaler', MinMaxScaler())  # or StandardScaler()
])

categorical_transformer = Pipeline(steps=[
    ('ordinal_encoder', OrdinalEncoder())
])
continuous_vars = [ 'BMI', 'PhysHlth', 'GenHlth', 'Age', 'MentHlth', 'Education', 'Income']
categorical_vars = [
    'Sex', 'DiffWalk', 'HighBP', 'HighChol', 'CholCheck', 'Smoker', 
    'Stroke' , 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits', 'Veggies', 
    'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost'
]

preprocessor = ColumnTransformer(
    transformers=[
        ('cont', continuous_transformer, continuous_vars),
        ('cat', categorical_transformer, categorical_vars)]
)
columns = continuous_vars + categorical_vars 


df_tmp = df_selected[columns]
# Apply transformations to the DataFrame
preprocessor.fit(df_tmp)


In [None]:
df_tmp = df_selected[columns]
df_transformed = preprocessor.transform(df_tmp)
X_train = pd.DataFrame(df_transformed, columns=columns)

# test Data
df_test_tmp = X_test_df[columns]
df_test_transformed = preprocessor.transform(df_test_tmp)
X_test = pd.DataFrame(df_test_transformed, columns=columns)

### Model Selection:
Choose an appropriate ML model based on the nature of the problem (classification) and the characteristics of the data. Common models include Logistic Regression, Decision Tree, Random Forest, KNN, XGBoost, and AdaBoost.


In [None]:
# Assuming X and y are your feature matrix and target variable
# Adjust the code based on your actual dataset

# Create a DataFrame to store results
results = pd.DataFrame(columns=['Model', 'Mean F1 Score', 'Std Deviation'])

# List of models to evaluate
models = [
    ('LogisticRegression', LogisticRegression(multi_class='multinomial', solver='lbfgs')),
    ('Decision Tree', DecisionTreeClassifier()),
    ('Random Forest', RandomForestClassifier()),
    ('AdaBoost', AdaBoostClassifier()),
    ('KNN', KNeighborsClassifier()),
    ('XGBoost', XGBClassifier())
]

# 5-fold cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Evaluate each model and store results based on F1 score
for model_name, model in models:
    f1_scorer = make_scorer(f1_score, average='macro')  # Adjust 'binary' if you have a multiclass problem
    scores = cross_val_score(model, X_train_smote, y_train_smote, scoring=f1_scorer, cv=cv, n_jobs=-1)
    mean_f1_score = scores.mean()
    std_deviation = scores.std()
    model_results = pd.DataFrame({'Model': [model_name], 'Mean F1 Score': [mean_f1_score], 'Std Deviation': [std_deviation]})
    results = pd.concat([results, model_results], ignore_index=True)

# Display results
print(results)

### Model Training:
Train the selected model on the training data. This involves feeding the model the input data and allowing it to learn the patterns within the data.

### Hyperparameter Tuning:
Fine-tune the model's hyperparameters to optimize its performance. This may involve using techniques like grid search or random search.


In [None]:
class_weights = {0.0: 48898 / 32599, 1.0: 48898/4427, 2.0: 48898/11872}
rf_model = RandomForestClassifier(random_state=42, class_weight=class_weights)
# Define the hyperparameter grid to search
param_dist = {
    'n_estimators': [50, 100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10, 15],
    'min_samples_leaf': [1, 2, 4, 6],
    'max_features': ['auto', 'sqrt', 'log2'],
    'bootstrap': [True, False]
}
# Diabetes_012
# 0.0    32599
# 1.0     4427
# 2.0    11872

# Perform RandomizedSearchCV for hyperparameter tuning
random_search = RandomizedSearchCV(estimator=rf_model, param_distributions=param_dist, n_iter=10,
                                   scoring='accuracy', cv=5, n_jobs=-1, random_state=42)
random_search.fit(X, y)

# Get the best model from the random search
best_rf_model = random_search.best_estimator_

# Train the best model on the entire training set
best_rf_model.fit(X, y)


### Predicting:
Once the machine learning model is trained and validated, apply it to the holdout dataâ€”previously unseen data reserved for testing purposes. This step involves utilizing the trained model to generate predictions or classifications for the holdout dataset, providing insights into how well the model generalizes to new, independent observations.

### Post-Prediction Processing:
Convert Predicted Values to DataFrame and Export to CSV.


In [None]:

predictions = best_rf_model.predict(X_test)

# predictions = svm_classifier.predict(X_test)

# predictions  = best_xgb.predict(X_test)

predictions  = classifier.predict(X_test)

predictions_df = pd.DataFrame(predictions, columns=['predictions'])
predictions_df.to_csv('predictions.csv', index=False)
