### Purpose
This notebook is an application of the XGBoost concepts taught in the Intermediate Machine Learning course on Kaggle Learn. XGBoost was used to predict the incidence of diabetes in the dataset; in the last section a comparison is made with a random forest model.

Credit for some of the code goes to [BABATUNDEADEKUNLE's Intro XGboost Classification](https://www.kaggle.com/code/babatee/intro-xgboost-classification).

# Loading Data and Relevant Libraries
The libraries below are not a comprehensive list and others will be loaded as we progress.

In [45]:
# # This Python 3 environment comes with many helpful analytics libraries installed
# # It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# # For example, here's several helpful packages to load

# import numpy as np # linear algebra
# import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# # Input data files are available in the read-only "../input/" directory
# # For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# # You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# # You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [46]:
from sklearn.model_selection import train_test_split
import xgboost as xgb

# Data Inspection and Preparation

In [47]:
diabetes_df = pd.read_csv('C:/Users/DELL/Desktop/aiml main/diabetes_prediction_dataset.csv')
diabetes_df.head() #checking the structure of the data

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


In [48]:
diabetes_df.dtypes #checking for categorical data types

gender                  object
age                    float64
hypertension             int64
heart_disease            int64
smoking_history         object
bmi                    float64
HbA1c_level            float64
blood_glucose_level      int64
diabetes                 int64
dtype: object

### Encoding Categorical Columns
I had to encode the categorical columns in a way similar to those used in Random Forest models. XGBoost supposedly allows for categorical elements but I was unable to get the model to run after enabling the setting in the parameters.

In [49]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

columns_to_encode = ['gender', 'smoking_history']

for column in columns_to_encode:
    diabetes_df[column] = encoder.fit_transform(diabetes_df[column])

In [50]:
diabetes_df.head() #checking that columns have been encoded

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,0,80.0,0,1,4,25.19,6.6,140,0
1,0,54.0,0,0,0,27.32,6.6,80,0
2,1,28.0,0,0,4,27.32,5.7,158,0
3,0,36.0,0,0,1,23.45,5.0,155,0
4,1,76.0,1,1,1,20.14,4.8,155,0


# Model Construction and Training

In [51]:
diabetes_df_complete = diabetes_df.copy()
X = diabetes_df_complete.drop('diabetes', axis=1) #dropping the outcome variable from features (X)
y = diabetes_df_complete.diabetes #redefining the outcome variable as y

In [52]:
X.head() #checking that features are correct

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level
0,0,80.0,0,1,4,25.19,6.6,140
1,0,54.0,0,0,0,27.32,6.6,80
2,1,28.0,0,0,4,27.32,5.7,158
3,0,36.0,0,0,1,23.45,5.0,155
4,1,76.0,1,1,1,20.14,4.8,155


In [53]:
# Separating the dataset into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.2,random_state=42)

In this notebook I constructed two XGBoost models.

model_1 - Simple model with no parameter tunings. 

model_2 - Uses basic parameter tunings specific in the Intermediate Machine Learning course:
* n_estimators - Assuming that the default is 500, I doubled the number hoping to increase the accuracy of the model.
* learning_rate - The default value is 0.1, a smaller number suggests better model accuracy but takes longer to run.
* n_jobs - Parallelism used for large datasets to decrease runtime, but shouldn't actually affect model accuracy.

In [54]:
#Simple XGBoost model
diabetes_model_1 = xgb.XGBClassifier()
diabetes_model_1_train = diabetes_model_1.fit(X_train, y_train)

#XGBoost model with basic parameter tuning
diabetes_model_2 = xgb.XGBClassifier(n_estimators=1000, learning_rate=0.05, n_jobs=4)
diabetes_model_2_train = diabetes_model_2.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)



# Classification Reports and Accuracy Scores

In [55]:
from sklearn.metrics import classification_report

diabetes_pred1 = diabetes_model_1_train.predict(X_valid)

diabetes_pred2 = diabetes_model_2_train.predict(X_valid)

print('model_1 XGBoost Report %r' % (classification_report(y_valid, diabetes_pred1)))
print('model_2 XGBoost Report %r' % (classification_report(y_valid, diabetes_pred2)))

model_1 XGBoost Report '              precision    recall  f1-score   support\n\n           0       0.97      1.00      0.98     18292\n           1       0.96      0.70      0.81      1708\n\n    accuracy                           0.97     20000\n   macro avg       0.97      0.85      0.90     20000\nweighted avg       0.97      0.97      0.97     20000\n'
model_2 XGBoost Report '              precision    recall  f1-score   support\n\n           0       0.97      1.00      0.99     18292\n           1       0.99      0.69      0.81      1708\n\n    accuracy                           0.97     20000\n   macro avg       0.98      0.84      0.90     20000\nweighted avg       0.97      0.97      0.97     20000\n'


In [56]:
from sklearn.metrics import accuracy_score

print("Accuracy for model 1: %.2f" % (accuracy_score(y_valid, diabetes_pred1) * 100))
print("Accuracy for model 2: %.2f" % (accuracy_score(y_valid, diabetes_pred2) * 100))

Accuracy for model 1: 97.14
Accuracy for model 2: 97.25


From the classification reports, both models achieved high accuracy scores (97%). Parameter tuning only slightly improved model performance (97.14% vs 97.25%), this is probably because my adjustments were relatively conservative and not drastic enough to effect larger changes.

### Random Forest for Comparison

In [57]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc_model = rfc.fit(X_train, y_train)
pred_rf = rfc_model.predict(X_valid)
print("Accuracy for Random Forest Model: %.2f" % (accuracy_score(y_valid, pred_rf) * 100))

Accuracy for Random Forest Model: 97.02


For comparison, a random forest model was run and the accuracy scores were compared with the two XGBoost models. The random forest model produced a comparable level of accuracy (97%) and was only marginally worse than employing XGBoost.

### Conclusions
* XGBoost and random forest models both achieved high levels of accuracy (97%).
* The difference between the XGBoost models (with paramter tuning) and random forest was small, with XGBoost producing slightly more accurate predictions.
* Further exploration with more radical parameter tunings may affect XGBoost model performance more significantly.