### Predicting diabetes
In the two previous projects we saw how we could use a decision tree and then a random forest to improve the prediction of diabetes. We have reached a point where we need to improve. Can boosting be the best alternative to optimize the results?

Boosting is a sequential composition of models (usually decision trees) in which the new model aims to correct the errors of the previous one. This view may be useful in this data set, since several of the assumptions studied in the module are met.

In this project you will focus on this idea by training the dataset to improve the accuracy.

In [4]:
import pandas as pd
test_data = pd.read_csv("/workspaces/Boosting-Algorithms/data/processed/diabetes_test.csv")
train_data = pd.read_csv("/workspaces/Boosting-Algorithms/data/processed/diabetes_train.csv")
train_data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,0.058824,0.419355,0.142857,0.012195,0.123188,0.100204,0.140478,0.083333,0.0
1,0.0,0.36129,0.469388,0.109756,0.038647,0.257669,0.221605,0.0,0.0
2,0.0,0.877419,0.428571,0.268293,0.074879,0.486708,0.774979,0.066667,1.0
3,0.235294,0.548387,0.632653,0.036585,0.304348,0.345603,0.065329,0.033333,0.0
4,0.117647,0.258065,0.265306,0.073171,0.070048,0.249489,0.380017,0.0,0.0


In [5]:
X_train = train_data.drop(["Outcome"], axis = 1)
y_train = train_data["Outcome"]
X_test = test_data.drop(["Outcome"], axis = 1)
y_test = test_data["Outcome"]

In [12]:
from xgboost import XGBClassifier

model = XGBClassifier(random_state = 42)
model.fit(X_train, y_train)


In [13]:
y_pred = model.predict(X_test)
y_pred

array([1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0,
       0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1])

In [14]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

0.7814569536423841

In [21]:
from sklearn.model_selection import RandomizedSearchCV
import scipy.stats as stats
import xgboost as xgb

# Define the hyperparameter distributions
param_dist = {
    'max_depth': stats.randint(3, 10),
    'learning_rate': stats.uniform(0.01, 0.1),
    'subsample': stats.uniform(0.5, 0.5),
    'n_estimators':stats.randint(50, 200)
}

# Create the XGBoost model object
xgb_model = xgb.XGBClassifier()

# Create the RandomizedSearchCV object
random_search = RandomizedSearchCV(xgb_model, random_state=42,param_distributions=param_dist, n_iter=10, cv=5, scoring='accuracy')

# Fit the RandomizedSearchCV object to the training data
random_search.fit(X_train, y_train)

# Print the best set of hyperparameters and the corresponding score
print("Best set of hyperparameters: ", random_search.best_params_)

Best set of hyperparameters:  {'learning_rate': np.float64(0.07118528947223794), 'max_depth': 4, 'n_estimators': 64, 'subsample': np.float64(0.728034992108518)}


In [22]:
model = XGBClassifier(random_state = 42,learning_rate = 0.07118528947223794, max_depth =  4, n_estimators = 64, subsample = 0.728034992108518)
model.fit(X_train, y_train)

In [23]:
y_pred = model.predict(X_test)
y_pred

array([1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0,
       0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1,
       1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1])

In [24]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

0.8079470198675497