# Boosting Algorithms Project

Predicting diabetes

- In the two previous projects we saw how we could use a decision tree and then a random forest to improve the prediction of diabetes. We have reached a point where we need to improve. Can boosting be the best alternative to optimize the results?

- Boosting is a sequential composition of models (usually decision trees) in which the new model aims to correct the errors of the previous one. This view may be useful in this data set, since several of the assumptions studied in the module are met.

- In this project you will focus on this idea by training the dataset to improve the accuracy.

#### Step 1: Loading the dataset

In [28]:
import pandas as pd

# Reading the processed dataset

train_data = pd.read_csv("/workspaces/machine-learning-boosting-Juli-MM/data/processed/clean_train.csv")
test_data = pd.read_csv("/workspaces/machine-learning-boosting-Juli-MM/data/processed/clean_test.csv")

train_data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,BMI,DiabetesPedigreeFunction,Age,Outcome
0,1.0,149.0,68.0,29.3,0.349,42.0,1
1,0.0,140.0,65.0,42.6,0.431,24.0,1
2,5.0,122.0,86.0,34.7,0.29,33.0,0
3,10.0,161.0,68.0,25.5,0.326,47.0,1
4,3.0,150.0,76.0,21.0,0.207,37.0,0


#### Step 2: Build a random forest

In [29]:
# Separate predictors and target variable in training and test data:

X_train = train_data.drop(["Outcome"], axis = 1)
y_train = train_data["Outcome"]
X_test = test_data.drop(["Outcome"], axis = 1)
y_test = test_data["Outcome"]

In [30]:
# Creating and training the XGBoosting model:

from xgboost import XGBClassifier

model = XGBClassifier(n_estimators = 200, learning_rate = 0.001, random_state = 42)
model.fit(X_train, y_train)

In [31]:
# Make predictions on test data:

y_pred = model.predict(X_test)
y_pred

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [32]:
# Calculating model accuracy on test data:

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

0.71875

#### Step 3: Optimize the previous model

In [33]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

# Define the grid of hyperparameters for search:

hyperparams = {
    'learning_rate': [0.1, 0.01, 0.001],
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7]
}

# Performing hyperparameter search using GridSearchCV:

grid = GridSearchCV(model, hyperparams, scoring="accuracy", cv=5)
grid.fit(X_train, y_train)

# Print the best hyperparameters from search:

print(f"The best parameters are: {grid.best_params_}")

The best parameters are: {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 200}


In [34]:
# Retrain model with best hyperparameters:

model_grid = XGBClassifier(learning_rate=0.01, max_depth=5, n_estimators=200, random_state=42)
model_grid.fit(X_train, y_train)


In [35]:
y_pred = model_grid.predict(X_test)
y_pred

array([0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0])

In [36]:
# Calculating model accuracy on retrained model:

accuracy_score(y_test, y_pred)

0.8125

#### Step 4: Save the model

In [37]:
from pickle import dump

dump(model, open("/workspaces/machine-learning-boosting-Juli-MM/models/boosting_classifier_maxdepth-5_learnrate-0.01_nestim-200_42", "wb"))