This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Pregancies - No of times pregnant <br>
Glucose - Plasma glucose concentration a 2 hours in an oral glucose tolerance test<br>Blood Pressure - Diastolic blood pressure (mm Hg)<br>SkinThickness- Triceps skin fold thickness (mm)<br>
Insulin - 2-Hour serum insulin (mu U/ml)<br>BMI - Body mass index (weight in kg/(height in m)^2) <br>DiabetesPedigreeFunction - Diabetes pedigree function<br>Age - Age (years)<br>Outcome - Class variable (0 or 1) 268 of 768 are 1, the others are 0<br>












In [12]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split, cross_val_score

# Load the dataset
data = pd.read_csv("C:\\Users\\himan\\OneDrive\\Desktop\\csv\\Machine Learning\\pima-indians-diabetes.csv")
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [13]:
from sklearn.model_selection import KFold
cv = KFold(n_splits=5)

In [14]:
# Splitting the dataset into features and columns
X = data.drop(columns=["Outcome"])
y = data["Outcome"]

In [15]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Initialize AdaBoost
ada_model = AdaBoostClassifier(n_estimators=100, random_state=42) 
# n_estimators - The maximum number of decision trees at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early. 

# Perform cross-validation
ada_scores = cross_val_score(ada_model, X, y, cv=cv, scoring='accuracy')

print("Accuracy Scores:\n",ada_scores)
print(f"Mean Accuracy Score: {np.mean(ada_scores)}")

Accuracy Scores:
 [0.75974026 0.68831169 0.75974026 0.79084967 0.73856209]
Mean Accuracy Score: 0.747440794499618


In [16]:
from sklearn.ensemble import GradientBoostingClassifier

# Initialize Gradient Boosting
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

# Perform cross-validation
gb_scores = cross_val_score(gb_model, X, y, cv=cv, scoring='accuracy')

print("Accuracy Scores:\n",gb_scores)
print(f"Mean Accuracy Score: {np.mean(gb_scores)}")

Accuracy Scores:
 [0.79220779 0.71428571 0.78571429 0.79738562 0.75163399]
Mean Accuracy Score: 0.7682454800101859


In [17]:
!pip install xgboost



In [18]:
import xgboost as xgb

# Initialize XGBoost
xgb_model = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

# Perform cross-validation
xgb_scores = cross_val_score(xgb_model, X, y, cv=cv, scoring='accuracy')

print("Accuracy Scores:\n",xgb_scores)
print(f"Mean Accuracy Score: {np.mean(xgb_scores)}")

Accuracy Scores:
 [0.77922078 0.71428571 0.75974026 0.79738562 0.74509804]
Mean Accuracy Score: 0.7591460826754945


## GridSearchCV

In [19]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5],
}
# Initialize XGBoost classifier
xgb_model = xgb.XGBClassifier()

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid, cv=cv, scoring='accuracy')

# Perform the grid search
grid_search.fit(X,y)

# Print the best parameters and the best score
print("GridSearchCV best parameters:", grid_search.best_params_)
print("GridSearchCV best score:", grid_search.best_score_)


GridSearchCV best parameters: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 50}
GridSearchCV best score: 0.7735081911552499


## RandomizedSearcCV

In [20]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint

param_dist = {
    'n_estimators': randint(50, 200),  # To generate random integers from a discrete uniform distribution.
    'learning_rate': uniform(0.01, 0.2), # Generates random numbers that are uniformly distributed over a specified range.
    'max_depth': randint(3, 10),
}

# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=xgb_model, param_distributions=param_dist, n_iter=20, cv=cv, scoring='accuracy')

# Perform the randomized search
random_search.fit(X, y)

# Print the best parameters and the best score
print("RandomizedSearchCV best parameters:", random_search.best_params_)
print("RandomizedSearchCV best score:", random_search.best_score_)


RandomizedSearchCV best parameters: {'learning_rate': 0.060709018007407196, 'max_depth': 3, 'n_estimators': 123}
RandomizedSearchCV best score: 0.7695866225277991
