1.  What is Boosting in Machine Learning ?




Boosting is a machine learning ensemble method that sequentially combines "weak learners" (models that are only slightly better than random guessing) to create a strong, accurate predictive model by focusing on correcting errors made by previous models.

2. How does Boosting differ from Bagging ?



Bagging (Bootstrap Aggregating) trains models in parallel on different subsets of the data, averaging their predictions to reduce variance, while Boosting trains models sequentially, with each model focusing on the errors of the previous ones to reduce bias and improve accuracy.

3.  What is the key idea behind AdaBoost ?


The core idea behind AdaBoost (Adaptive Boosting) is to sequentially train multiple "weak learners" (simple models) and combine their predictions to create a strong learner. It works by assigning weights to training examples, focusing on misclassified instances in subsequent iterations, effectively "boosting" the performance of the weak learners.

4. Explain the working of AdaBoost with an example.




AdaBoost is an ensemble learning algorithm that sequentially trains weak learners, focusing on correcting mistakes from previous learners. It works by assigning weights to training examples, increasing the weight of misclassified instances, and then training new learners that pay more attention to these misclassified examples. This process is repeated, and the final prediction is a weighted combination of the predictions from all the weak learners.

5.  What is Gradient Boosting, and how is it different from AdaBoost ?


Gradient boosting and AdaBoost are both ensemble methods that sequentially build models, focusing on correcting errors from previous models, but they differ in how they identify and address these errors. AdaBoost uses a specific loss function and reweights misclassified samples, while Gradient Boosting is more flexible, using gradients to find approximate solutions and can handle more complex base learners.

6.  What is the loss function in Gradient Boosting ?


In gradient boosting, the loss function quantifies the difference between model predictions and actual values, guiding the algorithm to minimize errors iteratively. For regression tasks, common loss functions include mean squared error (MSE), while for classification, logarithmic loss (or cross-entropy) is often used.

7.  How does XGBoost improve over traditional Gradient Boosting ?




XGBoost improves upon traditional Gradient Boosting by incorporating L1 and L2 regularization, parallel processing for speed, and sparse-aware split finding to handle missing values efficiently. This results in improved model performance, reduced overfitting, and faster training, making it a more versatile and powerful algorithm.


8.  What is the difference between XGBoost and CatBoos ?


XGBoost and CatBoost are both powerful gradient boosting algorithms, but they differ primarily in their handling of categorical data and their overall approach to model building. XGBoost requires preprocessing of categorical features, while CatBoost natively handles them, potentially leading to simpler workflows and improved performance, especially with datasets rich in categorical variables.


9. What are some real-world applications of Boosting techniques ?


Boosting techniques find applications in diverse fields, enhancing accuracy and performance in various tasks. Some key areas include finance, healthcare, e-commerce, and natural language processing. In finance, they help with credit scoring, fraud detection, and stock market prediction. Healthcare uses them for disease diagnosis, patient risk assessment, and medication development. E-commerce leverages boosting for personalized recommendations and customer segmentation. Natural language processing benefits from boosting in tasks like text classification and sentiment analysis.

10.   How does regularization help in XGBoost ?


Regularization in XGBoost helps prevent overfitting by adding penalties to the model's complexity, encouraging it to favor simpler trees and improve generalization to unseen data. This is achieved through L1 (Lasso) and L2 (Ridge) regularization terms added to the objective function.

11.  What are some hyperparameters to tune in Gradient Boosting models?



The most important XGBoost hyperparameters to tune are:
max_depth : Maximum depth of a tree. ...
min_child_weight : Minimum sum of instance weight needed in a child. ...
subsample : Subsample ratio of the training instances. ...
colsample_bytree : Subsample ratio of columns when constructing each tree.

12.  What is the concept of Feature Importance in Boosting ?



In the context of boosting algorithms, feature importance refers to a method of quantifying how much each input feature contributes to the model's predictive power. It essentially provides a score or ranking for each feature, indicating its relative usefulness in making predictions. This helps in understanding which features are most influential and can guide feature selection for model simplification and improvement.

13. Why is CatBoost efficient for categorical data?



CatBoost is efficient for categorical data because it handles categorical features directly without requiring extensive preprocessing, such as one-hot encoding. It uses a novel algorithm called "target statistics" to convert categorical values into numerical representations, considering the target variable to reduce overfitting and improve accuracy. Additionally, CatBoost's "Ordered Boosting" feature helps in capturing complex relationships within categorical data, leading to robust predictive models.

In [None]:
#14.  Train an AdaBoost Classifier on a sample dataset and print model accuracy



import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


        iris = pd.read_csv('/kaggle/input/iris/Iris.csv')



        X = iris[['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']]

X.head()



y = iris['Species']

y.head()

In [None]:
#15.  Train an AdaBoost Regressor and evaluate performance using Mean Absolute Error (MAE)


# check scikit-learn version
import sklearn


# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)
# summarize the dataset
print(X.shape, y.shape)


# evaluate adaboost algorithm for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import AdaBoostClassifier
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)
# define the model
model = AdaBoostClassifier()
# evaluate the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))




In [None]:
#16.  Train a Gradient Boosting Classifier on the Breast Cancer dataset and print feature importance





import os
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from collections import OrderedDict

from sklearn import datasets
from sklearn.preprocessing import label_binarize, LabelBinarizer
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.metrics import precision_recall_curve
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, auc

DISPLAY_PRECISION = 4

pd.set_option("display.precision", DISPLAY_PRE


              dat = datasets.load_breast_cancer()


              print("The sklearn breast cancer dataset keys:")
print(dat.keys()) # dict_keys(['target_names', 'target', 'feature_names', 'data', 'DESCR'])
print("---")

# Note that we need to reverse the original '0' and '1' mapping in order to end up with this mapping:
# Benign = 0 (negative class)
# Malignant = 1 (positive class)

li_classes = [dat.target_names[1], dat.target_names[0]]
li_target = [1 if x==0 else 0 for x in list(dat.target)]
li_ftrs = list(dat.feature_names)

print("There are 2 target classes:")
print("li_classes", li_classes)

print("---")
print("Target class distribution from a total of %d target values:" % len(li_target))
print(pd.Series(li_target).value_counts())
print("---")

df_all = pd.DataFrame(dat.data[:,:], columns=li_ftrs)
print("Describe dataframe, first 6 columns:")
print(df_all.iloc[:,:6].describe().to_string())

In [None]:
#17. Train a Gradient Boosting Regressor and evaluate using R-Squared Score


import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Sample dataset: House features (Square Footage, Number of Bedrooms) and Prices
X = np.array([[1500, 3], [1800, 4], [2400, 3], [3000, 5], [3500, 4]])  # Features
y = np.array([400000, 450000, 600000, 650000, 700000])  # House prices

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

gbr = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gbr.fit(X_train, y_train)

# Predict on the test set
y_pred = gbr.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R2 Score): {r2:.2f}")

# Visualize Actual vs. Predicted Prices
plt.scatter(range(len(y_test)), y_test, color='blue', label='Actual Prices')
plt.scatter(range(len(y_pred)), y_pred, color='red', label='Predicted Prices')



plt.title('Actual vs Predicted Prices')
plt.xlabel('Test Sample Index')
plt.ylabel('House Price')
plt.legend()
plt.show()


In [None]:
#18. Train an XGBoost Classifier on a dataset and compare accuracy with Gradient Boosting




import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings

warnings.filterwarnings('ignore')

data = 'C:/datasets/Wholesale customers data.csv'

df = pd.read_csv(data)
df.shape

df.isnull().sum()




In [None]:
#19. Train a CatBoost Classifier and evaluate using F1-Score



import numpy as np

from catboost import CatBoostClassifier, Pool

# initialize data
train_data = np.random.randint(0,
                               100,
                               size=(100, 10))

train_labels = np.random.randint(0,
                                 2,
                                 size=(100))

test_data = catboost_pool = Pool(train_data,
                                 train_labels)

model = CatBoostClassifier(iterations=2,
                           depth=2,
                           learning_rate=1,
                           loss_function='Logloss',
                           verbose=True)
# train the model
model.fit(train_data, train_labels)
# make the prediction using the resulting model
preds_class = model.predict(test_data)
preds_proba = model.predict_proba(test_data)
print("class = ", preds_class)
print("proba = ", preds_proba)


CatBoostRegressor
import numpy as np
from catboost import Pool, CatBoostRegressor
# initialize data
train_data = np.random.randint(0,
                               100,
                               size=(100, 10))
train_label = np.random.randint(0,
                                1000,
                                size=(100))
test_data = np.random.randint(0,
                              100,
                              size=(50, 10))
# initialize Pool
train_pool = Pool(train_data,
                  train_label,
                  cat_features=[0,2,5])
test_pool = Pool(test_data,
                 cat_features=[0,2,5])

# specify the training parameters
model = CatBoostRegressor(iterations=2,
                          depth=2,
                          learning_rate=1,
                          loss_function='RMSE')
#train the model
model.fit(train_pool)
# make the prediction using the resulting model
preds = model.predict(test_pool)
print(preds)


import numpy as np
from catboost import CatBoost, Pool

# read the dataset

train_data = np.random.randint(0,
                               100,
                               size=(100, 10))
train_labels = np.random.randint(0,
                                2,
                                size=(100))
test_data = np.random.randint(0,
                                100,
                                size=(50, 10))

train_pool = Pool(train_data,
                  train_labels)

test_pool = Pool(test_data)
# specify training parameters via map

param = {'iterations':5}
model = CatBoost(param)
#train the model
model.fit(train_pool)
# make the prediction using the resulting model
preds_class = model.predict(test_pool, prediction_type='Class')
preds_proba = model.predict(test_pool, prediction_type='Probability')
preds_raw_vals = model.predict(test_pool, prediction_type='RawFormulaVal')
print("Class", preds_class)
print("Proba", preds_proba)
print("Raw", preds_raw_vals)

In [None]:
#20  Train an XGBoost Regressor and evaluate using Mean Squared Error (MSE)


# Prepare features and target
X = data.drop('target', axis=1)
y = data['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100, learning_rate=0.1)
xgb_model.fit(X_train, y_train)


import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler

# Load stock data
stock_data = pd.read_csv('stock_data.csv')

# Calculate technical indicators (e.g., Moving Average)
stock_data['MA_7'] = stock_data['Close'].rolling(window=7).mean()
stock_data['MA_21'] = stock_data['Close'].rolling(window=21).mean()

# Prepare features and target
X = stock_data[['Open', 'High', 'Low', 'Volume', 'MA_7', 'MA_21']].dropna()
y = stock_data['Close'].dropna()

# Scale features
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
split = int(0.8 * len(X_scaled))
X_train, X_test = X_scaled[:split], X_scaled[split:]
y_train, y_test = y[:split], y[split:]

# Train XGBoost model
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100)
xgb_model.fit(X_train, y_train)

# Make predictions and evaluate
y_pred = xgb_model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Root Mean Squared Error: ${rmse:.2f}")

In [None]:
#27 #Import libraries:
import pandas as pd
import numpy as np
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import metrics   #Additional scklearn functions
from sklearn.model_selection import GridSearchCV
import matplotlib.pylab as plt

from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 4

train = pd.read_csv('Train_Modified.csv', encoding='ISO-8859–1')
target = 'Disbursed'
IDcol = 'ID'

print("There will be no output for this particular block of code")



def modelfit(alg, dtrain, predictors,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):

    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
            metrics='auc', early_stopping_rounds=early_stopping_rounds, show_progress=False)
        alg.set_params(n_estimators=cvresult.shape[0])

    #Fit the algorithm on the data
    alg.fit(dtrain[predictors], dtrain['Disbursed'],eval_metric='auc')

    #Predict training set:
    dtrain_predictions = alg.predict(dtrain[predictors])
    dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]

    #Print model report:
    print "\nModel Report"
    print "Accuracy : %.4g" % metrics.accuracy_score(dtrain['Disbursed'].values, dtrain_predictions)
    print "AUC Score (Train): %f" % metrics.roc_auc_score(dtrain['Disbursed'], dtrain_predprob)

    feat_imp = pd.Series(alg.booster().get_fscore()).sort_values(ascending=False)
    feat_imp.plot(kind='bar', title='Feature Importances')
    plt.ylabel('Feature Importance Score')

In [None]:
#28  Train a CatBoost Classifier on an imbalanced dataset and compare performance with class weighting



import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.read_csv('healthcare-dataset-stroke-data.csv')
#Ploting barplot for target
plt.figure(figsize=(10,6))
g = sns.barplot(data['stroke'], data['stroke'], palette='Set1', estimator=lambda x: len(x) / len(data) )

#Anotating the graph
for p in g.patches:
        width, height = p.get_width(), p.get_height()
        x, y = p.get_xy()
        g.text(x+width/2,
               y+height,
               '{:.0%}'.format(height),
               horizontalalignment='center',fontsize=15)

#Setting the labels
plt.xlabel('Heart Stroke', fontsize=14)
plt.ylabel('Precentage', fontsize=14)
plt.title('Percentage of patients will/will not have heart stroke', fontsize=16)
plt.show()


#Training the model using mode of target
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix
pred_test = []
for i in range (0, 13020):
    pred_test.append(y_train.mode()[0])

#Printing f1 and accuracy scores
print('The accuracy for mode model is:', accuracy_score(y_test, pred_test))
print('The f1 score for the model model is:',f1_score(y_test, pred_test))

#Ploting the cunfusion matrix
conf_matrix(y_test, pred_test)

In [None]:
#29 Train an AdaBoost Classifier and analyze the effect of different learning rates



import numpy as np

class DecisionStump:
    def __init__(self):
        self.polarity = 1
        self.feature_idx = None
        self.threshold = None
        self.alpha = None

    def predict(self, X):
        n_samples = X.shape[0]
        predictions = np.ones(n_samples)
        feature_column = X[:, self.feature_idx]

        if self.polarity == 1:
            predictions[feature_column < self.threshold] = -1
        else:
            predictions[feature_column > self.threshold] = -1

        return predictions

class AdaBoost:
    def __init__(self, n_clf=5):
        self.n_clf = n_clf
        self.clfs = []

    def fit(self, X, y):
        n_samples, n_features = X.shape
        w = np.full(n_samples, (1 / n_samples))

        for _ in range(self.n_clf):
            clf = DecisionStump()
            min_error = float('inf')

            for feature_i in range(n_features):
                X_column = X[:, feature_i]
                thresholds = np.unique(X_column)

                for threshold in thresholds:
                    predictions = np.ones(n_samples)
                    predictions[X_column < threshold] = -1

                    error = sum(w[y != predictions])

                    if error > 0.5:
                        error = 1 - error
                        p = -1
                    else:
                        p = 1

                    if error < min_error:
                        clf.polarity = p
                        clf.threshold = threshold
                        clf.feature_idx = feature_i
                        min_error = error

            EPS = 1e-10
            clf.alpha = 0.5 * np.log((1.0 - min_error + EPS) / (min_error + EPS))
            predictions = clf.predict(X)
            w *= np.exp(-clf.alpha * y * predictions)
            w /= np.sum(w)
            self.clfs.append(clf)

    def predict(self, X):
        clf_preds = [clf.alpha * clf.predict(X) for clf in self.clfs]
        y_pred = np.sum(clf_preds, axis=0)
        return np.sign(y_pred)


        from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Load dataset
data = pd.read_csv("Iris.csv")  # Adjust the file path as necessary
X = data.iloc[:, :-1].values  # Features
y = data.iloc[:, -1].values  # Target

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create the AdaBoost classifier
abc = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1), n_estimators=50)

# Fit the model
abc.fit(X_train, y_train)

# Predict and evaluate
y_pred = abc.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

In [None]:
#30 Train an XGBoost Classifier for multi-class classification and evaluate using log-loss.


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
import xgboost as xgb

dbunch = datasets.load_breast_cancer(as_frame=True)
df = dbunch.frame
features = dbunch.feature_names
target_names = dbunch.target_names
target = 'target'
df.info()



df.target.value_counts().sort_index().plot.bar()
plt.xlabel('target')
plt.ylabel('count');

from sklearn.model_selection import train_test_split

n_valid = 50

train_df, valid_df = train_test_split(df, test_size=n_valid, random_state=42)
train_df.shape, valid_df.shape

params = {
    'tree_method': 'exact',
    'objective': 'binary:logistic',
}
num_boost_round = 50

dtrain = xgb.DMatrix(label=train_df[target], data=train_df[features])
dvalid = xgb.DMatrix(label=valid_df[target], data=valid_df[features])
model = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,
                  evals=[(dtrain, 'train'), (dvalid, 'valid')],
                  verbose_eval=10)




y_true = valid_df[target]
y_pred = clf.predict(valid_df[features])
y_score = clf.predict_proba(valid_df[features])[:,1]


from sklearn import metrics

metrics.accuracy_score(y_true, y_pred)
print(metrics.classification_report(y_true, y_pred, target_names=target_names))
metrics.roc_auc_score(y_true, y_score)
from sklearn.inspection import permutation_importance
from sklearn.metrics import make_scorer

scorer = make_scorer(metrics.log_loss, greater_is_better=False, needs_proba=True)
permu_imp = permutation_importance(clf, valid_df[features], valid_df[target],
                                   n_repeats=30, random_state=0, scoring=scorer)

importances_permutation = pd.Series(permu_imp['importances_mean'], index=features)
importances_permutation.sort_values(ascending=True)[-10:].plot.barh()
plt.title('Permutation Importance on Out-of-Sample Set')
plt.xlabel('change in log likelihood');