Ensembling techniques in machine learning involve combining multiple models to make better predictions than any single model could. Think of it like asking multiple experts for their opinions and then combining their advice to make a final decision

1. Bagging (Bootstrap Aggregating)
Concept: Train multiple models on different random samples of the data and then average their predictions eg Random Forest.
Analogy: Imagine you’re trying to guess the weight of a cake by asking several friends to each take a guess. You then average their guesses to get a more accurate estimate.

2. Boosting
Concept: Train models sequentially, each one trying to correct the mistakes of the previous one. The final prediction is a weighted sum of all models.
Eg AdaBoost, Gradient Boosting.
Analogy: Think of it as an assembly line where each worker fixes the mistakes of the previous worker, gradually improving the final product

3. Stacking (Stacked Generalization)
Concept: Train multiple different models and then use another model to learn how to best combine their predictions.
Example: Using a logistic regression model to combine the outputs of a decision tree, a neural network

LOADING DATA


In [11]:

#Importing pandas library
import pandas as pd

#Reading the dataset
df =pd.read_excel("/content/Cancer.xlsx")

print(df)

    Country or Territory  \
0            Afghanistan   
1                Algeria   
2             Azerbaijan   
3                Albania   
4                Armenia   
..                   ...   
202              Réunion   
203     French Polynesia   
204           Guadeloupe   
205                 Guam   
206           Martinique   

    Cancer deaths attributable to alcohol\nProportion (%) of cancer deaths caused by alcohol drinking in men ages 15 years or older, 2016  \
0                                                  0.5                                                                                      
1                                                  1.1                                                                                      
2                                                  3.2                                                                                      
3                                                  4.9                                             

dataset has 207 rows and 27 columns.

In [12]:
#Checking the first rows
df.head()

Unnamed: 0,Country or Territory,"Cancer deaths attributable to alcohol\nProportion (%) of cancer deaths caused by alcohol drinking in men ages 15 years or older, 2016",Smoking prevalence male\nPrevalence (%) of daily smoking for men,Smoking prevalence female\nPrevalence (%) of daily smoking for women,"Cancers attributable to infections\nProportion of cancers attributable to infections (%), by country","Obesity prevalence male\nInternational variation in the prevalence of obesity, 2016","Obesity prevalence female\nInternational variation in the prevalence of obesity, 2016","Melanoma skin cancer incidence\nAge-standardized rate (world) per 100,000, both sexes, 2018",Breastfeeding at 12 months\nPercent (%) of children who receive any breast milk at 12 months of age,Average births per woman\n2010-2015,...,"Most common cancer cases worldwide, females\n2018","Most common cancer deaths worldwide, females\n2018","Most common cancer cases worldwide, males\n2018","Most common cancer deaths worldwide, males\n2018","Cancer survivors\nEstimated number of cancer survivors diagnosed within the past five years per 100,000 population, both sexes, 2018","Years lived with disability due to cancer\nBoth sexes, all ages, 2017","Hepatitis B virus vaccination\nHepatitis B vaccination coverage (% of one-year-olds who have received three doses of hepatitis B vaccine), 2017","Radiotherapy availability\nNumber of radiotherapy machines per 1,000 cancer patients","Cervical cancer incidence rates\nAge-standardized rate (world) per 100,000, 2018","HIV prevalence (%)\nBoth sexes, 2017"
0,Afghanistan,0.5,21.4,7.0,16.0,3.2,7.6,0.3,88.0,5.3,...,Breast,Breast,"Lip, oral cavity",Stomach,151.0,8089.8,65,0.0,6.6,No data
1,Algeria,1.1,17.5,2.2,12.9,19.9,34.9,0.7,55.1,3.0,...,Breast,Breast,Lung,Lung,312.6,16404.9,91,0.75,8.1,0.05
2,Azerbaijan,3.2,36.7,0.3,15.4,15.8,23.6,0.53,36.1,2.1,...,Breast,Breast,Lung,Lung,228.1,6685.1,95,0.86,6.5,0.1
3,Albania,4.9,40.9,6.1,13.4,21.6,21.8,1.7,72.3,1.7,...,Breast,Breast,Lung,Lung,401.1,2644.3,99,0.6,6.5,0.05
4,Armenia,5.0,43.5,1.5,12.2,17.1,23.0,1.6,52.3,1.7,...,Breast,Breast,Lung,Lung,402.2,3418.0,94,0.34,8.4,0.2


These are the first five rows in the  dataset.

In [13]:
#Checking the last rows
df.tail()

Unnamed: 0,Country or Territory,"Cancer deaths attributable to alcohol\nProportion (%) of cancer deaths caused by alcohol drinking in men ages 15 years or older, 2016",Smoking prevalence male\nPrevalence (%) of daily smoking for men,Smoking prevalence female\nPrevalence (%) of daily smoking for women,"Cancers attributable to infections\nProportion of cancers attributable to infections (%), by country","Obesity prevalence male\nInternational variation in the prevalence of obesity, 2016","Obesity prevalence female\nInternational variation in the prevalence of obesity, 2016","Melanoma skin cancer incidence\nAge-standardized rate (world) per 100,000, both sexes, 2018",Breastfeeding at 12 months\nPercent (%) of children who receive any breast milk at 12 months of age,Average births per woman\n2010-2015,...,"Most common cancer cases worldwide, females\n2018","Most common cancer deaths worldwide, females\n2018","Most common cancer cases worldwide, males\n2018","Most common cancer deaths worldwide, males\n2018","Cancer survivors\nEstimated number of cancer survivors diagnosed within the past five years per 100,000 population, both sexes, 2018","Years lived with disability due to cancer\nBoth sexes, all ages, 2017","Hepatitis B virus vaccination\nHepatitis B vaccination coverage (% of one-year-olds who have received three doses of hepatitis B vaccine), 2017","Radiotherapy availability\nNumber of radiotherapy machines per 1,000 cancer patients","Cervical cancer incidence rates\nAge-standardized rate (world) per 100,000, 2018","HIV prevalence (%)\nBoth sexes, 2017"
202,Réunion,No data,No data,No data,12.4,No data,No data,3.3,No data,2.4,...,Breast,Breast,Prostate,Lung,590.1,No data,No data,No data,10.5,No data
203,French Polynesia,No data,No data,No data,6.6,No data,No data,6.5,No data,2.1,...,Breast,Lung,Prostate,Lung,661.9,No data,No data,No data,10.1,No data
204,Guadeloupe,No data,No data,No data,10.2,No data,No data,1.5,No data,2.0,...,Breast,Breast,Prostate,Prostate,776.5,No data,No data,0.88,9.3,No data
205,Guam,No data,No data,No data,8.5,No data,No data,No data,No data,2.4,...,Breast,Lung,Lung,Lung,418.0,168.8,No data,2.58,18.7,No data
206,Martinique,No data,No data,No data,6.6,No data,No data,2,No data,2.0,...,Breast,Breast,Prostate,Prostate,747.0,No data,No data,1.46,7.6,No data


These are the last five rows in the dataset.

In [14]:
df.shape

(207, 27)

implies that the stucture of the given dataset is 207 rows by 11 columns

In [8]:
# Checking for information about the dataset.
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207 entries, 0 to 206
Data columns (total 27 columns):
 #   Column                                                                                                                                                     Non-Null Count  Dtype 
---  ------                                                                                                                                                     --------------  ----- 
 0   Country or Territory                                                                                                                                       207 non-null    object
 1   Cancer deaths attributable to alcohol
Proportion (%) of cancer deaths caused by alcohol drinking in men ages 15 years or older, 2016                       207 non-null    object
 2   Smoking prevalence male
Prevalence (%) of daily smoking for men                                                                                            207

In [15]:
#Summary statistics.
df.describe()


Unnamed: 0,Country or Territory,"Cancer deaths attributable to alcohol\nProportion (%) of cancer deaths caused by alcohol drinking in men ages 15 years or older, 2016",Smoking prevalence male\nPrevalence (%) of daily smoking for men,Smoking prevalence female\nPrevalence (%) of daily smoking for women,"Cancers attributable to infections\nProportion of cancers attributable to infections (%), by country","Obesity prevalence male\nInternational variation in the prevalence of obesity, 2016","Obesity prevalence female\nInternational variation in the prevalence of obesity, 2016","Melanoma skin cancer incidence\nAge-standardized rate (world) per 100,000, both sexes, 2018",Breastfeeding at 12 months\nPercent (%) of children who receive any breast milk at 12 months of age,Average births per woman\n2010-2015,...,"Most common cancer cases worldwide, females\n2018","Most common cancer deaths worldwide, females\n2018","Most common cancer cases worldwide, males\n2018","Most common cancer deaths worldwide, males\n2018","Cancer survivors\nEstimated number of cancer survivors diagnosed within the past five years per 100,000 population, both sexes, 2018","Years lived with disability due to cancer\nBoth sexes, all ages, 2017","Hepatitis B virus vaccination\nHepatitis B vaccination coverage (% of one-year-olds who have received three doses of hepatitis B vaccine), 2017","Radiotherapy availability\nNumber of radiotherapy machines per 1,000 cancer patients","Cervical cancer incidence rates\nAge-standardized rate (world) per 100,000, 2018","HIV prevalence (%)\nBoth sexes, 2017"
count,207,207,207,207,207,207,207,207,207,207,...,207,207,207,207,207,207,207,207,207,207
unique,204,96,145,114,142,137,152,111,135,49,...,6,7,11,11,180,185,42,101,142,43
top,Lesotho,No data,No data,No data,No data,No data,No data,No data,No data,No data,...,Breast,Breast,Prostate,Lung,No data,No data,99,0,No data,No data
freq,2,23,18,16,23,16,16,29,50,15,...,153,102,102,94,22,18,23,39,22,70


ummary of the dataset:

Country or Territory: 207 entries, 204 unique values, most frequent value is "Lesotho" appearing twice.
Cancer deaths attributable to alcohol: 207 entries, 96 unique values, "No data" is the most frequent, appearing 23 times.
Smoking prevalence male: 207 entries, 145 unique values, "No data" is the most frequent, appearing 18 times.
Smoking prevalence female: 207 entries, 114 unique values, "No data" is the most frequent, appearing 16 times.
Cancers attributable to infections: 207 entries, 142 unique values, "No data" is the most frequent, appearing 18 times.
Hepatitis B virus vaccination: 207 entries, 42 unique values, most frequent value is 99%, appearing 23 times.
Radiotherapy availability: 207 entries, 101 unique values, most frequent value is 0, appearing 39
Cervical cancer incidence rates: 207 entries, 142 unique values, "No data" is the most frequent, appearing 22 times.
HIV prevalence: 207 entries, 43 unique values, "No data" is the most frequent, appearing 70 times.

In [16]:
#Checking for duplicates.
df.duplicated().sum()


3

In [18]:
#Checking for missing values in the whole dataset.
df.isnull().sum()

Country or Territory                                                                                                                                          0
Cancer deaths attributable to alcohol\nProportion (%) of cancer deaths caused by alcohol drinking in men ages 15 years or older, 2016                         0
Smoking prevalence male\nPrevalence (%) of daily smoking for men                                                                                              0
Smoking prevalence female\nPrevalence (%) of daily smoking for women                                                                                          0
Cancers attributable to infections\nProportion of cancers attributable to infections (%), by country                                                          0
Obesity prevalence male\nInternational variation in the prevalence of obesity, 2016                                                                           0
Obesity prevalence female\nInternational

## 1.2 Replace "No data" with NaN

In [20]:
import numpy as np

# Replace "No data" with NaN
df.replace("No data", np.nan, inplace=True)

# Check for missing values
print(df.isnull().sum())

Country or Territory                                                                                                                                           0
Cancer deaths attributable to alcohol\nProportion (%) of cancer deaths caused by alcohol drinking in men ages 15 years or older, 2016                         23
Smoking prevalence male\nPrevalence (%) of daily smoking for men                                                                                              18
Smoking prevalence female\nPrevalence (%) of daily smoking for women                                                                                          16
Cancers attributable to infections\nProportion of cancers attributable to infections (%), by country                                                          23
Obesity prevalence male\nInternational variation in the prevalence of obesity, 2016                                                                           16
Obesity prevalence female\nInterna

missing values for numerical columns

In [21]:
from sklearn.impute import SimpleImputer

# Identify numerical columns
num_cols = df.select_dtypes(include=np.number).columns

# Impute missing values with the mean
imputer = SimpleImputer(strategy='mean')
df[num_cols] = imputer.fit_transform(df[num_cols])

# Check again for missing values in numerical columns to confirm
print(df[num_cols].isnull().sum())


Cancer deaths attributable to alcohol\nProportion (%) of cancer deaths caused by alcohol drinking in men ages 15 years or older, 2016                         0
Smoking prevalence male\nPrevalence (%) of daily smoking for men                                                                                              0
Smoking prevalence female\nPrevalence (%) of daily smoking for women                                                                                          0
Cancers attributable to infections\nProportion of cancers attributable to infections (%), by country                                                          0
Obesity prevalence male\nInternational variation in the prevalence of obesity, 2016                                                                           0
Obesity prevalence female\nInternational variation in the prevalence of obesity, 2016                                                                         0
Melanoma skin cancer incidence\nAge-stan

handling missing values for categorical columns

In [22]:
# Identify categorical columns
cat_cols = df.select_dtypes(include='object').columns

# Fill missing values with the mode (most frequent value)
for col in cat_cols:
    df[col].fillna(df[col].mode()[0], inplace=True)

# Check again for missing values in categorical columns to confirm
print(df[cat_cols].isnull().sum())


Country or Territory                                                                                                                      0
Cancer rank as leading cause of death among 30-69\n2016                                                                                   0
Breast most frequently diagnosed cancer in women\nCountries where breast cancer is the most frequently diagnosed cancer in women, 2018    0
Human Development Index (HDI) levels\n2017                                                                                                0
Most common cancer cases worldwide, females\n2018                                                                                         0
Most common cancer deaths worldwide, females\n2018                                                                                        0
Most common cancer cases worldwide, males\n2018                                                                                           0
Most common cancer d

Handle missing values for categorical columns

encoding categorical variables

In [24]:
from sklearn.preprocessing import LabelEncoder

# Encode categorical variables
for col in cat_cols:
    df[col] = LabelEncoder().fit_transform(df[col])

# Display data types to confirm encoding
print(df.dtypes)


Country or Territory                                                                                                                                            int64
Cancer deaths attributable to alcohol\nProportion (%) of cancer deaths caused by alcohol drinking in men ages 15 years or older, 2016                         float64
Smoking prevalence male\nPrevalence (%) of daily smoking for men                                                                                              float64
Smoking prevalence female\nPrevalence (%) of daily smoking for women                                                                                          float64
Cancers attributable to infections\nProportion of cancers attributable to infections (%), by country                                                          float64
Obesity prevalence male\nInternational variation in the prevalence of obesity, 2016                                                                           float64
Obes

Feature scaling

In [25]:
from sklearn.preprocessing import StandardScaler

# Scale numerical features
scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

# Display the first few rows to confirm changes
print(df.head())


   Country or Territory  \
0                     0   
1                     2   
2                    10   
3                     1   
4                     7   

   Cancer deaths attributable to alcohol\nProportion (%) of cancer deaths caused by alcohol drinking in men ages 15 years or older, 2016  \
0                                          -1.728551                                                                                       
1                                          -1.512090                                                                                       
2                                          -0.754476                                                                                       
3                                          -0.141170                                                                                       
4                                          -0.105093                                                                                     

Bagging, short for Bootstrap Aggregating, is an ensemble learning technique used in machine learning to improve the stability and accuracy of machine learning algorithms. It reduces variance and helps to avoid overfitting.

In [26]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

feature selection

In [27]:
X = df.drop('Cancers attributable to infections\nProportion of cancers attributable to infections (%), by country', axis=1)
y = df['Cancers attributable to infections\nProportion of cancers attributable to infections (%), by country']

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [29]:
# Standardize the data (important for some models)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Create a Bagging Regressor with Decision Trees
bagging_model = BaggingRegressor(base_estimator=DecisionTreeRegressor(), n_estimators=100, random_state=42)

# Train the model
bagging_model.fit(X_train, y_train)

# Predict on the test set
y_pred = bagging_model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')



Mean Squared Error: 0.18007396423823374
R^2 Score: 0.8125381794333825


Summary:
MSE of 0.180 indicates that the model's predictions are close to the actual values, on average.
R^2 score of 0.813 shows that a significant portion of the variance in the target variable is explained by the model.

BOOSTING

Boosting is a sequential ensemble technique that builds models iteratively. Each new model focuses on the errors made by the previous models, trying to correct them. This approach often leads to a more accurate model as it aims to reduce bias and variance.

Gradient Boosting:

Gradient Boosting is a powerful technique because it can model complex patterns in the data by building successive models that learn from the errors of previous ones.

In [34]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score


Feature selection

In [31]:
X = df.drop('Cancers attributable to infections\nProportion of cancers attributable to infections (%), by country', axis=1)
y = df['Cancers attributable to infections\nProportion of cancers attributable to infections (%), by country']

In [38]:

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the data (important for some models)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Create a Gradient Boosting Regressor
boosting_model = GradientBoostingRegressor(n_estimators=100, random_state=42)

# Train the model
boosting_model.fit(X_train, y_train)

# Predict on the test set
y_pred = boosting_model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')

Mean Squared Error: 0.25185026343867306
R^2 Score: 0.7378171292328797


MSE of approximately 0.252 indicates that, on average, the squared difference between the predicted and actual values is relatively small, but larger than what you achieved with bagging. This suggests that the model is making fairly accurate predictions, though there is room for improvement.
R-squared (R^2) Score:

An R^2 score of approximately 0.738 means that about 73.8% of the variance in the target variable is explained by the features in your model. This is a good indication that your model has a decent fit

bagging resulted in a lower MSE and higher R^2 compared to boosting, indicating it was more effective in capturing the underlying patterns of your dataset.

 higher MSE and lower R^2 suggest that the boosting model may not be as effective for this specific dataset compared to the bagging mod

ADABOOST

In [40]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

In [41]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the data (important for some models)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Create an AdaBoost Regressor with DecisionTreeRegressor as the base estimator
ada_model = AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=4), n_estimators=100, random_state=42)

# Train the model
ada_model.fit(X_train, y_train)

# Predict on the test set
y_pred = ada_model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')



Mean Squared Error: 0.23633699671369643
R^2 Score: 0.753966855460666


stacking

Stacking, also known as stacking generalization, is an ensemble learning technique that combines multiple different machine learning models to improve the overall performance. The idea is to train several base models and then use another model, called a meta-model or a blender, to make the final predictions based on the outputs of the base models.

In [42]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import StackingRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [44]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the data (important for some models)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Define base models
base_models = [
    ('rf', RandomForestRegressor(n_estimators=100, random_state=42)),
    ('gbr', GradientBoostingRegressor(n_estimators=100, random_state=42)),
    ('abr', AdaBoostRegressor(n_estimators=100, random_state=42))
]

# Define the stacking model
stacking_model = StackingRegressor(
    estimators=base_models,
    final_estimator=LinearRegression(),
    cv=5
)

# Train the stacking model
stacking_model.fit(X_train, y_train)

# Predict on the test set
y_pred = stacking_model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')

Mean Squared Error: 0.20066458024431041
R^2 Score: 0.7911027965926924


ANALYSIS

1  Bagging:

MSE: 0.180 (lowest among the three)
R^2: 0.813 (highest among the three)
Interpretation: Bagging (e.g., Random Forest) seems to perform the best in this case, providing the lowest error and the highest proportion of variance explained. This indicates a strong and accurate model.

2  Boosting:

MSE: 0.252 (highest among the three)
R^2: 0.738 (lowest among the three)
Interpretation: Boosting, while typically effective in many cases, appears to underperform compared to bagging and stacking in this scenario. It has the highest error and the lowest R^2 score, suggesting it might not be the best choice for this specific dataset.


3  Stacking:

MSE: 0.20066458024431041
R^2: 0.7911027965926924
Interpretation: Stacking performs well, with an MSE slightly higher than bagging but lower than boosting, and an R^2 score that is slightly lower than bagging but higher than boosting. This indicates that stacking is a strong performer and could be a good choice, especially if the models used in stacking complement each other well.

Conclusion:
Bagging appears to be the best technique for your dataset, offering the lowest error and the highest explanatory power.
Stacking is also good and might be useful .
Boosting did not perform as well as the other two techniques in this instance but can still be useful in other scenarios or with different parameter tuning.