# UCS2612 Machine Learning Laboratory
## A8 – Applications of Random Forest and AdaBoost Ensemble Techniques

**Name:** Rakshith

**Reg No:** 3122215001078



## 1. Loading the Dataset

- The dataset is imported using the `fetch_ucirepo` function from the `ucimlrepo` library.
- The features and target labels are extracted from the dataset and stored in separate variables.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier
from sklearn.metrics import accuracy_score, roc_curve, auc
from ucimlrepo import fetch_ucirepo

# Task 1: Loading the dataset
breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17)

# Extract features and target variable
X = breast_cancer_wisconsin_diagnostic.data.features
y = breast_cancer_wisconsin_diagnostic.data.targets
column_names = list(breast_cancer_wisconsin_diagnostic.data.features.columns)
target_column_name=list(breast_cancer_wisconsin_diagnostic.data.targets.columns)


ModuleNotFoundError: No module named 'ucimlrepo'

In [None]:
# Display features and targets
print("Features:")
print(X.head())


In [None]:
print("Targets:")
print(y.head())

## 2. Pre-processing the data

- The dataset is checked for missing values to ensure data completeness.
- Since there are no missing values in the dataset, no imputation or replacement of missing values is performed.
- Non-numeric columns in the dataset are encoded into numeric values using label encoding.
- Numeric features in the dataset are standardized using `StandardScaler` to have a mean of 0 and a standard deviation of 1.
- Numeric features in the dataset are normalized using `MinMaxScaler` to scale each feature to a specified range.


### Handling Missing Values

In [None]:
# Check for missing values in the DataFrame
missing_values = X.isnull().sum()

# Check if there are any missing values in each column
columns_with_missing_values = missing_values[missing_values > 0]

if columns_with_missing_values.empty:
    print("No missing values in the DataFrame")
else:
    print("Columns with missing values:")
    print(columns_with_missing_values)




### Encoding

In [None]:

# Encoding non-numeric columns
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
y = pd.DataFrame(y, columns=target_column_name)
X = pd.DataFrame(X, columns=column_names)
X = X.apply(label_encoder.fit_transform)




In [None]:
print("Features after Handling Missing Values & Encoding:")
print(X.head())


In [None]:
print("Target after encoding")
print(y.head())


### Standardization

In [None]:

# Standardization
scaler = StandardScaler()
X = scaler.fit_transform(X)
X=pd.DataFrame(X, columns=column_names)
print("DataFrame Head after Standardization:")
print(X.head())


### Normalization

In [None]:

# Normalization
minmax_scaler = MinMaxScaler()
X = minmax_scaler.fit_transform(X)
X=pd.DataFrame(X, columns=column_names)
print("DataFrame Head after Normalization:")
print(X.head())


## 3. Exploratory Data Analysis

- Visualization techniques such as pie charts are used to visualize the distribution of categorical variables, such as the target variable.
- Scatter plots are utilized to explore relationships between pairs of numeric features, allowing for the identification of potential correlations or patterns.
- Heatmaps are generated to visualize the correlation matrix between features, providing insights into the strength and direction of relationships among variables.


### Pie-Chart for target Variable

In [None]:
# Pie chart for target variable

plt.figure(figsize=(6, 6))
y_labels = label_encoder.inverse_transform([0, 1])
y_1d = y.squeeze()  # Convert y to 1D array
plt.pie(np.bincount(y_1d), labels=['Benign', 'Malignant'], autopct='%1.1f%%', colors=['skyblue', 'lightcoral'])
plt.title('Distribution of Diagnosis')
plt.show()


### Pairwise scatter plot for first 5 features

In [None]:

# Pairwise scatter plot for first 5 features
sns.pairplot(X.iloc[:, :5], diag_kind='kde')
plt.suptitle('Pairwise Scatter Plot for First 5 Features', y=1.02)
plt.show()


### Correlation heatmap

In [None]:
# Correlation heatmap
plt.figure(figsize=(21, 18))
correlation_matrix = X.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()


## 4. Feature Engineering Techniques (Selecting best k features)

- Feature engineering techniques are applied to select the most relevant features for modeling.
- The `SelectKBest` method from scikit-learn's `feature_selection` module is used to select the top k features based on their scores computed using the ANOVA F-value (`f_classif`) metric.
- In this example, 20 top features are selected using the `SelectKBest` method and transformed accordingly.


In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

# Using SelectKBest for feature selection
k = 20  # Number of top features to select
selector = SelectKBest(score_func=f_classif, k=k)
X= selector.fit_transform(X, y)


## 5. Split the data into training, testing and validation sets.

- The dataset is split into training and testing sets using the `train_test_split` function to evaluate model performance on unseen data.



In [None]:

# Task 5: Split the data into training, testing, and validation sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


## 6. Train the Model

- Ensemble models such as Bagging, Random Forest, and AdaBoost are trained on the training data using scikit-learn's implementation of these algorithms.
- Each ensemble model learns to predict the target labels based on the input features.


In [None]:

# Task 6: Train the model
# Ensemble Models: Bagging, Random Forest, AdaBoost
bagging_clf = BaggingClassifier(random_state=40)
bagging_clf.fit(X_train, y_train)

rf_clf = RandomForestClassifier(random_state=40)
rf_clf.fit(X_train, y_train)

adaboost_clf = AdaBoostClassifier(random_state=40)
adaboost_clf.fit(X_train, y_train)


## 7. Test the Model

- The trained models are used to make predictions on the test data to evaluate their performance.
- The predictions are compared to the true labels to assess the accuracy and effectiveness of the models.


In [None]:

# Task 7: Test the model
y_pred_bagging = bagging_clf.predict(X_test)
y_pred_rf = rf_clf.predict(X_test)
y_pred_adaboost = adaboost_clf.predict(X_test)


## 8. Measure the performance of the trained model

- The accuracy of each ensemble model is calculated using scikit-learn's `accuracy_score` function.
- The accuracy metric quantifies the proportion of correctly classified instances in the test set.


In [None]:

# Task 8: Measure the performance of the trained model
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
accuracy_adaboost = accuracy_score(y_test, y_pred_adaboost)
print("#########ACCURACY#########")
print(f"Bagging: {accuracy_bagging}")
print(f"Random Forest: {accuracy_rf}")
print(f"AdaBoost: {accuracy_adaboost}")


## 9. Compare the results of each ensemble model using graphs

- Graphs such as bar plots or box plots can be used to compare the performance metrics (e.g., accuracy) of different ensemble models.
- Another visualization technique, like confusion matrices or precision-recall curves, can provide insights into the models' strengths and weaknesses.


In [None]:

# Task 9: Compare the results of each ensemble model using graphs
# Bar plot for accuracy comparison
models = ['Bagging', 'Random Forest', 'AdaBoost']
accuracies = [accuracy_bagging, accuracy_rf, accuracy_adaboost]

plt.figure(figsize=(8, 6))
plt.bar(models, accuracies, color=['skyblue', 'lightcoral', 'lightgreen'])
plt.xlabel('Ensemble Models')
plt.ylabel('Accuracy')
plt.title('Accuracy Comparison of Ensemble Models')
plt.ylim(0.9, 1.0)
plt.show()


### Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns

# Plot confusion matrix for each model
plt.figure(figsize=(15, 5))
for i, (clf, name) in enumerate([(bagging_clf, 'Bagging'), (rf_clf, 'Random Forest'), (adaboost_clf, 'AdaBoost')], 1):
    plt.subplot(1, 3, i)
    y_pred = clf.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False,
                xticklabels=['Benign', 'Malignant'], yticklabels=['Benign', 'Malignant'])
    plt.title(f'Confusion Matrix - {name}')
    plt.xlabel('Predicted label')
    plt.ylabel('True label')
plt.tight_layout()
plt.show()


## 10. Represent the ROC of training and test results in the graphs

- Receiver Operating Characteristic (ROC) curves are plotted for each ensemble model to visualize their performance in terms of true positive rate (sensitivity) and false positive rate.
- The Area Under the ROC Curve (AUC) is calculated to quantify the model's discrimination ability between the positive and negative classes.


In [None]:

# Task 10: Represent the ROC of training and test results in the graphs
# Calculate ROC curves for training set
plt.figure(figsize=(8, 6))
for clf, name in [(bagging_clf, 'Bagging'), (rf_clf, 'Random Forest'), (adaboost_clf, 'AdaBoost')]:
    y_score = clf.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_score)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.5f})')

plt.plot([0, 1], [0, 1], 'k--', label='Random Guessing')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) - Training Set')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()


# Inference
- Random Forest achieved the highest accuracy among the three ensemble models, with an accuracy of 97.08%.
- Bagging and AdaBoost both achieved slightly lower accuracies, with Bagging at 95.32% and AdaBoost at 95.91%.
- Despite the differences in accuracy, all three models demonstrated excellent performance, with AUC scores above 0.99.
- Both Bagging and Random Forest models exhibited similar AUC scores, with Random Forest slightly outperforming Bagging by a small margin.
- AdaBoost, while achieving a slightly lower accuracy compared to Random Forest, still showed a strong AUC score, indicating good discrimination between classes.
- Overall, all three ensemble models performed well in diagnosing breast cancer, with Random Forest showing a slight advantage in accuracy and AUC.
