# **ML Assignment 10**

By 23520011 - Sharaneshwar Punjal

Download the following dataset - https://www.kaggle.com/datasets/erdemtaha/cancer-data/data 

1. Drop `Id` column.

2. Use the `Diagnosis` column as the target with Classes B and M.

3. Perform a train-test split: 80% for training and 20% for testing.

4. Following manipulation is performed to increase the skew in the data (only for this assignment, not to be done in practice):
   - From train data:
      - Consider all rows with diagnosis label = M.
      - From these, randomly remove 120 rows with label M and append them to the test data.

5. Build 10 decision trees using:
   - Feature bagging: Use all features for each tree, but at each node split, randomly choose a subset of features using `max_features` in sklearn.
   - Sample bagging: If the train data size is N, train each tree using N samples selected with replacement.

6. Combine feature importance from all trees:
   - Either use simple average or a weighted average using tree accuracy as weight.
   - Feature importance can be fetched from `.feature_importances_` attribute or computed using `sklearn.inspection.permutation_importance`.

7. Shortlist the most important features based on the above step and drop the rest from training data.

8. Train 10 new decision trees using only the shortlisted features.

9. Build two models using inputs as:
   - Shortlisted features
   - Predictions from the 10 trees trained in step 8
      - Logistic Regression Model
      - Master Decision Tree

10. On test data:
   - Keep only the shortlisted features.
   - Predict using the 10 trees from step 8.
   - Combine these predictions with the shortlisted features and:
      - Predict using the logistic regression model.
      - Predict using the master decision tree.
   - Compare accuracies of both models and evaluate which gives better performance. Note how much improvement they offer over the original 10 trees.

11. Since the training data was made more skewed by reducing the minority class (M), use class weights while training. Also, while evaluating accuracy, pay special attention to recall of the minority class.

In [1]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('Cancer_Data.csv')
df.head()

In [6]:
df = df.drop(columns=['id'])
df = df.drop(columns=['Unnamed: 32'])

# Separate features and target with binary mapping: 'B' → 0, 'M' → 1
X = df.drop('diagnosis', axis=1)
y = df['diagnosis'].map({'B': 0, 'M': 1})

In [None]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2 , random_state=42)

X_train.head()

In [9]:
# Combine X and y to filter together
train_data = X_train.copy()
train_data['Diagnosis'] = y_train

# Filter rows where Diagnosis == 1 (Malignant)
m_samples = train_data[train_data['Diagnosis'] == 1]

# Randomly select 120 rows with label 1 (Malignant)
m_selected = m_samples.sample(n=120, random_state=42)

# Drop these rows from training data
train_data = train_data.drop(m_selected.index)

# Prepare test data
test_data = X_test.copy()
test_data['Diagnosis'] = y_test

# Append the 120 'Malignant' rows to test data
test_data = pd.concat([test_data, m_selected])

# Separate features and target again
X_train = train_data.drop('Diagnosis', axis=1)
y_train = train_data['Diagnosis']

X_test = test_data.drop('Diagnosis', axis=1)
y_test = test_data['Diagnosis']


In [10]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import resample

trees = []
for i in range(10):
    # 1. Bootstrap sampling
    X_sample, y_sample = resample(X_train, y_train, replace=True, n_samples=len(X_train), random_state=i)

    # 2. Train tree with feature bagging
    tree = DecisionTreeClassifier(max_features='sqrt', random_state=i)
    tree.fit(X_sample, y_sample)

    trees.append(tree)

In [None]:
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

for i in range(10):
    plt.figure(figsize=(20, 10))
    plot_tree(trees[i], filled=True, feature_names=X.columns, class_names=["Benign", "Malignant"], rounded=True)
    plt.title(f"Decision Tree {i+1}")
    plt.show()

In [12]:
import numpy as np
from sklearn.metrics import accuracy_score

# Assuming: you have X_test, y_test already split
n_features = X.shape[1]
feature_names = X.columns

# Store feature importances and accuracy
importances = []
accuracies = []

for tree in trees:
    # 1. Get feature importance
    importances.append(tree.feature_importances_)

    # 2. Compute accuracy of this tree
    y_pred = tree.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    accuracies.append(acc)

importances = np.array(importances)  # shape: (10, num_features)
accuracies = np.array(accuracies)    # shape: (10,)

# ---------------------------
# Simple average
avg_importance = np.mean(importances, axis=0)

# Weighted average using accuracy
weighted_importance = np.average(importances, axis=0, weights=accuracies)

# ---------------------------
# Show feature rankings
for i, name in enumerate(feature_names):
    print(f"{name:<30}  Simple Avg: {avg_importance[i]:.4f}  Weighted Avg: {weighted_importance[i]:.4f}")


radius_mean                     Simple Avg: 0.0177  Weighted Avg: 0.0167
texture_mean                    Simple Avg: 0.0231  Weighted Avg: 0.0230
perimeter_mean                  Simple Avg: 0.0049  Weighted Avg: 0.0049
area_mean                       Simple Avg: 0.0000  Weighted Avg: 0.0000
smoothness_mean                 Simple Avg: 0.0054  Weighted Avg: 0.0054
compactness_mean                Simple Avg: 0.0067  Weighted Avg: 0.0067
concavity_mean                  Simple Avg: 0.0244  Weighted Avg: 0.0246
concave points_mean             Simple Avg: 0.0752  Weighted Avg: 0.0774
symmetry_mean                   Simple Avg: 0.0002  Weighted Avg: 0.0002
fractal_dimension_mean          Simple Avg: 0.0000  Weighted Avg: 0.0000
radius_se                       Simple Avg: 0.0087  Weighted Avg: 0.0088
texture_se                      Simple Avg: 0.0000  Weighted Avg: 0.0000
perimeter_se                    Simple Avg: 0.0043  Weighted Avg: 0.0040
area_se                         Simple Avg: 0.0012 

In [None]:
import matplotlib.pyplot as plt

top_n = 10
indices = np.argsort(weighted_importance)[-top_n:]

plt.figure(figsize=(10, 6))
plt.barh(range(top_n), weighted_importance[indices], align='center')
plt.yticks(range(top_n), [feature_names[i] for i in indices])
plt.xlabel('Weighted Feature Importance')
plt.title('Top 10 Features (Weighted Avg)')
plt.show()

In [14]:
# Number of top features to keep
top_k = 10

# Get indices of top features based on weighted importance
top_indices = np.argsort(weighted_importance)[-top_k:]

# Get their names
selected_features = [feature_names[i] for i in top_indices]

print("Selected top features:", selected_features)

Selected top features: ['concavity_worst', 'radius_mean', 'symmetry_worst', 'texture_mean', 'concavity_mean', 'perimeter_worst', 'concave points_mean', 'area_worst', 'radius_worst', 'concave points_worst']


In [15]:
# Keep only selected features
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Train new trees using only selected (shortlisted) features
trees_selected = []

for i in range(10):
    # Sample N rows with replacement (bootstrap sampling)
    indices = np.random.choice(len(X_train_selected), size=len(X_train_selected), replace=True)
    X_sample = X_train_selected.iloc[indices]
    y_sample = y_train.iloc[indices]

    # Train tree using random feature selection at each split
    clf = DecisionTreeClassifier(max_features='sqrt', random_state=i)
    clf.fit(X_sample, y_sample)

    trees_selected.append(clf)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, plot_tree

# Train new trees using only selected (shortlisted) features
trees_selected = []

for i in range(10):
    # Bootstrap sample the training data
    indices = np.random.choice(len(X_train_selected), size=len(X_train_selected), replace=True)
    X_sample = X_train_selected.iloc[indices]
    y_sample = y_train.iloc[indices]

    # Train the decision tree with random feature selection at each node
    clf = DecisionTreeClassifier(max_features='sqrt', random_state=i)
    clf.fit(X_sample, y_sample)
    trees_selected.append(clf)

    # Visualize the tree
    plt.figure(figsize=(20, 10))
    plot_tree(clf, filled=True, feature_names=X_train_selected.columns, class_names=["Benign", "Malignant"], rounded=True)
    plt.title(f"Decision Tree {i+1} (Shortlisted Features)")
    plt.show()

#### Step 1: Get Tree Predictions on Train Data
We’ll generate 10 new columns (tree_1, ..., tree_10) from the previous 10 trees’ predictions:

In [18]:
# Collect predictions of each tree on the training data
tree_outputs_train = []

for tree in trees_selected:
    preds = tree.predict(X_train_selected)
    tree_outputs_train.append(preds)

# Convert list of arrays to a DataFrame (transpose to make columns)
tree_outputs_train = np.array(tree_outputs_train).T
tree_output_df_train = pd.DataFrame(tree_outputs_train, columns=[f"tree_{i+1}" for i in range(10)])


#### Step 2: Concatenate with Shortlisted Features

In [19]:
# Concatenate tree outputs with selected original features
X_meta_train = pd.concat([X_train_selected.reset_index(drop=True), tree_output_df_train], axis=1)


#### Step 3: Repeat for Test Data

In [20]:
tree_outputs_test = []

for tree in trees_selected:
    preds = tree.predict(X_test_selected)
    tree_outputs_test.append(preds)

tree_outputs_test = np.array(tree_outputs_test).T
tree_output_df_test = pd.DataFrame(tree_outputs_test, columns=[f"tree_{i+1}" for i in range(10)])

X_meta_test = pd.concat([X_test_selected.reset_index(drop=True), tree_output_df_test], axis=1)


#### Step 4a: Train Logistic Regression Model

In [21]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Logistic Regression with class weight 'balanced'
log_model = LogisticRegression(max_iter=1000, class_weight='balanced')
log_model.fit(X_meta_train, y_train)

y_pred_log = log_model.predict(X_meta_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_log))

Logistic Regression Accuracy: 0.9188034188034188


#### Step 4b: Train Master Decision Tree

In [None]:
# Master Decision Tree with class weight 'balanced'
master_tree = DecisionTreeClassifier(max_features='sqrt', class_weight='balanced', random_state=42)
master_tree.fit(X_meta_train, y_train)

y_pred_master = master_tree.predict(X_meta_test)
print("Master Decision Tree Accuracy:", accuracy_score(y_test, y_pred_master))

Master Decision Tree Accuracy: 0.8846153846153846


In [None]:
plt.figure(figsize=(20, 10))
plot_tree(master_tree, filled=True, feature_names=X_meta_train.columns, class_names=["Benign", "Malignant"], rounded=True)
plt.title("Master Decision Tree")
plt.show()

In [None]:
from sklearn.metrics import accuracy_score, recall_score, classification_report

# --- a. Just retain the shortlisted features from test set ---
X_test_selected = X_test[selected_features]  # already defined 'selected_features'

# --- b. Make predictions using the 10 decision trees ---
tree_preds_test = []

for tree in trees_selected:  # assuming 'trees_selected' = list of 10 trained trees on selected features
    preds = tree.predict(X_test_selected)
    tree_preds_test.append(preds)

tree_preds_test = np.array(tree_preds_test).T  # shape: (n_samples, 10)
tree_output_df_test = pd.DataFrame(tree_preds_test, columns=[f'tree_{i+1}' for i in range(10)])

# --- c. Combine tree outputs with shortlisted features for meta-models ---
X_meta_test = pd.concat([X_test_selected.reset_index(drop=True), tree_output_df_test], axis=1)

# --- i. Predict using logistic regression model ---
log_pred = log_model.predict(X_meta_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, log_pred))
print("Logistic Regression Recall (minority class):", recall_score(y_test, log_pred))
print(classification_report(y_test, log_pred))

# --- ii. Predict using master decision tree model ---
master_pred = master_tree.predict(X_meta_test)
print("Master Decision Tree Accuracy:", accuracy_score(y_test, master_pred))
print("Master Decision Tree Recall (minority class):", recall_score(y_test, master_pred))
print(classification_report(y_test, master_pred))

# --- d. Evaluate baseline accuracy from individual trees ---
tree_accuracies = [accuracy_score(y_test, tree_output_df_test[f'tree_{i+1}']) for i in range(10)]
avg_tree_accuracy = sum(tree_accuracies) / 10
print(f"Average Accuracy of 10 Decision Trees: {avg_tree_accuracy:.4f}")
