#### Module 14: Model Selection and Boosting

#### Case Study–3

Questions:

1. The data file contains numerical attributes that describe a letter and its corresponding class. Read the datafile “letterCG.data” and set all the numerical attributes as features. Split the data into train and test sets.

   
2. Fit a sequence of AdaBoostClassifier with varying numbers of weak learners ranging from 1 to 16, keeping the max_depth as 1. Plot the accuracy of the test set against the number of weak learners. Use the decision tree classifier as the base classifier.

   
3. Repeat step2 with max_depth set as 2.

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Step 1: Load dataset
# Assuming the first column is the class label (letter), rest are numerical features
df = pd.read_csv("letterCG.data")

In [7]:
y = df.iloc[:, 0]        # class label (letter)
X = df.iloc[:, 1:]       # numerical attributes

In [10]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

In [11]:
print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)
print("Target distribution in training set:\n", y_train.value_counts())

Training set shape: (1056, 0)
Test set shape: (453, 0)
Target distribution in training set:
 Class x-box y-box width high  onpix x-bar y-bar x2bar y2bar xybar x2ybr xy2br x-ege xegvy y-ege yegvx 
C 2 1 2 1 0 6 7 6 9 7 6 14 0 8 4 10    4
G 2 4 3 3 2 6 7 5 5 9 7 10 2 9 4 9     3
G 2 1 2 1 1 8 6 6 6 6 5 9 1 7 5 10     3
C 1 0 2 1 0 6 7 6 8 7 6 14 0 8 4 10    3
C 1 0 1 1 0 6 7 6 8 7 6 14 0 8 4 10    3
                                      ..
C 2 3 2 2 1 6 8 6 7 8 7 12 1 9 3 10    1
G 6 11 8 9 6 6 6 7 5 5 6 12 5 7 5 6    1
C 1 1 2 1 1 6 8 6 6 8 7 12 1 9 3 10    1
G 4 9 5 7 4 8 7 7 6 6 6 7 2 8 6 11     1
G 4 9 3 4 2 8 6 5 2 9 6 8 3 10 7 7     1
Name: count, Length: 1029, dtype: int64


In [14]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score

# Step 1: Load dataset
df = pd.read_csv("letterCG.data")

# Step 2: Inspect columns
print("Dataset shape:", df.shape)
print("Columns:", df.columns)

# Step 3: Separate target and features
# Assume first column is the class label, rest are numerical features
y = df.iloc[:, 0]
X = df.iloc[:, 1:]

# Safety check
if X.shape[0] == 0 or X.shape[1] == 0:
    raise ValueError("Feature matrix X is empty. Please check dataset structure.")
if y.shape[0] == 0:
    raise ValueError("Target vector y is empty. Please check dataset structure.")

# Step 4: Train-test split (skip stratify to avoid rare-class errors)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 5: Fit AdaBoost with varying weak learners
learners = range(1, 17)
accuracies = []

for n in learners:
    base_clf = DecisionTreeClassifier(max_depth=1, random_state=42)
    model = AdaBoostClassifier(
        estimator=base_clf,  # modern syntax
        n_estimators=n,
        random_state=42
    )
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracies.append(accuracy_score(y_test, y_pred))

# Step 6: Plot results
plt.figure(figsize=(8,6))
plt.plot(learners, accuracies, marker='o', color='blue')
plt.title("AdaBoost Accuracy vs Number of Weak Learners (max_depth=1)")
plt.xlabel("Number of Weak Learners")
plt.ylabel("Test Accuracy")
plt.grid(True)
plt.show()


Dataset shape: (1509, 1)
Columns: Index(['Class x-box y-box width high  onpix x-bar y-bar x2bar y2bar xybar x2ybr xy2br x-ege xegvy y-ege yegvx '], dtype='object')


ValueError: Feature matrix X is empty. Please check dataset structure.