**Question 1**

The “recursive” part of the recursive binary splitting algorithm means that the algorithm repeatedly divides the dataset into smaller and smaller subsets. It does this by choosing a single predictor at each step, creating a decision point or “split” to separate the data. The “binary” aspect refers to each split producing exactly two child nodes, each representing a subset of the original data based on whether a given condition is met. In the case of our dataset, recursive binary splitting would involve evaluating potential splits based on each level of x1 and different thresholds of x2, iteratively dividing the data based on which splits reduce node impurity the most. Node impurity measures how mixed a node is, and the goal of splitting is to decrease impurity in the child nodes. For classification trees, impurity is usually measured using metrics which quantify the diversity of classes within a node. In regression trees, impurity is measured using metrics like the sum of squared deviations, focusing on the spread of quantitative outcomes in each node. The algorithm continues  until nodes reach a minimum impurity or size, resulting in a tree structure that organizes the data based on the chosen splits.

**Question 2**

Tuning parameters for a decision tree include max depth, min samples split, and min samples leaf. Choosing a larger max depth or lower minimum samples can lead to a model that closely fits the training data, which reduces bias but increases variance, potentially causing overfitting. On the other hand, limiting depth or increasing the minimum samples can prevent overfitting, leading to higher bias but reduced variance, which may help the model generalize better on unseen data.


**Question 3**

Variable importance in decision trees is determined by how much each feature contributes to reducing impurity at each split. Features that make more significant splits, leading to greater decreases in impurity, are assigned higher importance scores, as they help the model make more accurate predictions. This measure helps identify which variables have the most impact on the model's predictions, which can be useful for feature selection and understanding the relationships within the data.

**Question 4**

Bagging, random forests, and boosting improve decision tree performance by addressing overfitting and enhancing predictive accuracy. Bagging creates multiple trees from different samples of the data and averages their predictions, reducing variance. Random forests add further randomness by selecting a subset of features for each split, enhancing robustness. Finally, boosting sequentially adjusts trees to correct previous errors, reducing bias and producing a more accurate model.

**Question 5**


For bagging, the primary tuning parameter is the number of trees, which affects variance reduction. In random forests, key parameters include the number of trees and the number of features considered at each split, balancing randomness and model robustness. For boosting, parameters like the learning rate, number of trees, and tree depth control how aggressively errors are corrected, impacting both bias and variance in the final model.

**Exercise 6**

In [None]:
from google.colab import files
import pandas as pd
import io

# Upload the file
uploaded = files.upload()

# Read the file into a pandas dataframe
df = pd.read_csv(io.BytesIO(uploaded['acs12.csv']))

for col in df.select_dtypes(include=['object']).columns:
  df[col] = df[col].astype('category')

print(df)
!pip install ISLP

1)

In [None]:
import numpy as np
import pandas as pd
from matplotlib.pyplot import subplots
from statsmodels.datasets import get_rdataset
import sklearn.model_selection as skm
from ISLP import load_data, confusion_table
from ISLP.models import ModelSpec as MS
from sklearn.tree import (DecisionTreeClassifier as DTC,
  DecisionTreeRegressor as DTR,
  plot_tree,
  export_text)
from sklearn.metrics import (accuracy_score,
  log_loss)
from sklearn.ensemble import \
  (RandomForestClassifier as RF,
  GradientBoostingClassifier as GBC)
from ISLP.bart import BART

# Create binary variable from 'income'
median_income = df['income'].median()
high = np.where(df['income'] > median_income, 'Yes', 'No')

# Fit classification tree
model = MS(df.columns.drop('income'), intercept=False)
D = model.fit_transform(df)
feature_names = list(D.columns)
X = np.asarray(D)

clf = DTC(criterion='entropy',
  max_depth=3,
  random_state=0)
print(clf.fit(X, high))

# Accuracy score
print(accuracy_score(high, clf.predict(X)))

# Deviance value
resid_dev = np.sum(log_loss(high, clf.predict_proba(X)))
print(resid_dev)

# Plot
ax = subplots(figsize=(12,12))[1]
plot_tree(clf,
feature_names=feature_names,
ax=ax);

print(export_text(clf,
  feature_names=feature_names,
  show_weights=True))

# Estimate test error
validation = skm.ShuffleSplit(n_splits=1,
  test_size=200,
  random_state=0)
results = skm.cross_validate(clf,
  D,
  high,
  cv=validation)
print(results['test_score'])

# Split data
(X_train,
  X_test,
  High_train,
  High_test) = skm.train_test_split(X,
    high,
    test_size=0.5,
    random_state=0)

# Refit tree on training set
clf = DTC(criterion='entropy', random_state=0)
clf.fit(X_train, High_train)
print(accuracy_score(High_test, clf.predict(X_test)))

# Extract cost-complexity values
ccp_path = clf.cost_complexity_pruning_path(X_train, High_train)
kfold = skm.KFold(10,
  random_state=1,
  shuffle=True)

# Extract optimal through cross-validation
grid = skm.GridSearchCV(clf,
  {'ccp_alpha': ccp_path.ccp_alphas},
  refit=True,
  cv=kfold,
  scoring='accuracy')
grid.fit(X_train, High_train)
print(grid.best_score_)

# Plot pruned
ax = subplots(figsize=(12, 12))[1]
best_ = grid.best_estimator_
plot_tree(best_,
  feature_names=feature_names,
  ax=ax);

# Query best
print(best_.tree_.n_leaves)

# Fit pruned tree on test set
print(accuracy_score(High_test,
best_.predict(X_test)))
confusion = confusion_table(best_.predict(X_test),
High_test)
print(confusion)

2)

In [None]:
#from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

# n=100
bagging_model1 = RF(n_estimators=100, max_features=X_train.shape[1], random_state=0)
bagging_model1.fit(X_train, High_train)

# Prediction on test set
y_hat_bagging1 = bagging_model1.predict(X_test)
bagging_accuracy1 = accuracy_score(High_test, y_hat_bagging1)
print(bagging_accuracy1)

# Confusion matrix
bag_conf_matrix1 = confusion_matrix(High_test, y_hat_bagging1)
print(bag_conf_matrix1)

# Feature importance
bag_feature_imp1 = pd.DataFrame({
    'importance': bagging_model1.feature_importances_},
    index=feature_names)
print(bag_feature_imp1.sort_values(by='importance', ascending=False))


# n=200
bagging_model2 = RF(n_estimators=200, max_features=X_train.shape[1], random_state=0)
bagging_model2.fit(X_train, High_train)

# Prediction on test set
y_hat_bagging2 = bagging_model2.predict(X_test)
bagging_accuracy2 = accuracy_score(High_test, y_hat_bagging2)
print(bagging_accuracy2)

# Confusion matrix
bag_conf_matrix2 = confusion_matrix(High_test, y_hat_bagging2)
print(bag_conf_matrix2)

# Feature importance
bag_feature_imp2 = pd.DataFrame({
    'importance': bagging_model2.feature_importances_},
    index=feature_names)
print(bag_feature_imp2.sort_values(by='importance', ascending=False))

3)

In [None]:
# n=100
random_forest_model1 = RF(n_estimators=100, random_state=0)
random_forest_model1.fit(X_train, High_train)

# Prediction on test set
y_hat_random_forest1 = random_forest_model1.predict(X_test)
random_forest_accuracy1 = accuracy_score(High_test, y_hat_random_forest1)
print(random_forest_accuracy1)

# Confusion matrix
rf_conf_matrix1 = confusion_matrix(High_test, y_hat_random_forest1)
print(rf_conf_matrix1)

# Feature importance
rf_feature_imp1 = pd.DataFrame({
    'importance': random_forest_model1.feature_importances_},
    index=feature_names)
print(rf_feature_imp1.sort_values(by='importance', ascending=False))


# n=200
random_forest_model2 = RF(n_estimators=200, random_state=0)
random_forest_model2.fit(X_train, High_train)

# Prediction on test set
y_hat_random_forest2 = random_forest_model2.predict(X_test)
random_forest_accuracy2 = accuracy_score(High_test, y_hat_random_forest2)
print(random_forest_accuracy2)

# Confusion matrix
rf_conf_matrix2 = confusion_matrix(High_test, y_hat_random_forest2)
print(rf_conf_matrix2)

# Feature importance
rf_feature_imp2 = pd.DataFrame({
    'importance': random_forest_model2.feature_importances_},
    index=feature_names)
print(rf_feature_imp2.sort_values(by='importance', ascending=False))

4)

In [None]:
# Fit boost model (learning rate = 0.001)
boost_model1 = GBC(n_estimators=5000,
  learning_rate=0.001,
  max_depth=1,
  random_state=0)
boost_model1.fit(X_train, High_train)

# Accuracy
accuracy1 = np.zeros_like(boost_model1.train_score_)
for idx, y_ in enumerate(boost_model1.staged_predict(X_test)):
  accuracy1[idx] = 1.0 - accuracy_score(High_test, y_)

# Plot
plot_idx = np.arange(boost_model1.train_score_.shape[0])
ax = subplots(figsize=(8,8))[1]
ax.plot(plot_idx,
  boost_model1.train_score_,
  'b',
  label='Training')
ax.plot(plot_idx,
  accuracy1,
  'r',
  label='Test')
ax.legend();

# Predictions
y_hat_boost = boost_model1.predict(X_test)

# Calculate accuracy
boost_accuracy1 = accuracy_score(High_test, y_hat_boost)
print(boost_accuracy1)

# Confusion matrix
boost_conf_matrix1 = confusion_matrix(High_test, y_hat_boost)
print(boost_conf_matrix1)


# Fit boost model (learning rate = 0.01)
boost_model2 = GBC(n_estimators=5000,
  learning_rate=0.01,
  max_depth=1,
  random_state=0)
boost_model2.fit(X_train, High_train)

# Accuracy
accuracy2 = np.zeros_like(boost_model2.train_score_)
for idx, y_ in enumerate(boost_model2.staged_predict(X_test)):
  accuracy2[idx] = 1.0 - accuracy_score(High_test, y_)

# Plot
plot_idx = np.arange(boost_model2.train_score_.shape[0])
ax = subplots(figsize=(8,8))[1]
ax.plot(plot_idx,
  boost_model2.train_score_,
  'b',
  label='Training')
ax.plot(plot_idx,
  accuracy2,
  'r',
  label='Test')
ax.legend();

# Predictions
y_hat_boost2 = boost_model2.predict(X_test)

# Calculate accuracy
boost_accuracy2 = accuracy_score(High_test, y_hat_boost2)
print(boost_accuracy2)

# Confusion matrix
boost_conf_matrix2 = confusion_matrix(High_test, y_hat_boost2)
print(boost_conf_matrix2)

To classify individuals as having an income above or below the median, I began by building a basic classification tree. Using entropy as the criteria and tuning with cost-complexity pruning, I identified the best tree depth that balanced model complexity and accuracy. The pruned tree showed "hours worked" and "education level" as primary predictors, suggesting that individuals with more work hours or higher education are more likely to have above-median income. The model had a training accuracy of approximately 72.7% and was further validated with a test accuracy of 77.8%. Key nodes in the pruned tree indicated that marital status, age, and gender also contribute meaningfully to income classification, highlighting potential socioeconomic factors influencing income levels.

I then used a few other methods to improve predictive accuracy. For bagging, I created models with 100 and 200 trees, resulting in test accuracies of 77.3% and 76.8%, respectively, showing that increasing the number of trees led to minimal improvement. Moving to random forests, I tested models with 100 and 200 trees, which yielded similar results but with increased interpretability through feature importance analysis. The top predictors were "age," "hours worked," and "time to work," which aligns with expectations about income determinants. Finally, I used boosting with learning rates of 0.001 and 0.01. The model with a lower learning rate achieved slightly higher accuracy (78.3%) than the faster learning rate (76.3%), indicating a trade-off between convergence speed and accuracy. These results suggest that ensemble methods, particularly boosting, can enhance prediction accuracy, though the impact of additional trees or a faster learning rate was limited.

**References**

Used Google Colab's built-in Gemini feature to help change the code for MSE (numeric) in the ISLP boosting example to accuracy (categorical)