# Q3

# 3A

Properties of Variance (I am not going to prove these):

$Var(cX) = c^2 * Var(X)$ 


$Var(X_i + X_j) = Var(X_i) + Var(X_j) + 2Cov(X_i,X_j)$

Problem Statement Implies: 

$\rho*\sigma^2 = Cov(X_i,X_j)$

$Var(X_i) = \sigma^2$

\begin{align*}
Var\left(\frac{1}{M}\sum_{i=1}^M X_i\right) &= \frac{1}{M^2} Var\left(\sum_{i=1}^M X_i\right)\\
&= \frac{1}{M^2} Var(\sum_{i=1}^M X_i)\\

&=\frac{1}{M^2}\left[\sum_{i=1}^MVar(X_i)+\sum_{i=1}^M\sum_{j=1, j\neq i}^MCov(X_i,X_j)\right]\\

&=\frac{1}{M^2}\left[M\sigma^2+M(M-1)\rho\sigma^2\right] \\
&=\frac{1}{M^2}\left[M\sigma^2+M^2\rho\sigma^2-M\rho\sigma^2\right]\\
&=\frac{1}{M^2}\left[M\sigma^2(1-\rho)+M^2\rho\sigma^2\right] \\
&=\frac{M\sigma^2(1-\rho)}{M^2}+\rho\sigma^2 \\
&=\rho\sigma^2 + \frac{1-\rho}{M}\sigma^2\\

\end{align*}

# 3B i to ii

In [5]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestRegressor
# Function to create a bootstrapped dataset given features and labels
def bootstrap_dataset(X, y, N):
    """
    Function to create a bootstrapped training dataset D ∼ D by uniformly sampling N samples with replacement.
    
    Args:
    - X: Features of the dataset (pandas DataFrame).
    - y: Labels of the dataset (pandas Series).
    - N: The number of samples to draw.
    
    Returns:
    - A tuple (X_bootstrap, y_bootstrap) where both are bootstrapped versions of X and y.
    """
    bootstrap_indices = np.random.choice(X.index, size=N, replace=True)
    X_bootstrap = X.loc[bootstrap_indices]
    y_bootstrap = y.loc[bootstrap_indices]
    return X_bootstrap, y_bootstrap

# Load the training and test datasets
X_train = pd.read_csv('wine_true_D.csv',header=None)  # Training features
y_train = pd.read_csv('wine_true_y.csv', header=None).squeeze()  # Training labels
X_test = pd.read_csv('wine_test.csv')  # Test data just
X_test = X_test.to_numpy()


# Bootstrapped dataset size and number of datasets
N = 1024
num_bootstrap_datasets = 3000

# Initialize to store predictions for each tree on each bootstrap dataset
predictions_tree_1 = np.zeros((num_bootstrap_datasets, len(X_test)))
predictions_tree_2 = np.zeros((num_bootstrap_datasets, len(X_test)))

# Train the models on each of the 3000 bootstrapped datasets
for i in range(num_bootstrap_datasets):
    # Generate bootstrapped training data
    X_bootstrap, y_bootstrap = bootstrap_dataset(X_train, y_train, N)

    # Convert to numpy arrays to avoid warnings
    X_bootstrap = X_bootstrap.to_numpy()
    y_bootstrap = y_bootstrap.to_numpy()

    
    # Create a Bagging Regressor with 2 trees, each trained on 500 bootstrapped samples
    bagging_model = BaggingRegressor(estimator=DecisionTreeRegressor(), 
                                     n_estimators=2, 
                                     max_samples=500, 
                                     bootstrap=True, 
                                     random_state=np.random.randint(1e6))
    
    # Fit the model
    bagging_model.fit(X_bootstrap, y_bootstrap)
    
    # Get predictions for both trees
    pred_trees = np.array([est.predict(X_test) for est in bagging_model.estimators_])
    
    # Store the predictions from each tree
    predictions_tree_1[i, :] = pred_trees[0]  # Predictions from the first tree
    predictions_tree_2[i, :] = pred_trees[1]  # Predictions from the second tree

# Now compute the correlation for each test sample
correlations = []
for x_idx in range(len(X_test)):
    # Get predictions for the current test sample (x) across all bootstrap datasets
    preds_1 = predictions_tree_1[:, x_idx]
    preds_2 = predictions_tree_2[:, x_idx]
    
    # Compute the correlation coefficient between the predictions of the two trees
    corr_matrix = np.corrcoef(preds_1, preds_2)
    corr = corr_matrix[0, 1]  # The off-diagonal element is the correlation coefficient
    correlations.append(corr)

In [6]:
# Display the correlation results
correlations = np.array(correlations)
print("Correlations between tree predictions for each test point:")
print(np.mean(correlations))

Correlations between tree predictions for each test point:
0.04481161468025522


# 3B iii

In [7]:
from sklearn.utils import resample
# Define the range of F values (number of features to consider at each split)
F_values = [2, 5, 7, 9] 

# Store average correlations for each value of F
average_correlations = {}

# Iterate over each value of F
for F in F_values:
    # Initialize to store predictions for each tree on each bootstrap dataset
    predictions_tree_1 = np.zeros((num_bootstrap_datasets, len(X_test)))
    predictions_tree_2 = np.zeros((num_bootstrap_datasets, len(X_test)))

    # Train the models on each of the 3000 bootstrapped datasets
    for i in range(num_bootstrap_datasets):
        # Generate bootstrapped training data
        X_bootstrap, y_bootstrap = bootstrap_dataset(X_train, y_train, N)

        # Convert to numpy arrays to avoid warnings
        X_bootstrap = X_bootstrap.to_numpy()
        y_bootstrap = y_bootstrap.to_numpy()

        # Create a RandomForestRegressor with 2 trees, each trained on 500 bootstrapped samples and F features
        random_forest_model = RandomForestRegressor(n_estimators=2,
                                                    max_samples=500,
                                                    max_features=F,
                                                    bootstrap=True,
                                                    random_state=np.random.randint(1e6))
        
        # Fit the model on the bootstrapped training data
        random_forest_model.fit(X_bootstrap, y_bootstrap)

        # Get predictions for both trees
        pred_trees = np.array([est.predict(X_test) for est in random_forest_model.estimators_])
        
        # Store the predictions from each tree
        predictions_tree_1[i, :] = pred_trees[0]  # Predictions from the first tree
        predictions_tree_2[i, :] = pred_trees[1]  # Predictions from the second tree

    # Compute the basic correlation for each test sample and average over all test samples
    correlations = []
    for x_idx in range(len(X_test)):
        # Get predictions for the current test sample (x) across all bootstrap datasets
        preds_1 = predictions_tree_1[:, x_idx]
        preds_2 = predictions_tree_2[:, x_idx]
        
        # Compute the basic correlation coefficient between the predictions of the two trees
        corr_matrix = np.corrcoef(preds_1, preds_2)
        corr = corr_matrix[0, 1]  # The off-diagonal element is the correlation coefficient
        correlations.append(corr)

    # Compute the average correlation for the current value of F
    average_correlations[F] = np.mean(correlations)

# Display the average correlation results for each value of F
for F, avg_corr in average_correlations.items():
    print(f"Average correlation for F = {F}: {avg_corr:.4f}")

Average correlation for F = 2: 0.0268
Average correlation for F = 5: 0.0360
Average correlation for F = 7: 0.0415
Average correlation for F = 9: 0.0426


# 3B iiii

Note: values will be different on different runs

The tree bagging method had an average correlation of about 0.047 where as random forest method had:

Average correlation for F = 2: 0.0267

Average correlation for F = 5: 0.0383

Average correlation for F = 7: 0.0386

Average correlation for F = 9: 0.0442

In makes sense that tree bagging method performed worse than random forest as the main point of doing the random forest was to reduce correlation.

Clearly, as F increases correlation also increase. This makes sense. As F increases, more features are considered at each split, making it more likely that different trees will choose the same or similar features to split on. Small F's will naturally results in more different subsets of the available features for the two trees, which reduces correlation between their predictions. So yes this is about what I expected.