1. Prove formula (6.7) in the textbook, i.e. show that the variance of a sum of random but correlated variables can be written as $${\rm Var}\left(\frac{1}{m}\sum_{i=1}^m x_i\right)= \rho \sigma^2 + \frac{1}{m}(1-\rho)\sigma^2,$$
where ${\rm Var}(x_i)=\sigma^2$ and the correlation coefficient $\rho_{x_i,x_j}={\rm Cov(x_i,x_j)}/\sigma^2$

(Hint: Review some properties of the covariance)

You can write out the calculation by hand and attach a scanned pdf.

2. Load again the cleaned dataset from Lab 13 for the photometric redshift prediction with 6,307 objects and 6 features (sel_feature.csv and sel_target.csv). You can also just re-do the data cuts from the original file if you prefer.

Optimize (using a Grid Search for the parameters you deem to be most relevant) the 	Extremely Random Tree algorithm and compute performance metric and the outlier fraction. How do they compare to the optimal Random Forest model? Comment not just on the 	scoring parameter(s), but also on high variance vs high bias. Which model would you pick?


In [2]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import astropy

from sklearn import metrics
from sklearn.model_selection import cross_validate, KFold, cross_val_predict, train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostRegressor, IsolationForest
from sklearn.metrics import mean_absolute_error, make_scorer
from astropy.io import fits

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', 100)

font = {'size'   : 16}
matplotlib.rc('font', **font)
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14) 
matplotlib.rcParams.update({'figure.autolayout': False})
matplotlib.rcParams['figure.dpi'] = 100

In [3]:
with fits.open('DEEP2_uniq_Terapix_Subaru_v1.fits') as data:
    df = pd.DataFrame(np.array(data[1].data).byteswap().newbyteorder())
df = df[df['zquality'] >= 3]
df = df[df['cfhtls_source'] == 0]
mags_columns = ['u_apercor', 'g_apercor', 'r_apercor', 'i_apercor', 'z_apercor', 'y_apercor']
df = df[(df[mags_columns] != -99).all(axis=1)]
df = df[(df[mags_columns] != 99).all(axis=1)]
final_features_df = df[mags_columns]
final_target_df = df[['zhelio']]

final_features_df.shape

(6307, 6)

In [14]:
X_train,X_test,y_train,y_test=train_test_split(final_features_df,final_target_df,test_size=0.2,random_state=42)

etr_model=ExtraTreesRegressor(random_state=42)


param_grid = {
    'min_impurity_decrease': [0, 0.1, 0.5],
    'max_leaf_nodes': [None, 100, 200],
    'min_samples_split': [10, 20, 100],
    'max_features': [None, 2, 4],
    'min_samples_leaf': [1, 2, 4],
    'min_samples_split': [2, 5, 10]
}
grid_search=GridSearchCV(estimator=etr_model,param_grid=param_grid,scoring='neg_mean_absolute_error',cv=3,n_jobs=-1)

grid_search.fit(X_train,y_train.values.ravel())

best_params=grid_search.best_params_

best_etr_model=ExtraTreesRegressor(random_state=42,**best_params)
best_etr_model.fit(X_train,y_train.values.ravel())
    
cv_scores=cross_validate(best_etr_model,final_features_df,final_target_df.values.ravel(),cv=3,return_train_score=True)

predictions=cross_val_predict(best_etr_model,final_features_df,final_target_df.values.ravel(),cv=3)
mae=mean_absolute_error(final_target_df,predictions)
outlier_fraction=np.sum(np.abs(predictions-final_target_df.values.ravel())>0.15)/len(predictions)

print('Extremely Random Tree:')
print('Mean Test Score:',np.mean(cv_scores['test_score']))
print('Mean Train Score:',np.mean(cv_scores['train_score']))
print('Best Parameters:',best_params)
print('Mean Absolute Error:',mae)
print('Outlier Fraction:',outlier_fraction)

rf_model=RandomForestRegressor(random_state=42)

grid_search=GridSearchCV(estimator=rf_model,param_grid=param_grid,scoring='neg_mean_absolute_error',cv=3,n_jobs=-1)

grid_search.fit(X_train,y_train.values.ravel())

best_params=grid_search.best_params_

best_rf_model=RandomForestRegressor(random_state=42,**best_params)
best_rf_model.fit(X_train,y_train.values.ravel())

cv_scores=cross_validate(best_rf_model,final_features_df,final_target_df.values.ravel(),cv=3,return_train_score=True)

predictions=cross_val_predict(best_rf_model,final_features_df,final_target_df.values.ravel(),cv=3)
mae=mean_absolute_error(final_target_df,predictions)
outlier_fraction=np.sum(np.abs(predictions-final_target_df.values.ravel())>0.15)/len(predictions)

print('\nRandom Forest:')
print('Mean Test Score:',np.mean(cv_scores['test_score']))
print('Mean Train Score:',np.mean(cv_scores['train_score']))
print('Mean Absolute Error:',mae)
print('Outlier Fraction:',outlier_fraction)

Extremely Random Tree:
Mean Test Score: 0.7589850418354764
Mean Train Score: 1.0
Best Parameters: {'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0, 'min_samples_leaf': 1, 'min_samples_split': 2}
Mean Absolute Error: 0.08034802976287642
Outlier Fraction: 0.11986681465038845

Random Forest:
Mean Test Score: 0.7456736617631825
Mean Train Score: 0.9617793809347829
Mean Absolute Error: 0.08497041008311171
Outlier Fraction: 0.12731885206912955


#### -The Random Forest method has higher error and outlier fraction than the Extremely Random Tree. 
#### -The Extremely Random Tree performs approximately perfectly on the training data and not as well on the test data, indicating high variance, the scores are still high indicating low bias. Similar can be said for the Random Forest method, however the variance is slightly lower and the bias is slightly higher.
#### -Since the Extremely Random Tree has lower mean absolute error and outlier fraction, and the test and train scores are similar to the Random Forest Method, I would choose the Extremely Random Tree Method. The ERT Method also seems to run faster.