In [1]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn import ensemble
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

import numpy as np
from sklearn import datasets

In [2]:
# load data
cal = datasets.fetch_california_housing()
data = cal['data']
targets = cal['target']

# Task 1

In [3]:
# 1a
clf = LinearRegression()
scores = cross_val_score(clf, data, targets, cv=5)
print('Linear Regression R^2 Scores', scores)
print('Linear Regression Mean R^2 Score', np.mean(scores))
print()

clf = ensemble.GradientBoostingRegressor()
scores = cross_val_score(clf, data, targets, cv=5)
print('Boosting R^2 Scores', scores)
print('Boosting Scores Mean R^2 Score', np.mean(scores))

Linear Regression R^2 Scores [0.54866323 0.46820691 0.55078434 0.53698703 0.66051406]
Linear Regression Mean R^2 Score 0.5530311140279233

Boosting R^2 Scores [0.60253089 0.69877396 0.71802327 0.65021286 0.67975317]
Boosting Scores Mean R^2 Score 0.6698588296892413


In [4]:
# 1b
tuned_parameters = [{'max_depth': [3, 10],
                     'n_estimators': [50, 100],
                     'learning_rate': [0.01, 0.1]}]
clf = ensemble.GradientBoostingRegressor()
clf = GridSearchCV(clf, tuned_parameters)
clf.fit(data, targets)

print("Scores for parameter grid search:")
print()
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
          % (mean, std * 2, params))



Scores for parameter grid search:

0.334 (+/-0.055) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 50}
0.476 (+/-0.060) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 100}
0.407 (+/-0.082) for {'learning_rate': 0.01, 'max_depth': 10, 'n_estimators': 50}
0.561 (+/-0.098) for {'learning_rate': 0.01, 'max_depth': 10, 'n_estimators': 100}
0.644 (+/-0.082) for {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 50}
0.680 (+/-0.051) for {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}
0.645 (+/-0.108) for {'learning_rate': 0.1, 'max_depth': 10, 'n_estimators': 50}
0.647 (+/-0.107) for {'learning_rate': 0.1, 'max_depth': 10, 'n_estimators': 100}


**1c) Briefly discuss the performance and summarize your findings.**

When running linear regression and gradient boosting using 5-fold cross validation, we got the following R^2 Score results:

    Linear Regression Mean Score = 0.5530311140279233
    Boosting Scores Mean Score = 0.6698681752149087

Overall the scores are not great. R^2 measures the goodness-of-fit of the model on the data. R^2 closer to 1 is better. The boosting score is significantly greater than the linear regressor which makes sense.


For 1b) we tested all of the following possible permutations:

    'max_depth': [3, 10],
    'n_estimators': [50, 100],
    'learning_rate': [0.01, 0.1]
    
Our parameter grid search yielded interesting results. None of the parameter combinations were convincingly better than the default parameters. In fact, increasing the max_depth and decreasing the learning rate were very detrimental to the R^2 score.


# Task 2

In [5]:
new_targets = np.array([x>2 for x in targets])

In [6]:
# 2a
clf = LogisticRegression()
scores = cross_val_score(clf, data, new_targets, cv=5)
print('Logistic Regression Accuracy', scores)
print('Logistic Regression Mean Accuracy', np.mean(scores))
print()

clf = ensemble.GradientBoostingClassifier()
scores = cross_val_score(clf, data, new_targets, cv=5)
print('Boosting Classifier Accuracy', scores)
print('Boosting Classifier Mean Accuracy', np.mean(scores))



Logistic Regression Accuracy [0.80988133 0.79796512 0.77616279 0.74612403 0.82481221]
Logistic Regression Mean Accuracy 0.7909890954886174

Boosting Classifier Accuracy [0.79099055 0.75436047 0.80741279 0.75339147 0.82650836]
Boosting Classifier Mean Accuracy 0.7865327285758221


In [None]:
# 2b
tuned_parameters = [{'max_depth': [3, 5],
                     'n_estimators': [100, 200],
                     'learning_rate': [0.1, 0.5]}]
clf = ensemble.GradientBoostingClassifier()
clf = GridSearchCV(clf, tuned_parameters)
clf.fit(data, new_targets)

print("Scores for parameter grid search:")
print()
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
          % (mean, std * 2, params))



In [None]:
# 2c
clf = LogisticRegression()
scores = cross_val_score(clf, data, new_targets, cv=5, scoring='roc_auc')
print('Logistic Regression ROC AUC scores', scores)
print('Logistic Regression ROC AUC Mean score', np.mean(scores))
print()

clf = ensemble.GradientBoostingClassifier()
scores = cross_val_score(clf, data, new_targets, cv=5, scoring='roc_auc')
print('Boosting Classifier ROC AUC scores', scores)
print('Boosting Classifier ROC AUC Mean score', np.mean(scores))

**2d) Briefly discuss the performance and summarize your findings. Are they good classifiers? Compare the result with a trivial classifier. Compare the results when using accuracy and ROC_AUC.**

The performance of the logistic regression and gradient boosting clasifiers was pretty good. The mean accuracy of the models on 5-fold cross validation is as follows:

    Logistic Regression Mean Accuracy = 0.7909890954886174
    Boosting Classifier Mean Accuracy = 0.786484278963419

The accuracies were roughly the same with the logistic regression having a slightly better mean accuracy.

The results of ROC_AUC were:

    Logistic Regression ROC AUC Mean score 0.8701644975389575
    Boosting Classifier ROC AUC Mean score 0.8898095631008707
    
ROC_AUC measures the ability between [0,1] of a classifier to discriminate between classes of data. Values closer to 1 mean the model is good at discriminating classes. Given the results above, we can say that our model can predict the class of data relatively well. Assuming data is evenly distributed among 2 classes, an untrained trivial classifier would achive ~0.5 accuracy and ROC AUC.

The two classifiers performed similarly although the boosting classifier has a slightly greater ROC AUC mean score. The ROC_AUC values are also higher than the accuracy scores. 