# Homework 5: PCA and Boosting


This assignment is due on Moodle by **11:59pm on Friday November 16**. 
Your solutions to theoretical questions should be done in Markdown/MathJax directly below the associated question.
Your solutions to computational questions should include any specified Python code and results 
as well as written commentary on your conclusions.
Remember that you are encouraged to discuss the problems with your instructors and classmates, 
but **you must write all code and solutions on your own**. For a refresher on the course **Collaboration Policy** click [here](https://github.com/BoulderDS/CSCI-4622-Machine-Learning-18fa/blob/master/info/syllabus.md#collaboration-policy).

**NOTES**: 

- Do **NOT** load or use any Python packages that are not available in Anaconda 3.6. 
- Some problems with code may be autograded.  If we provide a function API **do not** change it.  If we do not provide a function API then you're free to structure your code however you like. 
- Submit only this Jupyter notebook to Moodle.  Do not compress it using tar, rar, zip, etc.
- **Unzip the files in data folder**.


Name:

In [None]:
import math
import pickle
import gzip
import numpy as np
import pandas
import matplotlib.pylab as plt
%matplotlib inline

[40 points] Problem 1 - Principal Component Analysis
---

In this problem you'll be implementing dimensionality reduction using the Principal Component Analysis technique. 

The gist of PCA Algorithm to compute principal components is as follows:
- Calculate the covariance matrix X of data points.
- Calculate eigenvectors and their corresponding eigenvalues.
- Sort the eigenvectors according to their eigenvalues in decreasing order.
- Choose first k eigenvectors which satisfies the target explained variance.
- Transform the original n dimensional data points into k dimensions.


The skeleton for the *PCA* class is below. Scroll down to find more information about your tasks as well as unit tests.

In [None]:
class PCA:
    def __init__(self, target_explained_variance=None):
        """
        explained_variance: float, the target level of explained variance
        """
        self.target_explained_variance = target_explained_variance
        self.feature_size = -1

    def standardize(self, X):
        """
        standardize features using standard scaler
        :param m X n: features data
        :return: standardized features
        """
        # YOUR CODE HERE
        raise NotImplementedError()

    def compute_mean_vector(self, X_std):
        """
        compute mean vector
        :param X_std: data
        :return n X 1 matrix: mean vector
        """
        # YOUR CODE HERE
        raise NotImplementedError()

    def compute_cov(self, X_std, mean_vec):
        """
        Covariance using mean, (don't use any numpy.cov)
        :param X_std:
        :param mean_vec:
        :return n X n matrix:: covariance matrix
        """
        # YOUR CODE HERE
        raise NotImplementedError()

    def compute_eigen_vector(self, cov_mat):
        """
        Eigenvector and eigen values using numpy
        :param cov_mat:
        :return: (eigen_vector,eigen_values)
        """
        # YOUR CODE HERE
        raise NotImplementedError()

    def compute_explained_variance(self, eigen_vals):
        """
        sort eigen values and compute explained variance.
        explained variance informs the amount of information (variance)
        can be attributed to each of  the principal components.
        :param eigen_vals:
        :return: explained variance.
        """
        # YOUR CODE HERE
        raise NotImplementedError()

    def cumulative_sum(self, var_exp):
        """
        return cumulative sum of explained variance.
        :param var_exp: explained variance
        :return: cumulative explained variance
        """
        return np.cumsum(var_exp)

    def compute_weight_matrix(self, eig_pairs, cum_var_exp):
        """
        compute weight matrix of top principal components conditioned on target
        explained variance.
        (Hint : use cumilative explained variance and target_explained_variance to find
        top components)
        
        :param eig_pairs: list of tuples containing eigenvector and eigen
        values
        :param cum_var_exp: cumulative expalined variance by features
        :return: weight matrix
        """
        # YOUR CODE HERE
        raise NotImplementedError()

    def transform_data(self, X_std, matrix_w):
        """
        transform data to subspace using weight matrix
        :param X_std: standardized data
        :param matrix_w: weight matrix
        :return: data in the subspace
        """
        return X_std.dot(matrix_w)

    def fit(self, X):
        """
        entry point to the transform data to k dimensions
        standardize and compute weight matrix to transform data.
        :param   m X n dimension: train samples
        :return  m X k dimension: subspace data.
        """
    
        self.feature_size = X.shape[1]
        
        # YOUR CODE HERE
        raise NotImplementedError()
        return self.transform_data(X_std=X_std, matrix_w=matrix_w)


**[ PART A - 25 Points]** Your task involves implementing helper functions to compute *mean, 
covariance, eigenvector and weights*.

Complete *fit()* to using all helper functions to find reduced dimension data.

In [None]:
%run -i tests/tests.py

**[ PART A - Continued ] **  Run PCA on *fashion mnist dataset* to reduce the dimension of the data.

fashion mnist data consists of samples with *784 dimensions*.

Report the reduced dimension $k$ for target explained variance of **0.99**

In [None]:
X_train = pickle.load(open('./data/mnist/train_images.pkl','rb'))
y_train = pickle.load(open('./data/mnist/train_image_labels.pkl','rb'))
X_train = X_train[:15000]
y_train = y_train[:15000]

In [None]:
pca_handler = PCA(target_explained_variance=0.99)
X_train_updated = pca_handler.fit(X_train)

[**Part B **] Run *SVM Classifier* (refer to HW4) on the reduced dimension data with linear kernel and C = 0.01.

Report the accuracy on test dataset.

In [None]:
from sklearn.model_selection import train_test_split
X_t, X_test, y_t, y_test = train_test_split(X_train_updated, y_train, test_size=0.2, random_state=42)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

**[PART C 10 Points]** Repeat the same experiment for different values of target explained variance between : **[0.8-1.0]** with increments of 0.04 and provide the reduced dimension size for each.

- plot the graph of accuracy vs target explained variance
- plot the graph of accuracy vs number of components.

Discuss your observations.

In [None]:
for target_variance in np.arange(0.5,1.02,.04):
    # YOUR CODE HERE
    raise NotImplementedError()
    print(X_t.shape, target_variance, clf.score(X_test, y_test))

[20 points] Problem 2 :  Statistical PCA for non-zero mean random variables. 
---
Following the notations in class. We define *Principal components* of $x$ as $v_i$. Assuming that the eigenvalues of the covariance matrix $C$ are different from each other.

Show that

1) $v_1$ is the eigenvector of $C$ corresponding to its largest eigenvalue.

2) $v_2^T v_1$ = 0 and $v_2$ is the eigenvector of $C$ corresponding to its second largest eigenvalue.

YOUR ANSWER HERE

[40 points] Problem 3  Decision Tree Ensembles - Bagging and Boosting
---

**[PART A 20 Points] *House price prediction* using Decision tree ensembles**

In this Regression problem, we compare Decision trees and its ensembles - bagging and Boosting on House Price prediction dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

Make use of standard Regression API's of  Decision tree ensembles from sklearn to predict the house price. http://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble

Complete the ensemble_test class to fit appropriate model recieved as parameter and store the test performance.

Use the **ensemble_test** class to plot score and time taken to fit the data.

In [None]:
from time import time
from sklearn.metrics import explained_variance_score, precision_score
import pandas as pd


class ensemble_test:
    """
        Test multiple model performance
    """

    def __init__(self, X_train, y_train, X_test, y_test):
        """
        initialize data and type of problem
        :param X_train:
        :param y_train:
        :param X_test:
        :param y_test:
        :param type: regression or classification
        """
        self.scores = {}
        self.execution_time = {}
        self.X_train = X_train
        self.y_train = y_train
        self.X_test = X_test
        self.y_test = y_test
        
    def fit_model(self, model, name):
        """
        Fit the model on train data.
        predict on test and store score and execution time for each fit.
        :param model: model
        :param name: name of model
        """
        # YOUR CODE HERE
        raise NotImplementedError()
        

    def print_result(self):
        """
            print results for all models trained and tested.
        """

        models_cross = pd.DataFrame({
            'Model'         : list(self.scores.keys()),
            'Score'         : list(self.scores.values()),
            'Execution time': list(self.execution_time.values())})
        print(models_cross.sort_values(by='Score', ascending=False))

    def plot_metric(self):
        """
         plot each metric : time, accuracy
        """

        # YOUR CODE HERE
        raise NotImplementedError()

In [None]:
X_train, X_test, y_train, y_test = pickle.load(open('./data/house_predictions/test_train.pkl','rb'))

In [None]:
# create a handler for ensemble_test , use the created handler for fitting different models.
ensemble_handler = ensemble_test(X_train,y_train,X_test,y_test)
from sklearn.tree  import DecisionTreeRegressor
decision = DecisionTreeRegressor()
ensemble_handler.fit_model(decision, 'decision_tree')

Complete the cells below to fit the dataset using  *RandomForestRegressor* and *AdaBoostRegressor* with appropriate parameter. (use same n_estimators=500 for both)

Compare the accuracy and time taken, report the results

In [None]:
from sklearn.ensemble import RandomForestRegressor

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
from sklearn.ensemble import AdaBoostRegressor
# YOUR CODE HERE
raise NotImplementedError()

#### Report Accuracy and Execution time by the model : 

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
ensemble_handler.plot_metric()

**[PART B 15 points]** This is an extension of HW4 problem on sentiment classification over reviews.

Here we make use DecisionTree ensembles to **classify** review as positive or negative

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import StratifiedKFold

In [None]:
reviews  = pd.read_csv('./data/reviews.csv')
train, test = train_test_split(reviews, test_size=0.2, random_state=4622)
X_train = train['reviews'].values
X_test = test['reviews'].values
y_train = train['sentiment']
y_test = test['sentiment']

Finish the following : 

* create pipeline for *RandomForestClassifier* and *AdaBoostClassifier* as shown for *DecisionTreeClassifier* below. refer : http://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble
* fit the reviews dataset on the above models and report the results. (tune parameters of classifier for optimal performance)
* use n_estimators = 500 for both

In [None]:
# Define tokenizer and create pipeline.
def tokenize(text): 
    tknzr = TweetTokenizer()
    return tknzr.tokenize(text)
en_stopwords = set(stopwords.words("english")) 
vectorizer = CountVectorizer(
    analyzer = 'word',
    tokenizer = tokenize,
    lowercase = True,
    ngram_range=(1, 2),
    stop_words = en_stopwords,
    min_df=10)

In [None]:
# create a handler for ensemble_test , use the created handler for fitting different models.
ensemble_classifier_handler = ensemble_test(X_train, y_train, X_test, y_test)

In [None]:
from sklearn.tree import DecisionTreeClassifier
pipeline_decision_tree = make_pipeline(vectorizer, DecisionTreeClassifier())
ensemble_classifier_handler.fit_model(pipeline_decision_tree, 'decision tree classifier')

In [None]:
ensemble_classifier_handler.print_result()

In [None]:
from sklearn.ensemble import RandomForestClassifier
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
from sklearn.ensemble import AdaBoostClassifier
# YOUR CODE HERE
raise NotImplementedError()

####  RESULTS:

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
ensemble_classifier_handler.plot_metric()

**[PART C  5 points]** Discuss atleast one advantage and disadvantage for *Random Forest
*and *Adaboost* in the following space.

YOUR ANSWER HERE