# Lab1 - Scikit-learn
Author: *Steven Duong (30022492)*

## 1. Introduction

The goal of this lab is to become familiar with the scikit-learn library.

You will practice loading example datasets, perform classification and regression with linear scikit-learn models, and investigate the effects of reducing the number of features (columns in X) and the number of samples (rows in X and y).


In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## 2. Classification

Using yellowbrick spam - classification  
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

The goal is to investigate `LogisticRegression(max_iter=2000)` and effects of reducing the number of features and number of samples on classification performance.

### 2.1 Implement convenience function

In [4]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

def get_classifier_accuracy(model, X, y):
    '''Calculate train and validation accuracy of classifier (model)
        
        Splits feature matrix X and target vector y 
        with sklearn train_test_split() and random_state=956.
        
        model (sklearn classifier): Classifier to train and evaluate
        X (numpy.array or pandas.DataFrame): Feature matrix
        y (numpy.array or pandas.Series): Target vector
        
        returns: training accuracy, validation accuracy
    '''
    
    # Creating train, test and split for the NumPy arrays
    # X is the feature matrix, while y is the target vector
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=956)
    
    # Building the model on the training set 
    model.fit(X_train, y_train)

    # Predicting the outcome with the training set
    y_train_pred = model.predict(X_train)

    # Computing the accuracy of the prediction
    acc_train = accuracy_score(y_train, y_train_pred)

    # Predict the outcome with the validation set
    y_test_pred = model.predict(X_test)
    
    # Computing the accuracy of the prediction
    acc_test = accuracy_score(y_test, y_test_pred)

    return (acc_train, acc_test)

### 2.2 Load data

Use the yellowbrick function `load_spam()`, load the spam data set into feature matrix `X` and target vector `y`.

Print size and type of `X` and `y`.


In [5]:
from yellowbrick.datasets.loaders import load_spam

# loading in spam data set into feature matrix X and target vector y
X, y = load_spam()

# Printing size and type of X and y
print(f"Size of X: {X.shape}\nType of X: \n{X.dtypes}")
print(f"\nSize of y: {y.size}\nType of y: {y.dtype}")


Size of X: (4600, 57)
Type of X: 
word_freq_make                float64
word_freq_address             float64
word_freq_all                 float64
word_freq_3d                  float64
word_freq_our                 float64
word_freq_over                float64
word_freq_remove              float64
word_freq_internet            float64
word_freq_order               float64
word_freq_mail                float64
word_freq_receive             float64
word_freq_will                float64
word_freq_people              float64
word_freq_report              float64
word_freq_addresses           float64
word_freq_free                float64
word_freq_business            float64
word_freq_email               float64
word_freq_you                 float64
word_freq_credit              float64
word_freq_your                float64
word_freq_font                float64
word_freq_000                 float64
word_freq_money               float64
word_freq_hp                  float64
word_freq_hpl   

Using the sklearn function `train_test_split()` prepare a feature matrix `X_small` and target vector `y_small` that contain only **1%** of the rows. Use `random_state=174`.

Print size and type of `X_small` and `y_small`.

In [6]:
# Creating a feature matrix that contain only 1% of the rows
X_small, y_test, y_small, y_text = train_test_split(X, y, random_state=174, train_size = 0.01)

# Printing the size and type of X_small and y_small
print(f"Size of X_small: {X_small.size}\nType of X_small:\n{X_small.dtypes}")
print(f"\nSize of y_small: {y_small.size}\nType of y_small: {y_small.dtype}")


Size of X_small: 2622
Type of X_small:
word_freq_make                float64
word_freq_address             float64
word_freq_all                 float64
word_freq_3d                  float64
word_freq_our                 float64
word_freq_over                float64
word_freq_remove              float64
word_freq_internet            float64
word_freq_order               float64
word_freq_mail                float64
word_freq_receive             float64
word_freq_will                float64
word_freq_people              float64
word_freq_report              float64
word_freq_addresses           float64
word_freq_free                float64
word_freq_business            float64
word_freq_email               float64
word_freq_you                 float64
word_freq_credit              float64
word_freq_your                float64
word_freq_font                float64
word_freq_000                 float64
word_freq_money               float64
word_freq_hp                  float64
word_freq_h

### 2.3 Train and evaluate models

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
4. Call your convenience function `get_classifier_accuracy()` using 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`
5. Add the data size, training and validation accuracy for each call to the `results` DataFrame
6. Print `results`

In [7]:
from sklearn.linear_model import LogisticRegression

# Instantiating model Logistic Regression
model = LogisticRegression(max_iter=2000)

# Creating a Pandas DataFrame
results = pd.DataFrame(columns=['Data size', 'training accuracy', 'validation accuracy'])
    
# Calculate the accuracy using X and y
train_acc, val_acc = get_classifier_accuracy(model, X, y)
print("Results for X and y:")
print("Training accuracy = {0:.5f}".format(train_acc))
print("Validation accuracy = {0:.5f}".format(val_acc))

# Calculate the accuracy using X and y
train_acc, val_acc = get_classifier_accuracy(model, X.iloc[:,:2], y)
print("\nResults for first two columns of X:")
print("Training accuracy = {0:.5f}".format(train_acc))
print("Validation accuracy = {0:.5f}".format(val_acc))


# Calculate the accuracy using X_small and y_small
train_acc, val_acc = get_classifier_accuracy(model, X_small, y_small)
print("\nResults for X_small and y_small:")
print("Training accuracy = {0:.5f}".format(train_acc))
print("Validation accuracy = {0:.5f}".format(val_acc))


Results for X and y:
Training accuracy = 0.93449
Validation accuracy = 0.91826

Results for first two columns of X:
Training accuracy = 0.60899
Validation accuracy = 0.61304

Results for X_small and y_small:
Training accuracy = 0.94118
Validation accuracy = 0.75000


### 2.4 Questions
1. What is the validation accuracy using all data? What is the difference between training and validation accuracy?
1. How does the validation accuracy and difference between training and validation change when only two columns are used? Provide values.
1. How does the validation accuracy and difference between training and validation change when only 1% of the rows are used? Provide values.

*YOUR ANSWERS HERE*
1. The training accuracy of a superseded learning model is a measure of how well it performs with known testing data, while the validation accuracy measures how well it performs with the real data. For example, a model with a validation accuracy of 97% means that you can expect your model to predict with 97% accuracy on new unseen data.
2. When only two columns are used, both training and validation accuracy decrease significantly. Where the training accuracy is equal to 0.60899 and the validation accuracy is equal to 0.61304. Model complexity relates to the variation of inputs in the training dataset, so having fewer columns of data for the model to train on will reduce its accuracy, a phenomenon known as underfitting. Furthermore, the validation score is higher than the training score, indicating extreme error in the results.
3. The training accuracy increases to 0.94118, while the validation accuracy decreases to 0.75000. The low number of data points used to train the data contributes to the variance in accuracy. The lack of enough variety in a small data set causes overfitting since it is impossible to build a complex model that achieves high accuracy for prediction with a small data set.



## 3. Regression

Using yellowbrick energy - regression  
https://www.scikit-yb.org/en/latest/api/datasets/energy.html

The goal is to investigate `LinearRegression()` and effects of reducing the number of features and number of samples on regression performance.

### 3.1 Implement convenience function

In [8]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

def get_regressor_mse(model, X, y):
    '''Calculate train and validation mean-squared error (mse) of regressor (model)
        
        Splits feature matrix X and target vector y 
        with sklearn train_test_split() and random_state=956.
        
        model (sklearn regressor): Regressor to train and evaluate
        X (numpy.array or pandas.DataFrame): Feature matrix
        y (numpy.array or pandas.Series): Target vector
        
        returns: training mse, validation mse
    
    '''

    # Partitioning the data into training and testing
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=956)

    # Fitting the training model
    model.fit(X_train, y_train)

    # Predicting the outcome with the training data
    y_train_pred = model.predict(X_train)

    # Calculating the mse of the training data
    train_mse = mean_squared_error(y_train, y_train_pred)

    # Predicting the outcome with the validation data
    y_test_pred = mean_squared_error(y_test, y_test_pred)

    # Calculating the mse of the validation data
    val_mse = mean_squared_error(y_test, y_test_pred)

    return (train_mse, val_mse)


### 3.2 Load data

Use the yellowbrick function `load_energy()` load the energy data set into feature matrix `X` and target vector `y`.

Print dimensions and type of `X` and `y`.

In [10]:
from yellowbrick.datasets import load_energy

# Loading the energy data set
X, y = load_energy()

print(f"Dimension of X: {X.shape}\nType of X:\n{X.dtypes}")
print(f"\nDimension of y: {y.shape}\nType of y: {y.dtype}")


Dimension of X: (768, 8)
Type of X:
relative compactness         float64
surface area                 float64
wall area                    float64
roof area                    float64
overall height               float64
orientation                    int64
glazing area                 float64
glazing area distribution      int64
dtype: object

Dimension of y: (768,)
Type of y: float64


Using the sklearn function `train_test_split()` prepare a feature matrix `X_small` and target vector `y_small` that contain only **1%** of the rows. Use `random_state=174`.

Print size and type of `X_small` and `y_small`.

In [None]:
# TODO: ADD YOUR CODE HERE


### 3.3 Train and evaluate models

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Create a pandas DataFrame `results` with columns: Data size, training MSE, validation MSE
4. Call your convenience function `get_regressor_mse()` using 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`
5. Add the data size, training and validation MSE for each call to the `results` DataFrame
6. Print `results`

In [None]:
# TODO: ADD YOUR CODE HERE


### 3.4 Questions
1. What is the validation MSE using all data? What is the difference between training and validation MSE?
1. How does the validation MSE and difference between training and validation change when only two columns are used? Provide values.
1. How does the validation MSE and difference between training and validation change when only 1% of the rows are used? Provide values.

*YOUR ANSWERS HERE*



## 4. Observations/Interpretation

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*



## 5. Reflection
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

