# Lab1 - Scikit-learn
Author: Sam Rainbow

## 1. Introduction

The goal of this lab is to become familiar with the scikit-learn library.

You will practice loading example datasets, perform classification and regression with linear scikit-learn models, and investigate the effects of reducing the number of features (columns in X) and the number of samples (rows in X and y).


In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## 2. Classification

Using yellowbrick spam - classification  
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

The goal is to investigate `LogisticRegression(max_iter=2000)` and effects of reducing the number of features and number of samples on classification performance.

### 2.1 Implement convenience function

In [2]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

def get_classifier_accuracy(model, X, y):
    '''Calculate train and validation accuracy of classifier (model)
        
        Splits feature matrix X and target vector y 
        with sklearn train_test_split() and random_state=956.
        
        model (sklearn classifier): Classifier to train and evaluate
        X (numpy.array or pandas.DataFrame): Feature matrix
        y (numpy.array or pandas.Series): Target vector
        
        returns: training accuracy, validation accuracy
    
    '''
    
    #TODO: IMPLEMENT FUNCTION BODY
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=956)
    model.fit(X_train, y_train)
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    train_acc = accuracy_score(y_train, y_train_pred)
    val_acc = accuracy_score(y_test, y_test_pred)
    return train_acc, val_acc   

### 2.2 Load data

Use the yellowbrick function `load_spam()`, load the spam data set into feature matrix `X` and target vector `y`.

Print size and type of `X` and `y`.


In [26]:
# TODO: ADD YOUR CODE HERE
from yellowbrick.datasets import load_spam

X, y = load_spam()
print("Size of X: ", X.shape, " type: ", type(X))
print("Size of y: ", y.shape, " type: ", type(y))

Size of X:  (4600, 57)  type:  <class 'pandas.core.frame.DataFrame'>
Size of y:  (4600,)  type:  <class 'pandas.core.series.Series'>


Using the sklearn function `train_test_split()` prepare a feature matrix `X_small` and target vector `y_small` that contain only **1%** of the rows. Use `random_state=174`.

Print size and type of `X_small` and `y_small`.

In [12]:
# TODO: ADD YOUR CODE HERE
X_small, _, y_small, _ = train_test_split(X,y, train_size = 0.01, random_state=174)
print("Size of X: ", X_small.shape, " type: ", type(X))
print("Size of y: ", y_small.shape, " type: ", type(y))


Size of X:  (46, 57)  type:  <class 'pandas.core.frame.DataFrame'>
Size of y:  (46,)  type:  <class 'pandas.core.series.Series'>


### 2.3 Train and evaluate models

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
4. Call your convenience function `get_classifier_accuracy()` using 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`
5. Add the data size, training and validation accuracy for each call to the `results` DataFrame
6. Print `results`

In [25]:
# TODO: ADD YOUR CODE HERE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter =2000)

col = ["Data Size", "Training Accuracy", "Validation Accuracy"]
results = pd.DataFrame(columns = col)

t_acc, v_acc = get_classifier_accuracy(model, X, y)
results.loc[0] = ["X and y",t_acc, v_acc]
t_acc, v_acc = get_classifier_accuracy(model, X.iloc[:,:2], y)
results.loc[1] = ["First 2 columns",t_acc, v_acc]
t_acc, v_acc = get_classifier_accuracy(model, X_small, y_small)
results.loc[2] = ["X and y small",t_acc, v_acc]

print(results)

         Data Size  Training Accuracy  Validation Accuracy
0          X and y           0.935072             0.917391
1  First 2 columns           0.608986             0.613043
2    X and y small           0.941176             0.750000


### 2.4 Questions
1. What is the validation accuracy using all data? What is the difference between training and validation accuracy?
2. How does the validation accuracy and difference between training and validation change when only two columns are used? Provide values.
3. How does the validation accuracy and difference between training and validation change when only 1% of the rows are used? Provide values.

1. The validation accuracy using all of the data is 0.917, with a difference of 0.018 between the training and validation accuracy. This indicates that model is accurately predicting unseen data.
2. When only two columns are used the validation accuracy decreases substantially. The difference between the training accuracy (0.609) and validation accuracy (0.613) is 0.0042 which even lower than the full set. Using only 2 columns likely results in a less
complex model due reduced data. 
3. When the small data sets are used the validation accuracy decreases substantially, but the training accuracy increases. The difference between the training accuracy (0.941) and validation accuracy (0.75) is 0.191 which is higher than the other two scenarios. This indicates overfitting.




## 3. Regression

Using yellowbrick energy - regression  
https://www.scikit-yb.org/en/latest/api/datasets/energy.html

The goal is to investigate `LinearRegression()` and effects of reducing the number of features and number of samples on regression performance.

### 3.1 Implement convenience function

In [47]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

def get_regressor_mse(model, X, y):
    '''Calculate train and validation mean-squared error (mse) of regressor (model)
        
        Splits feature matrix X and target vector y 
        with sklearn train_test_split() and random_state=956.
        
        model (sklearn regressor): Regressor to train and evaluate
        X (numpy.array or pandas.DataFrame): Feature matrix
        y (numpy.array or pandas.Series): Target vector
        
        returns: training mse, validation mse
    
    '''
   
    #TODO: IMPLEMENT FUNCTION BODY
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=956)
    model.fit(X_train, y_train)
    y_train_predict = model.predict(X_train)
    y_test_predict = model.predict(X_test)
    train_mse = mean_squared_error(y_train,y_train_predict)
    val_mse = mean_squared_error(y_test, y_test_predict)
    return train_mse, val_mse   
    

### 3.2 Load data

Use the yellowbrick function `load_energy()` load the energy data set into feature matrix `X` and target vector `y`.

Print dimensions and type of `X` and `y`.

In [66]:
# TODO: ADD YOUR CODE HERE
from yellowbrick.datasets import load_energy


X, y = load_energy()
print("Size of X: ", X.shape, " type: ", type(X))
print("Size of y: ", y.shape, " type: ", type(y))


Size of X:  (768, 8)  type:  <class 'pandas.core.frame.DataFrame'>
Size of y:  (768,)  type:  <class 'pandas.core.series.Series'>


Using the sklearn function `train_test_split()` prepare a feature matrix `X_small` and target vector `y_small` that contain only **1%** of the rows. Use `random_state=174`.

Print size and type of `X_small` and `y_small`.

In [67]:
# TODO: ADD YOUR CODE HERE
X_small, _, y_small, _ = train_test_split(X,y, train_size = 0.01, random_state=174)
print("Size of X: ", X_small.shape, " type: ", type(X))
print("Size of y: ", y_small.shape, " type: ", type(y))


Size of X:  (7, 8)  type:  <class 'pandas.core.frame.DataFrame'>
Size of y:  (7,)  type:  <class 'pandas.core.series.Series'>


### 3.3 Train and evaluate models

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Create a pandas DataFrame `results` with columns: Data size, training MSE, validation MSE
4. Call your convenience function `get_regressor_mse()` using 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`
5. Add the data size, training and validation MSE for each call to the `results` DataFrame
6. Print `results`

In [69]:
# TODO: ADD YOUR CODE HERE
from sklearn.linear_model import LinearRegression
model = LinearRegression()

col = ["Data Size", "Training MSE", "Validation MSE"]
results2 = pd.DataFrame(columns = col)

t_mse, v_mse = get_regressor_mse(model, X, y)
results2.loc[0] = ["X and y",t_mse, v_mse]
t_mse, v_mse = get_regressor_mse(model, X.iloc[:,:2], y)
results2.loc[1] = ["First 2 columns",t_mse, v_mse]
t_mse, v_mse = get_regressor_mse(model, X_small, y_small)
results2.loc[2] = ["X and y small",t_mse, v_mse]

print(results2)


         Data Size  Training MSE  Validation MSE
0          X and y  7.981975e+00       10.292306
1  First 2 columns  5.360043e+01       46.410426
2    X and y small  2.284541e-28       69.977449


### 3.4 Questions
1. What is the validation MSE using all data? What is the difference between training and validation MSE?
2. How does the validation MSE and difference between training and validation change when only two columns are used? Provide values.
3. How does the validation MSE and difference between training and validation change when only 1% of the rows are used? Provide values.

*YOUR ANSWERS HERE*
1. The validation mse using all of the data is 10.29, with a difference of 2.31 between the training and validation mse. The difference between the MSE is small which indicates that model is accurately predicting unseen data.
2. When only two columns are used the validation MSE increases substantially. The difference between the training MSE (53.6) and validation MSE (46.4) is 7.19 which is higher than the full set. Using only 2 columns likely results in a less
complex model due reduced data. 
3. When the small data sets are used the validation MSE increases substantially, but the training MSE decreases to 0. The difference between the training MSE (0) and validation accuracy (69.97) is 69.97 which is higher than both other scenarios. This indicates overfitting.


## 4. Observations/Interpretation

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.

Across both the classification and regression data sets:
1. We see that when using the full data sets the training and validation metrics both indicate a good fit of the data. The validation accuracy(0.917) and training accuracy(0.935) were both close and high, indication a good model fit (appropriate complexity, low bias and variance) while in the regression the MSE's were low and close together. The model is not overfitting or underfitting the data.

2. In the second scenario when we reduce the number of features to 2 the training accuracy (0.609) and validation accuracy (0.613) were both low, and the training MSE (53.6) and validation MSE (46.4) are both very high and have a large difference. This is likely due
to the reduced data and complexity of the model since there are only 2 features. Likely the model is underfitting the data leading to high bias and low variance as the model is making consistent innaccurate predictions.

3. In the third scenario we only use 1% of the data. The training accuracy (0.94) and validation accuracy (0.75) had a large difference between the training and validation accuracies. The training MSE (0) and validation MSE (69.98) also had a very large difference.  In both scenarios the model was likely overfitting the data since there was not enough data. This would indicate that the model is overfitting the training data, resulting in high variance and low bias which is caused when a model is too complex for the amount of data.




## 5. Reflection
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.

It was interesting to get to see appropriately fitted data, underfit data, and overfit data just by manipulating the amount of data or features. We have learned the theory of this in class but seeing the three fits side by side is a great way to learn
how the data causes good or poor models.



