# Lab1 - Scikit-learn
Author: *Steven Duong (30022492)*

## 1. Introduction

The goal of this lab is to become familiar with the scikit-learn library.

You will practice loading example datasets, perform classification and regression with linear scikit-learn models, and investigate the effects of reducing the number of features (columns in X) and the number of samples (rows in X and y).


In [22]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## 2. Classification

Using yellowbrick spam - classification  
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

The goal is to investigate `LogisticRegression(max_iter=2000)` and effects of reducing the number of features and number of samples on classification performance.

### 2.1 Implement convenience function

In [23]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

def get_classifier_accuracy(model, X, y):
    '''Calculate train and validation accuracy of classifier (model)
        
        Splits feature matrix X and target vector y 
        with sklearn train_test_split() and random_state=956.
        
        model (sklearn classifier): Classifier to train and evaluate
        X (numpy.array or pandas.DataFrame): Feature matrix
        y (numpy.array or pandas.Series): Target vector
        
        returns: training accuracy, validation accuracy
    '''
    
    # Creating train, test and split for the NumPy arrays
    # X is the feature matrix, while y is the target vector
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=956)
    
    # Building the model on the training set 
    model.fit(X_train, y_train)

    # Predicting the outcome with the training set
    y_train_pred = model.predict(X_train)

    # Computing the accuracy of the prediction
    acc_train = accuracy_score(y_train, y_train_pred)

    # Predict the outcome with the validation set
    y_test_pred = model.predict(X_test)
    
    # Computing the accuracy of the prediction
    acc_test = accuracy_score(y_test, y_test_pred)

    return (acc_train, acc_test)

### 2.2 Load data

Use the yellowbrick function `load_spam()`, load the spam data set into feature matrix `X` and target vector `y`.

Print size and type of `X` and `y`.


In [24]:
from yellowbrick.datasets.loaders import load_spam

# loading in spam data set into feature matrix X and target vector y
X, y = load_spam()

# Printing size and type of X and y
print(f"Dimensions of X: {X.shape}\nType of X: \n{X.dtypes}")
print(f"\nDimensions of y: {y.shape}\nType of y: {y.dtype}")


Dimensions of X: (4600, 57)
Type of X: 
word_freq_make                float64
word_freq_address             float64
word_freq_all                 float64
word_freq_3d                  float64
word_freq_our                 float64
word_freq_over                float64
word_freq_remove              float64
word_freq_internet            float64
word_freq_order               float64
word_freq_mail                float64
word_freq_receive             float64
word_freq_will                float64
word_freq_people              float64
word_freq_report              float64
word_freq_addresses           float64
word_freq_free                float64
word_freq_business            float64
word_freq_email               float64
word_freq_you                 float64
word_freq_credit              float64
word_freq_your                float64
word_freq_font                float64
word_freq_000                 float64
word_freq_money               float64
word_freq_hp                  float64
word_freq_

Using the sklearn function `train_test_split()` prepare a feature matrix `X_small` and target vector `y_small` that contain only **1%** of the rows. Use `random_state=174`.

Print size and type of `X_small` and `y_small`.

In [25]:
# Creating a feature matrix that contain only 1% of the rows
X_small, X_test, y_small, y_test = train_test_split(X, y, random_state=174, train_size = 0.01)

# Printing the size and type of X_small and y_small
print(f"Dimensions of X_small: {X_small.shape}\nType of X_small:\n{X_small.dtypes}")
print(f"\nDimensions of y_small: {y_small.shape}\nType of y_small: {y_small.dtype}")


Dimensions of X_small: (46, 57)
Type of X_small:
word_freq_make                float64
word_freq_address             float64
word_freq_all                 float64
word_freq_3d                  float64
word_freq_our                 float64
word_freq_over                float64
word_freq_remove              float64
word_freq_internet            float64
word_freq_order               float64
word_freq_mail                float64
word_freq_receive             float64
word_freq_will                float64
word_freq_people              float64
word_freq_report              float64
word_freq_addresses           float64
word_freq_free                float64
word_freq_business            float64
word_freq_email               float64
word_freq_you                 float64
word_freq_credit              float64
word_freq_your                float64
word_freq_font                float64
word_freq_000                 float64
word_freq_money               float64
word_freq_hp                  float64
w

### 2.3 Train and evaluate models

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
4. Call your convenience function `get_classifier_accuracy()` using 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`
5. Add the data size, training and validation accuracy for each call to the `results` DataFrame
6. Print `results`

In [26]:
from sklearn.linear_model import LogisticRegression

# Instantiating model Logistic Regression
model = LogisticRegression(max_iter=2000)

# Creating a Pandas DataFrame
results = pd.DataFrame(columns=['Data size', 'Training accuracy', 'Validation accuracy'])
    
# Calculate the accuracy using X and y
train_acc, val_acc = get_classifier_accuracy(model, X, y)
results = results.append({'Data size': X.shape, 
                          'Training accuracy': train_acc, 
                          'Validation accuracy': val_acc}, ignore_index=True)

# Calculate the accuracy using only the first 2 features of X and y
train_acc, val_acc = get_classifier_accuracy(model, X.iloc[:,:2], y)
results = results.append({'Data size': X.iloc[:, :2].shape, 
                          'Training accuracy': train_acc, 
                          'Validation accuracy': val_acc}, ignore_index=True)


# Calculate the accuracy using X_small and y_small
train_acc, val_acc = get_classifier_accuracy(model, X_small, y_small)
results = results.append({'Data size': X_small.shape, 
                          'Training accuracy': train_acc, 
                          'Validation accuracy': val_acc}, ignore_index=True)

results.index = ['X and y', 'First two columns of X and y', 'X_small and y_small']

print("\nResults DataFrame:")
results


Results DataFrame:


Unnamed: 0,Data size,Training accuracy,Validation accuracy
X and y,"(4600, 57)",0.934493,0.918261
First two columns of X and y,"(4600, 2)",0.608986,0.613043
X_small and y_small,"(46, 57)",0.941176,0.75


### 2.4 Questions
1. What is the validation accuracy using all data? What is the difference between training and validation accuracy?
1. How does the validation accuracy and difference between training and validation change when only two columns are used? Provide values.
1. How does the validation accuracy and difference between training and validation change when only 1% of the rows are used? Provide values.

*YOUR ANSWERS HERE*
1. The validation accuracy with the all data is 0.918261. The training accuracy of a superseded learning model is a measure of how well it performs with known testing data, while the validation accuracy measures how well it performs with the new data. For example, a model with a validation accuracy of 97% means that you can expect your model to predict with 97% accuracy on new unseen data.
2. When only two columns are used, both training and validation accuracy decrease significantly. Where the training accuracy is equal to 0.60899 and the validation accuracy is equal to 0.61304. Model complexity relates to the variation of inputs in the training dataset, so having fewer columns of data for the model to train on will reduce its accuracy, a phenomenon known as underfitting. Furthermore, the validation score is higher than the training score, indicating extreme error in the results.
3. The training accuracy increases to 0.94118, while the validation accuracy decreases to 0.75000. The low number of data points used to train the data contributes to the variance in accuracy. The lack of enough variety in a small data set causes overfitting since it is impossible to build a complex model that achieves high accuracy for prediction with a small data set.



## 3. Regression

Using yellowbrick energy - regression  
https://www.scikit-yb.org/en/latest/api/datasets/energy.html

The goal is to investigate `LinearRegression()` and effects of reducing the number of features and number of samples on regression performance.

### 3.1 Implement convenience function

In [27]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

def get_regressor_mse(model, X, y):
    '''Calculate train and validation mean-squared error (mse) of regressor (model)
        
        Splits feature matrix X and target vector y 
        with sklearn train_test_split() and random_state=956.
        
        model (sklearn regressor): Regressor to train and evaluate
        X (numpy.array or pandas.DataFrame): Feature matrix
        y (numpy.array or pandas.Series): Target vector
        
        returns: training mse, validation mse
    
    '''

    # Partitioning the data into training and testing
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=956)

    # Fitting the training model
    model.fit(X_train, y_train)

    # Predicting the outcome with the training data
    y_train_pred = model.predict(X_train)

    # Calculating the mse of the training data
    train_mse = mean_squared_error(y_train, y_train_pred)

    # Predicting the outcome with the validation data
    y_test_pred = model.predict(X_test)

    # Calculating the mse of the validation data
    val_mse = mean_squared_error(y_test, y_test_pred)

    return (train_mse, val_mse)


### 3.2 Load data

Use the yellowbrick function `load_energy()` load the energy data set into feature matrix `X` and target vector `y`.

Print dimensions and type of `X` and `y`.

In [28]:
from yellowbrick.datasets import load_energy

# Loading the energy data set
X, y = load_energy()

print(f"Dimensions of X: {X.shape}\nType of X:\n{X.dtypes}")
print(f"\nDimensions of y: {y.shape}\nType of y: {y.dtype}")


Dimensions of X: (768, 8)
Type of X:
relative compactness         float64
surface area                 float64
wall area                    float64
roof area                    float64
overall height               float64
orientation                    int64
glazing area                 float64
glazing area distribution      int64
dtype: object

Dimensions of y: (768,)
Type of y: float64


Using the sklearn function `train_test_split()` prepare a feature matrix `X_small` and target vector `y_small` that contain only **1%** of the rows. Use `random_state=174`.

Print size and type of `X_small` and `y_small`.

In [29]:
# Creating a feature matrix that contain only 1% of the rows
X_small, X_test, y_small, y_test = train_test_split(X, y, random_state=174, train_size = 0.01)

# Printing the size and type of X_small and y_small
print(f"Dimensions of X_small: {X_small.shape}\nType of X_small:\n{X_small.dtypes}")
print(f"\nDimensions of y_small: {y_small.shape}\nType of y_small: {y_small.dtype}")


Dimensions of X_small: (7, 8)
Type of X_small:
relative compactness         float64
surface area                 float64
wall area                    float64
roof area                    float64
overall height               float64
orientation                    int64
glazing area                 float64
glazing area distribution      int64
dtype: object

Dimensions of y_small: (7,)
Type of y_small: float64


### 3.3 Train and evaluate models

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Create a pandas DataFrame `results` with columns: Data size, training MSE, validation MSE
4. Call your convenience function `get_regressor_mse()` using 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`
5. Add the data size, training and validation MSE for each call to the `results` DataFrame
6. Print `results`

In [30]:
from sklearn.linear_model import LinearRegression

# Instantiating model Linear Regression
model = LinearRegression()

# Creating a Pandas DataFrame
results = pd.DataFrame(columns=['Data size', 'Training MSE', 'Validation MSE'])
    
# Calculate the mean squared error using X and y
train_acc, val_acc = get_regressor_mse(model, X, y)
results = results.append({'Data size': X.shape, 
                          'Training MSE': train_acc, 
                          'Validation MSE': val_acc}, ignore_index=True)

# Calculate the mean squared error using only the first 2 features of X and y
train_acc, val_acc = get_regressor_mse(model, X.iloc[:,:2], y)
results = results.append({'Data size': X.iloc[:, :2].shape, 
                          'Training MSE': train_acc, 
                          'Validation MSE': val_acc}, ignore_index=True)


# Calculate the mean squared error using X_small and y_small
train_acc, val_acc = get_regressor_mse(model, X_small, y_small)
results = results.append({'Data size': X_small.shape, 
                          'Training MSE': train_acc, 
                          'Validation MSE': val_acc}, ignore_index=True)

results.index = ['X and y', 'First two columns of X and y', 'X_small and y_small']

print("\nResults DataFrame:")
results


Results DataFrame:


Unnamed: 0,Data size,Training MSE,Validation MSE
X and y,"(768, 8)",8.012691,10.366349
First two columns of X and y,"(768, 2)",53.60043,46.410426
X_small and y_small,"(7, 8)",2.145702e-29,69.977449


### 3.4 Questions
1. What is the validation MSE using all data? What is the difference between training and validation MSE?
1. How does the validation MSE and difference between training and validation change when only two columns are used? Provide values.
1. How does the validation MSE and difference between training and validation change when only 1% of the rows are used? Provide values.

*YOUR ANSWERS HERE*
1. The validation MSE with all the data is 10.366349. The training MSE is the MSE of the model when it is trained on the training data. It measures the error between the predicted outputs and the actual outputs of the model for the training data. The goal is to have a low training MSE, indicating that the model has learned the relationships in the training data well. However, the validation MSE is the MSE of the model when it is evaluated on the validation data. It measures the error between the predicted outputs and the actual outputs of the model for the validation data. The validation data is a separate set of data that the model has not seen during training, and is used to evaluate the model's ability to generalize to new, unseen data.

2. When only the first 2 columns are used, the training MSE and the validation MSE both increase. The training MSE increases to the value of 5.360043e+01, while the validation MSE increases to the value of 46.410426. Since both the training MSE and validation MSE have increased to higher values, it may indicate underfitting, meaning that the model is not complex enough to learn the relationships in the data.

3. When only 1% of the rows are used, the training MSE heavily decreases the the value of 2.145702e-29, while the validation MSE increases to the value of 69.977449. Since the training MSE is low and the validation MSE is high, it may indicate overfitting. This means that the model is too closely fitted to the training data and is not able to generalize well to new data.





## 4. Observations/Interpretation

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*

When the model employs all available data for training, both the training and validation accuracy are high with minimal variance, leading to low mean squared errors; as justified in the results dataframes above. By contrast, using only the first two columns of data will result in a decrease in training and validation accuracy, and a rise in mean squared errors, resulting in an underfitting model pattern with a high bias. Finally, when only a small fraction of the data (1%) is used for training, the training accuracy increases and the validation accuracy decreases, reflecting a decrease in training MSE and an increase in validation MSE. This is a clear pattern of overfitting with high variance, as the model becomes too closely tied to the training data and struggles to generalize to new data due to the lack of samples to train the model effectively (not complex enough).


## 5. Reflection
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

1. I liked being able to finally build and create models of my own using scikit-learn and seeing how the model behaves with a different variety of training data. This exercise was able to help me dive deeper into gaining knowledge on machine learning which I also thoroughly enjoyed.

2. Machine learning can be applied to a variety of tasks, such as categorizing items into different classes or estimating numerical values, through classification and regression models. What makes this area of study interesting is the range of techniques and models available to choose from, each with their own strengths and weaknesses, which was confusing to me at first when I had no prior experiences with these models. This subject can also be challenging because it involves many mathematical concepts and requires a good understanding of statistics and linear algebra. However, once these concepts are understood, the ability to build models that can make predictions accurately is a very motivating experience.