## In-Class Assignment 1

**Instructions:**

- For questions that require coding, you need to write the relevant code and display its output. Your output should either be the direct answer to the question or clearly display the answer in it.
- For questions that require a written answer (sometimes along with the code), you need to put your answer in a Markdown cell. Writing the answer as a comment or as a print line is not acceptable.
- You need to render this file as HTML using Quarto and submit the HTML file. **Please note that this is a requirement and not optional.** A submission cannot be graded until it is properly rendered.

Import all the libraries and tools you need below.

In [3]:
# Import all libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression, LogisticRegression, Lasso, Ridge
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, Normalizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,mean_absolute_error, r2_score, accuracy_score, recall_score, confusion_matrix

### 1) 

This question serves as a warm-up for using sklearn objects for different Machine Learning tasks.

You need to use the **BankNote_Authentication.csv** dataset. Each observation is a banknote. The response variable, named `class`, represents whether the banknote is forged (1) or authentic (0). All other variables are the predictors and they represent a number of statistical measures extracted from the images of the banknotes.

### a)

Read the data as a DataFrame. **(5 points)**

In [4]:
banknote = pd.read_csv('BankNote_Authentication.csv') # read data, remove index column (Not included since the first place)
banknote.head() # show first 5 rows

Unnamed: 0,variance,skewness,curtosis,entropy,class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


### b)

Explore the data with Python tools to answer the following questions:

- Is this a binary or a multi-class classification task?
- How many observations and predictors do you have? (The response is not a predictor!)
- Is there any missing data?

- **(10 points)**

In [5]:
# Check the distribution of target variable
print(banknote['class'].value_counts())

# Check the number of unique classes
print(banknote['class'].nunique())

class
0    762
1    610
Name: count, dtype: int64
2


This is binary classification task since there are only two classes of banknotes available.

In [6]:
# Check the number of observations and predictors
observation = banknote.shape[0]
predictors = banknote.shape[1] - 1 # exclude target variable
print(f'Number of observations: {observation}')
print(f'Number of predictors: {predictors}')

Number of observations: 1372
Number of predictors: 4


There are 1372 observations and 4 predictors.

In [7]:
# Check the missing data
banknote.isnull().sum()

variance    0
skewness    0
curtosis    0
entropy     0
class       0
dtype: int64

There is no missing value in this dataset.

### c)

Split the data into training and test sets with a 70%-30% ratio. Use `random_state=42` for reproducible results. 

Note that in order to create the training and test sets, you first need to separate the predictors and the response into two variables, so you can use them as proper inputs.

**(10 points)**

In [8]:
# Use 70 - 30 for train-test split
X = banknote.drop(['class'], axis = 1) # Not need target var
Y = banknote['class'] # Response variable

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 42)


### d)

(Standard) scale the training and test sets. **(15 points)**

In [9]:
# Scale the dataset
scaler = StandardScaler()
# Fit the training dataset with scaler
scaler.fit(X_train)
# Transform the training dataset
X_train = scaler.transform(X_train)
# Transform the testing dataset
X_test = scaler.transform(X_test)
# We will not distort the result of model by scailing the target variable

### e)

Create and train a Logistic Regression model. (You are expected to know which set to use to train the model.) Do **not** use any Lasso or Ridge penalty.

**(15 points)**

In [10]:
# Logistic Regression setup, no penalty
logistic = LogisticRegression(penalty = None)
# Fit the model with dataset
logistic.fit(X_train, Y_train)
# Fit the model
Y_pred = logistic.predict(X_test)

### f)

Evaluate the model by printing its accuracy, recall and confusion matrix. (You are expected to know which set to use to evaluate the model.) You can assume that the probability threshold is 0.5, which is the default value.

**(20 points)**

In [11]:
# Getting the model accuracy, threshold default = 0.5
accuracy = accuracy_score(Y_test, Y_pred)
print(f'Accuracy: {accuracy}')
# recall score
recall = recall_score(Y_test, Y_pred)
print(f'Recall: {recall}')
# confusion matrix
confusion_matrix = confusion_matrix(Y_test, Y_pred)
print(f'Confusion Matrix: {confusion_matrix}')

Accuracy: 0.9902912621359223
Recall: 0.9890710382513661
Confusion Matrix: [[227   2]
 [  2 181]]


The accuracy is 99.03 %.

### g)
Did you use any hyperparameters in the classifier? Was there anything to cross-validate in the classifier? **(15 points)**


We use the default logistic model, so no hyperparameter is applied. Also this regular logis tic model does not need cross validation since we do not have to perform hyperparameter tuning.

### 2)

In this question, you need to create a user-defined function. It will be part of the Gradient Descent code you will put together next week.

Define a function called `initialize`. It should take one scalar numeric input, called `dim`, and return a **(dim + 1) x 1** vector of random numbers. **Note that the output should be a numpy vector.** You do not need to check or account for any invalid inputs.

You can use the given test cases to check your function. You need to sample the random numbers from a uniform distribution between 0 and 1 to match the given answers.

**(10 points)**

In [12]:
np.random.seed(123)

def initialize(dim):
    # Create the vector of random numbers v, from uniform dist
    v = np.random.rand(dim + 1)
    return v

# test cases
print(initialize(4)) # Should produce [0.69646919 0.28613933 0.22685145 0.55131477 0.71946897]
print(initialize(2)) # Should produce [0.42310646 0.9807642  0.68482974]
print(initialize(7)) # Should produce [0.4809319  0.39211752 0.34317802 0.72904971 0.43857224 0.0596779 0.39804426 0.73799541]

[0.69646919 0.28613933 0.22685145 0.55131477 0.71946897]
[0.42310646 0.9807642  0.68482974]
[0.4809319  0.39211752 0.34317802 0.72904971 0.43857224 0.0596779
 0.39804426 0.73799541]
