## Prepare python environment


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

%matplotlib inline

In [2]:
random_state = 5 # use this to control randomness across runs e.g., dataset partitioning

## Preparing the Glass Dataset (2 points)

We will use glass dataset from UCI machine learning repository. Details for this data can be found [here](https://archive.ics.uci.edu/ml/datasets/glass+identification). The objective of the dataset is to identify the class of glass based on the following features:

1.  RI: refractive index
2.  Na: Sodium
3.  Mg: Magnesium
4.  Al: Aluminum
5.  Si: Silica
6.  K: Potassium
7.  Ca: Calcium
8.  Ba: Barium
9.  Fe: Iron
10. Type of glass (Target label)

The classes of glass are:

1. building_windows_float_processed 
2. building_windows_non_float_processed 
3. vehicle_windows_float_processed 
4. containers 
6. tableware 
7. headlamps

Identification of glass from its content can be used for forensic analysis.

### Loading the dataset

In [3]:
# Download and load the dataset
import os
if not os.path.exists('glass.csv'): 
    !wget https://raw.githubusercontent.com/JHA-Lab/ece364_2022/master/dataset/glass.csv
df = pd.read_csv('glass.csv')
# Display the first five instances in the dataset
df.head(5)

--2022-11-21 01:25:23--  https://raw.githubusercontent.com/JHA-Lab/ece364_2022/master/dataset/glass.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9838 (9.6K) [text/plain]
Saving to: ‘glass.csv’


2022-11-21 01:25:23 (46.0 MB/s) - ‘glass.csv’ saved [9838/9838]



Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


In [4]:
# Additional features to be added to the data
df['Ca_Na'] = df.Ca*df.Na
df['Al_Mg'] = df.Al*df.Mg
df['Ca_Mg'] = df.Ca*df.Mg
df['Ca_RI'] = df.Ca*df.RI

### Extract target and descriptive features (0.5 points)

In [5]:
# Store all the features from the data in X
X= df.drop('Type',axis=1)
# Store all the labels in y
y= df['Type']

In [6]:
# Convert data to numpy array
X = X.to_numpy()
y = y.to_numpy()
print(X.shape)
print(y.shape)

(214, 13)
(214,)


### Create training and validation datasets (0.5 points)


Split the data into training and validation sets using `train_test_split`.  See [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for details. To get consistent result while splitting, set `random_state` to the value defined earlier. We use 80% of the data for training and 20% of the data for validation.

In [7]:
X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=.2, random_state=random_state) # 80% training and 20% validation

### Preprocess the dataset (1 point)

#### Preprocess the data by normalizing each feature to have zero mean and unit standard deviation. This can be done using the `StandardScaler()` function. See [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) for more details.


In [8]:
# Define the scaler for scaling the data
scaler = StandardScaler()

# Normalize the training data
X_train = scaler.fit_transform(X_train)

# Use the scaler defined above to standardize the validation data by applying the same transformation to the validation data.
X_val = scaler.transform(X_val)

## Training error-based models (18 points)


#### We will use the `sklearn` library to train a Multinomial Logistic Regression classifier and Support Vector Machines. 


### Exercise 1:  Learning a Multinomial Logistic Regression classifier (4 points)

#### Use `sklearn`'s `SGDClassifier` to train a multinomial logistic regression classifier (i.e., using a one-versus-rest scheme) with Stochastic Gradient Descent. Review ch.7 and see [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier) for more details. 

#### Set the `random_state` as defined above,  increase the `n_iter_no_change` to 100 and `max_iter` to 1000 to facilitate better convergence.  

#### Report the model's accuracy over the training and validation sets.
 

In [9]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score 

In [10]:
# Create Logistic Regression based classifier
clf = SGDClassifier(loss='log', random_state=random_state, n_iter_no_change=100, max_iter=1000)

# Train Classifer on training set
clf = clf.fit(X_train,y_train)

# Predict the response for train dataset
y_pred_train = clf.predict(X_train)

# Predict the response for validation dataset
y_pred_val = clf.predict(X_val)

print("train accuracy: %.5f" % accuracy_score(y_train, y_pred_train))
print("validation accuracy: %.5f" % accuracy_score(y_val, y_pred_val))

train accuracy: 0.71930
validation accuracy: 0.69767


#### Explain any performance difference observed between the training and validation datasets.

The classifier is slightly overfitting to the training dataset, resulting in lower accuracy on the validation dataset.

### Exercise 2: Learning a Support Vector Machine (SVM) (14 points)

#### Use `sklearn`'s `SVC` class to train an SVM (i.e., using a [one-versus-one scheme](https://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-one)). Review ch.7 and see [here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) for more details. 
 

In [11]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

#### Exercise 2a: Warm up (2 points)

#### Train an SVM with a linear kernel. Set the  random_state to the value defined above. Keep all other parameters at their defaults.

#### Report the model's accuracy over the training and validation sets.

In [12]:
clf = SVC(kernel='linear', random_state=random_state)
clf.fit(X_train,y_train)

print("training acc: %.5f" % accuracy_score(y_train, clf.predict(X_train)))
print("validation acc: %.5f" % accuracy_score(y_val, clf.predict(X_val)))


training acc: 0.74269
validation acc: 0.69767


#### Exercise 2b: Evaluate a polynomial kernel function (4 points)

#### Try fitting an SVM with a polynomial kernel function and vary the degree among {1, 2, 3, 4}. Note that degree=1 yields a linear kernel. 

#### For each fitted classifier, report its accuracy over the training and validation sets. 

#### As before, set the random_state to the value defined above. Set the regularization strength `C=3`.  When the data is not linearly separable, this encourages the model to fit the training data. Keep all other parameters at their default values.

In [13]:
for degree in [1,2,3,4]:
    clf = SVC(kernel='poly', random_state=random_state, degree=degree, C=3)
    clf.fit(X_train,y_train)
    
    print("Poly kernel, degree: %d" %degree)
    print("training acc: %.5f" % accuracy_score(y_train, clf.predict(X_train)))
    print("validation acc: %.5f" % accuracy_score(y_val, clf.predict(X_val)))

Poly kernel, degree: 1
training acc: 0.70760
validation acc: 0.72093
Poly kernel, degree: 2
training acc: 0.73684
validation acc: 0.74419
Poly kernel, degree: 3
training acc: 0.73099
validation acc: 0.62791
Poly kernel, degree: 4
training acc: 0.73684
validation acc: 0.62791


#### Explain the effect of increasing the degree of the polynomial.

Increasing the degree of the polynomial kernel function allows the model to better fit the training dataset. Consequently the model overfits the training data and reduces generalization. 

#### Exercise 2c: Evaluate the radial basis kernel function (6 points)

#### Try fitting an SVM with a radial basis kernel function and vary the length-scale parameter given by $\gamma$ among {0.01, 0.1,1,10, 100}. 

#### For each fitted classifier, report its accuracy over the training and validation sets. 

#### As before, set the random_state to the value defined above. Set the regularization strength `C=10`.  When the data is not linearly separable, this encourages the model to fit the training data (read more [here](https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html)). Keep all other parameters at their default values.

In [14]:
for gamma in [0.01,0.1,1,10,100]:
    clf = SVC(kernel='rbf', random_state=random_state, gamma=gamma, C=10)
    clf.fit(X_train,y_train)
    
    print("RBF kernel, gamma: ", gamma)
    print("training acc: %.2f" % accuracy_score(y_train, clf.predict(X_train)))
    print("validation acc: %.2f" % accuracy_score(y_val, clf.predict(X_val)))

RBF kernel, gamma:  0.01
training acc: 0.75
validation acc: 0.84
RBF kernel, gamma:  0.1
training acc: 0.90
validation acc: 0.72
RBF kernel, gamma:  1
training acc: 0.99
validation acc: 0.60
RBF kernel, gamma:  10
training acc: 1.00
validation acc: 0.53
RBF kernel, gamma:  100
training acc: 1.00
validation acc: 0.35


#### Comment on the effect of increasing/reducing the length-scale parameter $\gamma$. Also, compare the performance of the classifiers trained with RBF kernel function against those trained with the polynomial and linear kernel functions (i.e., Ex. 2b). 

Increasing $\gamma$ degrades generalization. $\gamma$ determines when points are deemed close and far. Larger $\gamma$ reduces the radius of influence of select training examples (support vectors), so that data points are only affected by a few nearby support vectors during inference. Smaller $\gamma$ increases the radius of influence of select training examples (support vectors), so that data points are also affected by more distant support vectors during inference. 

Increasing $\gamma$ to values greater than 0.1 reduces generalization performance because the classifier effectively relies on a few training examples (support vectors) that are close to the data point in order to make its prediction. Learn more [here](https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html).

Classifiers trained with the RBF kernel function overfit the training data more than those trained with the polynomial and linear kernel functions. This is because the RBF kernel function maps into an infinite-dimensional feature space. The increased overfitting, esp. at large $\gamma$, degrades generalization performance, resulting in less performant classifiers compared to classifiers trained with linear and polynomial kernel functions.

#### Exercise 2d: Briefly state the main difference between the logistic regression classifier and the SVM. (2 points)

SVMs are explicitly trained to learn decision boundaries with large margins separating the classes.  