## Prepare python environment


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

%matplotlib inline

In [None]:
random_state=5 # use this to control randomness across runs e.g., dataset partitioning

## Preparing the Glass Dataset (1 point)

We will use glass dataset from UCI machine learning repository. Details for this data can be found [here](https://archive.ics.uci.edu/ml/datasets/glass+identification). The objective of the dataset is to identify the class of glass based on the following features:

1.  RI: refractive index
2.  Na: Sodium
3.  Mg: Magnesium
4.  Al: Aluminum
5.  Si: Silica
6.  K: Potassium
7.  Ca: Calcium
8.  Ba: Barium
9.  Fe: Iron
10. Type of glass (Target label)

The classes of glass are:

1. building_windows_float_processed 
2. building_windows_non_float_processed 
3. vehicle_windows_float_processed 
4. containers 
6. tableware 
7. headlamps

Identification of glass from its content can be used for forensic analysis.


### Loading the dataset

In [None]:
# Download and load the dataset
import os
if not os.path.exists('glass.csv'): 
    !wget https://raw.githubusercontent.com/JHA-Lab/ece364/main/dataset/glass.csv 
data = pd.read_csv('glass.csv')
# Display the first five instances in the dataset
data.head(5)

In [None]:
# display some stats
data.describe()

#### Look at some statistics of the data using the `describe` function in pandas.

In [None]:
data.describe()

In [None]:
# Check type of data in each column
data.info()

### Visualize the Data

#### Check how many classes of each type of glass are there in the data. This has been done for you.

In [None]:
sns.set(style="whitegrid", font_scale=1.8)
plt.subplots(figsize = (15,8))
sns.countplot(x='Type',data=data).set_title('Count of Glass Types')

#### Calculate `mean` material content for each kind of glass. This has been done for you

In [None]:
# Compute mean material content for each kind of glass
data.groupby('Type', as_index=False).mean()

#### Create box plot to see distribution of each content in the glass. See [here](https://seaborn.pydata.org/generated/seaborn.boxplot.html) for further details. This has been done for you.

In [None]:
sns.set(style="whitegrid", font_scale=1.2)
plt.subplots(figsize = (20,15))
plt.subplot(3,3,1)
sns.boxplot(x='Type', y='RI', data=data)
plt.subplot(3,3,2)
sns.boxplot(x='Type', y='Na', data=data)
plt.subplot(3,3,3)
sns.boxplot(x='Type', y='Mg', data=data)
plt.subplot(3,3,4)
sns.boxplot(x='Type', y='Al', data=data)
plt.subplot(3,3,5)
sns.boxplot(x='Type', y='Si', data=data)
plt.subplot(3,3,6)
sns.boxplot(x='Type', y='K', data=data)
plt.subplot(3,3,7)
sns.boxplot(x='Type', y='Ca', data=data)
plt.subplot(3,3,8)
sns.boxplot(x='Type', y='Ba', data=data)
plt.subplot(3,3,9)
sns.boxplot(x='Type', y='Fe', data=data)
plt.show()

#### Create a pairplot to display pairwise relationship. See [here](https://seaborn.pydata.org/generated/seaborn.pairplot.html) for further details. This has been done for you.

In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
sns.pairplot(data[['RI','Na','Mg','Al','Si','Ca','Type']], hue='Type')

In [None]:

# Plot heatmap showing correlation between different features
plt.subplots(figsize=(15,10))
sns.heatmap(data.corr(),cmap='YlGnBu',annot=True, linewidth=.5)

### Extract target and descriptive features (0.5 points)

#### Add the following features to the dataset to model interactions between the pairs of glass materials. (See [here](https://cmdlinetips.com/2019/01/3-ways-to-add-new-columns-to-pandas-dataframe/) for an example.) 

    - Ca*Na
    - Al*Mg 
    - Ca*Mg
    - Ca*RI



In [None]:
# Additional features to be added to the data
data['Ca_Na'] = # TODO
data['Al_Mg'] = # TODO
data['Ca_Mg'] = # TODO
data['Ca_RI'] = # TODO

#### Separate the target and features from the data.

In [None]:
# Store all the features from the data in X
X= # TODO
# Store all the labels in y
y= # TODO

In [None]:
# Convert data to numpy array
X = # TODO
y = # TODO

### Create training and validation datasets (0.5 points)


We will split the dataset into training and validation set. Generally in machine learning, we split the data into training,
validation and test set (this will be covered in later chapters). The model with best performance on the validation set is used to evaluate perfromance on 
the test set which is the unseen data. In this assignment, we will using `train set` for training and evaluate the performance on the `test set` for various 
model configurations to determine the best hyperparameters (parameter setting yielding the best performance).

Split the data into training and validation set using `train_test_split`.  See [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for details. To get consistent result while splitting, set `random_state` to the value defined earlier. We use 80% of the data for training and 20% of the data for validation. This has been done for you.

In [None]:
X_train,X_test,y_train,y_test = # TODO

#### Preprocess the dataset by normalizing each feature to have zero mean and unit standard deviation. This can be done using `StandardScaler()` function. See [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) for more details.

In [None]:
# Define the scaler for scaling the data
scaler = # TODO

# Normalize the training data
X_train = # TODO

# Use the scaler defined above to standardize the validation data by applying the same transformation to the validation data.
X_test = # TODO


## Training K-nearest neighbor models (9 points)

#### We will use the `sklearn` library to train a K-nearest neighbors (kNN) classifier. Review ch.5 and see [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier) for more details. 

### Exercise 1:  Learning a kNN classifier (9 points)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score 

#### Exercise 1a: Evaluate the effect of the number of neighbors (3 points)

#### Train kNN classifiers with different number of neighbors among {1,5,25,100, length(X_train)}.

#### Keep all other parameters at their default values.  

#### Report the model's accuracy over the training and test sets.
 

In [None]:
# TODO

#### Explain the effect of increasing the number of neighbors on the performance observed on training and test sets. 

TO DO

#### Exercise 1b: Evaluate the effect of a weighted kNN (3 points)

#### Train kNN classifiers with distance-weighting and vary the  number of neighbors among {1,5,25,100,length(X_train)}.

#### Keep all other parameters at their default values.  

#### Report the model's accuracy over the training and test sets.
 

In [None]:
# TODO

#### Compare the effect of the number of neighbors on model performance (train and test) under the distance-weighted kNN against the uniformly weighted kNN. Explain any differences observed.

TO DO

#### Exercise 1c: Evaluate the effect of the power parameter in the Minkowski distance metric (3 points)

#### Train kNN classifiers with different distance functions by varying the power parameter for the Minkowski distance among {1,2,10,100}.

#### Fix the number of neighbors to be 25, and use the uniformly-weighted kNN. Keep all other parameters at their default values.  
#### Report the model's accuracy over the train and test sets.

In [None]:
# TODO

#### Explain any effect observed on the train and test performance upon increasing the power parameter. 

TO DO