## Prepare python environment


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

%matplotlib inline

In [None]:
random_state=5 # use this to control randomness across runs e.g., dataset partitioning

## Preparing the Glass Dataset (2 points)

---

We will use glass dataset from UCI machine learning repository. Details for this data can be found [here](https://archive.ics.uci.edu/ml/datasets/glass+identification). The objective of the dataset is to identify the class of glass based on the following features:

1.  RI: refractive index
2.  Na: Sodium
3.  Mg: Magnesium
4.  Al: Aluminum
5.  Si: Silica
6.  K: Potassium
7.  Ca: Calcium
8.  Ba: Barium
9.  Fe: Iron
10. Type of glass (Target label)

The classes of glass are:

1. building_windows_float_processed 
2. building_windows_non_float_processed 
3. vehicle_windows_float_processed 
4. containers 
6. tableware 
7. headlamps

Identification of glass from its content can be used for forensic analysis.

### Loading the dataset

In [None]:
# Download and load the dataset
import os
if not os.path.exists('glass.csv'): 
    !wget https://raw.githubusercontent.com/JHA-Lab/ece364_2022/master/dataset/glass.csv 
data = pd.read_csv('glass.csv')
# Display the first five instances in the dataset
data.head(5)

### Check the data type for each column

In [None]:
data.info()

#### Look at some statistics of the data using the `describe` function in pandas.

In [None]:
data.describe()

1. Count tells us the number of Non-empty rows in a feature.

2. Mean tells us the mean value of that feature.

3. Std tells us the Standard Deviation Value of that feature.

4. Min tells us the minimum value of that feature.

5. 25%, 50%, and 75% are the percentile/quartile of each feature.

6. Max tells us the maximum value of that feature.

### Visualize the Data

#### Check how many classes of each type of glass are there in the data. This has been done for you.

In [None]:
sns.set(style="whitegrid", font_scale=1.8)
plt.subplots(figsize = (15,8))
sns.countplot(x='Type',data=data).set_title('Count of Glass Types')

#### Calculate `mean` material content for each kind of glass. This has been done for you.

In [None]:
# Compute mean material content for each kind of glass
data.groupby('Type', as_index=False).mean()

#### Create box plot to see distribution of each content in the glass. See [here](https://seaborn.pydata.org/generated/seaborn.boxplot.html) for further details. This has been done for you.

In [None]:
sns.set(style="whitegrid", font_scale=1.2)
plt.subplots(figsize = (20,15))
plt.subplot(3,3,1)
sns.boxplot(x='Type', y='RI', data=data)
plt.subplot(3,3,2)
sns.boxplot(x='Type', y='Na', data=data)
plt.subplot(3,3,3)
sns.boxplot(x='Type', y='Mg', data=data)
plt.subplot(3,3,4)
sns.boxplot(x='Type', y='Al', data=data)
plt.subplot(3,3,5)
sns.boxplot(x='Type', y='Si', data=data)
plt.subplot(3,3,6)
sns.boxplot(x='Type', y='K', data=data)
plt.subplot(3,3,7)
sns.boxplot(x='Type', y='Ca', data=data)
plt.subplot(3,3,8)
sns.boxplot(x='Type', y='Ba', data=data)
plt.subplot(3,3,9)
sns.boxplot(x='Type', y='Fe', data=data)
plt.show()

#### Create a pairplot to display pairwise relationship. See [here](https://seaborn.pydata.org/generated/seaborn.pairplot.html) for further details. This has been done for you.

In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
sns.pairplot(data[['RI','Na','Mg','Al','Si','Ca','Type']], hue='Type')

In [None]:
# Plot heatmap showing correlation between different features
plt.subplots(figsize=(15,10))
sns.heatmap(data.corr(),cmap='YlGnBu',annot=True, linewidth=.5)

### Extract target and descriptive features (1 point)

#### Add the following features to the dataset to model interactions between the pairs of glass materials. (See [here](https://cmdlinetips.com/2019/01/3-ways-to-add-new-columns-to-pandas-dataframe/) for an example.) 

    - Ca*Na
    - Al*Mg 
    - Ca*Mg
    - Ca*RI

In [None]:
# Additional features to be added to the data
data['Ca_Na'] = # insert your code here
data['Al_Mg'] = # insert your code here
data['Ca_Mg'] = # insert your code here
data['Ca_RI'] = # insert your code here

In [None]:
data.columns.values

#### Separate the target and features from the data.

In [None]:
# Store all the features from the data in X
X= # insert your code here
print(X)
# Store all the labels in y
y= # insert your code here
print(y)

In [None]:
# Convert data to numpy array
X = # insert your code here
y = # insert your code here

### Create training and validation datasets (1 point)


We will split the dataset into training and validation set. Generally in machine learning, we split the data into training,
validation and test set (this will be covered in later chapters). The model with best performance on the validation set is used to evaluate perfromance on 
the test set which is the unseen data. In this assignment, we will using `train set` for training and evaluate the performance on the `validation set` for various 
model configurations to determine the best hyperparameters (parameter setting yielding the best performance).

Split the data into training and validation set using `train_test_split`.  See [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for details. To get consistent result while splitting, set `random_state` to the value defined earlier. We use 80% of the data for training and 20% of the data for validation. This has been done for you.

In [None]:
X_train,X_val,y_train,y_val = # insert your code here

## Training Decision Tree-based Classifiers (18 points)


### Exercise 1: Learning a Decision Tree (10 points)

#### We will use the `sklearn` library to train a Decision Tree classifier. Review ch.4 and see [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) for more details. 

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [None]:
# tree visualization helper function 
from sklearn.tree import export_graphviz
from six import StringIO  
from IPython.display import Image  
import pydotplus

"""
clf: DecisionTreeClassifier

Returns a bytes object representing the image of the tree 
"""
def get_tree_image(clf):
    dot_data = StringIO()
    feature_names=data.drop('Type',axis=1).columns
    class_names=["building_windows_float_processed", "building_windows_non_float_processed", "vehicle_windows_float_processed", 
            "containers", "tableware", "headlamps"]
    export_graphviz(clf, out_file=dot_data,  
                    filled=True, rounded=True,
                    special_characters=True,
                    feature_names=feature_names, 
                    class_names=class_names)
    graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
    

    return graph.create_png()

#### Exercise 1a: Fit and interpret a decision tree. (6 points)

#### Fit Decision trees using the Gini index and entropy-based impurity measure. 

#### Set the random_state to the value defined above. Keep all other parameters at their default values. 

#### Report the training and validation set accuracies for each classifier.

In [None]:
# insert your code here

#### Visualize the Decision Tree with the best validation performance.

In [None]:
best_clf=# insert your code here
tree_image=get_tree_image(best_clf)
Image(tree_image)

#### Indicate the most informative descriptive feature (with the threshold) and briefly explain why this is the most informative (from an algorithmic viewpoint). 

**ANS**:

#### Briefly comment on the tree's depth and what factors may contribute to the shallowness/complexity of the tree. 


**ANS**:

#### Show how one can interpret the tree by specifying the rule from its left most branch. 

**ANS**:

#### Exercise 1b: Prune a decision tree. (4 points)

#### Next, let's try pruning the tree to see if we can improve the classifier's generalization performance.

####  Preprune a decision tree by varying the `max_depth` among {None (no depth control), 1,3,5,7}.

#### Set the criterion to entropy and the random_state to the value defined above. Keep all other parameters at their default values. 

#### Report the training and validation set accuracies for each classifier.

In [None]:
# insert your code here

#### Analyze the effect of increasing tree depth on training and validation performance.

**ANS**:

### Exercise 2: Learning an Ensemble of Decision Trees (8 points)

#### We will use the `sklearn` library to implement bagging and boosting. Review ch.4 and read more on [bagging](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) and [boosting](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html). 

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

#### Exercise 2a: Fit a Random Forest. (4 points)

#### Fit different Random Forest classifiers by varying the number of trees among {10, 50, 100, 400, 1000}. 

#### Set the `criterion` to entropy and set the random_state to the value defined above. Keep all other parameters at their default values. 

#### Report the validation set accuracies for each classifier.

In [None]:
# insert your code here

#### Comment on the effect of increasing the number of trees on validation performance. Compare the performance of the best performing Random Forest classifier against the Decision Tree Classifier trained with entropy (Ex. 1a) and explain any difference. 

**ANS**:

#### Exercise 2b: Fit a Gradient Boosted Decision Tree (GBDT). (4 points)

#### Fit different GBDTs by varying the number of boosting steps/trees added among {5, 10, 20, 50, 100, 200}. 

#### Set the `n_iter_no_change` to 100, `validation_fraction=0.2`, and random_state to the value defined above. Keep all other parameters at their default values. 

#### Report the training and validation set accuracies for each classifier.

In [None]:
# insert your code here

#### Comment on the effect of increasing the number of trees on validation performance. Compare the performance of the best performing GBDT against that of the best performing Random Forest classifier (Ex. 2a) and Decision Tree classifier trained with entropy (Ex. 1a). 

**ANS**: