<a href="https://colab.research.google.com/github/Pataweepr/applyML_vistec_2019/blob/master/hw4.5_visualization_and_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Visualization and Evaluation

In this lab we will look into ways to visualize our data via t-sne in order to get more insights about our data. We will also look into another important aspect of machine learning, evaluation.

We will use the same data from the virus classification lab.

## The data preparation

The data can be found [here](https://drive.google.com/file/d/1tb1pvtUNqx3r4FVbNsRKum2hwaLOIYmO/view?usp=sharing). Click add to drive so that you can mount it later. The following section only repeats the same stuff we did in the previous lab. Just keep running the code until the next section.

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import itertools
import collections
import csv

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import confusion_matrix
from sklearn.utils import shuffle
import seaborn as sns
from IPython.display import display
from scipy.stats import mode

from sklearn.svm import SVC

seed = 976
np.random.seed(seed)

import time

In [0]:
from google.colab import drive
drive.mount('/content/gdrive/')

In [0]:
df = pd.read_csv("/content/gdrive/My Drive/Chosen_Data_clearToN.csv")

In [0]:
df = df[["Sequences", "Label"]]
key_virus = df["Label"].value_counts().index
print(key_virus)

In [0]:
def splitTrainTest(data):
  keyDatas = data["Label"].value_counts().keys()
  train = pd.DataFrame()
  valid = pd.DataFrame()
  test = pd.DataFrame()
  chk = 0
  for k in keyDatas:
    tmp = data[data["Label"]==k]
    tmp_train, tmp_test = train_test_split(tmp, test_size=2/5, random_state=seed)
    tmp_train, tmp_valid = train_test_split(tmp_train, test_size=1/6, random_state=seed)
    train = train.append(tmp_train)
    valid = valid.append(tmp_valid)
    test = test.append(tmp_test)
  return train, valid, test

In [0]:
df_train, df_valid, df_test = splitTrainTest(df)
df_train = df_train.reset_index(drop=True)
df_valid = df_valid.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [0]:
def createGram(sequence, gram=5):
  # sequence is the DNA sequence
  # returns: np array where each entry is an n-gram count
  nGram = dict.fromkeys(["".join(x) for x in itertools.product("ACTG",repeat=gram) ],0)
  for i in  range(len(sequence)-gram+1):
    if "N" not in str([sequence[i:i+gram]]):
      nGram[sequence[i:i+gram]]+=1
  return np.array(list(nGram.values()))

In [0]:
def createGramDataset(data, gram=5, length=10000, train=False):
  datangram = []
  label = []

  if(train):
    # Upsampling if training data
    
    # Assume first class has maximum amount
    max_sam = data["Label"].value_counts()[0]
    # For each label
    for i in data["Label"].value_counts().keys():
      tmp_virus = data[data["Label"]==i]
      tmp_virus = tmp_virus.reset_index(drop=True)
      # Upsample by j times
      for j in range(0, max_sam//len(tmp_virus)):
        # For data point
        for k in range(len(tmp_virus)):
          # Select location of the sub-sequence
          if(len(tmp_virus["Sequences"][k])-length == 0):
            rand_int = 0
          else:
            rand_int = np.random.randint(len(tmp_virus["Sequences"][k])-length)
          selected_sequence = tmp_virus["Sequences"][k][rand_int:rand_int+length]
          datangram.append(createGram(tmp_virus["Sequences"][k]))
          label.append(i)
  else:
    # For data point
    for k in range(len(data)):
      if (len(data["Sequences"][k])-length == 0) :
        rand_int = 0
      else:
        rand_int = np.random.randint(len(data["Sequences"][k])-length)
      selected_sequence = data["Sequences"][k][rand_int:rand_int+length]
      datangram.append(createGram(data["Sequences"][k]))
      label.append(data["Label"][k])

  return np.array(datangram), label

In [0]:
X_train, y_train = createGramDataset(df_train, 5, 20000, True)
X_valid, y_valid = createGramDataset(df_valid, 5, 20000, False)
X_test, y_test = createGramDataset(df_test, 5, 20000, False)

In [0]:
def buildLabel(y_str_label,key_virus_label):
  nplabel = []
  for lab in y_str_label:
    for i in np.arange(len(key_virus_label)):
      if(lab == key_virus_label[i]):
        nplabel.append(i+1)
        break
  return np.array(nplabel)

In [0]:
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

def eval(y_pred,y_test):
  acc = accuracy_score(y_test, y_pred)
  f1 = f1_score(y_test, y_pred, average='macro') 
  prec = precision_score(y_test, y_pred, average='macro') 
  recall = recall_score(y_test, y_pred, average='macro') 
  return acc, f1, prec, recall

In [0]:
y_train = buildLabel(y_train,key_virus)
y_valid = buildLabel(y_valid,key_virus)
y_test = buildLabel(y_test,key_virus)
# Shuffle the training data so that classes will not appear together.
X_train, y_train = shuffle(X_train, y_train, random_state=seed)

## Visuzliation with t-SNE

Since our training features is very high dimensional (how many dimensions?), it is hard to make sense of what our model is doing. It is also hard to get a sense of which class is similar to each other, especially for our data.

One way is to try to visualize the data using dimensionality reduction techniques. The goal is to map a high dimensional feature vector into 2 or 3 dimensions so that we can plot and see what is going on.

[t-SNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) is a visualization technique that can produce low dimensionality data. The goal is to have data that are closed together in the high dimensional space still be closed together in the low dimensional space.

Use t-SNE to generate X_embedded, use two components, 1000 iteration, and a random_state = seed. Plot low dimensionality data using the code provided.

In [0]:
from sklearn.manifold import TSNE

## TODO#1 ##
# Code to do TSNE here. One line.


In [0]:
# Draw the data
colors = ['r','g','b','c','y','k']
for i in range(1,7):
  label_i = (y_train == i) 
  plt.scatter(X_embedded[label_i,0],X_embedded[label_i,1],c = colors[i-1],label=key_virus[i-1])
plt.legend()
plt.show()

According to the plot, which class should be the hardest to classify?

** Ans: **

One of the main things to note about t-SNE is that the visualization changes according to the initialization. Try a different random seed and observe the difference in the embedding plot.

In [0]:
## TODO#2 ##


The embedding does not only depends on the intialization, since it is a gradient-based optimization, we need to specify the number of iterations as well.

Visualized the embeddings (using the provided random seed) for iteration = \[250, 260, 270, ..., 330, 340, 350\]. 

In [0]:
## TODO#3 ##


Unlike previous hyperparameter tuning we did, picking the number of iterations and the initialization, need to be done manually (as in looking at the visualizations). This is the annoying part about t-SNE. However, usually the same conclusions should emerge for most configurations.

### 3 dimensional t-SNE

We can also visualize t-SNE using 3 dimensions. However, to plot it on a screen, we need to specify the view angle when we plot using matplotlib. Study the following code which shows how to do so.

In [0]:
X_embedded_2 = TSNE(n_components=3,n_iter=15000,random_state = seed).fit_transform(X_train)
from mpl_toolkits.mplot3d import Axes3D
colors = ['r','g','b','c','y','k']

In [0]:
# This code view the plot in two different angles.

# 120
ii = 120
sns.set_style("whitegrid", {'axes.grid' : False})
fig = plt.figure(figsize=(8,8))
ax = Axes3D(fig)

print('print ', ii)

ax.view_init(elev=10., azim=ii)
for i in range(1,7):
  label_i = (y_train == i) 
  ax.scatter(X_embedded_2[label_i,0],X_embedded_2[label_i,1],X_embedded_2[label_i,2],c = colors[i-1])
plt.show()
print('-----------------------------------------------------------------------------------------------')


# 300
ii = 300
sns.set_style("whitegrid", {'axes.grid' : False})
fig = plt.figure(figsize=(8,8))
ax = Axes3D(fig)

print('print ', ii)

ax.view_init(elev=10., azim=ii)
for i in range(1,7):
  label_i = (y_train == i) 
  ax.scatter(X_embedded_2[label_i,0],X_embedded_2[label_i,1],X_embedded_2[label_i,2],c = colors[i-1])
plt.show()
print('-----------------------------------------------------------------------------------------------')

[Tensorboard projector](https://projector.tensorflow.org/) is a service provided by Google that lets you interactively play with tSNE and PCA to visualize the data. You can try it by downloading the file `virusngrams.tsv` and `viruslabels.tsv` and upload it using the button load data.

In [0]:
np.savetxt('/content/gdrive/My Drive/virusngrams.tsv',X_train,delimiter='\t')
np.savetxt('/content/gdrive/My Drive/viruslabels.tsv',y_train,delimiter='\t',fmt='%d')

## Evaluation

In the next sections we will look into how we can get more insight from our evaluation than just accuracy scores via two tools:

1. confusion matrix
2. RoC curve

### Confusion matrix

In this section, we will look at confusion matrices. Confusion matrices decribe how each instance are classified as. The diagonal refers to correct classification, while the off-diagonals refer to mis-classificaiton.

The code below create a linear SVM classifier (just like the previous lab)

In [0]:
clf_Linear = SVC(kernel='linear')
clf_Linear.fit(X_train, y_train)
y_pred_linear = clf_Linear.predict(X_test)

acc, f1, precision, recall = eval(y_pred_linear,y_test)

Use the function [`confusion_matrix()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)  to compute the classification results for each category. Each row refer to each true class, while each column refer to the predicted label.

In [0]:
## TODO#4 ##


<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
confuse_test_linear = confusion_matrix(y_test, y_pred_linear)
        </code>
      </pre>
</details>

What does the number in row 2, column 1 (0-index) signify?

** Ans: **

From the confusion matrix which class is the hardest to classify? Does it agree with your observation when you did t-SNE visualization?

** Ans: **

We can also normalize the confusion matrix so that each row sums to one. This turns each row to a conditional probability (Given true class, what's the probability to predict as each class).

In [0]:
## TODO#5 ##


<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
confuse_test_linear_norm = confuse_test_linear/confuse_test_linear.sum(axis = 1)
        </code>
      </pre>
</details>

You can visualize the confusion matrices using sns.heatmap() via a dataframe as shown below:

In [0]:
df_linear = pd.DataFrame(data=confuse_test_linear, index=key_virus, columns = key_virus)
ax = sns.heatmap(df_linear, annot=True, fmt="g")
plt.show()

df_linear_norm = pd.DataFrame(data=confuse_test_linear_norm, index=key_virus, columns = key_virus)
ax = sns.heatmap(df_linear_norm, annot=True, fmt="g")
plt.show()

Between the normalized and the unnormalized confusion matrices which one do you prefer? Why?

** Ans: **

Compare the confusion matrix between a linear kernel and the RBF kernel (gamma = 10**-5).

In [0]:
## TODO#6 ##


If all type of errors are of equal importance, which model is better?

** Ans: **

Are there any kind of situations where the linear model would be the prefered model? Come up with a scenario.

** Ans: **

In general, most machine learning model will treat each kind of errors equally. However, in real usage scenarios, some kind of errors might be more serious. This can have a large effect on choosing classifiers. To introduce different weights for different kinds of errors, we usually need specialized code or loss functions which is beyond the scope of this lab.

### RoC curve for binary classification

Given a function of a binary classifier $h(x)$, we ca perform classification by comparing $h(x)$ against a threshold $t$.

But where does this $t$ come from? One way to select $t$ is to use the RoC curve which plots the FPR against the TPR. In this section we will create an RoC curve for the binary classification whether a virus is HKU or not.

In [0]:
#create the new labels
y_test_new = np.zeros(len(y_test),dtype ='int') 
y_train_new = np.zeros(len(y_train),dtype ='int')
y_valid_new = np.zeros(len(y_valid),dtype ='int')

y_test_new[y_test == 3] = 1
y_train_new[y_train == 3] = 1
y_valid_new[y_valid == 3] = 1

key_virus_new = ['others','hku']

Train a simple linear SVC. Then, use svc.dicision_function to get the value of the classifer $h(x)$, put it under a variable named, pred_func_linear.

In [0]:
## TODO#7 ##


What value should $h(x)$ be for the HKU class?

** Ans: **

Use the function [roc_curve()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html) to get the points on the RoC curve. The function will compare the predicted values against multiple threshold values and evaluate the FPR and TPR. You should get 3 arrays.

1. FPR - the false positive rate at each threshold
2. TPR - the true postive rate at each threshold
3. THR - the threshold used

In [0]:
## TODO#8 ##


<details>
    <summary>SOLUTION HERE!</summary>
      <pre>
        <code>
fpr_linear, tpr_linear, thresholds_linear = roc_curve(y_test_new,y_pred_linear)
plt.plot(fpr_linear,tpr_linear,label="linear")
plt.legend()
plt.show()        
        </code>
      </pre>
</details>

#### Equal Error Rate (EER)

The equal error rate is where the two kinds of error FPR and miss rate (1-TPR) are equal.

What is the EER of this RoC? At what threshold?

** Ans: **

If you want at least 95% recall, which threshold would you pick?

** Ans: **

Given that a false alarm is twice as costly as a miss, which threshold would you pick?

** Ans: **

In [0]:
## TODO#9 ##

#### Comparing RoCs



The main usage of RoC is comparing the performance of different models. Since we have seen previously, the best model can depend on how we weight between different kinds of errors.

Create different SVC models of your choosing (can be different kernels or just different hyperparameters). Plot the RoCs of the three models on the same plot.

In [0]:
## TODO#10 ##


Which model would you use? Why? Be sure to state your assumption.

** Ans: **