## Diagnostic Metrics

### week 10

### Classification

### Confusion matrix

The most common model diagnostic metric is a confusion matrix because the prevelance of classification tasks. We have discussed confusion matrix thoroughly in ICE4. Once you obtain the confusion matrix, you can obtain a lot of other metrics such as:

![Screen%20Shot%202021-12-20%20at%207.25.08%20PM.png](attachment:Screen%20Shot%202021-12-20%20at%207.25.08%20PM.png)

scikit learn has implemented all of them under the sklearn.metric module (see here: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics).

Here is another tutorial just in case you are a little bit confused: https://vitalflux.com/accuracy-precision-recall-f1-score-python-example/



### ROC curves and AUC curves

It can be more flexible to predict probabilities of an observation belonging to each class in a classification problem rather than predicting classes directly. The flexibility also create a trade-off between false positive and false negative (or precision vs. recall). ROC Curves and Precision-Recall curves are two useful tools for us to (a) decide an optimal threshold for the decision, and (b) evaluate the classifier more comprehensively.

The main code we are going to use is the roc_curve and precision_recall_curve from sklearn.metrics.

This tutorial has a very comprehensive walk through of how to plot a ROC curve and an AUC curve with Python: https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/

Note. ROC and AUC often performed on testing dataset to validate the result.

### K-fold cross validation


In ICE4, we have seen how to split dataset into training and testing dataset as a way to prevent overfitting and evaluate the model performance farily. However, this random split could still cause problems (e.g., what if we just get lucky and end up with a perfect testing dataset?). Or in another case, we may not have the luxury to split the dataset into two sufficiently large datasets for both training and testing (say we just have 100 data points). Therefore, you will need k-fold cross validation.

The idea is actually pretty straightforard.

The training set is split into k smaller sets. The following procedure is followed for each of the k “folds”:

A model is trained using of the folds as training data;

the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. Common values are k=3, k=5, and k=10, and by far the most popular value used in applied machine learning to evaluate models is k=10. The reason for this is studies were performed and k=10 was found to provide good trade-off of low computational cost and low bias in an estimate of model performance. However, it is worthy noting that, this approach can be computationally expensive. But in EDM field, we don't often deal with hugmongous dataset like some other field (e.g., computer vision), so k-fold cv is a very good tool.

![Screen%20Shot%202021-12-20%20at%207.30.27%20PM.png](attachment:Screen%20Shot%202021-12-20%20at%207.30.27%20PM.png)

n terms of the implementation, it is very well capsulated in sklearn. All you need are cross_val_score and KFold in sklearn.model_selection. See the Cross Validation Using cross_val_score() in this tutorial: https://www.askpython.com/python/examples/k-fold-cross-validation (The first part of this tutorial is a walk-through of manual implementation of K-fold. You can safely skip that).

If you want to dive deeper into K-fold, you may want to check out these two tutorials:

- sci-kit learn documentation on cross validation: https://scikit-learn.org/stable/modules/cross_validation.html
- Machine Learning Mastery: How to configure k-Fold Cross-Validation: https://machinelearningmastery.com/how-to-configure-k-fold-cross-validation/


### Clustering

Algorithms for structure discovery, on the other hand, have different approaches to evaluate the performance. Because of the unsupervised nature, we won't be able to compare the results with the ground truth. As a result, we need to find a way to quantify the performance.



### Silhouette coefficient and silhouette plot


One good example is the silhouette coefficient we have discussed in ICE5. It uses the mean intra-cluster distance and mean nearest-cluster distance to quantify how well the algorithm has clustered the data points.

Similar to silhouette coefficient, an simpler criterion is known as inertia or within-cluster sum-of-square, which is distance measure of all data points to the assigned cluster centroids.

### Elbow method

Previously, we have been plotting multiple silhouette plots to pick the appropriate number of clusters, this can take time and inovlve a lot of repetition. We can simply automate the process with a loop.

One common implementation of the elbow measure is based on the inertia or the sum of squared distance.

In [1]:
#### BELOW IS JUST EXAMPLE CODE. DON'T RUN. ####

# from sklearn.cluster import KMeans
# import matplotlib.pyplot as plt
# clusters_range = [1, 2, 3, 4, 5, 6]
# avg_distance=[]
# for n in cluster_range:
  # clusterer = KMeans(n_clusters = n, random_State = 123).fit(X)
  # avg_distance.append(clusterer.inertia_)

# plt.plot(range_n_clusters, avg_distance)
# plt.xlabel("Number of Clusters (k)")
# plt.ylabel("Distance")
# plt.show()

It is clear that, the more number of clusters we have, the less the within-cluster sum-of-sqaure will be because all data points will be closer to the centroid. Therefore, if we want to maximize our efficiency, we need to pick the value of k, where the average distance falls suddenly. Hence, find the elbow point (see the figure below). It is worth noting that the elbow method is just an empirical method for decision making. There is really no right or wrong for which cluster to choose.

![Screen%20Shot%202021-12-20%20at%207.33.55%20PM.png](attachment:Screen%20Shot%202021-12-20%20at%207.33.55%20PM.png)

You can also choose to implement this elbow method with silhouette coefficient. See more here: https://medium.com/analytics-vidhya/how-to-determine-the-optimal-k-for-k-means-708505d204eb

In addition, yellowbrick has a nice implementation of elbow method. Read more here: https://www.scikit-yb.org/en/latest/api/cluster/elbow.html



### Scree plot for PCA

Following the same logic, we can evaluate the performance of PCA. We know that we can obtain the variance explained for each dimension. Then we can plot them all on a scree plot to show how much variance each component explains.

In the figure below, we can say we would like the principal components that explain at least 70% of the variance cumulatively. With the first two principal component, the cumulative proportion of the variance explained surpasses 70%, therefore we would consider to keep two principal components. If a higher threshold were used, then additional principal components would have to be retained.

Alternatively, we can also plot the scree plot in an accumalative way.

![Screen%20Shot%202021-12-20%20at%207.35.37%20PM.png](attachment:Screen%20Shot%202021-12-20%20at%207.35.37%20PM.png)

See this simple tutorial here: https://www.statology.org/scree-plot-python/

Here are more details of scree plots: https://www.sciencedirect.com/topics/mathematics/scree-plot

### Regression

Finally, diagnostic metrics for regressors. I am not going to expand into it because our models in ACA2 and ACA3 are not really regression models. But if you want to dig more into it. Here are a couple of videos you can watch. I recommend everyone watch these videos because although you won't use them at this moment, you will come across these dignostic metrics at a later time.

- StatQuest: R-squared, Clearly Explained: https://youtu.be/2AQKmw14mHM
- Vinsloev Academy: MAE and RMSE: https://youtu.be/lHAEPyWNgyY
- Brandon Foltz: Stats 101: Multiple Regression, AIC, AICc, and BIC Basics: https://youtu.be/-BR4WElPIXg

### ACA2

![ICE7%20Accuracy.png](attachment:ICE7%20Accuracy.png)

### ACA3 

I would like to use scree plot to evaluate PCA. 

![Screen%20Shot%202021-12-20%20at%209.02.22%20PM.png](attachment:Screen%20Shot%202021-12-20%20at%209.02.22%20PM.png)

We can see the variance explained for each dimension