-
Notifications
You must be signed in to change notification settings - Fork 29
12. Classification
Topic: Classification
Course: GMLC
Date: 26 February 2019
Professor: Not specified
Resources
-
https://developers.google.com/machine-learning/crash-course/classification/video-lecture
-
https://developers.google.com/machine-learning/crash-course/classification/thresholding
-
https://developers.google.com/machine-learning/crash-course/classification/accuracy
-
https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall
-
https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
-
https://developers.google.com/machine-learning/crash-course/classification/prediction-bias
-
https://developers.google.com/machine-learning/crash-course/classification/programming-exercise
Key Points
-
Thresholding
-
Decision threshold
-
Used to map logistic regression value to a binary category
-
Is problem-dependent
-
-
Binary classification
-
Outputs 2 manually exclusive classes
-
Either spam or not spam for example
-
-
Classification model
- Machine learning model which distinguishes between 2 or more classes (discrete values), as opposed to regression model which outputs floating point values
-
Confusion matrix
-
A NxN table that summarises how successful a classification model is (True positive, false positive, false negative, true negative)
-
Positive class
-
In binary classification, the value we are looking for (even though we are looking for both) for example spam, ** tumor**
-
Negative class
-
The other possibility of binary classification, for example, not spam, not tumor
-
True positive
-
Reality: there is a tumor
-
Prediction: there is a tumor
-
False positive
-
Reality: there is no tumor
-
Prediction: there is a tumor
-
False negative
-
Reality: there is a tumor
-
Prediction: there is no tumor
-
True negative
-
Reality: there is no tumour
-
Prediction: there is no tumor
-
-
Accuracy
-
Metric for evaluating classification models
-
Fraction of predictions our model got right
-
Accuracy = (Total correct predictions) / (Total number of predictions)
-
Binary classification
-
Accuracy = (TP + TN) / (TP + TN + FP + FN)
-
Not a good metric for class-imbalanced data set
-
Class imbalanced set is a binary classification problem where two classes have significantly different frequencies (0.00001 vs 0.99999)
-
-
Precision
-
TP / (TP + FP)
-
Outputs the value of correct positive identifications
-
-
Recall
-
TP / (TP + FN)
-
Outputs the value of actual correctly identified positives
-
-
You must examine both Precision and Recall
-
ROC Curve
-
Receiver operating logistics curve
-
Graph showing the performance of a model at all classification thresholds
-
True positive rate
- TPR = TP / TP + FN
-
False positive rate
- FPR = FP / FP + TN
-
Plots TPR vs. FPR at different classification thresholds
-
AUC
-
Area under the ROC curve
-
Measures the entire 2d area underneath the entire ROC curve
-
Provides an aggregate measure across all possible classification thresholds
-
scale-invariant
- Measures how well the predictions are ranked, rather than their absolute values
-
Classification-treshold-invariant
- Measures the quality of model predictions irrelevant how the threshold is set
-
-
Prediction bias
-
Measures how far an average of predictions & average of observations are
-
bucketing
-
Converting a feature into multiple binary features called buckets or bins (Covered in Representation topic)
-
Calibration layer
- Post-prediction adjustment, accounting for prediction bias
-
-
-
Understand confusion matrix
-
Know differences between Recall & Precision
-
Describe ROC & AUC
-
Evaluating accuracy and precision of logistics regression model is done using confusion matrix sets used in prediction & accuracy
-
ROC curve gives an absolute value of model performance irrelevant of tuning