12. Classification

Classification

Topic: Classification

Course: GMLC

Date: 26 February 2019

Professor: Not specified

Resources

Key Points

Thresholding
- Decision threshold
- Used to map logistic regression value to a binary category
- Is problem-dependent
Binary classification
- Outputs 2 manually exclusive classes
- Either spam or not spam for example
Classification model
- Machine learning model which distinguishes between 2 or more classes (discrete values), as opposed to regression model which outputs floating point values
Confusion matrix
- A NxN table that summarises how successful a classification model is (True positive, false positive, false negative, true negative)
- Positive class
- In binary classification, the value we are looking for (even though we are looking for both) for example spam, ** tumor**
- Negative class
- The other possibility of binary classification, for example, not spam, not tumor
- True positive
- Reality: there is a tumor
- Prediction: there is a tumor
- False positive
- Reality: there is no tumor
- Prediction: there is a tumor
- False negative
- Reality: there is a tumor
- Prediction: there is no tumor
- True negative
- Reality: there is no tumour
- Prediction: there is no tumor
Accuracy
- Metric for evaluating classification models
- Fraction of predictions our model got right
- Accuracy = (Total correct predictions) / (Total number of predictions)
- Binary classification
- Accuracy = (TP + TN) / (TP + TN + FP + FN)
- Not a good metric for class-imbalanced data set
- Class imbalanced set is a binary classification problem where two classes have significantly different frequencies (0.00001 vs 0.99999)
Precision
- TP / (TP + FP)
- Outputs the value of correct positive identifications
Recall
- TP / (TP + FN)
- Outputs the value of actual correctly identified positives
You must examine both Precision and Recall
ROC Curve
- Receiver operating logistics curve
- Graph showing the performance of a model at all classification thresholds
- True positive rate
  - TPR = TP / TP + FN
- False positive rate
  - FPR = FP / FP + TN
- Plots TPR vs. FPR at different classification thresholds
- AUC
  - Area under the ROC curve
  - Measures the entire 2d area underneath the entire ROC curve
  - Provides an aggregate measure across all possible classification thresholds
  - scale-invariant
    - Measures how well the predictions are ranked, rather than their absolute values
  - Classification-treshold-invariant
    - Measures the quality of model predictions irrelevant how the threshold is set
- Prediction bias
  - Measures how far an average of predictions & average of observations are
  - bucketing
  - Converting a feature into multiple binary features called buckets or bins (Covered in Representation topic)
  - Calibration layer
    - Post-prediction adjustment, accounting for prediction bias

Check your understanding

Understand confusion matrix
Know differences between Recall & Precision
Describe ROC & AUC

Summary of Notes

Evaluating accuracy and precision of logistics regression model is done using confusion matrix sets used in prediction & accuracy
ROC curve gives an absolute value of model performance irrelevant of tuning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

12. Classification

Classification

Check your understanding

Summary of Notes

Clone this wiki locally