Skip to content

12. Classification

Antonio Erdeljac edited this page Mar 1, 2019 · 2 revisions

Classification


Topic: Classification

Course: GMLC

Date: 26 February 2019 

Professor: Not specified


Resources


Key Points


  • Thresholding

    • Decision threshold

    • Used to map logistic regression value to a binary category

    • Is problem-dependent

  • Binary classification

    • Outputs 2 manually exclusive classes

    • Either spam  or not spam for example

  • Classification model

    • Machine learning model which distinguishes between 2 or more classes (discrete values), as opposed to regression model  which outputs floating point values
  • Confusion matrix

    • A NxN table that summarises how successful a classification model is (True positive, false positive, false negative, true negative)

    • Positive class

    • In binary classification, the value we are looking for (even though we are looking for both) for example spam, ** tumor**

    • Negative class

    • The other possibility of binary classification, for example, not spam, not tumor

    • True positive

    • Reality: there is a tumor

    • Prediction: there is a tumor

    • False positive

    • Reality: there is no tumor

    • Prediction: there is a tumor

    • False negative

    • Reality: there is a tumor

    • Prediction: there is no tumor

    • True negative

    • Reality: there is no   tumour

    • Prediction: there is no tumor

  • Accuracy

    • Metric for evaluating classification models

    • Fraction of predictions our model got right

    • Accuracy = (Total correct predictions) / (Total number of predictions)

    • Binary classification

    • Accuracy = (TP + TN) / (TP + TN + FP + FN)

    • Not a good metric for class-imbalanced data set

    • Class imbalanced set is a binary classification problem where two classes have significantly different frequencies (0.00001 vs 0.99999)

  • Precision

    • TP / (TP + FP)

    • Outputs the value of correct positive identifications

  • Recall

    • TP / (TP + FN)

    • Outputs the value of actual correctly identified positives

  • You must examine both Precision and Recall

  • ROC Curve

    • Receiver operating logistics curve

    • Graph showing the performance of a model at all classification thresholds

    • True positive rate

      • TPR = TP / TP + FN
    • False positive rate

      • FPR = FP / FP + TN
    • Plots TPR vs. FPR at different classification thresholds

    • AUC

      • Area under the ROC curve

      • Measures the entire 2d area underneath the entire ROC curve

      • Provides an aggregate measure across all possible classification thresholds

      • scale-invariant

        • Measures how well the predictions are ranked, rather than their absolute values
      • Classification-treshold-invariant

        • Measures the quality of model predictions irrelevant how the threshold is set
    • Prediction bias

      • Measures how far an average of predictions & average of observations are

      • bucketing

      • Converting a feature into multiple binary features called buckets or bins (Covered in Representation topic)

      • Calibration layer

        • Post-prediction adjustment, accounting for prediction bias

Check your understanding


  • Understand confusion matrix

  • Know differences between Recall & Precision

  • Describe ROC & AUC

Summary of Notes


  • Evaluating accuracy and precision of logistics regression model is done using confusion matrix sets used in prediction & accuracy

  • ROC curve gives an absolute value of model performance irrelevant of tuning