# Is my datapoint an outlier?

In this notebook we will show how to use the [Mahalanobis distance](https://huggingface.co/metrics/mahalanobis) to know if our datapoint belongs to a reference distribution of other datapoints, or if it is an outlier.

We sampled some Sports and Business news from the [AG News dataset](https://huggingface.co/datasets/ag_news) for our analysis.

In [1]:
# Reference distribution - list of sports news
sports_news = [  
    "Today's schedule College soccer: MEN -- Curry at Emerson, 4 p.m.; WOMEN -- Mount Ida at Curry, 3:30 p.m.",
    "Today's schedule Pro baseball: AL Division Series -- Anaheim vs. Red Sox at Fenway Park (Game 3), 4 p.m.",
    'Rivera was a corner stone for Ruiz NEW YORK -- For a time Saturday night, John Ruiz was ready to give up.',
    'Wisdom was main course WALTHAM -- He is an 87-year-old man with a cane and a cigar, and the clout of a king.',
    "Friday Focus: Running in the rain Rain is forecast for Saturday in Spa. Here's what the team will do to cope...",
    'Red Sox Stumble and Fumble Their Way to Series Lead  ST LOUIS (Reuters) - So much for the "Curse of the  Bambino."',
    'Georgia Tech Looks to Virginia Tech (AP) AP - Georgia Tech wants to avoid being embarrassed by another ACC rookie.',
    'Ali gives Iraq fighting chance The back of his shirt told the story last night at the Peristeri Olympic Boxing Hall.',
    'Starting for Cardinals Has Privileges St. Louis is a collection of superstar position players and anonymous pitchers.',
    'Hawks soar over Blue Stars The Division 4 Super Bowl last night featured two teams coming off 180-degree turnarounds.',
    "Italian GP, Race Fernando spun out of third position while Jarno finished tenth in this afternoon's Italian Grand Prix",
    "Emotional rescue for Jones Jacque Jones sprinted all the way around the bases, as if he couldn't wait to share the moment.",
    'Bulldogs, Gators Remember Last Miss. Game (AP) AP - Mississippi State is looking for another landmark win against Florida.',
    'Cavanagh, Crimson roll by Union Tom Cavanagh scored two goals, leading Harvard to a 4-1 win over visiting Union last night.'
]

In [2]:
# Candidate datapoint 1 - business news
datapoint_1 = 'UTC buys Kidde for 1.4bn UK fire equipment manufacturer Kidde agrees a 1.4bn takeover by US manufacturer United Technologies.'

# Candidate datapoint 1 - sports news
datapoint_2 = "Serena Ends Mauresmo's Year-End No. 1 Bid (AP) AP - Serena Williams is in love  #151; with her new attacking game and herself."

**Before using the Mahalanobis distance, we just need to convert our texts into features. We will use the [paraphrase-MiniLM-L6-v2 model from Sentence Transformers](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2) to do that.**

In [None]:
from sentence_transformers import SentenceTransformer

# load model
model = SentenceTransformer('sentence-transformers/paraphrase-MiniLM-L6-v2')

In [4]:
# Encode our sports news into features
encoded_sports_news = model.encode(sports_news)
encoded_sports_news.shape

(14, 384)

In [5]:
# Encode our candidate news into features
encoded_datapoint_1 = model.encode([datapoint_1])
encoded_datapoint_2 = model.encode([datapoint_2])
encoded_datapoint_1.shape, encoded_datapoint_2.shape

((1, 384), (1, 384))

**We have now encoded each text in a vector of 384 features. We are ready to use the Mahalanobis distance!**

All we need to do is calculate a critical score for a given `significance_level`: 
- `chi2.ppf((1 - significance_level), num_features - 1)`

If our mahalanobis distance is higher than this score, we have an outlier!

In [6]:
import numpy as np
from scipy.stats import chi2
from datasets import load_metric


mahalanobis_metric  = load_metric("mahalanobis")


def is_outlier(x: np.ndarray, reference_distribution: np.ndarray, significance_level: float) -> bool:

    # Calculate a critical value
    critical_value = chi2.ppf((1-significance_level), df=x.shape[1] - 1)
    
    # Calculate the mahalanobis distance
    results = mahalanobis_metric.compute(
        reference_distribution=reference_distribution, 
        X=x
    )
    
    # Return true if mahalanobis distance is higher than the critical value, False otherwise
    return results['mahalanobis'][0] > critical_value


In [7]:
is_outlier(
    x=encoded_datapoint_1, 
    reference_distribution=encoded_sports_news, 
    significance_level=0.01 # high significance value
)

True

In [8]:
is_outlier(
    x=encoded_datapoint_2, 
    reference_distribution=encoded_sports_news, 
    significance_level=0.01 # high significance value
)

False