<img src="https://i.imgur.com/RFR6UZX.jpg" width="100%"/>

# 3. The metric (`Jaccard`)
### [chaii - Hindi and Tamil Question Answering](https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering) - A quick overview for QA noobs

Hi and welcome! This is the third kernel of the series `chaii - Hindi and Tamil Question Answering - A quick overview for QA noobs`.

**In this short kernel we will go over the metric Jaccard**.


---

The full series consists of the following notebooks:
1. [The competition](https://www.kaggle.com/julian3833/1-the-competition-qa-for-qa-noobs)
2. [The dataset](https://www.kaggle.com/julian3833/2-the-dataset-qa-for-qa-noobs)
3. _[The metric (Jaccard)](https://www.kaggle.com/julian3833/3-the-metric-jaccard-qa-for-qa-noobs) (This notebook)_
4. [Exploring Public Models](https://www.kaggle.com/julian3833/4-exploring-public-models-qa-for-qa-noobs/)
5. [🥇 XLM-Roberta + Torch's extra data [LB: 0.749]](https://www.kaggle.com/julian3833/5-xlm-roberta-torch-s-extra-data-lb-0-749)
6. [🤗 Pre & post processing](https://www.kaggle.com/julian3833/6-pre-post-processing-qa-for-qa-noobs/)

This is an ongoing project, so expect more notebooks to be added to the series soon. Actually, we are currently working on the following ones:
* Exploring Public Models Revisited
* Reviewing `squad2`, `mlqa` and others
* About `xlm-roberta-large-squad2`
* Own improvements



---


# Evaluation

This is copied literally from the [evaluation](https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering/overview/evaluation) tab of the competition:

>The metric in this competition is the [word-level Jaccard score](https://en.wikipedia.org/wiki/Jaccard_index). A good description of Jaccard similarity for strings is [here](https://towardsdatascience.com/overview-of-text-similarity-metrics-3397c4601f50). 

> A Python implementation based on the links above, and matched with the output of the C# implementation on the back end, is provided below.

```python
def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))
```


# Jaccard for sets


Before applying the measure to texts, let's use it for sets of elements.

The jaccard coefficient measures **the intersection over the union** of two sets:

$$
J(A,B) = {{|A \cap B|}\over{|A \cup B|}}
$$
 
So, for example, consider the sets `a={1, 2}` and `b={2}` it's jaccard coefficient is `0.5`, because the intersection of both is `{2}` and has lenght `1` while the union is `{1, 2}` and has lenght `2`, leading to the division `1/2` (intersection length / union length).


In [None]:
def jaccard(a, b): 
    intersection = a.intersection(b)
    union = a.union(b)
    jaccard = len(intersection) / len(union)
    return float(jaccard)

In [None]:
jaccard({1, 2}, {2})

In [None]:
jaccard({1, 2, 3}, {2})

In [None]:
jaccard({1, 2, 3}, {3, 4})

Jaccard is a measure of similarity considering the overlap of elements. It goes from `0` (no overlap of elements) to `1` (all elements overlap).

Note that the order doesn't matter, since it works with sets.

In [None]:
# No overlap
jaccard({1, 2, 3}, {4, 5, 6})

In [None]:
# Full overlap
jaccard({1, 2, 3}, {1, 2, 3})

In [None]:
# Order doesn't matter
jaccard({1, 2, 3}, {3, 2, 1})

# Jaccard for texts

Now that we understood Jaccard applied to bare sets, we can extrapolate its behaviour to texts. See the two functions below:

In [None]:
# The function we were using, redefined to jaccard_set
def jaccard_set(a, b): 
    intersection = a.intersection(b)
    union = a.union(b)
    jaccard = len(intersection) / len(union)
    return float(jaccard)

# The metric used in the competition
# I edited it to make it more readable
def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    intersection = a.intersection(b)
    union = a.union(b)
    jaccard = len(intersection) / len(union)
    return float(jaccard)
    

The only difference is that we take as input strings and turn them into sets, splitting them by whitespaces:

## Examples:

In [None]:
# 1/2 overlap
jaccard("brown dog", "brown")

In [None]:
# 1/3 overlap
jaccard("the brown dog", "dog")

In [None]:
# Full overlap, order doesn't matter
jaccard("the brown dog", "brown dog the")

In [None]:
# No overlap
jaccard("the brown dog", "a white cat jumps")

The metric used for this competition is:
* Calculate jaccard coefficient between the real answer and your prediction for each test sample
* Average all those coefficients

Therefore, it's still a value between 0 and 1.

The current leaderboard of ~`0.75` means that the predictions are, in average, reaching a 75% overlap wit the actual responses.



## What's next?

We have already understood the problem, took a look at the dataset, and analyzed the metric.

Let's go on to [4. Exploring Public Models](https://www.kaggle.com/julian3833/4-exploring-public-models-qa-for-qa-noobs/)!


If you want to dig deeper into Jaccard metric, I recommend the notebook [Jaccard Similarity Tamil & Hindi](https://www.kaggle.com/mpwolke/jaccard-similarity-tamil-hindi) by [mpwolke](https://www.kaggle.com/mpwolke/).


&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;

## Remember to upvote the notebook if you found it useful! 🤗