## <p style="background-color:skyblue; font-family:newtimeroman; font-size:180%; text-align:center">Coleridge Initiative : Evaluation Process </p>

#### <p style="background-color:lightgrey; font-family:newtimeroman; font-size:150%; text-align:left"> There has been discussions regarding the evaluation process. Seems like there were some issues with LB or the metric because of which the competition got 0.992 score on the LB within 12 hours of launch. I followed the discussions and concluded that initially the metric wasn't considering recall. But now the issue has been fixed and they are using FBeta (0.5) metric. In this notebook, I have demonstrate the evaluation metric in detail.  </p>

### ⚠️ Before moving forward, I recommend you to look into the Competition details and data, if not already. You can follow this notebook - [Starter 🌟: Competition, Data, EDA and Modelling🚀](https://www.kaggle.com/pashupatigupta/starter-competition-data-eda-and-modelling)

#### A brief overview of competition before going ahead

> The objective of the competition is to identify the mention of datasets within scientific publications. Your predictions will be short excerpts from the publications that appear to note a dataset.

So, our predictions are dataset names used in a particular publication. Now, a particular publication can use multiple datasets so we will be predicting all the datasets use in the publication. Multiple predictions are delineated with a pipe (|) character in the submission file.

So, our prediction will look like - ABC|PQR|XYZ

And so will look the ground truth - PQR|BNM|XYZ|DEF

Now how to evaluate the such prediction against the given ground truth? The competion says using Jaccard based FBeta score. What's this? Let's find out. 

In [None]:
import os
import re
import json
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style="whitegrid")
from wordcloud import WordCloud, STOPWORDS
import plotly.graph_objects as go

import warnings
warnings.filterwarnings('ignore')

Reading the training dataset

In [None]:
df = pd.read_csv('../input/coleridgeinitiative-show-us-the-data/train.csv')
df.sample(10)

Now, the content of every publication is provided in a json file. We'll read the text of a publication from the json file and put it in the train dataframe

In [None]:
def get_text(filename, test=False):
    if test:
        df = pd.read_json('../input/coleridgeinitiative-show-us-the-data/test/{}.json'.format(filename))
    else:
        df = pd.read_json('../input/coleridgeinitiative-show-us-the-data/train/{}.json'.format(filename))
    text = " ".join(list(df['text']))
    return text

In [None]:
df['text'] = df['Id'].apply(get_text)
df.sample(5)

### Let's split this datafeame into development set and validation set.

In [None]:
from sklearn.model_selection import train_test_split
dev, val = train_test_split(df, test_size=0.1, random_state=42)
print("Development Shape : ", dev.shape)
print("Validation Shape : ", val.shape)

Cool! Here "cleaned_label" is our target variable. (As per my understanding)

> The "cleaned_label" column in validation set will work as ground truth. We'll make prediction on valication set. And then, we'll understand the evaluation metric. (Since we'll be having both predictions and ground truth)

### Simple Baseline Model

I'm directly copying the baseline model from my previous notebook. You can look into this [Notebook](https://www.kaggle.com/pashupatigupta/starter-competition-data-eda-and-modelling) if you want the explanation

> Note that ALL ground truth texts have been cleaned for matching purposes using the following code:

So we'll be using this function to clean our texts wherever needed.

In [None]:
def clean_text(txt):
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt).lower())

In [None]:
def baseline_model(dev, val):
    
    print("Running...")
    datasets_titles = [x.lower() for x in set(dev['dataset_title'].unique()).union(set(dev['dataset_label'].unique()))]

    print("Preparing Validation set...")
    val_id = val['Id'].unique()
    ids = []
    texts = []
    gt = []

    for ed in val_id:
        tdf = val[val['Id'] == ed]
        gt_label = "|".join(tdf['cleaned_label'].tolist())
        text = tdf['text'].tolist()[0]
        ids.append(ed)
        gt.append(gt_label)
        texts.append(text)

    pval = pd.DataFrame({'Id':ids, 'text':texts, 'ground_truth':gt})

    #print(pval.shape)

    print("Generating predictions...")
    labels = []
    for index in pval['Id']:
        publication_text = pval[pval['Id'] == index].text.str.cat(sep='\n').lower()
        #print(publication_text)
        label = []
        for dataset_title in datasets_titles:
            if dataset_title in publication_text:
                label.append(clean_text(dataset_title))
        labels.append('|'.join(label))

    pval['prediction'] = labels
    print("Done!")
    
    return pval

In [None]:
output = baseline_model(dev, val)
output.head()

In [None]:
### Making another copy of predictions for future use
sub_df = baseline_model(dev, val)

#### Superb! We have got the predictions and ground truths. Let's understand the Evaluation metric which is Jaccard similarity based FBeta (0.5). 

Before this we'll first understand Jaccard Similarity and FBeta separately. The steps will be like -
- Jaccard Similarity
- FBeta Score
- Jaccard similarity based FBeta score

## <p style="background-color:skyblue; font-family:newtimeroman; font-size:180%; text-align:center">1. Jaccard Similarity</p>

Definition - The Jaccard similarity measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets.

![img](https://miro.medium.com/max/744/1*XiLRKr_Bo-VdgqVI-SvSQg.png)

The definition is pretty simple and we can apply in any case if we have defined sets. But How it is applied in our case?

> So, in our case we have sentences (collection of words) ans we can split a sentence on white space to make a python set. Then we can apply jaccard similarity. Simple! 

Let's impliment it now!

In [None]:
def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

Let's check it on some examples -

#### 1.1 On Similar sentences

In [None]:
jaccard("The cat is on the mat", "The cat is on the table")

#### 1.2 On exact same sentence

In [None]:
jaccard("This is a pen", "This is a pen")

#### 1.3 On totally different sentence

In [None]:
jaccard("I am going home", "India is a beautiful country")

### Cool, we can see jaccard is working fine.

## <p style="background-color:skyblue; font-family:newtimeroman; font-size:180%; text-align:center">2. FBeta Score</p>

> Assuming that you are already familiar with true positive, true, negative, false positive, false negative.

#### **Definition** - The F-score is a measure of a model's performance. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the number of all positive results, including those not identified correctly, and the recall is the number of true positive results divided by the number of all samples that should have been identified as positive.

Below is the formula for Precision and Recall - 

![pr](https://miro.medium.com/max/1872/1*pOtBHai4jFd-ujaNXPilRg.png)

And below is the formula for F-Score

![f1](https://github.com/pashupati98/kaggle-archives/blob/main/img/f1.PNG?raw=true)

Note : F-Score is often referred as F1-Score. They are the same.

#### Notice that the definition above is F-Score not FBeta score. Now, Let's look into FBeta

**FBeta Score** :  It a F-Score that uses a positive real factor β, where β is chosen such that recall is considered β times as important as precision. The formula for FBeta is give as - 

![fb](https://github.com/pashupati98/kaggle-archives/blob/main/img/fb.PNG?raw=true)

Let's implement it!!

In [None]:
def get_precision_recall(tp, fp, fn):
    precision = tp / (tp+fp)
    recall = tp / (tp + fn)
    return precision, recall

def fbeta_score(precision, recall, beta):
    fbeta = (1+(beta*beta))*((precision*recall)/( (beta*beta*precision) + recall))
    return fbeta

Let's check it on some examples (We are passing beta = 0.5 because this competition uses beta = 0.5)

#### 2.1 When precision equals to recall

In [None]:
fbeta_score(0.8, 0.8, 0.5)

#### 2.2 When precision greater than recall

In [None]:
fbeta_score(0.8, 0.5, 0.5)

#### 2.3 When precision less than recall

In [None]:
fbeta_score(0.5, 0.8, 0.5)

Well well well! We can see that beta=0.5 give more importance to precision. So this competition's evaluation is baised towards precision.

## <p style="background-color:skyblue; font-family:newtimeroman; font-size:180%; text-align:center">3. Jaccard Similarity based FBeta Score</p>

### Now, the main question is "How FBeta Score is calculated in our case?"

Answer is pretty simple - All we need is true positives, false positives and false negatives. And the compition provides the rules to find these.

#### Rules mentioned in the Evaluation section of the competion 

For each publication's set of predictions, a token-based Jaccard score is calculated for each potential prediction / ground truth pair. The prediction with the highest score for a given ground truth is matched with that ground truth.

- Predicted strings for each publication are sorted alphabetically and processed in that order. Any scoring ties are resolved on the basis of that sort.
- Any matched predictions where the Jaccard score meets or exceeds the threshold of 0.5 are counted as true positives (TP), the remainder as false positives (FP).
- Any unmatched predictions are counted as false positives (FP).
- Any ground truths with no nearest predictions are counted as false negatives (FN).

All TP, FP and FN across all samples are used to calculate a final micro F0.5 score. (Note that a micro F score does precisely this, creating one pool of TP, FP and FN that is used to calculate a score for the entire set of predictions.)

<p style="background-color:skyblue; font-family:newtimeroman; font-size:150%; text-align:left">Disclaimer : This implementation is solely based on my understanding. If yours understanding is different let me know - we can discuss. This implementation is NOT official.</p>

In [None]:
def coleridge_initiative_jaccard(ground_truth, prediction, verbose=True):
    gts = ground_truth.split('|')
    pds = sorted(prediction.split('|'))
    if verbose:
        print("Ground truth : " , gts)
        print("Prediction : ", pds)
        
    js_scores = []
    cf_matrix = []
    
    #### Counting True Positives (TP) and False Positives (FP)
    
    for pd in pds:
        score = -1
        for gt in gts:
            js = jaccard(pd, gt)
            if js > score:
                score = js
        if score >= 0.5:
            js_scores.append(score)
            cf_matrix.append("TP")
        else:
            js_scores.append(score)
            cf_matrix.append("FP")
    
    
    #### Counting False Negatives (FN)
    
    for gt in gts:
        score = -1
        for pd in pds:
            js = jaccard(gt, pd)
            if js > score:
                score = js
        if score == 0:
            js_scores.append(score)
            cf_matrix.append("FN")
            
    return js_scores, " ".join(cf_matrix)
            

Let's check the function

In [None]:
coleridge_initiative_jaccard("this data|that dataset|xyz", "which data|no dataset|that dataset")

Let's apply it on our dataset

In [None]:
output['evaluation'] = output.apply(lambda x: coleridge_initiative_jaccard(x['ground_truth'], x['prediction'], verbose=False), axis=1)
output['js_scores'] = output['evaluation'].apply(lambda x : x[0])
output['pred_type'] = output['evaluation'].apply(lambda x : x[1])

In [None]:
output.head()

### Let's write a function to count TP, FP, FN

In [None]:
def get_count_tp_fp_fn(prediction, verbose=True):
    preds = prediction.split(" ")
    if verbose:
        print(preds)
    tpc = 0
    fpc = 0
    fnc = 0
    for pred in preds:
        if pred == "TP":
            tpc = tpc + 1
        elif pred == "FP":
            fpc = fpc + 1
        elif pred == "FN":
            fnc = fnc + 1
    return [tpc, fpc, fnc]

def make_col_tp_fp_fn(df, col):
    df['TP'] = df[col].apply(lambda x : x[0])
    df['FP'] = df[col].apply(lambda x : x[1])
    df['FN'] = df[col].apply(lambda x : x[2])
    return df

Let's check this function

In [None]:
get_count_tp_fp_fn("TP TP FP FN")

Let's apply it on our dataset

In [None]:
output['tp_fp_fn'] = output['pred_type'].apply(lambda x : get_count_tp_fp_fn(x, verbose=False))
output = make_col_tp_fp_fn(output, 'tp_fp_fn')
output.head()

In [None]:
tp = sum(output['TP'])
fp = sum(output['FP'])
fn = sum(output['FN'])

print("True Positives (TP) : ", tp)
print("False Positives (FP) : ", fp)
print("False Negatives (FN) : ", fn)

#### Finally we have got the TP, FP, FN to calculate FBeta score !!

In [None]:
precision, recall = get_precision_recall(tp, fp, fn)
print("Precision : ", precision)
print("Recall : ", recall)

In [None]:
fbeta = fbeta_score(precision, recall, 0.5)
print("FBeta Score : ", fbeta)

### Yay! We got the metric value.

This is the overall process of evaluation for this competition.

#### <p style="background-color:lightcoral; font-family:newtimeroman; font-size:150%; text-align:left">I have made this Notebook public so that fellow kagglers can understand the evaluation process better. Thanks for viewing this notebook. If you found this helpful consider UPVOTING it. Also, please correct me if you think I'm wrong anywhere.</p>

## EDIT 1 (31/03/2021)

Adding this score_df_coleridge_initiative function to score your predictions dataframe using a single function. You can also use this as objective function / loss function while model building.

In [None]:
def score_df_coleridge_initiative(output, gt_col, pred_col, beta=0.5, verbose=True):
    
    '''
    This function will calculate the FBeta score for Coleridge Initiative competition 
    if given appropriate arguments
    
    Arguments - 
    output - Your submission dataframe that has both ground truth and prediction columns.
    gt_col - This is the column name of ground truth column.
    pred_col - This is the column name of predictions column.
    beta - Beta value to calculate FBeta score.
    
    Returns - 
    This function will return the FBeta (beta=0.5) score.
    
    ## Set verbose = True to print logs    
    '''
    
    ### Jaccard Similarity
    output['evaluation'] = output.apply(lambda x: coleridge_initiative_jaccard(x[gt_col], x[pred_col], verbose=False), axis=1)
    output['js_scores'] = output['evaluation'].apply(lambda x : x[0])
    output['pred_type'] = output['evaluation'].apply(lambda x : x[1])
    
    ### TP, FP and FN 
    output['tp_fp_fn'] = output['pred_type'].apply(lambda x : get_count_tp_fp_fn(x, verbose=False))
    output = make_col_tp_fp_fn(output, 'tp_fp_fn')
    
    tp = sum(output['TP'])
    fp = sum(output['FP'])
    fn = sum(output['FN'])
    precision, recall = get_precision_recall(tp, fp, fn)
    fbeta = fbeta_score(precision, recall, 0.5)
    
    if verbose:
        print("True Positives (TP) : ", tp)
        print("False Positives (FP) : ", fp)
        print("False Negatives (FN) : ", fn)
        print("Precision : ", precision)
        print("Recall : ", recall)
        print("FBeta Score : ", fbeta)
        display(output.head())

    return fbeta

In [None]:
score_df_coleridge_initiative(sub_df, "ground_truth", "prediction", beta=0.5, verbose=True)