<a href="https://colab.research.google.com/github/shmuhammadd/semantic_relatedness/blob/main/Simple_English_Baseline_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple Co-Occurance Baseline for Semantic Relatedness -- English Example

Authors: Krishnapriya Vishnubhotla, Mohamed Abdalla

Introduction:

In this starter notebook, we will take you through the process of estimating semantic relatedness using simple co-occurance baselines. The notebook was adapted from a notebook for SemEval 2023 Shared Task 12: AfriSenti (Task A).

### Package Imports

In [4]:
import re
import pandas as pd
import numpy as np
import os
from scipy.stats import spearmanr, pearsonr
import matplotlib.pyplot as plt
plt.style.use('ggplot')

### Data Import

The training data will have a real-values semantic textual relatedness score (between 0 and 1) for a pair of English-language sentences.

The data is structured as a CSV file with the following fields:
- PairID: a unique identifier for the sentence pair
- Text: two sentences separated by a newline ('\n') character
- Score: the semantic textual relatedness score for the two sentences

Below we will show you how to load and re-format the provided data file.

In [5]:
# Load the File
PATH = os.path.join("..", "data", "raw")

df_str_rel = pd.read_csv(os.path.join(PATH, 'sem_text_rel_ranked.csv'), usecols=[3,4,5])
df_str_rel.head()

Unnamed: 0,PairID,Text,Score
0,Formality_pp_222,"It that happens, just pull the plug.\nif that ...",1.0
1,STS_237,A black dog running through water.\nA black do...,1.0
2,ParaNMT_pp_204,I've been searchingthe entire abbey for you.\n...,1.0
3,Formality_pp_119,If he is good looking and has a good personali...,1.0
4,Formality_pp_174,"She does not hate you, she is just annoyed wit...",1.0


In [None]:
df_str_rel['Text'].values

array(['It that happens, just pull the plug.\nif that ever happens, just pull the plug.',
       'A black dog running through water.\nA black dog is running through some water.',
       "I've been searchingthe entire abbey for you.\nI'm looking for you all over the abbey.",
       ...,
       "I actually read a chapter or two beyond that point, but my heart wasn't in it any more.\nLets say she's a blend of two types of beings.",
       'A boy gives being in the snow two thumbs up.\nA satisfied cat is perched beside a crystal lamp.',
       'Perhaps it is strange to think about sex constantly these days.\nFew people know how to shoot pool these days.'],
      dtype=object)

In [None]:
# Creating a column "Split_Text" which is a list of two sentences.
df_str_rel['Split_Text'] = df_str_rel['Text'].apply(lambda x: x.split("\n"))
df_str_rel.head()

Unnamed: 0,PairID,Text,Score,Split_Text
0,Formality_pp_222,"It that happens, just pull the plug.\nif that ...",1.0,"[It that happens, just pull the plug., if that..."
1,STS_237,A black dog running through water.\nA black do...,1.0,"[A black dog running through water., A black d..."
2,ParaNMT_pp_204,I've been searchingthe entire abbey for you.\n...,1.0,"[I've been searchingthe entire abbey for you.,..."
3,Formality_pp_119,If he is good looking and has a good personali...,1.0,[If he is good looking and has a good personal...
4,Formality_pp_174,"She does not hate you, she is just annoyed wit...",1.0,"[She does not hate you, she is just annoyed wi..."


# Dice Score (Overlap Score)

A simple baseline for estimating semantic relatedness between two sentences is to look at the proportion of words that they share in common.

There are many ways to change the score below. Consider:
1. Removing stop words and/or puncutation
2. Counting duplicate words (currently not counted)
3. Weighting rarer words differently
4. Splitting tokens differently

In [None]:
def dice_score(s1,s2):
  s1 = s1.lower()
  s1_split = re.findall(r"\w+|[^\w\s]", s1, re.UNICODE)

  s2 = s2.lower()
  s2_split = re.findall(r"\w+|[^\w\s]", s2, re.UNICODE)

  dice_coef = len(set(s1_split).intersection(set(s2_split))) / (len(set(s1_split)) + len(set(s2_split)))
  return round(dice_coef, 2)

## Calculate Dice Score

In [None]:
true_scores = df_str_rel['Score'].values
pred_scores = []

for index,row in df_str_rel.iterrows():
  s1,s2 = row["Text"].split("\n")

  # Overlap score
  pred_scores.append(dice_score(s1,s2))

In [None]:
# How well does the baseline correlate with human judgments?
print("Pearson Correlation:", round(pearsonr(true_scores,pred_scores)[0],2))

Pearson Correlation: 0.58


# Generate submission file

### Append prediction to dataframe

In [None]:
df_str_rel['Pred_Score'] = pred_scores
df_str_rel.head()

Unnamed: 0,PairID,Text,Score,Pred_Score
0,Formality_pp_222,"It that happens, just pull the plug.\nif that ...",1.0,0.42
1,STS_237,A black dog running through water.\nA black do...,1.0,0.44
2,ParaNMT_pp_204,I've been searchingthe entire abbey for you.\n...,1.0,0.29
3,Formality_pp_119,If he is good looking and has a good personali...,1.0,0.41
4,Formality_pp_174,"She does not hate you, she is just annoyed wit...",1.0,0.36


### Generate submission file

Submission file has two columns: '**PairID**' and '**Pred_Score**'

In [None]:
df_str_rel[['PairID', 'Pred_Score']].to_csv('pred_eng.csv', index=False)