<a href="https://colab.research.google.com/github/shmuhammadd/semantic_relatedness/blob/main/Simple_English_Baseline_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple Co-Occurance Baseline for Semantic Relatedness -- English Example

Authors: Krishnapriya Vishnubhotla, Mohamed Abdalla

Introduction:

In this starter notebook, we will take you through the process of estimating semantic relatedness using simple co-occurance baselines. The notebook was adapted from a notebook for SemEval 2023 Shared Task 12: AfriSenti (Task A).

### Package Imports

In [1]:
import re
import pandas as pd
import numpy as np
from scipy.stats import spearmanr, pearsonr
import matplotlib.pyplot as plt
import io
from gensim.models import FastText
plt.style.use('ggplot')

### Data Import

The training data will have a real-values semantic textual relatedness score (between 0 and 1) for a pair of English-language sentences.

The data is structured as a CSV file with the following fields:
- PairID: a unique identifier for the sentence pair
- Text: two sentences separated by a newline ('\n') character
- Score: the semantic textual relatedness score for the two sentences

Below we will show you how to load and re-format the provided data file.

In [9]:
# Load the File
df_str_rel = pd.read_csv('/Users/lemarx/Documents/01_projects/SentencesRelatedness24/data/raw/eng_train.csv')
df_str_rel.head()

Unnamed: 0,PairID,Text,Score
0,ENG-train-0000,"It that happens, just pull the plug.\nif that ...",1.0
1,ENG-train-0001,A black dog running through water.\nA black do...,1.0
2,ENG-train-0002,I've been searchingthe entire abbey for you.\n...,1.0
3,ENG-train-0003,If he is good looking and has a good personali...,1.0
4,ENG-train-0004,"She does not hate you, she is just annoyed wit...",1.0


In [10]:
df_str_rel['Text'].values

array(['It that happens, just pull the plug.\nif that ever happens, just pull the plug.',
       'A black dog running through water.\nA black dog is running through some water.',
       "I've been searchingthe entire abbey for you.\nI'm looking for you all over the abbey.",
       ...,
       "I actually read a chapter or two beyond that point, but my heart wasn't in it any more.\nLets say she's a blend of two types of beings.",
       'A boy gives being in the snow two thumbs up.\nA satisfied cat is perched beside a crystal lamp.',
       'Perhaps it is strange to think about sex constantly these days.\nFew people know how to shoot pool these days.'],
      dtype=object)

In [45]:
len(df_str_rel)

5500

In [32]:
# Creating a column "Split_Text" which is a list of two sentences.
df_str_rel['Split_Text'] = df_str_rel['Text'].apply(lambda x: x.split("\n"))
df_str_rel['Split_Text'].loc[0]

['It that happens, just pull the plug.',
 'if that ever happens, just pull the plug.']

# Dice Score (Overlap Score)

A simple baseline for estimating semantic relatedness between two sentences is to look at the proportion of words that they share in common.

There are many ways to change the score below. Consider:
1. Removing stop words and/or puncutation
2. Counting duplicate words (currently not counted)
3. Weighting rarer words differently
4. Splitting tokens differently

In [8]:
def dice_score(s1,s2):
  s1 = s1.lower()
  s1_split = re.findall(r"\w+|[^\w\s]", s1, re.UNICODE)

  s2 = s2.lower()
  s2_split = re.findall(r"\w+|[^\w\s]", s2, re.UNICODE)

  dice_coef = len(set(s1_split).intersection(set(s2_split))) / (len(set(s1_split)) + len(set(s2_split)))
  return round(dice_coef, 2)

## Calculate Dice Score

In [4]:
fasttext_model = FastText.load_fasttext_format('/Users/lemarx/Documents/01_projects/SentencesRelatedness24/data/embeddings/cc.en.300.bin')

  fasttext_model = FastText.load_fasttext_format('/Users/lemarx/Documents/01_projects/SentencesRelatedness24/data/embeddings/cc.en.300.bin')


In [7]:
word_vector = fasttext_model.wv['\n']
word_vector

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

In [30]:

def to_sent_emb(sentence):
    sentence_emb = np.array([fasttext_model.wv[word] for word in sentence.split() if word in fasttext_model.wv]).mean(axis=0)
    return sentence_emb

In [37]:
def cosine_similarity(vector_a, vector_b):
    vector_a = to_sent_emb(vector_a)
    vector_b = to_sent_emb(vector_b)
    dot_product = np.dot(vector_a, vector_b)
    norm_a = np.linalg.norm(vector_a)
    norm_b = np.linalg.norm(vector_b)

    similarity = dot_product / (norm_a * norm_b)
    return similarity

In [40]:
df_str_rel['cos_sim'] = df_str_rel.apply(lambda row: cosine_similarity(row['Split_Text'][0],row['Split_Text'][1]), axis= 1)

In [38]:
sent_pair = df_str_rel['Split_Text'].loc[0]
sim = cosine_similarity(sent_pair[0],sent_pair[1])
print(sim)

0.8440153


In [46]:
true_scores = df_str_rel['Score'].values
pred_scores = df_str_rel['cos_sim'].values

In [47]:
# How well does the baseline correlate with human judgments?
print("Spearman Correlation:", round(spearmanr(true_scores,pred_scores)[0],2))

Spearman Correlation: 0.35
