In [1]:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
__author__ = 'Author'
__email__ = 'Email'

# SemEval 2024 Task 1: Semantic Textual Relatedness (STR)
(These instructions are adapted from the [official ones](https://github.com/semantic-textual-relatedness/Semantic_Relatedness_SemEval2024/blob/main/STR_Baseline.ipynb).) \
[Dataset](https://github.com/semantic-textual-relatedness/Semantic_Relatedness_SemEval2024#dataset) | 
[Languages](https://github.com/semantic-textual-relatedness/Semantic_Relatedness_SemEval2024#languages) | 
[Shared Task Starter Kit](https://github.com/semantic-textual-relatedness/Semantic_Relatedness_SemEval2024#shared-task-starter-kit) | 
[Citing This Work](https://github.com/semantic-textual-relatedness/Semantic_Relatedness_SemEval2024#citing-this-work)

---

## Introduction
Welcome to the SemEval 2024 Task 1: Semantic Textual Relatedness (STR) instructional Jupyter Notebook. \
This guide is crafted to provide you with a comprehensive roadmap and essential resources to excel in this exciting challenge.

### Tracks
SemEval 2024 Task 1 offers three distinct tracks, each focusing on a unique aspect of Semantic Textual Relatedness. \
Understanding these tracks will help you decide where your submission fits best:
+ **Track A: Supervised** \
  This track is for submissions that utilize labeled data in the target language. \
  It's ideal if you're leveraging datasets with predefined semantic relationships for model training.
+ **Track B: Unsupervised** \
  Choose this track if your submission does not use labeled data in any language. \
  This track is suitable for approaches that rely on unsupervised learning techniques or intrinsic textual features.
+ **Track C: Cross-lingual** \
  If your submission involves using labeled data from a language other than the target language, this is the track for you.
  It’s designed for exploring semantic relationships across different languages.

### Choosing Your Track
When deciding which track to submit to, consider the type of data and methods you are using:
+ Opt for Track A if your approach is built on labeled data specifically in the target language.
+ Choose Track B if your method operates without any labeled data, regardless of the language.
+ Select Track C if you are employing labeled data from a language different from the target language, focusing on cross-lingual semantic understanding.

### Languages

The STR task focuses on the following 14 languages:


1. Afrikaans (_afr_ released)
2. Algerian Arabic (_arq_ released)
4. Amharic (_amh_ released)
5. English (_eng_ released)
6. Hausa (_hau_ released)
7. Indonesian
8. Hindi
9. Kinyarwanda
10. Marathi (_mar_ released)
11. Modern Standard Arabic (_arb_ released)
12. Moroccan Arabic (_ary_ released)
13. Punjabi
14. Spanish (_esp_ released)
15. Telugu (_tel_ released)

### Datasets
The STR dataset is available in the data folder or can be downloaded from Hugging Face (coming soon).
+ [TrackA Folder](https://github.com/semantic-textual-relatedness/Semantic_Relatedness_SemEval2024/tree/main/Track%20A)
+ [TrackB Folder](https://github.com/semantic-textual-relatedness/Semantic_Relatedness_SemEval2024/tree/main/Track%20B)
+ [TrackC Folder](https://github.com/semantic-textual-relatedness/Semantic_Relatedness_SemEval2024/tree/main/Track%20C)

### Please Join
+ [Join Google Group](https://groups.google.com/forum/#!forum/semrel-semeval-participants/join)
+ [Follow us on Twitter](https://twitter.com/SemRel2024)
+ [Join Task Slack Channel](https://join.slack.com/t/semrelsemeval2024/shared_invite/zt-2446ppar5-62koodIDFC9bCRMlR0ATkA)

### Reference
+ https://semeval.github.io/SemEval2024/tasks
+ https://semantic-textual-relatedness.github.io/
+ https://codalab.lisn.upsaclay.fr/competitions/15715
+ https://github.com/semantic-textual-relatedness/Semantic_Relatedness_SemEval2024

---

# Implementation

---
## Getting Started

### Virtual Environment
A virtual environment allows you to manage dependencies and isolate your project to prevent any conflicts with other work you may be doing. \
For this project, we highly recommend using a virtual environment to ensure a smooth and consistent development experience. \
There are several tools available for creating virtual environments. \
Two popular options are: [pyenv](https://github.com/pyenv/pyenv) and [conda](https://medium.com/@mrshininnnnn/virtual-environments-for-python-6ab3802fe87e).

### Required Libraries
+ numpy >= 1.26.2
+ pandas >= 2.1.4

Simply run:
```
pip install -r requirements.txt
```

---
## Libraries

In [25]:
# dependency
# built-in
import os, random
# public
import numpy as np
import pandas as pd
# private
from config import Config
from src.utils import helper

%load_ext autoreload
%autoreload 2
%config Completer.use_jedi = False

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


---
## Configurations

In [26]:
# initialize the config class
config = Config()
random.seed(config.seed)
np.random.seed(config.seed)
for k, v in config.__dict__.items():
    print(f'{k}: {v}')

seed: 0
track: a
tgt_lan: eng
method: base
CURR_PATH: ./
RESOURCE_PATH: ./res
DATA_PATH: ./res/data
TRACK_PATH: ./res/data/a
LAN_PATH: ./res/data/a/eng
TRAIN_CSV: ./res/data/a/eng/eng_train.csv
DEV_CSV: ./res/data/a/eng/eng_dev.csv
LOG_PATH: ./res/log/a/eng/base/0
LOG_TXT: ./res/log/a/eng/base/0/console_log.txt
RESULTS_PATH: ./res/results/a/eng/base
RESULTS_CSV: ./res/results/a/eng/base/0.csv


---
## Datasets

The training data will have a real-values semantic textual relatedness score (between 0 and 1) for a pair of English-language sentences.

The data is structured as a CSV file with the following fields:
- PairID: a unique identifier for the sentence pair
- Text: two sentences separated by a newline ('\n') character
- Score: the semantic textual relatedness score for the two sentences

Below we will show you how to load and re-format the provided data file.

In [29]:
# path from config
config.TRAIN_CSV, config.DEV_CSV

('./res/data/a/eng/eng_train.csv', './res/data/a/eng/eng_dev.csv')

#### Train

In [5]:
# read train csv
raw_train_df = pd.read_csv(config.TRAIN_CSV)

In [6]:
raw_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5500 entries, 0 to 5499
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   PairID  5500 non-null   object 
 1   Text    5500 non-null   object 
 2   Score   5500 non-null   float64
dtypes: float64(1), object(2)
memory usage: 129.0+ KB


In [7]:
raw_train_df.head()

Unnamed: 0,PairID,Text,Score
0,ENG-train-0000,"It that happens, just pull the plug.\nif that ...",1.0
1,ENG-train-0001,A black dog running through water.\nA black do...,1.0
2,ENG-train-0002,I've been searchingthe entire abbey for you.\n...,1.0
3,ENG-train-0003,If he is good looking and has a good personali...,1.0
4,ENG-train-0004,"She does not hate you, she is just annoyed wit...",1.0


In [8]:
train_xs1, train_xs2 = map(list, zip(*[tuple(row['Text'].split('\n')) for idx, row in raw_train_df.iterrows()]))

In [9]:
train_ys = raw_train_df.Score.tolist()

#### Dev

In [10]:
# read dev csv
raw_dev_df = pd.read_csv(config.DEV_CSV)

In [11]:
raw_dev_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   PairID  250 non-null    object
 1   Text    250 non-null    object
dtypes: object(2)
memory usage: 4.0+ KB


In [12]:
raw_dev_df.head()

Unnamed: 0,PairID,Text
0,ENG-dev-0000,The story is gripping and interesting.\nIt's a...
1,ENG-dev-0001,The majority of Southeast Alaska 's area is pa...
2,ENG-dev-0002,and from your post i think you are to young to...
3,ENG-dev-0003,The film 's success also made Dreamworks Anima...
4,ENG-dev-0004,I am still confused about how I feel about thi...


In [13]:
dev_xs1, dev_xs2 = map(list, zip(*[tuple(row['Text'].split('\n')) for idx, row in raw_dev_df.iterrows()]))

---
## Method

In [14]:
model = helper.get_model(config)

In [23]:
# train
train_ys_ = model.predict(train_xs1, train_xs2)
# evaluate
# How well does the baseline correlate with human judgments?
print("Pearson Correlation:", round(np.corrcoef(train_ys, train_ys_)[0][1], 2))

Pearson Correlation: 0.58


In [16]:
# dev
dev_ys_ = model.predict(dev_xs1, dev_xs2)

---
## Output
Submission file has two columns: 'PairID' and 'Pred_Score'

In [28]:
raw_dev_df['Pred_Score'] = dev_ys_

In [18]:
raw_dev_df.head()

Unnamed: 0,PairID,Text,Pred_Score
0,ENG-dev-0000,The story is gripping and interesting.\nIt's a...,0.17
1,ENG-dev-0001,The majority of Southeast Alaska 's area is pa...,0.31
2,ENG-dev-0002,and from your post i think you are to young to...,0.14
3,ENG-dev-0003,The film 's success also made Dreamworks Anima...,0.22
4,ENG-dev-0004,I am still confused about how I feel about thi...,0.12


In [30]:
raw_dev_df[['PairID', 'Pred_Score']].head()

Unnamed: 0,PairID,Pred_Score
0,ENG-dev-0000,0.17
1,ENG-dev-0001,0.31
2,ENG-dev-0002,0.14
3,ENG-dev-0003,0.22
4,ENG-dev-0004,0.12


In [32]:
config.RESULTS_CSV

'./res/results/a/eng/base/0.csv'

In [19]:
raw_dev_df[['PairID', 'Pred_Score']].to_csv(
    config.RESULTS_CSV
    , index=False
)

---
## Evaluation
In SemEval 2024 Task 1, the effectiveness of an approach in Semantic Textual Relatedness will be rigorously evaluated through a set of established procedures and metrics on [CodaLab](https://codalab.lisn.upsaclay.fr/competitions/15715). \
This section outlines the evaluation process, detailing the specific metrics used, along with guidelines for assessing performance on various datasets throughout the competition.

### Metric
The official evaluation metric is the Spearman correlation between the predicted similarity scores and the human-annotated gold scores. \
This metric helps in understanding how well the predicted scores align with human judgments. \
The evaluation script is available in the following GitHub repository: [Semantic Relatedness SemEval 2024 GitHub Repository](https://github.com/semantic-textual-relatedness/Semantic_Relatedness_SemEval2024)

### Train

For participants in Track A, which focuses on supervised methods, the provided dataset includes both a training set and a dev set. \
The training set, complete with labels, is crucial for building and refining your models. \
While evaluation on the training set is useful for checking implementation, the true measure of the sperformance lies in its generalization to unseen data, namely the dev and test sets.

### Dev
The development sets across all tracks come without labels. \
To evaluate the performance on the dev set, we need to submit results to the official evaluation hosted on CodaLab.

### Test
The final assessment will be based on the performance on the test set. \
Similar to the dev set evaluation, we are required to upload predictions for the test set to the official evaluation on CodaLab within the following time window:
+ Evaluation Start: 10 January 2024
+ Evaluation End: 31 January 2024

---
# Submission
Please follow steps shown below carefully to ensure the work is properly evaluated and considered in the competition.

## 1. Create a CodaLab Account
Begin by setting up an account on CodaLab, a platform widely used for academic competitions in machine learning and computational linguistics. \
We can create an account at [CodaLab's official website](https://codalab.lisn.upsaclay.fr/).

## 2. Register for the Task
Once the account is active, navigate to the [SemEval 2024 Task 1](https://codalab.lisn.upsaclay.fr/competitions/15715) on CodaLab and complete the registration process. \
This step is essential to ensure the participation and to gain access to submitting your results.

## 3. Submit Results
For developing, refining, and testing our methods, we will need to submit results for both the dev set and the test set for official evaluation. \
Ensure that the submissions adhere to the format specified in the [task guidelines](https://codalab.lisn.upsaclay.fr/competitions/15715#participate). \
Submissions that do not meet the required format may not be evaluated.