# MAFAT 2025 Semantic Retrieval Competition - Exploratory Data Analysis

This notebook performs exploratory data analysis on the Israeli semantic retrieval competition dataset.

## Import Libraries and Setup

First, we'll import all necessary libraries for data analysis and visualization.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap

import warnings
warnings.filterwarnings('ignore')


## Download training data

## Data Loading

Load the JSONL dataset containing query-paragraph pairs with relevance scores for the semantic retrieval competition.


In [None]:
import json

file_path = './hsrc/hsrc_train.jsonl'
rows = []
with open(file_path, 'r', encoding='utf-8') as f:
    for line in f:
        rows.append(json.loads(line))

df = pd.DataFrame(rows)


Select the data for analysis and display.
(note that running the notebook on the full dataset will not be possible in a standard collab)

## EDA

### Overview

The competition's data consists of the following columns:
1. query_uuid - unique identifier for query
2. query - query text
3. paragraphs - list of paragraphs with uuid and passage for each one
4. target_actions - list of target actions for each paragraph and query
5. case_name - data corpuses - source


## Initial Data Exploration

Let's first examine the structure of our dataset by looking at the column names and basic information.


In [None]:
df.columns

### Dataset Preview

Display the first few rows to understand the data structure and content.


In [None]:
data_head = df.head()
display(data_head)

### Dataset Composition

Examine the different case names (corpuses) present in the dataset to understand the variety of content.


In [None]:
# Display the unique corpuses in the dataset
unique_corpuses = df['case_name'].unique()
print("Unique corpuses:", unique_corpuses)

## Data Preprocessing

Create a working copy of the dataset for transformation and analysis.


In [None]:
df = df.copy()

### Data Structure Transformation

The dataset contains nested dictionaries for paragraphs and target actions. We need to unwrap these structures to make the data more accessible for analysis.

These functions will:
- Extract paragraph UUIDs and text content from the nested paragraph structure
- Extract relevance scores from the target actions structure
- Maintain proper ordering based on paragraph indices


In [None]:
import pandas as pd

def unwrap_paragraphs(par_dict):
    # sort items by the numeric suffix of "paragraph_i"
    items = sorted(
        par_dict.items(),
        key=lambda kv: int(kv[0].split('_')[-1])
    )
    uuids   = [v['uuid']    for _, v in items]
    texts   = [v['passage'] for _, v in items]
    return uuids, texts

def unwrap_targets(tgt_dict):
    # sort items by the numeric suffix of "target_action_i"
    items = sorted(
        tgt_dict.items(),
        key=lambda kv: int(kv[0].split('_')[-1])
    )
    rels = [v for _, v in items]
    return rels

# apply to each row
df['paragraph_uuids'], df['paragraph_texts'] = \
    zip(*df['paragraphs'].map(unwrap_paragraphs))

df['relevances'] = df['target_actions'].map(unwrap_targets)

df.rename(columns={'case_name': 'source'}, inplace=True)

# sanity checks
assert all(len(x)==20 for x in df['paragraph_uuids'])
assert all(len(x)==20 for x in df['relevances'])

### Data Reshaping - Creating Long Format

Transform the data from wide format (one row per query with 20 paragraphs) to long format (one row per query-paragraph pair). This makes it easier to analyze relationships between queries, paragraphs, and relevance scores.


In [None]:
from IPython.display import display, HTML

# 0..19 positions
df['position'] = df['paragraph_uuids'].map(lambda lst: list(range(len(lst))))

# explode all in lock‐step
df_long = (
    df
    .explode(['paragraph_uuids','paragraph_texts','relevances','position'])
    .rename(columns={
        'paragraph_uuids':'paragraph_uuid',
        'paragraph_texts':'paragraph_text',
        'relevances':'relevance'
    })
    .reset_index(drop=True)
)

# keep only the columns you need
df_long = df_long[[
    'query_uuid','query','source',
    'position','paragraph_uuid','paragraph_text','relevance'
]]

print(df_long.shape)   # should be n_queries * 20 rows

### Sample Data by Source

Display one example from each source to understand the content and structure differences across different corpuses.


In [None]:
# Display 1 example per source
examples_per_source = df_long.groupby('source').apply(lambda group: group.head(1))
# Display the examples
display(HTML(examples_per_source.drop(columns=['source']).style.set_table_attributes('dir="rtl"').to_html(index=False, escape=False)))

## Dataset Statistics

Generate summary statistics to understand the scale and composition of our dataset.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")

print("Rows:",      len(df_long))
print("Queries:",   df_long['query_uuid'].nunique())
print("Paragraphs:", len(df_long))
print("Relevance labels:", sorted(df_long['relevance'].unique()))
# df_long.info()

### Data Quality Check

Identify any problematic relevance labels that need to be cleaned or handled specially.


In [None]:
wrong_relevance_labels = df_long[df_long['relevance'].isin(['', '-', '9'])]
wrong_relevance_labels[['query_uuid','paragraph_uuid','relevance']]

### Data Cleaning

Clean the relevance scores by:
- Converting to numeric format
- Removing invalid entries
- Filtering to valid relevance range (0-4)


In [None]:
df_long['relevance_num'] = pd.to_numeric(df_long['relevance'],
                                         errors='coerce')
df_long = df_long.dropna(subset=['relevance_num'])
df_long = df_long[df_long['relevance_num'] <= 4]

## Relevance Distribution Analysis

### Overall Relevance Distribution

Visualize the distribution of relevance scores across all query-paragraph pairs to understand label balance.


In [None]:
order = sorted(df_long['relevance_num'].unique(), key=int)
plt.figure(figsize=(6,4))
sns.countplot(x='relevance_num',
              data=df_long,
              order=order,
              palette="Blues_d")
plt.title("Overall Relevance Label Counts")
plt.xlabel("Relevance (0–4)")
plt.ylabel("Count")
plt.show()

### Relevant Paragraphs per Query by Case

Analyze how many relevant paragraphs (relevance > 0) each query has, broken down by case name. This helps understand the difficulty and characteristics of different corpuses.


In [None]:
# how many relevant (relevance>0) paras per query?
nonzero_per_q = df_long[df_long.relevance_num>0].groupby('query_uuid').size()
# Split the queries by source
sources = df_long['source'].unique()
# turn the Series into a DataFrame column
tmp = nonzero_per_q.reset_index()
tmp.columns = ['query_uuid','n_relevant']
# Create a plot for each source
for case in sources:
    color = sns.color_palette("husl", len(sources))[list(sources).index(case)]
    case_data = tmp[tmp['query_uuid'].isin(df_long[df_long['source'] == case]['query_uuid'])]
    plt.figure(figsize=(6, 4))
    sns.countplot(x='n_relevant', data=case_data, color=color)
    plt.title(f"Number of Paragraphs with Relevance>0 per Query ({case})")
    plt.xlabel("Count of paras with relevance>0")
    plt.ylabel("Number of Queries")
    plt.xticks(range(case_data['n_relevant'].min(), case_data['n_relevant'].max() + 1))
    plt.show()

## Distribution of paragraph length

In [None]:
from collections import Counter

# Choose: "chars" for characters, "words" for words
length_type = "words"

lengths = []

# Read the JSONL file
with open(file_path, "r", encoding="utf-8") as f:
    for line in f:
        if line.strip():
            data = json.loads(line)
            paragraphs = data.get("paragraphs", {})
            for para in paragraphs.values():
                passage = para.get("passage", "")
                if length_type == "chars":
                    lengths.append(len(passage))
                elif length_type == "words":
                    lengths.append(len(passage.split()))

# Stats
print(f"Total passages: {len(lengths)}")
print(f"Min length: {min(lengths)}")
print(f"Max length: {max(lengths)}")
print(f"Average length: {sum(lengths)/len(lengths):.2f}")

# Distribution table
distribution = Counter(lengths)
print("\nSample distribution (length: count):")
for length, count in sorted(distribution.items())[:20]:
    print(f"{length}: {count}")

# Plot histogram
plt.figure(figsize=(10, 6))
plt.hist(lengths, bins=30, color='skyblue', edgecolor='black')
plt.title(f"Distribution of Passage Lengths ({length_type})")
plt.xlabel(f"Length in {length_type}")
plt.ylabel("Number of passages")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

## Advanced Feature Analysis

### Cosine Similarity Analysis

Calculate cosine similarity between queries and paragraphs using TF-IDF vectorization to understand the relationship between semantic similarity and relevance scores.


In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def calculate_cosine_similarity_me5_optimized(df, sample_size=500):
    """Calculate cosine similarity using ME5 with optimizations"""
    similarities = []
    
    # Sample the data for faster processing
    print(f"Sampling {sample_size} queries from {df['query_uuid'].nunique()} total queries")
    sampled_queries = df['query_uuid'].unique()[:sample_size]
    df_sample = df[df['query_uuid'].isin(sampled_queries)]
    
    # Use smaller, faster model
    print("Loading multilingual-e5-base (faster than large)...")
    model = SentenceTransformer('intfloat/multilingual-e5-base')
    print("Model loaded successfully!")
    
    # Process in batches for efficiency
    batch_size = 32
    
    for i, (query_uuid, group) in enumerate(df_sample.groupby('query_uuid')):
        if i % 50 == 0:  # More frequent progress updates
            print(f"Processing query {i+1}/{len(sampled_queries)}")
        
        query_text = group['query'].iloc[0]
        paragraph_texts = group['paragraph_text'].tolist()
        
        # Truncate very long texts to speed up processing
        query_text = query_text[:512] if len(query_text) > 512 else query_text
        paragraph_texts = [text[:512] if len(text) > 512 else text for text in paragraph_texts]
        
        try:
            # Add prefixes for better performance
            query_with_prefix = f"query: {query_text}"
            paragraphs_with_prefix = [f"passage: {text}" for text in paragraph_texts]
            
            # Encode query
            query_embedding = model.encode([query_with_prefix], batch_size=1)
            
            # Encode paragraphs in batch
            paragraph_embeddings = model.encode(paragraphs_with_prefix, batch_size=batch_size)
            
            # Calculate cosine similarity
            cosine_sims = cosine_similarity(query_embedding, paragraph_embeddings).flatten()
            similarities.extend(cosine_sims)
            
        except Exception as e:
            print(f"Error processing query {query_uuid}: {e}")
            similarities.extend([0.0] * len(paragraph_texts))
    
    return similarities, df_sample

print("Calculating cosine similarities using optimized ME5...")
similarities, df_sample = calculate_cosine_similarity_me5_optimized(df_long, sample_size=500)
df_sample['cosine_similarity_me5'] = similarities
print("ME5 cosine similarity calculation completed!")

# Use the sampled data for analysis
df_long = df_sample.copy()

### Correlation Analysis with Cosine Similarity

Examine the relationship between cosine similarity and relevance scores, along with other text-based features.


In [None]:
# Update the correlation analysis to include ME5 cosine similarity
num_cols = ['relevance_num', 'cosine_similarity_me5']

# Compute the correlation matrix
corr = df_long[num_cols].corr()

# Plot a larger heatmap
plt.figure(figsize=(8,6))
sns.heatmap(corr, annot=True, fmt=".3f", cmap="vlag", center=0, linewidths=.5,
            square=True, cbar_kws={"shrink": .8})
plt.title("Correlation Matrix with ME5 Cosine Similarity")
plt.tight_layout()
plt.show()

# Print features sorted by absolute correlation with relevance
print("Features sorted by |correlation| with relevance:")
relevance_corr = corr['relevance_num'].drop('relevance_num').abs().sort_values(ascending=False)
for feature, correlation in relevance_corr.items():
    print(f"{feature}: {correlation:.4f}")

### Cosine Similarity Distribution by Relevance

Visualize how cosine similarity values are distributed across different relevance scores.


In [None]:
# Create box plot showing cosine similarity distribution by relevance score
plt.figure(figsize=(10, 6))
sns.boxplot(x='relevance_num', y='cosine_similarity_me5', data=df_long, palette="viridis")
plt.title("ME5 Cosine Similarity Distribution by Relevance Score")
plt.xlabel("Relevance Score")
plt.ylabel("ME5 Cosine Similarity")
plt.show()

# Create violin plot for more detailed distribution
plt.figure(figsize=(10, 6))
sns.violinplot(x='relevance_num', y='cosine_similarity_me5', data=df_long, palette="viridis")
plt.title("ME5 Cosine Similarity Distribution by Relevance Score (Violin Plot)")
plt.xlabel("Relevance Score")
plt.ylabel("ME5 Cosine Similarity")
plt.show()

# Print summary statistics
print("ME5 Cosine Similarity Statistics by Relevance Score:")
print(df_long.groupby('relevance_num')['cosine_similarity_me5'].describe())

### Scatter Plot Analysis

Examine the direct relationship between cosine similarity and relevance scores.


In [None]:
# Scatter plot with trend line
plt.figure(figsize=(10, 6))
sns.scatterplot(x='cosine_similarity_me5', y='relevance_num', data=df_long, alpha=0.6)
sns.regplot(x='cosine_similarity_me5', y='relevance_num', data=df_long, scatter=False, color='red')
plt.title("Relationship between ME5 Cosine Similarity and Relevance Score")
plt.xlabel("ME5 Cosine Similarity")
plt.ylabel("Relevance Score")
plt.show()

# Calculate and display Pearson correlation coefficient
from scipy.stats import pearsonr
correlation, p_value = pearsonr(df_long['cosine_similarity_me5'], df_long['relevance_num'])
print(f"Pearson correlation between ME5 cosine similarity and relevance: {correlation:.4f}")
print(f"P-value: {p_value:.4e}")

# Calculate Spearman correlation (rank-based)
from scipy.stats import spearmanr
spearman_corr, spearman_p = spearmanr(df_long['cosine_similarity_me5'], df_long['relevance_num'])
print(f"Spearman correlation between ME5 cosine similarity and relevance: {spearman_corr:.4f}")
print(f"P-value: {spearman_p:.4e}")