<a href="https://colab.research.google.com/github/Jagoda222/LoLa---group-8/blob/main/dataset_triplets_1000.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sampling 10,000 Triplets from SNLI Dataset

This notebook demonstrates how to preprocess the SNLI dataset and sample **10,000 triplets** (30,000 rows). Each triplet consists of:
- One **premise** shared by three rows.
- Three **hypotheses** corresponding to labels `0` (entailment), `1` (neutral), and `2` (contradiction).

**Steps:**
1. Load the SNLI dataset
2. Assign unique **triplet numbers** to valid triplets (groups of three rows with the same premise).
3. Divide dataset into **10 equal-sized blocks** and randomly sample **1000 triplets** per block.
4. Combine sampled triplets and save the final dataset as `sampled_snli_triplets_10000.csv`.

This ensures uniform distribution and a clean triplet structure for further analysis.

In [4]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [5]:
import pandas as pd
import numpy as np
from datasets import load_dataset

### Loading the dataset

In [105]:
dataset = load_dataset("snli")
train_data = dataset['train'].to_pandas()

# Rename columns for clarity
train_data = train_data.rename(columns={
    'sentence1': 'premise',
    'sentence2': 'hypothesis',
    'gold_label': 'label'
})

train_data = train_data[train_data['label'].notnull()]
print(train_data.head())
print(train_data.describe())

                                             premise  \
0  A person on a horse jumps over a broken down a...   
1  A person on a horse jumps over a broken down a...   
2  A person on a horse jumps over a broken down a...   
3              Children smiling and waving at camera   
4              Children smiling and waving at camera   

                                          hypothesis  label  
0  A person is training his horse for a competition.      1  
1      A person is at a diner, ordering an omelette.      2  
2                  A person is outdoors, on a horse.      0  
3                  They are smiling at their parents      1  
4                         There are children present      0  
               label
count  550152.000000
mean        0.996730
std         0.819796
min        -1.000000
25%         0.000000
50%         1.000000
75%         2.000000
max         2.000000


### Checking the label distribution

In [106]:
unique_labels = train_data['label'].unique()
print("Unique label types:", unique_labels)

label_counts = train_data['label'].value_counts()
print("\nLabel counts:")
print(label_counts)


Unique label types: [ 1  2  0 -1]

Label counts:
label
 0    183416
 2    183187
 1    182764
-1       785
Name: count, dtype: int64


### Checking triplets with -1 label

In [113]:
# Filter rows with label -1 in train_data
rows_with_neg1 = train_data[train_data['label'] == -1]

# Get unique premises that have at least one label -1
premises_with_neg1 = rows_with_neg1['premise'].unique()

# Filter train_data to include all rows for these premises
all_rows_for_neg1_premises = train_data[train_data['premise'].isin(premises_with_neg1)]

# Print the premises and their hypotheses
print("Premises with label -1 and their hypotheses:")
print(all_rows_for_neg1_premises.head(9))


Premises with label -1 and their hypotheses:
                                                premise  \
144   A small group of church-goers watch a choir pr...   
145   A small group of church-goers watch a choir pr...   
146   A small group of church-goers watch a choir pr...   
783   A woman wearing a pink hat is looking at a pin...   
784   A woman wearing a pink hat is looking at a pin...   
785   A woman wearing a pink hat is looking at a pin...   
1560  man in red canada shirt standing with three me...   
1561  man in red canada shirt standing with three me...   
1562  man in red canada shirt standing with three me...   

                                             hypothesis  label  
144                         A group watches a practice.      0  
145          A choir performs in front of packed crowd.     -1  
146   The pastor and elders watch the choir to make ...      1  
783                       The woman is wearing clothes.     -1  
784    The woman is wondering if she le

### Assigning number to each triplet

In [114]:
triplet_nr = 1
triplet_numbers = []

# Iterate through the dataset in groups of three rows
for i in range(0, len(train_data), 3):
    group = train_data.iloc[i:i+3]

    # Check if all three rows share the same premise
    if len(group['premise'].unique()) == 1 and len(group) == 3:
        triplet_numbers.extend([triplet_nr] * 3)
        triplet_nr += 1
    else:
        triplet_numbers.extend([None] * len(group))

# Assign triplet numbers to the train_data
train_data['triplet_nr'] = triplet_numbers

# Drop rows without valid triplet assignment (optional, if you only want valid triplets)
train_data = train_data.dropna(subset=['triplet_nr'])

# Convert 'triplet_nr' to integer
train_data['triplet_nr'] = train_data['triplet_nr'].astype(int)

# Print the first few rows
print(train_data.head(9))


                                             premise  \
0  A person on a horse jumps over a broken down a...   
1  A person on a horse jumps over a broken down a...   
2  A person on a horse jumps over a broken down a...   
3              Children smiling and waving at camera   
4              Children smiling and waving at camera   
5              Children smiling and waving at camera   
6  A boy is jumping on skateboard in the middle o...   
7  A boy is jumping on skateboard in the middle o...   
8  A boy is jumping on skateboard in the middle o...   

                                          hypothesis  label  triplet_nr  
0  A person is training his horse for a competition.      1           1  
1      A person is at a diner, ordering an omelette.      2           1  
2                  A person is outdoors, on a horse.      0           1  
3                  They are smiling at their parents      1           2  
4                         There are children present      0          

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data['triplet_nr'] = train_data['triplet_nr'].astype(int)


### Checking if the numbers of triplets were assigned correctly to the ones with label -1

In [129]:
# Step 1: Verify that train_data contains rows with label = -1
print("Unique labels in train_data:", train_data['label'].unique())

# Step 2: Extract triplets with label = -1
triplets_with_neg1 = train_data[train_data['label'] == -1]['triplet_nr'].unique()

print(f"Total number of triplets containing label -1: {len(triplets_with_neg1)}")

# Step 3: Display the first 3 triplets containing label -1
triplets_with_neg1_data = train_data[train_data['triplet_nr'].isin(triplets_with_neg1)]

print("First 3 triplets containing at least one row with label -1:")
for triplet_nr in triplets_with_neg1[:3]:
    print(f"\nTriplet Number: {triplet_nr}")
    print(triplets_with_neg1_data[triplets_with_neg1_data['triplet_nr'] == triplet_nr])


Unique labels in train_data: [ 1  2  0 -1]
Total number of triplets containing label -1: 785
First 3 triplets containing at least one row with label -1:

Triplet Number: 49
                                               premise  \
144  A small group of church-goers watch a choir pr...   
145  A small group of church-goers watch a choir pr...   
146  A small group of church-goers watch a choir pr...   

                                            hypothesis  label  triplet_nr  
144                        A group watches a practice.      0          49  
145         A choir performs in front of packed crowd.     -1          49  
146  The pastor and elders watch the choir to make ...      1          49  

Triplet Number: 262
                                               premise  \
783  A woman wearing a pink hat is looking at a pin...   
784  A woman wearing a pink hat is looking at a pin...   
785  A woman wearing a pink hat is looking at a pin...   

                                    

### Chosing 100 random triplets containing -1 label


In [134]:
# Randomly sample 100 triplets with label -1
sampled_triplets_with_neg1 = train_data[train_data['triplet_nr'].isin(
    np.random.choice(triplets_with_neg1, 100, replace=False)
)]


### Dividing dataset (without triplets with -1 label) in 10 blocks. Choosing 990 random triplets from each block. Combining triplets with and without -1 label.

In [138]:
num_blocks = 10
triplets_per_block = 990

unique_triplets = train_data['triplet_nr'].unique()
triplets_without_neg1 = np.setdiff1d(unique_triplets, triplets_with_neg1)

# Calculate the size of each block
num_triplets = len(unique_triplets)
block_size = num_triplets // num_blocks

sampled_triplets = []

for i in range(num_blocks):
    start_idx = i * block_size
    end_idx = (i + 1) * block_size if i < num_blocks - 1 else num_triplets

    block_triplets = unique_triplets[start_idx:end_idx]

    block_triplets_without_neg1 = np.intersect1d(block_triplets, triplets_without_neg1)

    sampled_without_neg1 = train_data[train_data['triplet_nr'].isin(block_triplets_without_neg1)]
    sampled_without_neg1 = sampled_without_neg1['triplet_nr'].drop_duplicates().sample(
        n=triplets_per_block, random_state=42
    )
    sampled_without_neg1 = train_data[train_data['triplet_nr'].isin(sampled_without_neg1)]
    sampled_triplets.append(sampled_without_neg1)

# Combine sampled triplets without label -1
sampled_triplets_without_neg1 = pd.concat(sampled_triplets)

# Merge the two sampled datasets
final_sampled_data = pd.concat([sampled_triplets_with_neg1, sampled_triplets_without_neg1]).reset_index(drop=True)

print(f"Total number of triplets chosen: {final_sampled_data['triplet_nr'].nunique()}")
print("\nLabel distribution in the sampled data:")
print(final_sampled_data['label'].value_counts())

Total number of triplets chosen: 10000

Label distribution in the sampled data:
label
 2    10008
 0     9958
 1     9934
-1      100
Name: count, dtype: int64


In [139]:
final_sampled_data.to_csv('sampled_snli_10000.csv', index=False)
