<a href="https://colab.research.google.com/github/Jagoda222/LoLa---group-8/blob/main/calculate_measure.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This code processes an SNLI dataset for curriculum learning using proportional sampling based on a complexity measure. Here's what it does in a few steps:

1. **Load and Prepare Data**:
   - Reads a CSV file, cleans missing values, and converts text columns (`premise` and `hypothesis`) to strings.

2. **Calculate Complexity Measure**:
   - Computes given `measure_1` for each row (example -  the sum of the lengths of `premise` and `hypothesis`.)

3. **Bin the Data**:
   - Divides `measure_1` values into `num_bins` (default: 7) and assigns each triplet to a specific bin. Samples `sample_size` triplets (default: 700) proportionally across the bins, ensuring 700 unique triplets.

4. **Calculating average**:
   - Calculates the average of the metric value from each sentence pair within the triplet

5. **Sort and Return**:
   - Calculates the average `measure_1` for each triplet and creates two DataFrames:
     - One ordered by increasing complexity.
     - One ordered by decreasing complexity.

It ensures balanced sampling across bins and returns two ordered subsets of the data.

In [1]:
import pandas as pd
import numpy as np

In [9]:
def process_snli_dataset_with_measures(file_path, sample_size=700, num_bins=7):
    """
    Process an SNLI dataset for curriculum learning with nested measures and proportional sampling.

    Args:
        file_path (str): Path to the CSV file containing the dataset.
        sample_size (int): The number of triplets to sample from the dataset. Default is 700.
        num_bins (int): Number of bins for dividing measure_1 values. Default is 7.

    Returns:
        tuple: Two DataFrames - one ordered by increasing average measure_1 and one by decreasing average measure_1.
    """
    # Step 1:
    data = pd.read_csv(file_path)
    print(f"Dataset loaded: {len(data)} rows")

    data['premise'] = data['premise'].fillna("").astype(str)
    data['hypothesis'] = data['hypothesis'].fillna("").astype(str)

    # Step 2: CHANGE THE MEASURE
    def measure_1(row):
        """Example complexity measure: sum of lengths of premise and hypothesis."""
        return len(row['premise']) + len(row['hypothesis'])

    data['measure_1'] = data.apply(measure_1, axis=1)

    # Step 3:
    def sample_triplets(data, sample_size, num_bins):
        # Step 3.1: Divide measure_1 into ranges (bins)
        bin_edges = np.linspace(data['measure_1'].min(), data['measure_1'].max(), num_bins + 1)
        data['range_bin'] = pd.cut(data['measure_1'], bins=bin_edges, labels=False, include_lowest=True)

        # Step 3.2: Calculate bin distributions
        bin_distribution = data.groupby('range_bin')['triplet_nr'].nunique()
        print(f"Distribution of triplets across bins:\n{bin_distribution}")

        # Step 3.3: Determine how many triplets to sample from each bin
        total_triplets = data['triplet_nr'].nunique()
        triplets_per_bin = (bin_distribution / bin_distribution.sum() * 700).astype(int)

        # Ensure the sum matches exactly 700
        while triplets_per_bin.sum() < 700:
            residuals = (bin_distribution / bin_distribution.sum() * 700) - triplets_per_bin
            triplets_per_bin[residuals.idxmax()] += 1

        while triplets_per_bin.sum() > 700:
            residuals = (bin_distribution / bin_distribution.sum() * 700) - triplets_per_bin
            triplets_per_bin[residuals.idxmin()] -= 1


        # Step 3.4: Sample triplets proportionally from each bin
        sampled_triplets = []
        for bin_id, sample_count in triplets_per_bin.items():
            if sample_count > 0:
                triplets_in_bin = data[data['range_bin'] == bin_id]['triplet_nr'].unique()
                sampled_triplet_ids = np.random.choice(triplets_in_bin, size=min(sample_count, len(triplets_in_bin)), replace=False)
                sampled_triplets.append(data[data['triplet_nr'].isin(sampled_triplet_ids)])

        # Combine sampled triplets
        sampled_data = pd.concat(sampled_triplets).reset_index(drop=True)

        return sampled_data

    final_sample = sample_triplets(data, sample_size, num_bins)

    # Step 4:
    triplet_avg = final_sample.groupby('triplet_nr')['measure_1'].mean().reset_index(name='triplet_avg_measure_1')

    # Merge back to keep triplet-level averages
    final_sample = final_sample.merge(triplet_avg, on='triplet_nr')
    final_sample_increasing = final_sample.sort_values(by='triplet_avg_measure_1').reset_index(drop=True)
    final_sample_decreasing = final_sample.sort_values(by='triplet_avg_measure_1', ascending=False).reset_index(drop=True)

    # Print bin distributions for the final samples
    increasing_bins = final_sample_increasing.groupby('range_bin')['triplet_nr'].nunique()
    print(f"Distribution of bins in increasing order sample:\n{increasing_bins}")
    print(f"Sampled {sample_size} triplets and returned two ordered DataFrames.")

    return final_sample_increasing, final_sample_decreasing


In [10]:
# Test the function with the basic measure
sample_increasing, sample_decreasing = process_snli_dataset_with_measures(file_path='/content/sampled_snli_10000.csv')

# Display the outputs
print("Sample ordered by increasing triplet average:")
print(sample_increasing.head())

#print("\nSample ordered by decreasing triplet average:")
#print(sample_decreasing.head())
sample_increasing.to_csv('sample_increasing.csv', index=False)
#sample_decreasing.to_csv('sample_decreasing.csv', index=False)

Dataset loaded: 30000 rows
Distribution of triplets across bins:
range_bin
0    5495
1    6368
2    1362
3     179
4      26
5       5
6       4
Name: triplet_nr, dtype: int64
Distribution of bins in increasing order sample:
range_bin
0    407
1    484
2    117
3     18
4      3
Name: triplet_nr, dtype: int64
Sampled 700 triplets and returned two ordered DataFrames.
Sample ordered by increasing triplet average:
                         premise               hypothesis  label  triplet_nr  \
0    A man surfing in the ocean.         a man is surfing      0        6583   
1    A man surfing in the ocean.        the man is thirty      1        6583   
2    A man surfing in the ocean.        the man is seated      2        6583   
3  People riding on motorcycles.       A bunch of bikers.      0       85871   
4  People riding on motorcycles.  People standing around.      2       85871   

   measure_1  range_bin  triplet_avg_measure_1  
0         43          0              43.666667  
1     