# Loading and Exploring the XNLI Dataset

In this notebook, we will load the XNLI dataset from the specified path and explore its basic information. The XNLI dataset is a crowd-sourced collection of sentence pairs in 15 languages annotated with textual entailment information.

## Steps:
1. Load the dataset from the specified path.
2. Display the first few rows of the dataset to understand its structure.
3. Get basic information about the dataset, such as the number of rows, columns, and data types.
4. Check for any missing values in the dataset.

In [2]:
import pandas as pd

# Step 1: Load the dataset from the specified path
path = '../data/xnli/xnli.dev.tsv'
df = pd.read_csv(path, sep='\t')

# Step 2: Display the first few rows of the dataset to understand its structure
print(df.head())

# Step 3: Get basic information about the dataset, such as the number of rows, columns, and data types
print(df.info())

# Step 4: Check for any missing values in the dataset
print(df.isnull().sum())

  language     gold_label  sentence1_binary_parse  sentence2_binary_parse  \
0       ar        neutral                     NaN                     NaN   
1       ar  contradiction                     NaN                     NaN   
2       ar     entailment                     NaN                     NaN   
3       ar        neutral                     NaN                     NaN   
4       ar  contradiction                     NaN                     NaN   

   sentence1_parse  sentence2_parse  \
0              NaN              NaN   
1              NaN              NaN   
2              NaN              NaN   
3              NaN              NaN   
4              NaN              NaN   

                                           sentence1  \
0                        وقال، ماما، لقد عدت للمنزل.   
1                        وقال، ماما، لقد عدت للمنزل.   
2                        وقال، ماما، لقد عدت للمنزل.   
3  لم أعرف من أجل ماذا أنا ذاهب أو أي شىْ ، لذلك ...   
4  لم أعرف من أجل ماذا

In [2]:
languages = df['language'].unique()
print(languages)

['ar' 'bg' 'de' 'el' 'en' 'es' 'fr' 'hi' 'ru' 'sw' 'th' 'tr' 'ur' 'vi'
 'zh']


In [3]:
selected_languages = ['en', 'es', 'fr', 'de', 'zh']
df_filtered = df[df['language'].isin(selected_languages)]
print(df_filtered)
print(f"Number of samples remaining: {len(df_filtered)}")

      language     gold_label  sentence1_binary_parse  sentence2_binary_parse  \
4980        de        neutral                     NaN                     NaN   
4981        de  contradiction                     NaN                     NaN   
4982        de     entailment                     NaN                     NaN   
4983        de        neutral                     NaN                     NaN   
4984        de  contradiction                     NaN                     NaN   
...        ...            ...                     ...                     ...   
37345       zh        neutral                     NaN                     NaN   
37346       zh  contradiction                     NaN                     NaN   
37347       zh        neutral                     NaN                     NaN   
37348       zh  contradiction                     NaN                     NaN   
37349       zh     entailment                     NaN                     NaN   

       sentence1_parse  sen

In [4]:
samples_per_language = df_filtered['language'].value_counts()
print(samples_per_language)

language
de    2490
en    2490
es    2490
fr    2490
zh    2490
Name: count, dtype: int64


In [6]:
# Load the test dataset
test_path = '../data/xnli/xnli.test.tsv'
df_test = pd.read_csv(test_path, sep='\t')

# Filter the test dataset for the selected languages
df_test_filtered = df_test[df_test['language'].isin(selected_languages)]

# Get the number of samples for each language
samples_per_language_test = df_test_filtered['language'].value_counts()

print(samples_per_language_test)


language
de    5010
en    5010
es    5010
fr    5010
zh    5010
Name: count, dtype: int64


In [7]:
# Save the filtered dev set to a CSV file
df_filtered.to_csv('../data/xnli/xnli_filtered_dev.csv', index=False)

In [8]:
df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12450 entries, 4980 to 37349
Data columns (total 19 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   language                12450 non-null  object 
 1   gold_label              12450 non-null  object 
 2   sentence1_binary_parse  0 non-null      float64
 3   sentence2_binary_parse  0 non-null      float64
 4   sentence1_parse         0 non-null      float64
 5   sentence2_parse         0 non-null      float64
 6   sentence1               12450 non-null  object 
 7   sentence2               12450 non-null  object 
 8   promptID                12450 non-null  int64  
 9   pairID                  12450 non-null  int64  
 10  genre                   12450 non-null  object 
 11  label1                  12450 non-null  object 
 12  label2                  12450 non-null  object 
 13  label3                  12450 non-null  object 
 14  label4                  12450 non-null  

In [23]:
# Get one example from the filtered dataset
example = df_filtered.iloc[2502]
print("Language:", example['language'])
print("Gold Label:", example['gold_label'])
print("\nSentence 1:", example['sentence1'])
print("\nSentence 2:", example['sentence2'])

Language: en
Gold Label: contradiction

Sentence 1: I was just there just trying to figure it out.

Sentence 2: I understood it well from the beginning.


In [10]:
example = df_filtered.iloc[1]
print("Language:", example['language'])
print("Gold Label:", example['gold_label'])
print("\nSentence 1:", example['sentence1'])
print("\nSentence 2:", example['sentence2'])

Language: de
Gold Label: contradiction

Sentence 1: und er hat gesagt, Mama ich bin daheim.

Sentence 2: Er sagte kein Wort.


In [11]:
# Create validation set with 1000 samples per language
validation_dfs = []
test_dfs = []

for lang in selected_languages:
    lang_data = df_test_filtered[df_test_filtered['language'] == lang]
    
    # Get 1000 samples for validation
    validation_data = lang_data.head(1000)
    # Get remaining samples for test
    test_data = lang_data.iloc[1000:]
    
    validation_dfs.append(validation_data)
    test_dfs.append(test_data)

# Combine all languages
df_validation = pd.concat(validation_dfs, axis=0)
df_test_final = pd.concat(test_dfs, axis=0)

# Print the shapes to verify
print("Validation set shape:", df_validation.shape)
print("Test set shape:", df_test_final.shape)

# Save the splits
df_validation.to_csv('../data/xnli/xnli_validation.csv', index=False)
df_test_final.to_csv('../data/xnli/xnli_test.csv', index=False)

Validation set shape: (5000, 19)
Test set shape: (20050, 19)
