## Week 36
Familiarize yourself with the dataset card, download the dataset and explore its
columns. Summarize data statistics (size, word count, etc.) for training and
validation data in the languages Arabic (ar), Korean (ko) and Telugu (te).

For each of the languages Arabic, Korean and Telugu, report the 5 most
common words in the questions from the training set and their count, as well
as their English translation. What kind of words are they?

Implement a rule-based classifier that predicts whether a question is an-
swerable or impossible, only using the document (context) and question.

You
may use machine translation as a component. Use the answerable field to
evaluate it on the validation set. What is the performance of your classifier for
each of the languages Arabic, Korean and Telugu?

In [None]:
from datasets import load_dataset
dataset = load_dataset("coastalcph/tydi_xor_rc")
train_set = dataset["train"]
validation_set = dataset["validation"]

In [1]:
import pandas as pd
import string
import numpy as np

In [2]:
validation_set = pd.read_parquet("data/validation.parquet")
train_set = pd.read_parquet("data/train.parquet")

In [9]:
strip_punctuation = lambda line: [word.strip(string.punctuation+"؟")for word in line.split(" ")]
count_words = lambda line: len(line)

In [4]:
!pip install torch




[notice] A new release of pip available: 22.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
# Use a pipeline as a high-level helper
from transformers import pipeline
import torch

In [18]:
from transformers import pipeline
import torch

LANG_CODE = {"ko": "kor_Hang", "ar": "arb_Arab", "te": "tel_Telu"}

device = 0 if torch.cuda.is_available() else -1
translators = {
    l: pipeline(
        "translation",
        model="facebook/nllb-200-distilled-600M",
        tokenizer="facebook/nllb-200-distilled-600M",
        src_lang=LANG_CODE[l],
        tgt_lang="eng_Latn",
        device=device
    )
    for l in LANG_CODE
}

def translate_top_words(counts, l):
    translate = translators[l]
    tx = [translate(w, max_length=64)[0]["translation_text"] for w in counts.index]
    counts = counts.reset_index()
    counts["translation"] = tx
    return counts



Device set to use cpu
Device set to use cpu
Device set to use cpu


In [27]:
def translate_questions(df, l):
    translate = translators[l]
    df["question_translated"] = df["question"].apply(translate)
    return df

In [28]:
def statistics(df, col=["question", "context"]):
    langs = ["ko", "ar", "te"]
    df = df[df['lang'].isin(langs)].copy()
    for c in col:
        df[f"{c}_stripped"] = df[c].apply(strip_punctuation)
        df[f"{c}_wordcount"] = df[f"{c}_stripped"].apply(count_words)

    print(
        df.groupby(["lang", "answerable"]).agg(
            question_wordcount_mean=("question_wordcount", "mean"),
            context_wordcount_mean=("context_wordcount", "mean"),
            question_wordcount_sum=("question_wordcount", "sum"),
            context_wordcount_sum=("context_wordcount", "sum"),
            question_wordcount_max=("question_wordcount", "max"),
            question_wordcount_min=("question_wordcount", "min"),
            count=("question_wordcount", "count")
        ).round(0)
    )

    for l in langs:
        lang_df = df[df["lang"] == l]
        words = list(filter(None, sum(lang_df["question_stripped"].tolist(), []))) #[w for lst in lang_df["question_stripped"].tolist() for w in (lst or []) if w]
        counts = pd.Series(words).value_counts().head()
        counts = translate_top_words(counts, l)
        print(f"\nTop tokens for {l} -> en:")
        print(counts)

        df[df["lang"] == l] = translate_questions(lang_df, l)

    return df

In [None]:
val_for_stat = statistics(validation_set)
train_for_stat = statistics(train_set)

                 question_wordcount_mean  context_wordcount_mean  \
lang answerable                                                    
ar   False                           8.0                   112.0   
     True                            7.0                   103.0   
ko   False                           5.0                   108.0   
     True                            5.0                    95.0   
te   False                           6.0                   112.0   
     True                            6.0                   105.0   

                 question_wordcount_sum  context_wordcount_sum  \
lang answerable                                                  
ar   False                          406                   5813   
     True                          2436                  37285   
ko   False                           95                   2051   
     True                          1636                  32124   
te   False                          575                  10

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["question_translated"] = df["question"].apply(translate)



Top tokens for ar -> en:
  index  count  translation
0    من    113        Who ?
1    في     90         In .
2    ما     81       What ?
3    هو     66  It 's him .
4   متى     65       When ?


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["question_translated"] = df["question"].apply(translate)



Top tokens for te -> en:
        index  count  translation
0           ఏ     92  There is no
1         ఏది     76  What is it?
2        ఎవరు     74   Who is it?
3  భారతదేశంలో     45     In India
4         ఎంత     40     How much


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["question_translated"] = df["question"].apply(translate)


                 question_wordcount_mean  context_wordcount_mean  \
lang answerable                                                    
ar   False                           8.0                   124.0   
     True                            7.0                   102.0   
ko   False                           5.0                   104.0   
     True                            5.0                    97.0   
te   False                           6.0                   118.0   
     True                            6.0                    87.0   

                 question_wordcount_sum  context_wordcount_sum  \
lang answerable                                                  
ar   False                         1948                  31580   
     True                         15509                 234189   
ko   False                          325                   6574   
     True                         11524                 228653   
te   False                          277                   5

 I wanted to strip punctuation, but certain languages that we do not include in the analysis have special punctuation that has to be included in thw stripping pool

In [None]:
train_for_stat

In [None]:
val_for_stat.groupby(["lang", "answerable"]).agg(
    question_wordcount_mean=("question_wordcount", "mean"),
    context_wordcount_mean=("context_wordcount", "mean"),
    count=("question_wordcount", "count")
).round(0)


In [None]:
train_set

In [None]:
train_ds = load_dataset(
    "parquet",
    data_files={"train": "data/train.parquet"}
)["train"]

val_ds = load_dataset(
    "parquet",
    data_files={"validation": "data/validation.parquet"}
)["validation"]

test_ds = load_dataset(
    "json",
    data_files={"test": "data/test.json"}
)["test"]

In [None]:
train_ds