<img src="https://i.imgur.com/RFR6UZX.jpg" width="100%"/>

# 2. The Dataset
### [chaii - Hindi and Tamil Question Answering](https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering) - A quick overview for QA noobs

Hi and welcome! This is the second kernel of the series `chaii - Hindi and Tamil Question Answering - A quick overview for QA noobs`.

**In this short kernel, we will go over the competition dataset very briefly and provide a transliteration table .**


---

The entire series consists of the following notebooks:
1. [The competition](https://www.kaggle.com/julian3833/1-the-competition-qa-for-qa-noobs)
2. _[The dataset](https://www.kaggle.com/julian3833/2-the-dataset-qa-for-qa-noobs) (This notebook)_
3. [The metric (Jaccard)](https://www.kaggle.com/julian3833/3-the-metric-jaccard-qa-for-qa-noobs) 
4. [Exploring Public Models](https://www.kaggle.com/julian3833/4-exploring-public-models-qa-for-qa-noobs/)
5. [🥇 XLM-Roberta + Torch's extra data [LB: 0.749]](https://www.kaggle.com/julian3833/5-xlm-roberta-torch-s-extra-data-lb-0-749)
6. [🤗 Pre & post processing](https://www.kaggle.com/julian3833/6-pre-post-processing-qa-for-qa-noobs/)

This is an ongoing project, so expect more notebooks to be added to the series soon. Actually, we are currently working on the following ones:
* Exploring Public Models Revisited
* Reviewing `squad2`, `mlqa` and others
* About `xlm-roberta-large-squad2`
* Own improvements

---

In [None]:
BASE_PATH = "../input/chaii-hindi-and-tamil-question-answering/"
!ls -l $BASE_PATH

# Small-data regime

The training dataset is tiny! It looks like the addition of datasets might be an important aspect of this competition as it goes by.

Regarding the size of the test and submission: these are just placeholders, as explained in [this section](https://www.kaggle.com/julian3833/1-the-competition-qa-for-qa-noobs#Code-requirements) of the [first notebook](https://www.kaggle.com/julian3833/1-the-competition-qa-for-qa-noobs). It is a common practice in `Kernel-only` competitions like this one.

In [None]:
import pandas as pd
df_train = pd.read_csv(BASE_PATH + "train.csv")
df_test = pd.read_csv(BASE_PATH + "test.csv")
df_sub = pd.read_csv(BASE_PATH + "sample_submission.csv")

# How many training and test samples have been provided?
print(f"Training shape  : {df_train.shape}")
print(f"Test shape      : {df_test.shape}")
print(f"Submission shape: {df_sub.shape}")

In [None]:
df_train.head()

In [None]:
# This is the full df_test, not only the head
df_test

In [None]:
# Same here
df_sub

# Let's take a look at `df_train`

It has a `question` and a `context` (the inputs) and an `answer_text` (the output) plus the `answer_start` position indicator, which is a common practice as we mentioned in the [first notebook](https://www.kaggle.com/julian3833/1-the-competition-qa-for-qa-noobs#Question-Answering).

Note that the submission only requires the `PredictionString` and not the position of it.

In [None]:
df_train.head()

As we explained in the first notebook, the answer is always a substring of the context:

In [None]:
for _, row in df_train.iterrows():
    assert row.answer_text in row.context

## Language percentages
`67% hindi`, 
`33% tamil`

In [None]:
display(df_train['language'].value_counts())
print()
df_train['language'].value_counts(normalize=True).round(2)

In [None]:
df_train['language'].value_counts(normalize=True).round(2).plot.bar(alpha=0.5, rot=0, color=['red', 'green'], figsize=(10, 5));

# Length of text columns (number of words)

| Field | Average | Min | Max |
| -- | -- | -- | -- |
| question|  7 | 3 | 22| 
| answer|  7 | 1 | 51| 
| context|  1694 | 24 | 10259| 

The context is huge! I don't know how good models work on this sequence length regime. We will see...

In [None]:
df_train['question'].str.split().str.len().hist(figsize=(10, 5), alpha=0.5)
pd.DataFrame(df_train['question'].str.split().str.len().describe().round(2)).T

In [None]:
df_train['answer_text'].str.split().str.len().hist(figsize=(10, 5), alpha=0.5)
pd.DataFrame(df_train['answer_text'].str.split().str.len().describe().round(2)).T

In [None]:
df_train['context'].str.split().str.len().hist(figsize=(10, 5), alpha=0.5)
pd.DataFrame(df_train['context'].str.split().str.len().describe().round(2)).T

In [None]:
# You can uncomment this line to see the size of the largest context:
# df_train.loc[df_train['context'].str.split().str.len() == 10259, 'context'].iloc[0]

# Some quick and dirty transliterations

I saw some beautiful EDAs with [wordclouds](https://www.kaggle.com/hoshi7/chaii-the-beginning-eda-wordclouds) in `Hindi` and `Tamil` and thought immediately: a transliteration could be something good to do.


# What is transliteration?  `अक्तूबर` -> `aktūbr` (October)

Transliteration is phonetically replacing one alphabet with another. It allows or improves phonetic readability and, sometimes, interpretability too.

See this example:

This is how you write `police` in Russian: `полиция`.

And this is how it looks when you transliterate Cyrillic to Latin: `politsiya`

It's still Russian, but much more familiar, isn't it?
The transliteration is a simple phonetic mapping from one alphabet to another. Here, the mapping was:
```python
{'п': 'p', 'о': 'o', 'л': 'l', 'и': 't', 'ц': 's', 'и': 'i', 'я': 'ya'}
```


# Origin of the tables

I couldn't find well-established python packaged for that, at least fast. But I did find the following tables:

For Hindi:
* https://pandey.github.io/posts/transliterate-devanagari-to-latin.html

For Tamil:
* https://www.loc.gov/catdir/cpso/romanization/tamil.pdf


Note that few characters are dropped (this is actually quick and dirty)


# Usage

The usage is quite straightforward. See examples below for some good surprises!
```python
df_trans = transliterate(df_train)
```

In [None]:
import string

def transliterate_hindi(st):
    HINDI_MAP = { 'ॐ' : 'oṁ', 'ऀ' : 'ṁ', 'ँ' : 'ṃ', 'ं' : 'ṃ', 'ः' : 'ḥ', 'अ' : 'a', 'आ' : 'ā', 'इ' : 'i', 'ई' : 'ī', 'उ' : 'u', 'ऊ' : 'ū', 'ऋ' : 'r̥', 'ॠ' : ' r̥̄', 'ऌ' : 'l̥', 'ॡ' : ' l̥̄', 'ऍ' : 'ê', 'ऎ' : 'e', 'ए' : 'e', 'ऐ' : 'ai', 'ऑ' : 'ô', 'ऒ' : 'o', 'ओ' : 'o', 'औ' : 'au', 'ा' : 'ā', 'ि' : 'i', 'ी' : 'ī', 'ु' : 'u', 'ू' : 'ū', 'ृ' : 'r̥', 'ॄ' : ' r̥̄', 'ॢ' : 'l̥', 'ॣ' : ' l̥̄', 'ॅ' : 'ê', 'े' : 'e', 'ै' : 'ai', 'ॉ' : 'ô', 'ो' : 'o', 'ौ' : 'au', 'क़' : 'q', 'क' : 'k', 'ख़' : 'x', 'ख' : 'kh', 'ग़' : 'ġ', 'ग' : 'g', 'ॻ' : 'g', 'घ' : 'gh', 'ङ' : 'ṅ', 'च' : 'c', 'छ' : 'ch', 'ज़' : 'z', 'ज' : 'j', 'ॼ' : 'j', 'झ' : 'jh', 'ञ' : 'ñ', 'ट' : 'ṭ', 'ठ' : 'ṭh', 'ड़' : 'ṛ', 'ड' : 'ḍ', 'ॸ' : 'ḍ', 'ॾ' : 'd', 'ढ़' : 'ṛh', 'ढ' : 'ḍh', 'ण' : 'ṇ', 'त' : 't', 'थ' : 'th', 'द' : 'd', 'ध' : 'dh', 'न' : 'n', 'प' : 'p', 'फ़' : 'f', 'फ' : 'ph', 'ब' : 'b', 'ॿ' : 'b', 'भ' : 'bh', 'म' : 'm', 'य' : 'y', 'र' : 'r', 'ल' : 'l', 'ळ' : 'ḷ', 'व' : 'v', 'श' : 'ś', 'ष' : 'ṣ', 'स' : 's', 'ह' : 'h', 'ऽ' : '\'', '्' : '', '़' : '', '०' : '0', '१' : '1', '२' : '2', '३' : '3', '४' : '4', '५' : '5', '६' : '6', '७' : '7', '८' : '8', '९' : '9', 'ꣳ' : 'ṁ', '।' : '.', '॥' : '..', ' ' : ' '}
    return ''.join(HINDI_MAP.get(c, c)  for c in st)

def transliterate_tamil(st):
    text = """அ a எ e ஆ ā ஏ ē இ i ஐ ai ஈ ī ஒ o உ u ஓ ō ஊ ū ஔ au ஃ ka ம ma க ka ய ya ங ṅa ர ra ச ca ல la ஞ ña வ va ட ṭa ழ la ண ṇa ள ḷa த ta ற raந na ன na ப pa ஜ ja ஸ sa ஶ śa ஹ ha ஷ ṣa""".split()
    TAMIL_MAP = dict(zip(text[0::2], text[1::2]))
    TAMIL_MAP.update({t: t for t in ' ?.1234567890'+string.ascii_lowercase})
    return ''.join(TAMIL_MAP.get(c.lower(), '') for c in st)

def transliterate(df_in, columns=['question', 'context', 'answer_text']):
    df = df_in.copy()
    for c in columns:
        df.loc[df['language'] == 'hindi', c] = df.loc[df['language'] == 'hindi', c].apply(transliterate_hindi)
        df.loc[df['language'] == 'tamil', c] = df.loc[df['language'] == 'tamil', c].apply(transliterate_tamil)        
    return df

In [None]:
df_trans = transliterate(df_train)

In [None]:
df_train.head(5)

In [None]:
df_trans.head(5)

In [None]:
df_train[df_train['language'] == 'hindi'].head(5)

In [None]:
df_trans[df_trans['language'] == 'hindi'].head(5)

It increases a little the readability. See for example:

In [None]:
# This is a name. Adolph Meyr or something
df_trans[df_trans['language'] == 'hindi']['answer_text'].iloc[0]

In [None]:
df_train[df_train['language'] == 'hindi']['answer_text'].iloc[0]

And this is a date (October 27, 1605):

In [None]:
df_train.iloc[1112]['answer_text']

In [None]:
df_trans.iloc[1112]['answer_text']

I created a short notebook with the transliteration code so it's easy to copy-and-paste the code: [Quick and Dirty Transliteration Tables](https://www.kaggle.com/julian3833/quick-and-dirty-transliteration-tables).

## What's next?

Enough of the data! Let's check the `Jaccard metric` in the [next notebook](https://www.kaggle.com/julian3833/3-the-metric-jaccard-qa-for-qa-noobs) so we can move to the Public Models.

If you want to see more EDA, there are some incredible notebooks around. These are the ones I liked the most, but there are many more!
* [EDA Chaii Gogogo 😅](https://www.kaggle.com/vaby667/eda-chaii-gogogo)
* [chaii-explore_the_data](https://www.kaggle.com/aakashnain/chaii-explore-the-data)
* [ChAii: The Beginning: EDA, Wordclouds](https://www.kaggle.com/hoshi7/chaii-the-beginning-eda-wordclouds)

&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
## Remember to upvote the notebook if you found it useful! 🤗
