# Exploratory Data Analysis 🕵️
## Looking at number of tokens, characters, unknown tokens, what languages are in text, and more

#### There might be some useful post-processing approaches that could be explored based on this work

#### Note: some figures might be hard to read in viewer mode, but you can use the plotly zoom features to see what the label is for each tick. Alternatively you can copy and edit the notebook to do your own investigating.

In [None]:
# need for rembert
!pip install -U --no-build-isolation --no-deps ../input/transformers-master/ -qq

In [None]:
import pandas as pd
from collections import Counter
from transformers import AutoTokenizer
import plotly.express as px

df = pd.read_csv("../input/chaii-hindi-and-tamil-question-answering/train.csv")

# Tokenizing 
I'm using XLM-R

In [None]:
tokenizer = AutoTokenizer.from_pretrained("deepset/xlm-roberta-base-squad2")

context_tokens = [tokenizer(txt)["input_ids"] for txt in df["context"]]
question_tokens = [tokenizer(txt)["input_ids"] for txt in df["question"]]
answer_tokens = [tokenizer(txt)["input_ids"] for txt in df["answer_text"]]

df["num_tokens_context"] = [len(tok) for tok in context_tokens]
df["num_chars_context"] = [len(tok) for tok in df["context"]]
df["num_tokens_question"] = [len(tok) for tok in question_tokens]
df["num_chars_question"] = [len(tok) for tok in df["question"]]
df["num_tokens_answer"] = [len(tok) for tok in answer_tokens]
df["num_chars_answer"] = [len(tok) for tok in df["answer_text"]]

# Checking for unknown tokens

Edit: September 8, 2021

For XLM-R, it looks like there are unknowns in the context tokens!

Thank you @harveenchadha for finding the mistake in my code!!

In [None]:
context_tokens_flat = sum(context_tokens, [])
question_tokens_flat = sum(question_tokens, [])
answer_tokens_flat =  sum(answer_tokens, [])

unk = tokenizer.unk_token_id

unk in context_tokens_flat, unk in question_tokens_flat, unk in answer_tokens_flat

There are actually 212 unknown tokens in all contexts which is a tiny fraction of all tokens. I don't think it would be worth adding new tokens to the tokenizer because the associated embeddings wouldn't be trained very well. If you disagree, please comment!

In [None]:
num_unk_tokens = sum([tok == unk for tok in context_tokens_flat])
num_unk_tokens, num_unk_tokens/len(context_tokens_flat)

# Muril unknown tokens

It looks like Muril has even more unknowns.

In [None]:
tokenizer_muril = AutoTokenizer.from_pretrained("google/muril-base-cased")

context_tokens_muril = [tokenizer_muril(txt)["input_ids"] for txt in df["context"]]
question_tokens_muril = [tokenizer_muril(txt)["input_ids"] for txt in df["question"]]
answer_tokens_muril = [tokenizer_muril(txt)["input_ids"] for txt in df["answer_text"]]

context_tokens_flat_muril = sum(context_tokens_muril, [])
question_tokens_flat_muril = sum(question_tokens_muril, [])
answer_tokens_flat_muril =  sum(answer_tokens_muril, [])

unk_muril = tokenizer_muril.unk_token_id

print("Unk token in context, question, answer")
print(unk_muril in context_tokens_flat_muril, unk_muril in question_tokens_flat_muril, unk_muril in answer_tokens_flat_muril)

print("Num unk tokens in context, question, answer")
sum([tok == unk_muril for tok in context_tokens_flat_muril]), sum([tok == unk_muril for tok in question_tokens_flat_muril]), sum([tok == unk_muril for tok in answer_tokens_flat_muril])

# Now with RemBERT

RemBERT has 119 unknown tokens in the contexts

In [None]:
tokenizer_rembert = AutoTokenizer.from_pretrained("google/rembert")

context_tokens_rembert = [tokenizer_rembert(txt)["input_ids"] for txt in df["context"]]
question_tokens_rembert = [tokenizer_rembert(txt)["input_ids"] for txt in df["question"]]
answer_tokens_rembert = [tokenizer_rembert(txt)["input_ids"] for txt in df["answer_text"]]

context_tokens_flat_rembert = sum(context_tokens_rembert, [])
question_tokens_flat_rembert = sum(question_tokens_rembert, [])
answer_tokens_flat_rembert =  sum(answer_tokens_rembert, [])

unk_rembert = tokenizer_rembert.unk_token_id

print("Unk token in context, question, answer")
print(unk_rembert in context_tokens_flat_rembert, unk_rembert in question_tokens_flat_rembert, unk_rembert in answer_tokens_flat_rembert)

print("Num unk tokens in context, question, answer")
sum([tok == unk_rembert for tok in context_tokens_flat_rembert]), sum([tok == unk_rembert for tok in question_tokens_flat_rembert]), sum([tok == unk_rembert for tok in answer_tokens_flat_rembert])

# Looking at character level

In [None]:
contexts = df["context"]
answers = df["answer_text"]

all_chars_ctx = "".join(contexts)
all_chars_ans = "".join(answers)

unq_chars_ctx = sorted(list(set(all_chars_ctx)))
unq_chars_ans = sorted(list(set(all_chars_ans)))

# About 190 contexts and 120 answers are duplicates

In [None]:
print("Contexts: ", len(contexts), contexts.nunique())
print("Answers: ", len(answers), answers.nunique())

# Looking at what types of characters are in the context

It turns out there are many languages in addition to Hindi and Tamil. I see:
- English
- Latin
- Greek
- Japanese
- Chinese
- Arabic
- Nepali
- and many more...

A multi-lingual model will be *very* important

In [None]:
"".join(unq_chars_ctx)

# Answer characters are less varied
I noticed one of my predictions having a parenthesis around it which made the jaccard score 0. (1990 compared to 1990. It might be a good idea to clean un-balanced punctuation or non-letters (commas, periods) at the beginning/end of answers. I might be wrong, but it looks like the only languages in the answer are English, Hindi, and Tamil.

In [None]:
"".join(unq_chars_ans)

# Answers with periods are often for numbers or dates.

கி.மு means BC and கி.பி means AD  
ई.पू. means BC and ई means AD

I would guess that the answers that end in ... or start with . are annotator mistakes.  There are some inconsistencies with AD and BC ending in periods (263 and 288) though it might have to do with the source text. Might be worth probing.

In [None]:
answers[answers.str.contains(r"\.")]

# Is it more common for dates to be in Arabic Numerals (0123456789), Devanagari Numerals(०१२३४५६७८९), or Tamil Numerals(௦௧௨௩௪௫௬௭௮௯௰)?

In [None]:
# Arabic
results = answers[answers.str.contains(r"[0123456789]{4}")]
print(len(results))
results.tolist()

In [None]:
# Devanagari
results = answers[answers.str.contains(r"[०१२३४५६७८९]{4}")]
print(len(results))
results.tolist()

In [None]:
# Tamil
results = answers[answers.str.contains(r"[௦௧௨௩௪௫௬௭௮௯௰]{4}")]
print(len(results))
results.tolist()

# Nothing starts or ends with dashes

In [None]:
answers[answers.str.contains(r"\-")] # I don't understand where the '-' is in 648

# Again, I think ending in a comma is an annotator mistake

In [None]:
answers[answers.str.startswith(r",")].tolist(), answers[answers.str.endswith(r",")].tolist()

In [None]:
answers[answers.str.contains(r",")].tolist()

# Looking at characters by count

#### Note: some figures might be hard to read, but you can use the zoom features to see what the label is for each bar

In [None]:
most_common = Counter(all_chars_ctx).most_common(50)
"".join([x[0] for x in most_common])

# Context characters
Looks mostly Hindi and Tamil in the top 50 characters

In [None]:
px.bar(x=[x[0] for x in most_common], y=[x[1] for x in most_common], labels={"x": "character", "y": "count"})

# Answer characters

Some numbers showing up

In [None]:
most_common_ans = Counter(all_chars_ans).most_common(50)
px.bar(x=[x[0] for x in most_common_ans], y=[x[1] for x in most_common_ans], labels={"x": "character", "y": "count"})

# Number of tokens in context

Long contexts are probably harder because the model can not look at everything at once. Even Big Bird can't do 14k tokens. Maybe there could be an approach to identify the chunk of 1,000 or 500 tokens from where the answer is likely, and then that smaller chunk goes into the QA model.

In [None]:
px.histogram(df, x="num_tokens_context", color="language")

In [None]:
# What fraction of the contexts are below a certain length?

only_hindi = df[df["language"]=="hindi"]
only_tamil = df[df["language"]=="tamil"]
num_hindi = len(only_hindi)
num_tamil = len(only_tamil)

lengths = list(range(0, df["num_tokens_context"].max(), 25))
hindi_counts = []
tamil_counts = []
for l in lengths:
    hindi_counts.append((only_hindi["num_tokens_context"]<=l).sum()/num_hindi)
    tamil_counts.append((only_tamil["num_tokens_context"]<=l).sum()/num_tamil)

counts_df = pd.DataFrame(data={"count": hindi_counts+tamil_counts, "length": lengths*2, "language": ["hindi"]*len(hindi_counts)+["tamil"]*len(tamil_counts)})
    
px.line(counts_df, x="length", y="count", color="language", labels={"count": "fraction below length"})

In [None]:
px.histogram(df, x="num_chars_context", color="language")

In [None]:
px.histogram(df, x="num_tokens_answer", color="language")

In [None]:
px.histogram(df, x="num_chars_answer", color="language")

In [None]:
px.histogram(df, x="num_tokens_question", color="language")

In [None]:
px.histogram(df, x="num_chars_question", color="language")

## Answer start is probably skewed left because data creators just took the first index of the answer string

In [None]:
px.histogram(df, x="answer_start", color="language")

# Does the answer start correlate with the length of the answer?

In [None]:
px.scatter(df, x="num_tokens_answer", y="answer_start", color="language")

# Does the answer start correlate with the length of the context?

In [None]:
px.scatter(df, x="num_tokens_context", y="answer_start", color="language")

# Does the length of the question correlate with the length of the context?

In [None]:
px.scatter(df, x="num_tokens_context", y="num_tokens_question", color="language")

# Does the length of the answer correlate with the length of the context?

In [None]:
px.scatter(df, x="num_tokens_context", y="num_tokens_answer", color="language")

# Does the length of the answer correlate with the length of the question?

In [None]:
px.scatter(df, x="num_tokens_question", y="num_tokens_answer", color="language")

# How many words in the question are in the answer?

In the scenario where your model doesn't produce anything, your best bet might be to take a word from the answer.  Putting nothing is definitely wrong, but using a word from the question has a very small chance of getting points.

In [None]:
def count_word_overlap(question, answer):
    if question.endswith("?"):
        question = question[:-1]
    q_splits = question.split()
    a_splits = answer.split()
    return sum([a in q_splits for a in a_splits])
    
df["num_overlap"] = [count_word_overlap(question, answer) for question, answer in df[["question", "answer_text"]].to_numpy()]

px.histogram(df, x="num_overlap", color="language")

# Of those words in the questions with no overlap, what are the most common words?

These are probably stopwords that you should not put in your answer if you are randomly choosing a word from the question.

#### Note: some figures might be hard to read, but you can use the zoom features to see what the label is for each bar

In [None]:
def breakup_question(question):
    if question.endswith("?"):
        question = question[:-1]
    return question.split()

hindi_df = df[df["language"]=="hindi"]

no_overlap_words = Counter(sum([breakup_question(q) for q in hindi_df[hindi_df["num_overlap"]==0]["question"]], []))

most_common_no_overlap_words = no_overlap_words.most_common(50)
px.bar(x=[x[0] for x in most_common_no_overlap_words], y=[x[1] for x in most_common_no_overlap_words], labels={"x": "word", "y": "count"}, title="Hindi words in question with 0 overlap with answer words")

In [None]:
tamil_df = df[df["language"]=="tamil"]

no_overlap_words = Counter(sum([breakup_question(q) for q in tamil_df[tamil_df["num_overlap"]==0]["question"]], []))

most_common_no_overlap_words = no_overlap_words.most_common(50)
px.bar(x=[x[0] for x in most_common_no_overlap_words], y=[x[1] for x in most_common_no_overlap_words], labels={"x": "word", "y": "count"}, title="Tamil words in question with 0 overlap with answer words")

# How bad could the training set be?

Given that we know the annotations for the training set are noisy and the `answer_start` values are just the first occurrence of `answer_text`, what is the worst case scenario? The following cell checks each context for how many times the answer is mentioned. If the answer is mentioned more than once, it has a chance of being incorrectly labelled. If it is incorrectly labelled, your model will be penalized when it may have gotten the right answer because the loss is based off of `answer_start`.

Assuming that all the answers are correct (which we know is wrong, [see thread here](https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering/discussion/264395)), it looks like at least 58% of `answer_start` values are correct, but about 41% of the `answer_start` values could be wrong.

In [None]:
import re

df["num_occurrences"] = [len(re.findall(ans, ctx)) for ans, ctx in df[["answer_text", "context"]].values]

multiple_answers_df = df[df["num_occurrences"] > 1]
percent_mult_ans = len(multiple_answers_df)/len(df)

one_answer = df[df["num_occurrences"] == 1]
percent_one_ans = len(one_answer)/len(df)


print("Num multiple answers in context:", len(multiple_answers_df))
print("As a percentage of all samples:",  percent_mult_ans)
print("Percentage with one answer:", percent_one_ans)
print("Sanity check: percentage with no answer:", len(df[df["num_occurrences"] == 0]))