<div class="alert alert-success">
    <h1 align='center'>1. Introduction and Imports 📔</h1>
</div>
<center>
Let's get started with this new text competition! 
<br>    
    With nearly 1.4 billion people, India is the second-most populated country in the world. Yet Indian languages, like Hindi and Tamil, are underrepresented on the web. 
<br>
    Popular Natural Language Understanding (NLU) models perform worse with Indian languages compared to English, the effects of which lead to subpar experiences in downstream web applications for Indian users.
</center>

<center>
    <strong>If you found this notebook useful, you can leave an upvote!</strong>
</center>

In [None]:
import sys
sys.path.append('../input/rich-text-formatting')

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import random

from transformers import pipeline

import plotly.graph_objs as go
import plotly.figure_factory as ff
import plotly.express as px
from plotly.subplots import make_subplots
from plotly.offline import iplot
from wordcloud import WordCloud
from plotly.offline import iplot

from rich import print as _pprint

In [None]:
def cprint(string):
    """
    Utility function for beautiful colored printing.
    """
    _pprint(f"[black]{string}[/black]")

<div class="alert alert-success">
    <h3 align='center'>1.1 What is our task? 🎯</h1>
</div>

Ok, so in this competition we will be predicting answers to questions in Hindi and Tamil. The answers are drawn directly from a limited context 

During inference, we will be provided with hiddent test set that will be about the same size as our training set.

This is a Research Code-Competition.

The submission file should consist of 2 rows:
* `id`: Unique ID
* `PredictionString`: Predicted String

<div class="alert alert-success">
    <h3 align='center'>1.2 How does the Data look like? 🗃</h1>
</div>

The data provided to us in this competition consists mainly of 2 `.csv` files (`train.csv` and `test.csv`).

Below is the breakdown of the `.csv` files;

* 📄 `train.csv` - The training set, containing context, questions, and answers. Also includes the start character of the answer for disambiguation.


* 📄 `test.csv` - The test set, containing context and questions.


* 📄 `sample_submission.csv` - The Sample submission file in the format we are expected to follow.

<div class="alert alert-success">
    <h3 align='center'>1.3 Evaluation Metric ✒</h1>
</div>

<center>
In this competition, our submissions will be judged on the Jaccard Score metric.
</center>

$$
score = \frac{1}{n} \sum_{i=1}^n jaccard( gt_i, dt_i ) 
$$

<div class="alert alert-warning">
    <h1 align='center'>2. Data Loading and EDA 💹</h1>
</div>

In [None]:
train_file = pd.read_csv("../input/chaii-hindi-and-tamil-question-answering/train.csv")
test_file = pd.read_csv("../input/chaii-hindi-and-tamil-question-answering/test.csv")
sample_sub = pd.read_csv("../input/chaii-hindi-and-tamil-question-answering/sample_submission.csv")

In [None]:
train_file.head()

In [None]:
train_file.info()

In [None]:
train_file.describe()

In [None]:
test_file.head()

In [None]:
test_file.info()

In [None]:
test_file.describe()

In [None]:
sample_sub.head()

In [None]:
cprint("Total Training Examples: [green]{}[/green]".format(train_file.shape[0]))
cprint("Total Testing Examples: [green]{}[/green]".format(test_file.shape[0]))

<div class="alert alert-warning">
    <h3 align='center'>2.1 Training Samples by Language</h3>
</div>

In [None]:
train_file['language'].value_counts()

In [None]:
language_name = train_file['language'].value_counts().index.tolist()
language_val = train_file['language'].value_counts().tolist()

fig = px.bar(
    x=language_name,
    y=language_val,
    title="Training Samples by Language",
    labels={
        'x': 'Language',
        'y': 'Sample count'
    },
    color=language_val
)
fig.show()

In [None]:
fig = px.pie(
    names=language_name,
    values=language_val,
    title="Training Samples by Language - Pie Chart",
    color_discrete_sequence=px.colors.sequential.RdBu_r,
)
fig.show()

<div class="alert alert-warning">
    <h3 align='center'>2.2 EDA on Context</h3>
</div>

In [None]:
hindi = train_file[train_file['language']=='hindi']['context'].str.len()
tamil = train_file[train_file['language']=='tamil']['context'].str.len()

fig = make_subplots(rows=1, cols=2)

fig.add_trace(
    go.Histogram(x=list(hindi), name='Hindi Context'),
    row=1, 
    col=1
)

fig.add_trace(
    go.Histogram(x=list(tamil), name='Tamil Context'),
    row=1, 
    col=2,
)

fig.update_layout(height=400, width=800, title_text="Character Count by Language")
iplot(fig)

In [None]:
hindi = train_file[train_file['language']=='hindi']['context'].str.split().map(lambda x: len(x))
tamil = train_file[train_file['language']=='tamil']['context'].str.split().map(lambda x: len(x))

fig = make_subplots(rows=1, cols=2)

fig.add_trace(
    go.Histogram(x=list(hindi), name='Hindi Context'),
    row=1, 
    col=1
)

fig.add_trace(
    go.Histogram(x=list(tamil), name='Tamil Context'),
    row=1, 
    col=2,
)

fig.update_layout(height=400, width=800, title_text="Word Count Distribution by Language")
iplot(fig)

In [None]:
hindi = train_file[train_file['language']=='hindi']['context'].str.split().map(lambda x: [len(j) for j in x]).map(lambda x: np.mean(x)).to_list()
tamil = train_file[train_file['language']=='tamil']['context'].str.split().map(lambda x: [len(j) for j in x]).map(lambda x: np.mean(x)).to_list()

fig = ff.create_distplot([hindi, tamil], ['Hindi', 'Tamil'])
fig.update_layout(height=500, width=800, title_text="Average Word Length Distribution by Language")
iplot(fig)

In [None]:
hindi = train_file[train_file['language']=='hindi']['context'].apply(lambda x: len(set(str(x).split()))).to_list()
tamil = train_file[train_file['language']=='tamil']['context'].apply(lambda x: len(set(str(x).split()))).to_list()

fig = ff.create_distplot([hindi, tamil], ['Hindi', 'Tamil'])
fig.update_layout(height=500, width=800, title_text="Unique Word Count Distribution by Language")
iplot(fig)

<div class="alert alert-info">
    <h1 align='center'>3. Baseline Model using 🤗</h1>
</div>

In [None]:
model = "../input/bbmcfs/bert-base-multilingual-cased-finetuned-squad"
qna = pipeline('question-answering', model=model, tokenizer=model, device=0)

predictions = []

for question, context in test_file[["question", "context"]].to_numpy():
    result = qna(context=context, question=question)
    predictions.append(result["answer"])

In [None]:
submission = pd.DataFrame()
submission['id'] = test_file['id']
submission['PredictionString'] = predictions
submission.to_csv("submission.csv", index=None)

submission.head()

In [None]:
cprint("[red]Under Work! More stuff coming soon[/red] ⚠")