# Brief Introduction

### Tamil

- Tamil is a Dravidian language spoken by Tamils in southern India, Sri Lanka, and elsewhere
- Tamil language originated from Proto-Dravidian in 450BCE
- Tamil language is derived from the Dravidian language family written in Tamil scripts. It is one of the four Dravidian languages along with Telegu, Malayalam, and Kannada
- It is the oldest of all Dravidian languages
- Tamil language witnesses it’s existence for more than 2000 years making it the oldest and longest surviving classical language in the world
- The Tamil language is spoken widely in India, Sri Lanka, Malaysia, Singapore, South Africa and Mauritius

### Hindi

- Hindi is an Indic language of northern India that derived from Vedic Sanskrit language
- Hindi is written in the Devanagari script
- Hindi language originated from the Indo-Aryans linguistic Family in the 17th century CE
- It is one of the official languages of India which includes Tamil as well

# Competition Overview:

In this competition, the goal is to predict answers to real questions about Wikipedia articles. You will use chaii-1, a new question answering dataset with question-answer pairs. The dataset covers Hindi and Tamil, collected without the use of translation. It provides a realistic information-seeking task with questions written by native-speaking expert data annotators. 


# Competition Rules:
- CPU Notebook <= 5 hours run-time
-GPU Notebook <= 5 hours run-time
-Internet access disabled
- Freely & publicly available external data is allowed, including pre-trained models
- Submission file must be named submission.csv

# Competition Metrics:
The metric in this competition is the word-level Jaccard score

`def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))`

In [None]:
%matplotlib inline

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from tqdm import tqdm
tqdm.pandas()

import gc

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

import plotly.express as px

from nltk import FreqDist
from nltk.corpus import stopwords
from nltk import ngrams

import os

import json

plt.rcParams["figure.figsize"] = (12, 8)
plt.rcParams['axes.titlesize'] = 16
sns.set_palette('Set3_r')

pd.set_option("display.max_rows", 20, "display.max_columns", None)

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
from time import time, strftime, gmtime
start = time()
import datetime
print(str(datetime.datetime.now()))

import warnings
warnings.simplefilter(action = 'ignore', category = Warning)

In [None]:
train = pd.read_csv('/kaggle/input/chaii-hindi-and-tamil-question-answering/train.csv')
print(train.shape)
train.head()

In [None]:
test = pd.read_csv('/kaggle/input/chaii-hindi-and-tamil-question-answering/test.csv')
print(test.shape)
test.head()

In [None]:
sub = pd.read_csv('/kaggle/input/chaii-hindi-and-tamil-question-answering/sample_submission.csv')
print(sub.shape)
sub.head()

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2)
sns.countplot(x = 'language', data = train, ax = ax1).set_title('Train Language Counts')
for p in ax1.patches:
    ax1.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.005))
sns.countplot(x = 'language', data = test, ax = ax2).set_title('Test Language Counts')
for p in ax2.patches:
    ax2.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.005))

In [None]:
# Download Tamil/hindi fonts and extract
!wget -q http://www.lipikaar.com/sites/www.lipikaar.com/themes/million/images/support/fonts/Devanagari.zip
!wget -q http://www.lipikaar.com/sites/www.lipikaar.com/themes/million/images/support/fonts/Tamil.zip

!unzip -qq Devanagari.zip
!unzip -qq Tamil.zip

In [None]:
def length_dist(data, text = None):
    length = train['context'].apply(lambda x: len(x))
    words_len = train['context'].apply(lambda x: len(x.split(' ')))
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (16, 10))
    sns.distplot(length, ax = ax1).set_title(f'{text} Length')
    sns.distplot(words_len, ax = ax2).set_title(f'{text} Word Count')

In [None]:
length_dist(train['context'], text = 'Train Context')

In [None]:
length_dist(train['question'], text = 'Train Question')

In [None]:
freq_dist = pd.Series(' '.join(train[train['language'] == 'tamil']['context']).split()).value_counts()
fig = px.line(freq_dist, 
             title = 'Train Context - Word Frequency')
fig.update_layout(showlegend = False)

In [None]:
freq_dist = pd.Series(' '.join(train[train['language'] == 'tamil']['question']).split()).value_counts()
fig = px.line(freq_dist, 
             title = 'Train Question - Word Frequency')
fig.update_layout(showlegend = False)

In [None]:
freq_dist = pd.Series(' '.join(train[train['language'] == 'tamil']['answer_text']).split()).value_counts()
fig = px.line(freq_dist, 
             title = 'Train Answer - Word Frequency')
fig.update_layout(showlegend = False)

In [None]:
freq_dist = pd.Series(' '.join(train[train['language'] == 'hindi']['context']).split()).value_counts()
fig = px.line(freq_dist, 
             title = 'Train Context - Word Frequency')
fig.update_layout(showlegend = False)

# WIP