
In this notebook, we will take a first glance at the competition and perform preliminary EDA to gain some insights.

**Aim of the competition:**
The use case this competition targets is annotating different parts of the essays into 7 different categories to help the students with their writing abilities. The categories are as follows:
* Lead
* Position
* Evidence
* Claim
* Concluding Statement
* Counterclaim
* Rebuttal

**Evaluation Metric:**

> Submissions are evaluated on the overlap between ground truth and predicted word indices.
> 1. For each sample, all ground truths and predictions for a given class are compared.
> 2. If the overlap between the ground truth and prediction is >= 0.5, and the overlap between the prediction and the ground truth >= 0.5, the prediction is a match and considered a true positive. If multiple matches exist, the match with the highest pair of overlaps is taken.
> 3. Any unmatched ground truths are false negatives and any unmatched predictions are false positives

The final score is arrived at by calculating a micro F1 score for each class, then taking the mean across all classes.

<img src="https://media.giphy.com/media/3oxOCtREZeCtTGExEc/giphy.gif">

## Time to gear up! : Lets Import

In [None]:
import pandas as pd
import os
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

import warnings
warnings.filterwarnings('ignore')

## The Actual First glance 👀
Loading the training data

In [None]:
df_train = pd.read_csv('../input/feedback-prize-2021/train.csv')
df_train.head()

### Label Distribution

In [None]:
# fetching all the labels
list(df_train['discourse_type'].unique())

In [None]:
# checking count of all the labels
df_train['discourse_type'].value_counts()

In [None]:
# The Hated Pie Chart
colors = sns.color_palette('pastel')[0:5]
plt.figure(figsize=(16,9))

plt.pie(x=df_train.discourse_type.value_counts(), labels = df_train.discourse_type.unique(), colors = colors, autopct='%.0f%%')
plt.show()

In [None]:
# the classic bar plot
plt.figure(figsize=(12,6))
df_train['discourse_type'].value_counts().plot(kind='bar')
plt.show()

**To Note:**
- Class Imbalance 
- Low Support for the classes: Lead, Counterclaim and Rebuttal   

### Training Data Stats

In [None]:
# total discourses (sentences)
print(f"The total number of text segments are {len(df_train)}.")

In [None]:
# number of text files/ essays available for training
print(f"There are {len(df_train['id'].unique())} essays.")


In [None]:
df_train['len_word'] = df_train['discourse_text'].apply(lambda x: len(x.split()))

In [None]:
df_train.head()

### Lable Wise Stats

In [None]:
# Category wise Dataframes
df_lead = df_train.loc[df_train["discourse_type"]=='Lead']
df_claim = df_train.loc[df_train["discourse_type"]=='Claim']
df_evidence = df_train.loc[df_train["discourse_type"]=='Evidence']
df_position = df_train.loc[df_train["discourse_type"]=='Position']
df_concluding = df_train.loc[df_train["discourse_type"]=='Concluding Statement']
df_rebuttal = df_train.loc[df_train["discourse_type"]=='Rebuttal']
df_counterclaim = df_train.loc[df_train["discourse_type"]=='Counterclaim']



In [None]:
# Lead
plt.figure(figsize=(16,9))
print("mean of word lengths: ",df_lead.len_word.mean())
sns.distplot(df_lead['len_word'],  color="b")
plt.show()

In [None]:
# Claim
plt.figure(figsize=(16,9))
print("mean of word lengths: ",df_claim.len_word.mean())
sns.distplot(df_claim['len_word'],  color="b")
plt.show()

In [None]:
# Evidence
plt.figure(figsize=(16,9))
print("mean of word lengths: ",df_evidence.len_word.mean())
sns.distplot(df_evidence['len_word'],  color="b")
plt.show()

In [None]:
# Position
plt.figure(figsize=(16,9))
print("mean of word lengths: ",df_position.len_word.mean())
sns.distplot(df_position['len_word'],  color="b")
plt.show()

In [None]:
# Concluding Statement
plt.figure(figsize=(16,9))
print("mean of word lengths: ",df_concluding.len_word.mean())
sns.distplot(df_concluding['len_word'],  color="b")
plt.show()

In [None]:
# Rebuttal
plt.figure(figsize=(16,9))
print("mean of word lengths: ",df_rebuttal.len_word.mean())
sns.distplot(df_rebuttal['len_word'],  color="b")
plt.show()

In [None]:
# Counterclaim
print("mean of word lengths: ",df_counterclaim.len_word.mean())
plt.figure(figsize=(16,9))
sns.distplot(df_counterclaim['len_word'],  color="b")
plt.show()

## Looking at the text files

In [None]:
# loading the text files into a dataframe
data = []
src_path = '/kaggle/input/feedback-prize-2021/train'
for elem in os.listdir(src_path):
    with open(os.path.join(src_path,elem)) as f:
        text = f.readlines()
    data.append({'id': elem[:-4], 'text':''.join(text)})
    
df_text = pd.DataFrame(data)


### Word Lengths forr each text file/essay

In [None]:
len_words = []
for elem in list(df_text['text']):
    len_words.append(len(elem.split(' ')))

In [None]:
df_text['len_words'] = len_words

In [None]:
df_text.head()

### Distribution of essay word lengths 

In [None]:
# distribution of word lengths per essay
plt.figure(figsize=(16,9))
df_text['text'].apply(lambda x: len(x.split())).hist(bins=30)
plt.show()

In [None]:
print("Average word length per essay: ", df_text.len_words.mean())

### Up Next:
- Advanced EDA
- Baseline training code
- Baseline inference code
- Beating the Baseline

**If you like it so far, consider upvoting 😄** 

<img src="https://media.giphy.com/media/eunrMjB8lBUKeL1fqD/giphy-downsized.gif">