<h1 style="color:blue"><center>CommonLit Readability Prize Competition</center></h1>
<h3><center>What's it's all about?</center></h3>

**If you found this notebook helpful, you can leave a vote!**

📌**PyTorch BERT Multi-Model Trainer + KFolds🎯 (Training Notebook) - https://www.kaggle.com/heyytanay/pytorch-bert-multi-model-trainer-kfolds**

📌**Vanilla PyTorch BERT Starter (Submission Notebook)🎯 - https://www.kaggle.com/heyytanay/submission-nb-vanilla-pytorch-bert-starter**

## What is this Competition about? 🤷‍♂️

* In this competition, we are required to build algorithms to rate the complexity of reading passages for grade 3-12 classroom use.

* To accomplish this, we'll have to pair our machine learning skills with a dataset that includes readers from a wide variety of age groups and a large collection of texts taken from various domains. Winning models will be sure to incorporate text cohesion and semantics.

* If successfull, students will benefit from feedback on the complexity and readability of their work, making it far easier to improve essential reading skills.

<hr>

## And what is our task? 🎯

In technical terms,

* Given a `training.csv` file in which we will have (among other) 2 columns: `excerpt` and `target`, we will have to train Machine Learning model(s) that can approximate the relationship between excerpt and the target.

* In simple words, we will have to train a Model which can predict the target value given a text excerpt.

* This can be formulated as a Regression problem with text

<hr>

## Files and what they contain 📂

In this competition, we are provided with 3 files:

* 📄 `train.csv` :  This is the main training file, it consists of 6 columns: `id`, `excerpt`, `license`, `url_legal`, `target`, `standard_error`.
    
* 📄 `test.csv` :  This is the testing file, it consists of 4 columns: `id`, `url_legal`, `license`, `excerpt`
    
* 📄 `sample_submission.csv` :  This is a sample submission file that guides us how to form our submission file during inference


<hr>

<h2>Evaluation Metrics 🖊</h2>

Submissions are scored on the root mean squared error. RMSE is defined as:

$$RMSE = \sqrt{\frac{1}{n}\Sigma_{i=1}^{n}{\Big(\frac{y_i - \hat{y_i}}{\sigma_i}\Big)^2}}$$

where $y_i$ is the predicted value, $\hat{y_i}$ is the original value, and $n$ is the number of rows in the test data.

<hr>
<h2> EDA time! 📊 </h2>

Enough talking, let's do some light EDA!

In [None]:
# Some imports :)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import re
from tqdm.notebook import tqdm

import nltk
from nltk.corpus import stopwords

import plotly.graph_objs as go
import plotly.figure_factory as ff
import plotly.express as px
from plotly.subplots import make_subplots
from plotly.offline import iplot
from wordcloud import WordCloud
from plotly.offline import iplot

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.metrics import mean_squared_error

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
plt.style.use('classic')
sns.set_palette(sns.color_palette('winter_r'))

In [None]:
def cprint(string:str, end="\n"):
    """
    A little utility function for printing and stuff
    """
    _pprint(f"[black]{string}[/black]", end=end)

In [None]:
# Importing data
training_file = pd.read_csv("../input/commonlitreadabilityprize/train.csv")
test_file = pd.read_csv("../input/commonlitreadabilityprize/test.csv")
submission = pd.read_csv("../input/commonlitreadabilityprize/sample_submission.csv")

In [None]:
training_file.head()

In [None]:
test_file.head()

In [None]:
submission.head()

## Peeking at the `target` column 👀

Let's take a look at the target column to see how it is distributed.

In [None]:
plt.figure(figsize=(8, 8))
plt.title(f"Target Column Distribution")
sns.histplot(training_file['target'], stat='density')
sns.kdeplot(training_file['target'], color='blue')
plt.axvline(training_file['target'].mean(), color='red', linestyle='--', linewidth=0.8)
min_ylim, max_ylim = plt.ylim()
plt.text(training_file['target'].mean()*1.05, max_ylim*0.96, 'Mean (μ): {:.2f}'.format(training_file['target'].mean()))
plt.xlabel("Value")
plt.ylabel("Density")
plt.show()

The `target` columns looks very much normally distributed.

## Peeking at the `standard_error` column 👀

Let's take a look at the `standard_error` column to see how it looks like.

In [None]:
plt.figure(figsize=(8, 8))
plt.title(f"Standard Error Column Distribution")
sns.histplot(training_file['standard_error'], stat='density')
sns.kdeplot(training_file['standard_error'], color='magenta')
plt.axvline(training_file['standard_error'].mean(), color='red', linestyle='--', linewidth=0.8)
min_ylim, max_ylim = plt.ylim()
plt.text(training_file['standard_error'].mean()*1.05, max_ylim*0.96, 'Mean (μ): {:.2f}'.format(training_file['standard_error'].mean()))
plt.xlabel("Value")
plt.ylabel("Density")
plt.show()

## Analysis on Excerpt Columns (`excerpt`) 🎯

Let's now do some text related analysis on `excerpt` column and see how the text in it is structured.

### Character Frequency Count

In [None]:
char_freq_count = training_file['excerpt'].str.len()

plt.figure(figsize=(8, 8))
plt.title(f"Character Frequency Count")
sns.histplot(char_freq_count, stat='density', color="#00a9ff")
sns.kdeplot(char_freq_count, color='purple')
plt.axvline(np.mean(char_freq_count), color='red', linestyle='--', linewidth=0.8)
min_ylim, max_ylim = plt.ylim()
plt.text(np.mean(char_freq_count)*0.64, max_ylim*0.96, 'Mean (μ): {:.2f}'.format(np.mean(char_freq_count)))
plt.xlabel("Count")
plt.ylabel("Density")
plt.show()

### Word Count Distribution



In [None]:
word_count = training_file['excerpt'].str.split().map(lambda x: len(x))

plt.figure(figsize=(8, 8))
plt.title(f"Word Count Distribution")
sns.histplot(word_count, stat='density', color="magenta")
sns.kdeplot(word_count, color='purple')
plt.axvline(np.mean(word_count), color='red', linestyle='--', linewidth=0.8)
min_ylim, max_ylim = plt.ylim()
plt.text(np.mean(word_count)*0.86, max_ylim*0.93, 'Mean (μ): {:.2f}'.format(np.mean(word_count)))
plt.xlabel("Word Count")
plt.ylabel("Density")
plt.show()

*The above word count distribution can help us identify what `max_len` should we use when training our models*

### Unique Word Count

In [None]:
unq_word_count = training_file['excerpt'].apply(lambda x: len(set(str(x).split()))).to_list()

fig = ff.create_distplot([unq_word_count], ['Excerpt Text'])
fig.update_layout(title_text="Unique Word Count Distribution")
iplot(fig)

### WordCloud

In [None]:
# Generate WordCloud
words = " ".join(training_file['excerpt'].tolist())
wc = WordCloud(width = 5000, height = 4000, background_color ='black', min_font_size = 10).generate(words)

# Plot it
plt.figure(figsize = (12, 12), facecolor = 'k', edgecolor = 'k' ) 
plt.imshow(wc) 
plt.axis("off") 
plt.tight_layout(pad = 0)

plt.show()

## Modelling

Let's try some basic modelling and then make a submission using that!

In [None]:
# Split the data roughly
data = training_file[['excerpt', 'target']]
data = data.sample(frac=1).reset_index(drop=True)
excerpt, targets = training_file['excerpt'].values, training_file['target'].values

t_X, v_X = excerpt[:2750], excerpt[2750:]
t_Y, v_Y = targets[:2750], targets[2750:]

print(t_X.shape, v_X.shape)
print(t_Y.shape, v_Y.shape)

In [None]:
# Make an Sklearn pipeline for this Ridge Regression
backbone_ridge = Ridge(fit_intercept=True, normalize=False)
pipeline_ridge = make_pipeline(
    TfidfVectorizer(binary=True, ngram_range=(1, 1)),
    backbone_ridge
)

# Do training
pipeline_ridge.fit(t_X, t_Y)

# Evaluate the performance on validation set
preds = pipeline_ridge.predict(v_X)
mse_loss = mean_squared_error(v_Y, preds)

print(f"MSE Loss using Ridge and TfIdfVectorizer: {mse_loss}")

In [None]:
# Make an Sklearn pipeline for this Linear Regression
backbone_linear = LinearRegression(fit_intercept=True, normalize=False)
pipeline_linear = make_pipeline(
    TfidfVectorizer(binary=True, ngram_range=(1, 1)),
    backbone_linear
)

# Do training
pipeline_linear.fit(t_X, t_Y)

# Evaluate the performance on validation set
preds = pipeline_linear.predict(v_X)
mse_loss = mean_squared_error(v_Y, preds)

print(f"MSE Loss using Linear Regression and TfIdfVectorizer: {mse_loss}")

In [None]:
# Weights for blending later
lin_wgt = 0.2
rig_wgt = 0.8

In [None]:
# Get the testing file
test = test_file[['id', 'excerpt']]
test_ids = test['id'].tolist()
test_text = test['excerpt'].values

# Do Predictions on testing set
test_preds_ridge = pipeline_ridge.predict(test_text)
test_preds_linear = pipeline_linear.predict(test_text)

# Form a submissions file and save it
submission = pd.DataFrame()
submission['id'] = test_ids
submission['target'] = (test_preds_ridge + test_preds_linear) / 2
submission.to_csv("submission.csv", index=None)