In [None]:
# --- CSS STYLE ---
from IPython.core.display import HTML
def css_styling():
    styles = open("../input/2020-cost-of-living/alerts.css", "r").read()
    return HTML("<style>"+styles+"</style>")
css_styling()

<img src="https://i.imgur.com/GvOB7o2.png">
<center><h1>📖 CommonLit: Target Understanding and Text FE</h1></center>
<center><h2>The problem is more complex than you think</h2></center>

# 1. Introduction

Yet another amazing competition brought by Kaggle! 😎 My mother is a teacher, and I know the struggle of keeping kids involved and interested in reading, so I can say this competition is a bit closer to my heart.

This one looks *simple* in terms of understanding the problem, goal and competition metric. However, *don't be rush in throwing it into a model*. On a second look, there's more to it than it allows to show on the surface.

This competition differenciates itself from others because there is **only 1 feature to be used**: the text, which can be highly subjective. The target might be **missleading** as well, as it might behave differently in the `test` data than what we see in the training data.

We don't want to **turn into a smoothie after the shakeup**, as [Laura Fink](https://www.kaggle.com/allunia) says. 👀

<div class="alert simple-alert">
📌 <b>Competition Goal</b>: Rating the complexity of literary passages from grades 3 to 12.
</div>

### ⬇️ Libraries:
* RAPIDS info here: https://rapids.ai/
* Link to my W&B Dashboard here: https://wandb.ai/andrada/commonlit?workspace=user-andrada
* Learn more on why and how to use W&B here: [Experiment Tracking using W&B](https://www.kaggle.com/ayuraj/experiment-tracking-with-weights-and-biases).

In [None]:
!pip install sentence-transformers
!pip install pandarallel

In [None]:
# Libraries
import os
import re
import wandb
import tqdm
import ast
import pickle
import string
import warnings
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as mpl
import matplotlib.patches as patches
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sentence_transformers import SentenceTransformer
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler

# GPU Libraries
import cudf
import cupy
import cuml

# Environment check
warnings.filterwarnings("ignore")
os.environ["WANDB_SILENT"] = "true"
CONFIG = {'competition': 'common-lit', '_wandb_kernel': 'aot'}

# Secrets 🤫
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("wandb")

# Custom colors
my_colors = ["#E4916C", "#E36A67", "#FFB2C8", "#BCE6EF", "#1E5656"]
sns.palplot(sns.color_palette(my_colors))

# Set Style
sns.set_style("white")
mpl.rcParams['xtick.labelsize'] = 16
mpl.rcParams['ytick.labelsize'] = 16
mpl.rcParams['axes.spines.left'] = False
mpl.rcParams['axes.spines.right'] = False
mpl.rcParams['axes.spines.top'] = False
plt.rcParams.update({'font.size': 22})

class color:
    BOLD = '\033[1m' + '\033[93m'
    END = '\033[0m'

> 📌 **Note**: If this line throws an error, try using `wandb.login()` instead. It will ask for the API key to login, which you can get from your W&B profile (click on Profile -> Settings -> scroll to API keys).

In [None]:
! wandb login $secret_value_0

### ⬇️😉 Custom Functions Below 

In [None]:
def show_values_on_bars(axs, h_v="v", space=0.4):
    '''Plots the value at the end of the a seaborn barplot.
    axs: the ax of the plot
    h_v: weather or not the barplot is vertical/ horizontal'''
    
    def _show_on_single_plot(ax):
        if h_v == "v":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() / 2
                _y = p.get_y() + p.get_height()
                value = int(p.get_height())
                ax.text(round(_x, 5), round(_y, 5), format(round(value, 5), ','), ha="center") 
        elif h_v == "h":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() + float(space)
                _y = p.get_y() + p.get_height()
                value = int(p.get_width())
                ax.text(_x, _y, format(value, ','), ha="left")

    if isinstance(axs, np.ndarray):
        for idx, ax in np.ndenumerate(axs):
            _show_on_single_plot(ax)
    else:
        _show_on_single_plot(axs)
        

def create_wandb_plot(x_data=None, y_data=None, x_name=None, y_name=None, title=None, log=None, plot="line"):
    '''Create and save lineplot/barplot in W&B Environment.
    x_data & y_data: Pandas Series containing x & y data
    x_name & y_name: strings containing axis names
    title: title of the graph
    log: string containing name of log'''
    
    data = [[label, val] for (label, val) in zip(x_data, y_data)]
    table = wandb.Table(data=data, columns = [x_name, y_name])
    
    if plot == "line":
        wandb.log({log : wandb.plot.line(table, x_name, y_name, title=title)})
    elif plot == "bar":
        wandb.log({log : wandb.plot.bar(table, x_name, y_name, title=title)})
    elif plot == "scatter":
        wandb.log({log : wandb.plot.scatter(table, x_name, y_name, title=title)})
        
        
def create_wandb_hist(x_data=None, x_name=None, title=None, log=None):
    '''Create and save histogram in W&B Environment.
    x_data: Pandas Series containing x values
    x_name: strings containing axis name
    title: title of the graph
    log: string containing name of log'''
    
    data = [[x] for x in x_data]
    table = wandb.Table(data=data, columns=[x_name])
    wandb.log({log : wandb.plot.histogram(table, x_name, title=title)})
    
    
# Cosine Similarity
def cosine_similarity(u, v):
    # Get similarity between 2 vectors.
    # To test how effective is our Embedding method
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))


def save_dataset_artifact(run_name, artifact_name, path):
    '''Saves dataset to W&B Artifactory.
    run_name: name of the experiment
    artifact_name: under what name should the dataset be stored
    path: path to the dataset'''
    
    run = wandb.init(project='commonlit', 
                     name=run_name, 
                     config=CONFIG,
                     anonymous="allow")
    artifact = wandb.Artifact(name=artifact_name, 
                              type='dataset')
    artifact.add_file(path)

    wandb.log_artifact(artifact)
    wandb.finish()
    print("Artifact has been saved successfully.")

# 2. The Data

> Let's observe the structure of the data first:

<center><img src="https://i.imgur.com/WpEGQkd.png" width=800></center>

In [None]:
# Read in data
train = pd.read_csv("../input/commonlitreadabilityprize/train.csv")
test = pd.read_csv("../input/commonlitreadabilityprize/test.csv")

print(color.BOLD + "Train Shape:" + color.END, train.shape, "\n" +
      color.BOLD + "Test Shape:" + color.END, test.shape)

# Plot
plt.figure(figsize = (25, 11))
sns.heatmap(train.isna(), cmap = [my_colors[3],
                                                    my_colors[4]], xticklabels=train.columns)
plt.title("Missing values in Train Data", size=20);

In [None]:
# Save data to W&B Dashboard
save_dataset_artifact(run_name='save-original-data',
                      artifact_name='original-data', 
                      path="../input/commonlitreadabilityprize/train.csv")

# 3. The Target

Our target variable starts at **-3.67**, the highest possible difficulty (text I can't understand myself 😅) and stops at **1.71**, which is the lowest difficulty (dinosaurs and pretty things ❤️).

> 📌 **Note**: The target in our case has a distribution very close to normal. However ... is the test set following **the same pattern**? The CV technique in this competition will prove very valuable.

In [None]:
run = wandb.init(project='commonlit', name='target-exploration', config=CONFIG, anonymous="allow")

In [None]:
print(color.BOLD + "Min Target:" + color.END ,train["target"].min(), "\n" +
      color.BOLD + "Text:" + color.END, train[train["target"] == train["target"].min()]["excerpt"][1705], "\n" +
      "\n" +
      color.BOLD + "Max Target:" + color.END ,train["target"].max(), "\n" +
      color.BOLD + "Text:" + color.END, train[train["target"] == train["target"].max()]["excerpt"][2829])

# Plot
plt.figure(figsize = (25, 11))
plt.hist(train["target"], bins=50, histtype='step', lw=1, facecolor=my_colors[3], 
         hatch='/', edgecolor=my_colors[4], fill=True)
plt.title("Target Distribution", size=25)
plt.xlabel("Value", size=20)
plt.ylabel("Frequency", size=20);

In [None]:
# Log plot into W&B
create_wandb_hist(x_data=train["target"], x_name="Target", title="Target Distribution", log="target-distribution")

Let's look at a few more examples of text and the **target difficutly** that it was given:

In [None]:
print(color.BOLD + "Target Value" + color.END ,train.iloc[2718]["target"], "\n" +
      color.BOLD + "Text:" + color.END, train[train["target"] == train.iloc[2718]["target"]]["excerpt"][2718], "\n" +
      "\n" +
      color.BOLD + "Target Value:" + color.END ,train.iloc[47]["target"], "\n" +
      color.BOLD + "Text:" + color.END, train[train["target"] == train.iloc[47]["target"]]["excerpt"][47], "\n" +
      "\n" +
      color.BOLD + "Target Value:" + color.END ,train.iloc[106]["target"], "\n" +
      color.BOLD + "Text:" + color.END, train[train["target"] == train.iloc[106]["target"]]["excerpt"][106], "\n" +
     "\n" +
      color.BOLD + "Target Value:" + color.END ,train.iloc[2216]["target"], "\n" +
      color.BOLD + "Text:" + color.END, train[train["target"] == train.iloc[2216]["target"]]["excerpt"][2216])

# 4. The Error

### 📖 Story Time
Let's forget for a second about the classification problem and the fact that we want to train an AI to distinguish between a more complex text and a rather easy one.

Let's assume that WE (as people, not as Data Scientists), need to rate these texts by hand. One by one. Try to give a *score* to one of the texts. Now try for another one. Is the second one simpler, or more complex? And how big is the difference between the 2?

For picking up emotion in images (expressions like anger, fear, sadness), **coders** are trained to score the images. **Coders** are people that are qualified to assess images and follow a strict set of rules to classify if, for example, in an image the person is happy or sad.

But problems are NEVER that easy. Usually a face can express many more feelings, like happiness, love, disgust and a little bit of surprise in the same time. In front of such picture, even the best rules and the most skilled coders can fail ... **as the answer becomes subjective**.

<div class="alert success-alert">
<b>There is subjectivity in our text too.</b> Some could find a paragraph being a bit more easy than others. In this case, we'll have <b>an error</b>.
</div>

## ! Understanding the Standard Error

Our error is **very skewed to the left**.

Meaning that we have MANY texts with more than 0.5 **standard error**. So, the coders dissagreed. If humans dissagreed *half a point* on so many paragraphs, how is our AI going to perform?

In [None]:
print(color.BOLD + "Min error:" + color.END ,train["standard_error"].min(), "\n" +
      color.BOLD + "Target Value:" + color.END, train[train["standard_error"] == train["standard_error"].min()]["target"][106], "\n" +
      color.BOLD + "Text:" + color.END, train[train["standard_error"] == train["standard_error"].min()]["excerpt"][106], "\n" +
      "\n" +
      color.BOLD + "Max error:" + color.END ,train["standard_error"].max(), "\n" +
      color.BOLD + "Target Value:" + color.END, train[train["standard_error"] == train["standard_error"].max()]["target"][2235], "\n" +
      color.BOLD + "Text:" + color.END, train[train["standard_error"] == train["standard_error"].max()]["excerpt"][2235])

# Plot
plt.figure(figsize = (25, 11))
# sns.kdeplot(train["standard_error"], fill=my_colors[0], color=my_colors[0], lw=0.1, alpha=0.55)
plt.hist(train["standard_error"], bins=50, histtype='step', lw=1, facecolor=my_colors[2], 
         hatch='/', edgecolor=my_colors[1],fill=True)
plt.title("Standard Error Distribution", size=25)
plt.xlabel("Value", size=20)
plt.ylabel("Frequency", size=20);

In [None]:
# Log plot into W&B
create_wandb_hist(x_data=train["standard_error"], x_name="Std Error", title="Standard Error Distribution", 
                  log="error-distribution")

## Target vs Error Comparison

> 📌 **Note**: How do we interpret this plot? When the target is ~ -1 (so the complexity is quite neutral), the error decreases a little. However, at the ends of the distribution, the standard error increases slightly, meaning that there is more dissagreement between coders in these cases.

<div class="alert success-alert">
What does this mean? It means that if in the <code>test</code> set we'll have more examples that have the target value more towards the end of the histogram, then our Machine Learning might have even more troubles classifying them. And who can blame the machine, if even the humans are in such dissagreement?
</div>

We also have a little guy, completely off charts. It is an outlier, with the complexity set to 0.0 and the error as well.

*There is **no correlation** between the `target` and `standard_error`.*

In [None]:
corr = round(pearsonr(train["target"], train["standard_error"])[0], 4)

# Plot
plt.figure(figsize = (25, 11))
sns.scatterplot(x=train["target"], y=train["standard_error"], color=my_colors[4], 
                size=train["standard_error"], sizes=(50, 400))
plt.title(f"Target vs Error | Pearson: {corr}", size=25)
plt.xlabel("Target", size=20)
plt.ylabel("Standard Error", size=20)
plt.legend(fontsize=15, loc="lower right");

# Arrow
style = "Simple, tail_width=3, head_width=16, head_length=16"
kw = dict(arrowstyle=style, color=my_colors[0])
arrow = patches.FancyArrowPatch((0.3, 0.1), (0.05, 0.01),
                             connectionstyle="arc3,rad=-.15", **kw)
plt.gca().add_patch(arrow);

In [None]:
wandb.log({"pearsonr" : corr})

# Log plot into W&B
create_wandb_plot(x_data=train["target"], y_data=train["standard_error"], 
                  x_name="Target", y_name="Error", title="Target vs Error", log="scatterplot", plot="scatter")

Oooook, so we looked at the `target` and its `error` to observe what we're dealing with in terms of text. How hard it is and how much dissagreement is between the coders (or raters).

> After our first experiment, the W&B Dashboard looks like this:

<center><img src="https://i.imgur.com/BFz866C.gif" width=800></center>

## Target Segmentation
Now, let's look a bit at the `standard_error` in terms of segmentation.

### We'll segment the target into 3 groups:
* high (complexity)
* medium (complexity)
* low (complexity)

We'll follow the natural distribution of the histogram and we'll segment the data into 3 thirds.

In [None]:
# Create segments
segm1 = int(len(train)/3)
segm2 = segm1 * 2

train = train.sort_values("target", ascending=True).reset_index(drop=True)
train["target_segment"] = 0
train.loc[0:segm1, "target_segment"] = "high"
train.loc[segm1:segm2, "target_segment"] = "medium"
train.loc[segm2:, "target_segment"] = "low"

So now, our distribution would look like this:

In [None]:
# Plot
plt.figure(figsize = (25, 11))
sns.kdeplot(train["target"], hue=train["target_segment"], fill=my_colors, color=my_colors, palette=my_colors[1:4], lw=0.1, alpha=0.65)
plt.title("Target Distribution (segmented)", size=25)
plt.xlabel("Value", size=20)
plt.ylabel("Frequency", size=20);

Ok, now let's look at the `standard_error` in terms of segmentation.

You can observe that indeed the error decreases a little for medium complexity. However, we can state that usually **we'll encounter a `standard_error` of 0.5 on a normal rating**.

In [None]:
data = train.groupby("target_segment")[["target", "standard_error"]].mean().reset_index()

# Plot
plt.figure(figsize = (25, 11))
ax = sns.barplot(data=data, x="target_segment", y="standard_error", palette=my_colors[1:4], order=["high", "medium", "low"])
# show_values_on_bars(ax, h_v="v", space=1)
plt.title(f"Standard Error (segmented)", size=25)
plt.xlabel("", size=20)
plt.ylabel("Standard Error", size=20);

In [None]:
# Log into W&B
create_wandb_plot(x_data=data["target_segment"], y_data=data["standard_error"], 
                  x_name="Std Error", y_name="Segment", title="Standard Error (segmented)", 
                  log="error_segment", plot="bar")

In [None]:
wandb.finish()

# 5. The Word Embeddings

**What is a Word Embedding**?
Embedding words is the process of vectorizing text; meaning that we're changing the characters *we understand* to numbers that the *computer can understand*.
<center><img src="https://i.imgur.com/zTIGT0u.png" width=550></center>

Let's first see how many words we usually have in a paragraph. This will later help us understand how long we'll need to make the embeddings. If we'll choose to create embeddings of 100 numbers a vector and we have 200 words in a paragraph, and we have 2,834 paragraphs, then we'll end up with and object of size 100 x 200 x 2,834 = 56,680,000.

Which is A LOT.

> 📌 **Note**: It seems that our paragraphs have between ~140 and ~210 words. Quite long I would say.

In [None]:
run = wandb.init(project='commonlit', name='embeddings_exploration', config=CONFIG, anonymous="allow")

In [None]:
paragraphs_len = train["excerpt"].apply(lambda x: len(x.split(" ")))

# Plot
plt.figure(figsize = (25, 11))
sns.kdeplot(paragraphs_len, fill=my_colors[0], color=my_colors[0], lw=0.1, alpha=0.65)
plt.title("Count of Words in a Paragraph", size=25)
plt.xlabel("Value", size=20)
plt.ylabel("Frequency", size=20);

In [None]:
# Log plot into W&B
create_wandb_hist(x_data=paragraphs_len, x_name="Count of Words", title="Frequency of Words in Paragraphs", 
                  log="word_count")

### What does tokenization mean?

> **Tokenization** is a fancy word for saying "splitting sequences into words". 

<center><img src="https://i.imgur.com/JwLBUnN.png" width=700></center>

In [None]:
paragraphs = train["excerpt"]

# Tokenize each paragraph
tokenized_paragraphs = [word_tokenize(p.lower()) for p in paragraphs]

Now let's take *2 Word Embeddings* one by one as our possible methods for the models.
> 📌 **Note**: My Inspo from [this article](https://www.analyticsvidhya.com/blog/2020/08/top-4-sentence-embedding-techniques-using-python/) and [this one](https://medium.com/analytics-vidhya/text-classification-using-word-embeddings-and-deep-learning-in-python-classifying-tweets-from-6fe644fcfc81).

## I. Doc2Vec

In [None]:
# Represents a document along with a tag
tagged_paragraphs = [TaggedDocument(d, [i]) for i, d in enumerate(tokenized_paragraphs)]

# Train model
doc2vec_model = Doc2Vec(tagged_paragraphs,    ### the tagged documents
                        vector_size = 50,     ### how big are the feature vectors
                        window = 2,           ### max distance between current & predicted word
                        min_count = 1,        ### ignores words that appear once
                        epochs = 50)          ### no. epochs to train

# Example of new paragraph
test_paragraph = word_tokenize("Do you think that dinosaurs actually had furr instead of lizard skin?".lower())
test_paragraph = doc2vec_model.infer_vector(test_paragraph)

print(color.BOLD + "Word2Vec Embedding" + color.END, "\n", test_paragraph)

## II. SentenceBERT

In [None]:
# Encode the paragraphs
bert_model = SentenceTransformer('bert-base-nli-mean-tokens')
paragraph_embeddings = bert_model.encode(paragraphs)

# Test with an example
test_paragraph = "Do you think that dinosaurs actually had furr instead of lizard skin? Do you think they were like giant furry toys?"
example_BERT = bert_model.encode([test_paragraph])[0]

# Print the most similar sentence to our example
print(color.BOLD + "SentenceBERT Embedding" + color.END)
for paragr in train.loc[2832:2833]["excerpt"].values:
    similarity = cosine_similarity(paragraph_embeddings, bert_model.encode([paragr])[0])
    print(color.BOLD + "Sentence = " + color.END, paragr, color.BOLD + "| similarity = " + color.END, np.sum(similarity), "\n")

In [None]:
import gc
del bert_model, paragraph_embeddings
gc.collect()

# 6. Text Preprocessing

Another thing to be done (besides words embeddings) is preprocessing our text in a manner that it would be much cleaner and easier for the models to "digest".

Luckly, we don't have to do much in this case. The texts are very clean, as they're paragraphs from academia. However, we can still make some adjustments:
* **lower casing** all words
* removing **punctuation**
* filtering **stopwords**
* **lemmatization** of the tokens - bringing the word to the root (*e.g.: beautifly/ beautiful/ beautify/ beautification = beautiful*)

In [None]:
def clean_paragraph(paragraph, verbose=False):
    '''Cleans paragraph before tokenization'''
    
    # Tokenize & convert to lower case
    tokens = word_tokenize(paragraph)
    tokens = [t.lower() for t in tokens]

    # Remove punctuation & non alphabetic characters from each word
    table = str.maketrans('', '', string.punctuation)
    tokens = [t.translate(table) for t in tokens]
    tokens = [t for t in tokens if t.isalpha()]

    # Filter out stopwords
    stop_words = stopwords.words('english')
    tokens = [t for t in tokens if not t in stop_words]

    # Lemmatizer
    lemmatizer = WordNetLemmatizer()
    tokens_lemm = [lemmatizer.lemmatize(t) for t in tokens]

    if verbose:
        print(color.BOLD + "Show difference between original and lemmatized token:" + color.END)
        for a, b, in zip(tokens, tokens_lemm):
            if a != b: print(a, " | ", b)
                
    return " ".join(tokens_lemm)

In [None]:
# Example
cleaned_paragraph = clean_paragraph(paragraph=train["excerpt"][1], verbose=True)

print("\n" + 
      color.BOLD + "Original Text:" + color.END, "\n" +
      train["excerpt"][1], "\n" +
      color.BOLD + "After Cleaning:" + color.END, "\n" +
      cleaned_paragraph)

# Apply to the entire text
train["text"] = train["excerpt"].apply(lambda x: clean_paragraph(x))
wandb.finish()

# 7. English Word Frequency Model & Submission

I was super curious to see if I would create a SUPER simple model, using only some basic features from text & the word frequencies [dataset here](https://www.kaggle.com/rtatman/english-word-frequency), what would be the RMSE score?

Using embeddings and more advanced techniques will MOST DEFINITELY render better results. However, I wanted to see how a very simple baseline would perform. Can't be that bad ... right?

## I. Data Preprocessing

We'll use **parallelization** to append to each word the frequency from the `english_word_frequency` dataset.

<div class="alert success-alert">
📌 <b>Note:</b> I also chose to use a `TfIdfVectorizer` for the `text` feature, to add more information to the models.
</div>

In [None]:
#! This cell was taking me 3 hours to run (on how I wrote the code - very poorly)
#! But @adityaecdrid came to the resque with this amazing script
#! Now it runs in less than a second ❤

# English Word Frequencies Dataset
word_freq = pd.read_csv("../input/english-word-frequency/unigram_freq.csv")
# Convert it into a dict (i.e. hashmap)
word_freq = dict(zip(word_freq["word"], word_freq["count"])) #### change - 1
available_words = set(word_freq.keys()) #### change - 2

# Tokenize full text
train["split_text"] = train["text"].apply(lambda x: [word for word in x.split(" ")])

# Get word count for each word
train["freq_text"] = train["split_text"].parallel_apply(lambda x: [word_freq.get(word, 0) for word in x 
                                                                   if word in available_words]) #### change - 3


# Save data to W&B Dashboard
save_dataset_artifact(run_name='save-original-data',
                      artifact_name='original-data', 
                      path="../input/commonlitreadabilityprize/train.csv")

## II. Create More Features

Now we can create some features from the new created dataset.

In [None]:
# Get sum, mean, std etc. from the text frequencies
train["freq_sum"] = train["freq_text"].apply(lambda x: np.sum(x))
train["freq_mean"] = train["freq_text"].apply(lambda x: np.mean(x))
train["freq_std"] = train["freq_text"].apply(lambda x: np.std(x))
train["freq_min"] = train["freq_text"].apply(lambda x: np.min(x))
train["freq_max"] = train["freq_text"].apply(lambda x: np.max(x))

# Get more info from text itself
train["no_words"] = train["text"].apply(lambda x: len(x.split(" ")))
train["no_words_paragraph"] = train["excerpt"].apply(lambda x: len(x.split(" ")))

# Scale these features (as they are HUGE)
X = train[['freq_sum', 'freq_mean', 'freq_std', 'freq_min', 
           'freq_max', 'no_words', 'no_words_paragraph']]
y = cudf.Series(train["target"])

scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X))
X_scaled.columns = X.columns


# === TFIDF Vectorizer ===
### parameters from here: https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle
tfv = TfidfVectorizer(min_df=3,  max_features=None,
                      strip_accents='unicode', analyzer='word',
                      token_pattern=r'\w{1,}', ngram_range=(1, 3), 
                      use_idf=1,smooth_idf=1,sublinear_tf=1,
                      stop_words = 'english')
tfv.fit(train["text"])
train_tf_matrix = pd.DataFrame.sparse.from_spmatrix(tfv.transform(train["text"]))
pickle.dump(tfv.vocabulary_, open("tfidfvectorizer.pkl","wb"))


# Create final X variable, containing all info
X_cpu = pd.concat([X_scaled, train_tf_matrix], axis=1)
X_gpu = cudf.DataFrame(X_cpu)

In [None]:
# This is how our data looks now :)
X_cpu.head(2)

> 📌 **Bonus**: So! we can't retrain the `TfIdfVectorizer` when we submit the data. So, I saved the trained information into a `pickle`. To use it for inference, follow the code below:

`transformer = TfidfTransformer()
saved = CountVectorizer(decode_error = "replace", vocabulary = pickle.load(open("tfidfvectorizer.pkl", "rb")))
X_test = transformer.fit_transform(saved.fit_transform(X_test))`

## III. RAPIDS XGBoost

I will try just a basic XGBoost, as it is usually the best performing one.

In [None]:
# Libraries for models ;)
from wandb.xgboost import wandb_callback
from cuml.metrics import mean_squared_error
from cuml.preprocessing.model_selection import train_test_split
import xgboost

# Basic Data Validation
X_train, X_test, y_train, y_test = train_test_split(X_gpu, y, 
                                                    test_size=0.3, shuffle=False)

In [None]:
def train_xgb_model(X_train, X_test, y_train, y_test, params, 
                    details="default", prints=True, step=1):
    
    run = wandb.init(project='commonlit', name=f'xgboost_{step}', 
                     config=CONFIG, anonymous="allow")
    wandb.log(params)
    
    # Create DMatrix - is optimized for both memory efficiency and training speed.
    train_matrix = xgboost.DMatrix(data = X_train, label = y_train)
    
    # Create & Train the model
    model = xgboost.train(params, dtrain = train_matrix, callbacks=[wandb_callback()])

    # Make prediction
    predicts = model.predict(xgboost.DMatrix(X_test))
    rmse = mean_squared_error(y_test.astype('float32'), predicts)
    wandb.log({'rmse':rmse}, step=step)

    if prints:
        print(color.BOLD + details + color.END + " | RMSE: {}".format(rmse))
    
    wandb.finish()
    return model, rmse

In [None]:
# === TRAIN ===
params1 = {'max_depth' : 4,
           'max_leaves' : 2**4,
           'tree_method' : 'gpu_hist',
           'objective' : 'reg:squarederror',
           'grow_policy' : 'lossguide',
           'colsample_bynode': 0.8,}

model1, roc1 = train_xgb_model(X_train, X_test, y_train, y_test, 
                               params1, details="Baseline Model", step=2)

# Save the model
### you can access the saved models in the my commonlit-dataset
### or in the output of this notebook
pickle.dump(model1, open("xgb_word_freq.sav", "wb"))

> 📌 **Note**: Well ... **not good**. The **RMSE is huge**, considering the fact that we have numbers between -3 and 2. I will try to improve this one a bit, but I was just curious to see if the dataset would help ... so you don't have to :)

## IV. XGBRF + Cross Validation

Let's try a different approach. I'll try to improve the model by choosing a different method + making a simple `RepeatedKFold` on the data.

Let's see how this fares!

In [None]:
# Libraries for models ;)
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, RepeatedKFold
from xgboost import XGBRegressor, XGBRFRegressor

# Convert data to CPU
y_cpu = pd.Series(cupy.asnumpy(y.values))

# Create Folds
folds = RepeatedKFold(n_splits=7, n_repeats=2, random_state=33)

In [None]:
# === TRAIN ===
step = 1
run = wandb.init(project='commonlit', name=f'xgbrf_model_{step}', 
                 config=CONFIG, anonymous="allow")

model2 = XGBRFRegressor(n_estimators=120)

rmses = []
for k, (train_index, val_index) in enumerate(folds.split(X_cpu, y_cpu)):
    X_train, y_train = X_cpu.iloc[train_index], y_cpu[train_index]
    X_valid, y_valid = X_cpu.iloc[val_index], y_cpu[val_index]
    
    model2.fit(X_train, y_train, callbacks=[wandb_callback()])
    preds = model2.predict(X_valid)
    
    rmse = mean_squared_error(y_valid, preds)
    rmses.append(rmse)
    wandb.log({'rmse':np.float(rmse)}, step=k+1)
    
    print(color.BOLD + f"{k}. RMSE :" + color.END, round(rmse, 5))
    
print("\n", color.BOLD + "Mean Fold RMSE:" + color.END, round(np.mean(rmses), 5))
wandb.log({'mean_rmse' : np.float(np.mean(rmses))})
wandb.finish()


# Save the model
### you can access the saved models in the my commonlit-dataset
### or in the output of this notebook
pickle.dump(model2, open("xgbrf_word_freq.sav", "wb"))

**We can also look at feature importance**, to see which features (out of the ones we've already created) are the most important. This way, we can choose afterwards which one to choose when creating the more complex models (more details are coming in my second notebook).

<div class="alert success-alert">
📌 <b>Note:</b>We can see that freq_sum, freq_min and freq_mean are the most important features, although we have more than 11,700 columns for the words in our texts. It means that <b>the word_frequencies dataset is actually helpful!</b>
</div>

In [None]:
fe_dict = model2.get_booster().get_score(importance_type='weight')
fe_dict = pd.DataFrame({"feature":fe_dict.keys(), "weight":fe_dict.values()}).\
                        sort_values("weight", ascending=False).head(10)

# Plot
plt.figure(figsize = (25, 11))
ax = sns.barplot(data=fe_dict, x="feature", y="weight", palette="ocean")
show_values_on_bars(ax, h_v="v", space=1)
plt.title(f"XGBRF: Feature Importance", size=25)
plt.xlabel("", size=20)
plt.ylabel("Weight", size=20)
plt.yticks([]);

<div class="alert simple-alert">
📌 <b>Yay!</b> This is a big improvement! The RMSE dropped from a value of 2.31 to around 0.82! I would call this a win, especially because we didn't really do much to our dataset.
</div>

<center><img src="https://i.imgur.com/k6xGQHx.gif" width=850></center>

<center><img src="https://i.imgur.com/cUQXtS7.png"></center>

# ⌨️🎨 My Specs

* **Z8 G4** Workstation 🖥
* 2 CPUs & 96GB Memory 💾
* NVIDIA **Quadro RTX 8000** 🎮
* **RAPIDS** version 0.17 🏃🏾‍♀️


> 📌 **Leaderboard**: And the leaderboard score for the XGBRF Model using Repeated Folds is **0.93** (if you have any questions on how to submit, don't hesitate to ask - don't forget to name your submission `submission.csv` so you won't get an error!)
<center><img src="https://i.imgur.com/V9faI1T.png"></center>