# CommonLit Readability Challenge - Word2Vec + UMAP 🗺️
## Introduction 

In this notebook we are going to see how to apply a pretrained [Word2Vec](https://jalammar.github.io/illustrated-word2vec/) model (from gensim) to a text corpus of our choice and,after that, we are going to see if our embedding has captured readability by employing a dimensionality reduction algorithm named [UMAP](https://umap-learn.readthedocs.io/en/latest/how_umap_works.html).

``` If you're trying to locally install umap, pay attention to the fact that the correct command is "pip install umap-learn", indeed it is "import umap" while importing it```.


In [1]:
import pandas as pd 
import numpy as np
import re 

import nltk
#nltk.download('punkt')

# model imports
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api # we need an internet connection for this one 

from sklearn.model_selection import train_test_split

# visualization imports
import umap # nonlinear dimensionality reduction
import matplotlib.pyplot as plt
%matplotlib inline

### Utilities
Here some utility functions: 
* ```remove_ascii_words``` it might be useless in this case but it's quite useful for general purpose preprocessing, where we don't know whether there are non-ascii characters around

* ```get_good_tokens``` removes useless punctuation

* ``` w2v_preprocessing``` all the necessary preprocessing for gensim Word2Vec model, we basically divide each document into individual sentences, remove punctuation by using ```get_good_tokens```, tokenize every sentence and remove empty lists. 


In [None]:
our_special_word = 'qwerty'

def remove_ascii_words(df):
    """ removes non-ascii characters from the 'texts' column in df.
    It returns the words containig non-ascii characers.
    """
    non_ascii_words = []
    for i in range(len(df)):
        for word in df.loc[i, 'excerpt'].split(' '):
            if any([ord(character) >= 128 for character in word]):
                non_ascii_words.append(word)
                df.loc[i, 'excerpt'] = df.loc[i, 'excerpt'].replace(word, our_special_word)
    return non_ascii_words

In [None]:
def get_good_tokens(sentence):
    replaced_punctation = list(map(lambda token: re.sub('[^0-9A-Za-z!?]+', '', token), sentence))
    removed_punctation = list(filter(lambda token: token, replaced_punctation))
    return removed_punctation

In [None]:
def w2v_preprocessing(df):
    """ All the preprocessing steps for word2vec are done in this function.
    All mutations are done on the dataframe itself. So this function returns
    nothing.
    """
    df['excerpt'] = df.excerpt.str.lower()
    df['document_sentences'] = df.excerpt.str.split('.')  # split texts into individual sentences
    df['tokenized_sentences'] = list(map(lambda sentences:
                                         list(map(nltk.word_tokenize, sentences)),
                                         df.document_sentences))  # tokenize sentences
   # df['tokenized_sentences'] = list(map(lambda sentences: list(map( ,sentences)), df.tokenized_sentences))
    df['tokenized_sentences'] = list(map(lambda sentences:
                                         list(map(get_good_tokens, sentences)),
                                         df.tokenized_sentences))  # remove unwanted characters
    df['tokenized_sentences'] = list(map(lambda sentences:
                                         list(filter(lambda lst: lst, sentences)),
                                         df.tokenized_sentences))  # remove empty lists

### Data Import
Let's load training data as a ```pandas``` dataframe

In [None]:
train_data =  pd.read_csv("../input/commonlitreadabilityprize/train.csv")
train_data.head()

In [None]:
train_data.excerpt = train_data['excerpt'].apply(str)
non_ascii_words = remove_ascii_words(train_data)

print("Replaced {} words with characters with an ordinal >= 128 in the test data.".format(
    len(non_ascii_words)))

### Pretrained Word2Vec
Here we can load a pretrained version of Word2Vec (on Google News) from ```Gensim```'s API. Each vector has 300 components

In [None]:
W2Vmodel = api.load('word2vec-google-news-300')

## Preprocessing

Here we apply our utility functions for preprocessing

In [None]:
w2v_preprocessing(train_data)

In [None]:
train_data.drop(train_data[train_data.tokenized_sentences.str.len() == 0].index, inplace= True) 

In [None]:
#create dictionary with all sentences
sentences = []
for sentence_group in train_data.tokenized_sentences:
    sentences.extend(sentence_group)

print("Number of sentences: {}.".format(len(sentences)))
print("Number of texts: {}.".format(len(train_data)))

## Feature Extraction

The following function, ```get_w2v_features``` transforms each sentence into a feature vector by averaging words vectors. In this way we can take into account the different length of each sentence, on the other side, by averaging on different words we will lose some word-specific information.

In [None]:
def get_w2v_features(w2v_model, sentence_group):
    """ Transform a sentence_group (containing multiple lists
    of words) into a feature vector. It averages out all the
    word vectors of the sentence_group.
    """
    words = np.concatenate(sentence_group)  # words in text
    index2word_set = set(w2v_model.index_to_key) # set(w2v_model.wv.vocab.keys())  # words known to model
    
    featureVec = np.zeros(w2v_model.vector_size, dtype="float32")
    
    # Initialize a counter for number of words in a review
    nwords = 0
    # Loop over each word in the comment and, if it is in the model's vocabulary, add its feature vector to the total
    for word in words:
        if word in index2word_set: 
            featureVec = np.add(featureVec, w2v_model[word])
            nwords += 1.

    # Divide the result by the number of words to get the average
    if nwords > 0:
        featureVec = np.divide(featureVec, nwords)
    return featureVec

Extracting Word2Vec features as a dataframe column. This cell takes quite a lot of time (~1 hr), it might be useful to parallelize the feature extraction process, you know nested ```for``` loops are always problematic in Python 🐍

In [None]:
train_data['w2v_features'] = list(map(lambda sen_group:
                                     get_w2v_features(W2Vmodel, sen_group),
                                     train_data.tokenized_sentences))

train_data["w2v_resh_features"] = train_data["w2v_features"].apply(lambda x : x.reshape(1,-1) )

In [None]:
# save w2v features
#train_data.to_csv("w2v_features.csv")

Converting Word2Vec features to ```numpy.ndarray``` for visualization purposes with ```UMAP```


In [None]:
arr_w2v = train_data.w2v_resh_features[0]
for i in range(1, len(train_data)):
    arr_w2v = np.vstack((arr_w2v, train_data.w2v_resh_features[i]))

# Dimensionality Reduction and Visualization via UMAP

```UMAP``` is a nonlinear dimensionality reduction technique, something like ```PCA``` but more fancy! 
It can be used for both *unsupervised* and *supervised* problems and also both for *regression* and *classification* problems.   

Since we are facing a regression problem, i.e. our target variable is continuous, we have to impose ```target_metric = 'l1'``` (thanks Leland McInnes, here a [github issue ](https://github.com/lmcinnes/umap/issues/257) as motivation). 
We opted for a 2-dimensional representation (```n_components = 2```) 

In [None]:
umap_emb = umap.UMAP(n_neighbors= 15, n_components = 2, target_metric = 'l1' , n_epochs = 500).fit_transform(arr_w2v, y=train_data.target)

Let's take a look at what we get by plotting this new 2-d space.

In [None]:
fig, ax = plt.subplots(1, figsize=(14, 10))
plt.scatter(*umap_emb.T, s=3, c=train_data.target, cmap='Spectral', alpha=1.0)
plt.setp(ax, xticks=[], yticks=[])
ax.patch.set_facecolor('black')
fg_color = 'black'
cbar = plt.colorbar()
plt.setp(plt.getp(cbar.ax.axes, 'yticklabels'), color=fg_color)

Okay, in a supervised setting ```UMAP``` is able to isolate this structure, i.e. it seems that there exist a specific subspace which is kinda parametrized by the **readability target**. Is it possible to find a similar structure in a semi-supervised or unsupervised setting? Let's dive in 🏊🤿

## Validation Set

Let's use ```scikit-learn```'s ```train_test_split``` method in order to extract a validation set and try to see whether a semi-supervised ```UMAP``` can capture some structures.

In [None]:
X_train, X_val, y_train, y_val = train_test_split(
    arr_w2v, train_data[["target", "standard_error"]].values, test_size=0.10, random_state=42)

In [None]:
mapper = umap.UMAP(n_neighbors= 15, n_components = 2, target_metric = 'l1' , n_epochs = 1000).fit(X_train, y=y_train[:,0])

In [None]:
val_embedding = mapper.transform(X_val)

In [None]:
fig, ax = plt.subplots(1, figsize=(14, 10))
plt.scatter(*mapper.embedding_.T, s=3, c=y_train[:,0], cmap='Spectral', alpha=1.0)
plt.setp(ax, xticks=[], yticks=[])
ax.patch.set_facecolor('black')
fg_color = 'black'
cbar = plt.colorbar()
plt.setp(plt.getp(cbar.ax.axes, 'yticklabels'), color=fg_color)

In [None]:
fig, ax = plt.subplots(1, figsize=(14, 10))
plt.scatter(*val_embedding.T, s=3, c=y_val[:,0], cmap='Spectral', alpha=1.0)
plt.setp(ax, xticks=[], yticks=[])
ax.patch.set_facecolor('black')
fg_color = 'black'
cbar = plt.colorbar()
plt.setp(plt.getp(cbar.ax.axes, 'yticklabels'), color=fg_color)

Unfortunately the result is not what we wanted it to be. Anyway, a further investigation of the supervised representation extracted by ```UMAP``` is worth to be done even by considering it for augmentation strategies

# XGBoost on W2V features

Here we apply a ```XGBoost``` regressor on our Word2Vec features,  in order to be ready for submission

In [None]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error as mse

In [None]:
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.5, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 1000, verbosity = 1)

xg_reg.fit(X_train,y_train[:,0])

In [None]:
preds = xg_reg.predict(X_val)

~0.69 not bad, let's see on test :) 

In [None]:
np.sqrt(mse(y_val[:,0], preds)) 

# W2V Naive Augmentation 

In this section, we'll try to instantiate a naive augmentation pipeline for Word2Vec features. The rationale behind the following strategy is that W2V features are essentially vectors so we can augment our "feature set" by perturbing these vectors with a small gaussian noise, in order to facilitate the learning procedure for our regressor of choice. Moreover, we'd like to see if by augmenting our "feature set" in this way we can **break** the structure extracted by the supervised ```UMAP```.

Of course, this naive augmentation doesn't make a lot of sense from a word/sentence perspective
since it might be that the perturbed vectors don't correspond to any word/sentence themselves.

With the following method, ```augment_train_w2v```, we augment the training set by adding a small gaussian noise on w2v's features while slightly modifying (by exploiting the **standard deviation** of each target) the new target variable corresponding to the augmented sample.

In [None]:
def augment_train_w2v(X_train, y_train, y_std, times = 5):
    
    augmented_w2v_X = X_train.copy()
    augmented_w2v_y = y_train.copy()
    
    for j in range(0, times - 1):
        for i in range(0, X_train.shape[0]):
        
            new_w2v = X_train[i,:] + np.random.uniform(1,8)*1e-4*np.random.randn(300)       # np.random.uniform(1,8)*1e-3*np.random.randn(300)
            augmented_w2v_X = np.vstack((augmented_w2v_X, new_w2v))
            
            new_y = y_train[i] + np.random.choice([-1, 1])*y_std[i]*0.05
            augmented_w2v_y = np.append(augmented_w2v_y, new_y)
        
    return augmented_w2v_X, augmented_w2v_y

In [None]:
aug_X_train, aug_Y_train = augment_train_w2v(X_train, y_train[:,0], y_train[:,1], times = 5)

We decided to shuffle again the new augmented training set by keeping the information on the sample-target couple. That's what the following method ```shuffle_couples``` does. 

In [None]:
def shuffle_couples(a, b):
    assert len(a) == len(b)
    p = np.random.permutation(len(a))
    return a[p], b[p]

In [None]:
aug_X_train, aug_Y_train = shuffle_couples(aug_X_train, aug_Y_train)

Now let's see if the previously seen structure still exist ( you can try to change the parameter ```times``` of ```augment_train_w2v``` to see weather it affects something or not) 

In [None]:
mapper_2 = umap.UMAP(n_neighbors= 15, n_components = 2, min_dist = 0.1, target_metric = 'l1' , n_epochs = 1000).fit(aug_X_train)#, y=aug_Y_train)

In [None]:
val_embedding_2 = mapper_2.transform(X_val)

In [None]:
fig, ax = plt.subplots(1, figsize=(14, 10))
plt.scatter(*mapper_2.embedding_.T, s=3, c=aug_Y_train, cmap='Spectral', alpha=1.0)
plt.setp(ax, xticks=[], yticks=[])
ax.patch.set_facecolor('black')
fg_color = 'black'
cbar = plt.colorbar()
plt.setp(plt.getp(cbar.ax.axes, 'yticklabels'), color=fg_color)

In [None]:
fig, ax = plt.subplots(1, figsize=(14, 10))
plt.scatter(*val_embedding_2.T, s=3, c=y_val[:,0], cmap='Spectral', alpha=1.0)
plt.setp(ax, xticks=[], yticks=[])
ax.patch.set_facecolor('black')
fg_color = 'black'
cbar = plt.colorbar()
plt.setp(plt.getp(cbar.ax.axes, 'yticklabels'), color=fg_color)

Apparently the structure is affected by this augmentation strategy. After that we can take a look at the ```rmse``` score, maybe we can get a better result

In [None]:
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.5, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 1000, verbosity = 1)

xg_reg.fit(aug_X_train, aug_Y_train)

In [None]:
preds = xg_reg.predict(X_val)

In [None]:
np.sqrt(mse(y_val[:,0], preds))

The result is pretty much the same, we need to devise a better augmentation strategy, maybe by exploiting the information gained with ```UMAP```


# Test Data

Preparing submission with test data by basically re-doing the previously seen operations.

In [None]:
test_data =  pd.read_csv("/kaggle/input/commonlitreadabilityprize/test.csv")
test_data.head()

In [None]:
test_data.excerpt = test_data['excerpt'].apply(str)
non_ascii_words = remove_ascii_words(test_data)

print("Replaced {} words with characters with an ordinal >= 128 in the test data.".format(
    len(non_ascii_words)))

In [None]:
w2v_preprocessing(test_data)

In [None]:
test_data.drop(test_data[test_data.tokenized_sentences.str.len() == 0].index, inplace= True) 

In [None]:
#create dictionary with all sentences
sentences_test = []
for sentence_group in test_data.tokenized_sentences:
    sentences_test.extend(sentence_group)

print("Number of sentences: {}.".format(len(sentences)))
print("Number of texts: {}.".format(len(test_data)))

In [None]:
test_data['w2v_features'] = list(map(lambda sen_group:
                                     get_w2v_features(W2Vmodel, sen_group),
                                     test_data.tokenized_sentences))

In [None]:
test_data["w2v_resh_features"] = test_data["w2v_features"].apply(lambda x : x.reshape(1,-1) )

In [None]:
arr_w2v_test = test_data.w2v_resh_features[0]
for i in range(1, len(test_data)):
    arr_w2v_test = np.vstack((arr_w2v_test, test_data.w2v_resh_features[i]))