## Introduction
I was interested in seeing how separable some of these sentence-level representations are in our data set.  
In this notebook, we'll go through a brief EDA and and into some unsupervised learning.  
**Sentence transformers** are used to obtain the sentence embeddings and **textstat** is used for feature engineering

In [None]:
!pip install sentence_transformers
!pip install textstat

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.manifold import TSNE
from sklearn.mixture import GaussianMixture
import umap
import textstat
plt.style.use('fivethirtyeight')

In [None]:
train_df = pd.read_csv('../input/commonlitreadabilityprize/train.csv')
test_df = pd.read_csv('../input/commonlitreadabilityprize/test.csv')

In [None]:
train_df.head()

## EDA and Data Preparation

In [None]:
plt.figure(figsize = (8,5))
sns.histplot(train_df.target)

In [None]:
plt.figure(figsize = (8,5))
sns.histplot(train_df.standard_error)

Readability target appears to be relativel normally distributed.    
The standard error is right-skewed... and seems to have some outlier values on the left

In [None]:
ind = np.where(train_df.standard_error == train_df.standard_error.min())[0]
train_df.loc[ind]

This row has a target which looks like an integer, and a 0 standard error.  
We'll remove it for now, as the standard error is largely out of distribution, which could affect dimensionality reduction

In [None]:
train_df.drop(ind, inplace = True)
train_df.reset_index(inplace = True,drop = True)

## Obtaining Sentence Representations
Here we'll use a Roberta base model to encode our sequences.  

In [None]:
# bert = SentenceTransformer('bert-base-uncased')
roberta = SentenceTransformer('roberta-base')
vects = roberta.encode(train_df.excerpt)

In [None]:
# Probably isn't neccesary to scale these vectors
scaler = StandardScaler()
scaled = scaler.fit_transform(vects)

## TSNE Dimensionality Reduction

In [None]:
tsne_embedding = TSNE(2).fit_transform(scaled)

In [None]:
px.scatter(train_df, x = tsne_embedding[:, 0], y = tsne_embedding[:, 1], color = 'target',
                 labels = {'x' : 'Dimension 1', 'y' : 'Dimension 2'},
                 title = 'TSNE Projection of Roberta Sentence Representations')

### UMAP Dimensionality Reduction

In [None]:
reducer = umap.UMAP(random_state = 123)
umap_embedding = reducer.fit_transform(scaled)

In [None]:
px.scatter(train_df, x = umap_embedding[:, 0], y = umap_embedding[:, 1], color = 'target',
                 labels = {'x' : 'Dimension 1', 'y' : 'Dimension 2'},
                 title = 'UMAP Projection of Roberta Sentence Representations')

Looking at the latent representations, we see some interesting findings:
* We can see some separation across scores - the finely grouped 'cluster' of good observations in the bottom left is noticeable 
* This plot looks like some kind of flipped Australia as well (to me at least)

In [None]:
px.scatter(train_df, x = umap_embedding[:, 0], y = umap_embedding[:, 1], color = 'standard_error',
                 labels = {'x' : 'Dimension 1', 'y' : 'Dimension 2'},
                 title = 'UMAP Projection of Roberta Sentence Representations')

## Clustering
We'll fit a GMM to the original sentence embeddings to cluster our datapoints  

In [None]:
gmm = GaussianMixture(n_components = 6, random_state = 123)
clusters = gmm.fit_predict(vects)

In [None]:
px.scatter(train_df, x = umap_embedding[:, 0], y = umap_embedding[:, 1], color = clusters,
                 labels = {'x' : 'Dimension 1', 'y' : 'Dimension 2'},
                 title = 'UMAP Projection of Roberta Sentence Representations')

These clusters seem to be relatively well-defined.  
We can perform some feature engineering to compare across clusters

##  Feature Engineering
I used the same textstat augmentations from this excellent EDA notebook https://www.kaggle.com/gunesevitan/commonlit-readability-prize-eda  
For reference, the augmentations are defined below:
* `character_count` - number of characters in the text
* `digit_count` - number of digits in the text
* `word_count` - number of words in the text
* `unique_word_count` - number of unique words in the text
* `mean_word_length` - average number of character that the words have in the text
* `syllable_count` - number of syllables in the text
* `sentence_count` - number of sentences in the text
* `flesch_reading_ease` - [flesch reading ease score](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch_reading_ease) of the text
* `flesch_kincaid_grade` - [flesch-kincaid grade level](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch%E2%80%93Kincaid_grade_level) of the text
* `smog_index` - [smog index](https://en.wikipedia.org/wiki/SMOG) of the text
* `automated_readability_index` - [automated readability index](https://en.wikipedia.org/wiki/Automated_readability_index) of the text
* `coleman_liau_index` - [coleman–liau index](https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index) of the text
* `linsear_write_formula` - [linsear write grade](hhttps://en.wikipedia.org/wiki/Linsear_Write) of the text

In [None]:
train_df['is_licensed'] = train_df.license.notna()*1 # might be interesting to look at?

train_df['character_count'] = train_df['excerpt'].apply(lambda x: len(str(x)))
train_df['digit_count'] = train_df['excerpt'].apply(lambda x: np.sum(([int(word.isdigit()) for word in str(x).split()])))
train_df['word_count'] = train_df['excerpt'].apply(textstat.lexicon_count)
train_df['unique_word_count'] = train_df['excerpt'].apply(lambda x: len(set(str(x).split())))
train_df['mean_word_length'] = train_df['excerpt'].apply(lambda x: np.mean([len(word) for word in str(x).split()]))
train_df['syllable_count'] = train_df['excerpt'].apply(textstat.syllable_count)
train_df['sentence_count'] = train_df['excerpt'].apply(textstat.sentence_count)
train_df['flesch_reading_ease'] = train_df['excerpt'].apply(textstat.flesch_reading_ease)
train_df['flesch_kincaid_grade'] = train_df['excerpt'].apply(textstat.flesch_kincaid_grade)
train_df['smog_index'] = train_df['excerpt'].apply(textstat.smog_index)
train_df['automated_readability_index'] = train_df['excerpt'].apply(textstat.automated_readability_index)
train_df['coleman_liau_index'] = train_df['excerpt'].apply(textstat.coleman_liau_index)
train_df['linsear_write_formula'] = train_df['excerpt'].apply(textstat.linsear_write_formula)

In [None]:
numeric_df = train_df.select_dtypes(include=np.number)
scaled_df = pd.DataFrame(MinMaxScaler(feature_range=(0, 1)).fit_transform(numeric_df), 
                         index = numeric_df.index, 
                         columns = numeric_df.columns)

scaled_df['cluster'] = clusters

In [None]:
agg = scaled_df.groupby('cluster').mean()
agg.reset_index(inplace = True)
agg

In [None]:
import plotly.graph_objects as go

categories = ['target', 'standard_error', 'sentence_count', 'mean_word_length',
              'automated_readability_index', 'character_count', 'unique_word_count']

fig = go.Figure()

for row in agg.itertuples():
    fig.add_trace(go.Scatterpolar(
    r = [getattr(row, i) for i in categories],
    theta = categories,
    fill = 'toself',
    name = row.cluster
    ))
fig.show()

Class `1` appears to have higher sentence counts, lower character counts and has a noticeably lower average score in the automated readability index  
90% of records in class `1` also have url/licenses associated with them, which is much higher than most other clusters
