### ⚙ Basic text pre-processing and new dataset generation
In this notebook, we will use the most common methods of text pre-processing to reduce the number of features across the train and test sets and homogenise the input. 

Boilerpate follows

In [None]:
!pip install -q ../input/ftfywhl602/ftfy-6.0.2-py2.py3-none-any.whl

In [None]:
!pip install -q ../input/docoptwhl062/docopt-0.6.2-py2.py3-none-any.whl

In [None]:
!pip install -q ../input/num2wordswhl0510/num2words-0.5.10-py2.py3-none-any.whl

In [None]:
import re
import nltk
import ftfy
import pandas as pd

from string import punctuation
from num2words import num2words
from urllib.parse import unquote
from string import punctuation
from nltk.stem import WordNetLemmatizer
from gensim import corpora, models, similarities
from nltk import word_tokenize, sent_tokenize
from wordsegment import load, segment

lemmatizer = WordNetLemmatizer()
load()

In [None]:
train_df = pd.read_csv("../input/commonlitreadabilityprize/train.csv")
test_df = pd.read_csv("../input/commonlitreadabilityprize/test.csv")

In [None]:
# For convenience, combine the two
train_df['is_train'] = True
test_df['is_train'] = False

df = pd.concat([train_df, test_df], axis=0)

### Transformations

Convert to lower case. This is the most common of transformation and is pretty self-explanatory

In [None]:
df['excerpt'] = df['excerpt'].apply(lambda x: str(x).lower().replace('\\', '').replace('_', ' '))

In case there is unicode text in the input, this will fix inconsistencies and glitches in it, such as mojibake (text that was decoded in the wrong encoding).

In [None]:
df['excerpt'] = df['excerpt'].apply(lambda x: ftfy.fix_text(x))

Double spaces are erroneous in our context. Remove them

In [None]:
df['excerpt'][df['excerpt'].str.contains("  ")]

In [None]:
def remove_multiple_spaces(text):
    text = re.sub('\s+',  ' ', text)
    return text

df['excerpt'] = df['excerpt'].apply(lambda x: remove_multiple_spaces(x))

Contractions. The ability to replace contractions like "don't" with "do not" and "we've" with "we have" can help us reveal tokens/features that are hidden in such a way. Normally, I would use the [pycontractions](https://pypi.org/project/pycontractions/) library, but it does not support python 3.7. Regardless, the way to use this goes as follows:

```
from pycontractions import Contractions

cont = Contractions(api_key="text8")
df['excerpt'].apply(lambda x: list(cont.expand_texts([x]))[0])
```

For the purposes of this demo, we can write our own function like so:

In [None]:
def decontraction(phrase):
    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

df['excerpt'] = df['excerpt'].apply(lambda x: decontraction(x))

Punctuation should also be irrelevant in terms of modelling for readability difficulty. Remove them

In [None]:
def remove_punct(text):
    new_punct = re.sub('\ |\!|\?', '', punctuation)
    table=str.maketrans('', '', new_punct)
    return text.translate(table)

df['excerpt'] = df['excerpt'].apply(lambda x: remove_punct(x))

Words are typically inflected (e.g., letters suffixed, affixed, etc.) to express their forms (e.g., plural, tense, etc.). Dog -> Dogs is an example of inflection. Usually, words must be compared in their native forms for effective text matching.

Lemmatization is one method used to convert a word to a non-inflected form, i.e. reduce a word to its most native form.

It uses a simple mechanism that removes or modifies inflections to form the root word, but the root word may not be a valid word in the language. Also, it removes or modifies the inflections to form the root word, but the root word needs to be a valid word in the language.

In [None]:
def lemma(text):
    words = word_tokenize(text)
    return ' '.join([lemmatizer.lemmatize(w, pos='v') for w in words])

df['excerpt'] = df['excerpt'].apply(lambda x: lemma(x))

Finally, any numbers should be replaced with their letter equivalents.

In [None]:
df['excerpt'] = df['excerpt'].apply(lambda x: re.sub(r"(\d+)", lambda s: num2words(int(s.group(0))), x))

Save back the datasets

In [None]:
df[df['is_train']].iloc[:, :-1].to_csv("train.csv", index=False)
df[~df['is_train']].iloc[:, :-3].to_csv("test.csv", index=False)

### Latent semantic analysis. 

Let's look at how well our dataset has been separated, based on abstract, high level topics using LDA analysis

**Workflow**
Create a dictionary and corpus
  * Dictionary: use words from all of the tweets for more tokens
  * Corpus: composed of the tweets with the greatest certainty in their classification.

In [None]:
# Takes as input the tweet dataframe, dictionary, corpus and dimensions for the tweets and returns 
# a new dataframe with each tweet characterized by the new lower dimensional features
# also returns the topics if desired
def latent_semantic_analysis(df, dictionary, corpus_tfidf, dimensions, return_topics = False, n_topics = 10, n_words = 10):
    # Create a lsi wrapper around the tfidf wrapper
    lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=dimensions, power_iters=10)
    corpus_lsi = lsi[corpus_tfidf]
    
    #create the features for a new dataframe
    features = []
    for doc in corpus_lsi:
        features.append(remove_doc_label(doc))
        
    # Create a new dataframe with the features
    df_features = pd.DataFrame(data = features)
    
    #return the new features dataframe devoid of columns that contain nothing
    if return_topics:
        return (df_features.fillna(0), lsi.print_topics(n_topics, num_words = n_words), lsi)
    else:
        return df_features.fillna(0)

In [None]:
# Makes the gensim dictionary and corpus
def make_dictionary_and_corpus(df):
    # The tokenized and stemmed data form our texts database 
    texts = df.copy()
    
    # Check how frequently a given word appears and remove it if only one occurrence
    frequency = defaultdict(int)
    for text in texts:
        for token in text:
            frequency[token] += 1
    texts = [[token for token in text if frequency[token] > 1] for text in texts]
    
    # Create a gensim dictionary
    dictionary = corpora.Dictionary(texts)
    
    # Create a new texts of only the ones I will analyze
    texts = df.copy()    
    
    # Create the bag of words corpus
    corpus = [dictionary.doc2bow(text) for text in texts]
    #corpus = [token_word2vec_map(text, frequency) for text in texts]
    
    # Create a tfidf wrapper and convert the corpus to a tfidf format
    tfidf = models.TfidfModel(corpus)
    corpus_tfidf = tfidf[corpus]
    
    # Return a tuple with the dictionary and corpus
    return (dictionary, corpus_tfidf, corpus, tfidf)

#clean the features for use in dataframe
def remove_doc_label(doc):
    cleaned_doc = []
    for element in doc:
        cleaned_doc.append(element[1])
    return cleaned_doc

In [None]:
from nltk.corpus import stopwords
from collections import defaultdict

In [None]:
df['excerpt_tokenized'] = df['excerpt'].apply(
    lambda excerpt: [word for word in word_tokenize(excerpt) if word not in stopwords.words('english')]
)

In [None]:
# Generate the dictionary and the corpus for our tweets
text_col = "excerpt_tokenized"
dictionary, corpus_tfidf, corpus_bow, tfidf = make_dictionary_and_corpus(df[text_col])

In [None]:
dimensions = 43
df_lsi_features, topics, lsi = latent_semantic_analysis(df, dictionary, corpus_tfidf, dimensions, True, 15, 20)

**Check out the "topics"**

  1. Print out the top 15 topics with the top 20 tokens they are composed of
  2. Plot some topics against each other with colors to indicate class

In [None]:
for topic in topics:
    print("Topic %d:" % topic[0])
    print(topic[1] + "\n")

**Plot topics**

Topic 0 and topic 1 seem to be easily readable, as they contain easier words like old, man and mother, while topic 9 a lot less so, as it includes words such as bacteria, DNA and species. Let's verify this with a plot..

In [None]:
import matplotlib.pyplot as plt

In [None]:
#set the two topics
feature_0 = 0
feature_1 = 1
feature_2 = 9

# Extract the data for plotting
feature_0 = df_lsi_features[feature_0]
feature_1 = df_lsi_features[feature_1]
feature_2 = df_lsi_features[feature_2]

# Things to plot
plt.scatter(feature_0, feature_1, c="b", s=40, alpha=0.3, linewidths=0.0, label = "More readable")
plt.scatter(feature_0, feature_2, c="r", s=40, alpha=0.3, linewidths=0.0, label = "Less readable")

#backround grid details
axes = plt.gca()
axes.grid(b = True, which = 'both', axis = 'both', color = 'gray', linestyle = '-', alpha = 0.5, linewidth = 0.5) 
# axes.set_axis_bgcolor('white')  

#font scpecifications
title_font = {'family' : 'arial', 'color'  : 'black', 'weight' : 'heavy','size': 20}
axis_label_font = {'family' : 'arial', 'color'  : 'black', 'weight' : 'normal','size': 20}

#figure size and tick style
plt.rcParams["figure.figsize"] = [6,6]
plt.rc('axes',edgecolor='black',linewidth=1)
plt.tick_params(which='both', axis='both', color='black', length=4, width=0.5)
plt.rcParams['xtick.direction'] = 'in'
plt.rcParams['ytick.direction'] = 'in'

#axis range and labels (also specify if log or not)
plt.xlim(0.0, 0.4)
#plt.xscale('log')
plt.ylim(-0.3, 0.3)
plt.xlabel(r'Origin topic', y=3, fontsize=20, fontdict = axis_label_font)
plt.ylabel(r'Target topic', fontsize=20, fontdict = axis_label_font)

#title and axis labels
plt.tick_params(axis='both', labelsize=20)
plt.title('Features', y=1.05, fontdict = title_font)

#legend details
legend = plt.legend(shadow = True, frameon = True, fancybox = False, ncol = 1, fontsize = 15, loc = 'lower left')
frame = legend.get_frame()
#frame.set_width(100)
frame.set_facecolor('white')
frame.set_edgecolor('black')
    
plt.show()

Indeed, the topics match our intuition based on what tokens they contain! The readable topics are more spread on the x and y axis in comparison with non-readable topics, which remained "clumped" near the origin. With more topics and text augmentations applied to enrich the input dataset, good separation is possible.

### Modelling

The following were copied in from https://www.kaggle.com/hengzheng/simpletransformers-regression-starter-less-code. Many thanks @hengzheng 

In [None]:
!pip install -q /kaggle/input/coleridge-packages/seqeval-1.2.2-py3-none-any.whl
!pip install -q simpletransformers==0.51.0 --no-index --find-links=file:///kaggle/input/simpletransformers/simpletransformers-0.51.0

from sklearn.model_selection import train_test_split
from simpletransformers.classification import ClassificationModel, ClassificationArgs

In [None]:
train = df[df['is_train']][['excerpt', 'target']].copy()
train.columns = ['text', 'labels']
train_df, valid_df = train_test_split(train, test_size=0.01, random_state=42)

In [None]:
model_args = ClassificationArgs()
model_args.max_seq_length = 300
model_args.num_train_epochs = 5
model_args.regression = True
model_args.no_save = True
model_args.save_model_every_epoch = False
model_args.save_steps = -1

model = ClassificationModel(
    "roberta",
    "../input/robertalarge",
    num_labels=1,
    args=model_args
)

model.train_model(train_df)

result, model_outputs, wrong_predictions = model.eval_model(valid_df)
print(result)

In [None]:
test = df[~df['is_train']][['id', 'excerpt']].copy()
test.columns = ['id', 'text']

predictions, _ = model.predict(test['text'].values)
test['target'] = predictions

test[['id', 'target']].to_csv('submission.csv', index=False)

### Upvote if it was remotely helpful! More to come :)