# CommonLit Readability

### Description

Can machine learning identify the appropriate reading level of a passage of text, and help inspire learning? Reading is an essential skill for academic success. When students have access to engaging passages offering the right level of challenge, they naturally develop reading skills.

Currently, most educational texts are matched to readers using traditional readability methods or commercially available formulas. However, each has its issues. Tools like Flesch-Kincaid Grade Level are based on weak proxies of text decoding (i.e., characters or syllables per word) and syntactic complexity (i.e., number or words per sentence). As a result, they lack construct and theoretical validity. At the same time, commercially available formulas, such as Lexile, can be cost-prohibitive, lack suitable validation studies, and suffer from transparency issues when the formula's features aren't publicly available.

CommonLit, Inc., is a nonprofit education technology organization serving over 20 million teachers and students with free digital reading and writing lessons for grades 3-12. Together with Georgia State University, an R1 public research university in Atlanta, they are challenging Kagglers to improve readability rating methods.

In this competition, you’ll build algorithms to rate the complexity of reading passages for grade 3-12 classroom use. To accomplish this, you'll pair your machine learning skills with a dataset that includes readers from a wide variety of age groups and a large collection of texts taken from various domains. Winning models will be sure to incorporate text cohesion and semantics.

If successful, you'll aid administrators, teachers, and students. Literacy curriculum developers and teachers who choose passages will be able to quickly and accurately evaluate works for their classrooms. Plus, these formulas will become more accessible for all. Perhaps most importantly, students will benefit from feedback on the complexity and readability of their work, making it far easier to improve essential reading skills.

### Content

* id - unique ID for excerpt
* url_legal - URL of source - this is blank in the test set.
* license - license of source material - this is blank in the test set.
* excerpt - text to predict reading ease of
* target - reading ease
* standard_error - measure of spread of scores among multiple raters for each excerpt. Not included for test data.

### Acknowledgements

CommonLit would like to extend a special thanks to Professor Scott Crossley's research team at the Georgia State University Departments of Applied Linguistics and Learning Sciences for their partnership on this project.

The organizers would like to thank Schmidt Futures for their advice and support for making this work possible.

## Imports

In [None]:
import numpy as np
import pandas as pd
from tensorflow.keras import layers
import matplotlib.pyplot as plt
from tensorflow.keras.layers import Embedding
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.layers import Embedding
from tensorflow.keras.metrics import RootMeanSquaredError
from gensim.models import KeyedVectors
from sklearn.linear_model import LinearRegression
from collections import Counter

import string
import re

## Download Models

## Read Dataset

In [None]:
df_train = pd.read_csv('../input/commonlitreadabilityprize/train.csv')
df_test = pd.read_csv('../input/commonlitreadabilityprize/test.csv')
df_submission = pd.read_csv('../input/commonlitreadabilityprize/sample_submission.csv')

print('Training shape : {}'.format(df_train.shape))
print('Testing shape : {}'.format(df_test.shape))

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
df_submission.head()

In [None]:
df_train.info()

In [None]:
df_train.describe()

## Data Analysis

In [None]:
col = df_train.columns       # .columns gives columns names in data 
print(col)

### Missing data

In [None]:
df_train.isnull().sum()

In [None]:
df_test.isnull().sum()

In [None]:
count = df_train['excerpt'].str.split().str.len()
print("Number of words in excerpts:\n",count)
print("Max word count from excerpt: ", max(count))

In [None]:
results = Counter()
df_train['excerpt'].str.lower().str.split().apply(results.update)
print(len(results.keys()))

In [None]:
longest = max(str(results.keys()).split(), key=len)
print(longest)
print(len(longest))

## Clean Dataset

### Redundancy data

In [None]:
print("duplicated =>", df_train.duplicated(keep = "first").sum())

In [None]:
print("duplicated =>", df_test.duplicated(keep = "first").sum())

In [None]:
def removePunctuations(text):
    return text.translate(str.maketrans('', '', string.punctuation))

In [None]:
def removeLinks(text):
    return re.sub('https?://\S+|www\.\S+', '', text)

In [None]:
def removeNumbers(text):
    return re.sub(r'\d+', '', text)

In [None]:
def clean(text):
    text = text.lower() 
    text = removePunctuations(text)
    text = removeLinks(text)
    text = removeNumbers(text)
    return text

## Data Preprocessing

In [None]:
df_train['excerpt_clean'] = df_train['excerpt'].apply(clean)
df_train.head()

In [None]:
X = df_train['excerpt_clean'].copy()
y = df_train['target'].copy()

print(len(X), len(y))

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
print(len(X_train), len(y_train))
print(len(X_test), len(y_test))

## Model

In [None]:
def plot_loss(history):
    plt.plot(history.history['loss'], label='loss')
    plt.plot(history.history['val_loss'], label='val_loss')
    plt.xlabel('Epoch')
    plt.ylabel('Error')
    plt.legend()
    plt.grid(True)

In [None]:
def plot_rmse(history):
    plt.plot(history.history['root_mean_squared_error'], label='root_mean_squared_error')
    plt.plot(history.history['val_root_mean_squared_error'], label='val_root_mean_squared_error')
    plt.xlabel('Epoch')
    plt.ylabel('root mean squared error')
    plt.legend()
    plt.grid(True)

In [None]:
def get_predict(model, X_train, y_train, X_test, y_test):
    print("\nThe model performance for training set")
    print("--------------------------------------")
    print(model.score(X_train, y_train))
    print("\nThe model performance for testing set")
    print("--------------------------------------")
    print(model.score(X_test, y_test))

### Word Embeddings: Continuous Bag of Words

In [None]:
#google_model = KeyedVectors.load_word2vec_format('../input/googlenewsvectorsnegative300/GoogleNews-vectors-negative300.bin', binary=True)

In [None]:
class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec

    def fit(self, X, y):
        return self

    def fit_transform(self, X, y):
        return self.transform(X)

    def transform(self, X):
        return [np.mean([self.word2vec.get_vector(w) for w in words if w in self.word2vec.index_to_key] or [np.zeros(100)], axis=0) for words in X]

In [None]:
#cbow_model = LinearRegression()

#embedding_vectorizer = MeanEmbeddingVectorizer(google_model)
#X_train_vectorizer = embedding_vectorizer.transform(X_train)
#X_test_vectorizer = embedding_vectorizer.transform(X_test)

#cbow_model.fit(X_train_vectorizer, y_train)

In [None]:
#get_predict(cbow_model, X_train_vectorizer, y_train, X_test_vectorizer, y_test)

### Embeddings

In [None]:
vectorizer = TextVectorization(max_tokens=5000, output_sequence_length=200)
ds = tf.data.Dataset.from_tensor_slices(X_train).batch(128)
vectorizer.adapt(ds)

In [None]:
voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

In [None]:
num_tokens = len(voc) + 2
embedding_dim = 100
hits = 0
misses = 0

In [None]:
filepath = '../input/glove6b/glove.6B.100d.txt'

embeddings_index = {}
with open(filepath) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Total vectors find: %i." % len(embeddings_index))

In [None]:
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Word used %d and lost %d" % (hits, misses))

In [None]:
embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
    trainable=False,
)

In [None]:
embedding_model = tf.keras.Sequential([
          tf.keras.Input(shape=(1,), dtype="string"),  
          vectorizer,
          embedding_layer,
          layers.GlobalMaxPool1D(),
          layers.Dense(10, activation='relu'),
          layers.Dense(1)
])

embedding_model.compile(optimizer='adam', loss='mean_squared_error', metrics=[RootMeanSquaredError()])


embedding_model.summary()

In [None]:
history = embedding_model.fit(X_train, y_train, batch_size=128, epochs=100, validation_data=(X_train, y_train), verbose=0)

In [None]:
plot_loss(history)

In [None]:
plot_rmse(history)

### Long Short Term Memory (LSTM)

In [None]:
vectorizer = TextVectorization(max_tokens=5000, output_sequence_length=200)
ds = tf.data.Dataset.from_tensor_slices(X_train).batch(128)
vectorizer.adapt(ds)

In [None]:
voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

In [None]:
num_tokens = len(voc) + 2
embedding_dim = 100
hits = 0
misses = 0

In [None]:
filepath = '../input/glove6b/glove.6B.100d.txt'

embeddings_index = {}
with open(filepath) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Total vectors find: %i." % len(embeddings_index))

In [None]:
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Word used %d and lost %d" % (hits, misses))

In [None]:
embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
    trainable=False,
)

In [None]:
lstm_model = tf.keras.Sequential([
    tf.keras.Input(shape=(1,), dtype="string"),  
    vectorizer,
    embedding_layer,
    layers.LSTM(64, return_sequences=True),
    layers.GlobalMaxPool1D(),
    layers.Dense(10, activation='relu'),
    layers.Dense(1)
])

lstm_model.compile(optimizer='adam', loss='mean_squared_error', metrics=[RootMeanSquaredError()])

lstm_model.summary()

In [None]:
history = lstm_model.fit(X_train, y_train, batch_size=128, epochs=10, validation_data=(X_test, y_test), verbose=0)

In [None]:
plot_loss(history)

In [None]:
plot_rmse(history)

## Best Model

In [None]:
best_model = lstm_model

## Submission

In [None]:
def submission(submission_file_path,model,excerpt):
    padding_type='post'
    classes = model.predict(excerpt)
    sample_submission = pd.read_csv(submission_file_path)
    sample_submission["target"] = classes
    sample_submission.to_csv("./submission.csv", index=False)
    

In [None]:
submission_file_path = '../input/commonlitreadabilityprize/sample_submission.csv'

submission(submission_file_path, best_model, df_test['excerpt'])