## Libraries

In [None]:
import pandas as pd
import numpy as np
from tensorflow.keras import regularizers, optimizers
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.layers import Embedding, Dense, Dropout, Input, LSTM, GlobalMaxPool1D
from tensorflow.keras.models import Sequential
from tensorflow.keras.initializers import Constant
import tensorflow as tf
import spacy


In [None]:
!python -m spacy download en_core_web_lg
import en_core_web_lg



## Weights & Biases

W&B is a tool that allows to track on website the logs from training of our model. You can login using your GitHub acccount

In [None]:
!pip install wandb -qqq
import wandb
wandb.login()
from wandb.keras import WandbCallback

## Data

We use data from task [Assess Student Writing Level](https://github.com/jnels13/Screening-Childrens-Writing-Level-With-NLP).

In [None]:
!wget https://github.com/Violet-Spiral/assessing-childrens-writing/raw/main/data/samples_no_title.csv
 
text = pd.read_csv('samples_no_title.csv').dropna()

In [None]:
len(text)

In [None]:
text["Grade"].unique()

In [None]:
text.iloc[23].Text

In [None]:
text.iloc[23].Grade

## Preprocessing

We define vectorization to create vocabulary and give every token (word) a number.

In [None]:
nlp = en_core_web_lg.load()
Vectorizer = TextVectorization()

Vectorizer.adapt(text.Text.to_numpy())
vocab = Vectorizer.get_vocabulary()


## Building model

Here we define first layer - embeddings

In [None]:
num_tokens = len(vocab)
embedding_dim = len(nlp('The').vector)
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for i, word in enumerate(vocab):
    embedding_matrix[i] = nlp(str(word)).vector

In [None]:
Embedding_layer=Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=Constant(embedding_matrix),
    trainable=False)

Here, we define learning rate and epochs parameters (you should change it later and see how it affects training)

In [None]:
lr = .01
epochs = 50

We put model together

In [None]:
model = Sequential()
model.add(Input(shape=(1,), dtype=tf.string))
model.add(Vectorizer)
model.add(Embedding_layer)
model.add(LSTM(25, return_sequences=True))
model.add(GlobalMaxPool1D())
model.add(Dropout(0.5))
model.add(Dense(32, activation='tanh', 
                kernel_regularizer = regularizers.l1_l2(l1=1e-5, l2=1e-4)))
model.add(Dropout(0.5))
model.add(Dense(32, activation='tanh', 
                kernel_regularizer = regularizers.l1_l2(l1=1e-5, l2=1e-4)))    
model.add(Dense(1))

adam = optimizers.Adam(learning_rate=lr, decay=1e-2)
model.compile(optimizer = adam, loss = 'mean_absolute_error', metrics = ["mean_squared_error"])

print(model.summary())



We must initialize W&B - to be easily read on the website, and inform about parameters of training.

In [None]:
wandb.init(
project="EmbeddingLayer", 
name=f"with_fixed_embeddings", 
config={
  "learning_rate": lr,
  "architecture": "MLP",
  "dataset": "Children texts",
  "epochs": epochs})

config = wandb.config
logging_callback = WandbCallback(log_evaluation=True)

## Training

We don't define particular validation set, indstead, we define validation spit-- 20% of data will be used for validation. 

We run the training:

In [None]:
history = model.fit(text.Text,
          text.Grade,
          batch_size = 10,
          epochs = epochs,
          validation_split=.2,
          callbacks=[logging_callback])

We plot the learnig curve, check if it's similar on your W&B site

In [None]:
import matplotlib.pyplot as plt


def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])

In [None]:
plt.figure(figsize=(16, 8))
plot_graphs(history, 'loss')
