# Goal

<h3 style="color:blue">assess the quality of summaries written by students</h3>
<h3 style="color:indigo">evaluate how well a student represents the main idea and details of a source text, as well as the clarity, precision, and fluency of the language used in the summary</h3>
<h3 style="color:red">Freely & publicly available external data is <b>allowed</b>, including pre-trained models</h3>
<h3>This is Multi-Output problem</h3>

### Use Hugging Face Library
### Use NLTK
### Use Tensorflow

In [42]:
import warnings
warnings.filterwarnings("ignore")

In [43]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import re
import math
import subprocess
from tqdm import tqdm
import pickle

In [44]:
import tensorflow as tf

In [45]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, explained_variance_score, median_absolute_error

In [46]:
import transformers
from transformers import AutoTokenizer, TFBertModel

In [47]:
import keras_tuner 

In [48]:
prompts_train = pd.read_csv('/kaggle/input/commonlit-evaluate-student-summaries/prompts_train.csv')
summaries_train = pd.read_csv('/kaggle/input/commonlit-evaluate-student-summaries/summaries_train.csv')
prompts_test = pd.read_csv('/kaggle/input/commonlit-evaluate-student-summaries/prompts_test.csv')
summaries_test = pd.read_csv('/kaggle/input/commonlit-evaluate-student-summaries/summaries_test.csv')

In [49]:
train = pd.merge(prompts_train, summaries_train, on='prompt_id')
test = pd.merge(prompts_test, summaries_test, on='prompt_id')

In [50]:
train.rename(columns = {'text' : 'summary'}, inplace=True)
test.rename(columns = {'text' : 'summary'}, inplace=True)

In [51]:
train.head(2)

Unnamed: 0,prompt_id,prompt_question,prompt_title,prompt_text,student_id,summary,content,wording
0,39c16e,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...,00791789cc1f,1 element of an ideal tragedy is that it shoul...,-0.210614,-0.471415
1,39c16e,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...,0086ef22de8f,The three elements of an ideal tragedy are: H...,-0.970237,-0.417058


In [52]:
train['summary'][0]

'1 element of an ideal tragedy is that it should be arranged on a complex plan.  Another element of an ideal tragedy is that it should only have one main issue. The last element of an ideal tragedy is that it should have a double thread plot and an opposite catastrophe for both good and bad.'

In [53]:
columns_needed = ["prompt_text", "summary"]

In [54]:
train_data = train[columns_needed]
test_data = test[columns_needed]

In [55]:
#from transformers import XLNetTokenizer, TFXLNetModel
#tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
#model = TFXLNetModel.from_pretrained('xlnet-base-cased', return_dict=True)

#from transformers import RobertaTokenizer, TFRobertaModel
#tokenizer = RobertaTokenizer.from_pretrained('roberta-base-cased')
#model = TFRobertaModel.from_pretrained('roberta-base-cased', return_dict=True)

from transformers import AutoTokenizer, TFBertModel
model = TFBertModel.from_pretrained('/kaggle/input/huggingface-bert-variants/bert-base-uncased/bert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('/kaggle/input/huggingface-bert-variants/bert-base-uncased/bert-base-uncased')

Some layers from the model checkpoint at /kaggle/input/huggingface-bert-variants/bert-base-uncased/bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at /kaggle/input/huggingface-bert-variants/bert-base-uncased/bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


### Next time use prepare_tf_dataset which is used to directly tokenize and data colat and
### make dataset compatible with tensorflow
####       https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/model#transformers.TFPreTrainedModel.prepare_tf_dataset

In [56]:

def vectorize_dataframe(dataframe, col):
    vectors = []
    for text in tqdm(dataframe[col].tolist()):
        text_tokens = tokenizer(text, return_tensors="tf",max_length = 512, padding='max_length', truncation=True)
        
        output = model(text_tokens)
        
        pooler_output = output.pooler_output

        vectors.append(pooler_output)
    return vectors
    

In [57]:
test_data['prompt_text_embedded'] = vectorize_dataframe(test_data, 'prompt_text')
test_data['summary_embedded'] = vectorize_dataframe(test_data, 'summary')

100%|██████████| 4/4 [00:00<00:00,  5.10it/s]
100%|██████████| 4/4 [00:00<00:00,  4.90it/s]


In [58]:
with open("/kaggle/input/embeddings/BERT_prompt_text_embeddings.pkl", "rb") as file:
    train_data['prompt_text_embedded'] = pickle.load(file)
    
with open("/kaggle/input/embeddings/BERT_summary_embeddings.pkl", "rb") as file:
    train_data['summary_embedded'] = pickle.load(file)

In [59]:
traning_set = train_data[['prompt_text_embedded', 'summary_embedded']]
testing_set = test_data[['prompt_text_embedded', 'summary_embedded']]

### Take average of embeddings  [Not required, just checking]

In [60]:
target1 = np.array(train['content'])
target1 = target1.astype('float32')

target2 = np.array(train['wording'])
target2 = target2.astype('float32')

#target = (target1, target2)

In [61]:
def convert_tensor_to_numpy(tensor):
        return np.array(tensor, dtype='float32')

traning_set = traning_set.applymap(convert_tensor_to_numpy)
testing_set = testing_set.applymap(convert_tensor_to_numpy)

In [62]:
def prepare_dataset(dataset):
    # Flatten the nested arrays in the DataFrame
    dataset['prompt_text_embedded'] = dataset['prompt_text_embedded'].apply(lambda x: x.flatten())
    dataset['summary_embedded'] = dataset['summary_embedded'].apply(lambda x: x.flatten())
    
    feature1 = np.array(dataset['prompt_text_embedded'].tolist())
    feature2 = np.array(dataset['summary_embedded'].tolist())
    
    features = np.concatenate((feature1, feature2), axis=1)
    
    return features

In [63]:
features = prepare_dataset(traning_set)

In [64]:
features_for_test = prepare_dataset(testing_set)

In [65]:
from tensorflow.keras.layers import Dense, Input, Flatten, Bidirectional, LSTM, Dropout
from tensorflow.keras.models import Sequential

In [66]:
def build_model_content(hp):   
   
    #optimizer = hp.Choice('optimizer', values=['adam', 'rmsprop', 'sgd'])
    
    model_content = Sequential()
    model_content.add(Bidirectional(LSTM(units=hp.Int("units1", min_value=56, max_value=412, step=32)), input_shape=(len(features[0]) , 1)))
    model_content.add(Dropout(0.2))
    model_content.add(Dense(1,  activation='linear') )
    
    model_content.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001), loss='mean_squared_error', metrics=['mae', 'mse'])
    
    return model_content


In [67]:
def build_model_wording(hp):   
   
    #optimizer = hp.Choice('optimizer', values=['adam', 'rmsprop', 'sgd'])
    
    model_wording =  Sequential()
    model_wording.add(Bidirectional(LSTM(units=hp.Int("units1", min_value=56, max_value=412, step=32)), input_shape=(len(features[0]) , 1)))
    model_wording.add(Dropout(0.2))
    model_wording.add(Dense(1,  activation='linear') )
    
    model_wording.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001), loss='mean_squared_error', metrics=['mae', 'mse'])
    
    return model_wording

### hyperband for build_model_content

In [68]:

objective = keras_tuner.Objective('mse', 'min')

content_tuner = keras_tuner.Hyperband(
    hypermodel=build_model_content,
    objective=objective,
    max_epochs=5,
    factor=3
)

wording_tuner = keras_tuner.Hyperband(
    hypermodel=build_model_wording,
    objective=objective,
    max_epochs=5,
    factor=3
)

In [69]:
content_tuner.search(features, target1,epochs=10, validation_split=0.2)

Trial 10 Complete [00h 03m 26s]
mse: 0.69724440574646

Best mse So Far: 0.594480574131012
Total elapsed time: 00h 21m 34s


In [70]:
wording_tuner.search(features, target2,epochs=10, validation_split=0.2)

Trial 10 Complete [00h 03m 26s]
mse: 0.7264451384544373

Best mse So Far: 0.7264451384544373
Total elapsed time: 00h 20m 06s


In [71]:
# Get the optimal hyperparameters
best_content_tuner_hps=content_tuner.get_best_hyperparameters(num_trials=1)[0]

# Get the optimal hyperparameters
best_wording_tuner_hps=wording_tuner.get_best_hyperparameters(num_trials=1)[0]

In [72]:
best_content_tuner_hps.values

{'units1': 248,
 'tuner/epochs': 5,
 'tuner/initial_epoch': 0,
 'tuner/bracket': 0,
 'tuner/round': 0}

In [73]:
best_wording_tuner_hps.values

{'units1': 152,
 'tuner/epochs': 5,
 'tuner/initial_epoch': 0,
 'tuner/bracket': 0,
 'tuner/round': 0}

In [75]:
content_hp_model = content_tuner.hypermodel.build(best_content_tuner_hps)
history__1 = content_hp_model.fit(features, target1, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [76]:
wording_hp_model = wording_tuner.hypermodel.build(best_wording_tuner_hps)
history__2 = wording_hp_model.fit(features, target2, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [77]:
evaluate_on_train_content = content_hp_model.evaluate(features, target1)
evaluate_on_train_wording = wording_hp_model.evaluate(features, target2)



In [78]:
print('evaluate_on_train_content', evaluate_on_train_content)
print('evaluate_on_train_wording',evaluate_on_train_wording)

evaluate_on_train_content [0.5187978744506836, 0.5582633018493652, 0.5187978744506836]
evaluate_on_train_wording [0.6495986580848694, 0.6342442631721497, 0.6495986580848694]


In [79]:
content_prediction = content_hp_model.predict(features)
wording_prediction = wording_hp_model.predict(features)



### Predict on test

In [80]:
test_pred_content = content_hp_model.predict(features_for_test)
test_pred_wording = wording_hp_model.predict(features_for_test)



## submission

In [81]:
test_pred_content = test_pred_content.reshape(-1)
test_pred_wording = test_pred_wording.reshape(-1)

In [82]:
submission = pd.DataFrame({
    'student_id' : test['student_id'],
    'content' : test_pred_content,
    'wording' : test_pred_wording
})

In [83]:
submission.to_csv('submission.csv', index=False)

In [84]:
submission.head()

Unnamed: 0,student_id,content,wording
0,000000ffffff,-2.383166,-0.919307
1,222222cccccc,-2.326527,-0.904968
2,111111eeeeee,-2.355732,-0.901406
3,333333dddddd,-2.333451,-0.888816
