# Goal

<h3 style="color:blue">assess the quality of summaries written by students</h3>
<h3 style="color:indigo">evaluate how well a student represents the main idea and details of a source text, as well as the clarity, precision, and fluency of the language used in the summary</h3>
<h3 style="color:red">Freely & publicly available external data is <b>allowed</b>, including pre-trained models</h3>
<h3>This is Multi-Output problem</h3>

### Use Hugging Face Library
### Use NLTK
### Use Tensorflow

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import re
import math
import subprocess
from tqdm import tqdm
import pickle

In [3]:
import tensorflow as tf

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, explained_variance_score, median_absolute_error

In [5]:
import transformers
from transformers import AutoTokenizer, TFBertModel

In [6]:
prompts_train = pd.read_csv('/kaggle/input/commonlit-evaluate-student-summaries/prompts_train.csv')
summaries_train = pd.read_csv('/kaggle/input/commonlit-evaluate-student-summaries/summaries_train.csv')
prompts_test = pd.read_csv('/kaggle/input/commonlit-evaluate-student-summaries/prompts_test.csv')
summaries_test = pd.read_csv('/kaggle/input/commonlit-evaluate-student-summaries/summaries_test.csv')

In [7]:
train = pd.merge(prompts_train, summaries_train, on='prompt_id')
test = pd.merge(prompts_test, summaries_test, on='prompt_id')

In [8]:
train.rename(columns = {'text' : 'summary'}, inplace=True)
test.rename(columns = {'text' : 'summary'}, inplace=True)

In [9]:
train.head(2)

Unnamed: 0,prompt_id,prompt_question,prompt_title,prompt_text,student_id,summary,content,wording
0,39c16e,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...,00791789cc1f,1 element of an ideal tragedy is that it shoul...,-0.210614,-0.471415
1,39c16e,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...,0086ef22de8f,The three elements of an ideal tragedy are: H...,-0.970237,-0.417058


In [10]:
train['summary'][0]

'1 element of an ideal tragedy is that it should be arranged on a complex plan.  Another element of an ideal tragedy is that it should only have one main issue. The last element of an ideal tragedy is that it should have a double thread plot and an opposite catastrophe for both good and bad.'

In [11]:
columns_needed = ["prompt_text", "summary"]

In [12]:
train_data = train[columns_needed]
test_data = test[columns_needed]

In [13]:
#from transformers import XLNetTokenizer, TFXLNetModel
#tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
#model = TFXLNetModel.from_pretrained('xlnet-base-cased', return_dict=True)

#from transformers import RobertaTokenizer, TFRobertaModel
#tokenizer = RobertaTokenizer.from_pretrained('roberta-base-cased')
#model = TFRobertaModel.from_pretrained('roberta-base-cased', return_dict=True)

from transformers import AutoTokenizer, TFBertModel
model = TFBertModel.from_pretrained('/kaggle/input/huggingface-bert-variants/bert-base-uncased/bert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('/kaggle/input/huggingface-bert-variants/bert-base-uncased/bert-base-uncased')

Some layers from the model checkpoint at /kaggle/input/huggingface-bert-variants/bert-base-uncased/bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at /kaggle/input/huggingface-bert-variants/bert-base-uncased/bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


### Next time use prepare_tf_dataset which is used to directly tokenize and data colat and
### make dataset compatible with tensorflow
####       https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/model#transformers.TFPreTrainedModel.prepare_tf_dataset

In [14]:

def vectorize_dataframe(dataframe, col):
    vectors = []
    for text in tqdm(dataframe[col].tolist()):
        text_tokens = tokenizer(text, return_tensors="tf",max_length = 4000, padding='max_length', truncation=True)

        vectors.append(text_tokens['input_ids'])
    return vectors
    

In [15]:
train_data['prompt_text_embedded'] = vectorize_dataframe(train_data, 'prompt_text')
train_data['summary_embedded'] = vectorize_dataframe(train_data, 'summary')

100%|██████████| 7165/7165 [00:28<00:00, 254.29it/s]
100%|██████████| 7165/7165 [00:12<00:00, 553.95it/s]


In [16]:
test_data['prompt_text_embedded'] = vectorize_dataframe(test_data, 'prompt_text')
test_data['summary_embedded'] = vectorize_dataframe(test_data, 'summary')

100%|██████████| 4/4 [00:00<00:00, 513.39it/s]
100%|██████████| 4/4 [00:00<00:00, 646.35it/s]


In [17]:
traning_set = train_data[['prompt_text_embedded', 'summary_embedded']]
testing_set = test_data[['prompt_text_embedded', 'summary_embedded']]

### Take average of embeddings  [Not required, just checking]

In [18]:
target1 = np.array(train['content'])
target1 = target1.astype('float32')

target2 = np.array(train['wording'])
target2 = target2.astype('float32')

#target = (target1, target2)

In [19]:
def convert_tensor_to_numpy(tensor):
        return np.array(tensor, dtype='int64')

traning_set = traning_set.applymap(convert_tensor_to_numpy)
testing_set = testing_set.applymap(convert_tensor_to_numpy)

In [20]:
def prepare_dataset(dataset):
    # Flatten the nested arrays in the DataFrame
    dataset['prompt_text_embedded'] = dataset['prompt_text_embedded'].apply(lambda x: x.flatten())
    dataset['summary_embedded'] = dataset['summary_embedded'].apply(lambda x: x.flatten())
    
    feature1 = np.array(dataset['prompt_text_embedded'].tolist())
    feature2 = np.array(dataset['summary_embedded'].tolist())
    
    features = np.concatenate((feature1, feature2), axis=1)
    
    return features

In [21]:
features = prepare_dataset(traning_set)

In [22]:
features_for_test = prepare_dataset(testing_set)

In [23]:
#X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

In [24]:
from tensorflow.keras.layers import Dense, Input, Flatten


In [25]:
# Define the model architecture
model_content = tf.keras.Sequential([
      tf.keras.layers.Dense(256, activation='linear'),
      tf.keras.layers.Dense(128, activation='linear'),
      tf.keras.layers.Dense(64, activation='linear'),
      tf.keras.layers.Dense(32, activation='linear'),
      tf.keras.layers.Dense(1, activation='linear')
])

model_wording = tf.keras.Sequential([
      tf.keras.layers.Dense(256, activation='linear'),
      tf.keras.layers.Dense(128, activation='linear'),
      tf.keras.layers.Dense(64, activation='linear'),
      tf.keras.layers.Dense(32, activation='linear'),
      tf.keras.layers.Dense(1, activation='linear')
])

In [26]:
model_content.compile(optimizer='adam', loss='mean_squared_error', metrics=['mae', 'mse'])
model_wording.compile(optimizer='adam', loss='mean_squared_error', metrics=['mae', 'mse'])

In [37]:
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mae', 'mse'])


In [40]:
model.fit(features, target1, epochs=1)

ResourceExhaustedError: Graph execution error:

Detected at node 'tf_bert_model/bert/encoder/layer_._0/attention/self/MatMul' defined at (most recent call last):
    File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
      exec(code, run_globals)
    File "/opt/conda/lib/python3.10/site-packages/ipykernel_launcher.py", line 17, in <module>
      app.launch_new_instance()
    File "/opt/conda/lib/python3.10/site-packages/traitlets/config/application.py", line 1043, in launch_instance
      app.start()
    File "/opt/conda/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 728, in start
      self.io_loop.start()
    File "/opt/conda/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 195, in start
      self.asyncio_loop.run_forever()
    File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
      self._run_once()
    File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
      handle._run()
    File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
      self._context.run(self._callback, *self._args)
    File "/opt/conda/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 513, in dispatch_queue
      await self.process_one()
    File "/opt/conda/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 502, in process_one
      await dispatch(*args)
    File "/opt/conda/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 409, in dispatch_shell
      await result
    File "/opt/conda/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 729, in execute_request
      reply_content = await reply_content
    File "/opt/conda/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 422, in do_execute
      res = shell.run_cell(
    File "/opt/conda/lib/python3.10/site-packages/ipykernel/zmqshell.py", line 540, in run_cell
      return super().run_cell(*args, **kwargs)
    File "/opt/conda/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3009, in run_cell
      result = self._run_cell(
    File "/opt/conda/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3064, in _run_cell
      result = runner(coro)
    File "/opt/conda/lib/python3.10/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner
      coro.send(None)
    File "/opt/conda/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3269, in run_cell_async
      has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
    File "/opt/conda/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3448, in run_ast_nodes
      if await self.run_code(code, result, async_=asy):
    File "/opt/conda/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3508, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
    File "/tmp/ipykernel_923/2014875107.py", line 1, in <module>
      model.fit(features, target1, epochs=56)
    File "/opt/conda/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/opt/conda/lib/python3.10/site-packages/keras/engine/training.py", line 1685, in fit
      tmp_logs = self.train_function(iterator)
    File "/opt/conda/lib/python3.10/site-packages/keras/engine/training.py", line 1284, in train_function
      return step_function(self, iterator)
    File "/opt/conda/lib/python3.10/site-packages/keras/engine/training.py", line 1268, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/opt/conda/lib/python3.10/site-packages/keras/engine/training.py", line 1249, in run_step
      outputs = model.train_step(data)
    File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_tf_utils.py", line 1658, in train_step
      y_pred = self(x, training=True)
    File "/opt/conda/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/opt/conda/lib/python3.10/site-packages/keras/engine/training.py", line 558, in __call__
      return super().__call__(*args, **kwargs)
    File "/opt/conda/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/opt/conda/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1145, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/opt/conda/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_tf_utils.py", line 1061, in run_call_with_unpacked_inputs
      # if the new size is greater than the old one, we extend the current embeddings with a padding until getting new size
    File "/opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_tf_bert.py", line 1088, in call
      outputs = self.bert(
    File "/opt/conda/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/opt/conda/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1145, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/opt/conda/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_tf_utils.py", line 1061, in run_call_with_unpacked_inputs
      # if the new size is greater than the old one, we extend the current embeddings with a padding until getting new size
    File "/opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_tf_bert.py", line 862, in call
      encoder_outputs = self.encoder(
    File "/opt/conda/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/opt/conda/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1145, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/opt/conda/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_tf_bert.py", line 548, in call
      for i, layer_module in enumerate(self.layer):
    File "/opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_tf_bert.py", line 554, in call
      layer_outputs = layer_module(
    File "/opt/conda/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/opt/conda/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1145, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/opt/conda/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_tf_bert.py", line 464, in call
      self_attention_outputs = self.attention(
    File "/opt/conda/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/opt/conda/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1145, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/opt/conda/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_tf_bert.py", line 380, in call
      self_outputs = self.self_attention(
    File "/opt/conda/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/opt/conda/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1145, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/opt/conda/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_tf_bert.py", line 310, in call
      attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
Node: 'tf_bert_model/bert/encoder/layer_._0/attention/self/MatMul'
OOM when allocating tensor with shape[32,12,8000,8000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node tf_bert_model/bert/encoder/layer_._0/attention/self/MatMul}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
 [Op:__inference_train_function_170644]

In [27]:
# Traiactivity_regularizern your model using model.fit()
history1 = model_content.fit(features, target1, epochs=56,batch_size=50, validation_split=0.2)

Epoch 1/56
Epoch 2/56
Epoch 3/56
Epoch 4/56
Epoch 5/56
Epoch 6/56
Epoch 7/56
Epoch 8/56
Epoch 9/56
Epoch 10/56
Epoch 11/56
Epoch 12/56
Epoch 13/56
Epoch 14/56
Epoch 15/56
Epoch 16/56
Epoch 17/56
Epoch 18/56
Epoch 19/56
Epoch 20/56
Epoch 21/56
Epoch 22/56
Epoch 23/56
Epoch 24/56
Epoch 25/56
Epoch 26/56
Epoch 27/56
Epoch 28/56
Epoch 29/56
Epoch 30/56
Epoch 31/56
Epoch 32/56
Epoch 33/56
Epoch 34/56
Epoch 35/56
Epoch 36/56
Epoch 37/56
Epoch 38/56
Epoch 39/56
Epoch 40/56
Epoch 41/56
Epoch 42/56
Epoch 43/56
Epoch 44/56
Epoch 45/56
Epoch 46/56
Epoch 47/56
Epoch 48/56
Epoch 49/56
Epoch 50/56
Epoch 51/56
Epoch 52/56
Epoch 53/56
Epoch 54/56
Epoch 55/56
Epoch 56/56


In [28]:
history2 = model_wording.fit(features, target2, epochs=56,batch_size=50, validation_split=0.2)

Epoch 1/56
Epoch 2/56
Epoch 3/56
Epoch 4/56
Epoch 5/56
Epoch 6/56
Epoch 7/56
Epoch 8/56
Epoch 9/56
Epoch 10/56
Epoch 11/56
Epoch 12/56
Epoch 13/56
Epoch 14/56
Epoch 15/56
Epoch 16/56
Epoch 17/56
Epoch 18/56
Epoch 19/56
Epoch 20/56
Epoch 21/56
Epoch 22/56
Epoch 23/56
Epoch 24/56
Epoch 25/56
Epoch 26/56
Epoch 27/56
Epoch 28/56
Epoch 29/56
Epoch 30/56
Epoch 31/56
Epoch 32/56
Epoch 33/56
Epoch 34/56
Epoch 35/56
Epoch 36/56
Epoch 37/56
Epoch 38/56
Epoch 39/56
Epoch 40/56
Epoch 41/56
Epoch 42/56
Epoch 43/56
Epoch 44/56
Epoch 45/56
Epoch 46/56
Epoch 47/56
Epoch 48/56
Epoch 49/56
Epoch 50/56
Epoch 51/56
Epoch 52/56
Epoch 53/56
Epoch 54/56
Epoch 55/56
Epoch 56/56


In [29]:
evaluate_on_train_content = model_content.evaluate(features, target1)
evaluate_on_train_wording = model_wording.evaluate(features, target2)



In [30]:
print('evaluate_on_train_content', evaluate_on_train_content)
print('evaluate_on_train_wording',evaluate_on_train_wording)

evaluate_on_train_content [23611.341796875, 68.01499938964844, 23611.341796875]
evaluate_on_train_wording [1003.4942016601562, 12.359598159790039, 1003.4942016601562]


In [31]:
content_prediction = model_content.predict(features)
wording_prediction = model_wording.predict(features)



### Predict on test

In [32]:
test_pred_content = model_content.predict(features_for_test)
test_pred_wording = model_wording.predict(features_for_test)



## submission

In [33]:
test_pred_content = test_pred_content.reshape(-1)
test_pred_wording = test_pred_wording.reshape(-1)

In [34]:
submission = pd.DataFrame({
    'student_id' : test['student_id'],
    'content' : test_pred_content,
    'wording' : test_pred_wording
})

In [35]:
submission.to_csv('submission.csv', index=False)

In [36]:
submission.head()

Unnamed: 0,student_id,content,wording
0,000000ffffff,7.110975,-2.385606
1,222222cccccc,7.110075,-2.385587
2,111111eeeeee,7.110529,-2.385564
3,333333dddddd,7.109636,-2.385571
