# Goal

<h3 style="color:blue">assess the quality of summaries written by students</h3>
<h3 style="color:indigo">evaluate how well a student represents the main idea and details of a source text, as well as the clarity, precision, and fluency of the language used in the summary</h3>
<h3 style="color:red">Freely & publicly available external data is <b>allowed</b>, including pre-trained models</h3>
<h3>This is Multi-Output problem</h3>

### Use Hugging Face Library
### Use NLTK
### Use Tensorflow

In [3]:
import warnings
warnings.filterwarnings("ignore")

In [4]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import re
import math
import subprocess
from tqdm import tqdm


In [5]:
import tensorflow as tf

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, explained_variance_score, median_absolute_error

In [7]:
import transformers
from transformers import AutoTokenizer, TFBertModel

In [8]:
prompts_train = pd.read_csv('/kaggle/input/commonlit-evaluate-student-summaries/prompts_train.csv')
summaries_train = pd.read_csv('/kaggle/input/commonlit-evaluate-student-summaries/summaries_train.csv')
prompts_test = pd.read_csv('/kaggle/input/commonlit-evaluate-student-summaries/prompts_test.csv')
summaries_test = pd.read_csv('/kaggle/input/commonlit-evaluate-student-summaries/summaries_test.csv')

In [9]:
train = pd.merge(prompts_train, summaries_train, on='prompt_id')
test = pd.merge(prompts_test, summaries_test, on='prompt_id')

In [10]:
train.rename(columns = {'text' : 'summary'}, inplace=True)
test.rename(columns = {'text' : 'summary'}, inplace=True)

In [11]:
train.head(2)

Unnamed: 0,prompt_id,prompt_question,prompt_title,prompt_text,student_id,summary,content,wording
0,39c16e,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...,00791789cc1f,1 element of an ideal tragedy is that it shoul...,-0.210614,-0.471415
1,39c16e,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...,0086ef22de8f,The three elements of an ideal tragedy are: H...,-0.970237,-0.417058


In [12]:
train['summary'][0]

'1 element of an ideal tragedy is that it should be arranged on a complex plan.  Another element of an ideal tragedy is that it should only have one main issue. The last element of an ideal tragedy is that it should have a double thread plot and an opposite catastrophe for both good and bad.'

In [13]:
columns_needed = ["prompt_text", "summary"]

In [14]:
train_data = train[columns_needed]

In [15]:
train_data

Unnamed: 0,prompt_text,summary
0,Chapter 13 \r\nAs the sequel to what has alrea...,1 element of an ideal tragedy is that it shoul...
1,Chapter 13 \r\nAs the sequel to what has alrea...,The three elements of an ideal tragedy are: H...
2,Chapter 13 \r\nAs the sequel to what has alrea...,Aristotle states that an ideal tragedy should ...
3,Chapter 13 \r\nAs the sequel to what has alrea...,One element of an Ideal tragedy is having a co...
4,Chapter 13 \r\nAs the sequel to what has alrea...,The 3 ideal of tragedy is how complex you need...
...,...,...
7160,"With one member trimming beef in a cannery, an...","In paragraph two, they would use pickle meat a..."
7161,"With one member trimming beef in a cannery, an...","in the first paragraph it says ""either can it..."
7162,"With one member trimming beef in a cannery, an...",They would have piles of filthy meat on the fl...
7163,"With one member trimming beef in a cannery, an...",They used all sorts of chemical concoctions to...


In [16]:
#from transformers import XLNetTokenizer, TFXLNetModel
#tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
#model = TFXLNetModel.from_pretrained('xlnet-base-cased', return_dict=True)

#from transformers import RobertaTokenizer, TFRobertaModel
#tokenizer = RobertaTokenizer.from_pretrained('roberta-base-cased')
#model = TFRobertaModel.from_pretrained('roberta-base-cased', return_dict=True)

from transformers import AutoTokenizer, TFBertModel
model = TFBertModel.from_pretrained('bert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

### Next time use prepare_tf_dataset which is used to directly tokenize and data colat and
### make dataset compatible with tensorflow
####       https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/model#transformers.TFPreTrainedModel.prepare_tf_dataset

In [19]:

def vectorize_dataframe(dataframe, col):
    vectors = []
    for text in tqdm(dataframe[col].tolist()):
        text_tokens = tokenizer(text, return_tensors="tf", padding='max_length', truncation=True)
        
        output = model(text_tokens)
        
        pooler_output = output.pooler_output

        
        vectors.append(pooler_output)
    return vectors
    

In [20]:
train_data['prompt_text_embeddings'] = vectorize_dataframe(train_data, 'prompt_text')

100%|██████████| 7165/7165 [23:17<00:00,  5.13it/s]


In [21]:
import pickle
with open("BERT_prompt_text_embeddings.pkl", "wb") as file:
    pickle.dump(train_data['prompt_text_embeddings'], file)

In [22]:
train_data['prompt_text_embeddings'].to_csv('BERT_prompt_text_embeddings.csv', index=False)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train_sentence_embeddings, train['content'], test_size=0.2, random_state=42)
