# Project 15: Metaphor detection in poetry

**Toivo Xiong, Joonas Tapaninaho**

This project explores the detection of metaphors in poetry using natural language processing, aiming to
distinguish figurative and non-figurative language.

We shall consider the metaphor using FrameBERT whose implementation is available in GitHub -
liyucheng09/MetaphorFrame: FrameBERT: Conceptual Metaphor Detection with Frame Embedding
Learning. Presented at EACL 2023. as one of the state-of-the-art models in this field, and few dataset in the
field. https://github.com/liyucheng09/MetaphorFrame

Initially, we consider the verb metaphor as a classification task.

In [1]:
# Imports
import nltk
import spacy
import pandas as pd
from bs4 import BeautifulSoup

import numpy
import requests
import sklearn
import scipy
import torch
import torchvision
import tqdm
import transformers

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords


In [2]:
!pip install boto3
!pip install datasets

Collecting boto3
  Downloading boto3-1.35.56-py3-none-any.whl.metadata (6.7 kB)
Collecting botocore<1.36.0,>=1.35.56 (from boto3)
  Downloading botocore-1.35.56-py3-none-any.whl.metadata (5.7 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3)
  Downloading jmespath-1.0.1-py3-none-any.whl.metadata (7.6 kB)
Collecting s3transfer<0.11.0,>=0.10.0 (from boto3)
  Downloading s3transfer-0.10.3-py3-none-any.whl.metadata (1.7 kB)
Downloading boto3-1.35.56-py3-none-any.whl (139 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.2/139.2 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading botocore-1.35.56-py3-none-any.whl (12.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.7/12.7 MB[0m [31m63.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading jmespath-1.0.1-py3-none-any.whl (20 kB)
Downloading s3transfer-0.10.3-py3-none-any.whl (82 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.6/82.6 kB[0m [31m5.4 MB/s[0m eta [36m0:0

**1. Consider the verb metaphor classification highlighted in metaphor/verbs/README.md at master ·
EducationalTestingService/metaphor · GitHub, utilizing the Amsterdam Metaphor corpus. Suggest a
script that uses NLTK library to tokenize the dataset, extract the individual tokens, sentences,
vocabulary, total number of tokens, average number of token per sentence. Then use space Spacy
named-entity tagger to identify the named-entities in the corpus and the average number of named-
entity per tag. Summarize these statistical data in a table.**


In [3]:
# Downloading data set

from zipfile import ZipFile
from io import BytesIO

VUA_corpus_url = 'https://web.archive.org/web/20151023150541/http://ota.ox.ac.uk/text/2541.zip'

r = requests.get(VUA_corpus_url)

with ZipFile(BytesIO(r.content), 'r') as zip:
  zip.printdir()
  file = zip.extract('2541/VUAMC.xml','r')

with open(file) as f:
  content = f.read()

File Name                                             Modified             Size
2541/VUAMC.odd                                 2013-10-04 18:18:46        34535
2541/VUAMC.rnc                                 2013-10-04 18:18:46        54689
2541/VUAMC.rng                                 2013-10-04 18:18:46       132900
2541/VUAMC.xml                                 2013-10-04 18:18:46     16820946
2541.xml                                       2015-04-08 15:12:20         8092


In [4]:
soup = BeautifulSoup(content, 'xml')

# Extract all sentences in <s> tags
sentences = [" ".join(sentence.get_text().split()) for sentence in soup.find_all('s')]

print(sentences[0:10])

['Latest corporate unbundler reveals laid-back approach : Roland Franklin , who is leading a 697m pound break-up bid for DRG , talks to Frank Kane', 'By FRANK KANE', 'IT SEEMS that Roland Franklin , the latest unbundler to appear in the UK , has made a fatal error in the preparation of his £697m break-up bid for stationery and packaging group DRG .', "He has not properly investigated the target 's dining facilities .", 'The 63-year-old head of Pembridge Investments , through which the bid is being mounted says , ‘ rule number one in this business is : the more luxurious the luncheon rooms at headquarters , the more inefficient the business ’ .', 'If he had taken his own rule seriously , he would have found out that DRG has a very modest self-service canteen at its Bristol head office .', 'There are other things he has , on his own admission , not fully investigated , like the value of the DRG properties , or which part of the DRG business he would keep after the break up .', 'When the 

In [5]:
merged_sentences = ','.join(sentences)

counts_sentence = len(sentences)

counts_sentence

16202

**Tokenizing:**

In [6]:
# Downloads
nltk.download('punkt')
nlp = spacy.load("en_core_web_sm")

stopwords = nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')
chars = [',', '.', '´', '`', ')', '(', '-', ',', ':', '’', '‘', '—', ';']

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [7]:
all_tokens = []
tokenize_sentences = []
tokens_sum = 0
for s in sentences:
  tokens = word_tokenize(s)
  tokens = [token.lower() for token in tokens if token.lower() not in stopwords and token not in chars]
  tokens_sum += len(tokens)
  for token in tokens:
    if token not in all_tokens:
      all_tokens.append(token)
  tokenize_sentences.append(tokens)


In [8]:
avg_tokens_per_sentence = tokens_sum / len(tokenize_sentences)
unique_tokens = len(all_tokens)

**Extracting Entities:**

In [9]:
entities = []
unique_entities = []
tag_count = {}

for s in tokenize_sentences:
  s_nlp = nlp(' '.join(s))
  for ent in s_nlp.ents:
    string = str(ent.text) + ' ' + str(ent.label_)
    entities.append(string)
    if ent.text not in unique_entities:
      unique_entities.append(ent.text)
    if tag_count.get(ent.label_):
      tag_count[ent.label_] = tag_count.get(ent.label_) + 1
    else:
      tag_count[ent.label_] = 1

In [10]:
total_num_entities = len(entities)
num_unique_entities = len(unique_entities)

In [11]:
# Calculating diff entity popularity

for tag in tag_count.keys():
  val = tag_count.get(tag)
  print(f'{tag} --> {(val/len(entities)) * 100} %')

ORG --> 7.478376580172988 %
QUANTITY --> 1.9693945442448437 %
PERSON --> 21.144377910844977 %
CARDINAL --> 23.446440452428476 %
DATE --> 21.35728542914172 %
MONEY --> 2.1290751829673984 %
GPE --> 7.411842980705257 %
ORDINAL --> 4.07185628742515 %
NORP --> 5.349301397205589 %
LAW --> 0.3060545575515636 %
LOC --> 0.7052561543579507 %
TIME --> 3.992015968063872 %
LANGUAGE --> 0.2528276779773786 %
PRODUCT --> 0.10645375914836992 %
EVENT --> 0.05322687957418496 %
FAC --> 0.11976047904191617 %
PERCENT --> 0.09314703925482369 %
WORK_OF_ART --> 0.01330671989354624 %


In [12]:
from prettytable import PrettyTable

In [13]:
myTable = PrettyTable(["Measurement", "Result"])

# Add rows
myTable.add_row(["Sentences count", counts_sentence])
myTable.add_row(["Total number of tokens", tokens_sum])
myTable.add_row(["Unique tokens", unique_tokens])
myTable.add_row(["Avg token per sentence", f"{avg_tokens_per_sentence:.2f}"])
myTable.add_row(["Entity measurement", "----"])
myTable.add_row(["Total number of Enitities", total_num_entities])
myTable.add_row(["Unique Entities", num_unique_entities])
myTable.add_row(["Tag % share of all tags", "----"])

for tag in tag_count.keys():
  val = tag_count.get(tag)
  myTable.add_row([tag, f"{(val/len(entities)):.2%}"])

print(myTable)

+---------------------------+--------+
|        Measurement        | Result |
+---------------------------+--------+
|      Sentences count      | 16202  |
|   Total number of tokens  | 113765 |
|       Unique tokens       | 17883  |
|   Avg token per sentence  |  7.02  |
|     Entity measurement    |  ----  |
| Total number of Enitities |  7515  |
|      Unique Entities      |  3095  |
|  Tag % share of all tags  |  ----  |
|            ORG            | 7.48%  |
|          QUANTITY         | 1.97%  |
|           PERSON          | 21.14% |
|          CARDINAL         | 23.45% |
|            DATE           | 21.36% |
|           MONEY           | 2.13%  |
|            GPE            | 7.41%  |
|          ORDINAL          | 4.07%  |
|            NORP           | 5.35%  |
|            LAW            | 0.31%  |
|            LOC            | 0.71%  |
|            TIME           | 3.99%  |
|          LANGUAGE         | 0.25%  |
|          PRODUCT          | 0.11%  |
|           EVENT        

**Extracting methapore data**

In [14]:
sentences_2 = [sentence for sentence in soup.find_all('s')]
methaporas = [[m.get_text().lower() for m in metaphor.find_all('seg')] for metaphor in sentences_2]
methaporas[:2]


[['reveals', 'approach', 'leading', 'to'], []]

In [15]:
# Calculating count of sentences with methapora

sentence_with_methapor_count = 0

for sent in methaporas:
  if sent:
    sentence_with_methapor_count += 1

print(sentence_with_methapor_count)
print(len(methaporas))

8326
16202


In [16]:
# Extracting unique metaphoras

methaporas_2 = []

for s in sentences_2:
  for m in s.find_all('seg'):
    if m.get_text() not in methaporas_2:
      methaporas_2.append(m.get_text())

methaporas_2[:10]

['reveals',
 'approach',
 'leading',
 'to',
 'made',
 'fatal',
 'in',
 'target',
 'head',
 'through']

In [17]:
# Extractin methaporas in sentences

methaporas_in_sentence = []

for sent in tokenize_sentences:
  methaphoras = []
  for token in sent:
    if token in methaporas_2:
      methaphoras.append(token)
  methaporas_in_sentence.append(methaphoras)

In [18]:
data = {
  "sentence": [i for i in range(1,(len(tokenize_sentences) + 1))],
  "metahporas": [methaporas_in_sentence[i] for i in range(0, len(methaporas_in_sentence) )],
  "count": [len(methaporas_in_sentence[i]) for i in range(0, len(methaporas_in_sentence) )],
  "y": [1 if len(methaporas_in_sentence[i]) > 0 else 0 for i in range(0, len(methaporas_in_sentence) )],
}

df = pd.DataFrame(data)

print(df)


       sentence                                         metahporas  count  y
0             1  [corporate, reveals, approach, leading, pound,...      6  1
1             2                                                 []      0  0
2             3       [appear, made, fatal, bid, packaging, group]      6  1
3             4             [investigated, target, 's, facilities]      4  1
4             5  [head, bid, mounted, says, rule, number, one, ...     10  1
...         ...                                                ...    ... ..
16197     16198  [come, back, 've, got, know, 've, got, 's, wor...     10  1
16198     16199                                             [know]      1  1
16199     16200                                      ['s, sitting]      2  1
16200     16201                     [well, know, know, kind, work]      5  1
16201     16202                                  [well, 's, right]      3  1

[16202 rows x 4 columns]


**2. Now we want to study the detection of the metaphor as a multi-class classification problem. You may
inspire from the existing implementations for this task available on the same github account, see also
some original papers, e,g., A Report on the 2018 VUA Metaphor Detection Shared Task
(aclanthology.org). Consider a simple machine learning classifier, e.,g., SVM, or NaivesBayes with
80-20 data split for training-testing, and use tf-idf features, testing various lengths of total feature
sets, e.g., 500, 1000, 3000, to report the detection accuracy of the models.**


In [19]:
# 2

import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, mean_squared_error

In [20]:
sentences_for_tf_idf = []
for i in range(0,len(tokenize_sentences)):
  tokens = tokenize_sentences[i]
  sentence = ""
  for t in range(0,len(tokens)):
    sentence += tokens[t] + " "
  sentences_for_tf_idf.append(str(sentence))

In [21]:
feature_lengths = [250, 500, 1000, 3000]

results = {'Dataset': [], 'Feature Length': [], 'Model': [], 'Accuracy': [], 'Precision': [], 'Recall': [], 'F1-Score': []}


In [22]:
for length in feature_lengths:

    vectorizer = TfidfVectorizer(max_features=length)
    X = vectorizer.fit_transform(sentences_for_tf_idf)
    names = vectorizer.get_feature_names_out()
    y = df['y']

    train_sentences, test_sentences, labels_train, labels_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Train and eval Naive Bayes
    nb_model = MultinomialNB()
    nb_model.fit(train_sentences, labels_train)
    nb_predictions = nb_model.predict(test_sentences)
    nb_accuracy = accuracy_score(labels_test, nb_predictions)
    nb_precision = precision_score(labels_test, nb_predictions)
    nb_recall = recall_score(labels_test, nb_predictions)
    nb_f1 = f1_score(labels_test, nb_predictions)


    results['Feature Length'].append(length)
    results['Model'].append('Naive Bayes')
    results['Accuracy'].append(nb_accuracy)
    results['Precision'].append(nb_precision)
    results['Recall'].append(nb_recall)
    results['F1-Score'].append(nb_f1)
    results['Dataset'].append('VUA')
    highestacc = nb_accuracy
    bestmodel = 'NB'

    # Train and eval SVM
    svm_model = SVC(kernel='linear', random_state=42)
    svm_model.fit(train_sentences, labels_train)
    svm_predictions = svm_model.predict(test_sentences)
    svm_accuracy = accuracy_score(labels_test, svm_predictions)
    svm_precision = precision_score(labels_test, svm_predictions)
    svm_recall = recall_score(labels_test, svm_predictions)
    svm_f1 = f1_score(labels_test, svm_predictions)

    results['Feature Length'].append(length)
    results['Model'].append('SVM')
    results['Accuracy'].append(svm_accuracy)
    results['Precision'].append(svm_precision)
    results['Recall'].append(svm_recall)
    results['F1-Score'].append(svm_f1)
    results['Dataset'].append('VUA')
    if svm_accuracy > highestacc:
        highestacc = svm_accuracy
        bestmodel = 'SVM'

    # Train and eval Logistic Regression
    logreg = LogisticRegression(random_state=16)
    logreg.fit(train_sentences, labels_train)
    logreg_pred = logreg.predict(test_sentences)
    logreg_acc = accuracy_score(labels_test, logreg_pred)
    logreg_precision = precision_score(labels_test, logreg_pred)
    logreg_recall = recall_score(labels_test, logreg_pred)
    logreg_f1 = f1_score(labels_test, logreg_pred)

    results['Feature Length'].append(length)
    results['Model'].append('Logistic Reg.')
    results['Accuracy'].append(logreg_acc)
    results['Precision'].append(logreg_precision)
    results['Recall'].append(logreg_recall)
    results['F1-Score'].append(logreg_f1)
    results['Dataset'].append('VUA')
    if logreg_acc > highestacc:
        highestacc = logreg_acc
        bestmodel = 'LogReg'

    # Train and eval Multi-Layer-Perceptron classifier
    clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
                        hidden_layer_sizes=(5, 2), max_iter=100, random_state=1)
    clf.fit(train_sentences, labels_train)
    mlp_pred = clf.predict(test_sentences)
    mlp_acc = accuracy_score(labels_test, mlp_pred)
    mlp_precision = precision_score(labels_test, mlp_pred)
    mlp_recall = recall_score(labels_test, mlp_pred)
    mlp_f1 = f1_score(labels_test, mlp_pred)

    results['Feature Length'].append(length)
    results['Model'].append('MLP')
    results['Accuracy'].append(mlp_acc)
    results['Precision'].append(mlp_precision)
    results['Recall'].append(mlp_recall)
    results['F1-Score'].append(mlp_f1)
    results['Dataset'].append('VUA')
    if mlp_acc > highestacc:
        highestacc = mlp_acc
        bestmodel = 'MLP'

    average = (nb_predictions + svm_predictions + logreg_pred + mlp_pred) / 4
    averageacc = (nb_accuracy + svm_accuracy + logreg_acc + mlp_acc) / 4
    print('Average accuracy', round(averageacc, 5))
    print('RMSE: ', round(mean_squared_error(labels_test, average), 5))

    if averageacc > highestacc:
        highestacc = averageacc
        bestmodel = 'Ensemble'

    print('Best model: ', bestmodel + '    Model Accuracy: ', round(highestacc, 5))
    print('------')


Average accuracy 0.82019
RMSE:  0.16363
Best model:  LogReg    Model Accuracy:  0.83678
------
Average accuracy 0.83778
RMSE:  0.10662
Best model:  SVM    Model Accuracy:  0.88954
------
Average accuracy 0.87496
RMSE:  0.07276
Best model:  SVM    Model Accuracy:  0.91515
------
Average accuracy 0.88453
RMSE:  0.07099
Best model:  SVM    Model Accuracy:  0.93397
------


In [26]:
true_sum_labels = 0
false_sum_labels = 0
for num in labels_test:
  if num == 1:
    true_sum_labels += 1
  else:
    false_sum_labels += 1

print(true_sum_labels)
print(false_sum_labels)

2514
727


In [27]:
results_df = pd.DataFrame(results)
print(results_df)

   Dataset  Feature Length          Model  Accuracy  Precision    Recall  \
0      VUA             250    Naive Bayes  0.835545   0.832494  0.986476   
1      VUA             250            SVM  0.832768   0.830429  0.985680   
2      VUA             250  Logistic Reg.  0.836779   0.831829  0.989658   
3      VUA             250            MLP  0.775687   0.775687  1.000000   
4      VUA             500    Naive Bayes  0.842333   0.839147  0.985680   
5      VUA             500            SVM  0.889540   0.966263  0.888624   
6      VUA             500  Logistic Reg.  0.843567   0.839824  0.986476   
7      VUA             500            MLP  0.775687   0.775687  1.000000   
8      VUA            1000    Naive Bayes  0.845418   0.842231  0.985282   
9      VUA            1000            SVM  0.915150   0.966264  0.922832   
10     VUA            1000  Logistic Reg.  0.852823   0.847255  0.988465   
11     VUA            1000            MLP  0.886455   0.971441  0.879475   
12     VUA  

**3. From the best performing algorithms that you may have scrutinized for the same application, suggest
an ensemble model that combines three models, see example of implementations in Ensemble
Methods in Python - GeeksforGeeks, and test the performance of the model.**


In [28]:
!pip install vecstack

Collecting vecstack
  Downloading vecstack-0.4.0.tar.gz (18 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: vecstack
  Building wheel for vecstack (setup.py) ... [?25l[?25hdone
  Created wheel for vecstack: filename=vecstack-0.4.0-py3-none-any.whl size=19861 sha256=32599e14344099f97b19d548f959b75e5bb08334bd478081a10784cdc8552197
  Stored in directory: /root/.cache/pip/wheels/b8/d8/51/3cf39adf22c522b0a91dc2208db4e9de4d2d9d171683596220
Successfully built vecstack
Installing collected packages: vecstack
Successfully installed vecstack-0.4.0


In [29]:
# Training Ensemble Model

from vecstack import stacking

vectorizer = TfidfVectorizer(max_features=3000)
X = vectorizer.fit_transform(sentences_for_tf_idf)
names = vectorizer.get_feature_names_out()

y = df['y']

train_sentences_2, test_sentences_2, labels_train_2, labels_test_2 = train_test_split(X, y, test_size=0.2, random_state=42)

# initializing all the base model objects with default parameters

model_1 = LogisticRegression(random_state=16)
model_2 = MultinomialNB()
model_3 = SVC(kernel='linear', random_state=42)

# putting all base model objects in one list
all_models = [model_1, model_2, model_3]

# computing the stack features
s_train, s_test = stacking(all_models, train_sentences_2, labels_train_2, test_sentences_2, regression=True, shuffle=True, n_folds=4)

# initializing the second-level model
final_model = model_1

# fitting the second level model with stack features
final_model = final_model.fit(s_train, labels_train_2)

# predicting the final output using stacking
pred_final = final_model.predict(s_test)


In [30]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, mean_squared_error

In [31]:
# Calculate evaluation metrics for Ensemble Model
accuracy = accuracy_score(labels_test_2, pred_final)
precision = precision_score(labels_test_2, pred_final)
recall = recall_score(labels_test_2, pred_final)
f1 = f1_score(labels_test_2, pred_final)
mean_sqr_error = mean_squared_error(labels_test_2, pred_final)

# Print the results
print("Evaluation Metrics for Metaphor Detection:")
print(f"Accuracy: {accuracy*100:.4f} %")
print(f"Precision: {precision*100:.4f} %")
print(f"Recall: {recall*100:.4f} %")
print(f"F1-Score: {f1*100:.4f} %")
print(f"Mean squared error: {mean_sqr_error:.4f}")

model_3_new_row = {'Dataset': 'VUA', 'Feature Length': 3000, 'Model': 'Ensemble', 'Accuracy': accuracy, 'Precision': precision, 'Recall': recall, 'F1-Score': f1}

Evaluation Metrics for Metaphor Detection:
Accuracy: 92.5332 %
Precision: 94.2712 %
Recall: 96.2212 %
F1-Score: 95.2362 %
Mean squared error: 0.0747


In [32]:
labels_true = 0

for num in labels_test_2:
  if num == 1:
    labels_true += 1

In [33]:
vua_data = {'Dataset': 'VUA',
            'Measurement' : ['Sent. count', 'Tokens', 'Unique tokens', 'Avg token per sent', 'sent with methapor', 'Testing sent count', 'How many with metaphor'],
            'Values' : [counts_sentence, tokens_sum, unique_tokens, avg_tokens_per_sentence, sentence_with_methapor_count, len(labels_test_2), labels_true ]
            }

**4. Study the use of FrameBERT pointed out earlier on the same dataset and report its performances in terms precision, recall and F1-measure and compare this with results in 3.**



In [34]:
!pip install datasets



In [35]:
import nltk
import wandb
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [36]:
!git clone https://github.com/liyucheng09/MetaphorFrame.git

Cloning into 'MetaphorFrame'...
remote: Enumerating objects: 287, done.[K
remote: Counting objects: 100% (287/287), done.[K
remote: Compressing objects: 100% (143/143), done.[K
remote: Total 287 (delta 157), reused 257 (delta 132), pack-reused 0 (from 0)[K
Receiving objects: 100% (287/287), 18.59 MiB | 18.92 MiB/s, done.
Resolving deltas: 100% (157/157), done.


In [37]:
cd MetaphorFrame

/content/MetaphorFrame


In [38]:
y_b = df['y']

train_sentences_b, test_sentences_b, labels_train_b, labels_test_b = train_test_split(tokenize_sentences, y_b, test_size=0.2, random_state=42)

print(len(test_sentences_b))
print(len(labels_test_b))
print(test_sentences_b[0])

3241
3241
['glenys', 'ever', 'since', 'han']


In [39]:
string_b = ''

for sentence in test_sentences_b:
  for index, s in enumerate(sentence):
    if index == len(sentence) - 1:
      string_b += s + "\n"
    else:
      string_b += s + " "

import json

temp = { "articles": string_b }

json_sentences = json.dumps(temp)

with open('sentences.json', 'w') as writefile:
    writefile.write(json_sentences)

In [40]:
# moves modified config file into correct location so it can be run
!mv /content/wandb_config.py /usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_config.py

In [41]:
# Run FrameBert
!python inference.py sentences.json


2024-11-08 09:07:06.233659: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-08 09:07:06.292140: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-08 09:07:06.306763: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-08 09:07:06.350587: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
config.json: 100% 713/713 [00:00<00:00, 3.58MB/s]
pyt

In [42]:
bert_predictions_df = pd.read_csv("predictions.tsv", sep="\t")
predicted_metaphors_bert = bert_predictions_df["Real_metaphors"]
print(bert_predictions_df)

           Tokens  Borderline_metaphor  Real_metaphors         Frame_label
0          glenys                    0               0                   _
1            ever                    0               0                   _
2           since                    0               0         Time_vector
3             han                    0               0                   _
4           vicki                    0               0                   _
...           ...                  ...             ...                 ...
21988        fell                    1               0  Motion_directional
21989      victim                    1               1         Catastrophe
21990        late                    0               0  Temporal_subregion
21991       1960s                    0               0                   _
21992  iconoclasm                    0               0                   _

[21993 rows x 4 columns]


In [43]:
bert_predictions_df.head()

Unnamed: 0,Tokens,Borderline_metaphor,Real_metaphors,Frame_label
0,glenys,0,0,_
1,ever,0,0,_
2,since,0,0,Time_vector
3,han,0,0,_
4,vicki,0,0,_


In [44]:
bert_metaphor_df = bert_predictions_df.loc[bert_predictions_df['Real_metaphors'] == 1]
bert_metaphoras = bert_metaphor_df['Tokens']
bert_metaphoras = bert_metaphoras.values.tolist()

# FrameBert Result to predictions

bert_predictions = []

for test_sentence in test_sentences_b:
  metaphora = 0
  for token in test_sentence:
    if token in bert_metaphoras:
      metaphora = 1
  bert_predictions.append(metaphora)

print(len(bert_predictions))
print(len(labels_test_b))

3241
3241


In [45]:
# Calculate evaluation metrics for FrameBERT
accuracy = accuracy_score(labels_test_b, bert_predictions)
precision = precision_score(labels_test_b, bert_predictions)
recall = recall_score(labels_test_b, bert_predictions)
f1 = f1_score(labels_test_b, bert_predictions)

# Print the results
print("Evaluation Metrics for Metaphor Detection:")
print(f"Accuracy: {accuracy*100:.4f} %")
print(f"Precision: {precision*100:.4f} %")
print(f"Recall: {recall*100:.4f} %")
print(f"F1-Score: {f1*100:.4f} %")

model_bert_new_row = {'Dataset': 'VUA', 'Feature Length': 0, 'Model': 'BERT', 'Accuracy': accuracy, 'Precision': precision, 'Recall': recall, 'F1-Score': f1}

Evaluation Metrics for Metaphor Detection:
Accuracy: 68.2197 %
Precision: 99.2042 %
Recall: 59.5068 %
F1-Score: 74.3909 %


**5. We want to test the model on subsequent dataset -content word metaphor metaphor/content-
words/README.md at master · EducationalTestingService/metaphor · GitHub available on the same
repository. Test the performance of the machine learning model in 2), ensemble model and
FrameBERT models.**


**Dataset TOEFL Pfeature.jsonlines** toefl_skll_train_features.zip https://github.com/EducationalTestingService/metaphor/releases/download/v1.0/toefl_skll_train_features.zip


In [46]:
TOELF_corpus_url = 'https://github.com/EducationalTestingService/metaphor/releases/download/v1.0/toefl_skll_train_features.zip'

r = requests.get(TOELF_corpus_url)

with ZipFile(BytesIO(r.content), 'r') as zip:
  zip.printdir()
  file = zip.extract('features/all_pos/P.jsonlines','r')


with open(file) as f:
  content = f.readlines()

File Name                                             Modified             Size
features/                                      2020-01-11 22:48:22            0
features/verbs/                                2020-01-11 22:48:22            0
features/verbs/C-BiasDown.jsonlines            2020-01-11 22:48:22      5379532
features/verbs/C-BiasUp.jsonlines              2020-01-11 22:48:22      5151596
features/verbs/CCDB-BiasUpDown.jsonlines       2020-01-11 22:48:22      9880892
features/verbs/P.jsonlines                     2020-01-11 22:48:22       490389
features/verbs/T.jsonlines                     2020-01-11 22:48:22     22348021
features/verbs/U.jsonlines                     2020-01-11 22:48:22       442972
features/verbs/UL.jsonlines                    2020-01-11 22:48:22       445024
features/verbs/WordNet.jsonlines               2020-01-11 22:48:22       755941
features/all_pos/                              2020-01-11 22:47:28            0
features/all_pos/C-BiasDown.jsonlines   

In [47]:
sentences_toelf = []
labels_toelf = []
unique_tokens_toelf = []
sent_count_toelf = 0
sent_lenght_toelf_sum = 0
token_sum_toelf = 0
start_idx = 1
sentence_temp = []
containing_metaphor = 0
sentence_with_methapor_count_toelf = 0

for c in content:

  json_token = json.loads(c.strip())
  methapor = json_token['y']
  word_data = json_token['id']
  parsing_word = word_data.split('_')
  sentence_num = parsing_word[1]
  word = parsing_word[3]
  token_sum_toelf += 1

  if word not in unique_tokens_toelf:
    unique_tokens_toelf.append(word)

  if start_idx != int(sentence_num):
    sentences_toelf.append(sentence_temp)
    labels_toelf.append(containing_metaphor)
    if containing_metaphor > 0:
      sentence_with_methapor_count_toelf += 1

    containing_metaphor = 0
    sent_lenght_toelf_sum += len(sentence_temp)
    sentence_temp = []
    start_idx = int(sentence_num)
    sentence_temp.append(word)
    sent_count_toelf += 1

  else:
    sentence_temp.append(word)
    if methapor == 1:
      containing_metaphor = 1

avg_tokens_per_sent_toelf = sent_lenght_toelf_sum / sent_count_toelf


In [48]:
sentences_for_tf_idf_toelf = []
for i in range(0,len(sentences_toelf)):
  tokens = sentences_toelf[i]
  sentence = ""
  for t in range(0,len(tokens)):
    sentence += tokens[t] + " "
  sentences_for_tf_idf_toelf.append(str(sentence))

In [49]:
sentences_for_tf_idf_toelf

['believe specializing subject better important world live today broad knowledge many academic subjects ',
 'broad knowledge many academic subjects good socially allows person communicate people many subjects ',
 'However enough society live today decent job merely broad knowledge things ',
 'Jobs require specific thorough knowledge specific subject subjects ',
 'Working lawyer example requires thorough indepth knowledge law general knowledge enough become lawyer ',
 'Specializing subject means person going become really good subject therefore pursue future ',
 'ensures person able find job specialized world ',
 'Someone specializes finance example work bank learn many things help everyday life gain knowledge job ',
 'Therefore specialization leads broader knowledge subjects ',
 'Specialization discovered agreed economists including father economics Adam Smith leads greater efficiency ',
 'someone specializes something time discovers ways improve therefore becomes expert field ',
 'led

In [50]:
feature_lengths = [250, 500, 1000, 3000]

results_toelf = {'Dataset': [], 'Feature Length': [], 'Model': [], 'Accuracy': [], 'Precision': [], 'Recall': [], 'F1-Score': []}

for length in feature_lengths:

    vectorizer = TfidfVectorizer(max_features=length)
    X = vectorizer.fit_transform(sentences_for_tf_idf_toelf)
    names = vectorizer.get_feature_names_out()

    train_sentences, test_sentences, labels_train, labels_test = train_test_split(X, labels_toelf, test_size=0.2, random_state=42)

    # Train and eval Naive Bayes
    nb_model = MultinomialNB()
    nb_model.fit(train_sentences, labels_train)
    nb_predictions = nb_model.predict(test_sentences)
    nb_accuracy = accuracy_score(labels_test, nb_predictions)
    nb_precision = precision_score(labels_test, nb_predictions)
    nb_recall = recall_score(labels_test, nb_predictions)
    nb_f1 = f1_score(labels_test, nb_predictions)

    results_toelf['Feature Length'].append(length)
    results_toelf['Model'].append('Naive Bayes')
    results_toelf['Accuracy'].append(nb_accuracy)
    results_toelf['Precision'].append(nb_precision)
    results_toelf['Recall'].append(nb_recall)
    results_toelf['F1-Score'].append(nb_f1)
    results_toelf['Dataset'].append('TOEFL')
    highestacc = nb_accuracy
    bestmodel = 'NB'

    # Train and eval SVM
    svm_model = SVC(kernel='linear', random_state=42)
    svm_model.fit(train_sentences, labels_train)
    svm_predictions = svm_model.predict(test_sentences)
    svm_accuracy = accuracy_score(labels_test, svm_predictions)
    svm_precision = precision_score(labels_test, svm_predictions)
    svm_recall = recall_score(labels_test, svm_predictions)
    svm_f1 = f1_score(labels_test, svm_predictions)

    results_toelf['Feature Length'].append(length)
    results_toelf['Model'].append('SVM')
    results_toelf['Accuracy'].append(svm_accuracy)
    results_toelf['Precision'].append(svm_precision)
    results_toelf['Recall'].append(svm_recall)
    results_toelf['F1-Score'].append(svm_f1)
    results_toelf['Dataset'].append('TOEFL')
    if svm_accuracy > highestacc:
        highestacc = svm_accuracy
        bestmodel = 'SVM'

    # Train and eval Logistic Regression
    logreg = LogisticRegression(random_state=16)
    logreg.fit(train_sentences, labels_train)
    logreg_pred = logreg.predict(test_sentences)
    logreg_acc = accuracy_score(labels_test, logreg_pred)
    logreg_precision = precision_score(labels_test, logreg_pred)
    logreg_recall = recall_score(labels_test, logreg_pred)
    logreg_f1 = f1_score(labels_test, logreg_pred)

    results_toelf['Feature Length'].append(length)
    results_toelf['Model'].append('Logistic Reg.')
    results_toelf['Accuracy'].append(logreg_acc)
    results_toelf['Precision'].append(logreg_precision)
    results_toelf['Recall'].append(logreg_recall)
    results_toelf['F1-Score'].append(logreg_f1)
    results_toelf['Dataset'].append('TOEFL')
    if logreg_acc > highestacc:
        highestacc = logreg_acc
        bestmodel = 'LogReg'

    # Train and eval Multi-Layer-Perceptron classifier
    clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
                        hidden_layer_sizes=(5, 2), max_iter=4000, random_state=1)
    clf.fit(train_sentences, labels_train)
    mlp_pred = clf.predict(test_sentences)
    mlp_acc = accuracy_score(labels_test, mlp_pred)
    mlp_precision = precision_score(labels_test, mlp_pred)
    mlp_recall = recall_score(labels_test, mlp_pred)
    mlp_f1 = f1_score(labels_test, mlp_pred)

    results_toelf['Feature Length'].append(length)
    results_toelf['Model'].append('MLP')
    results_toelf['Accuracy'].append(mlp_acc)
    results_toelf['Precision'].append(mlp_precision)
    results_toelf['Recall'].append(mlp_recall)
    results_toelf['F1-Score'].append(mlp_f1)
    results_toelf['Dataset'].append('TOEFL')
    if mlp_acc > highestacc:
        highestacc = mlp_acc
        bestmodel = 'MLP'

    average = (nb_predictions + svm_predictions + logreg_pred + mlp_pred) / 4
    averageacc = (nb_accuracy + svm_accuracy + logreg_acc + mlp_acc) / 4
    print('Average accuracy', round(averageacc, 5))
    print('RMSE: ', round(mean_squared_error(labels_test, average), 5))

    if averageacc > highestacc:
        highestacc = averageacc
        bestmodel = 'Ensemble'

    print('Best model: ', bestmodel + '    Model Accuracy: ', round(highestacc, 5))
    print('------')


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Average accuracy 0.63686
RMSE:  0.29927
Best model:  SVM    Model Accuracy:  0.65693
------
Average accuracy 0.66286
RMSE:  0.27977
Best model:  NB    Model Accuracy:  0.66971
------
Average accuracy 0.6615
RMSE:  0.28513
Best model:  SVM    Model Accuracy:  0.68431
------
Average accuracy 0.64599
RMSE:  0.28878
Best model:  SVM    Model Accuracy:  0.67701
------


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [51]:
results_df_toelf = pd.DataFrame(results_toelf)
print(results_df_toelf)

   Dataset  Feature Length          Model  Accuracy  Precision    Recall  \
0    TOEFL             250    Naive Bayes  0.631387   0.626016  0.330472   
1    TOEFL             250            SVM  0.656934   0.639752  0.442060   
2    TOEFL             250  Logistic Reg.  0.656934   0.643312  0.433476   
3    TOEFL             250            MLP  0.602190   0.532468  0.527897   
4    TOEFL             500    Naive Bayes  0.669708   0.683099  0.416309   
5    TOEFL             500            SVM  0.664234   0.652174  0.450644   
6    TOEFL             500  Logistic Reg.  0.655109   0.648649  0.412017   
7    TOEFL             500            MLP  0.662409   0.613208  0.557940   
8    TOEFL            1000    Naive Bayes  0.660584   0.669065  0.399142   
9    TOEFL            1000            SVM  0.684307   0.678571  0.489270   
10   TOEFL            1000  Logistic Reg.  0.658759   0.655405  0.416309   
11   TOEFL            1000            MLP  0.642336   0.586047  0.540773   
12   TOEFL  

In [52]:
print(len(sentences_for_tf_idf_toelf))
print(len(labels_toelf))

2737
2737


**Ensemble model**

In [53]:
from vecstack import stacking

vectorizer = TfidfVectorizer(max_features=3000)
X = vectorizer.fit_transform(sentences_for_tf_idf_toelf)
names = vectorizer.get_feature_names_out()

train_sentences_toelf_3, test_sentences_toelf_3, labels_train_toelf_3, labels_test_toelf_3 = train_test_split(X, labels_toelf, test_size=0.2, random_state=42)

# initializing all the base model objects with default parameters

model_1_toelf = LogisticRegression(random_state=16)
model_2_toelf = MultinomialNB()
model_3_toelf = SVC(kernel='linear', random_state=42)

# putting all base model objects in one list
all_models_toelf = [model_1_toelf, model_2_toelf, model_3_toelf]

# computing the stack features
s_train_toelf, s_test_toelf = stacking(all_models_toelf, train_sentences_toelf_3, labels_train_toelf_3, test_sentences_toelf_3, regression=True, shuffle=True, n_folds=4)

# initializing the second-level model
final_model_toelf = model_1_toelf

# fitting the second level model with stack features
final_model_toelf = final_model_toelf.fit(s_train_toelf, labels_train_toelf_3)

# predicting the final output using stacking
pred_final_toelf = final_model_toelf.predict(s_test_toelf)


In [54]:
# Calculate evaluation metrics for Ensemble Model
accuracy = accuracy_score(labels_test_toelf_3,pred_final_toelf)
precision = precision_score(labels_test_toelf_3, pred_final_toelf)
recall = recall_score(labels_test_toelf_3, pred_final_toelf)
f1 = f1_score(labels_test_toelf_3, pred_final_toelf)
mean_sqr_error = mean_squared_error(labels_test_toelf_3, pred_final_toelf)

# Print the results
print("Evaluation Metrics for Metaphor Detection:")
print(f"Accuracy: {accuracy*100:.4f} %")
print(f"Precision: {precision*100:.4f} %")
print(f"Recall: {recall*100:.4f} %")
print(f"F1-Score: {f1*100:.4f} %")
print(f"Mean squared error: {mean_sqr_error:.4f}")

model_3_new_row_toelf = {'Dataset': 'TOEFL', 'Feature Length': 3000, 'Model': 'Ensemble', 'Accuracy': accuracy, 'Precision': precision, 'Recall': recall, 'F1-Score': f1}

Evaluation Metrics for Metaphor Detection:
Accuracy: 67.8832 %
Precision: 71.7557 %
Recall: 40.3433 %
F1-Score: 51.6484 %
Mean squared error: 0.3212


In [55]:
labels_true_toelf = 0

for num in labels_test_toelf_3:
  if num == 1:
    labels_true_toelf += 1

In [56]:
toelf_data = {'Dataset': 'toelf',
            'Measurement' : ['Sent. count', 'Tokens', 'Unique tokens', 'Avg token per sent', 'sent with methapor', 'Testing sent count', 'How many with metaphor'],
            'Values' : [sent_count_toelf, token_sum_toelf, len(unique_tokens_toelf), avg_tokens_per_sent_toelf, sentence_with_methapor_count_toelf, len(labels_test_toelf_3), labels_true_toelf]
            }

**FrameBERT**

In [57]:
# Modifying Ottawa data for FrameBert testing

train_sentences_toelf_b, test_sentences_toelf_b, labels_train_toelf_b, labels_test_toelf_b = train_test_split(sentences_toelf, labels_toelf, test_size=0.2, random_state=42)

string_toelf = ''

for sentence in test_sentences_toelf_b:
  for index, s in enumerate(sentence):
    if index == len(sentence) - 1:
      string_toelf += s + "\n"
    else:
      string_toelf += s + " "

temp = { "articles": string_toelf }

json_sentences_toelf = json.dumps(temp)

with open('sentences_toelf.json', 'w') as writefile:
    writefile.write(json_sentences_toelf)

In [58]:
!python inference.py sentences_toelf.json

2024-11-08 09:28:31.647845: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-08 09:28:31.683060: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-08 09:28:31.693633: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-08 09:28:31.722119: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Map: 100% 1/1 [00:00<00:00,  8.84 examples/s]
{'token

In [59]:
bert_predictions_df = pd.read_csv("predictions.tsv", sep="\t")
predicted_metaphors_bert = bert_predictions_df["Real_metaphors"]
print(bert_predictions_df)

bert_metaphor_df = bert_predictions_df.loc[bert_predictions_df['Real_metaphors'] == 1]
bert_metaphoras = bert_metaphor_df['Tokens']
bert_metaphoras = bert_metaphoras.values.tolist()

# FrameBert Result to predictions

bert_predictions = []

for test_sentence in test_sentences_toelf_b:
  metaphora = 0
  for token in test_sentence:
    if token in bert_metaphoras:
      metaphora = 1
  bert_predictions.append(metaphora)

print(len(bert_predictions))
print(len(labels_test_b))



          Tokens  Borderline_metaphor  Real_metaphors       Frame_label
0        general                    0               0                 _
1         scheme                    0               0                 _
2         normal                    0               0                 _
3        italian                    0               0                 _
4         family                    0               0           Kinship
..           ...                  ...             ...               ...
224        older                    0               0               Age
225       people                    0               0            People
226  necessarely                    0               0                 _
227     maintain                    1               1  Activity_ongoing
228         sort                    0               0              Type

[229 rows x 4 columns]
548
3241


In [60]:
# Calculate evaluation metrics for FrameBERT
accuracy = accuracy_score(labels_test_toelf_b, bert_predictions)
precision = precision_score(labels_test_toelf_b, bert_predictions)
recall = recall_score(labels_test_toelf_b, bert_predictions)
f1 = f1_score(labels_test_toelf_b, bert_predictions)

# Print the results
print("Evaluation Metrics for Metaphor Detection:")
print(f"Accuracy: {accuracy*100:.4f} %")
print(f"Precision: {precision*100:.4f} %")
print(f"Recall: {recall*100:.4f} %")
print(f"F1-Score: {f1*100:.4f} %")

model_bert_new_row_toelf = {'Dataset': 'TOEFL', 'Feature Length': 0, 'Model': 'BERT', 'Accuracy': accuracy, 'Precision': precision, 'Recall': recall, 'F1-Score': f1}

Evaluation Metrics for Metaphor Detection:
Accuracy: 58.2117 %
Precision: 70.0000 %
Recall: 3.0043 %
F1-Score: 5.7613 %


**6. We want to test the model on other datasets,
https://www.eecs.uottawa.ca/~diana/resources/metaphor/type1_metaphor_annotated.txt. In the
above, the annotation at the end of the sentence i.e., @1@y indicates whether it is a metaphor (y) or
not (n). Here the presence of ‘y’ indicates that it is a metaphor, whereas “1” indicates the first head
word of the sentence, which is “poise”, in the part of speech tag sequence. Write a script that translates
the dataset into a format that can be utilized by the classifier / machine learning model.**

In [61]:
import re

r_otw = requests.get('https://www.eecs.uottawa.ca/~diana/resources/metaphor/type1_metaphor_annotated.txt')

decoded_content_otw = r_otw.content.decode("utf-8").split('\r')
decoded_content_otw = [c.replace('\n', ' ') for c in decoded_content_otw]

strings_otw = []
nums_otw = []
labels_otw = []

pattern_otw = r'@\d+@[a-zA-Z]'

for c in decoded_content_otw:
  matches = re.findall(pattern_otw, c)
  num = re.findall('\d+', matches[0])
  lab = re.findall('[a-zA-Z]', matches[0])
  string = c.replace(matches[0], '')
  strings_otw.append(string)
  nums_otw.append(num)
  labels_otw.append(lab)

data_otw = {
  "strings": strings_otw,
  "numbers": nums_otw,
  "labes" : labels_otw
}

ottawa_df = pd.DataFrame(data_otw)

In [62]:
print(ottawa_df)

                                               strings numbers labes
0                                   poise is a club .      [1]   [y]
1         destroying alexandria . sunlight is silence      [4]   [y]
2      feet are no anchor . gravity sucks at the mind      [1]   [y]
3         on the day 's horizon is a gesture of earth      [5]   [y]
4         he said good-by as if good-by is a number .      [6]   [y]
..                                                 ...     ...   ...
714   as the season of cold is the season of darkness      [5]   [n]
715                     else all beasts were tigers ,      [3]   [y]
716                       without which earth is sand      [3]   [n]
717                         the sky is cloud on cloud      [2]   [n]
718                                 the sky is cloudy      [2]   [n]

[719 rows x 3 columns]


In [63]:
all_tokens_otw = []
tokenize_sentences_otw = []
tokens_sum_otw = 0

for s in strings_otw:
  tokens = word_tokenize(s)
  tokens = [token.lower() for token in tokens if token.lower() not in stopwords and token not in chars]
  tokens_sum_otw += len(tokens)
  for token in tokens:
    if token not in all_tokens:
      all_tokens_otw.append(token)
  tokenize_sentences_otw.append(tokens)

In [64]:
otw_sent_count = len(tokenize_sentences_otw)
otw_avg = tokens_sum_otw / otw_sent_count

In [65]:
ottawa_df_2 = ottawa_df.assign(Tokenized_sentences=tokenize_sentences_otw)

In [66]:
otw_sentences = ottawa_df_2['Tokenized_sentences'].values.tolist()
otw_labes = ottawa_df_2['labes'].values.tolist()
otw_labes = [1 if l[0] == 'y' else 0 for l in otw_labes]

In [67]:
otw_sent_with_methapor = 0

for l in otw_labes:
  if l == 1:
    otw_sent_with_methapor += 1

print(otw_sent_with_methapor)
print(len(otw_labes))

358
719


In [68]:
sentences_for_tf_idf_otw = []
for i in range(0,len(otw_sentences)):
  tokens = tokenize_sentences_otw[i]
  sentence = ""
  for t in range(0,len(tokens)):
    sentence += tokens[t] + " "
  sentences_for_tf_idf_otw.append(str(sentence))

**7. Accommodate the programs in FrameBERT and machine learning model in 2) and 3) to test the
performance of the models on uottawa dataset.**


**Task 2 models with Ottawa data**

In [69]:
feature_lengths = [250, 500, 1000, 3000]

results_2_otw = {'Dataset': [], 'Feature Length': [], 'Model': [], 'Accuracy': [], 'Precision': [], 'Recall': [], 'F1-Score': []}

for length in feature_lengths:

    vectorizer = TfidfVectorizer(max_features=length)
    X = vectorizer.fit_transform(sentences_for_tf_idf_otw)
    names = vectorizer.get_feature_names_out()

    train_sentences, test_sentences, labels_train, labels_test = train_test_split(X, otw_labes, test_size=0.2, random_state=42)

    # Train and eval Naive Bayes
    nb_model = MultinomialNB()
    nb_model.fit(train_sentences, labels_train)
    nb_predictions = nb_model.predict(test_sentences)
    nb_accuracy = accuracy_score(labels_test, nb_predictions)
    nb_precision = precision_score(labels_test, nb_predictions)
    nb_recall = recall_score(labels_test, nb_predictions)
    nb_f1 = f1_score(labels_test, nb_predictions)


    results_2_otw['Feature Length'].append(length)
    results_2_otw['Model'].append('Naive Bayes')
    results_2_otw['Accuracy'].append(nb_accuracy)
    results_2_otw['Precision'].append(nb_precision)
    results_2_otw['Recall'].append(nb_recall)
    results_2_otw['F1-Score'].append(nb_f1)
    results_2_otw['Dataset'].append('Ottawa')
    highestacc = nb_accuracy
    bestmodel = 'NB'

    # Train and eval SVM
    svm_model = SVC(kernel='linear', random_state=42)
    svm_model.fit(train_sentences, labels_train)
    svm_predictions = svm_model.predict(test_sentences)
    svm_accuracy = accuracy_score(labels_test, svm_predictions)
    svm_precision = precision_score(labels_test, svm_predictions)
    svm_recall = recall_score(labels_test, svm_predictions)
    svm_f1 = f1_score(labels_test, svm_predictions)

    results_2_otw['Feature Length'].append(length)
    results_2_otw['Model'].append('SVM')
    results_2_otw['Accuracy'].append(svm_accuracy)
    results_2_otw['Precision'].append(svm_precision)
    results_2_otw['Recall'].append(svm_recall)
    results_2_otw['F1-Score'].append(svm_f1)
    results_2_otw['Dataset'].append('Ottawa')
    if svm_accuracy > highestacc:
        highestacc = svm_accuracy
        bestmodel = 'SVM'

    # Train and eval Logistic Regression
    logreg = LogisticRegression(random_state=16)
    logreg.fit(train_sentences, labels_train)
    logreg_pred = logreg.predict(test_sentences)
    logreg_acc = accuracy_score(labels_test, logreg_pred)
    logreg_precision = precision_score(labels_test, logreg_pred)
    logreg_recall = recall_score(labels_test, logreg_pred)
    logreg_f1 = f1_score(labels_test, logreg_pred)

    results_2_otw['Feature Length'].append(length)
    results_2_otw['Model'].append('Logistic Reg.')
    results_2_otw['Accuracy'].append(logreg_acc)
    results_2_otw['Precision'].append(logreg_precision)
    results_2_otw['Recall'].append(logreg_recall)
    results_2_otw['F1-Score'].append(logreg_f1)
    results_2_otw['Dataset'].append('Ottawa')
    if logreg_acc > highestacc:
        highestacc = logreg_acc
        bestmodel = 'LogReg'

    # Train and eval Multi-Layer-Perceptron classifier
    clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
                        hidden_layer_sizes=(5, 2), max_iter=100, random_state=1)
    clf.fit(train_sentences, labels_train)
    mlp_pred = clf.predict(test_sentences)
    mlp_acc = accuracy_score(labels_test, mlp_pred)
    mlp_precision = precision_score(labels_test, mlp_pred)
    mlp_recall = recall_score(labels_test, mlp_pred)
    mlp_f1 = f1_score(labels_test, mlp_pred)

    results_2_otw['Feature Length'].append(length)
    results_2_otw['Model'].append('MLP')
    results_2_otw['Accuracy'].append(mlp_acc)
    results_2_otw['Precision'].append(mlp_precision)
    results_2_otw['Recall'].append(mlp_recall)
    results_2_otw['F1-Score'].append(mlp_f1)
    results_2_otw['Dataset'].append('Ottawa')
    if mlp_acc > highestacc:
        highestacc = mlp_acc
        bestmodel = 'MLP'

    average = (nb_predictions + svm_predictions + logreg_pred + mlp_pred) / 4
    averageacc = (nb_accuracy + svm_accuracy + logreg_acc + mlp_acc) / 4
    print('Average accuracy', round(averageacc, 5))
    print('RMSE: ', round(mean_squared_error(labels_test, average), 5))

    if averageacc > highestacc:
        highestacc = averageacc
        bestmodel = 'Ensemble'

    print('Best model: ', bestmodel + '    Model Accuracy: ', round(highestacc, 5))
    print('------')

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Average accuracy 0.58507
RMSE:  0.31033
Best model:  MLP    Model Accuracy:  0.60417
------
Average accuracy 0.55903
RMSE:  0.36198
Best model:  LogReg    Model Accuracy:  0.57639
------


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Average accuracy 0.56076
RMSE:  0.37023
Best model:  NB    Model Accuracy:  0.58333
------
Average accuracy 0.53299
RMSE:  0.34852
Best model:  NB    Model Accuracy:  0.5625
------


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [70]:
results_df_otw = pd.DataFrame(results_2_otw)
print(results_df_otw)

   Dataset  Feature Length          Model  Accuracy  Precision    Recall  \
0   Ottawa             250    Naive Bayes  0.576389   0.625000  0.410959   
1   Ottawa             250            SVM  0.576389   0.563830  0.726027   
2   Ottawa             250  Logistic Reg.  0.583333   0.573034  0.698630   
3   Ottawa             250            MLP  0.604167   0.700000  0.383562   
4   Ottawa             500    Naive Bayes  0.555556   0.571429  0.493151   
5   Ottawa             500            SVM  0.569444   0.560440  0.698630   
6   Ottawa             500  Logistic Reg.  0.576389   0.569767  0.671233   
7   Ottawa             500            MLP  0.534722   0.537500  0.589041   
8   Ottawa            1000    Naive Bayes  0.583333   0.600000  0.534247   
9   Ottawa            1000            SVM  0.541667   0.541176  0.630137   
10  Ottawa            1000  Logistic Reg.  0.541667   0.542169  0.616438   
11  Ottawa            1000            MLP  0.576389   0.583333  0.575342   
12  Ottawa  

**Task 3 ensemble model with Ottawa data**

In [71]:
# note this time max_features = 1000, because it performed best

vectorizer = TfidfVectorizer(max_features=1000)
X = vectorizer.fit_transform(sentences_for_tf_idf_otw)
names = vectorizer.get_feature_names_out()

train_sentences_otw_3, test_sentences_otw_3, labels_train_otw_3, labels_test_otw_3 = train_test_split(X, otw_labes, test_size=0.2, random_state=42)

# initializing all the base model objects with default parameters

model_1_otw_3 = LogisticRegression(random_state=16)
model_2_otw_3 = MultinomialNB()
model_3_otw_3 = SVC(kernel='linear', random_state=42)

# putting all base model objects in one list
all_models_otw_3 = [model_1_otw_3, model_2_otw_3, model_3_otw_3]

# computing the stack features
s_train_otw_3, s_test_otw_3 = stacking(all_models_otw_3, train_sentences_otw_3, labels_train_otw_3, test_sentences_otw_3, regression=True, shuffle=True, n_folds=4)

# initializing the second-level model
final_model_otw_3 = model_1_otw_3

# fitting the second level model with stack features
final_model_otw_3 = final_model.fit(s_train_otw_3, labels_train_otw_3)

# predicting the final output using stacking
pred_final_otw_3 = final_model_otw_3.predict(s_test_otw_3)

In [72]:
# Calculate evaluation metrics
accuracy_otw_3 = accuracy_score(labels_test_otw_3, pred_final_otw_3)
precision_otw_3 = precision_score(labels_test_otw_3, pred_final_otw_3)
recall_otw_3 = recall_score(labels_test_otw_3, pred_final_otw_3)
f1_otw_3 = f1_score(labels_test_otw_3, pred_final_otw_3)

# Print the results
print("Evaluation Metrics for Metaphor Detection:")
print(f"Accuracy: {accuracy_otw_3*100:.4f} %")
print(f"Precision: {precision_otw_3*100:.4f} %")
print(f"Recall: {recall_otw_3*100:.4f} %")
print(f"F1-Score: {f1_otw_3*100:.4f} %")

model_3_new_row_otw = {'Dataset': 'Ottawa', 'Feature Length': 3000, 'Model': 'Ensemble', 'Accuracy': accuracy, 'Precision': precision, 'Recall': recall, 'F1-Score': f1}

Evaluation Metrics for Metaphor Detection:
Accuracy: 54.8611 %
Precision: 55.8824 %
Recall: 52.0548 %
F1-Score: 53.9007 %


In [73]:
labels_true_otw = 0

for num in labels_test_otw_3:
  if num == 1:
    labels_true_otw += 1

labels_true_otw

73

In [74]:
otw_data = {'Dataset': 'Ottawa',
            'Measurement' : ['Sent. count', 'Tokens', 'Unique tokens', 'Avg token per sent', 'sent with methapor', 'Testing sent count', 'How many with metaphor'],
            'Values' : [otw_sent_count, tokens_sum_otw, len(all_tokens_otw), otw_avg, otw_sent_with_methapor, len(labels_test_otw_3), labels_true_otw]
            }

**FrameBERT ensemble model with Ottawa data**

In [75]:
# Modifying Ottawa data for FrameBert testing

train_sentences_otw_b, test_sentences_otw_b, labels_train_otw_b, labels_test_otw_b = train_test_split(otw_sentences, otw_labes, test_size=0.2, random_state=42)


string_otw = ''

for sentence in test_sentences_otw_b:
  for index, s in enumerate(sentence):
    if index == len(sentence) - 1:
      string_otw += s + "\n"
    else:
      string_otw += s + " "

temp = { "articles": string_otw }

json_sentences_otw = json.dumps(temp)

with open('sentences_otw.json', 'w') as writefile:
    writefile.write(json_sentences_otw)


In [76]:
# run Bert with ottawa data
!python inference.py sentences_otw.json

2024-11-08 09:30:16.560637: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-08 09:30:16.600762: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-08 09:30:16.613988: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-08 09:30:16.641190: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Map: 100% 49/49 [00:00<00:00, 1343.99 examples/s]
{'t

In [77]:
bert_otw_predictions = pd.read_csv("predictions.tsv", sep="\t")
predicted_metaphors_Bert = bert_otw_predictions["Real_metaphors"]

bert_metaphor_otw_df = bert_otw_predictions.loc[bert_otw_predictions['Real_metaphors'] == 1]
bert_metaphoras_otw = bert_metaphor_otw_df['Tokens'].values.tolist()


# FrameBert Result to predictions

bert_predictions_otw = []

for test_sentence in test_sentences_otw_b:
  metaphora = 0
  for token in test_sentence:
    if token in bert_metaphoras_otw:
      metaphora = 1
  bert_predictions_otw.append(metaphora)

print(len(bert_predictions_otw))
print(len(labels_test_otw_b))

144
144


In [78]:
# Calculate evaluation metrics for FrameBert in Ottawa data
accuracy_otw_b = accuracy_score(labels_test_otw_b, bert_predictions_otw)
precision_otw_b = precision_score(labels_test_otw_b, bert_predictions_otw)
recall_otw_b = recall_score(labels_test_otw_b, bert_predictions_otw)
f1_otw_b = f1_score(labels_test_otw_b, bert_predictions_otw)

# Print the results
print("Evaluation Metrics for Metaphor Detection:")
print(f"Accuracy: {accuracy_otw_b*100:.4f} %")
print(f"Precision: {precision_otw_b*100:.4f} %")
print(f"Recall: {recall_otw_b*100:.4f} %")
print(f"F1-Score: {f1_otw_b*100:.4f} %")

model_bert_new_row_otw = {'Dataset': 'Ottawa', 'Feature Length': 0, 'Model': 'BERT', 'Accuracy': accuracy, 'Precision': precision, 'Recall': recall, 'F1-Score': f1}

Evaluation Metrics for Metaphor Detection:
Accuracy: 47.9167 %
Precision: 48.0769 %
Recall: 34.2466 %
F1-Score: 40.0000 %


8. Use appropriate literature to comment on the findings. Also, identify any additional input that would
allow you to further elucidate any of the preceding, and use appropriate literature of corpus linguistic
literature to justify your findings and comment on the obtained results. Finally, comment on the
limitations and structural weakness of the data processing pipeline.

**FrameBERT and Machine Learning Models:**
FrameBERT uses frame embeddings to capture complex metaphors by understanding context with frame embedding and performing well in precision, recall, and F1-score especially for simple broad metaphor expressions.
Traditional Models (e.g SVM, Naive Bayes): They are effective at detecting frequent and commmon metaphors.
Ensemble Model: Combines multiple algorithms, enhancing generalization and accuracy for diverse poetic language.

**Datasets:**
VU Amsterdam Metaphor Corpus: Offers varied texts, a common dataset in this field and these tasks, helping models generalize across metaphor types.
Uottawa Dataset: Labeling with annotations allows more detailed analysis, aiding interpretability and cross-domain performance.

Adding more diverse metaphor datasets and text with plenty of figurative expressions (e.g. poetry and lyrics) and using data augmentation can improve generalization and capture less common metaphor types.

**More discussion, analysis, related work and insights on the report paper!**

**Result and Data Conlusion**

In [79]:

# data set VUA result combination
results_df.loc[len(results_df.index)] = model_3_new_row
results_df.loc[len(results_df.index)] = model_bert_new_row

# data set TOELF result combination
results_df_toelf.loc[len(results_df_toelf.index)] = model_3_new_row_toelf
results_df_toelf.loc[len(results_df_toelf.index)] = model_bert_new_row_toelf

# data set Ottawa result combination
results_df_otw.loc[len(results_df_otw.index)] = model_3_new_row_otw
results_df_otw.loc[len(results_df_otw.index)] = model_bert_new_row_otw

result_combined = pd.concat([results_df, results_df_toelf, results_df_otw])

result_combined = result_combined.drop_duplicates()

print(result_combined)


   Dataset  Feature Length          Model  Accuracy  Precision    Recall  \
0      VUA             250    Naive Bayes  0.835545   0.832494  0.986476   
1      VUA             250            SVM  0.832768   0.830429  0.985680   
2      VUA             250  Logistic Reg.  0.836779   0.831829  0.989658   
3      VUA             250            MLP  0.775687   0.775687  1.000000   
4      VUA             500    Naive Bayes  0.842333   0.839147  0.985680   
5      VUA             500            SVM  0.889540   0.966263  0.888624   
6      VUA             500  Logistic Reg.  0.843567   0.839824  0.986476   
7      VUA             500            MLP  0.775687   0.775687  1.000000   
8      VUA            1000    Naive Bayes  0.845418   0.842231  0.985282   
9      VUA            1000            SVM  0.915150   0.966264  0.922832   
10     VUA            1000  Logistic Reg.  0.852823   0.847255  0.988465   
11     VUA            1000            MLP  0.886455   0.971441  0.879475   
12     VUA  

In [80]:
# vua_data toelf_data otw_data

# Dataset information
result_dataset_combined = pd.concat([
    pd.DataFrame(vua_data),
    pd.DataFrame(toelf_data),
    pd.DataFrame(otw_data)
    ])

print(result_dataset_combined.round(2))


  Dataset             Measurement     Values
0     VUA             Sent. count   16202.00
1     VUA                  Tokens  113765.00
2     VUA           Unique tokens   17883.00
3     VUA      Avg token per sent       7.02
4     VUA      sent with methapor    8326.00
5     VUA      Testing sent count    3241.00
6     VUA  How many with metaphor    2514.00
0   toelf             Sent. count    2737.00
1   toelf                  Tokens   26737.00
2   toelf           Unique tokens    5224.00
3   toelf      Avg token per sent       9.76
4   toelf      sent with methapor    1124.00
5   toelf      Testing sent count     548.00
6   toelf  How many with metaphor     233.00
0  Ottawa             Sent. count     719.00
1  Ottawa                  Tokens   10319.00
2  Ottawa           Unique tokens    2124.00
3  Ottawa      Avg token per sent      14.35
4  Ottawa      sent with methapor     358.00
5  Ottawa      Testing sent count     144.00
6  Ottawa  How many with metaphor      73.00
