<a href="https://colab.research.google.com/github/Northwind01/metaphors/blob/master/2_Extracting_metaphors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/googlecolab/colabtools/blob/master/notebooks/colab-github-demo.ipynb)

# Metaphor extraction

## 0. Set-up

### Get the spaCy model

In [0]:
# Get the spaCy model for embeddings
!python -m spacy download en_core_web_lg

Collecting en_core_web_lg==2.1.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.1.0/en_core_web_lg-2.1.0.tar.gz (826.9MB)
[K     |████████████████████████████████| 826.9MB 1.1MB/s 
[?25hBuilding wheels for collected packages: en-core-web-lg
  Building wheel for en-core-web-lg (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-lg: filename=en_core_web_lg-2.1.0-cp36-none-any.whl size=828255076 sha256=73294702c4782ac1a9c50cfc95503687af5a7a9cf745d1d48c14993cbef8e94c
  Stored in directory: /tmp/pip-ephem-wheel-cache-aosayh3z/wheels/b4/d7/70/426d313a459f82ed5e06cc36a50e2bb2f0ec5cb31d8e0bdf09
Successfully built en-core-web-lg
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-2.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


In [0]:
# Get more vectors
!python -m spacy download en_vectors_web_lg

Collecting en_vectors_web_lg==2.1.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_vectors_web_lg-2.1.0/en_vectors_web_lg-2.1.0.tar.gz (661.8MB)
[K     |████████████████████████████████| 661.8MB 1.1MB/s 
[?25hBuilding wheels for collected packages: en-vectors-web-lg
  Building wheel for en-vectors-web-lg (setup.py) ... [?25l[?25hdone
  Created wheel for en-vectors-web-lg: filename=en_vectors_web_lg-2.1.0-cp36-none-any.whl size=663461749 sha256=5c4d62404352d8d3c0379f5461bb1d26ea8e5916a43da37c842b048525a22598
  Stored in directory: /tmp/pip-ephem-wheel-cache-xnkiwjlj/wheels/ce/3e/83/59647d0b4584003cce18fb68ecda2866e7c7b2722c3ecaddaf
Successfully built en-vectors-web-lg
Installing collected packages: en-vectors-web-lg
Successfully installed en-vectors-web-lg-2.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_vectors_web_lg')


When done, restart the runtime

### Imports

In [0]:
import sys, os
import numpy as np
import pandas as pd
import spacy
from spacy import displacy
import nltk
from nltk import Tree
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer()
import _pickle as cPickle

### Load the spaCy model

In [0]:
# Load the spaCy models
nlp = spacy.load('en_core_web_lg')
nlp_vec = spacy.load('en_vectors_web_lg')

### Load Wordnet

In [0]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

### Get Google drive access

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


### Set the paths

In [0]:
root_path = 'gdrive/My Drive/metaphors/'
pickle_dir = root_path + 'data/pickles/'
preprocessing_dir = pickle_dir + 'pre_processing/'
metaphors_dir = pickle_dir + 'extracted_metaphors/'

## 1. Exploring dependency information (INFO)

In [0]:
# Get some data for checks
path = os.path.join(preprocessing_dir, 'nlp_articles1000.pickle')
with open(path, "rb") as input_file:
  articles = cPickle.load(input_file)

In [0]:
sent = list(nlp('I am afraid this spells trouble.').sents)[0]
#sent = list(articles[0].sents)[157]

In [0]:
displacy.render(sent, style='ent', jupyter=True)

  "__main__", mod_spec)


In [0]:
displacy.render(sent, style='dep', jupyter=True)

In [0]:
def tok_format(tok):
    return "_".join([tok.orth_, tok.dep_])

def to_nltk_tree(node):
    if node.n_lefts + node.n_rights > 0:
        return Tree(tok_format(node), [to_nltk_tree(child) for child in node.children])
    else:
        return tok_format(node)

In [0]:
to_nltk_tree(sent.root).pretty_print()

        am_ROOT                                     
    _______|____________________                     
   |       |               afraid_acomp             
   |       |                    |                    
   |       |               spells_ccomp             
   |       |         ___________|____________        
I_nsubj ._punct this_nsubj              trouble_dobj



In [0]:
verb = sent.root
verb

am

## 2. Helper functions

### Getting synonyms and hypernyms

In [0]:
def get_candidates(verb):
  '''Gets all synonyms and hypernyms of a verb.
        Synonym: "a word or phrase that means exactly or nearly the same"
        Hypernym: "a words with a broad meaning constituting a category into which words with more specific meanings fall; a superordinate"
  
  Args:
    verb (str): the target verb
  
  Returns:
    candidates ([str]): all synonyms and hypernyms of a verb
  '''
  candidates = []
  hyper = lambda s: s.hypernyms()

  # Creating list of candidates
  for syn in wn.synsets(verb, wn.VERB): # Only verbs considered
    #print('The synset: ' + str(syn))

    # Get synonyms
    candidates.extend(syn.lemma_names())
    #print('Synonyms: ' + str(candidates))

    # Get hypernyms
    hypernyms = list(syn.closure(hyper, depth=1)) # We consider only direct hypernyms
    hyper_lemmas = [hyp.lemma_names() for hyp in hypernyms]
    flattend = [lemma for hyp in hyper_lemmas for lemma in hyp]
    candidates.extend(flattend)
    #print('Hypernyms: ' + str(flattend))
  
  # Excluding duplicates
  candidates = set(candidates)

  # Discarding the target verb from the set in base form
  #candidates.discard(lemmatizer.lemmatize(verb, 'v')) # It does not have to discarded

  return list(candidates)

In [0]:
# Just checking
candidates = get_candidates(verb.text)
len(candidates)

22

In [0]:
print('Target verb: ' + verb.text)
print('Candidates: ' + str(candidates))

Target verb: am
Candidates: ['represent', 'constitute', 'symbolise', 'live', 'rest', 'cost', 'personify', 'be', 'embody', 'follow', 'stand_for', 'occupy', 'symbolize', 'make_up', 'exist', 'use_up', 'typify', 'take', 'stay', 'remain', 'comprise', 'equal']


### Getting all inflections of a verb

https://lemminflect.readthedocs.io/en/latest/inflections/

In [0]:
!pip3 install lemminflect

Collecting lemminflect
[?25l  Downloading https://files.pythonhosted.org/packages/cf/0c/cce8e1831b53c2d40cd36c87ce77d9ea7bae9bba17d0b01a6cece129e6a7/lemminflect-0.2.0-py3-none-any.whl (769kB)
[K     |▍                               | 10kB 19.1MB/s eta 0:00:01[K     |▉                               | 20kB 1.8MB/s eta 0:00:01[K     |█▎                              | 30kB 2.6MB/s eta 0:00:01[K     |█▊                              | 40kB 1.7MB/s eta 0:00:01[K     |██▏                             | 51kB 2.1MB/s eta 0:00:01[K     |██▋                             | 61kB 2.5MB/s eta 0:00:01[K     |███                             | 71kB 2.9MB/s eta 0:00:01[K     |███▍                            | 81kB 3.3MB/s eta 0:00:01[K     |███▉                            | 92kB 3.7MB/s eta 0:00:01[K     |████▎                           | 102kB 2.8MB/s eta 0:00:01[K     |████▊                           | 112kB 2.8MB/s eta 0:00:01[K     |█████▏                          | 122kB 2.8MB/

In [0]:
from lemminflect import getAllInflections

In [0]:
def get_inflections(verbs):
  '''Gets all inflections of a list of verbs.
        
  Args:
    verbs ([str]): the verbs to be inflected
  
  Returns:
    inflections ([str]): a list of all inflections of all the verbs
  '''
  inflections = []

  for verb in verbs:
    infl_dict = getAllInflections(verb, upos='VERB') # verb[1]._.inflect('VERB') is a spaCy extension, but would not work
    infl_list_of_tuples = list(infl_dict.values()) # dict => list of tuples 
    for t in infl_list_of_tuples: # list of tuples => list of inflections
      for infl in t:
        inflections.append(infl)
  
  return list(set(inflections))

In [0]:
# Just checking
candidates = list(set(get_inflections(candidates)))
len(candidates)

85

### Similarity function

In [0]:
from scipy.spatial import distance

def dist(vector1, vector2, dist_type='cosine'):
  try:
    return distance.cdist([vector1], [vector2], dist_type)[0][0]
  except:
    return 100

In [0]:
def best_fit(context_vector, candidates):
  '''Finds most likely candidate for the context.
        
  Args:
    context_vector (np.ndarray): average nlp.vector of the words in the sentence, excl. target verb
    candidates ([str]): target verb + synonyms + direct hypernym (in all inflections)
  
  Returns:
    best_fit_verb (str): candidate most similar to the context vector
  '''
  df = pd.DataFrame(candidates, columns=['word'])

  # Get vectors of all the candidates and drop zero-vectors
  df['vector'] = df['word'].apply(lambda w: nlp_vec(w).vector)
  df = df[df['vector'].map(lambda v: v.any())]

  # Get distances between all the candidates and the context vector
  df['dist'] = df['vector'].apply(lambda v: dist(context_vector, v))
  
  # Get index of the minimum-distance verb
  min_index = df['dist'].idxmin()
  #print(df[['word', 'dist']].sort_values('dist')[:])
  
  # Return best fit verb
  return df['word'][min_index]

In [0]:
# Just checking
context_vector = np.mean([w.vector for w in sent if (w != verb) and (w.vector.any())], axis=0)
best_candidate = nlp_vec(best_fit(context_vector, candidates))
print('Similarity: ' + str(best_candidate.similarity(verb)))
print('between "' + verb.text + '" and "' + best_candidate.text + '"')
print('in the sentence: ' + str(sent))
print('Context: ' + str([w for w in sent if w != verb]))
print('Context vector: ' + str(context_vector)[:100] + '...')

Similarity: 0.4092633399087371
between "am" and "be"
in the sentence: I am afraid this spells trouble.
Context: [I, afraid, this, spells, trouble, .]
Context vector: [-8.98140073e-02  1.71537519e-01 -2.80563682e-01 -1.72855005e-01
 -1.30782321e-01 -3.76488306e-02  8...


## 3. Metaphor extraction

In [0]:
# Initialize the variable: list of tuples (metaphorical verb, target verb)
metaphors = []
threshold = 0.6**3
threshold

0.21599999999999997

In [0]:
def extract_met(doc): # input is nlp doc
  '''Extracts list of verb pairs from an article:
        1. target verb: metaphorically used verb
        2. fit verb: most popular alternative / literally used verb

  Args:
    doc (spaCy doc object): text to be used for extraction
  
  Returns:
    metaphors_in_the_article ([tuple]): all valid metaphor pairs
  '''
  metaphors_in_the_article = []
  # Get to each verb in each sentence
  sents = doc.sents
  i = -1
  
  for sent in sents:
    verbs = [token for token in sent if (token.pos_ == 'VERB') and (nlp_vec(token.text).vector.any())] # only get verbs which have non-zero vectors
    i = i + 1
    for verb in verbs:
      # Get the candidates and their inflections
      candidates = get_inflections(get_candidates(verb.text))

      if len(candidates) > 0:
        # Extract the context vector as an average of non-zero vectors of the words in the sentence (excl. punctuation; lower/uppercase words have same vectors)
        context_vector = np.mean([w.vector for w in sent if (w != verb) and (w.vector.any()) and (w.pos_ not in ['PUNCT', 'PART'])], axis=0)

        if context_vector.any():
          # Find best fit from the candidates for the context
          best_candidate = nlp_vec(best_fit(context_vector, candidates))[0]

          # Check the similarity threshold
          if (verb != best_candidate) and (best_candidate.similarity(verb) < threshold):

            # Take the pair if dissimilar enough
            pair = (verb, best_candidate)
            print('Pair: '+ verb.text  + ' => ' + best_candidate.text + '       from: ' + str(i) + ' ' + str(sent))
            metaphors_in_the_article.append(pair)

  return metaphors_in_the_article

In [0]:
# Just checking:
print([w for w in sent if (w != verb) and (w.vector.any()) and (w.pos_ not in ['PUNCT', 'PART'])])

[I, afraid, this, spells, trouble]


In [0]:
# Just checking
extract_met(articles[0])

Pair: combines => given       from: 7 The Diagnostic and Statistical Manual of Mental Disorders (DSM-5), combines autism and less severe forms of the condition, including Asperger syndrome and pervasive developmental disorder not otherwise specified (PDD-NOS) into the diagnosis of autism spectrum disorder (ASD).
Pair: aged => change       from: 52 aged 8–15 performed equally well as, and as adults better than, individually matched controls at basic language tasks involving vocabulary and spelling.
Pair: categorizes => reasons       from: 59 Autistic individuals can display many forms of repetitive or restricted behavior, which the Repetitive Behavior Scale-Revised (RBS-R) categorizes as follows.
Pair: occurs => came       from: 79 Unusual eating behavior occurs in about three-quarters of children with ASD, to the extent that it was formerly a diagnostic indicator.
Pair: sequencing => found       from: 95 Many genes have been associated with autism through sequencing the genomes of affe

[(combines, given),
 (aged, change),
 (categorizes, reasons),
 (occurs, came),
 (sequencing, found),
 (imprinted, work),
 (aggravating, changes),
 (brominated, treating),
 (controlled, seen),
 (originated, make),
 (performs, did),
 (warranted, support),
 (coexisting, be),
 (precede, going),
 (meets, have),
 (diagnosed, names),
 (gesturing, motion),
 (screened, take),
 (precede, going),
 (licensed, clear),
 (substantiated, be),
 (utilizes, change),
 (recommends, change),
 (integrating, turn),
 (modulating, changes),
 (associated, think),
 (outweigh, rules),
 (documented, entering),
 (ranged, be),
 (excludes, lack),
 (annul, avoid),
 (gain, made),
 (coined, striking),
 (defining, was),
 (labeled, told),
 (multiply, making),
 (withdrawn, going),
 (emphasizes, shown),
 (cured, change)]

In [0]:
# Loop through the files an extract info
for filename in os.listdir(preprocessing_dir):
  if filename.endswith(".pickle"):
    path = os.path.join(preprocessing_dir, filename)
    with open(path, "rb") as input_file:
      docs = pd.Series(cPickle.load(input_file))
      metaphors.extend(docs.apply(extract_met))
    
    # Save the extracted pairs
    pickle_file = os.path.join(metaphors_dir, filename + '.pickle')
    with open(pickle_file, "wb") as output_file:
      cPickle.dump(metaphors, output_file)
      metaphors = []

    print('Processed '+ filename)

print('Number of metaphors: ' + str(len(metaphors)))

print('Processing complete!')

Pair: combines => given       from: 7 The Diagnostic and Statistical Manual of Mental Disorders (DSM-5), combines autism and less severe forms of the condition, including Asperger syndrome and pervasive developmental disorder not otherwise specified (PDD-NOS) into the diagnosis of autism spectrum disorder (ASD).
Pair: aged => change       from: 52 aged 8–15 performed equally well as, and as adults better than, individually matched controls at basic language tasks involving vocabulary and spelling.
Pair: categorizes => reasons       from: 59 Autistic individuals can display many forms of repetitive or restricted behavior, which the Repetitive Behavior Scale-Revised (RBS-R) categorizes as follows.
Pair: occurs => came       from: 79 Unusual eating behavior occurs in about three-quarters of children with ASD, to the extent that it was formerly a diagnostic indicator.
Pair: sequencing => found       from: 95 Many genes have been associated with autism through sequencing the genomes of affe

  out=out, **kwargs)


Pair: See => regarded       from: 171 See also==
Pair: derives => make       from: 2 It is similar in shape to the Ancient Greek letter alpha, from which it derives.
Pair: corresponded => be       from: 22 Its name is thought to have corresponded closely to the Paleo-Hebrew or Arabic aleph.
Pair: denoted => meant       from: 27 When the ancient Greeks adopted the alphabet, they had no use for a letter to represent the glottal stop—the consonant sound that the letter denoted in Phoenician and other Semitic languages, and that was the first phoneme of the Phoenician pronunciation of the letter
Pair: resembles => check       from: 29 In the earliest Greek inscriptions after the Greek Dark Ages, dating to the 8th century BC, the letter rests upon its side, but in the Greek alphabet of later times it generally resembles the modern capital letter, although many local varieties can be distinguished by the shortening of one leg, or by the angle at which the cross line is set.
Pair: inscribing 

KeyboardInterrupt: ignored