Metaphor Extraction using BERT Sentence Transformer
> Given the lack of clean-language Q/A data for T5 model training as well as computationally expensive training of a large-language model, such as T5, the following approach is proposed:

1- Find the closest “Symbolic Images” to a user utterance (top 5 based on CSS)

- Tools used
    - BERT Sentence Transformer
    - Cosine Similarity Score (CSS)
    - 2 Datasets: image_dictionary (18,857 rows), Symbolic_Image_Dictionary_and_Questions (12,115 rows)
2- Use THE most frequent Symbolic Image, that is also present in the user utterance, as the “metaphor”

- Tools used
    - NLTK Lemmatizing &amp; Stemming
    - Spacy tokenization
    - Logic loop  </span></span></span></span></span></span></span></span></span>

In [1]:
!pip install transformers
!pip install sentence_transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.23.1-py3-none-any.whl (5.3 MB)
[K     |████████████████████████████████| 5.3 MB 5.3 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 36.3 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 60.0 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.10.1 tokenizers-0.13.1 transformers-4.23.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████|

In [2]:
import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', None)   # displays the full text of a DF cell
import re
import random

from sentence_transformers import SentenceTransformer, util

import nltk
from nltk.corpus import wordnet
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktLanguageVars

import spacy
# nlp = spacy.load("en_core_web_trf")  # accuracy
nlp = spacy.load("en_core_web_sm")   # effeciency, works great for this project

import warnings
warnings.filterwarnings('ignore')

In [5]:
# Import and Clean "Linguistic Image Data"
df = pd.read_excel('/content/Symbolic_Image_Dictionary_and_Questions.xlsx')
df.drop(df.iloc[:, 1:].columns, axis = 1, inplace = True)

df.columns = ['ling_image']

# remove punctuation 
df['ling_image'] = df['ling_image'].apply(lambda x: ' '.join(re.findall('[^!.? ]+', str(x))))   # find anything that is NOT !, ., ?, or white space

# remove white space at both ends
df['ling_image'] = df['ling_image'].str.strip()

# lowercase "ling_image" col
df['ling_image'] = df['ling_image'].str.lower()

# delete duplicate rows
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)

In [6]:
# Import and Clean "Symbolic_Image_Dictionary_and_Questions"
df_symb_dict = pd.read_excel("Symbolic_Image_Dictionary_and_Questions.xlsx")
df_symb_dict = df_symb_dict[["Entity", "Naming", "Reflecting", "Expanding"]]
df_symb_dict.columns = [
    "symbolic_image",
    "symb_naming_qustn",
    "symb_reflecting_qustn",
    "symb_expanding_qustn",
]

# remove punctuation
df_symb_dict["symbolic_image"] = df_symb_dict["symbolic_image"].apply(
    lambda x: " ".join(re.findall("[^!.? ]+", str(x)))
)  # find anything that is NOT !, ., ?, or white space

# remove white space at both ends
df_symb_dict["symbolic_image"] = df_symb_dict["symbolic_image"].str.strip()

# lowercase "Entity"
df_symb_dict["symbolic_image"] = df_symb_dict["symbolic_image"].str.lower()

# delete duplicate rows
df_symb_dict["symbolic_image"].drop_duplicates(inplace=True)
df_symb_dict["symbolic_image"].reset_index(drop=True, inplace=True)

In [7]:
# initiate BERT Sentence Transformer
model = SentenceTransformer("all-MiniLM-L6-v2")

# Compute df["ling_image"] embedding
sents_embeddings = model.encode(
    list(df["ling_image"]), 
    convert_to_tensor=True, 
    normalize_embeddings=True
)

# Compute df_symb_dict["symbolic_image"] embedding
symbol_embedding = model.encode(
    list(df_symb_dict["symbolic_image"]),
    convert_to_tensor=True,
    normalize_embeddings=True,
)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [8]:
# https://github.com/aswintechguy/Deep-Learning-Projects/blob/main/Abstractive%20Text%20Summarization%20Transformer%20Model%20-%20NLP/Abstractive_Text_Summarization_Transformer_Model.ipynb

# OPTIONAL: t5 summarization (could be used in future iterations)
import torch
from transformers import T5Config, T5ForConditionalGeneration, T5Tokenizer

In [9]:
def t5_summary(utterance):

    # initialize the pretrained model
    model = T5ForConditionalGeneration.from_pretrained("t5-small")
    tokenizer = T5Tokenizer.from_pretrained("t5-small")
    torch_device = "cuda" if torch.cuda.is_available() else "cpu"

    # preprocess the input text
    preprocessed_text = utterance.strip().replace("\n", "")
    t5_input_text = "summarize: " + preprocessed_text

    # tokenize
    ## padding=True    ---> pad the shorter sequences (sentences) in the batch to match the longest sequence
    ## truncation=True ---> truncate a sequence to the maximum length accepted by the model
    tokenized_text = tokenizer.encode(t5_input_text, 
                                      return_tensors="pt", 
                                      padding=True,          # or = "max_length", then provide max_length=512 for example
                                      truncation=True).to(torch_device)

    # summarize
    summary_ids = model.generate(tokenized_text, min_length=30, max_length=150)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

In [11]:
# # OPTIONAL: sentence segmentation (could be used in future iterations ---> more advanced would be sentence segmentation to its clauses)
def sent_segm(utterance):
    """
    create a list of sentences from our utterance by separating sentences no matter what the punctuation (e.g. '.' or ';')

    """
    
    class EndLangVars(PunktLanguageVars):
        sent_end_chars = ('.', '?', '!', ';')
    tokenizer = PunktSentenceTokenizer(lang_vars = EndLangVars())
    
    utterance_punc_separated = []
    for sent in tokenizer.tokenize(utterance):
        # remove punctuation
        utterance_punc_separated.append(' '.join(re.findall('[^!.?; ]+', sent)))   # find anything that is NOT !, ., ?, or white space
    # remove any empty string sentences if present
    utterance_punc_separated = [utterance for utterance in utterance_punc_separated if len(utterance)>0]


    # further segmentation of "utterance_punc_separated" to its individual sentences
    utterance_sents = []
    for utterance in utterance_punc_separated:
        doc = nlp(utterance)
        seen = set()  # keep track of covered words
        chunks = []

        for sent in doc.sents:
            heads = [cc for cc in sent.root.children if cc.dep_ == "conj"]

            for head in heads:
                words = [ww for ww in head.subtree]
                for word in words:
                    seen.add(word)
                chunk = " ".join([ww.text for ww in words])
                chunks.append((head.i, chunk))

            unseen = [ww for ww in sent if ww not in seen]
            chunk = " ".join([ww.text for ww in unseen])
            chunks.append((sent.root.i, chunk))

        chunks = sorted(chunks, key=lambda x: x[0])

        # only grab the text (chunk) without the index
        utterance_chunks = " ".join([chunk for ii, chunk in chunks])

        utterance_sents.append(utterance_chunks)

    return utterance_sents


In [12]:
def find_symbolic_images(phrase):
    
    # Compute embedding for a given phrase (--> we will apply to "top_n_sents["ling_image"]")
    ling_image_embedding = model.encode(
        phrase, convert_to_tensor=True, normalize_embeddings=True
    )
    
    # Compute cosine-similarities between our symbol_embedding (df_symb_dict["symbolic_image"] embedding) and ling_image_embedding
    cosine_sim_scores = util.cos_sim(symbol_embedding, ling_image_embedding)
    cosine_sim_df = pd.DataFrame(cosine_sim_scores)
    cosine_sim_df.columns = ["cosine_score"]
    
    # get index of max "cosine_score"
    index_max_cosine_score = cosine_sim_df[cosine_sim_df['cosine_score'] == cosine_sim_df['cosine_score'].max()].index
    
    # output the DF as a dictionary (easier to retrieve for later use)
    return df_symb_dict.iloc[index_max_cosine_score].to_dict('records')   # 'records' return all cells except index values

In [13]:
def metaphor_detection(utterance, sents_embeddings, n):
    """
    utterance: user input text
    sents_embeddings: BERT Sentence Transformer embedding matrix obtained from our linguistic image database
    n: top n linguistic image phrases that we want returned with the highest cosine similarity score (between sents_embeddings & utterance)
    """

    global cosine_sim_df
    
    # Compute utterance embedding
    utterance_embedding = model.encode(
        utterance, convert_to_tensor=True, normalize_embeddings=True
    )

    # Compute cosine-similarities between the utterance and our df sentences
    cosine_sim_scores = util.cos_sim(sents_embeddings, utterance_embedding)
    cosine_sim_df = pd.DataFrame(cosine_sim_scores)
    cosine_sim_df.columns = ["cosine_score"]

    # get top n ling. images based on cos. sim. score
    top_n_cosine_scores = pd.DataFrame(
        cosine_sim_df.sort_values("cosine_score", ascending=False)[:n]
    )
    top_n_cosine_sents = df.iloc[top_n_cosine_scores.index]
    top_n_sents = top_n_cosine_sents.join(top_n_cosine_scores)


    # get the most similar symbolic_image to the ling_image using our custom fn
    top_n_sents['symbolic_image'] = top_n_sents['ling_image'].apply(lambda x: find_symbolic_images(x)[0]['symbolic_image']) 
    top_n_sents['symb_naming_qustn'] = top_n_sents['ling_image'].apply(lambda x: find_symbolic_images(x)[0]['symb_naming_qustn'])
    top_n_sents['symb_reflecting_qustn'] = top_n_sents['ling_image'].apply(lambda x: find_symbolic_images(x)[0]['symb_reflecting_qustn'])
    top_n_sents['symb_expanding_qustn'] = top_n_sents['ling_image'].apply(lambda x: find_symbolic_images(x)[0]['symb_expanding_qustn'])

    return top_n_sents

In [14]:
# metaphor detection
# utterance = "I am trying to hack my way through a dense forest; I also feel like I am living under a dark cloud!!!"
# utterance = "I'm at the edge of a cliff paralized with fear"
utterance =  "I am at the edge of a cliff paralyzed with fear"


top_n_sents = metaphor_detection(utterance, sents_embeddings, 5)
top_n_sents

Unnamed: 0,ling_image,cosine_score,symbolic_image,symb_naming_qustn,symb_reflecting_qustn,symb_expanding_qustn
1992,climbing up cliff,tensor(0.6355),climbing up cliff,When might you confidently make the sheer effort required to deal with a sudden transformation in your continuing progress?,When might you take a different approach to confidently make the sheer effort required to deal with a sudden transformation in your continuing progress?,What other opportunities do you have to confidently make the sheer effort required to deal with a sudden transformation in your continuing progress?
1982,cliff,tensor(0.5657),cliff,Where might there be a looming awareness of a sudden transformation in the sheer effort required to deal with a challenge?,How can you make the most of a sudden transformation in your awareness so that you can make the sheer effort required to deal with an approaching challenge?,Where else might there be a looming awareness of a sudden transformation in the sheer effort required to deal with a challenge?
1983,cliff edge,tensor(0.5518),cliff edge,Where might you be on the threshold of an abrupt change in your circumstances and need to be very confident in your next steps?,Where might you use a different way to be on the threshold of an abrupt change in your circumstances and need to be very confident in your next steps?,Where else might you have the opportunity to be on the threshold of an abrupt change in your circumstances and need to be very confident in your next steps?
1984,cliff face,tensor(0.5274),cliff face,Where might you have to face up to the sheer effort required to reach your objective even though it can make yourself feel insecure?,Where might you use a different way to have to face up to the sheer effort required to reach your objective even though it can make yourself feel insecure?,Where else might you have the opportunity to have to face up to the sheer effort required to reach your objective even though it can make yourself feel insecure?
5871,ledge,tensor(0.4848),ledge,Where might you relax in the knowledge that your sheer effort has helped yourself to attain a much higher level of achievement?,Where might you use a different way to relax in the knowledge that your sheer effort has helped yourself to attain a much higher level of achievement?,Where else might you have the opportunity to relax in the knowledge that your sheer effort has helped yourself to attain a much higher level of achievement?


In [15]:
# custom fn for stemming, lemmatization, and removing stopwords
import gensim
from gensim.utils import simple_preprocess  # Convert a document into a "list of lowercase tokens", ignoring tokens that are too short or too long.
from gensim.parsing.preprocessing import STOPWORDS
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
stemmer = SnowballStemmer('english')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [16]:
# our pre-processing custom functions
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))  # pos = "v" => only lemmatize verbs and leave nouns 'n' and adjectives 'a' alone

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS:
            result.append(lemmatize_stemming(token))
    return str(" ".join(result))

In [17]:
# apply preprocess fn to "symbolic_image" col
top_n_sents['symbolic_image'].apply(preprocess)

1992    climb cliff
1982          cliff
1983      cliff edg
1984     cliff face
5871           ledg
Name: symbolic_image, dtype: object

In [18]:
# count instances of each symbolic_image, exclyding empty strings '' for cases where preprocess() returns '', and sort in descending order
symbolic_image_series = top_n_sents[top_n_sents['symbolic_image'].apply(preprocess) != '']['symbolic_image'].apply(preprocess).value_counts()
symbolic_image_series

climb cliff    1
cliff          1
cliff edg      1
cliff face     1
ledg           1
Name: symbolic_image, dtype: int64

In [19]:
def metaphor_extractor(utterance):
    
    # get top_n_sents DF using "metaphor_detection" fn
    top_n_sents = metaphor_detection(utterance, sents_embeddings, 5)
    
    # symbolic_image_series: count of each symbolic_image lemma, exclyding empty strings '' for cases where preprocess() returns '', sorted in descending order
    symbolic_image_series = top_n_sents[top_n_sents['symbolic_image'].apply(preprocess) != '']['symbolic_image'].apply(preprocess).value_counts()

    # using spacy tokenization of the utterance, return the words (tokens) whose lemma is within symbolic_image_series index words
    doc = nlp(utterance)
    metaphors = []
    for token in doc:
        ## check for two conditions:
           # 1- lemma of token (i.e. each utterance word) should be in the index of symbolic_image_series
           # 2- the "metaphors" output list should have unique lemmas (e.g. cannot have metaphors = ['climbing', 'climbed'])
        if (preprocess(token.text) in symbolic_image_series.index) and (preprocess(token.text) not in list(map(preprocess, metaphors))):
            metaphors.append(token.text)

    return metaphors

In [20]:
metaphor_extractor(utterance)

['cliff']

In [22]:
# test on a DF
df_data = pd.read_csv("/content/clean_language_BERT.csv", encoding="utf-8")

# grab only the user utterances
df_data = df_data[['input_text']]

# apply answer_generator
df_data['metaphors'] = df_data['input_text'].apply(metaphor_extractor)
df_data

Unnamed: 0,input_text,metaphors
0,"I feel like I am living under a dark, heavy cloud!","[dark, cloud]"
1,I am at the edge of a cliff paralyzed with fear,[cliff]
2,I am hacking my way through a dense forest.,[forest]
3,Lately I have been feeling like I'm invisible.,[invisible]
4,I feel like I am a ghost.,[ghost]
5,I feel like I am a spirit made of mist. You can walk right through me.,"[spirit, mist]"
6,I would like to come alive.,[alive]
7,I'm at the edge of a cliff paralysed with fear.,"[cliff, fear]"
8,I can’t get away from the chasing dragon.,[dragon]
9,In my dreams I am always falling.,"[dreams, falling]"


Clean Language Question Generation
> Algorithm

- Using the metaphors detected in the previous step and clean-language rule-based questions, extract the question with the highest CSS to the utterance

- Tools used
    - BERT Sentence Transformer
    - Cosine Similarity Score (CSS)
    - T5 Grammar Correction 
        - And what kind of clouds `is` that? ---&gt; And what kind of clouds `are` that?
        - And where `is that` ghosts from? ---&gt; And where `are those` ghosts from?
    - Clean-Language Question structures from CleanCoach GitHub

In [23]:
# list of "Clean-Language Questions" to ask the user: https://github.com/phughesmcr/CleanCoach/blob/master/lib/questions.js

clean_lang_questions = {

    "generic_qs": [
        "And what would you like to have happen?",
        "And is there anything else?",
        "And what happens next?",
        "And then what happens?",
        "And what needs to happen?",
        "And is there anything else that needs to happen?",
        "And how might you know?",
        "And how will you know?"   # nones        
        "And is there anything else about that?",
        "And whereabouts would you feel them?",
        "And what are they like?",
        "And what is that like?",
        "And what happens just before that?",
        "And where could they come from?",
        "And where could that come from?",
        "And if they happen, what would you like to happen now?",
        "And if that happens, what would you like to happen now?",
        "And what needs to happen for that?",
        "And what needs to happen for those?",
        "And those are like what?",
        "And that is like what?",
    ],
    
    "x_qs": [
        "And can you tell me more about X?",
        # "And what kind of X are they?",
        "And what kind of X is that?",
        "And what kind of X?",
        "And is there anything else about X?",
        # "And where are X?",
        "And where is X?",
        # "And whereabouts are X?",
        "And whereabouts is X?",
        # "And X are like what?",
        "And X is like what?",
        "And that's X like what?",
        "And when X happens, you're like what?",
        # "And when X happen, you're like what?",
        # "And when X happens, what happens next?",
        "And what happens after X?",
        "And what happens just before X?",
        "And where could X come from?",
        "And what needs to happen for X?",
        "And could X happen?",
        # "And when X, those are like what?",
        "And when X, that is like what?",
        "And does X have a size or a shape?",
        # "And how many X’s could there be?",
        "And how old could X be?",
        "And what could X be wearing?",
        "And is X on the inside or outside?",
        "And where is that X from?",
        # "And what kind of X was that X before it was X?",
        # "And what would X like to have happen?"
    ],
    
    "xy_qs": [
        "And how might you describe X and Y?",
        "And is there a relationship between X and Y?",
        "And what is the relationship between X and Y?",
        # "And when X happens, what happens to Y?",
        # "And when X happen, what happens to Y?",
        # "And when Y happens, what happens to X?",
        # "And when Y happen, what happens to X?",
        # "And when X, what happens to Y?",
        "And is the X the same as or different from Y?",
        # "And what’s between X and Y?",
        # "And would X be interested in going to Y?"
    ]
}

In [24]:
clean_quest_df = pd.DataFrame.from_dict(clean_lang_questions, orient='index').transpose()
clean_quest_df

Unnamed: 0,generic_qs,x_qs,xy_qs
0,And what would you like to have happen?,And can you tell me more about X?,And how might you describe X and Y?
1,And is there anything else?,And what kind of X is that?,And is there a relationship between X and Y?
2,And what happens next?,And what kind of X?,And what is the relationship between X and Y?
3,And then what happens?,And is there anything else about X?,And is the X the same as or different from Y?
4,And what needs to happen?,And where is X?,
5,And is there anything else that needs to happen?,And whereabouts is X?,
6,And how might you know?,And X is like what?,
7,And how will you know?And is there anything else about that?,And that's X like what?,
8,And whereabouts would you feel them?,"And when X happens, you're like what?",
9,And what are they like?,And what happens after X?,


In [25]:
!pip install happytransformer


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting happytransformer
  Downloading happytransformer-2.4.1-py3-none-any.whl (45 kB)
[K     |████████████████████████████████| 45 kB 2.2 MB/s 
Collecting datasets>=1.6.0
  Downloading datasets-2.6.1-py3-none-any.whl (441 kB)
[K     |████████████████████████████████| 441 kB 9.0 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 65.1 MB/s 
[?25hCollecting multiprocess
  Downloading multiprocess-0.70.13-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 58.8 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 48.7 MB/s 
Installing collected packages:

In [26]:
# initiate BERT Sentence Transformer
model = SentenceTransformer("all-MiniLM-L6-v2")

# initiate T5 grammer correction model
from happytransformer import HappyTextToText, TTSettings
happy_tt = HappyTextToText("T5", "vennify/t5-base-grammar-correction")
args = TTSettings(num_beams=5, min_length=1)


def question_detection(utterance, metaphors):
    """
    utterance: user input text
    qs_embeddings: BERT Sentence Transformer embedding matrix obtained from our clean-language questions DF
    metaphors: list of metaphors
    """

    global cosine_sim_df
    clean_quest_df = pd.DataFrame.from_dict(clean_lang_questions, orient='index').transpose()

    # Compute utterance embedding
    utterance_embedding = model.encode(
        utterance, convert_to_tensor=True, normalize_embeddings=True
    )
    
    n=len(metaphors)
    if n == 0:   # no metaphors
        # embed clean_quest_df["generic_qs"]
        generic_qs_embeddings = model.encode(
            list(clean_quest_df[~clean_quest_df["generic_qs"].isna()]["xy_qs"]),
            convert_to_tensor=True,
            normalize_embeddings=True
        )        
        
        cosine_sim_scores = util.cos_sim(generic_qs_embeddings, utterance_embedding)
        cosine_sim_df = pd.DataFrame(cosine_sim_scores)
        cosine_sim_df.columns = ["cosine_score"]

        top_n_cosine_scores = pd.DataFrame(
                cosine_sim_df.sort_values("cosine_score", ascending=False)
            )
        # display(top_n_cosine_scores)

        top_cosine_sent = clean_quest_df.loc[top_n_cosine_scores[:1].index, 'generic_qs'].iloc[0]    
    
    elif n == 1:
        # replace X with our metaphor in clean_quest_df
        clean_quest_df['x_qs'].replace("X", metaphors[0], regex=True, inplace=True)
        
        # embed clean_quest_df["x_qs"]
        x_qs_embeddings = model.encode(
            list(clean_quest_df[~clean_quest_df["x_qs"].isna()]["xy_qs"]),
            convert_to_tensor=True,
            normalize_embeddings=True
        )

        cosine_sim_scores = util.cos_sim(x_qs_embeddings, utterance_embedding)
        cosine_sim_df = pd.DataFrame(cosine_sim_scores)
        cosine_sim_df.columns = ["cosine_score"]

        top_n_cosine_scores = pd.DataFrame(
                cosine_sim_df.sort_values("cosine_score", ascending=False)
            )
        # display(top_n_cosine_scores)

        top_cosine_sent = clean_quest_df.loc[top_n_cosine_scores[:1].index, 'x_qs'].iloc[0]
    
    
    else:   # n >=2
        metaphors = metaphors[0:2]   # take the first two metaphors
        
        # replace X & Y with our 2 metaphors in clean_quest_df ---> X = metaphors[0], Y = metaphors[1]
        place_holders = ["X", "Y"]
        for i, j in zip(place_holders, metaphors):
            clean_quest_df["xy_qs"].replace(i, j, regex=True, inplace=True)
        
        xy_qs_embeddings = model.encode(
            list(clean_quest_df[~clean_quest_df["xy_qs"].isna()]["xy_qs"]),
            convert_to_tensor=True,
            normalize_embeddings=True
        )        
        
        cosine_sim_scores = util.cos_sim(xy_qs_embeddings, utterance_embedding)
        cosine_sim_df = pd.DataFrame(cosine_sim_scores)
        cosine_sim_df.columns = ["cosine_score"]

        top_n_cosine_scores = pd.DataFrame(
                cosine_sim_df.sort_values("cosine_score", ascending=False)
            )
        # display(top_n_cosine_scores)

        top_cosine_sent = clean_quest_df.loc[top_n_cosine_scores[:1].index, 'xy_qs'].iloc[0]
    
    
    # T5 grammer correction: Add the prefix "grammar: " before each input 
    top_cosine_sent = happy_tt.generate_text(f"grammar: {top_cosine_sent}", args=args)
    
    return top_cosine_sent.text   # , top_n_cosine_scores[:1].iloc[0][0]

Downloading:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

In [27]:
utterance = "I am hacking my way through a dense forest"
metaphors = ['hacking', 'forest']
question_detection(utterance, metaphors)

'And how might you describe hacking and forest?'

In [28]:
# test
df_data['clean_lang_q'] = df_data.apply(lambda x: question_detection(x['input_text'], x['metaphors']), axis=1)
df_data

Unnamed: 0,input_text,metaphors,clean_lang_q
0,"I feel like I am living under a dark, heavy cloud!","[dark, cloud]",And how might you describe dark and cloud?
1,I am at the edge of a cliff paralyzed with fear,[cliff],And can you tell me more about cliff?
2,I am hacking my way through a dense forest.,[forest],And can you tell me more about the forest?
3,Lately I have been feeling like I'm invisible.,[invisible],And what happens just before invisible?
4,I feel like I am a ghost.,[ghost],And what happens just before the ghost?
5,I feel like I am a spirit made of mist. You can walk right through me.,"[spirit, mist]",And how might you describe spirit and mist?
6,I would like to come alive.,[alive],And what happens just before you are alive?
7,I'm at the edge of a cliff paralysed with fear.,"[cliff, fear]",And how might you describe cliff and fear?
8,I can’t get away from the chasing dragon.,[dragon],And what happens just before dragon?
9,In my dreams I am always falling.,"[dreams, falling]",And what is the relationship between dreams and falling?


###**Softening**

Softening means adding words/qualifiers to a sentence so we can make it less commanding and more polite, ambiguous, so we offer a space for the user to response using his/her agency rather than commanding him/her to respond a certain way. For example, instead of saying "And what will happen now?", we add qualifiers and soften by saying "And what MIGHT happen now PERHAPS?".

To achieve softening, the workflow is as follows:

Create a dictionary of templated responses (e.g.) And what you like to have happen ? where those in <> are the words we fill in as we generate questions.
Generate questions using Sam Faar's code.
Replace the bracketed (<>) terms with the list of qualifiers. As of October 5, 2022, the conditional qualifiers are might and would and the ending_qualifiers are perhaps. Feel free to add on to this list.

In [29]:

df_data_no_metaphors = df_data[df_data["metaphors"].str.len() == 0]

In [30]:
# Update clean language questions with softening qualifiers

clean_lang_questions = {

    "generic_qs": [
        "And what would you like to have happen?",
        "And what might you like to have happen?",
        "And would there be anything else?"
        "And might there be anything else?",
        "And what happens next?",
        "And what might happen next perhaps?",
        "And what could happen next perhaps?"
        "And then what happens?",
        "And what needs to happen?",
        "And is there anything else that needs to happen?",
        "And would there be anything else that needs to happen?",
        "And might there be anything else that needs to happen?",
        "And how might you know?",
        "And how will you know?",   # nones
        "And how could you know perhaps?"        
        "And is there anything else about that?",
        "And might there be anything else about that?",
        "And might there be anything else about that perhaps?",
        "And could there be anything else about that?",
        "And could there be anything else about that perhaps?",
        "And what are they like?",
        "And what might they be like?",
        "And what could they be like?",
        "And what might they be like perhaps?",
        "And what could they be like perhaps?",
        "And what might it be like?",
        "And what could it be like?",
        "And what might it be like perhaps?",
        "And what could it be like perhaps?",
        "And what is that like?",
        "And what happens just before that?",
        "And where could they come from?",
        "And where could that come from?",
        "And where might they come from?",
        "And where might that come from?",
        "And if they happen, what would you like to happen now?",
        "And if that happens, what would you like to happen now?",
        "And what needs to happen for that?",
        "And what needs to happen for those?",
        "And those are like what?",
        "And that is like what?",
    ],
    
    "x_qs": [
        "And can you tell me more about X?",
        # "And what kind of X are they?",
        "And what kind of X is that?",
        "And what kind of X?",
        "And is there anything else about X?",
        # "And where are X?",
        "And where is X?",
        # "And whereabouts are X?",
        "And whereabouts is X?",
        # "And X are like what?",
        "And X is like what?",
        "And that's X like what?",
        "And when X happens, you're like what?",
        # "And when X happen, you're like what?",
        # "And when X happens, what happens next?",
        "And what happens after X?",
        "And what happens just before X?",
        "And where could X come from?",
        "And what needs to happen for X?",
        "And could X happen?",
        # "And when X, those are like what?",
        "And when X, that is like what?",
        "And does X have a size or a shape?",
        # "And how many X’s could there be?",
        "And how old could X be?",
        "And what could X be wearing?",
        "And is X on the inside or outside?",
        "And where is that X from?",
        # "And what kind of X was that X before it was X?",
        # "And what would X like to have happen?"
    ],
    
    "xy_qs": [
        "And how might you describe X and Y?",
        "And is there a relationship between X and Y?",
        "And what is the relationship between X and Y?",
        # "And when X happens, what happens to Y?",
        # "And when X happen, what happens to Y?",
        # "And when Y happens, what happens to X?",
        # "And when Y happen, what happens to X?",
        # "And when X, what happens to Y?",
        "And is the X the same as or different from Y?",
        # "And what’s between X and Y?",
        # "And would X be interested in going to Y?"
    ]
}

In [31]:
# Clean language question bank with <qualifier tags>

clean_lang_questions = {

    "generic_qs": [
        "And what <conditional_qualifier> you like to have happen?",
        "And <conditional_qualifier> there be anything else?",
        "And what <conditional_qualifier> happen next?",
        "And what <conditional_qualifier> happen next <ending_qualifier>?",
        "And <conditional_qualifier> there be anything else that needs to happen?",
        "And how <conditional_qualifier> you know?",
        "And how <conditional_qualifier> you know <ending_qualifier>?",        
        "And <conditional_qualifier> there be anything else about that?",
        "And <conditional_qualifier> there be anything else about that <ending_qualifier>?"
        "And what <conditional_qualifier> they be like?",
        "And what <conditional_qualifier> they be like <ending_qualifier>?",
        "And what <conditional_qualifier> it be like?",
        "And what <conditional_qualifier> it be like <ending_qualifier>?",
        "And what <conditional_qualifier> happen just before that?",
        "And where <conditional_qualifier> they come from?",
        "And where <conditional_qualifier> that come from?",
        "And if they happen, what <conditional_qualifier> you like to happen now?",
        "And if that happens, what <conditional_qualifier> you like to happen now?",
        "And what <conditional_qualifier> you need to happen for that?",
        "And what <conditional_qualifier> you need to happen for that <ending_qualifier>?"
    ],
    
    "x_qs": [
        "And can you tell me more about X?",
        # "And what kind of X are they?",
        "And what kind of X is that?",
        "And what kind of X?",
        "And is there anything else about X?",
        # "And where are X?",
        "And where is X?",
        # "And whereabouts are X?",
        "And whereabouts is X?",
        # "And X are like what?",
        "And X is like what?",
        "And that's X like what?",
        "And when X happens, you're like what?",
        # "And when X happen, you're like what?",
        # "And when X happens, what happens next?",
        "And what happens after X?",
        "And what happens just before X?",
        "And where could X come from?",
        "And what needs to happen for X?",
        "And could X happen?",
        # "And when X, those are like what?",
        "And when X, that is like what?",
        "And does X have a size or a shape?",
        # "And how many X’s could there be?",
        "And how old could X be?",
        "And what could X be wearing?",
        "And is X on the inside or outside?",
        "And where is that X from?",
        # "And what kind of X was that X before it was X?",
        # "And what would X like to have happen?"
    ],
    
    "xy_qs": [
        "And how might you describe X and Y?",
        "And is there a relationship between X and Y?",
        "And what is the relationship between X and Y?",
        # "And when X happens, what happens to Y?",
        # "And when X happen, what happens to Y?",
        # "And when Y happens, what happens to X?",
        # "And when Y happen, what happens to X?",
        # "And when X, what happens to Y?",
        "And is the X the same as or different from Y?",
        # "And what’s between X and Y?",
        # "And would X be interested in going to Y?"
    ]
}

In [32]:
# Update question_detection function with the additional softening parts

def question_detection(utterance, metaphors):
    """
    utterance: user input text
    qs_embeddings: BERT Sentence Transformer embedding matrix obtained from our clean-language questions DF
    metaphors: list of metaphors
    """

    conditional_qualifiers = ["might", "would"]
    ending_qualifiers = ["perhaps"]

    global cosine_sim_df
    clean_quest_df = pd.DataFrame.from_dict(clean_lang_questions, orient='index').transpose()

    # Compute utterance embedding
    utterance_embedding = model.encode(
        utterance, convert_to_tensor=True, normalize_embeddings=True
    )
    
    n=len(metaphors)
    if n == 0:   # no metaphors
        # embed clean_quest_df["generic_qs"]

        # Create model embeddings based on generic_qs list
        generic_q_list = list(clean_quest_df[~clean_quest_df["generic_qs"].isna()]["generic_qs"])
        generic_q_list = list(map(lambda x: x.replace("<conditional_qualifier>",conditional_qualifiers[random.randrange(len(conditional_qualifiers))]),generic_q_list))
        generic_q_list = list(map(lambda x: x.replace("<ending_qualifier>",ending_qualifiers[random.randrange(len(ending_qualifiers))]),generic_q_list))

        generic_qs_embeddings = model.encode(
            generic_q_list,
            convert_to_tensor=True,
            normalize_embeddings=True
        )        
        
        cosine_sim_scores = util.cos_sim(generic_qs_embeddings, utterance_embedding)
        cosine_sim_df = pd.DataFrame(cosine_sim_scores)
        cosine_sim_df.columns = ["cosine_score"]

        top_n_cosine_scores = pd.DataFrame(
                cosine_sim_df.sort_values("cosine_score", ascending=False)
            )
        # display(top_n_cosine_scores)

        top_cosine_sent = clean_quest_df.loc[top_n_cosine_scores[:1].index, 'generic_qs'].iloc[0]

        # Replace the qualifiers in the most similar question generated
        top_cosine_sent = top_cosine_sent.replace("<conditional_qualifier>", conditional_qualifiers[random.randrange(len(conditional_qualifiers))])
        top_cosine_sent = top_cosine_sent.replace("<ending_qualifier>", ending_qualifiers[random.randrange(len(ending_qualifiers))])
    
    elif n == 1:
        # replace X with our metaphor in clean_quest_df
        clean_quest_df['x_qs'].replace("X", metaphors[0], regex=True, inplace=True)
        
        # embed clean_quest_df["x_qs"]
        x_qs_embeddings = model.encode(
            list(clean_quest_df[~clean_quest_df["x_qs"].isna()]["xy_qs"]),
            convert_to_tensor=True,
            normalize_embeddings=True
        )

        cosine_sim_scores = util.cos_sim(x_qs_embeddings, utterance_embedding)
        cosine_sim_df = pd.DataFrame(cosine_sim_scores)
        cosine_sim_df.columns = ["cosine_score"]

        top_n_cosine_scores = pd.DataFrame(
                cosine_sim_df.sort_values("cosine_score", ascending=False)
            )
        # display(top_n_cosine_scores)

        top_cosine_sent = clean_quest_df.loc[top_n_cosine_scores[:1].index, 'x_qs'].iloc[0]
    
    
    else:   # n >=2
        metaphors = metaphors[0:2]   # take the first two metaphors
        
        # replace X & Y with our 2 metaphors in clean_quest_df ---> X = metaphors[0], Y = metaphors[1]
        place_holders = ["X", "Y"]
        for i, j in zip(place_holders, metaphors):
            clean_quest_df["xy_qs"].replace(i, j, regex=True, inplace=True)
        
        xy_qs_embeddings = model.encode(
            list(clean_quest_df[~clean_quest_df["xy_qs"].isna()]["xy_qs"]),
            convert_to_tensor=True,
            normalize_embeddings=True
        )        
        
        cosine_sim_scores = util.cos_sim(xy_qs_embeddings, utterance_embedding)
        cosine_sim_df = pd.DataFrame(cosine_sim_scores)
        cosine_sim_df.columns = ["cosine_score"]

        top_n_cosine_scores = pd.DataFrame(
                cosine_sim_df.sort_values("cosine_score", ascending=False)
            )
        # display(top_n_cosine_scores)

        top_cosine_sent = clean_quest_df.loc[top_n_cosine_scores[:1].index, 'xy_qs'].iloc[0]
    
    
    # T5 grammer correction: Add the prefix "grammar: " before each input 
    top_cosine_sent = happy_tt.generate_text(f"grammar: {top_cosine_sent}", args=args)
    
    return top_cosine_sent.text   # , top_n_cosine_scores[:1].iloc[0][0]

In [33]:
df_data_no_metaphors.head()


Unnamed: 0,input_text,metaphors,clean_lang_q
16,I feel very anxious. I hate this feeling.,[],And what is that like?
26,I feel that something is wrong with me.,[],And what is that like?
29,It’s difficult to explain.,[],And what happens next?
30,It is like a tightening feeling.,[],And is there anything else?
31,A feeling of helplessness,[],And what is that like?


In [34]:
%%time 
df_data_no_metaphors['clean_lang_q_softened'] = df_data_no_metaphors.apply(lambda x: question_detection(x['input_text'], x['metaphors']), axis=1)

CPU times: user 21.9 s, sys: 133 ms, total: 22 s
Wall time: 25.2 s


In [35]:
df_data_no_metaphors


Unnamed: 0,input_text,metaphors,clean_lang_q,clean_lang_q_softened
16,I feel very anxious. I hate this feeling.,[],And what is that like?,"And if that happens, what would you like to happen now?"
26,I feel that something is wrong with me.,[],And what is that like?,And what might it be like?
29,It’s difficult to explain.,[],And what happens next?,And how might you know?
30,It is like a tightening feeling.,[],And is there anything else?,And what might it be like perhaps?
31,A feeling of helplessness,[],And what is that like?,And what might you like to have happen?
40,It traveled down the umbilical cord into me.,[],And what would you like to have happen?,And what might it be like perhaps?
44,I need to go back down the hole and pick up the dead bodies there.,[],And what is that like?,And what would you need to happen for that?
46,He wanted to spend time with me as much as we both could.,[],And what happens next?,And what might you like to have happen?
47,"I like caring. I like it. I like caring for myself, and not giving away every part of my, just because some Tom, Dick, or Harry wants it.",[],And what would you like to have happen?,"And if that happens, what might you like to happen now?"
48,"The other thing was that I was about ready to leave my children, you know, very young, beautiful children . . . children that I'd brought into the world to do the same kind of, kind of thing that I was doing, you know . . . to take the world on their back and change things. You know, and now that helped me now with my kids in the sense of I don't think I want to raise my children like that.",[],And what would you like to have happen?,And what might you like to have happen?


In [36]:
# Please add template questions here. 

clean_lang_questions = {

    "generic_qs": [
        "And what <conditional_qualifier> you like to have happen?",
        "And <conditional_qualifier> there be anything else?",
        "And what <conditional_qualifier> happen next?",
        "And what <conditional_qualifier> happen next <ending_qualifier>?",
        "And <conditional_qualifier> there be anything else that needs to happen?",
        "And how <conditional_qualifier> you know?",
        "And how <conditional_qualifier> you know <ending_qualifier>?",        
        "And <conditional_qualifier> there be anything else about that?",
        "And <conditional_qualifier> there be anything else about that <ending_qualifier>?"
        "And what <conditional_qualifier> they be like?",
        "And what <conditional_qualifier> they be like <ending_qualifier>?",
        "And what <conditional_qualifier> it be like?",
        "And what <conditional_qualifier> it be like <ending_qualifier>?",
        "And what <conditional_qualifier> happen just before that?",
        "And where <conditional_qualifier> they come from?",
        "And where <conditional_qualifier> that come from?",
        "And if they happen, what <conditional_qualifier> you like to happen now?",
        "And if that happens, what <conditional_qualifier> you like to happen now?",
        "And what <conditional_qualifier> you need to happen for that?",
        "And what <conditional_qualifier> you need to happen for that <ending_qualifier>?"
    ],
    
    # Please add the templates below in the x_qs and xy_qs section

    "x_qs": [
        "And can you tell me more about X?",
        # "And what kind of X are they?",
        "And what kind of X is that?",
        "And what kind of X?",
        "And is there anything else about X?",
        # "And where are X?",
        "And where is X?",
        "And where would I find X?",
        "And where might X be located?",
        "And where could I find X?",
        # "And whereabouts are X?",
        "And whereabouts is X?",
        # "And X are like what?",
        "And X is like what?",
        "And that's X like what?",
        "And when X happens, you're like what?",
        "And what is X like?",
        "And how might you describe X?",
        "And how would you describe X?",
        "And when X occurs, what is it like?",
        # "And when X happen, you're like what?",
        # "And when X happens, what happens next?",
        "And what happens after X?",
        "And what happens just before X?",
        "And where could X come from?",
        "And what needs to happen for X?",
        "And could X happen?",
        "And what might happen at the end of X?",
        "And what could happen right before X?",
        "And when does X happen?",
        # "And when X, those are like what?",
        "And when X, that is like what?",
        "And does X have a size or a shape?",
        "And how would you describe X?",
        "And how might you look at X?",
        "And how could X be described?",
        "And could you describe X?",
        # "And how many X’s could there be?",
        "And how old could X be?",
        "And what could X be wearing?",
        "And is X on the inside or outside?",
        "And where is that X from?",
        "And is X looking at you?",
        "And when X talks to you, what does he say?",
        "And when X talks to you, what does she say?",
        "And how do you feel when you see X?",
        "And where is X going?", 
        # "And what kind of X was that X before it was X?",
        # "And what would X like to have happen?"
    ],
    
    "xy_qs": [
        "And how might you describe X and Y?",
        "And is there a relationship between X and Y?",
        "And what is the relationship between X and Y?",
        "And where is X when Y talks to you?",
        "And when you meet Y, what happens to X?",
        "And when Y happens, where is X?", 
        "And does X know Y?",
        "And what kind of relationship does X and Y have?",
        "And why does X show up when Y is around?", 
        # "And when X happens, what happens to Y?",
        # "And when X happen, what happens to Y?",
        # "And when Y happens, what happens to X?",
        # "And when Y happen, what happens to X?",
        # "And when X, what happens to Y?",
        "And is the X the same as or different from Y?",
        # "And what’s between X and Y?",
        # "And would X be interested in going to Y?"
    ]
}