# Question Answering for Role Title Extraction

Question Answering can be used as a sort of zero-shot NER by picking an appropriate question.

This works surprisingly well for posts that actually contain a book title; almost always picking it out (and potentially the author).
It doesn't do so well for multiple results, but is a very strong starting point.

In [1]:
import numpy as np
import pandas as pd

import html

from pathlib import Path

from transformers import pipeline

2022-06-27 21:42:11.017470: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-06-27 21:42:11.017528: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [2]:
import re

def clean(text):
    text = html.unescape(text)
    text = text.replace('<i>', '"')
    text = text.replace('</i>', '"')
    text = text.replace('<p>', '\n\n')
    text = re.sub('<a href="(.*?)".*?>.*?</a>', r'\1', text)
    return text

In [3]:
book_recs = pd.read_csv('../data/02_intermediate/hn_ask_book_recommendations.csv')

In [4]:
books = book_recs.text.head(30).map(clean).to_list()

In [5]:
pipe = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


## First post

There are a list of 4 books here.

In [6]:
print(books[0])

- Existential Rationalism: Handling Hume's Fork (second edition)
- Living with the Himalayan Masters
- The Outsider
- Hirohito: Behind the Myth


It does poorly breaking across newlines. Maybe some preprocessing would help.

In [7]:
pipe(context=books[0], question="What book is this about?", topk=5, handle_impossible_answer=True)

  fw_args = {k: torch.tensor(v, device=self.device) for (k, v) in fw_args.items()}


[{'score': 0.2605375349521637,
  'start': 103,
  'end': 143,
  'answer': 'The Outsider\n- Hirohito: Behind the Myth'},
 {'score': 0.09077351540327072,
  'start': 103,
  'end': 126,
  'answer': 'The Outsider\n- Hirohito'},
 {'score': 0.04595860093832016,
  'start': 2,
  'end': 47,
  'answer': "Existential Rationalism: Handling Hume's Fork"},
 {'score': 0.03626202791929245,
  'start': 128,
  'end': 143,
  'answer': 'Behind the Myth'},
 {'score': 0.016868628561496735,
  'start': 27,
  'end': 47,
  'answer': "Handling Hume's Fork"}]

Hume is too confidently wrong

In [8]:
pipe(context=books[0], question="Who is the author?", topk=5, handle_impossible_answer=True)

[{'score': 0.39432305097579956, 'start': 36, 'end': 40, 'answer': 'Hume'},
 {'score': 0.17219921946525574,
  'start': 36,
  'end': 47,
  'answer': "Hume's Fork"},
 {'score': 0.09345115721225739,
  'start': 27,
  'end': 40,
  'answer': 'Handling Hume'},
 {'score': 0.048475462943315506,
  'start': 36,
  'end': 64,
  'answer': "Hume's Fork (second edition)"},
 {'score': 0.04080972820520401,
  'start': 27,
  'end': 47,
  'answer': "Handling Hume's Fork"}]

## Second Post

In [9]:
print(books[1])

The Coming of Neo-Feudalism by Joel Kotkin


We need a way of merging overlapping recommendations; but this is right.

In [10]:
pipe(context=books[1], question="What book is this about?", topk=5, handle_impossible_answer=True)

[{'score': 0.8426161408424377,
  'start': 0,
  'end': 27,
  'answer': 'The Coming of Neo-Feudalism'},
 {'score': 0.15182335674762726,
  'start': 4,
  'end': 27,
  'answer': 'Coming of Neo-Feudalism'},
 {'score': 0.004171550273895264,
  'start': 14,
  'end': 27,
  'answer': 'Neo-Feudalism'},
 {'score': 0.0004384214407764375,
  'start': 0,
  'end': 42,
  'answer': 'The Coming of Neo-Feudalism by Joel Kotkin'},
 {'score': 0.0002944305306300521,
  'start': 0,
  'end': 10,
  'answer': 'The Coming'}]

This is right.

In [11]:
pipe(context=books[1], question="Who is the author?", topk=5, handle_impossible_answer=True)

[{'score': 0.9972344636917114,
  'start': 31,
  'end': 42,
  'answer': 'Joel Kotkin'},
 {'score': 0.0010390320094302297, 'start': 31, 'end': 35, 'answer': 'Joel'},
 {'score': 0.0005376354674808681, 'start': 36, 'end': 42, 'answer': 'Kotkin'},
 {'score': 0.00041213215445168316,
  'start': 31,
  'end': 42,
  'answer': 'Joel Kotkin'},
 {'score': 0.0001502289087511599,
  'start': 0,
  'end': 42,
  'answer': 'The Coming of Neo-Feudalism by Joel Kotkin'}]

## Third Post

Gets it exactly right

In [12]:
print(books[2])

Probably "Reaper", by Will Wight. It’s not an insightful nonfiction book or a piece of high literature, but the whole Cradle series is very, very fun.


In [13]:
pipe(context=books[2], question="What book is this about?", topk=5, handle_impossible_answer=True)

[{'score': 0.3536214530467987, 'start': 10, 'end': 16, 'answer': 'Reaper'},
 {'score': 0.05046669766306877, 'start': 118, 'end': 124, 'answer': 'Cradle'},
 {'score': 0.04372008517384529,
  'start': 118,
  'end': 131,
  'answer': 'Cradle series'},
 {'score': 0.036290064454078674, 'start': 10, 'end': 17, 'answer': 'Reaper"'},
 {'score': 0.019747283309698105, 'start': 9, 'end': 16, 'answer': '"Reaper'}]

In [14]:
pipe(context=books[2], question="Who is the author?", topk=5, handle_impossible_answer=True)

[{'score': 0.9716078639030457, 'start': 22, 'end': 32, 'answer': 'Will Wight'},
 {'score': 0.02592596597969532,
  'start': 22,
  'end': 33,
  'answer': 'Will Wight.'},
 {'score': 0.001337842084467411, 'start': 27, 'end': 32, 'answer': 'Wight'},
 {'score': 0.0003502972249407321, 'start': 22, 'end': 26, 'answer': 'Will'},
 {'score': 0.00034414735273458064,
  'start': 19,
  'end': 32,
  'answer': 'by Will Wight'}]

## Fouth Post

Right again

In [15]:
print(books[3])

A Gentleman in Moscow by Amor Towles. I spent a lot of the year in isolation, only seeing a few people and this book felt like an appropriate analogy. It was also very heartwarming when I really needed something to lift me up.


In [16]:
pipe(context=books[3], question="What book is this about?", topk=5, handle_impossible_answer=True)

[{'score': 0.7443851232528687,
  'start': 0,
  'end': 21,
  'answer': 'A Gentleman in Moscow'},
 {'score': 0.08203072100877762,
  'start': 0,
  'end': 36,
  'answer': 'A Gentleman in Moscow by Amor Towles'},
 {'score': 0.07942397147417068,
  'start': 0,
  'end': 37,
  'answer': 'A Gentleman in Moscow by Amor Towles.'},
 {'score': 0.044820066541433334,
  'start': 0,
  'end': 11,
  'answer': 'A Gentleman'},
 {'score': 0.021551057696342468,
  'start': 2,
  'end': 21,
  'answer': 'Gentleman in Moscow'}]

In [17]:
pipe(context=books[3], question="Who is the author?", topk=5, handle_impossible_answer=True)

[{'score': 0.9685162305831909,
  'start': 25,
  'end': 36,
  'answer': 'Amor Towles'},
 {'score': 0.02845042571425438,
  'start': 25,
  'end': 37,
  'answer': 'Amor Towles.'},
 {'score': 0.0012753085466101766,
  'start': 0,
  'end': 36,
  'answer': 'A Gentleman in Moscow by Amor Towles'},
 {'score': 0.0006127895903773606, 'start': 25, 'end': 29, 'answer': 'Amor'},
 {'score': 0.00034427616628818214,
  'start': 22,
  'end': 36,
  'answer': 'by Amor Towles'}]

## Fifth Post

Right again

In [18]:
print(books[4])

"In Cold Blood" by Truman Capote. It's a masterpiece.


In [19]:
pipe(context=books[4], question="What book is this about?", topk=5, handle_impossible_answer=True)

[{'score': 0.5808616876602173,
  'start': 1,
  'end': 14,
  'answer': 'In Cold Blood'},
 {'score': 0.23709535598754883,
  'start': 0,
  'end': 14,
  'answer': '"In Cold Blood'},
 {'score': 0.11051516979932785,
  'start': 1,
  'end': 15,
  'answer': 'In Cold Blood"'},
 {'score': 0.04510993883013725,
  'start': 0,
  'end': 15,
  'answer': '"In Cold Blood"'},
 {'score': 0.0039156838320195675,
  'start': 1,
  'end': 52,
  'answer': 'In Cold Blood" by Truman Capote. It\'s a masterpiece'}]

In [20]:
pipe(context=books[4], question="Who is the author?", topk=5, handle_impossible_answer=True)

[{'score': 0.9937615394592285,
  'start': 19,
  'end': 32,
  'answer': 'Truman Capote'},
 {'score': 0.005086508113890886,
  'start': 19,
  'end': 33,
  'answer': 'Truman Capote.'},
 {'score': 0.00057207205099985, 'start': 19, 'end': 25, 'answer': 'Truman'},
 {'score': 0.00022857873409520835, 'start': 26, 'end': 32, 'answer': 'Capote'},
 {'score': 7.831704715499654e-05,
  'start': 16,
  'end': 32,
  'answer': 'by Truman Capote'}]

## Sixth Post

This is a difficult one, these are actually all papers.

In [21]:
print(books[5])

A small selection of papers that I find useful (also check the Wikipedia articles for a quick overview):

Communicating Sequential Processes "CSP" by Tony Hoare[0] has a strong influence on Go and Clojure. He also published/contributed to other interesting and influential books and papers.

Making reliable distributed systems in the presence of software errors by Joe Armstrong[1] (Erlang, BEAM). An implementation of the actor model and functional programming to optimize for reliability.

Conflict-free Replicated Data Types by Marc Shapiro, Nuno Preguiça, Carlos Baquero, Marek Zawirsk, "CRDTs" [2]. Enable strong eventual consistency, which is typically useful (and implemented) for databases, p2p (chat) applications and other distributed systems.

[0] https://www.cs.cmu.edu/~crary/819-f09/Hoare78.pdf

[1] https://www.cs.otago.ac.nz/coursework/cosc461/armstrong_thesis_2003.pdf

[2] https://hal.inria.fr/hal-00932836/file/CRDTs_SSS-2011.pdf


In [22]:
pipe(context=books[5], question="What book is this about?", topk=5, handle_impossible_answer=True)

[{'score': 0.0959417074918747,
  'start': 106,
  'end': 140,
  'answer': 'Communicating Sequential Processes'},
 {'score': 0.08643828332424164,
  'start': 106,
  'end': 145,
  'answer': 'Communicating Sequential Processes "CSP'},
 {'score': 0.06057683750987053, 'start': 142, 'end': 145, 'answer': 'CSP'},
 {'score': 0.05559767037630081,
  'start': 106,
  'end': 146,
  'answer': 'Communicating Sequential Processes "CSP"'},
 {'score': 0.04726332053542137, 'start': 141, 'end': 145, 'answer': '"CSP'}]

In [23]:
pipe(context=books[5], question="Who is the author?", topk=5, handle_impossible_answer=True)

[{'score': 0.4444877803325653,
  'start': 366,
  'end': 379,
  'answer': 'Joe Armstrong'},
 {'score': 0.07175072282552719,
  'start': 150,
  'end': 160,
  'answer': 'Tony Hoare'},
 {'score': 0.008594379760324955,
  'start': 366,
  'end': 397,
  'answer': 'Joe Armstrong[1] (Erlang, BEAM)'},
 {'score': 0.002378952456638217,
  'start': 366,
  'end': 382,
  'answer': 'Joe Armstrong[1]'},
 {'score': 0.0008865283452905715, 'start': 0, 'end': 0, 'answer': ''}]

## Seventh Post

In [24]:
print(books[6])

"The Unwomanly Face of War: An Oral History of Women in World War II". Non-fiction. Harrowing. Was inspired to read it after seeing "Beanpole".

Also really enjoyed "Klara and the Sun".


Only gets half the title, and completely misses the other books. The score is very low too.

In [25]:
pipe(context=books[6], question="What book is this about?", topk=5, handle_impossible_answer=True)

[{'score': 0.06026965379714966,
  'start': 1,
  'end': 26,
  'answer': 'The Unwomanly Face of War'},
 {'score': 0.033767107874155045,
  'start': 28,
  'end': 68,
  'answer': 'An Oral History of Women in World War II'},
 {'score': 0.011089831590652466,
  'start': 1,
  'end': 52,
  'answer': 'The Unwomanly Face of War: An Oral History of Women'},
 {'score': 0.010547981597483158,
  'start': 0,
  'end': 26,
  'answer': '"The Unwomanly Face of War'},
 {'score': 0.010149270296096802,
  'start': 31,
  'end': 68,
  'answer': 'Oral History of Women in World War II'}]

This is too confidently wrong.

In [26]:
pipe(context=books[6], question="Who is the author?", topk=5, handle_impossible_answer=True)

[{'score': 0.2859797179698944, 'start': 84, 'end': 93, 'answer': 'Harrowing'},
 {'score': 0.18127445876598358,
  'start': 71,
  'end': 93,
  'answer': 'Non-fiction. Harrowing'},
 {'score': 0.12296736240386963,
  'start': 71,
  'end': 82,
  'answer': 'Non-fiction'},
 {'score': 0.03354998677968979,
  'start': 84,
  'end': 94,
  'answer': 'Harrowing.'},
 {'score': 0.021266387775540352,
  'start': 71,
  'end': 94,
  'answer': 'Non-fiction. Harrowing.'}]

## Eighth Post

In [27]:
print(books[7])

programming pearls jon bentley


Only gets half the title, and completely misses the other books.

The top answer is right

In [28]:
pipe(context=books[7], question="What book is this about?", topk=5, handle_impossible_answer=True)

[{'score': 0.5504440069198608,
  'start': 0,
  'end': 18,
  'answer': 'programming pearls'},
 {'score': 0.30278968811035156,
  'start': 0,
  'end': 30,
  'answer': 'programming pearls jon bentley'},
 {'score': 0.06683448702096939, 'start': 12, 'end': 18, 'answer': 'pearls'},
 {'score': 0.036764491349458694,
  'start': 12,
  'end': 30,
  'answer': 'pearls jon bentley'},
 {'score': 0.019708285108208656,
  'start': 0,
  'end': 11,
  'answer': 'programming'}]

But it can't really tell it's a book...

In [29]:
pipe(context=books[7], question="What song is this about?", topk=5, handle_impossible_answer=True)

[{'score': 0.38840705156326294,
  'start': 0,
  'end': 18,
  'answer': 'programming pearls'},
 {'score': 0.3662135601043701,
  'start': 0,
  'end': 30,
  'answer': 'programming pearls jon bentley'},
 {'score': 0.08963220566511154, 'start': 12, 'end': 18, 'answer': 'pearls'},
 {'score': 0.08451064676046371,
  'start': 12,
  'end': 30,
  'answer': 'pearls jon bentley'},
 {'score': 0.012788637541234493,
  'start': 0,
  'end': 11,
  'answer': 'programming'}]

This is too confidently wrong.

In [30]:
pipe(context=books[7], question="Who is the author?", topk=5, handle_impossible_answer=True)

[{'score': 0.5670811533927917,
  'start': 0,
  'end': 30,
  'answer': 'programming pearls jon bentley'},
 {'score': 0.19033314287662506,
  'start': 12,
  'end': 30,
  'answer': 'pearls jon bentley'},
 {'score': 0.09922166913747787,
  'start': 19,
  'end': 30,
  'answer': 'jon bentley'},
 {'score': 0.05318687856197357, 'start': 23, 'end': 30, 'answer': 'bentley'},
 {'score': 0.03328821435570717,
  'start': 0,
  'end': 18,
  'answer': 'programming pearls'}]

If we add more context it does much better.

In [31]:
pipe(context=books[7], question="Who is the author of programming pearls?", topk=5, handle_impossible_answer=True)

[{'score': 0.9242533445358276,
  'start': 19,
  'end': 30,
  'answer': 'jon bentley'},
 {'score': 0.05638393387198448, 'start': 23, 'end': 30, 'answer': 'bentley'},
 {'score': 0.008598157204687595,
  'start': 0,
  'end': 30,
  'answer': 'programming pearls jon bentley'},
 {'score': 0.006067426409572363, 'start': 23, 'end': 30, 'answer': 'bentley'},
 {'score': 0.0018185155931860209, 'start': 19, 'end': 22, 'answer': 'jon'}]

## Ninth book

Exactly right

In [32]:
print(books[8])

try aristotle's treatment on the subject in The Nichomachean Ethics. very relatable by modern standards


In [33]:
pipe(context=books[8], question="What book is this about?", topk=5, handle_impossible_answer=True)

[{'score': 0.90522301197052,
  'start': 44,
  'end': 67,
  'answer': 'The Nichomachean Ethics'},
 {'score': 0.05711742490530014,
  'start': 48,
  'end': 67,
  'answer': 'Nichomachean Ethics'},
 {'score': 0.028559990227222443,
  'start': 44,
  'end': 68,
  'answer': 'The Nichomachean Ethics.'},
 {'score': 0.004761051386594772,
  'start': 44,
  'end': 60,
  'answer': 'The Nichomachean'},
 {'score': 0.0018020677380263805,
  'start': 48,
  'end': 68,
  'answer': 'Nichomachean Ethics.'}]

In [34]:
pipe(context=books[8], question="Who is the author?", topk=5, handle_impossible_answer=True)

[{'score': 0.9859960675239563, 'start': 4, 'end': 13, 'answer': 'aristotle'},
 {'score': 0.002930512884631753, 'start': 4, 'end': 13, 'answer': 'aristotle'},
 {'score': 0.001233691698871553,
  'start': 4,
  'end': 15,
  'answer': "aristotle's"},
 {'score': 0.0003458717546891421,
  'start': 0,
  'end': 13,
  'answer': 'try aristotle'},
 {'score': 0.00030797257204540074,
  'start': 4,
  'end': 13,
  'answer': 'aristotle'}]

## Tenth book

Exactly right

In [35]:
print(books[9])

Can you send a reference for. “Statistical Inference in Computer Age?” There are several books with similar titles.


Get's it right

In [36]:
pipe(context=books[9], question="What book is this about?", topk=5, handle_impossible_answer=True)

[{'score': 0.7364696264266968,
  'start': 31,
  'end': 68,
  'answer': 'Statistical Inference in Computer Age'},
 {'score': 0.11681927740573883,
  'start': 56,
  'end': 68,
  'answer': 'Computer Age'},
 {'score': 0.04142001271247864,
  'start': 30,
  'end': 68,
  'answer': '“Statistical Inference in Computer Age'},
 {'score': 0.040733471512794495,
  'start': 31,
  'end': 52,
  'answer': 'Statistical Inference'},
 {'score': 0.03188558295369148,
  'start': 31,
  'end': 70,
  'answer': 'Statistical Inference in Computer Age?”'}]

Too confidently wrong

In [37]:
pipe(context=books[9], question="Who is the author?", topk=5, handle_impossible_answer=True)

[{'score': 0.21951982378959656,
  'start': 31,
  'end': 68,
  'answer': 'Statistical Inference in Computer Age'},
 {'score': 0.17552605271339417,
  'start': 31,
  'end': 70,
  'answer': 'Statistical Inference in Computer Age?”'},
 {'score': 0.10378080606460571,
  'start': 30,
  'end': 68,
  'answer': '“Statistical Inference in Computer Age'},
 {'score': 0.08298218995332718,
  'start': 30,
  'end': 70,
  'answer': '“Statistical Inference in Computer Age?”'},
 {'score': 0.04622333124279976,
  'start': 31,
  'end': 69,
  'answer': 'Statistical Inference in Computer Age?'}]

In [38]:
pipe(context=books[9], question="Who is the author of Statistical Inference in Computer Age?", topk=5, handle_impossible_answer=True)

[{'score': 0.10568806529045105,
  'start': 0,
  'end': 24,
  'answer': 'Can you send a reference'},
 {'score': 0.04337441548705101,
  'start': 0,
  'end': 68,
  'answer': 'Can you send a reference for. “Statistical Inference in Computer Age'},
 {'score': 0.042221132665872574,
  'start': 0,
  'end': 28,
  'answer': 'Can you send a reference for'},
 {'score': 0.03146642819046974,
  'start': 0,
  'end': 69,
  'answer': 'Can you send a reference for. “Statistical Inference in Computer Age?'},
 {'score': 0.014582433737814426,
  'start': 30,
  'end': 70,
  'answer': '“Statistical Inference in Computer Age?”'}]

## Example with no results

Maybe we should set threshold around 50.

In [39]:
print(books[13])

Seek out distributed systems research papers from real-world practitioners. A quick search lead me to this nice collection: https://dancres.github.io/Pages/


In [40]:
pipe(context=books[13], question="What book is this about?", topk=5, handle_impossible_answer=True)

[{'score': 0.3960203230381012,
  'start': 132,
  'end': 155,
  'answer': 'dancres.github.io/Pages'},
 {'score': 0.13216771185398102,
  'start': 132,
  'end': 146,
  'answer': 'dancres.github'},
 {'score': 0.09443812817335129, 'start': 132, 'end': 139, 'answer': 'dancres'},
 {'score': 0.0851767286658287,
  'start': 132,
  'end': 149,
  'answer': 'dancres.github.io'},
 {'score': 0.028707168996334076,
  'start': 140,
  'end': 155,
  'answer': 'github.io/Pages'}]

In [41]:
pipe(context=books[13], question="Who is the author?", topk=5, handle_impossible_answer=True)

[{'score': 0.37659919261932373,
  'start': 132,
  'end': 146,
  'answer': 'dancres.github'},
 {'score': 0.3447030484676361, 'start': 132, 'end': 139, 'answer': 'dancres'},
 {'score': 0.097689189016819,
  'start': 132,
  'end': 155,
  'answer': 'dancres.github.io/Pages'},
 {'score': 0.019532062113285065,
  'start': 132,
  'end': 149,
  'answer': 'dancres.github.io'},
 {'score': 0.01495211198925972,
  'start': 124,
  'end': 146,
  'answer': 'https://dancres.github'}]