# Part 1 - Question Answering

For the first part, use the Hugging Face question-answering pipeline and feed it with the five 300-word long sections from the book of your choice that you analyzed in Project 1.

These sections should be selected so they are: introducing the protagonist(s), the antagonist, the crime and crime scene, any significant evidence, and the resolution of the crime/a narrative that presents the case against the perpetrator.

For a prompt, Implement a simple prompt interface that takes in your question, runs it against the model, and returns the answer. You don't need to do anything special about this, just a simple console I/O interface without any complicated error handling. It is up to you how you want to upload the context to the model (pre-loaded into your program, on-demand, etc.).

The questions you should ask are about the identity and characteristics of the protagonist, antagonist/perpetrator, the nature and the setting of the crime or crime scene, the evidence, and the case against the perpetrator.

Document the questions, ask the questions, and document the specificity and accuracy of the results.

Part 1.2 - use two different HF QA models: use the default question-answering pipeline, then use other models of choice and discuss the differences in the result.

https://huggingface.co/docs/transformers/main_classes/pipelines

https://huggingface.co/docs/transformers/v4.35.0/en/main_classes/pipelines#transformers.QuestionAnsweringPipeline


In [1]:
!pip install transformers





[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [91]:
from typing import Optional
from transformers import pipeline
import torch
from pprint import pprint
from IPython.display import display, Markdown, Latex


In [51]:
with open('../dataset/the_sign_of_the_four_proc.txt', 'r') as f:
  text = f.read()

chunks = []
chunk_size = 300
text = text.split()
for i in range(0, len(text), chunk_size):
  chunks.append(' '.join(text[i:i+chunk_size]))
  # chunks.append(text)

for chunk in chunks[:3]:
  print(chunk)
  print('-----------------------')

cover The Sign of the Four by Arthur Conan Doyle Contents Chapter I. The Science of Deduction Chapter II. The Statement of the Case Chapter III. In Quest of a Solution Chapter IV. The Story of the Bald-Headed Man Chapter V. The Tragedy of Pondicherry Lodge Chapter VI. Sherlock Holmes Gives a Demonstration Chapter VII. The Episode of the Barrel Chapter VIII. The Baker Street Irregulars Chapter IX. A Break in the Chain Chapter X. The End of the Islander Chapter XI. The Great Agra Treasure Chapter XII. The Strange Story of Jonathan Small Chapter I The Science of Deduction Sherlock Holmes took his bottle from the corner of the mantel-piece and his hypodermic syringe from its neat morocco case. With his long, white, nervous fingers he adjusted the delicate needle, and rolled back his left shirt-cuff. For some little time his eyes rested thoughtfully upon the sinewy forearm and wrist all dotted and scarred with innumerable puncture-marks. Finally he thrust the sharp point home, pressed down 

In [52]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [63]:
# for chunk in chunks[:3]:
chunk_vec = tokenizer(chunks, padding=True, return_tensors="pt")['input_ids']
print(chunk_vec.shape)

# chunks_vecs.append(chunk_vec)
print(type(chunk_vec))
for ch in chunk_vec:
  print(len(ch))
print('-----------------------')

torch.Size([144, 468])
<class 'torch.Tensor'>
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
468
-----------------------


In [96]:
question = """
crime scene
"""

question = question.strip()
question_vec = tokenizer(question, padding=True, return_tensors="pt")['input_ids']

vec_len = len(chunk_vec[0])
# print(question_vec.shape)
# print(vec_len)
# pad 0 to question_vec
question_vec = torch.cat((question_vec, torch.zeros((1, vec_len - len(question_vec[0])), dtype=chunk_vec.dtype)), dim=1)

# print(question_vec.shape)
chunk_vec = chunk_vec.float()
question_vec = question_vec.float()
cs = torch.nn.functional.cosine_similarity(chunk_vec, question_vec, dim=1)
print(cs.argmax(dim=0))
# print top 3 most cs chunks
topk = cs.topk(3)
# print chunks at topk indices
for i in topk.indices:
  print(i)
  print(display(Markdown(chunks[i])))
  print(cs[i])
  print('-'*100)

# print first occurence of question in text based on cs
print(cs.argmax(dim=0))
print(cs)

tensor(48)
tensor(48)


pungent a smell as this? It sounds like a sum in the rule of three. The answer should give us the-But halloa! here are the accredited representatives of the law." Heavy steps and the clamour of loud voices were audible from below, and the hall door shut with a loud crash. "Before they come," said Holmes, "just put your hand here on this poor fellow's arm, and here on his leg. What do you feel?" "The muscles are as hard as a board," I answered. "Quite so. They are in a state of extreme contraction, far exceeding the usual _rigor mortis_. Coupled with this distortion of the face, this Hippocratic smile, or '_risus sardonicus_,' as the old writers called it, what conclusion would it suggest to your mind?" "Death from some powerful vegetable alkaloid," I answered,-"some strychnine-like substance which would produce tetanus." "That was the idea which occurred to me the instant I saw the drawn muscles of the face. On getting into the room I at once looked for the means by which the poison had entered the system. As you saw, I discovered a thorn which had been driven or shot with no great force into the scalp. You observe that the part struck was that which would be turned towards the hole in the ceiling if the man were erect in his chair. Now examine the thorn." I took it up gingerly and held it in the light of the lantern. It was long, sharp, and black, with a glazed look near the point as though some gummy substance had dried upon it. The blunt end had been trimmed and rounded off with a knife. "Is that an English thorn?" he asked. "No, it certainly is not." "With all these data you should be able to draw

None
tensor(0.2100)
----------------------------------------------------------------------------------------------------
tensor(68)


fortunately, we have no distance to go. Evidently what puzzled the dog at the corner of Knight's Place was that there were two different trails running in opposite directions. We took the wrong one. It only remains to follow the other." There was no difficulty about this. On leading Toby to the place where he had committed his fault, he cast about in a wide circle and finally dashed off in a fresh direction. "We must take care that he does not now bring us to the place where the creasote-barrel came from," I observed. "I had thought of that. But you notice that he keeps on the pavement, whereas the barrel passed down the roadway. No, we are on the true scent now." It tended down towards the river-side, running through Belmont Place and Prince's Street. At the end of Broad Street it ran right down to the water's edge, where there was a small wooden wharf. Toby led us to the very edge of this, and there stood whining, looking out on the dark current beyond. "We are out of luck," said Holmes. "They have taken to a boat here." Several small punts and skiffs were lying about in the water and on the edge of the wharf. We took Toby round to each in turn, but, though he sniffed earnestly, he made no sign. Close to the rude landing-stage was a small brick house, with a wooden placard slung out through the second window. "Mordecai Smith" was printed across it in large letters, and, underneath, "Boats to hire by the hour or day." A second inscription above the door informed us that a steam launch was kept,-a statement which was confirmed by a great pile of coke upon the jetty. Sherlock Holmes looked slowly round, and his

None
tensor(0.1874)
----------------------------------------------------------------------------------------------------
tensor(59)


Sherlock Holmes was on the roof, and I could see him like an enormous glow-worm crawling very slowly along the ridge. I lost sight of him behind a stack of chimneys, but he presently reappeared, and then vanished once more upon the opposite side. When I made my way round there I found him seated at one of the corner eaves. "That you, Watson?" he cried. "Yes." "This is the place. What is that black thing down there?" "A water-barrel." "Top on it?" "Yes." "No sign of a ladder?" "No." "Confound the fellow! It's a most break-neck place. I ought to be able to come down where he could climb up. The water-pipe feels pretty firm. Here goes, anyhow." There was a scuffling of feet, and the lantern began to come steadily down the side of the wall. Then with a light spring he came on to the barrel, and from there to the earth. "It was easy to follow him," he said, drawing on his stockings and boots. "Tiles were loosened the whole way along, and in his hurry he had dropped this. It confirms my diagnosis, as you doctors express it." The object which he held up to me was a small pocket or pouch woven out of coloured grasses and with a few tawdry beads strung round it. In shape and size it was not unlike a cigarette-case. Inside were half a dozen spines of dark wood, sharp at one end and rounded at the other, like that which had struck Bartholomew Sholto. "They are hellish things," said he. "Look out that you don't prick yourself. I'm delighted to have them, for the chances are that they are all he has. There is the less fear of you or me finding one in our skin before long.

None
tensor(0.1860)
----------------------------------------------------------------------------------------------------
tensor(48)
tensor([0.0198, 0.1365, 0.0083, 0.0295, 0.0875, 0.0198, 0.0173, 0.0285, 0.1534,
        0.0127, 0.0066, 0.0265, 0.0171, 0.0265, 0.1281, 0.1009, 0.0932, 0.0331,
        0.0151, 0.0559, 0.0175, 0.0152, 0.1116, 0.1615, 0.0185, 0.0417, 0.0199,
        0.0174, 0.0170, 0.0395, 0.0186, 0.0716, 0.0154, 0.0173, 0.0168, 0.0213,
        0.0551, 0.1393, 0.0168, 0.1239, 0.0173, 0.0173, 0.0283, 0.0147, 0.0195,
        0.0310, 0.0188, 0.0152, 0.2100, 0.0133, 0.0118, 0.0107, 0.0176, 0.0166,
        0.0140, 0.0604, 0.0161, 0.0336, 0.0140, 0.1860, 0.0076, 0.0191, 0.0834,
        0.0175, 0.0214, 0.0484, 0.0530, 0.0178, 0.1874, 0.0340, 0.0277, 0.0192,
        0.0224, 0.1310, 0.0136, 0.0161, 0.0157, 0.0895, 0.0196, 0.0084, 0.0176,
        0.0573, 0.0995, 0.0164, 0.1016, 0.0450, 0.0621, 0.1394, 0.0179, 0.0216,
        0.0164, 0.0658, 0.0260, 0.0440, 0.0137, 0.0361, 0.0808, 0.06

cover The Sign of the Four by Arthur Conan Doyle Contents Chapter I. The Science of Deduction Chapter II. The Statement of the Case Chapter III. In Quest of a Solution Chapter IV. The Story of the Bald-Headed Man Chapter V. The Tragedy of Pondicherry Lodge Chapter VI. Sherlock Holmes Gives a Demonstration Chapter VII. The Episode of the Barrel Chapter VIII. The Baker Street Irregulars Chapter IX. A Break in the Chain Chapter X. The End of the Islander Chapter XI. The Great Agra Treasure Chapter XII. The Strange Story of Jonathan Small Chapter I The Science of Deduction Sherlock Holmes took his bottle from the corner of the mantel-piece and his hypodermic syringe from its neat morocco case. With his long, white, nervous fingers he adjusted the delicate needle, and rolled back his left shirt-cuff. For some little time his eyes rested thoughtfully upon the sinewy forearm and wrist all dotted and scarred with innumerable puncture-marks. Finally he thrust the sharp point home, pressed down 

## Prompt Interface

For a prompt, Implement a simple prompt interface that takes in your question, runs it against the model, and returns the answer.

You don't need to do anything special about this, just a simple console I/O interface without any complicated error handling.

It is up to you how you want to upload the context to the model (pre-loaded into your program, on-demand, etc.).


In [4]:
def run(
    question: str,
    context: str,
    model: Optional[str] = None,
    **kwargs,
):
    print("=" * 100)
    print(f"model: {model}")
    for k, v in kwargs.items():
        print(f"{k}: {v}")
    print("~" * 80)

    # Construct Pipeline

    device = "cuda" if torch.cuda.is_available() else "cpu"
    pipe = pipeline(
        "question-answering",
        model=model,
        # model="deepset/roberta-base-squad2",
        device=device,
    )

    # Run Pipeline

    question = question.strip()
    context = context.strip()

    print(f"C: {context}")
    print(f"Q: {question}")
    # display(Markdown(f"**Q:** {question}"))
    # display(Markdown(f"**C:** {context}"))

    res = pipe(
        question=question,
        context=context,
        **kwargs,
    )
    # pprint(res)

    answer, score = "idk", 1.0

    # Get the result
    if res and isinstance(res, dict):
        answer = res.get("answer", "idk")
        score = res.get("score", 1.0)

    answer = answer.strip()
    score = round(score, 3)

    print(f"A: {answer} (score: {round(score, 3)})")
    # display(Markdown(f"**A:** {answer} (score: {score})"))

    return res


def run_models(question: str, context: str, models: list[str], **kwargs):
    for model in models:
        run(question, context, model=model, **kwargs)


### Testing Prompt Interface

Models: https://huggingface.co/models?pipeline_tag=question-answering&sort=trending


In [5]:
test_ctx = """
Sherlock Holmes took his bottle from the corner of the mantel-piece and
his hypodermic syringe from its neat morocco case. With his long,
white, nervous fingers he adjusted the delicate needle, and rolled back
his left shirt-cuff. For some little time his eyes rested thoughtfully
upon the sinewy forearm and wrist all dotted and scarred with
innumerable puncture-marks. Finally he thrust the sharp point home,
pressed down the tiny piston, and sank back into the velvet-lined
arm-chair with a long sigh of satisfaction.
"""

test_q = """
What's my name?
"""


In [6]:
# Default - "distilbert-base-uncased-distilled-squad"
_ = run(test_q, test_ctx)


No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


model: None
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C: Sherlock Holmes took his bottle from the corner of the mantel-piece and
his hypodermic syringe from its neat morocco case. With his long,
white, nervous fingers he adjusted the delicate needle, and rolled back
his left shirt-cuff. For some little time his eyes rested thoughtfully
upon the sinewy forearm and wrist all dotted and scarred with
innumerable puncture-marks. Finally he thrust the sharp point home,
pressed down the tiny piston, and sank back into the velvet-lined
arm-chair with a long sigh of satisfaction.
Q: What's my name?
A: Sherlock Holmes (score: 0.761)


In [7]:
_ = run(test_q, test_ctx, model="deepset/roberta-base-squad2")


model: deepset/roberta-base-squad2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Downloading (…)lve/main/config.json: 100%|██████████| 571/571 [00:00<?, ?B/s] 
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading model.safetensors: 100%|██████████| 496M/496M [01:10<00:00, 7.05MB/s] 
Downloading (…)okenizer_config.json: 100%|██████████| 79.0/79.0 [00:00<?, ?B/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 6.74MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 6.77MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 772/772 [00:00<?, ?B/s] 


C: Sherlock Holmes took his bottle from the corner of the mantel-piece and
his hypodermic syringe from its neat morocco case. With his long,
white, nervous fingers he adjusted the delicate needle, and rolled back
his left shirt-cuff. For some little time his eyes rested thoughtfully
upon the sinewy forearm and wrist all dotted and scarred with
innumerable puncture-marks. Finally he thrust the sharp point home,
pressed down the tiny piston, and sank back into the velvet-lined
arm-chair with a long sigh of satisfaction.
Q: What's my name?
A: Sherlock Holmes (score: 0.114)


In [8]:
run_models(
    test_q,
    test_ctx,
    models=[
        "distilbert-base-uncased-distilled-squad",
        "deepset/roberta-base-squad2",
    ],
)


model: distilbert-base-uncased-distilled-squad
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Downloading (…)lve/main/config.json: 100%|██████████| 451/451 [00:00<?, ?B/s] 
Downloading model.safetensors: 100%|██████████| 265M/265M [00:33<00:00, 7.87MB/s] 
Downloading (…)okenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<?, ?B/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 7.25MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 7.06MB/s]


C: Sherlock Holmes took his bottle from the corner of the mantel-piece and
his hypodermic syringe from its neat morocco case. With his long,
white, nervous fingers he adjusted the delicate needle, and rolled back
his left shirt-cuff. For some little time his eyes rested thoughtfully
upon the sinewy forearm and wrist all dotted and scarred with
innumerable puncture-marks. Finally he thrust the sharp point home,
pressed down the tiny piston, and sank back into the velvet-lined
arm-chair with a long sigh of satisfaction.
Q: What's my name?
A: Sherlock Holmes (score: 0.99)
model: deepset/roberta-base-squad2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C: Sherlock Holmes took his bottle from the corner of the mantel-piece and
his hypodermic syringe from its neat morocco case. With his long,
white, nervous fingers he adjusted the delicate needle, and rolled back
his left shirt-cuff. For some little time his eyes rested thoughtfully
upon the sinewy forearm 

---
---

## Experiments & Results

For the first part, use the Hugging Face question-answering pipeline and feed it with the five 300-word long sections from the book of your choice that you analyzed in Project 1.

These sections should be selected so they are: **introducing the protagonist(s), the antagonist, the crime and crime scene, any significant evidence, and the resolution of the crime/a narrative that presents the case against the perpetrator.**

The questions you should ask are about the identity and characteristics of the protagonist, antagonist/perpetrator, the nature and the setting of the crime or crime scene, the evidence, and the case against the perpetrator.

Document the questions, ask the questions, and document the specificity and accuracy of the results.


In [29]:
# TODO: Try out a good selection of models and keep some interesting ones
models = [
    "distilbert-base-uncased-distilled-squad",
    "deepset/roberta-base-squad2",
]


---

### Section 1


In [30]:
s1 = """
My name Jeff.
"""


In [31]:
s1q1 = """
What's my name?
"""

run_models(s1q1, s1, models=models)


model: distilbert-base-uncased-distilled-squad
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C: My name Jeff.
Q: What's my name?
A: Jeff (score: 0.987)
model: deepset/roberta-base-squad2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C: My name Jeff.
Q: What's my name?
A: Jeff (score: 0.641)


In [32]:
# TODO: Add more cells and ask more questions on Section 1.


#### TODO: document the specificity and accuracy of the results


---

### Section 2


In [33]:
s2 = """
My name Jeff.
"""


In [34]:
s2q1 = """
What's my name?
"""

run_models(s2q1, s2, models=models)


model: distilbert-base-uncased-distilled-squad
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C: My name Jeff.
Q: What's my name?
A: Jeff (score: 0.987)
model: deepset/roberta-base-squad2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C: My name Jeff.
Q: What's my name?
A: Jeff (score: 0.641)


In [35]:
# TODO: Add more cells and ask more questions on Section 2.


#### TODO: document the specificity and accuracy of the results


---

### Section 3


In [36]:
s3 = """
My name Jeff.
"""


In [37]:
s3q1 = """
What's my name?
"""


run_models(s3q1, s3, models=models)


model: distilbert-base-uncased-distilled-squad
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C: My name Jeff.
Q: What's my name?
A: Jeff (score: 0.987)
model: deepset/roberta-base-squad2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C: My name Jeff.
Q: What's my name?
A: Jeff (score: 0.641)


In [38]:
# TODO: Add more cells and ask more questions on Section 3.


#### TODO: document the specificity and accuracy of the results


---

### Section 4


In [39]:
s4 = """
My name Jeff.
"""


In [40]:
s4q1 = """
What's my name?
"""


run_models(s4q1, s4, models=models)


model: distilbert-base-uncased-distilled-squad
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C: My name Jeff.
Q: What's my name?
A: Jeff (score: 0.987)
model: deepset/roberta-base-squad2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C: My name Jeff.
Q: What's my name?
A: Jeff (score: 0.641)


In [41]:
# TODO: Add more cells and ask more questions on Section 4.


#### TODO: document the specificity and accuracy of the results


---

### Section 5


In [42]:
s5 = """
My name Jeff.
"""


In [43]:
s5q1 = """
What's my name?
"""


run_models(s5q1, s5, models=models)


model: distilbert-base-uncased-distilled-squad
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C: My name Jeff.
Q: What's my name?
A: Jeff (score: 0.987)
model: deepset/roberta-base-squad2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C: My name Jeff.
Q: What's my name?
A: Jeff (score: 0.641)


In [44]:
# TODO: Add more cells and ask more questions on Section 5.


#### TODO: document the specificity and accuracy of the results
