<a href="https://colab.research.google.com/github/Angel-Castro-RC/Final_NLP/blob/main/F2_2_SummarizationTranslationQuestionAnswering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS 195: Natural Language Processing
## Summarization, Translation, and Question Answering

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F2_2_SummarizationTranslationQuestionAnswering.ipynb)


## References

*Two minutes NLP — Learn the ROUGE metric* by examples by Fabio Chiusano: https://medium.com/nlplanet/two-minutes-nlp-learn-the-rouge-metric-by-examples-f179cc285499

Google's implementation of rouge_score: https://github.com/google-research/google-research/tree/master/rouge

Hugging Face's wrapper for Google's implementation: https://huggingface.co/spaces/evaluate-metric/rouge

Hugging Face Task Guide on Summarization: https://huggingface.co/docs/transformers/tasks/summarization

Hugging Face Task Guide on Translation: https://huggingface.co/docs/transformers/tasks/translation

Hugging Face Task Guide on Question Answering: https://huggingface.co/docs/transformers/tasks/question_answering


## Installing necessary modules

In [None]:
import sys
!{sys.executable} -m pip install transformers datasets evaluate rouge_score sentencepiece sacremoses

Collecting transformers
  Downloading transformers-4.33.2-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting s

## Review: Sequence-to-Sequence Models

NLP models that take one sequence as input and produce another sequence as output are called **Seq2seq**
* summarization
* translation
* conversation

**A Challenge:** unlike classification, there's no way to tell for sure whether the prediction is right!

**Partial Solutions:**
* Qualitative metrics - humans can describe how closely they match
* ROUGE Metrics: statistics that measure similarities between two sequences.



## Review: Using Hugging Face's wrapper for ROUGE

**ROUGE:** Recall-Oriented Understudy for Gisting Evaluation

Suppose we have a **reference** sequence, which is one known possible *correct* sequence
* E.g., a translation or a summarization that a trustworthy human has produced

**Example reference:** "A broody hen sat in a nesting box all day."

**Example machine-generated prediction:** "A hen sat in every nesting box that long sunny day."



In [None]:
import evaluate

rouge = evaluate.load("rouge")

reference_sentence = "a broody hen sat in a nesting box all day"
predicted_sentence = "a hen sat in every nesting box that long sunny day"

rouge.compute(predictions=[predicted_sentence],references=[reference_sentence])

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

{'rouge1': 0.6666666666666666,
 'rouge2': 0.3157894736842105,
 'rougeL': 0.6666666666666666,
 'rougeLsum': 0.6666666666666666}

## Interpreting ROUGE

All of these are in the context of the F1 score - balancing precision and recall (looking at overlap relative to the *reference* or the *prediction*)

`rouge1` - overlap of individual words (1-grams) between prediction and reference

`rouge2` - overlap of *bigrams* (2-grams, pairs of consecutive words)

`rougeL` - the *longest common subsequence* between the prediction and reference. The subsequence must be in *order* but not nececssarily *consecutive*

`rougeLsum` - do `rougeL` for each newline/sentence and aggregate the results

## Summarization in Hugging Face

Hugging Face hosts many summarization models. Here's one called BART (https://huggingface.co/facebook/bart-large-cnn) that was trained on CNN/Daily Mail news articles (https://huggingface.co/datasets/cnn_dailymail) which include **reference** summaries written by the authors of the original article.

We'll try it out on a Times-Delphic article I found here: https://timesdelphic.com/2023/09/the-answer-has-little-to-do-with-affirmative-action-over-the-summer-the-supreme-court-ruled-against-the-admissions-programs-of-harvard-university-and-the-university-of-north-carolina-in-an-affirmat/

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn") #could also try google/pegasus-xsum

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
times_delphic_story = """
How does the Supreme Court ruling on affirmative action affect Drake?
The answer has little to do with affirmative action.
Over the summer, the Supreme Court ruled against the admissions programs of Harvard University and the University of North Carolina in an affirmative action decision. Before the decision, race already wasn’t a factor in Drake University admissions, according to Provost Sue Mattison.
“Affirmative action, with regards to admissions, only impacts those really highly selective institutions that limit the number of incoming students,” Mattison said. “So that doesn’t apply to Drake and most institutions across the country.”
She said schools like Harvard and UNC have enough applicants that they can pick and choose which applicants fill a certain number of spots.
Drake’s admissions team found that the university has “admitted all students who have a 3.0 high school GPA or [higher],” Mattison said. “Even though we’ve asked for a person’s race on the admissions form, it does not have an impact on the admissions decision, and it doesn’t displace anybody.”
Possible effects of the court’s ruling
Mark Kende, director of Drake’s Constitutional Law Center, said the Supreme Court “basically has embraced an idea that it calls colorblindness.”
“If you take their principle of colorblindness and extend it beyond universities, to other places, it could raise some problems,” Kende said. “But we don’t know yet.”
Financial aid programs that prioritize applicants of a particular race over another are more vulnerable after the court’s decision, according to Kende. He said it’s not clear what impact the decision might have on university hiring practices that consider an employee’s race, as well as corporations’ diversity programs.
Following the Supreme Court’s decision, Missouri Attorney General Andrew Bailey said Missouri institutions subject to the U.S. Constitution or Title VI must stop using race-based standards “to make decisions about things like admissions, scholarships, programs and employment.”
The University of Missouri System said that “a small number of our programs and scholarships have used race/ethnicity as a factor for admissions and scholarships,” and that “these practices will be discontinued.”
Drake is taking a different approach in the wake of the affirmative action decision. The university is monitoring maybe about forty to fifty scholarships, according to Ryan Zantingh, Drake’s director of financial aid. This is more in anticipation of a comparable case on financial aid that considers race, rather than a reaction to the affirmative action ruling.
Mattison said she thinks Drake is still trying to determine how the Supreme Court decision will impact Drake’s Crew Scholars program, which is for incoming students of color.
“There are ways that we can ensure that we continue Crew Scholars while still being compliant,” Mattison said.
Donors for some Drake scholarships specified that they wanted to support a student of color or a woman in a STEM field, Mattison said.
“And so we’re still working through what that actually means, and what we have to do to continue to achieve the values that we expect,” Mattison said. “There are ways that we can change the wording of some of the scholarships.”
Like all students, students of color may qualify for scholarships for first-generation students or students with financial need.
“There’s a lot of overlap between students of color and other areas where financial aid is directed,” Zantingh said. “Scholarship resources can be directed [to financial need or first generation status] and still reach the same students.”
Even if there is a ruling on financial aid that’s comparable to the affirmative action decision, Zantingh doesn’t expect a large impact on Drake financial aid from either decision.
“There may be some implications, but I think the overall general effect on students will be little to none,” Zantingh said.
Zantingh gave an example of scholarship language offered by legal counsel. If a scholarship is for only minority students, it might become a scholarship that gives preference to students who demonstrate a commitment to Drake’s vision for diversity on campus.
“If a white student is actively involved in anti-racist leadership here on campus, certainly they would fit that description then, wouldn’t they?” Zantingh said. “Basically, the language would not seek to exclude any particular protected class categorically.”
In some cases, a donor might be unwilling to change the scholarship’s language or be deceased, Zantingh said. If a donor is deceased, a judge might approve changes. He said he doesn’t expect Drake to cut any of the scholarships it is monitoring.
“The scholarship criteria would have to change, or the dollars would have to be repurposed in another way. Per either the donor or a court’s approval,” Zantingh said.
Race can still play a role in college admissions
The Supreme Court left at least one legal path open for race to play a role in college admissions.
When admitting students, universities are allowed to consider “an applicant’s discussion of how race affected his or her life, be it through discrimination, inspiration or otherwise,” Chief Justice John Roberts wrote in the Court’s decision. However, “the student must be treated based on his or her experiences as an individual — not on the basis of race.”
A student’s story can emerge without Drake asking for it, according to Dean of Admissions Joel Johnson.
“Especially if they’ve overcome a lot, or it’s so key to their identity… it’ll come out on its own,” Johnson said. “I don’t know if I could say the Supreme Court protected it. They couldn’t have stopped it, honestly.”
Johnson said that caring about diversity also means intentionally recruiting a diverse group of students. He said students can’t join Drake if they never apply in the first place.
In the wake of the Supreme Court’s decision on affirmative action, The Times-Delphic is publishing a series. Check next week’s paper for an article about legacy admissions and legacy financial aid with a Drake focus.

"""

In [None]:
len(times_delphic_story) #let's check how long this string is

6089

In [None]:
print(summarizer(times_delphic_story[:4631],max_length=100,min_length=50))

[{'summary_text': 'The Supreme Court ruled against the admissions programs of Harvard University and the University of North Carolina in an affirmative action decision. Before the decision, race already wasn’t a factor in Drake University admissions. Financial aid programs that prioritize applicants of a particular race over another are more vulnerable.'}]


### Group Exercise

In this example, I only use the first 4000 characters from the article.

Try using more. Why do you think I did that?

it has a limit and it is 4631

What strategies can you think of for getting summaries of longer articles?

breaking it down in parts


## Let's try it on a different summarization dataset

The *BillSum* dataset contains the text of legislative bills and their summaries from both the US Federal and California State legislatures.

See more here: https://huggingface.co/datasets/billsum

This dataset has `train`, `test`, and `ca_test` splits. We can load just one of them - let's try the `ca-test` which is the smaller test set.


In [None]:
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")

Downloading builder script:   0%|          | 0.00/3.66k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.70k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/67.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/18949 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3269 [00:00<?, ? examples/s]

Generating ca_test split:   0%|          | 0/1237 [00:00<?, ? examples/s]

## Let's explore the dataset

What does it look like when printed/displayed?

In [None]:
print(billsum)

Dataset({
    features: ['text', 'summary', 'title'],
    num_rows: 1237
})


What does one of the items look like?

In [None]:
billsum[0]

{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nThe Legislature finds and declares all of the following:\n(a) (1) Since 1899 congressionally chartered veterans’ organizations have provided a valuable service to our nation’s returning service members. These organizations help preserve the memories and incidents of the great hostilities fought by our nation, and preserve and strengthen comradeship among members.\n(2) These veterans’ organizations also own and manage various properties including lodges, posts, and fraternal halls. These properties act as a safe haven where veterans of all ages and their families can gather together to find camaraderie and fellowship, share stories, and seek support from people who understand their unique experiences. This aids in the healing process for these returning veterans, and ensures their health and happiness.\n(b) As a result of congressional chartering of these veterans’ organizations, the United States Inte

Let's get a summary of the first bill (first 4000 characters of the text only) using the news-article summarizer.

In [None]:
print(len(billsum[0]["text"]))
summarizer(billsum[0]["text"][:4000])

8203


[{'summary_text': 'Since 1899 congressionally chartered veterans’ organizations have provided a valuable service to our nation’s returning service members. The U.S. Internal Revenue Service created a special tax exemption for these organizations under Section 501(c)(19) of the Internal Revenue Code.'}]

## Now let's do a batch of 5 articles

First, we need to prepare a list that contains the texts of the first 5 bills, truncated to the first 4000 characters.

In [None]:
truncated_bill_texts = []
for idx in range(5):
    curr_truncated_text = billsum[idx]["text"][:4000]
    truncated_bill_texts.append( curr_truncated_text )

Now let's get a summary of each of those texts. This might take a while.

In [None]:
prediction_summaries = summarizer(truncated_bill_texts)
actual_references = billsum["summary"][0:5]

print(prediction_summaries)
print(actual_references)


[{'summary_text': 'Since 1899 congressionally chartered veterans’ organizations have provided a valuable service to our nation’s returning service members. The U.S. Internal Revenue Service created a special tax exemption for these organizations under Section 501(c)(19) of the Internal Revenue Code.'}, {'summary_text': 'A prisoner is not eligible for resentence or recall pursuant to subdivision (e) of Section 1170 if he or she was convicted of first-degree murder if the victim was a peace officer. A prisoner sentenced to death or life in prison without possibility of parole cannot be granted medical parole.'}, {'summary_text': 'California has long been known as the land of opportunity, the republic of the future. But for too many of its residents the future is receding. Inequality continues to rise, even though California has one of the most progressive tax structures in the nation. Small businesses, like plumbing contractors, auto repair shops, and restaurants that account for over 90

Notice that summarizer returns a list of dictionaries with one key each: `'summary_text'`. If we want to evaluate these with ROUGE, we will need to get a flat list of all these texts - not contained inside a dictionary.

In [None]:
predictions_flat = []

for result in prediction_summaries:
    predictions_flat.append(result["summary_text"])

print(predictions_flat)

['Since 1899 congressionally chartered veterans’ organizations have provided a valuable service to our nation’s returning service members. The U.S. Internal Revenue Service created a special tax exemption for these organizations under Section 501(c)(19) of the Internal Revenue Code.', 'A prisoner is not eligible for resentence or recall pursuant to subdivision (e) of Section 1170 if he or she was convicted of first-degree murder if the victim was a peace officer. A prisoner sentenced to death or life in prison without possibility of parole cannot be granted medical parole.', 'California has long been known as the land of opportunity, the republic of the future. But for too many of its residents the future is receding. Inequality continues to rise, even though California has one of the most progressive tax structures in the nation. Small businesses, like plumbing contractors, auto repair shops, and restaurants that account for over 90 percent of the state’s businesses are a key rung on 

and now let's compute the ROUGE metrics

In [None]:
import evaluate

rouge = evaluate.load("rouge")

rouge.compute(predictions=predictions_flat,references=actual_references)

{'rouge1': 0.17429965660581964,
 'rouge2': 0.0758329573249116,
 'rougeL': 0.12923816949402567,
 'rougeLsum': 0.1495535060752452}

These seem to indicate there isn't a lot of overlap between the reference summaries and the predictions.

Keep in mind:
* the model was trained on a different kind of dataset
* we are only using the first part of each bill

## A translation example

Here is a model that translates from Spanish (ES) to English (EN): https://huggingface.co/Helsinki-NLP/opus-mt-es-en

In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline

spanish_sentence = "una gallina melancólica se sentó en un nido todo el día"
reference_english_sentence = "a broody hen sat in a nesting box all day"


translator = pipeline("translation", model="Helsinki-NLP/opus-mt-es-en")

print(predicted_sentence)


predicted_sentence = translator(spanish_sentence)
predictions = [spanish_sentence]
reference = [reference_english_sentence]



#scores = rouge.compute([predicted_sentence[0]["translation_text"]], reference_english_sentence)


#rouge.compute(prediction = [predicted_sentence[0]["translation_text"]], references = [reference_english_sentence])
print(predicted_sentence)

[{'translation_text': 'a melancholy hen sat in a nest all day'}]


ValueError: ignored

In [None]:
from transformers import pipeline
from datasets import load_dataset
import evaluate

# Load SAMSum dataset
samsum = load_dataset("JulesBelveze/tldr_news")

# Load summarization model
summarizer = pipeline("summarization", model="google/pegasus-xsum")

# Generate summaries for SAMSum dataset
truncated_samsum_text = []
for idx in range(5):
    curr_truncated_text = samsum["train"][idx]["content"][:150]
    truncated_samsum_text.append(curr_truncated_text)

predic_summaries = summarizer(truncated_samsum_text)

# Actual references from the dataset
actual_references = [example['headline'] for example in samsum['train'][:5]]

# Assuming samsum['train'] is a list of strings
# Convert the strings to dictionaries and then extract the 'headline' field from the first 5 examples
parsed_train_data = [eval(example) for example in samsum['train'][:5]]
actual_references = [example['headline'] for example in parsed_train_data]
# Extract generated summaries
predictions_flat = [result["summary_text"] for result in predic_summaries]

# Print generated summaries and actual references
print("Generated Summaries:")
print(predictions_flat)
print("Actual References:")
print(actual_references)

# Compute ROUGE scores
rouge = evaluate.load("rouge")
rouge_scores = rouge.compute(predictions=predictions_flat, references=actual_references)

# Print ROUGE scores
print("ROUGE Scores for SAMSum Dataset:", rouge_scores)

## Applied Exploration

Go to the Hugging Face models page: https://huggingface.co/models
* Use the same model, but find two different news datasets (https://huggingface.co/datasets), and evaluate them using ROUGE metrics
* For each dataset, record
    - where did it come from?

    TLDR News Dataset

    CNN/Daily Mail Dataset
    
    - Daily tech newsletter
CNN and the Daily Mail

- how big is it?
    tech
    Train: 287,113
    CNN
    Not specified
- how big are the texts? Did you have to truncate them?
    About a paragraph for tech news.

    Varies, but often news articles, potentially longer than TLDR New

    Both of the texts are truncated

* Evaluate the performance
    - use the ROUGE metrics
    - describe in your own words how it performed
    TLDR News vs. CNN/Daily Mail

    TLDR News performs slightly lower than CNN/Daily Mail across all ROUGE metrics. This difference may be attributed to the nature of the datasets and the summarization model's training data, as different datasets may have distinct characteristics.


    - how did they compare to each other?
    CNN/Daily Mail vs. BillSum (Assumed):

    CNN/Daily Mail performs better than BillSum in terms of ROUGE-1 and ROUGE-L, but slightly worse in terms of ROUGE-2 and ROUGE-Lsum.

    - how did they compare to the bills dataset?
    The differences can be attributed to variations in the content, style, and length of news articles versus legislative bills.
    - what do you think is the reason for the difference in performance that you noticed?

    Training The summarization model may have been trained on a diverse set of data, but the distribution might favor one type of content over another.

    Truncating content may result in information loss, affecting the model's ability to generate accurate summaries.
    

In [None]:
from transformers import pipeline
from datasets import load_dataset
import evaluate

# Load summarization model
summarizer = pipeline("summarization", model="google/pegasus-xsum")

# Load CNN/Daily Mail news dataset
cnn_dailymail = load_dataset("cnn_dailymail","3.0.0", split="test[:5]")  # Adjust split and limit for your needs

# Generate summaries for CNN/Daily Mail dataset
truncated_cnn_dailymail_text = [example["article"][:150] for example in cnn_dailymail]
predic_summaries_cnn_dailymail = summarizer(truncated_cnn_dailymail_text)
actual_references_cnn_dailymail = [example["highlights"] for example in cnn_dailymail]

# Load XSum dataset
xsum = load_dataset("xsum", split="test[:5]")  # Adjust split and limit for your needs

# Generate summaries for XSum dataset
truncated_xsum_text = [example["document"][:150] for example in xsum]
predic_summaries_xsum = summarizer(truncated_xsum_text)
actual_references_xsum = [example["summary"] for example in xsum]

# Extract generated summaries
predictions_flat_cnn_dailymail = [result["summary_text"] for result in predic_summaries_cnn_dailymail]
predictions_flat_xsum = [result["summary_text"] for result in predic_summaries_xsum]

# Print generated summaries and actual references for CNN/Daily Mail
print("Generated Summaries - CNN/Daily Mail:")
print(predictions_flat_cnn_dailymail)
print("Actual References - CNN/Daily Mail:")
print(actual_references_cnn_dailymail)

# Print generated summaries and actual references for XSum
print("Generated Summaries - XSum:")
print(predictions_flat_xsum)
print("Actual References - XSum:")
print(actual_references_xsum)

# Compute ROUGE scores for CNN/Daily Mail
rouge_cnn_dailymail = evaluate.load("rouge")
rouge_scores_cnn_dailymail = rouge_cnn_dailymail.compute(predictions=predictions_flat_cnn_dailymail, references=actual_references_cnn_dailymail)
print("ROUGE Scores for CNN/Daily Mail Dataset:", rouge_scores_cnn_dailymail)

# Compute ROUGE scores for XSum
rouge_xsum = evaluate.load("rouge")
rouge_scores_xsum = rouge_xsum.compute(predictions=predictions_flat_xsum, references=actual_references_xsum)
print("ROUGE Scores for XSum Dataset:", rouge_scores_xsum)

## An Idea for Creative Synthesis

Write some code that lets the user type in a web address (like a Wikipedia article) and generate a summary for the whole page.
* you will have to experiment with different ideas of how to get summaries for longer texts
    - come up with your own ideas
    - research how others handle it and try those
    - you might find that combining more than one kind of model can be helpful

Record your results and discuss it at the demo!

## Question Answering

[roberta-based model](https://huggingface.co/deepset/roberta-base-squad2) trained on the [SQuAD2.0](https://huggingface.co/datasets/squad_v2) question answering data set

Requires two inputs
* a question
* context - where to find the answer

Returns
* an answer
* a location where you can find the answer in the context

In [None]:
from transformers import pipeline

model_name = "deepset/roberta-base-squad2"

# a) Get predictions
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
QA_input = {
    'question': 'Can colleges take race into account when making admissions decisions?',
    'context': times_delphic_story
}
res = nlp(QA_input)
print(res)

{'score': 0.1444220393896103, 'start': 1416, 'end': 1433, 'answer': 'we don’t know yet'}


In [None]:
print( times_delphic_story[1416:1433] )
print( times_delphic_story[1200:1500] )

we don’t know yet
Court “basically has embraced an idea that it calls colorblindness.”
“If you take their principle of colorblindness and extend it beyond universities, to other places, it could raise some problems,” Kende said. “But we don’t know yet.”
Financial aid programs that prioritize applicants of a particula


### Let's try another question

In [None]:
QA_input2 = {
    'question' : "Which kinds of schools are most affected by the Supreme Court's affirmative action ruling?",
    'context': times_delphic_story
}
res = nlp(QA_input2)
print(res)

{'score': 0.035478729754686356, 'start': 671, 'end': 686, 'answer': 'Harvard and UNC'}


In [None]:
print( times_delphic_story[671:686] )
print( times_delphic_story[500:800] )

Harvard and UNC
 institutions that limit the number of incoming students,” Mattison said. “So that doesn’t apply to Drake and most institutions across the country.”
She said schools like Harvard and UNC have enough applicants that they can pick and choose which applicants fill a certain number of spots.
Drake’s adm


The answer I was hoping for was `"highly selective institutions"`.

### How you ask the question seems to have an impact on the answer it finds

In [None]:
QA_input3 = {
    'question' : "Does Drake consider race when deciding to admit a student?",
    'context': times_delphic_story
}
res = nlp(QA_input3)
print(res)

{'score': 0.1436648666858673, 'start': 1416, 'end': 1433, 'answer': 'we don’t know yet'}


In [None]:
QA_input4 = {
    'question' : "At Drake, does race have an impact on the admissions decision?",
    'context': times_delphic_story
}
res = nlp(QA_input4)
print(res)

{'score': 0.10744316130876541, 'start': 995, 'end': 1048, 'answer': 'it does not have an impact on the admissions decision'}


### Discussion question:

What are some ways you can think of for evaluating question answering models?