In [1]:
# read in the doc
def read_preprocess_doc(doc_path):
    with open(doc_path, 'r', encoding='utf-8') as file:
        content = file.read()
    content = content.split()[:4096]
    content = ' '.join(content)
    return content

In [2]:
doc_list = [
    '../data/crawled/crawled_text_data/3.txt',
    '../data/crawled/crawled_text_data/5.txt',
    '../data/crawled/crawled_text_data/137.txt'
]

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# load the model
# model_id = "meta-llama/Llama-2-7b-chat-hf"
model_id = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [4]:
# load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

In [5]:
from transformers import pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, torch_dtype=torch.float16)

In [6]:
INSTRUCTION = """
I have provided a document above that may be relevant to one or more topics such as 
general information, history, economy, music, culture, sports, or upcoming events 
related to Pittsburgh or Carnegie Mellon University. Based on this document, 
generate 10 factual question and answer pairs that cover different types of inquiries, 
such as time, events, people, locations, or numerical data.
If there are fewer than 10 pairs available, only provide the number you can find.
The questions should make sense independently of the document by including enough context from the document.
For example, the question "Q: What is the event in 2000?" is not acceptable, 
because it is meaningless without a given context.
Please provide concise answers without repeating the question or using complete sentences.
For example, given the question "Q: When was Carnegie Mellon University founded?",
you should only answer "A: 1900".
\n\nExamples: \n
Q: Who is Pittsburgh named after? A: William Pitt \n
Q: What famous machine learning venue had its first conference in Pittsburgh in 1980? A: ICML \n
Q: What musical artist is performing at PPG Arena on October 13? A: Billie Eilish\n\n
Before you start, please read the document above and provide the number of question and answer pairs you can find."""

In [7]:
for content in doc_list:
    # read the doc
    content = read_preprocess_doc(content)
    formatted_input = content + "\n\n" + INSTRUCTION + "\n\n" + "Your answer:"
    tokenized_prompt = tokenizer(formatted_input, return_tensors="pt").to(model.device)
    print(tokenized_prompt.input_ids.size(1))
    
    # generate the response
    messages = [
    {"role": "user", "content": formatted_input},
    ]
    with torch.no_grad():
        result = pipe(messages, max_new_tokens=512)
    print(result[0]['generated_text'][1]['content'])

2726


Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


I was able to find 9 factual question and answer pairs that cover different types of inquiries related to Pittsburgh or Carnegie Mellon University. Here are the pairs:

1. Q: Who is Pittsburgh named after? 
A: William Pitt

2. Q: When was the Pennsylvania Canal and the Portage Railroad completed? 
A: 1834

3. Q: What was the population of Pittsburgh in 1900? 
A: 321,616

4. Q: What year did the American Federation of Labor emerge in Pittsburgh? 
A: 1881

5. Q: What was the name of the blast furnace erected by George Anschutz? 
A: (Not mentioned in the text, but implied to be a blast furnace)

6. Q: What was the name of the steel mill where Bessemer converters were in operation in 1886? 
A: (Not mentioned in the text)

7. Q: When did the city of Pittsburgh generate electricity by nuclear power? 
A: 1957

8. Q: What is the name of the convention center in Pittsburgh? 
A: David L. Lawrence Convention Center

9. Q: What is the name of the planetarium in Pittsburgh? 
A: Henry Buhl, Jr., Pla