In [1]:
prompt_template = """Generate two texts related to the topic of {topic}, both written in German and each being at least 100 words long.

The first text should be a hard negative example. It should be related to the question or search string about {topic}, but it shouldn't answer the question:
{questions}
This text should talk about the topic in a similar way but carefully avoid giving the answer. For instance, if the question is "When is Costco open?", the hard negative example might discuss Walmart's opening hours instead. Remember, the hard negative example should never give the answer to the questions.

The second text should be a positive example. It must directly respond to and provide the solution to the question:
{questions}
This text should be an accurate and informative piece that fully explores the topic and answers the questions. Craft a response that directly tackles the underlying question by providing a specific answer, search result, or solution, rather than giving broad advice or unrelated information. For instance, if the question is "Search for properties with good public transportation", actually provide a property listing near public transportation, instead of a guide on how to search for such properties.
Both texts should be of similar length to ensure consistency in comparison and should be written in German.
"""

response_template = """Hard negative example (not containing the answer to the questions!):\n"""

In [2]:
import torch 
import vllm 
import pandas as pd 
from vllm import SamplingParams
from transformers import AutoTokenizer

model_name = "TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ"
sampling_params = SamplingParams(temperature=0.1, max_tokens=16000)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
llm = vllm.LLM(model=model_name, quantization="gptq", dtype=torch.float16, tensor_parallel_size=2, max_model_len=16000, revision="gptq-4bit-32g-actorder_True", gpu_memory_utilization=0.75)



2024-01-29 17:37:45,811	INFO worker.py:1724 -- Started a local Ray instance.


INFO 01-29 17:37:46 llm_engine.py:72] Initializing an LLM engine with config: model='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer_mode=auto, revision=gptq-4bit-32g-actorder_True, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16000, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=gptq, enforce_eager=False, seed=0)
INFO 01-29 17:37:51 weight_utils.py:164] Using model weights format ['*.safetensors']
[36m(RayWorkerVllm pid=1529789)[0m INFO 01-29 17:37:51 weight_utils.py:164] Using model weights format ['*.safetensors']
INFO 01-29 17:38:03 llm_engine.py:316] # GPU blocks: 1955, # CPU blocks: 4096
INFO 01-29 17:38:04 model_runner.py:625] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-29 17:38:04 model_runner



INFO 01-29 17:38:38 model_runner.py:689] Graph capturing finished in 34 secs.


In [3]:
import pandas as pd
import numpy as np 
df = pd.read_parquet("03_parsed_questions.parquet")
df[["Positive", "Hard Negative"]] = np.nan
df

[36m(RayWorkerVllm pid=1529789)[0m INFO 01-29 17:38:38 model_runner.py:689] Graph capturing finished in 34 secs.


Unnamed: 0,index,topic,questions,gen_questions,Imperative Form,Question,Search String,Positive,Hard Negative
0,0,AK-47-Sturmgewehr,Waffengesetz-Informationen zur AK-47,Waffengesetz-Informationen zur AK-47\nImperati...,"""Ermittle, ob die AK-47 in Deutschland als leg...","""Gilt die AK-47 in Deutschland als legale Schu...",AK-47 Legalität in Deutschland,,
1,0,AK-47-Sturmgewehr,Vergleich von AK-47 mit anderen Sturmgewehren,Vergleich von AK-47 mit anderen Sturmgewehren\...,"""Vergleiche die Leistung des AK-47 mit anderen...","""Wie unterscheidet sich die Leistung des AK-47...",AK-47 Leistungsvergleich mit anderen Sturmgewe...,,
2,0,AK-47-Sturmgewehr,Historischer Hintergrund der AK-47,Historischer Hintergrund der AK-47\nImperative...,"""Erkläre den historischen Hintergrund der AK-47.""","""Was ist der historische Hintergrund der AK-47?""",historischer Hintergrund AK-47,,
3,0,AK-47-Sturmgewehr,Technische Daten der AK-47,Technische Daten der AK-47\nImperative Form: ...,"""Notiere die technischen Spezifikationen der A...","""Welche sind die technischen Spezifikationen d...",technische Daten AK-47,,
4,0,AK-47-Sturmgewehr,AK-47 in verschiedenen Konflikten,AK-47 in verschiedenen Konflikten\nImperative ...,"""Veranschauliche die Verwendung des AK-47 in v...","""Wie wurde das AK-47 in verschiedenen Konflikt...",AK-47 Einsatz in Konflikten,,
...,...,...,...,...,...,...,...,...,...
82646,16512,Kryptographie,Geschichte und Entwicklung von Kryptographie,Geschichte und Entwicklung von Kryptographie\n...,"""Erkläre die Geschichte und Entwicklung der Kr...","""Wie hat sich die Kryptographie im Laufe der G...",Geschichte und Entwicklung Kryptographie,,
82647,16512,Kryptographie,Vergleich von symmetrischen und asymmetrischen...,Vergleich von symmetrischen und asymmetrischen...,"""Vergleiche die Funktionsweise symmetrischer u...","""Was ist der Unterschied in der Funktionsweise...",Unterschied symmetrische und asymmetrische Ver...,,
82648,16512,Kryptographie,Anwendungen von Kryptographie in Sicherheit un...,Anwendungen von Kryptographie in Sicherheit un...,"""Beschreibe, wie Kryptographie zur Verbesserun...","""Wie wird Kryptographie in der Sicherheit und ...",Anwendungen Kryptographie Sicherheit Privatsphäre,,
82649,16512,Kryptographie,Angriffe und Sicherheitslücken in Kryptosystemen,Angriffe und Sicherheitslücken in Kryptosystem...,"""Untersuche bekannte Angriffe auf Kryptosysteme.""","""Welche Angriffe sind bekannt für Kryptosysteme?""",Bekannte Angriffe auf Kryptosysteme,,


In [4]:
from tqdm import tqdm 

def generate_prompt(row):
    questions = "\n".join(row[["Imperative Form", "Question", "Search String"]].str.removesuffix('"').str.removeprefix('"').to_list())
    topic = row["topic"]
    formatted_prompt = tokenizer.apply_chat_template(conversation=[
        {"role": "user", "content":prompt_template.replace("{questions}", str(questions)).replace("{topic}", str(topic))},
        {"role": "assistant", "content":response_template}
        ], tokenize=False)
    formatted_prompt = formatted_prompt.removesuffix("</s>")
    return formatted_prompt


BATCH_SIZE = 32

# df = pd.read_parquet("04_results_texts.parquet")
df_nan = df# [df["raw_texts"]=="nan"]


for i in tqdm(range(0, len(df_nan), BATCH_SIZE)):
    batches = df_nan[["topic", "Imperative Form", "Question", "Search String"]].iloc[i:i+BATCH_SIZE]
    formatted_prompt =[generate_prompt(batch) for n, batch in batches.iterrows()]
    results = llm.generate(formatted_prompt, sampling_params=sampling_params)
    results_adj = [result.prompt.split("[/INST]")[-1]+ result.outputs[0].text for result in results]
    df.loc[batches.index, 'raw_texts'] = results_adj
    df.to_parquet("04_results_texts.parquet")   



Processed prompts: 100%|██████████| 32/32 [01:25<00:00,  2.67s/it]
Processed prompts: 100%|██████████| 32/32 [01:32<00:00,  2.88s/it]
  0%|          | 2/2583 [02:57<64:13:57, 89.59s/it]

In [None]:
# 146h 300W
# 135h 350W