<div align="center">
    <span style="font-size: 2em; font-weight: bold; font-style: italic;">Medical Dataset Expansion</span><br>
    <span style="font-size: 1.5em; font-weight: bold;">Reza Dalir - 610300050</span><br>
    <span style="font-size: 1.5em; font-weight: bold;">Modern Information Retrieval course</span>
    <hr style="width: 50%; border: 1px solid white;">
</div>


## ***Step 1: Load and Inspect the Dataset***

In [None]:
! pip install openpyxl



In [2]:
import pandas as pd
df = pd.read_excel("MeQSum_ACL2019_BenAbacha_Demner-Fushman.xlsx")

#### **Examining the dataset** <br> (the output is scrollable, you can scroll it and see all the information.)

In [3]:
print("dataset information: ")
print(df.info(), '\n')

# number of null cells in each column
print("number of null cells: ")
print(df.isnull().sum(), '\n')

# name of the columns
print("columns name: ")
print(df.columns, '\n')

# count unique cells
print("number of unique cells in each column: ")
print(df.nunique(), '\n')

# Check the length of questions and summaries
print("Question and Summary length informations: ")
df["question_length"] = df["CHQ"].apply(lambda x: len(str(x).split()))
df["summary_length"] = df["Summary"].apply(lambda x: len(str(x).split()))
print(df[["question_length", "summary_length"]].describe())

dataset information: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   File     1000 non-null   object
 1   CHQ      1000 non-null   object
 2   Summary  1000 non-null   object
dtypes: object(3)
memory usage: 23.6+ KB
None 

number of null cells: 
File       0
CHQ        0
Summary    0
dtype: int64 

columns name: 
Index(['File', 'CHQ', 'Summary'], dtype='object') 

number of unique cells in each column: 
File       1000
CHQ        1000
Summary     994
dtype: int64 

Question and Summary length informations: 
       question_length  summary_length
count      1000.000000     1000.000000
mean         60.776000       10.043000
std          46.616767        3.645057
min           5.000000        3.000000
25%          30.000000        7.000000
50%          47.000000        9.000000
75%          75.250000       12.000000
max         378.000000       26.000000

#### ***First few rows of the dataset:***

In [4]:
df.head()

Unnamed: 0,File,CHQ,Summary,question_length,summary_length
0,1-131188152.xml.txt,SUBJECT: who and where to get cetirizine - D\n...,Who manufactures cetirizine?,31,3
1,14348.txt,who makes bromocriptine\ni am wondering what c...,Who manufactures bromocriptine?,97,3
2,1-131985747.xml.txt,SUBJECT: nulytely\nMESSAGE: Hello can you tell...,"Who makes nulytely, and where can I buy it?",25,9
3,15410.txt,Williams' syndrome\nI would like to have my da...,Where can I get genetic testing for william's ...,31,9
4,35.txt,ClinicalTrials.gov - Question - general inform...,Where can I get genetic testing for multiple m...,69,14


***
## ***Step 2: Preprocessing***

#### I use beautiful soup for removing HTML tags and use regular expression for remove all the other things. If there is SUBJECT and MESSAGE in the CHQ column i remove the SUBJECT and only use MESSAGE.

In [5]:
import re
from bs4 import BeautifulSoup

def clean_text(text):
    text = BeautifulSoup(text, "html.parser").get_text()
    messageSearch = re.search("MESSAGE", text)
    if messageSearch:
        text = text[re.search("MESSAGE", text).span()[1]+1:]
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text.lower()

df["CHQ"] = df["CHQ"].apply(clean_text)
df["Summary"] = df["Summary"].apply(clean_text)


In [7]:
df["CHQ"]

0      i needwant to know who manufscturs cetirizine ...
1      who makes bromocriptine i am wondering what co...
2      hello can you tell me where do i order the nul...
3      williams syndrome i would like to have my daug...
4      clinicaltrialsgov question general information...
                             ...                        
995    i got surgery for hole in my ear drumhole was ...
996    looking for help for my nephew with glycogen s...
997    i have numbnesstingling in my lower right arm ...
998    i was diagnosed with sleep apnea prolly had it...
999    what specific resources are available for an e...
Name: CHQ, Length: 1000, dtype: object

***
## ***Step 3: Handling Missing or Irregular Data***

In [8]:
# number of null cells in each column
print("number of null cells: ")
print(df.isnull().sum(), '\n')

number of null cells: 
File               0
CHQ                0
Summary            0
question_length    0
summary_length     0
dtype: int64 



#### As we can see there is no null or empty cell, in step 1 we also seen that the minimum length of a cell in CHQ and Summary column is 3 and 5 respectively. so here's another proof that there is no empty cell in these columns.

In [9]:
print(df["CHQ"].apply(len).describe(), '\n')
print(df["Summary"].apply(len).describe(), '\n')

q1Q = df["CHQ"].apply(len).quantile(0.1)
q2Q = df["CHQ"].apply(len).quantile(0.9)
q1S = df["Summary"].apply(len).quantile(0.1)
q2S = df["Summary"].apply(len).quantile(0.9)
print(f"In CHQ length of cells smaller than {q1Q} or bigger than {q2Q} are considered outliers")
print(f"In Summary length of cells smaller than {q1S} or bigger than {q2S} are considered outliers")


count    1000.000000
mean      299.459000
std       242.843714
min        27.000000
25%       141.000000
50%       231.000000
75%       379.250000
max      1940.000000
Name: CHQ, dtype: float64 

count    1000.000000
mean       61.081000
std        22.470856
min        21.000000
25%        44.000000
50%        57.000000
75%        73.000000
max       158.000000
Name: Summary, dtype: float64 

In CHQ length of cells smaller than 91.0 or bigger than 590.1 are considered outliers
In Summary length of cells smaller than 37.900000000000006 or bigger than 90.0 are considered outliers


#### As we can see in CHQ the mean length of each question is around 300 while the min and the max of it are 27 and 1940 respectively. so we truncate longer questions and pad smaller ones with these two functions:

In [10]:
def pad_text(text, min_length):
    if len(text) < min_length:
        return text + "<PAD>"*(int((min_length-len(text))/5)+1)
    return text

def truncate(text, max_length):
    if len(text) > max_length:
        return text[:max_length]
    return text

df.loc[df["CHQ"].apply(len) < q1Q, "CHQ"] = df["CHQ"].apply(lambda x: pad_text(x, int(q1Q)))
df.loc[df["CHQ"].apply(len) > q2Q, "CHQ"] = df["CHQ"].apply(lambda x: truncate(x, int(q2Q)))

# df.loc[df["Summary"].apply(len) < q1S, "Summary"] = df["Summary"].apply(lambda x: pad_text(x, int(q1S)))
# df.loc[df["Summary"].apply(len) > q2S, "Summary"] = df["Summary"].apply(lambda x: truncate(x, int(q2S)))

#### q1Q and q2Q are the bound that splits first and last 10 percent of the data. q1S and q2S are the same for Summary.

In [11]:
print(df["CHQ"].apply(len).describe(), '\n')
print(df["Summary"].apply(len).describe(), '\n')

count    1000.000000
mean      275.607000
std       161.409211
min        91.000000
25%       141.000000
50%       231.000000
75%       379.250000
max       590.000000
Name: CHQ, dtype: float64 

count    1000.000000
mean       61.081000
std        22.470856
min        21.000000
25%        44.000000
50%        57.000000
75%        73.000000
max       158.000000
Name: Summary, dtype: float64 



#### After applying these function the result are normal and there is no questions or summaries that are too long or too short.

***
## ***Step 4: Translate Questions to a Pivot Language***

### **Translating using MarinaMTModel:**

In [None]:
! pip install transformers torch sentencepiece

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [None]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
from transformers import MarianMTModel, MarianTokenizer

def translate(text):
    inputs = tokenizer(text, return_tensors="pt").to(device)
    with torch.no_grad(), torch.cuda.amp.autocast():
        translated_tokens = model.generate(**inputs)

    translated_text = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)
    return translated_text[0]

In [None]:
tmp = pd.DataFrame()

model_name = "Helsinki-NLP/opus-mt-en-es"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
model.eval()

tmp["spanish"] = df["CHQ"].apply(translate)

In [None]:
model_name = "Helsinki-NLP/opus-mt-en-fr"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name).to(device)
model.eval()

tmp["french"] = df["CHQ"].apply(translate)

In [None]:
model_name = "Helsinki-NLP/opus-mt-en-de"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name).to(device)
model.eval()
tmp["german"] = df["CHQ"].apply(translate)
print("German translation Finished.")

model_name = "Helsinki-NLP/opus-mt-en-zh"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name).to(device)
model.eval()
tmp["chineese"] = df["CHQ"].apply(translate)
print("Chineese translation Finished.")

model_name = "Helsinki-NLP/opus-mt-en-it"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name).to(device)
model.eval()
tmp["italian"] = df["CHQ"].apply(translate)
print("Italian translation Finished.")

In [None]:
! pip install deep-translator

Collecting deep-translator
  Downloading deep_translator-1.11.4-py3-none-any.whl.metadata (30 kB)
Downloading deep_translator-1.11.4-py3-none-any.whl (42 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.3/42.3 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: deep-translator
Successfully installed deep-translator-1.11.4


#### **Using google translate for translation (Bonus)**

In [None]:
from deep_translator import GoogleTranslator

tmp2 = pd.DataFrame()
for i in df["CHQ"]:
    print(i)
    tmp2.loc[len(tmp2), "spanish"] = GoogleTranslator(source="en", target="es").translate(i)

In [None]:
for i in df["CHQ"]:
    tmp2.loc[len(tmp2), "chineese"] = GoogleTranslator(source="en", target="zh-CN").translate(i)
print("Chineese Translation finished.")

for i in df["CHQ"]:
    tmp2.loc[len(tmp2), "italian"] = GoogleTranslator(source="en", target="it").translate(i)
print("Italian Translation finished.")

for i in df["CHQ"]:
    tmp2.loc[len(tmp2), "german"] = GoogleTranslator(source="en", target="de").translate(i)
print("German Translation finished.")

Chineese Translation finished.
Italian Translation finished.
German Translation finished.


In [None]:
tmp2.loc[0:1000, "french"] = tmp2.loc[1000:2000, "french"].values
tmp2.loc[0:1000, "chineese"] = tmp2.loc[2000:3000, "chineese"].values
tmp2.loc[0:1000, "italian"] = tmp2.loc[3000:4000, "italian"].values
tmp2.loc[0:999, "german"] = tmp2.loc[4000:4999, "german"].values

In [None]:
tmp2 = tmp2.drop(index=range(1000, 5000)).reset_index(drop=True)

In [None]:
tmp2

Unnamed: 0,spanish,french,chineese,italian,german
0,Necesito saber quién manufscturs cetirizine mi...,j'ai besoin de savoir qui fabrique la cétirizi...,我需要知道谁制造西替利嗪我的沃尔玛正在寻找新的供应，但没有得到最近的,ho bisogno di sapere chi produce la cetirizina...,"ich muss wissen, wer Cetirizin herstellt. Mein..."
1,¿Quién hace bromocriptina? Me pregunto qué com...,qui fabrique la bromocriptine je me demande qu...,谁制造了溴隐亭，我想知道哪家公司制造了溴隐亭这种药物，我需要它来治疗我脑垂体上的肿块，而且价...,chi produce la bromocriptina mi chiedo quale a...,"wer stellt Bromocriptin her? Ich frage mich, w..."
2,"Hola, ¿puedes decirme dónde puedo pedir el Nul...",bonjour pouvez vous me dire où puis je command...,你好，你能告诉我在哪里可以订购 nulytely 制造商是谁我可以拨打什么电话号码谢谢,ciao puoi dirmi dove posso ordinare il nulytel...,"hallo, können Sie mir sagen, wo ich das Nulyte..."
3,Síndrome de Williams Me gustaría que le hicier...,syndrome de williams j'aimerais que ma fille s...,威廉斯综合征 我想让我的女儿做威廉斯综合征的检查 你能告诉我去哪里做检查吗 或者在我所在地区...,sindrome di Williams vorrei far fare il test p...,Williams-Syndrom. Ich möchte meine Tochter auf...
4,ClinicalTrialSgov Pregunta Información general...,Question sur ClinicalTrialsgov Informations gé...,clinicaltrialsgov 问题一般信息我的父母都死于德克萨斯州，父亲 70 多岁，...,clinicaltrialsgov domanda informazioni general...,clinicaltrialsgov Frage Allgemeine Information...
...,...,...,...,...,...
995,"Me operaron de un agujero en el tímpano, que t...","j'ai été opéré pour un trou dans mon tympan, u...",我做了耳洞手术，耳朵上有鼓膜洞，从 5 或 6 只耳朵开始就有，但我当时不知道，但当我知道时...,ho subito un intervento chirurgico per un foro...,Ich wurde wegen eines Lochs in meinem Ohr oper...
996,Estoy buscando ayuda para mi sobrino que tiene...,je recherche de l'aide pour mon neveu atteint ...,寻求帮助，帮助我的侄子，他患有糖原累积症，他住在弗吉尼亚州，病情严重，今年到目前为止，他已经...,"cerco aiuto per mio nipote con la glicogenosi,...","Ich suche Hilfe für meinen Neffen, der an eine..."
997,Tengo entumecimiento y hormigueo en la parte i...,j'ai des engourdissements et des picotements d...,我的右下臂从肘部到手指有麻木刺痛感，肌电图显示没有异常，我已经有这种症状很长时间了，我需要帮助,ho intorpidimento e formicolio nella parte inf...,ich habe ein Taubheitsgefühl und Kribbeln in m...
998,"Me diagnosticaron apnea del sueño, probablemen...","on m'a diagnostiqué une apnée du sommeil, je l...",我被诊断患有睡眠呼吸暂停，可能已经 5 年了，并且我有由此引起的肿胀问题，已经排除了其他所有...,"Mi è stata diagnosticata l'apnea notturna, pro...","Bei mir wurde Schlafapnoe diagnostiziert, ich ..."


In [None]:
tmp2.to_csv("translated.csv")

In [12]:
tmp2 = pd.read_csv("checkpoints/translated.csv")

***
## ***Step 5: Translate Back to English***

#### **Translate back using google translate**

In [None]:
back_to_english = pd.DataFrame()

for i in tmp2["italian"]:
    print(i)
    back_to_english.loc[len(back_to_english), "italian"] = GoogleTranslator(source="it", target="en").translate(i)
print("Translation form Italian Translation finished.")

for i in tmp2["chineese"]:
    back_to_english.loc[len(back_to_english), "chineese"] = GoogleTranslator(source="zh-CN", target="en").translate(i)
print("Translation form Chineese Translation finished.")

for i in tmp2["german"]:
    back_to_english.loc[len(back_to_english), "german"] = GoogleTranslator(source="de", target="en").translate(i)
print("Translation form German Translation finished.")

for i in tmp2["spanish"]:
    back_to_english.loc[len(back_to_english), "spanish"] = GoogleTranslator(source="es", target="en").translate(i)
print("Translation form Italian Translation finished.")

for i in tmp2["french"]:
    back_to_english.loc[len(back_to_english), "french"] = GoogleTranslator(source="fr", target="en").translate(i)
print("Translation form German Translation finished.")

ho bisogno di sapere chi produce la cetirizina, il mio Walmart sta cercando una nuova fornitura e non sta ricevendo quella recente
chi produce la bromocriptina mi chiedo quale azienda produce il farmaco bromocriptina mi serve per una massa che ho sulla ghiandola pituitaria e il costo continua ad aumentare non posso mai comprare una prescrizione completa a causa del prezzo e mi è stato detto che se metto le mani sul produttore del farmaco a volte offrono buoni sconto o qualcosa per aiutarmi a permettermi il medicinale se compro 10 pillole che devo prendere 2 volte al giorno mi costano 7800 ed è così che devo comprarle grazie
ciao puoi dirmi dove posso ordinare il nulytely chi è il produttore che numero di telefono posso chiamare grazie
sindrome di Williams vorrei far fare il test per la sindrome di Williams a mia figlia, potresti dirmi dove potrei andare o chi lo fa nella mia zona? Grazie.
clinicaltrialsgov domanda informazioni generali entrambi i miei genitori sono morti in località tx

In [None]:
back_to_english.loc[0:999, "chineese"] = back_to_english.loc[1000:1999, "chineese"].values
back_to_english.loc[0:999, "german"] = back_to_english.loc[2000:2999, "german"].values
back_to_english.loc[0:999, "spanish"] = back_to_english.loc[3000:3999, "spanish"].values
back_to_english.loc[0:999, "french"] = back_to_english.loc[4000:4999, "french"].values

In [None]:
back_to_english = back_to_english.drop(index=range(1000, 5000)).reset_index(drop=True)

In [None]:
back_to_english

Unnamed: 0.1,Unnamed: 0,italian,chineese,german,spanish,french
0,0,"I need to know who makes cetirizine, my Walmar...",I need to know who makes Cetirizine My Walmart...,I need to know who makes Cetirizine. My Walmar...,I need to know who manufactures cetirizine my ...,"I need to know who makes Cetirizine, my Walmar..."
1,1,who makes bromocriptine i wonder what company ...,"Who makes bromocriptine, I would like to know ...",who makes bromocriptine? I am wondering what c...,Who makes bromocriptine? I'm wondering what co...,who makes bromocriptine i wonder what company ...
2,2,hi can you tell me where i can order nulytely ...,Hi can you tell me where can i order nulytely ...,"hello, can you tell me where I can order the N...","Hello, can you tell me where I can order Nulyt...",hello can you tell me where can i order the nu...
3,3,Williams Syndrome I would like to have my daug...,Williams syndrome I would like to have my daug...,Williams Syndrome. I want to get my daughter t...,Williams Syndrome I would like to have my daug...,williams syndrome i would like my daughter to ...
4,4,clinicaltrialsgov general information question...,clinicaltrialsgov Question General Information...,clinicaltrialsgov Question General Information...,ClinicalTrialSgov Question General Information...,ClinicalTrialsgov Question General Information...
...,...,...,...,...,...,...
995,995,i had surgery for eardrum hole the hole was in...,I had my ears pierced and I have holes in my e...,I had surgery for a hole in my ear. The eardru...,I had an operation for a hole in my eardrum wh...,"i was operated for a hole in my eardrum, a hol..."
996,996,I am looking for help for my nephew with glyco...,Looking for help for my nephew who has Glycoge...,I am looking for help for my nephew who has gl...,I am looking for help for my nephew who has gl...,I am looking for help for my nephew who has gl...
997,997,I have numbness and tingling in my lower right...,I have numbness and tingling in my right lower...,I have numbness and tingling in my right forea...,I have numbness and tingling in my lower right...,I have numbness and tingling in my lower right...
998,998,"I have been diagnosed with sleep apnea, probab...",I have been diagnosed with sleep apnea and hav...,"I have been diagnosed with sleep apnea, I have...","I was diagnosed with sleep apnea, I've probabl...","I was diagnosed with sleep apnea, I've probabl..."


In [None]:
back_to_english.to_csv("backed_to_english.csv")

In [14]:
back_to_english = pd.read_csv("checkpoints/backed_to_english.csv")

***
## ***Step 6: Use FQD to select a subset of the new dataset***

In [None]:
import torch
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

In [None]:
def FQD(q, q_hat):
    q = embedding_model.encode(q)
    q_hat = embedding_model.encode(q_hat)
    return 1- (np.dot(q, q_hat) / (np.linalg.norm(q) * np.linalg.norm(q_hat)))

In [None]:
fqd = pd.DataFrame(columns=["FQD_score_german", "FQD_score_italian", "FQD_score_french", "FQD_score_chineese", "FQD_score_spanish"])

fqd["FQD_score_german"] = df.apply(lambda row: FQD(row["CHQ"], back_to_english.loc[row.name, "german"]), axis=1)
fqd["FQD_score_italian"] = df.apply(lambda row: FQD(row["CHQ"], back_to_english.loc[row.name, "italian"]), axis=1)
fqd["FQD_score_french"] = df.apply(lambda row: FQD(row["CHQ"], back_to_english.loc[row.name, "french"]), axis=1)
fqd["FQD_score_chineese"] = df.apply(lambda row: FQD(row["CHQ"], back_to_english.loc[row.name, "chineese"]), axis=1)
fqd["FQD_score_spanish"] = df.apply(lambda row: FQD(row["CHQ"], back_to_english.loc[row.name, "spanish"]), axis=1)

fqd

Unnamed: 0,FQD_score_german,FQD_score_italian,FQD_score_french,FQD_score_chineese,FQD_score_spanish
0,0.135942,0.114213,0.116181,0.114409,0.101983
1,0.027566,0.020286,0.018962,0.017540,0.006332
2,0.064951,0.056367,0.038141,0.047548,0.097447
3,0.023154,0.007376,0.014031,0.031443,0.017445
4,0.050489,0.011894,0.058016,0.120378,0.089953
...,...,...,...,...,...
995,0.077045,0.073356,0.155814,0.129439,0.146475
996,0.018133,0.026385,0.034623,0.086100,0.010995
997,0.106177,0.069920,0.074217,0.079659,0.076506
998,0.015998,0.016833,0.016017,0.040064,0.018615


avoid underflow for negetive values

In [None]:
fqd = fqd.clip(lower=0)
print(fqd["FQD_score_german"].min())
print(fqd["FQD_score_german"].max())

0.0
0.5501136779785156


In [None]:
def normalize_FQD(column):
    minimum = column.min()
    maximum = column.max()
    return (column - minimum) / (maximum - minimum)

fqd["normalized_FQD_score_german"] = normalize_FQD(fqd["FQD_score_german"])
fqd["normalized_FQD_score_italian"] = normalize_FQD(fqd["FQD_score_italian"])
fqd["normalized_FQD_score_french"] = normalize_FQD(fqd["FQD_score_french"])
fqd["normalized_FQD_score_chineese"] = normalize_FQD(fqd["FQD_score_chineese"])
fqd["normalized_FQD_score_spanish"] = normalize_FQD(fqd["FQD_score_spanish"])

In [17]:
fqd[fqd["normalized_FQD_score_french"] < 0.02].count()

Unnamed: 0                       132
FQD_score_german                 132
FQD_score_italian                132
FQD_score_french                 132
FQD_score_chineese               132
FQD_score_spanish                132
normalized_FQD_score_german      132
normalized_FQD_score_italian     132
normalized_FQD_score_french      132
normalized_FQD_score_chineese    132
normalized_FQD_score_spanish     132
dtype: int64

In [16]:
fqd[fqd["normalized_FQD_score_french"] > 0.20].count()

Unnamed: 0                       84
FQD_score_german                 84
FQD_score_italian                84
FQD_score_french                 84
FQD_score_chineese               84
FQD_score_spanish                84
normalized_FQD_score_german      84
normalized_FQD_score_italian     84
normalized_FQD_score_french      84
normalized_FQD_score_chineese    84
normalized_FQD_score_spanish     84
dtype: int64

In [None]:
fqd.to_csv("fqd.csv")

In [None]:
new_questions = pd.DataFrame()

mu1 = 0.02
mu2 = 0.2

new_questions.loc[:, "german"] = back_to_english["german"][(fqd["normalized_FQD_score_german"] < mu2) & (fqd["normalized_FQD_score_german"] > mu1)]
new_questions.loc[:, "italian"] = back_to_english["italian"][(fqd["normalized_FQD_score_italian"] < mu2) & (fqd["normalized_FQD_score_italian"] > mu1)]
new_questions.loc[:, "chineese"] = back_to_english["chineese"][(fqd["normalized_FQD_score_chineese"] < mu2) & (fqd["normalized_FQD_score_chineese"] > mu1)]
new_questions.loc[:, "spanish"] = back_to_english["spanish"][(fqd["normalized_FQD_score_spanish"] < mu2) & (fqd["normalized_FQD_score_spanish"] > mu1)]
new_questions.loc[:, "french"] = back_to_english["french"][(fqd["normalized_FQD_score_french"] < mu2) & (fqd["normalized_FQD_score_french"] > mu1)]

In [None]:
new_questions["german"]

Unnamed: 0,german
1,who makes bromocriptine? I am wondering what c...
2,"hello, can you tell me where I can order the N..."
3,Williams Syndrome. I want to get my daughter t...
4,clinicaltrialsgov Question General Information...
5,Genetic Testing for IHHS Heart Disease Is ther...
...,...
995,I had surgery for a hole in my ear. The eardru...
996,I am looking for help for my nephew who has gl...
997,I have numbness and tingling in my right forea...
998,"I have been diagnosed with sleep apnea, I have..."


In [None]:
new_questions.to_csv("new_questions.csv")

In [15]:
new_questions = pd.read_csv("checkpoints/new_questions.csv")
fqd = pd.read_csv("checkpoints/fqd.csv")
new_questions = new_questions.set_index("Unnamed: 0")

***
## ***Step 7: Use PRQD to select a subset of the new dataset***

In [None]:
import torch
import numpy as np
from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cosine

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

def compute_prqd(q, q_hat):
    q = embedding_model.encode(q)
    q_hat = embedding_model.encode(q_hat)
    precision = 1 - cosine(q, q_hat)
    recall = 1 / (1 + np.linalg.norm(q - q_hat))
    prqd_score = (2 * precision * recall) / (precision + recall + 1e-8)
    return prqd_score


In [None]:
prqd = pd.DataFrame()

prqd["prqd_score_german"] = df.apply(lambda row: compute_prqd(row["CHQ"], back_to_english.loc[row.name, "german"]), axis=1)
prqd["prqd_score_italian"] = df.apply(lambda row: compute_prqd(row["CHQ"], back_to_english.loc[row.name, "italian"]), axis=1)
prqd["prqd_score_french"] = df.apply(lambda row: compute_prqd(row["CHQ"], back_to_english.loc[row.name, "french"]), axis=1)
prqd["prqd_score_chineese"] = df.apply(lambda row: compute_prqd(row["CHQ"], back_to_english.loc[row.name, "chineese"]), axis=1)
prqd["prqd_score_spanish"] = df.apply(lambda row: compute_prqd(row["CHQ"], back_to_english.loc[row.name, "spanish"]), axis=1)


In [None]:
prqd

Unnamed: 0,prqd_score_german,prqd_score_italian,prqd_score_french,prqd_score_chineese,prqd_score_spanish
0,0.746615,0.767201,0.765259,0.767007,0.779669
1,0.883724,0.900038,0.903313,0.906969,0.943883
2,0.823085,0.834902,0.863616,0.848069,0.784487
3,0.893297,0.939459,0.916696,0.875951,0.907218
4,0.843545,0.923245,0.832561,0.761174,0.792707
...,...,...,...,...,...
995,0.807748,0.812284,0.729179,0.752591,0.737230
996,0.905426,0.886204,0.869939,0.797073,0.926180
997,0.775312,0.816616,0.811215,0.804602,0.808404
998,0.911106,0.908842,0.911054,0.860287,0.904193


In [None]:
prqd[prqd["prqd_score_italian"] < 0.73].count()

Unnamed: 0,0
prqd_score_german,88
prqd_score_italian,88
prqd_score_french,88
prqd_score_chineese,88
prqd_score_spanish,88


In [None]:
prqd_new_questions = pd.DataFrame()

mu1 = 0.73
mu2 = 0.90

prqd_new_questions.loc[:, "german"] = back_to_english["german"][(prqd["prqd_score_german"] < mu2) & (prqd["prqd_score_german"] > mu1)]
prqd_new_questions.loc[:, "italian"] = back_to_english["italian"][(prqd["prqd_score_italian"] < mu2) & (prqd["prqd_score_italian"] > mu1)]
prqd_new_questions.loc[:, "chineese"] = back_to_english["chineese"][(prqd["prqd_score_chineese"] < mu2) & (prqd["prqd_score_chineese"] > mu1)]
prqd_new_questions.loc[:, "spanish"] = back_to_english["spanish"][(prqd["prqd_score_spanish"] < mu2) & (prqd["prqd_score_spanish"] > mu1)]
prqd_new_questions.loc[:, "french"] = back_to_english["french"][(prqd["prqd_score_french"] < mu2) & (prqd["prqd_score_french"] > mu1)]

In [None]:
prqd_new_questions

Unnamed: 0,german,italian,chineese,spanish,french
0,I need to know who makes Cetirizine. My Walmar...,"I need to know who makes cetirizine, my Walmar...",I need to know who makes Cetirizine My Walmart...,I need to know who manufactures cetirizine my ...,"I need to know who makes Cetirizine, my Walmar..."
1,who makes bromocriptine? I am wondering what c...,,,,
2,"hello, can you tell me where I can order the N...",hi can you tell me where i can order nulytely ...,Hi can you tell me where can i order nulytely ...,"Hello, can you tell me where I can order Nulyt...",hello can you tell me where can i order the nu...
3,Williams Syndrome. I want to get my daughter t...,,Williams syndrome I would like to have my daug...,,
4,clinicaltrialsgov Question General Information...,,clinicaltrialsgov Question General Information...,ClinicalTrialSgov Question General Information...,ClinicalTrialsgov Question General Information...
...,...,...,...,...,...
991,Please email me a list with 100 all ingredient...,please send me a list of 100 all ingredients i...,,Please email me a list of all 100 ingredients ...,please email me a list of 100 all ingredients ...
994,clinicaltrialsgov Question Specific Study Do y...,clinicaltrialsgov specific study question do y...,clinicaltrialsgov question specific research d...,Question from clinicaltrialsgov: Specific stud...,Question from clinicaltrialsgov on a specific ...
995,I had surgery for a hole in my ear. The eardru...,i had surgery for eardrum hole the hole was in...,I had my ears pierced and I have holes in my e...,I had an operation for a hole in my eardrum wh...,
997,I have numbness and tingling in my right forea...,I have numbness and tingling in my lower right...,I have numbness and tingling in my right lower...,I have numbness and tingling in my lower right...,I have numbness and tingling in my lower right...


In [None]:
prqd.to_csv("prqd.csv")
prqd_new_questions.to_csv("prqd_new_questions.csv")

In [21]:
prqd_new_questions = pd.read_csv("checkpoints/prqd_new_questions.csv")
prqd = pd.read_csv("checkpoints/prqd.csv")

***
## ***Step 8 (Bonus): Use QSV to select a subset of the new dataset***

In [None]:
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

def compute_covariance_matrix(questions):
    embeddings = np.array([embedding_model.encode(q) for q in questions])
    return np.cov(embeddings)*1000

def compute_semantic_volume(questions):
    cov_matrix = compute_covariance_matrix(questions)
    det = np.linalg.det(cov_matrix)
    return  det


In [None]:
qsv = pd.DataFrame()

qsv["QSV_score_german"] = df.apply(lambda row: compute_semantic_volume([row["CHQ"], back_to_english.loc[row.name, "german"]]), axis=1)
qsv["QSV_score_italian"] = df.apply(lambda row: compute_semantic_volume([row["CHQ"], back_to_english.loc[row.name, "italian"]]), axis=1)
qsv["QSV_score_french"] = df.apply(lambda row: compute_semantic_volume([row["CHQ"], back_to_english.loc[row.name, "french"]]), axis=1)
qsv["QSV_score_chinese"] = df.apply(lambda row: compute_semantic_volume([row["CHQ"], back_to_english.loc[row.name, "chineese"]]), axis=1)
qsv["QSV_score_spanish"] = df.apply(lambda row: compute_semantic_volume([row["CHQ"], back_to_english.loc[row.name, "spanish"]]), axis=1)


In [None]:
qsv

Unnamed: 0,QSV_score_german,QSV_score_italian,QSV_score_french,QSV_score_chinese,QSV_score_spanish
0,1.723590,1.466341,1.490167,1.470178,1.317554
1,0.370629,0.273567,0.255991,0.237022,0.086014
2,0.854871,0.746592,0.509846,0.632652,1.262807
3,0.311979,0.099835,0.189894,0.421878,0.235647
4,0.670720,0.160948,0.767563,1.542182,1.171236
...,...,...,...,...,...
995,1.008644,0.962788,1.956579,1.648548,1.849553
996,0.242790,0.354041,0.462193,1.122387,0.147804
997,1.370225,0.919939,0.974000,1.042753,1.002566
998,0.216098,0.227313,0.216335,0.534564,0.251114


In [None]:
def normalize_qsv(column):
    minimum = column.min()
    maximum = column.max()
    return (column - minimum) / (maximum - minimum)



qsv["normalized_QSV_score_german"] = normalize_qsv(qsv["QSV_score_german"])
qsv["normalized_QSV_score_italian"] = normalize_qsv(qsv["QSV_score_italian"])
qsv["normalized_QSV_score_french"] = normalize_qsv(qsv["QSV_score_french"])
qsv["normalized_QSV_score_chinese"] = normalize_qsv(qsv["QSV_score_chinese"])
qsv["normalized_QSV_score_spanish"] = normalize_qsv(qsv["QSV_score_spanish"])

In [None]:
qsv

Unnamed: 0,QSV_score_german,QSV_score_italian,QSV_score_french,QSV_score_chinese,QSV_score_spanish,normalized_QSV_score_german,normalized_QSV_score_italian,normalized_QSV_score_french,normalized_QSV_score_chinese,normalized_QSV_score_spanish
0,1.723590,1.466341,1.490167,1.470178,1.317554,0.317110,0.303909,0.223859,0.251556,0.199815
1,0.370629,0.273567,0.255991,0.237022,0.086014,0.068189,0.056699,0.038456,0.040556,0.013045
2,0.854871,0.746592,0.509846,0.632652,1.262807,0.157281,0.154736,0.076591,0.108251,0.191512
3,0.311979,0.099835,0.189894,0.421878,0.235647,0.057399,0.020691,0.028527,0.072186,0.035737
4,0.670720,0.160948,0.767563,1.542182,1.171236,0.123401,0.033358,0.115307,0.263877,0.177625
...,...,...,...,...,...,...,...,...,...,...
995,1.008644,0.962788,1.956579,1.648548,1.849553,0.185573,0.199544,0.293926,0.282077,0.280496
996,0.242790,0.354041,0.462193,1.122387,0.147804,0.044669,0.073377,0.069433,0.192047,0.022415
997,1.370225,0.919939,0.974000,1.042753,1.002566,0.252097,0.190664,0.146319,0.178421,0.152045
998,0.216098,0.227313,0.216335,0.534564,0.251114,0.039758,0.047112,0.032499,0.091467,0.038083


In [None]:
qsv[qsv["normalized_QSV_score_chinese"] < 0.05].count()

Unnamed: 0,0
QSV_score_german,62
QSV_score_italian,62
QSV_score_french,62
QSV_score_chinese,62
QSV_score_spanish,62
normalized_QSV_score_german,62
normalized_QSV_score_italian,62
normalized_QSV_score_french,62
normalized_QSV_score_chinese,62
normalized_QSV_score_spanish,62


In [None]:
qsv_new_questions = pd.DataFrame()

mu1 = 0.05
mu2 = 0.35

qsv_new_questions.loc[:, "german"] = back_to_english["german"][(qsv["normalized_QSV_score_german"] < mu2) & (qsv["normalized_QSV_score_german"] > mu1)]
qsv_new_questions.loc[:, "italian"] = back_to_english["italian"][(qsv["normalized_QSV_score_italian"] < mu2) & (qsv["normalized_QSV_score_italian"] > mu1)]
qsv_new_questions.loc[:, "chineese"] = back_to_english["chineese"][(qsv["normalized_QSV_score_chinese"] < mu2) & (qsv["normalized_QSV_score_chinese"] > mu1)]
qsv_new_questions.loc[:, "spanish"] = back_to_english["spanish"][(qsv["normalized_QSV_score_spanish"] < mu2) & (qsv["normalized_QSV_score_spanish"] > mu1)]
qsv_new_questions.loc[:, "french"] = back_to_english["french"][(qsv["normalized_QSV_score_french"] < mu2) & (qsv["normalized_QSV_score_french"] > mu1)]

In [None]:
qsv_new_questions

Unnamed: 0,german,italian,chineese,spanish,french
0,I need to know who makes Cetirizine. My Walmar...,"I need to know who makes cetirizine, my Walmar...",I need to know who makes Cetirizine My Walmart...,I need to know who manufactures cetirizine my ...,"I need to know who makes Cetirizine, my Walmar..."
1,who makes bromocriptine? I am wondering what c...,who makes bromocriptine i wonder what company ...,,,
2,"hello, can you tell me where I can order the N...",hi can you tell me where i can order nulytely ...,Hi can you tell me where can i order nulytely ...,"Hello, can you tell me where I can order Nulyt...",hello can you tell me where can i order the nu...
3,Williams Syndrome. I want to get my daughter t...,,Williams syndrome I would like to have my daug...,,
4,clinicaltrialsgov Question General Information...,,clinicaltrialsgov Question General Information...,ClinicalTrialSgov Question General Information...,ClinicalTrialsgov Question General Information...
...,...,...,...,...,...
991,Please email me a list with 100 all ingredient...,please send me a list of 100 all ingredients i...,Please send me a list of 100 ingredients iperi...,Please email me a list of all 100 ingredients ...,please email me a list of 100 all ingredients ...
994,clinicaltrialsgov Question Specific Study Do y...,clinicaltrialsgov specific study question do y...,clinicaltrialsgov question specific research d...,Question from clinicaltrialsgov: Specific stud...,Question from clinicaltrialsgov on a specific ...
995,I had surgery for a hole in my ear. The eardru...,i had surgery for eardrum hole the hole was in...,I had my ears pierced and I have holes in my e...,I had an operation for a hole in my eardrum wh...,"i was operated for a hole in my eardrum, a hol..."
997,I have numbness and tingling in my right forea...,I have numbness and tingling in my lower right...,I have numbness and tingling in my right lower...,I have numbness and tingling in my lower right...,I have numbness and tingling in my lower right...


In [None]:
qsv.to_csv("qsv.csv")
qsv_new_questions.to_csv("qsv_new_questions.csv")

In [8]:
qsv_new_questions = pd.read_csv("/content/qsv_new_questions.csv")
qsv = pd.read_csv("/content/qsv.csv")

***
## ***Step 9: Use pre-trained models to summarize questions***

In [9]:
pip install transformers torch


Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [10]:
from transformers import pipeline

summarizer_bart = pipeline("summarization", model="facebook/bart-large-cnn"= device=0)
summarizer_t5 = pipeline("summarization", model="t5-small", device=0)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Device set to use cuda:0


In [20]:
questions = df["CHQ"].tolist()

summaries_bart = summarizer_bart(questions, max_length=15, min_length=5, do_sample=False, truncation=True)
summaries_t5 = summarizer_t5(questions, max_length=15, min_length=5, do_sample=False, truncation=True)

bart_summaries = [summary["summary_text"] for summary in summaries_bart]
t5_summaries = [summary["summary_text"] for summary in summaries_t5]

summaries_BART = pd.DataFrame({"gold_questions": bart_summaries})
summaries_T5 = pd.DataFrame({"gold_questions": t5_summaries})


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Your max_length is set to 15, but your input_length is only 14. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=7)
Your max_length is set to 15, but your input_length is only 11. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)
Your max_length is set to 15, but your input_length is only 13. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=6)
Your max_length is set to 15, but your input_length is only 14. Since this is a summarization task, where outputs shor

In [22]:
summaries_T5

Unnamed: 0,gold_questions
0,manufscturs cetirizine my walmart is
1,bromocriptine i need it for a mass
2,can you tell me where do i order the nulytely
3,williams syndrome i would like to have my daug...
4,my parents died in location tx with multiple m...
...,...
995,i got surgery for hole in my ear drumhole was in
996,he has been hospitalized for severe cramping a...
997,a emg has shown nothing abnormal . i
998,i was diagnosed with sleep apnea prolly


In [49]:
languages = ["german", "italian", "french", "chineese", "spanish"]
for lang in languages:
    print(f"Summarizing {lang} questions...")
    valid_questions = new_questions[lang].dropna()
    questions = valid_questions.tolist()
    summaries_bart = summarizer_bart(questions, max_length=15, min_length=5, do_sample=False, truncation=True)
    summaries_t5 = summarizer_t5(questions, max_length=15, min_length=5, do_sample=False, truncation=True)
    bart_summaries = [summary["summary_text"] for summary in summaries_bart]
    t5_summaries = [summary["summary_text"] for summary in summaries_t5]
    summaries_BART.loc[valid_questions.index, f"fqd_{lang}"] = bart_summaries
    summaries_T5.loc[valid_questions.index, f"fqd_{lang}"] = t5_summaries


Summarizing german questions...
Summarizing italian questions...
Summarizing french questions...
Summarizing chineese questions...
Summarizing spanish questions...


In [71]:
summaries_BART

Unnamed: 0,gold_questions,fqd_german,fqd_italian,fqd_french,fqd_chineese,fqd_spanish
0,i need to know who manufscturs cet,I am wondering what company makes the drug bro...,,,,
1,i need bromocriptine for a mass i have,I am wondering what company makes the drug bro...,i need bromocriptine for a lump i have,i need bromocriptine for a lump i have,I need bromocriptine for a mass on my,
2,"""I need to order a nulytely"" says",Nulytely is a Russian brand of ice cream.,Hi can you tell me where i can order nulyte,"""I want to order a nulytely"" """,Hi can you tell me where can i order nulyte,Nulytely is a brand of Nul
3,i would like to have my daughter tested for wi...,I want to get my daughter tested for Williams ...,,,I would like to have my daughter tested for Wi...,I would like to have my daughter tested for Wi...
4,Both my parents died in location tx with multi...,Both of my parents died of multiple myeloma in...,Both my parents died in tx with multiple myelo...,Both of my parents died in Texas with multiple...,Where can I get genetic testing for multiple m...,Both parents died in TX location with father h...
...,...,...,...,...,...,...
995,hole was in my ear from 5 0r 6 ears but,The eardrum hole was in 5th or 6th,i had surgery for eardrum hole the hole was in,"i was operated for a hole in my eardrum,",,I had an operation for a hole in my eardrum
996,looking for help for my nephew with glycogen ...,I am looking for help for my nephew who has gl...,I am looking for help for my nephew with glyco...,I am looking for help for my nephew who has gl...,Looking for help for my nephew who has Glycoge...,
997,i have numbnesstingling in my lower right arm,I have numbness and tingling in my right forearm,I have numbness and tingling in my lower right,I have numbness and tingling in my lower right,I have numbness and tingling in my right lower,I have numbness and tingling in my lower right
998,i was diagnosed with sleep apnea prolly had it...,"I have been diagnosed with sleep apnea, I have...","I have been diagnosed with sleep apnea, probab...",,I have been diagnosed with sleep apnea and hav...,"I was diagnosed with sleep apnea, I've probabl..."


In [72]:
summaries_T5

Unnamed: 0,gold_questions,fqd_german,fqd_italian,fqd_french,fqd_chineese,fqd_spanish
0,manufscturs cetirizine my walmart is,,,,,
1,bromocriptine i need it for a mass,who makes bromocriptine? I need it for a,bromocriptine i need it for a lump,bromocriptine i need it for a lump,bromocriptine is a drug made by a company,
2,can you tell me where do i order the nulytely,"can you tell me where I can order the Nulytely,",hi can you tell me where i can order nulytely,can you tell me where can i order the nulytely,can you tell me where can i order nulytely who,can you tell me where I can order Nulytely? Who
3,williams syndrome i would like to have my daug...,can you please tell me where I can go or who d...,,,Williams syndrome I would like to have my daug...,Williams syndrome I would like to have my daug...
4,my parents died in location tx with multiple m...,clinicaltrialsgov Question General Information...,clinicaltrialsgov general information question...,my father was 70 and my mother was 84 . my tre...,clinicaltrialsgov Question General Information...,both parents died in TX location with father h...
...,...,...,...,...,...,...
995,i got surgery for hole in my ear drumhole was in,the eardrum hole was in 5th or 6th ear,i had surgery for eardrum hole the hole was in my,"i was operated for a hole in my eardrum,",,the ringing in both ears started 3 years ago .
996,he has been hospitalized for severe cramping a...,my nephew has glycogen storage disease . he ha...,he has been hospitalized for severe cramps abo...,my nephew has glycogen storage disease . he ha...,nephew has glycogen storage disease and is ver...,
997,a emg has shown nothing abnormal . i,an EMG showed nothing unusual in my right fore...,"an EMG showed nothing abnormal, I have had thi...",an EMG showed nothing abnormal. I have had thi...,"the EMG shows nothing abnormal, I have had thi...",an EMG hasn't shown anything abnormal .
998,i was diagnosed with sleep apnea prolly,I have had sleep apnea for probably 5 years,sleep apnea is a sleep-ap,,sleep apnea sufferer has had it for maybe,I was diagnosed with sleep apnea for 5 years


In [73]:
summaries_BART.to_csv("summaries_bart.csv")
summaries_T5.to_csv("summaries_t5.csv")

***

In [None]:
! pip install vllm

Collecting vllm
  Downloading vllm-0.7.2-cp38-abi3-manylinux1_x86_64.whl.metadata (12 kB)
Collecting blake3 (from vllm)
  Downloading blake3-1.0.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Collecting fastapi!=0.113.*,!=0.114.0,>=0.107.0 (from vllm)
  Downloading fastapi-0.115.8-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn[standard] (from vllm)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm)
  Downloading prometheus_fastapi_instrumentator-7.0.2-py3-none-any.whl.metadata (13 kB)
Collecting tiktoken>=0.6.0 (from vllm)
  Downloading tiktoken-0.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting lm-format-enforcer<0.11,>=0.10.9 (from vllm)
  Downloading lm_format_enforcer-0.10.9-py3-none-any.whl.metadata (17 kB)
Collecting outlines==0.1.11 (from vllm)
  Downloading outlines-0.1.11-py3-none-any.whl.metadata (17 kB)
Collecting lark=

In [None]:
from vllm import LLM
llm = LLM(model="deepseek-ai/deepseek-llm-7b-chat", dtype="float16", gpu_memory_utilization=0.999, max_model_len=1024)

INFO 02-12 17:34:53 __init__.py:190] Automatically detected platform cuda.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/594 [00:00<?, ?B/s]

INFO 02-12 17:35:17 config.py:542] This model supports multiple tasks: {'generate', 'embed', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 02-12 17:35:17 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='deepseek-ai/deepseek-llm-7b-chat', speculative_config=None, tokenizer='deepseek-ai/deepseek-llm-7b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=deepseek-ai/deepseek-

tokenizer_config.json:   0%|          | 0.00/1.28k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.61M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

INFO 02-12 17:35:22 cuda.py:179] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 02-12 17:35:22 cuda.py:227] Using XFormers backend.
INFO 02-12 17:35:23 model_runner.py:1110] Starting to load model deepseek-ai/deepseek-llm-7b-chat...
INFO 02-12 17:35:24 weight_utils.py:252] Using model weights format ['*.bin']


pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.85G [00:00<?, ?B/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.97G [00:00<?, ?B/s]

Loading pt checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 02-12 17:41:02 model_runner.py:1115] Loading model weights took 12.8726 GB
INFO 02-12 17:41:05 worker.py:267] Memory profiling takes 2.68 seconds
INFO 02-12 17:41:05 worker.py:267] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (1.00) = 14.73GiB
INFO 02-12 17:41:05 worker.py:267] model weights take 12.87GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 0.95GiB; the rest of the memory reserved for KV Cache is 0.88GiB.
INFO 02-12 17:41:05 executor_base.py:110] # CUDA blocks: 119, # CPU blocks: 546
INFO 02-12 17:41:05 executor_base.py:115] Maximum concurrency for 1024 tokens per request: 1.86x
INFO 02-12 17:41:10 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utili

Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:34<00:00,  1.02it/s]

INFO 02-12 17:41:45 model_runner.py:1562] Graph capturing finished in 34 secs, took 0.24 GiB
INFO 02-12 17:41:45 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 43.04 seconds





In [None]:
def summarize_question(question):
    prompt = f"Rewrite the following question in a very short between 5 to 10 words and more concise way, keeping the same meaning:\n\n{question}\n\nSummary:"
    output = llm.generate(prompt)
    return output[0].outputs[0].text.strip()


In [None]:
summeries = pd.DataFrame(columns=["gold_questions"])
summeries["gold_questions"] = df["CHQ"].apply(lambda x: summarize_question(x))


Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.55s/it, est. speed input: 40.15 toks/s, output: 9.71 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.31it/s, est. speed input: 191.68 toks/s, output: 14.54 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.04it/s, est. speed input: 59.26 toks/s, output: 16.63 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.03it/s, est. speed input: 68.38 toks/s, output: 16.58 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.03s/it, est. speed input: 111.07 toks/s, output: 15.59 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.06it/s, est. speed input: 90.43 toks/s, output: 16.15 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.65it/s, est. speed input: 155.71 toks/s, output: 14.91 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.10s/it, est. speed input: 149.59 toks/s, output: 14.77 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:0

In [None]:
summeries.to_csv("summaries.csv")

In [None]:
summeries["fqd_german"] = new_questions["german"].apply(lambda x: summarize_question(x))

Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.08s/it, est. speed input: 138.79 toks/s, output: 7.45 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.02s/it, est. speed input: 60.52 toks/s, output: 13.89 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.65s/it, est. speed input: 40.16 toks/s, output: 8.52 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.07it/s, est. speed input: 134.97 toks/s, output: 11.78 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.12s/it, est. speed input: 76.21 toks/s, output: 14.35 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.03s/it, est. speed input: 92.24 toks/s, output: 15.53 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.01s/it, est. speed input: 173.75 toks/s, output: 14.81 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.04s/it, est. speed input: 92.74 toks/s, output: 15.46 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<

In [None]:
summeries["fqd_italian"] = new_questions["italian"].apply(lambda x: summarize_question(x))

Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.49s/it, est. speed input: 95.24 toks/s, output: 10.81 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.08it/s, est. speed input: 60.48 toks/s, output: 17.28 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.07it/s, est. speed input: 36.54 toks/s, output: 17.19 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.48it/s, est. speed input: 166.50 toks/s, output: 16.35 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.13it/s, est. speed input: 90.45 toks/s, output: 16.96 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.78it/s, est. speed input: 163.99 toks/s, output: 16.04 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.01it/s, est. speed input: 170.74 toks/s, output: 16.26 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.06it/s, est. speed input: 103.29 toks/s, output: 17.04 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:

In [None]:
summeries["fqd_chineese"] = new_questions["chineese"].apply(lambda x: summarize_question(x))

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.08it/s, est. speed input: 156.65 toks/s, output: 15.12 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.04it/s, est. speed input: 59.36 toks/s, output: 16.96 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.15s/it, est. speed input: 55.53 toks/s, output: 14.10 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.20s/it, est. speed input: 105.48 toks/s, output: 13.61 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.06s/it, est. speed input: 74.84 toks/s, output: 14.21 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.15s/it, est. speed input: 79.78 toks/s, output: 14.03 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.02s/it, est. speed input: 192.46 toks/s, output: 12.96 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.16s/it, est. speed input: 81.50 toks/s, output: 13.87 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:0

In [None]:
summeries["fqd_spanish"] = new_questions["spanish"].apply(lambda x: summarize_question(x))

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.02it/s, est. speed input: 34.60 toks/s, output: 16.28 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.28it/s, est. speed input: 79.31 toks/s, output: 16.63 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.13it/s, est. speed input: 75.57 toks/s, output: 15.79 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.17it/s, est. speed input: 93.45 toks/s, output: 16.35 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.19it/s, est. speed input: 90.36 toks/s, output: 16.64 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.05it/s, est. speed input: 35.88 toks/s, output: 16.89 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.72it/s, est. speed input: 320.82 toks/s, output: 13.87 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.04it/s, est. speed input: 95.06 toks/s, output: 16.71 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<

In [None]:
summeries["fqd_french"] = new_questions["french"].apply(lambda x: summarize_question(x))

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.09it/s, est. speed input: 155.92 toks/s, output: 15.37 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.05it/s, est. speed input: 61.22 toks/s, output: 16.89 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.06it/s, est. speed input: 36.24 toks/s, output: 17.05 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.43it/s, est. speed input: 181.48 toks/s, output: 15.84 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.04it/s, est. speed input: 80.60 toks/s, output: 16.75 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.72it/s, est. speed input: 166.06 toks/s, output: 15.57 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.48it/s, est. speed input: 50.46 toks/s, output: 16.32 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.04it/s, est. speed input: 102.05 toks/s, output: 16.66 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:

In [None]:
summeries = pd.read_csv("/content/summaries (3).csv")

In [None]:
prqd_new_questions = prqd_new_questions.set_index("Unnamed: 0")
qsv_new_questions = qsv_new_questions.set_index("Unnamed: 0")

In [None]:
summeries.to_csv("summaries.csv")

In [None]:
summeries["prqd_german"] = prqd_new_questions["german"].apply(lambda x: summarize_question(x))

Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.05s/it, est. speed input: 57.55 toks/s, output: 15.35 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.70it/s, est. speed input: 253.38 toks/s, output: 13.60 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.07it/s, est. speed input: 65.43 toks/s, output: 15.02 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.02s/it, est. speed input: 64.64 toks/s, output: 14.69 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.09s/it, est. speed input: 115.26 toks/s, output: 14.64 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.18s/it, est. speed input: 72.35 toks/s, output: 13.62 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.63it/s, est. speed input: 154.92 toks/s, output: 14.68 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.21s/it, est. speed input: 145.38 toks/s, output: 13.22 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:

In [None]:
summeries["prqd_italian"] = prqd_new_questions["italian"].apply(lambda x: summarize_question(x))
summeries["prqd_chineese"] = prqd_new_questions["chineese"].apply(lambda x: summarize_question(x))
summeries["prqd_french"] = prqd_new_questions["french"].apply(lambda x: summarize_question(x))
summeries["prqd_spanish"] = prqd_new_questions["spanish"].apply(lambda x: summarize_question(x))

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.27it/s, est. speed input: 74.95 toks/s, output: 16.51 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  2.06it/s, est. speed input: 70.26 toks/s, output: 12.40 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.09s/it, est. speed input: 51.64 toks/s, output: 14.75 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  5.35it/s, est. speed input: 185.57 toks/s, output: 10.91 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  5.37it/s, est. speed input: 186.53 toks/s, output: 10.97 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.04s/it, est. speed input: 76.96 toks/s, output: 15.39 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.19it/s, est. speed input: 109.76 toks/s, output: 14.32 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.19s/it, est. speed input: 140.95 toks/s, output: 13.42 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:

In [None]:
summeries.to_csv("summaries.csv")

In [None]:
summeries

Unnamed: 0.1,Unnamed: 0,gold_questions,fqd_german,fqd_italian,fqd_chineese,fqd_spanish,fqd_french,prqd_german,prqd_italian,prqd_chineese,prqd_french,prqd_spanish
0,0,Who manufactures Cetirizine for Walmart's new ...,,,,,,Who makes Cetirizine? Walmart needs a new supp...,Who makes cetirizine for Walmart's supply?,Who makes Cetirizine for Walmart?,Who makes Cetirizine? Walmart looking for new ...,Cetirizine supplier needed for Walmart's new s...
1,1,Who makes bromocriptine for high cost?,Who produces Bromocriptine?,Who makes Bromocriptine? Can get coupons if pu...,Which company makes bromocriptine? Help with d...,Create a concise summary of the provided quest...,Who manufactures Bromocriptine? Company offeri...,Who makes Bromocriptine?,Make the question concise.,There is no need to rewrite the given question...,"No summary provided as ""nan"" means ""not a numb...",Provide a concise summary of the text given. I...
2,2,"Who makes Nulavé, and what's the phone number ...","Where can I order Nulytely, manufacturer, phon...",Where can I order NUlytely? Who is the manufac...,Where can I order Nulvytely? Who is the manufa...,"Where to order Nulytely, manufacturer, phone n...",Where can I order NuLeafy? Who is the maker? W...,"Where can I order Nulytely, manufacturer, phon...",Where can I order Nulytely? Who is the manufac...,"Can you tell me about Nulutymy, manufacturer, ...",Where can I order Nulytely? Who is the manufac...,Where can I order Nulytely? and Who is the man...
3,3,Can you tell me where I can test my daughter f...,Where can I test my daughter for Williams Synd...,"The original question is lacking any content, ...",Can you suggest a location or provider for a W...,Can you recommend a medical facility/professio...,Explain the revised question and provide the a...,Where can I get my daughter tested for William...,,Can you help me find a place/person to get my ...,,None required.
4,4,"Genetic test for MM, location TX, 63-year-old ...","Genetic testing for MM, cost, and location.","Get genetic test, cost, MM genetic test.",Need info on cost & location of genetic testin...,"General Info about Parents, MM, location, and ...",Where can I get MM genetic test for $?,"Where can I get genetic MM test, cost?\n\nWrit...",,"Genetic testing for multiple myeloma, cost?","Genetic test for MM, cost, and location.",Genetic Testing for Multiple Myeloma in TX for...
...,...,...,...,...,...,...,...,...,...,...,...,...
995,995,Surgery needed; still has problem; help me.\nS...,"Surgery for ear hole. Same problems listening,...",Surgery for eardrum hole didn't help; still ha...,No summary available for this question.,Had operation for hole in eardrum; still has t...,"operated for hole, hole in eardrum, buzzing/ri...","Surgery for eardrum hole, constant buzzing aft...",Surgery didn't fix hearing problem in right ea...,Surgery didn't fix hearing problem. Buzzing fo...,"Provide shorter, more concise version of the q...","Hole in eardrum, hearing problems after surger..."
996,996,Seeking guidance for nephew in Virginia with g...,Seeking help for nephew with severe cramps in ...,Seeking advice & help for nephew suffering fro...,Seeking guidance for nephew in Virginia with s...,Please provide a concise rewording of the foll...,Help for Virginia nephew suffering from glycog...,,,,,
997,997,Need help with numb tingling feeling in lower ...,Need help with numbness and tingling in right ...,"Assistance needed, numbness and tingling in lo...","Painful numbness in arm, EMG normal, long-term...","Lower right arm numbness, long duration, need ...",Help needed for numbness and tingling in lower...,I have numbness and tingling in my arm and nee...,"Lower right arm numbness, tingling, EMG shows ...","Need help for numbness, tingling in right lowe...","Need help for numbness, tingling arm; EMG norm...","I have numbness tingling in arm, need help."
998,998,"Sleep apnea and swelling issues, how long for ...",My doctor thinks a CPAP machine will help with...,How long will swollen ankles from sleep apnea ...,How long for CPAP to reduce sleep apnea-relate...,"Sleep apnea, swelling, CPAP machine, how long?",Q: What is the meaning of life?\nRewritten: Q:,,,,,


In [None]:
summeries["qsv_german"] = qsv_new_questions["german"].apply(lambda x: summarize_question(x))
summeries["qsv_italian"] = qsv_new_questions["italian"].apply(lambda x: summarize_question(x))
summeries["qsv_chineese"] = qsv_new_questions["chineese"].apply(lambda x: summarize_question(x))
summeries["qsv_french"] = qsv_new_questions["french"].apply(lambda x: summarize_question(x))
summeries["qsv_spanish"] = qsv_new_questions["spanish"].apply(lambda x: summarize_question(x))

Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.08s/it, est. speed input: 55.14 toks/s, output: 14.95 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.27it/s, est. speed input: 180.59 toks/s, output: 14.09 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.04s/it, est. speed input: 54.23 toks/s, output: 15.49 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  5.33it/s, est. speed input: 184.05 toks/s, output: 10.82 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.07it/s, est. speed input: 36.59 toks/s, output: 15.06 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.05it/s, est. speed input: 84.50 toks/s, output: 15.84 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.51it/s, est. speed input: 140.03 toks/s, output: 15.22 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.11s/it, est. speed input: 152.19 toks/s, output: 14.49 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:

In [None]:
summeries

Unnamed: 0.1,Unnamed: 0,gold_questions,fqd_german,fqd_italian,fqd_chineese,fqd_spanish,fqd_french,prqd_german,prqd_italian,prqd_chineese,prqd_french,prqd_spanish,qsv_german,qsv_italian,qsv_chineese,qsv_french,qsv_spanish
0,0,Who manufactures Cetirizine for Walmart's new ...,,,,,,Who makes Cetirizine? Walmart needs a new supp...,Who makes cetirizine for Walmart's supply?,Who makes Cetirizine for Walmart?,Who makes Cetirizine? Walmart looking for new ...,Cetirizine supplier needed for Walmart's new s...,Who makes Cetirizine? New supply needed at Wal...,Who makes cetirizine for Walmart?\n\nExplanati...,Who makes Cetirizine for My Walmart?\nExplanat...,Who makes Cetirizine for Walmart?,Who manufactures cetirizine that Walmart is lo...
1,1,Who makes bromocriptine for high cost?,Who produces Bromocriptine?,Who makes Bromocriptine? Can get coupons if pu...,Which company makes bromocriptine? Help with d...,Create a concise summary of the provided quest...,Who manufactures Bromocriptine? Company offeri...,Who makes Bromocriptine?,Make the question concise.,There is no need to rewrite the given question...,"No summary provided as ""nan"" means ""not a numb...",Provide a concise summary of the text given. I...,Who makes Bromocriptine and can you offer a co...,Who makes bromocriptine and offers coupons?,Create an abbreviated version of the given sta...,There is no question to rephrase.,None.
2,2,"Who makes Nulavé, and what's the phone number ...","Where can I order Nulytely, manufacturer, phon...",Where can I order NUlytely? Who is the manufac...,Where can I order Nulvytely? Who is the manufa...,"Where to order Nulytely, manufacturer, phone n...",Where can I order NuLeafy? Who is the maker? W...,"Where can I order Nulytely, manufacturer, phon...",Where can I order Nulytely? Who is the manufac...,"Can you tell me about Nulutymy, manufacturer, ...",Where can I order Nulytely? Who is the manufac...,Where can I order Nulytely? and Who is the man...,Can you tell me the location and contact info ...,Can you help me find a retailer and contact in...,Where can I order noni juice? Who makes it? Wh...,Order information for a specific product and i...,"Can you help me order Nulytely, manufacturer's..."
3,3,Can you tell me where I can test my daughter f...,Where can I test my daughter for Williams Synd...,"The original question is lacking any content, ...",Can you suggest a location or provider for a W...,Can you recommend a medical facility/professio...,Explain the revised question and provide the a...,Where can I get my daughter tested for William...,,Can you help me find a place/person to get my ...,,None required.,Can you tell me where I can test my daughter f...,,Can you tell me where or whom I can get tested...,,(no summary given)
4,4,"Genetic test for MM, location TX, 63-year-old ...","Genetic testing for MM, cost, and location.","Get genetic test, cost, MM genetic test.",Need info on cost & location of genetic testin...,"General Info about Parents, MM, location, and ...",Where can I get MM genetic test for $?,"Where can I get genetic MM test, cost?\n\nWrit...",,"Genetic testing for multiple myeloma, cost?","Genetic test for MM, cost, and location.",Genetic Testing for Multiple Myeloma in TX for...,"Where can I get MM genetic test, cost?",Create a shorter version of the following ques...,"Genetic test for multiple myeloma, cost, where...",Where can I get MM genetic test and cost?,"Genetic testing for MM in TX, cost and location?"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,995,Surgery needed; still has problem; help me.\nS...,"Surgery for ear hole. Same problems listening,...",Surgery for eardrum hole didn't help; still ha...,No summary available for this question.,Had operation for hole in eardrum; still has t...,"operated for hole, hole in eardrum, buzzing/ri...","Surgery for eardrum hole, constant buzzing aft...",Surgery didn't fix hearing problem in right ea...,Surgery didn't fix hearing problem. Buzzing fo...,"Provide shorter, more concise version of the q...","Hole in eardrum, hearing problems after surger...","Surgery for eardrum hole, still have problems ...",Surgery didn't fix hearing problem 2 years aft...,"I had my ear pierced years ago, but constant b...",I was operated for hole in eardrum; issue stil...,Operation for hole in eardrum; issue persists;...
996,996,Seeking guidance for nephew in Virginia with g...,Seeking help for nephew with severe cramps in ...,Seeking advice & help for nephew suffering fro...,Seeking guidance for nephew in Virginia with s...,Please provide a concise rewording of the foll...,Help for Virginia nephew suffering from glycog...,,,,,,,,,,
997,997,Need help with numb tingling feeling in lower ...,Need help with numbness and tingling in right ...,"Assistance needed, numbness and tingling in lo...","Painful numbness in arm, EMG normal, long-term...","Lower right arm numbness, long duration, need ...",Help needed for numbness and tingling in lower...,I have numbness and tingling in my arm and nee...,"Lower right arm numbness, tingling, EMG shows ...","Need help for numbness, tingling in right lowe...","Need help for numbness, tingling arm; EMG norm...","I have numbness tingling in arm, need help.",I have numbness and tingling in my right forea...,"Lower right arm numbness, tingling, long term ...",Numbness and tingling question,"Lower right arm numbness, tingling, EMG normal...",Lower right arm numbness and tingling for year...
998,998,"Sleep apnea and swelling issues, how long for ...",My doctor thinks a CPAP machine will help with...,How long will swollen ankles from sleep apnea ...,How long for CPAP to reduce sleep apnea-relate...,"Sleep apnea, swelling, CPAP machine, how long?",Q: What is the meaning of life?\nRewritten: Q:,,,,,,,,,,


In [None]:
summeries.to_csv("summaries.csv")

In [None]:
x = summarize_question(qsv_new_questions["german"][2])
print("\ncreated summary:     ", x)
print("the original summary:", df["Summary"][2])

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.09it/s, est. speed input: 66.60 toks/s, output: 15.28 toks/s]


created summary:      Where can I order Nulytely, manufacturer, phone number?
the original summary: Who makes nulytely, and where can I buy it?





In [61]:
summeries = pd.read_csv("checkpoints/summaries.csv")

***
## ***Step 10: Use evaluation metrics and compare the results***

In [23]:
! pip install rouge-score sacrebleu nltk

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
Collecting portalocker (from sacrebleu)
  Downloading portalocker-3.1.1-py3-none-any.whl.metadata (8.6 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Downloading sacrebleu-2.5.1-py3-none-any.whl (104 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.1/104.1 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Downloading portalocker-3.1.1-py3-none-any.whl (19 kB)
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score

In [24]:
from rouge_score import rouge_scorer
import sacrebleu
import nltk
from nltk.translate.meteor_score import meteor_score

nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [25]:
def compute_rouge(reference, generated):
    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
    scores = scorer.score(reference, generated)
    return scores["rouge1"].fmeasure, scores["rouge2"].fmeasure, scores["rougeL"].fmeasure

def compute_bleu(reference, generated):
    reference = [reference]
    return sacrebleu.sentence_bleu(generated, reference).score

def compute_meteor(reference, generated):
    reference = reference.split()
    generated = generated.split()
    return meteor_score([reference], generated)

***

In [26]:
metrics_gold_questions_bart = pd.DataFrame()

rouge_scores = summaries_BART.apply(lambda row: compute_rouge(row["gold_questions"], df.loc[row.name, "Summary"]), axis=1)
metrics_gold_questions_bart["rouge-1"] = rouge_scores.apply(lambda x: float(x[0]))
metrics_gold_questions_bart["rouge-2"] = rouge_scores.apply(lambda x: float(x[1]))
metrics_gold_questions_bart["rouge-L"] = rouge_scores.apply(lambda x: float(x[2]))

metrics_gold_questions_bart["bleu"] = summaries_BART.apply(lambda row: compute_bleu(row["gold_questions"], df.loc[row.name, "Summary"]), axis=1)
metrics_gold_questions_bart["meteor"] = summaries_BART.apply(lambda row: compute_meteor(row["gold_questions"], df.loc[row.name, "Summary"]), axis=1)


In [27]:
metrics_gold_questions_bart

Unnamed: 0,rouge-1,rouge-2,rouge-L,bleu,meteor
0,0.200000,0.000000,0.200000,7.253155,0.075758
1,0.181818,0.000000,0.181818,5.197112,0.066667
2,0.250000,0.000000,0.125000,4.767707,0.000000
3,0.421053,0.235294,0.421053,9.442944,0.450505
4,0.166667,0.090909,0.166667,5.679677,0.180288
...,...,...,...,...,...
995,0.000000,0.000000,0.000000,0.000000,0.000000
996,0.444444,0.250000,0.444444,15.181939,0.260771
997,0.434783,0.190476,0.434783,11.114925,0.512644
998,0.200000,0.111111,0.200000,8.392230,0.187500


In [28]:
avg_metrics_gold_questions_bart = metrics_gold_questions_bart.mean().to_frame(name="Average Score")
avg_metrics_gold_questions_bart


Unnamed: 0,Average Score
rouge-1,0.210705
rouge-2,0.071545
rouge-L,0.185902
bleu,6.231903
meteor,0.156663


In [68]:
metrics_gold_questions_t5 = pd.DataFrame()

rouge_scores = summaries_T5.apply(lambda row: compute_rouge(row["gold_questions"], df.loc[row.name, "Summary"]), axis=1)
metrics_gold_questions_t5["rouge-1"] = rouge_scores.apply(lambda x: float(x[0]))
metrics_gold_questions_t5["rouge-2"] = rouge_scores.apply(lambda x: float(x[1]))
metrics_gold_questions_t5["rouge-L"] = rouge_scores.apply(lambda x: float(x[2]))

metrics_gold_questions_t5["bleu"] = summaries_T5.apply(lambda row: compute_bleu(row["gold_questions"], df.loc[row.name, "Summary"]), axis=1)
metrics_gold_questions_t5["meteor"] = summaries_T5.apply(lambda row: compute_meteor(row["gold_questions"], df.loc[row.name, "Summary"]), axis=1)


In [69]:
metrics_gold_questions_t5

Unnamed: 0,rouge-1,rouge-2,rouge-L,bleu,meteor
0,0.200000,0.000000,0.200000,7.253155,0.075758
1,0.181818,0.000000,0.181818,5.197112,0.066667
2,0.250000,0.000000,0.125000,4.767707,0.000000
3,0.421053,0.235294,0.421053,9.442944,0.450505
4,0.166667,0.090909,0.166667,5.679677,0.180288
...,...,...,...,...,...
995,0.000000,0.000000,0.000000,0.000000,0.000000
996,0.444444,0.250000,0.444444,15.181939,0.260771
997,0.434783,0.190476,0.434783,11.114925,0.512644
998,0.200000,0.111111,0.200000,8.392230,0.187500


In [70]:
avg_metrics_gold_questions_t5 = metrics_gold_questions_t5.mean().to_frame(name="Average Score")
avg_metrics_gold_questions_t5


Unnamed: 0,Average Score
rouge-1,0.210705
rouge-2,0.071545
rouge-L,0.185902
bleu,6.231903
meteor,0.156663


***

In [52]:
metrics_fqd_bart = pd.DataFrame()
languages = ["italian", "german", "french", "chineese", "spanish"]

for language in languages:
    filtered = summaries_BART.dropna(subset=[f"fqd_{language}"])

    rouge_scores = filtered.apply(lambda row: compute_rouge(row[f"fqd_{language}"], df.loc[row.name, "Summary"]), axis=1)
    metrics_fqd_bart[(language, "ROUGE-1")] = rouge_scores.apply(lambda x: float(x[0]))
    metrics_fqd_bart[(language, "ROUGE-2")] = rouge_scores.apply(lambda x: float(x[1]))
    metrics_fqd_bart[(language, "ROUGE-L")] = rouge_scores.apply(lambda x: float(x[2]))

    metrics_fqd_bart[(language, "bleu")] = filtered.apply(lambda row: compute_bleu(row[f"fqd_{language}"], df.loc[row.name, "Summary"]), axis=1)
    metrics_fqd_bart[(language, "meteor")] = filtered.apply(lambda row: compute_meteor(row[f"fqd_{language}"], df.loc[row.name, "Summary"]), axis=1)

metrics_fqd_bart.columns = pd.MultiIndex.from_tuples(metrics_fqd_bart.columns)


In [67]:
metrics_fqd_bart

Unnamed: 0_level_0,italian,italian,italian,italian,italian,german,german,german,german,german,...,chineese,chineese,chineese,chineese,chineese,spanish,spanish,spanish,spanish,spanish
Unnamed: 0_level_1,ROUGE-1,ROUGE-2,ROUGE-L,bleu,meteor,ROUGE-1,ROUGE-2,ROUGE-L,bleu,meteor,...,ROUGE-1,ROUGE-2,ROUGE-L,bleu,meteor,ROUGE-1,ROUGE-2,ROUGE-L,bleu,meteor
0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,0.181818,0.000000,0.181818,5.197112,0.066667,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.181818,0.000000,0.181818,5.197112,0.066667,0.000000,0.000000,0.000000,0.000000,0.000000
2,0.421053,0.000000,0.210526,5.614808,0.202020,0.117647,0.000000,0.117647,0.000000,0.123457,...,0.421053,0.235294,0.315789,15.881076,0.378788,0.133333,0.000000,0.133333,0.000000,0.158730
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.700000,0.333333,0.600000,4.062583,0.473251,...,0.500000,0.333333,0.500000,4.062583,0.412963,0.500000,0.333333,0.500000,4.062583,0.412963
4,0.166667,0.090909,0.166667,5.679677,0.245726,0.166667,0.090909,0.166667,5.679677,0.180288,...,0.833333,0.818182,0.833333,44.710186,0.961058,0.080000,0.000000,0.080000,2.839839,0.088496
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.352941,0.000000,0.235294,5.630401,0.154639,0.266667,0.000000,0.266667,5.693025,0.126582,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.235294,0.000000,0.235294,5.087641,0.103093
996,0.315789,0.117647,0.315789,8.400789,0.238837,0.210526,0.000000,0.210526,4.513618,0.093458,...,0.333333,0.125000,0.333333,4.300848,0.260771,0.000000,0.000000,0.000000,0.000000,0.000000
997,0.583333,0.363636,0.583333,19.564751,0.700468,0.500000,0.272727,0.500000,17.395797,0.585938,...,0.583333,0.272727,0.500000,18.207053,0.661139,0.583333,0.363636,0.583333,19.564751,0.700468
998,0.200000,0.111111,0.200000,7.593604,0.100000,0.200000,0.111111,0.200000,7.593604,0.100000,...,0.190476,0.105263,0.190476,7.593604,0.234455,0.200000,0.111111,0.200000,8.392230,0.054945


In [53]:
avg_metrics_fqd_bart = metrics_fqd_bart.mean()
avg_metrics_fqd_bart = avg_metrics_fqd_bart.unstack(level=1)
avg_metrics_fqd_bart


Unnamed: 0,ROUGE-1,ROUGE-2,ROUGE-L,bleu,meteor
chineese,0.120529,0.038993,0.107006,2.84995,0.084898
french,0.120785,0.03736,0.107497,2.885483,0.085473
german,0.175546,0.049065,0.155837,3.831917,0.119602
italian,0.127325,0.040392,0.111915,3.055146,0.092092
spanish,0.121179,0.039369,0.106658,2.783649,0.08751


In [55]:
avg_metrics_fqd_bart.mean().to_frame(name="Average Score")

Unnamed: 0,Average Score
ROUGE-1,0.133073
ROUGE-2,0.041036
ROUGE-L,0.117782
bleu,3.081229
meteor,0.093915


***

In [56]:
metrics_fqd_t5 = pd.DataFrame()
languages = ["italian", "german", "french", "chineese", "spanish"]

for language in languages:
    filtered = summaries_T5.dropna(subset=[f"fqd_{language}"])

    rouge_scores = filtered.apply(lambda row: compute_rouge(row[f"fqd_{language}"], df.loc[row.name, "Summary"]), axis=1)
    metrics_fqd_t5[(language, "ROUGE-1")] = rouge_scores.apply(lambda x: float(x[0]))
    metrics_fqd_t5[(language, "ROUGE-2")] = rouge_scores.apply(lambda x: float(x[1]))
    metrics_fqd_t5[(language, "ROUGE-L")] = rouge_scores.apply(lambda x: float(x[2]))

    metrics_fqd_t5[(language, "bleu")] = filtered.apply(lambda row: compute_bleu(row[f"fqd_{language}"], df.loc[row.name, "Summary"]), axis=1)
    metrics_fqd_t5[(language, "meteor")] = filtered.apply(lambda row: compute_meteor(row[f"fqd_{language}"], df.loc[row.name, "Summary"]), axis=1)

metrics_fqd_t5.columns = pd.MultiIndex.from_tuples(metrics_fqd_t5.columns)


In [66]:
metrics_fqd_t5

Unnamed: 0_level_0,italian,italian,italian,italian,italian,german,german,german,german,german,...,chineese,chineese,chineese,chineese,chineese,spanish,spanish,spanish,spanish,spanish
Unnamed: 0_level_1,ROUGE-1,ROUGE-2,ROUGE-L,bleu,meteor,ROUGE-1,ROUGE-2,ROUGE-L,bleu,meteor,...,ROUGE-1,ROUGE-2,ROUGE-L,bleu,meteor,ROUGE-1,ROUGE-2,ROUGE-L,bleu,meteor
0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,0.200000,0.000000,0.200000,7.253155,0.075758,0.363636,0.000000,0.363636,4.691812,0.066667,...,0.181818,0.000000,0.181818,5.197112,0.066667,0.000000,0.000000,0.000000,0.000000,0.000000
2,0.421053,0.000000,0.210526,6.033504,0.202020,0.421053,0.000000,0.210526,4.540014,0.151515,...,0.526316,0.235294,0.315789,18.044386,0.450505,0.526316,0.000000,0.210526,4.540014,0.202020
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.260870,0.000000,0.173913,3.253062,0.148148,...,0.454545,0.300000,0.454545,3.635359,0.446429,0.454545,0.300000,0.454545,3.635359,0.446429
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.083333,0.000000,0.083333,2.839839,0.048077,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.080000,0.000000,0.080000,2.839839,0.088496
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.333333,0.000000,0.222222,4.880870,0.141509,0.250000,0.000000,0.250000,5.868925,0.113636,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.125000,0.000000,0.125000,4.278179,0.051546
996,0.105263,0.000000,0.105263,3.795485,0.046729,0.333333,0.250000,0.333333,14.128386,0.275182,...,0.352941,0.266667,0.352941,18.141207,0.330836,0.000000,0.000000,0.000000,0.000000,0.000000
997,0.153846,0.000000,0.153846,2.627962,0.087719,0.166667,0.000000,0.166667,3.125191,0.095238,...,0.153846,0.000000,0.153846,2.627962,0.087719,0.000000,0.000000,0.000000,0.000000,0.000000
998,0.250000,0.142857,0.250000,8.392230,0.340909,0.210526,0.117647,0.210526,8.392230,0.280830,...,0.222222,0.125000,0.222222,8.392230,0.228659,0.210526,0.117647,0.210526,8.392230,0.206044


In [57]:
avg_metrics_fqd_t5 = metrics_fqd_t5.mean()
avg_metrics_fqd_t5 = avg_metrics_fqd_t5.unstack(level=1)
avg_metrics_fqd_t5


Unnamed: 0,ROUGE-1,ROUGE-2,ROUGE-L,bleu,meteor
chineese,0.118882,0.036324,0.102251,2.918356,0.084751
french,0.119119,0.035331,0.104487,2.982174,0.087407
german,0.147429,0.042175,0.128889,3.394218,0.104914
italian,0.124149,0.036519,0.106543,3.143936,0.091962
spanish,0.122452,0.037106,0.104454,3.007475,0.090344


In [58]:
avg_metrics_fqd_t5.mean().to_frame(name="Average Score")

Unnamed: 0,Average Score
ROUGE-1,0.126406
ROUGE-2,0.037491
ROUGE-L,0.109325
bleu,3.089232
meteor,0.091876


***

In [None]:
metrics_fqd = pd.DataFrame()

In [None]:
languages = ["italian", "german", "french", "chineese", "spanish"]

for language in languages:
    filtered = summeries.dropna(subset=[f"fqd_{language}"])

    rouge_scores = filtered.apply(lambda row: compute_rouge(row[f"fqd_{language}"], df.loc[row.name, "Summary"]), axis=1)
    metrics_fqd[(language, "ROUGE-1")] = rouge_scores.apply(lambda x: float(x[0]))  # Extract ROUGE-1
    metrics_fqd[(language, "ROUGE-2")] = rouge_scores.apply(lambda x: float(x[1]))  # Extract ROUGE-2
    metrics_fqd[(language, "ROUGE-L")] = rouge_scores.apply(lambda x: float(x[2]))  # Extract ROUGE-L

    metrics_fqd[(language, "bleu")] = filtered.apply(lambda row: compute_bleu(row[f"fqd_{language}"], df.loc[row.name, "Summary"]), axis=1)
    metrics_fqd[(language, "meteor")] = filtered.apply(lambda row: compute_meteor(row[f"fqd_{language}"], df.loc[row.name, "Summary"]), axis=1)

metrics_fqd.columns = pd.MultiIndex.from_tuples(metrics_fqd.columns)


In [None]:
metrics_fqd

Unnamed: 0_level_0,italian,italian,italian,italian,italian,german,german,german,german,german,...,chineese,chineese,chineese,chineese,chineese,spanish,spanish,spanish,spanish,spanish
Unnamed: 0_level_1,ROUGE-1,ROUGE-2,ROUGE-L,bleu,meteor,ROUGE-1,ROUGE-2,ROUGE-L,bleu,meteor,...,ROUGE-1,ROUGE-2,ROUGE-L,bleu,meteor,ROUGE-1,ROUGE-2,ROUGE-L,bleu,meteor
1,0.307692,0.000000,0.307692,3.300991,0.107527,0.666667,0.000000,0.666667,18.995892,0.333333,...,0.181818,0.000000,0.181818,7.128374,0.066667,0.000000,0.000000,0.000000,0.000000,0.000000
2,0.476190,0.210526,0.285714,6.786053,0.320513,0.470588,0.266667,0.352941,8.913766,0.462963,...,0.400000,0.222222,0.300000,7.431878,0.347222,0.250000,0.000000,0.125000,4.456883,0.138889
3,0.000000,0.000000,0.000000,0.000000,0.042735,0.636364,0.300000,0.636364,13.147601,0.381197,...,0.400000,0.000000,0.320000,3.030756,0.104167,0.476190,0.105263,0.380952,3.696720,0.101010
4,0.380952,0.210526,0.380952,3.218583,0.129870,0.476190,0.210526,0.380952,7.314032,0.487013,...,0.461538,0.333333,0.384615,31.314225,0.449534,0.260870,0.000000,0.173913,5.412989,0.105263
5,0.476190,0.315789,0.476190,14.473711,0.444037,0.434783,0.285714,0.260870,12.067499,0.381102,...,0.476190,0.315789,0.476190,14.473711,0.444037,0.500000,0.333333,0.500000,15.851166,0.484000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.210526,0.000000,0.210526,3.983253,0.047170,0.200000,0.000000,0.100000,2.415965,0.080645,...,0.153846,0.000000,0.153846,5.522398,0.081967,0.200000,0.000000,0.200000,3.102161,0.040323
996,0.200000,0.000000,0.200000,3.056960,0.080000,0.117647,0.000000,0.117647,4.266332,0.056180,...,0.363636,0.200000,0.363636,3.253062,0.190713,0.105263,0.000000,0.105263,3.056960,0.102041
997,0.538462,0.500000,0.538462,37.239099,0.525097,0.416667,0.272727,0.416667,16.188614,0.504167,...,0.250000,0.000000,0.250000,3.218583,0.114943,0.347826,0.190476,0.260870,5.412989,0.215517
998,0.434783,0.285714,0.434783,13.977689,0.381102,0.260870,0.095238,0.173913,4.069583,0.084746,...,0.600000,0.222222,0.400000,9.864703,0.346841,0.588235,0.266667,0.235294,5.300157,0.136986


In [None]:
avg_metrics_fqd = metrics_fqd.mean()
avg_metrics_fqd = avg_metrics_fqd.unstack(level=1)
avg_metrics_fqd


Unnamed: 0,ROUGE-1,ROUGE-2,ROUGE-L,bleu,meteor
chineese,0.252149,0.088497,0.221494,5.719521,0.142994
french,0.258126,0.092993,0.227049,6.487052,0.150425
german,0.295847,0.107279,0.257683,6.979391,0.173467
italian,0.258471,0.090343,0.22612,6.105964,0.149357
spanish,0.246438,0.082258,0.21546,5.521652,0.142717


In [None]:
avg_metrics_fqd.mean().to_frame(name="Average Score")

Unnamed: 0,Average Score
ROUGE-1,0.262206
ROUGE-2,0.092274
ROUGE-L,0.229561
bleu,6.162716
meteor,0.151792


***

In [None]:
metrics_prqd = pd.DataFrame()
languages = ["italian", "german", "french", "chineese", "spanish"]

for language in languages:
    filtered = summeries.dropna(subset=[f"prqd_{language}"])

    rouge_scores = filtered.apply(lambda row: compute_rouge(row[f"prqd_{language}"], df.loc[row.name, "Summary"]), axis=1)
    metrics_prqd[(language, "ROUGE-1")] = rouge_scores.apply(lambda x: float(x[0]))
    metrics_prqd[(language, "ROUGE-2")] = rouge_scores.apply(lambda x: float(x[1]))
    metrics_prqd[(language, "ROUGE-L")] = rouge_scores.apply(lambda x: float(x[2]))

    metrics_prqd[(language, "bleu")] = filtered.apply(lambda row: compute_bleu(row[f"prqd_{language}"], df.loc[row.name, "Summary"]), axis=1)
    metrics_prqd[(language, "meteor")] = filtered.apply(lambda row: compute_meteor(row[f"prqd_{language}"], df.loc[row.name, "Summary"]), axis=1)

metrics_prqd.columns = pd.MultiIndex.from_tuples(metrics_prqd.columns)


In [None]:
metrics_prqd

Unnamed: 0_level_0,italian,italian,italian,italian,italian,german,german,german,german,german,...,chineese,chineese,chineese,chineese,chineese,spanish,spanish,spanish,spanish,spanish
Unnamed: 0_level_1,ROUGE-1,ROUGE-2,ROUGE-L,bleu,meteor,ROUGE-1,ROUGE-2,ROUGE-L,bleu,meteor,...,ROUGE-1,ROUGE-2,ROUGE-L,bleu,meteor,ROUGE-1,ROUGE-2,ROUGE-L,bleu,meteor
0,0.400000,0.000000,0.400000,9.930284,0.087719,0.285714,0.000000,0.285714,2.570814,0.098039,...,0.500000,0.000000,0.500000,11.521591,0.104167,0.181818,0.000000,0.181818,0.000000,0.000000
1,0.000000,0.000000,0.000000,0.000000,0.000000,0.666667,0.000000,0.666667,18.995892,0.333333,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,0.500000,0.222222,0.300000,8.139166,0.378788,0.470588,0.266667,0.352941,8.913766,0.462963,...,0.105263,0.000000,0.105263,3.422098,0.050505,0.631579,0.352941,0.315789,8.606120,0.450505
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.695652,0.476190,0.695652,19.487234,0.467372,...,0.400000,0.173913,0.400000,2.738597,0.233796,0.000000,0.000000,0.000000,0.000000,0.000000
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.538462,0.333333,0.538462,27.225894,0.482696,...,0.600000,0.444444,0.600000,26.332019,0.866013,0.384615,0.333333,0.384615,2.445594,0.258709
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
991,0.454545,0.200000,0.363636,4.368584,0.286932,0.315789,0.000000,0.315789,4.065425,0.060241,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.190476,0.000000,0.095238,3.673527,0.099010
994,0.250000,0.142857,0.250000,4.456883,0.256849,0.200000,0.111111,0.200000,7.495553,0.187500,...,0.210526,0.117647,0.210526,3.747777,0.187500,0.235294,0.133333,0.235294,3.747777,0.256849
995,0.000000,0.000000,0.000000,0.000000,0.000000,0.333333,0.000000,0.222222,3.890218,0.094340,...,0.222222,0.000000,0.111111,3.983253,0.103093,0.111111,0.000000,0.111111,2.608596,0.000000
997,0.400000,0.173913,0.240000,5.816635,0.178571,0.538462,0.250000,0.461538,17.609283,0.374025,...,0.518519,0.160000,0.444444,7.955892,0.302439,0.434783,0.095238,0.434783,6.150343,0.362787


In [None]:
avg_metrics_prqd = metrics_prqd.mean()
avg_metrics_prqd = avg_metrics_prqd.unstack(level=1)
avg_metrics_prqd


Unnamed: 0,ROUGE-1,ROUGE-2,ROUGE-L,bleu,meteor
chineese,0.250239,0.085318,0.221812,5.838075,0.144736
french,0.237092,0.078468,0.207524,5.654251,0.138687
german,0.279595,0.09791,0.244278,6.280108,0.15893
italian,0.23851,0.07881,0.205584,5.467073,0.132074
spanish,0.233923,0.075426,0.203687,5.396213,0.135073


In [None]:
avg_metrics_prqd.mean().to_frame(name="Average Score")

Unnamed: 0,Average Score
ROUGE-1,0.247872
ROUGE-2,0.083186
ROUGE-L,0.216577
bleu,5.727144
meteor,0.1419


***

In [62]:
metrics_qsv = pd.DataFrame()
languages = ["italian", "german", "french", "chineese", "spanish"]

for language in languages:
    filtered = summeries.dropna(subset=[f"qsv_{language}"])

    rouge_scores = filtered.apply(lambda row: compute_rouge(row[f"qsv_{language}"], df.loc[row.name, "Summary"]), axis=1)
    metrics_qsv[(language, "ROUGE-1")] = rouge_scores.apply(lambda x: float(x[0]))
    metrics_qsv[(language, "ROUGE-2")] = rouge_scores.apply(lambda x: float(x[1]))
    metrics_qsv[(language, "ROUGE-L")] = rouge_scores.apply(lambda x: float(x[2]))

    metrics_qsv[(language, "bleu")] = filtered.apply(lambda row: compute_bleu(row[f"qsv_{language}"], df.loc[row.name, "Summary"]), axis=1)
    metrics_qsv[(language, "meteor")] = filtered.apply(lambda row: compute_meteor(row[f"qsv_{language}"], df.loc[row.name, "Summary"]), axis=1)

metrics_qsv.columns = pd.MultiIndex.from_tuples(metrics_qsv.columns)


In [63]:
metrics_qsv

Unnamed: 0_level_0,italian,italian,italian,italian,italian,german,german,german,german,german,...,chineese,chineese,chineese,chineese,chineese,spanish,spanish,spanish,spanish,spanish
Unnamed: 0_level_1,ROUGE-1,ROUGE-2,ROUGE-L,bleu,meteor,ROUGE-1,ROUGE-2,ROUGE-L,bleu,meteor,...,ROUGE-1,ROUGE-2,ROUGE-L,bleu,meteor,ROUGE-1,ROUGE-2,ROUGE-L,bleu,meteor
0,0.333333,0.000000,0.333333,1.911911,0.119048,0.363636,0.000000,0.363636,0.000000,0.066667,...,0.333333,0.000000,0.333333,0.000000,0.119048,0.500000,0.400000,0.500000,5.336573,0.350529
1,0.444444,0.000000,0.444444,7.253155,0.175439,0.333333,0.000000,0.333333,0.000000,0.119048,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,0.173913,0.000000,0.086957,2.735488,0.074074,0.272727,0.000000,0.090909,3.416211,0.085470,...,0.571429,0.315789,0.380952,4.023186,0.448148,0.300000,0.000000,0.200000,3.416211,0.101010
4,0.076923,0.000000,0.076923,2.839839,0.081967,0.636364,0.400000,0.636364,4.016138,0.562791,...,0.800000,0.608696,0.480000,14.458925,0.702434,0.434783,0.190476,0.347826,6.754313,0.469474
5,0.476190,0.315789,0.476190,4.521357,0.344037,0.526316,0.352941,0.526316,4.990050,0.412088,...,0.526316,0.352941,0.526316,4.515184,0.412088,0.608696,0.476190,0.347826,3.977815,0.544753
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
991,0.190476,0.000000,0.095238,4.456883,0.099010,0.190476,0.000000,0.095238,4.456883,0.099010,...,0.200000,0.000000,0.200000,4.456883,0.108696,0.476190,0.210526,0.380952,8.913766,0.441584
994,0.444444,0.250000,0.444444,16.784460,0.311653,0.300000,0.222222,0.300000,14.133289,0.270133,...,0.222222,0.125000,0.222222,4.196115,0.228659,0.285714,0.166667,0.285714,4.196115,0.340909
995,0.000000,0.000000,0.000000,0.000000,0.000000,0.222222,0.000000,0.222222,3.314288,0.094340,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.222222,0.000000,0.222222,2.873083,0.047170
997,0.400000,0.173913,0.240000,6.250382,0.280423,0.428571,0.230769,0.428571,17.395797,0.426136,...,0.315789,0.235294,0.315789,5.255923,0.577342,0.518519,0.320000,0.222222,12.874331,0.546706


In [64]:
avg_metrics_qsv = metrics_qsv.mean()
avg_metrics_qsv = avg_metrics_qsv.unstack(level=1)
avg_metrics_qsv


Unnamed: 0,ROUGE-1,ROUGE-2,ROUGE-L,bleu,meteor
chineese,0.243706,0.080095,0.211939,4.394625,0.156385
french,0.230054,0.069549,0.200502,4.340036,0.142792
german,0.278045,0.095346,0.243765,5.32775,0.179086
italian,0.242372,0.077016,0.209199,4.686008,0.151896
spanish,0.23301,0.077947,0.205836,4.426099,0.153246


In [65]:
avg_metrics_qsv.mean().to_frame(name="Average Score")

Unnamed: 0,Average Score
ROUGE-1,0.245437
ROUGE-2,0.079991
ROUGE-L,0.214248
bleu,4.634904
meteor,0.156681


***

In [None]:
metrics_gold_questions = pd.DataFrame()

rouge_scores = summeries.apply(lambda row: compute_rouge(row["gold_questions"], df.loc[row.name, "Summary"]), axis=1)
metrics_gold_questions["rouge-1"] = rouge_scores.apply(lambda x: float(x[0]))
metrics_gold_questions["rouge-2"] = rouge_scores.apply(lambda x: float(x[1]))
metrics_gold_questions["rouge-L"] = rouge_scores.apply(lambda x: float(x[2]))

metrics_gold_questions["bleu"] = summeries.apply(lambda row: compute_bleu(row["gold_questions"], df.loc[row.name, "Summary"]), axis=1)
metrics_gold_questions["meteor"] = summeries.apply(lambda row: compute_meteor(row["gold_questions"], df.loc[row.name, "Summary"]), axis=1)


In [None]:
metrics_gold_questions

Unnamed: 0,rouge-1,rouge-2,rouge-L,bleu,meteor
0,0.545455,0.444444,0.545455,13.006502,0.284091
1,0.444444,0.000000,0.444444,9.930284,0.087719
2,0.300000,0.111111,0.300000,10.600313,0.258137
3,0.538462,0.083333,0.461538,3.030756,0.163399
4,0.250000,0.181818,0.250000,2.908318,0.342377
...,...,...,...,...,...
995,0.000000,0.000000,0.000000,0.000000,0.000000
996,0.400000,0.222222,0.400000,6.766165,0.220307
997,0.480000,0.260870,0.480000,16.188614,0.486772
998,0.545455,0.200000,0.272727,4.266505,0.433145


In [None]:
avg_metrics_gold_questions = metrics_gold_questions.mean().to_frame(name="Average Score")
avg_metrics_gold_questions


Unnamed: 0,Average Score
rouge-1,0.295107
rouge-2,0.111745
rouge-L,0.259083
bleu,7.060116
meteor,0.173031
