# ST Baseline experiment using Whisper and Europarl-ST (Spanish-English)


In this notebook, we are going to learn how to use the Open AI pre-trained model [Whisper](https://openai.com/index/whisper/) for speech translation on the [Europarl-ST dataset](https://huggingface.co/datasets/tj-solergibert/Europarl-ST) (using Spanish-English speech data).

First, we import some OpenAI source whisper libraries and additional ones (e.g. for computing evaluation figures: WER and BLEU)

In [None]:
import whisper
import jiwer
from whisper.normalizers import BasicTextNormalizer

from tqdm.notebook import tqdm
import pandas as pd

model = whisper.load_model("base")

Load Europarl-ST Spanish-English test audio dataset

In [2]:
audios = []

with open(r"./Europarl-ST/es/en/test/Europarl-ST.v2.es.en.test.200.lst", "r", encoding="utf-8") as lista_audios:
    for linea in lista_audios:
        audios.append(str(linea).strip())

len(audios)

86

<p style="page-break-after:always;"></p>

Translate to English all the audio data using the Whisper (base) model. Automatic translations are stored in translations. At the same time, translation references and source transcriptions are stored into references and sources, respectively.

In [3]:
translations = []
references = []
sources = []

for audio in tqdm(audios):
    with open(r"./Europarl-ST/es/en/test/%s/translation_en" % audio, "r", encoding="utf-8") as reference:
        references.append(reference.read())
    with open(r"./Europarl-ST/es/en/test/%s/transcription.tok" % audio, encoding="utf-8") as source:
        sources.append(source.read())

    translations.append((model.transcribe(r"./Europarl-ST/es/en/test/%s/audio_clip_diarization.m4a" % audio, language="Spanish", task="translate"))['text'])

  0%|          | 0/86 [00:00<?, ?it/s]

<p style="page-break-after:always;"></p>

Automatic translations, references and sources are stored into a Pandas dataframe. Show the two first translations and references.

In [4]:
data = pd.DataFrame(dict(translation=translations, reference=references, source=sources))
pd.set_option('display.max_colwidth', None)
data.head(2)

Unnamed: 0,translation,reference,source
0,"But the Madrid Athletic, the officials and even the Spanish police, are being mistreated by the European Federation of Fútbol, but in the case of the particular one. These initiatives aggravate the sanctions that occur to the ordinary justice. This medieval conception, this law of the law of the law, is incompatible with the law, with the European institutions, since we have to react and we will end up doing that, because these medieval gentlemen, arbitraries of Orca and Cuchillo, they have to put the line with respect to the law and the great ordinary processes of our European Union, just a lot of times.","Atlético Madrid, its fans and even the Spanish police are being mistreated by the Union of European Football Associations. However, the problem is wider than this as these federative bodies tend to increase sanctions when people resort to the ordinary courts.\nThis mediaeval concept of one law for me and another for you is contrary to our law and the European institutions. We must therefore react. In fact, we will end up having to react as these arbitrary mediaeval tyrants must abide by the law and the ordinary procedural guarantees of our Europe.","El Atlético de Madrid , los aficionados e incluso la policía española están siendo maltratados por la Federación Europea de Fútbol . Pero el caso transciende lo particular , pues esos órganos federativos agravan las sanciones a quienes recurren a la justicia ordinaria .\nEsta concepción medieval , esta ley del embudo es incompatible con el Derecho y con las instituciones europeas , desde las que hemos de reaccionar . Lo acabaremos haciendo , pues esos señores medievales arbitrarios de horca y cuchillo han de ponerse en línea con el respeto al Derecho y las garantías procesales ordinarias de nuestra Europa .\n"
1,"Thank you, Mr. President, Mr. Commissioner. Terrorism is a huge global phenomenon and the act of the serious danger that has been carried out too. Therefore, all the media have to be proportional and have to fight for their effectiveness. I have taken good note of the answers that were left to the questions that were the opportunities. It is true that there are guaranteees, it is true that it is a delicate issue, but it is not true that it is absolutely inexcusable to form a globalized and harmonized response. For some who are terrorism a little far away, they worry more about the serious and habitual worries, they worry the habitual and the collective. And that is absolutely necessary that we start where we can. If we start through the transport area where the companies are in those data, we start there. If we look at the ancient, we see which is the habit of application, we start through relationships with international sports and we have to follow them through the interiors because the terrorists, many times they do not come from outside and they do not come from inside, that they ask in the United States and they ask others what that is and that we will have to plan.","Mr President, Commissioner, terrorism and serious organised crime are global phenomena. The means for fighting these must therefore be proportional and effective.\nI took due note of the answers given to the questions. These answers were quite correct: it is true that guarantees must be demanded and that this is a delicate issue. However, it is also true that it is absolutely inexcusable to provide a globalised and harmonised response.\nThose people who are somewhat detached from terrorism are more concerned about individual guarantees. My concern is for both individual and collective guarantees. It is absolutely vital that we start where we can. If we have to start with air transport, given that air carriers already have this data, then that is where we must start.\nWe will demand guarantees, we will assess the scope and we will start with international transport. However, it should be noted that we will then move on to domestic transport because terrorists very often do not come from outside, but are home-grown. Ask the United States and everyone else, because that is how it is and that is how we will have to address it in the future.","Señor Presidente , señor Comisario , el terrorismo es un fenómeno global y la actuación de la delincuencia grave organizada también , y por tanto los medios tienen que ser proporcionales y hay que luchar con eficacia .\nHe tomado buena nota de las respuestas que ha dado a las preguntas , y eran oportunas : es verdad que hay que exigir garantías , es verdad que es un tema delicado , pero no es menos cierto que es absolutamente inexcusable montar , formar una respuesta globalizada y armonizada .\nA algunos , a los que el terrorismo les queda un poco lejos , les preocupan más las garantías individuales . A mí me preocupan las individuales y las colectivas , y es absolutamente necesario que empecemos por donde podamos . Si hemos de empezar por el transporte aéreo , donde ya las compañías tienen esos datos , empecemos por ahí .\nExijamos garantías , veamos cuál es el ámbito de aplicación , empecemos por las relaciones de los transportes internacionales , y ¡ ojo ! tendremos que seguir por las interiores porque los terroristas muchas veces no vienen de fuera , sino que vienen de dentro . Que se lo pregunten a Estados Unidos , que se lo pregunten a los demás , que así es y así tendremos que plantearlo en el futuro .\n"


Automatic translations, references and sources are normalized using the Whisper basic text standardisation/normalization module

In [6]:
normalizer = BasicTextNormalizer()

data["translation_clean"] = [normalizer(text) for text in data["translation"]]
data["reference_clean"] = [normalizer(text) for text in data["reference"]]
data["source_clean"] = [normalizer(text) for text in data["source"]]
data.head(2)

Unnamed: 0,translation,reference,source,translation_clean,reference_clean,source_clean
0,"But the Madrid Athletic, the officials and even the Spanish police, are being mistreated by the European Federation of Fútbol, but in the case of the particular one. These initiatives aggravate the sanctions that occur to the ordinary justice. This medieval conception, this law of the law of the law, is incompatible with the law, with the European institutions, since we have to react and we will end up doing that, because these medieval gentlemen, arbitraries of Orca and Cuchillo, they have to put the line with respect to the law and the great ordinary processes of our European Union, just a lot of times.","Atlético Madrid, its fans and even the Spanish police are being mistreated by the Union of European Football Associations. However, the problem is wider than this as these federative bodies tend to increase sanctions when people resort to the ordinary courts.\nThis mediaeval concept of one law for me and another for you is contrary to our law and the European institutions. We must therefore react. In fact, we will end up having to react as these arbitrary mediaeval tyrants must abide by the law and the ordinary procedural guarantees of our Europe.","El Atlético de Madrid , los aficionados e incluso la policía española están siendo maltratados por la Federación Europea de Fútbol . Pero el caso transciende lo particular , pues esos órganos federativos agravan las sanciones a quienes recurren a la justicia ordinaria .\nEsta concepción medieval , esta ley del embudo es incompatible con el Derecho y con las instituciones europeas , desde las que hemos de reaccionar . Lo acabaremos haciendo , pues esos señores medievales arbitrarios de horca y cuchillo han de ponerse en línea con el respeto al Derecho y las garantías procesales ordinarias de nuestra Europa .\n",but the madrid athletic the officials and even the spanish police are being mistreated by the european federation of fútbol but in the case of the particular one these initiatives aggravate the sanctions that occur to the ordinary justice this medieval conception this law of the law of the law is incompatible with the law with the european institutions since we have to react and we will end up doing that because these medieval gentlemen arbitraries of orca and cuchillo they have to put the line with respect to the law and the great ordinary processes of our european union just a lot of times,atlético madrid its fans and even the spanish police are being mistreated by the union of european football associations however the problem is wider than this as these federative bodies tend to increase sanctions when people resort to the ordinary courts this mediaeval concept of one law for me and another for you is contrary to our law and the european institutions we must therefore react in fact we will end up having to react as these arbitrary mediaeval tyrants must abide by the law and the ordinary procedural guarantees of our europe,el atlético de madrid los aficionados e incluso la policía española están siendo maltratados por la federación europea de fútbol pero el caso transciende lo particular pues esos órganos federativos agravan las sanciones a quienes recurren a la justicia ordinaria esta concepción medieval esta ley del embudo es incompatible con el derecho y con las instituciones europeas desde las que hemos de reaccionar lo acabaremos haciendo pues esos señores medievales arbitrarios de horca y cuchillo han de ponerse en línea con el respeto al derecho y las garantías procesales ordinarias de nuestra europa
1,"Thank you, Mr. President, Mr. Commissioner. Terrorism is a huge global phenomenon and the act of the serious danger that has been carried out too. Therefore, all the media have to be proportional and have to fight for their effectiveness. I have taken good note of the answers that were left to the questions that were the opportunities. It is true that there are guaranteees, it is true that it is a delicate issue, but it is not true that it is absolutely inexcusable to form a globalized and harmonized response. For some who are terrorism a little far away, they worry more about the serious and habitual worries, they worry the habitual and the collective. And that is absolutely necessary that we start where we can. If we start through the transport area where the companies are in those data, we start there. If we look at the ancient, we see which is the habit of application, we start through relationships with international sports and we have to follow them through the interiors because the terrorists, many times they do not come from outside and they do not come from inside, that they ask in the United States and they ask others what that is and that we will have to plan.","Mr President, Commissioner, terrorism and serious organised crime are global phenomena. The means for fighting these must therefore be proportional and effective.\nI took due note of the answers given to the questions. These answers were quite correct: it is true that guarantees must be demanded and that this is a delicate issue. However, it is also true that it is absolutely inexcusable to provide a globalised and harmonised response.\nThose people who are somewhat detached from terrorism are more concerned about individual guarantees. My concern is for both individual and collective guarantees. It is absolutely vital that we start where we can. If we have to start with air transport, given that air carriers already have this data, then that is where we must start.\nWe will demand guarantees, we will assess the scope and we will start with international transport. However, it should be noted that we will then move on to domestic transport because terrorists very often do not come from outside, but are home-grown. Ask the United States and everyone else, because that is how it is and that is how we will have to address it in the future.","Señor Presidente , señor Comisario , el terrorismo es un fenómeno global y la actuación de la delincuencia grave organizada también , y por tanto los medios tienen que ser proporcionales y hay que luchar con eficacia .\nHe tomado buena nota de las respuestas que ha dado a las preguntas , y eran oportunas : es verdad que hay que exigir garantías , es verdad que es un tema delicado , pero no es menos cierto que es absolutamente inexcusable montar , formar una respuesta globalizada y armonizada .\nA algunos , a los que el terrorismo les queda un poco lejos , les preocupan más las garantías individuales . A mí me preocupan las individuales y las colectivas , y es absolutamente necesario que empecemos por donde podamos . Si hemos de empezar por el transporte aéreo , donde ya las compañías tienen esos datos , empecemos por ahí .\nExijamos garantías , veamos cuál es el ámbito de aplicación , empecemos por las relaciones de los transportes internacionales , y ¡ ojo ! tendremos que seguir por las interiores porque los terroristas muchas veces no vienen de fuera , sino que vienen de dentro . Que se lo pregunten a Estados Unidos , que se lo pregunten a los demás , que así es y así tendremos que plantearlo en el futuro .\n",thank you mr president mr commissioner terrorism is a huge global phenomenon and the act of the serious danger that has been carried out too therefore all the media have to be proportional and have to fight for their effectiveness i have taken good note of the answers that were left to the questions that were the opportunities it is true that there are guaranteees it is true that it is a delicate issue but it is not true that it is absolutely inexcusable to form a globalized and harmonized response for some who are terrorism a little far away they worry more about the serious and habitual worries they worry the habitual and the collective and that is absolutely necessary that we start where we can if we start through the transport area where the companies are in those data we start there if we look at the ancient we see which is the habit of application we start through relationships with international sports and we have to follow them through the interiors because the terrorists many times they do not come from outside and they do not come from inside that they ask in the united states and they ask others what that is and that we will have to plan,mr president commissioner terrorism and serious organised crime are global phenomena the means for fighting these must therefore be proportional and effective i took due note of the answers given to the questions these answers were quite correct it is true that guarantees must be demanded and that this is a delicate issue however it is also true that it is absolutely inexcusable to provide a globalised and harmonised response those people who are somewhat detached from terrorism are more concerned about individual guarantees my concern is for both individual and collective guarantees it is absolutely vital that we start where we can if we have to start with air transport given that air carriers already have this data then that is where we must start we will demand guarantees we will assess the scope and we will start with international transport however it should be noted that we will then move on to domestic transport because terrorists very often do not come from outside but are home grown ask the united states and everyone else because that is how it is and that is how we will have to address it in the future,señor presidente señor comisario el terrorismo es un fenómeno global y la actuación de la delincuencia grave organizada también y por tanto los medios tienen que ser proporcionales y hay que luchar con eficacia he tomado buena nota de las respuestas que ha dado a las preguntas y eran oportunas es verdad que hay que exigir garantías es verdad que es un tema delicado pero no es menos cierto que es absolutamente inexcusable montar formar una respuesta globalizada y armonizada a algunos a los que el terrorismo les queda un poco lejos les preocupan más las garantías individuales a mí me preocupan las individuales y las colectivas y es absolutamente necesario que empecemos por donde podamos si hemos de empezar por el transporte aéreo donde ya las compañías tienen esos datos empecemos por ahí exijamos garantías veamos cuál es el ámbito de aplicación empecemos por las relaciones de los transportes internacionales y ojo tendremos que seguir por las interiores porque los terroristas muchas veces no vienen de fuera sino que vienen de dentro que se lo pregunten a estados unidos que se lo pregunten a los demás que así es y así tendremos que plantearlo en el futuro


<p style="page-break-after:always;"></p>

For evaluation, we use the [Evaluate library](https://huggingface.co/docs/evaluate) which includes the definition of generic and task-specific metrics. In our case, we use the [BLEU metric](https://huggingface.co/spaces/evaluate-metric/bleu), or to be more precise, [sacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu).

In [None]:
from evaluate import load

metric = load("sacrebleu")

In [8]:
result = metric.compute(predictions=data["translation_clean"], references=data["reference_clean"])
print(f'BLEU score: {result["score"]:.1f}')

BLEU score: 21.1


Compute COMET figures using the [Evaluate library](https://huggingface.co/docs/evaluate) which includes the definition of generic and task-specific metrics.

In [None]:
from evaluate import load
comet_metric = load('comet')

In [None]:
comet_score = comet_metric.compute(predictions=data["translation_clean"], references=data["reference_clean"], sources=data["source_clean"])

In [11]:
print(f"COMET: {comet_score['mean_score'] * 100:.2f} %")

COMET: 63.26 %


All the data is stored into a file using 'csv' format

In [None]:
data.to_csv('L4.2_ST_Whisper_Baseline_Europarl-ST.csv', encoding='utf-8')

# Exercise

Perform a similar experiment using the Covost2 source-english setup previously used in L4.1. Evaluate the performance of different whisper models 