# Contents
## Whisper Model -> Transcription on an audio file
## Evaluatioin on Transcription -> WER, CER metrics
## Using OpenAI chatprompts -> Demonstraion of Summarization and Translation using openAI's chatbot prompt templates
## RAG using OpenAI llm -> Embedding, database creation, using flant t5 model, making qa_chains
## RAG using Hugging Face llm -> Embedding, database creation, using flant t5 model, making qa_chains 
## Evaluation on Translation -> BlEU, ROUGE metrics
## Evaluation on Retrieval -> BlEU, ROUGE metrics
## Gradio Library -> Interface for RAG Model



In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/validation/original transcript.txt
/kaggle/input/english-audio-1hr/sample.mp3
/kaggle/input/translation-test/german_text.txt
/kaggle/input/translation-test/english_text.txt


# Whisper Model 

In [3]:
import librosa
def load_audio_set_sample_rate(file_path):
    '''
     Code to resample the audio to 16000 if the sample rate is different
    '''
    waveform, sample_rate = librosa.load(file_path, sr=None, mono=True)
    if not sample_rate == 16000:
        tensor_waveform = librosa.resample(waveform, orig_sr=sample_rate, target_sr=16000)
    else:
        tensor_waveform = waveform

    return tensor_waveform, 16000

**Creating path to save the transcription text, using pretrained whisper-tiny model to transcribe the given audio and process each chunk to transcribe**

In [5]:
import os
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from tqdm.auto import tqdm

audio_file_path = "/kaggle/input/english-audio-1hr/sample.mp3"
model_name = "openai/whisper-tiny"
text_dir = "/kaggle/working/text"
if os.path.exists(text_dir):
    print("exists")
else:
    os.mkdir(text_dir)
    print("just made")
print(bool(os.path.exists(text_dir)))



processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)
audio, sample_rate = load_audio_set_sample_rate(audio_file_path)
chunk_size = sample_rate * 20  
chunks = [audio[i:i+chunk_size] for i in range(0, len(audio), chunk_size)]
transcriptions = []
for index, chunk in enumerate(tqdm(chunks)):
    input_features = processor(chunk, sampling_rate=16000, return_tensors="pt").input_features
    predicted_ids = model.generate(input_features)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
    transcriptions.append(transcription)
full_transcription = ' '.join(transcriptions)

basefile_name = os.path.splitext(os.path.basename(audio_file_path))[0]
text_file_path = os.path.join(text_dir, basefile_name + ".txt")
with open(text_file_path, 'w') as text_file:
    text_file.write(full_transcription)
print(f"Transcription saved to {text_file_path}")

just made
True


preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/1.98k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/151M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.75k [00:00<?, ?B/s]

  0%|          | 0/195 [00:00<?, ?it/s]

2024-04-21 02:39:00.670411: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-21 02:39:00.670528: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-21 02:39:00.841761: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.


Transcription saved to /kaggle/working/text/sample.txt


In [8]:
import gc
gc.collect()

0

In [9]:
content = ""
with open(text_file_path, 'r') as f:
    content = f.read()
print(content[:100])
    

 Hello, Sarah. Today, let's discuss an important aspect of business English, customer service and co


# Evaluation On Transcription

In [10]:
len(content)

57243

In [None]:
pip install --upgrade evaluate jiwer

In [12]:
from evaluate import load

wer_metric = load("wer")


Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

In [13]:
from transformers.models.whisper.english_normalizer import BasicTextNormalizer

normalizer = BasicTextNormalizer()

prediction = content
normalized_prediction = normalizer(prediction)

normalized_prediction[:100]

' hello sarah today let s discuss an important aspect of business english customer service and commun'

In [14]:
validation_path = "/kaggle/input/validation/original transcript.txt"
validation_content = ""
with open(validation_path, 'r') as f:
    validation_content = f.read()

    

In [15]:
len(validation_content)

57529

In [16]:
from transformers.models.whisper.english_normalizer import BasicTextNormalizer

normalizer = BasicTextNormalizer()

reference = content
normalized_reference = normalizer(reference)

normalized_reference[:100]

' hello sarah today let s discuss an important aspect of business english customer service and commun'

In [17]:

normalized_referece = normalizer(validation_content)

wer_normalized = wer_metric.compute(
    references=[normalized_referece], predictions=[normalized_prediction]
)
print(wer_normalized)

0.051586095064355936


In [18]:
wer = wer_metric.compute(references = [validation_content], predictions = [content])
print(wer)

0.27565733672603904


In [19]:
pip install python-Levenshtein

Collecting python-Levenshtein
  Downloading python_Levenshtein-0.25.1-py3-none-any.whl.metadata (3.7 kB)
Collecting Levenshtein==0.25.1 (from python-Levenshtein)
  Downloading Levenshtein-0.25.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.3 kB)
Downloading python_Levenshtein-0.25.1-py3-none-any.whl (9.4 kB)
Downloading Levenshtein-0.25.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (177 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.4/177.4 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0mm
[?25hInstalling collected packages: Levenshtein, python-Levenshtein
Successfully installed Levenshtein-0.25.1 python-Levenshtein-0.25.1
Note: you may need to restart the kernel to use updated packages.


In [22]:
import Levenshtein as lev

def calculate_cer(reference, prediction):
    distance = lev.distance(reference, prediction)
    length = len(reference)
    return distance / length

cer = calculate_cer(reference = validation_content, prediction = content)
print("Character Error Rate: {:.4f}%".format(cer * 100))

Character Error Rate: 8.4966%


In [23]:

cer_normalized = calculate_cer(reference = normalized_reference, prediction = normalized_prediction)
print("Character Error Rate: {:.4f}%".format(cer_normalized * 100))

Character Error Rate: 0.0000%


# Using OpenAI Chatprompts (Just for Demonstration)

In [None]:
pip install langchain

In [32]:
# creating langchain document using the transcribed text
from langchain.document_loaders import TextLoader
docs = []
loader = TextLoader(text_file_path)
loaded_documents = loader.load()
if loaded_documents:
    docs.extend(loaded_documents)

In [34]:
#splitting the text documents in chunks using a text splitter 
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)

In [35]:
texts[0].page_content

"Hello, Sarah. Today, let's discuss an important aspect of business English, customer service and communication. In many jobs, providing excellent customer service is crucial, and effective communication plays a significant role. That sounds important, Mr. Davis. I want to be better at  communicating with customers and ensuring they have a positive experience. Excellent, Sarah. Customer service and communication skills are key to building positive relationships with customers. To start, what do you think comes to mind when you hear customer service and communication? I think it's about being friendly.  listening to customers' needs and providing helpful and clear information. You've hit the nail on the head, Sarah. Customer service involves friendlyness, active listening and clear communication. Now let's practice some sentences. Finish this one. In customer service and communication, it's important to  In customer service and communication, it's important to be friendly, actively"

**Feel free to skip this section and go to "RAG using Hugging Face", this section involves using OpenAI's chatbots to summarize and translate for demonstration purpose**

In [39]:

openai_api_key = os.getenv("OPENAI_API_KEY")

In [None]:
pip install openai

In [43]:
from langchain.chat_models import ChatOpenAI
from langchain.schema import(
    AIMessage,
    HumanMessage,
    SystemMessage
)


In [46]:
speech = content

In [47]:


def work(task , speech , language, num_words ):
    '''
      Defining prompt templates for the chatbot
    '''
    chat_messages = [
    SystemMessage(content = "You are an expert in summarization and translation of langugages"),
    HumanMessage(content = f'please provide {task} of the text {speech} in {language} in {num_words} words')
                ]
    llm = ChatOpenAI(model_name = "gpt-3.5-turbo")
    result = llm(chat_messages).content
    return result


In [48]:
hindi_summary = work(task = "summarize", speech = speech, language = "hindi", num_words = 300)

  warn_deprecated(
  warn_deprecated(


In [49]:
telugu_summary = work(task = "summarize", speech = speech, language = "telugu",num_words =  200)

In [50]:
english_summary = work(task = "summarize", speech = speech, language ="english", num_words = 300 )

In [51]:

print(hindi_summary)

सारा, आपने व्यापारिक अंग्रेजी कौशल में सुधार के प्रति अपनी समर्पण देखी है। आज, हम उन विभिन्न कॉल सेंटर स्थितियों की जांच करने जा रहे हैं जो उन्हें हल करने कौशल से संबंधित हैं। एक स्थिति की कल्पना करें जहां एक ग्राहक उसके कंप्यूटर में समस्या होने के कारण कॉल कर रहा है। आप क्या कहेंगी? आप कहेंगी, "नमस्ते। आपकी कंप्यूटर की समस्या में मैं आपकी कैसे मदद कर सकता हूँ?" अद्भुत, सारा। आप स्थितियों को अच्छी तरह संभाल रही हैं। अब हम एक और करते हैं। इस बार ग्राहक फोन कर रहा है क्योंकि उसे नए फोन खरीदना है। आप क्या कहेंगी? आप कहेंगी, "नमस्ते? क्या आप नए फोन खरीदने के इच्छुक हैं?" मैं आपकी मदद कर सकता हूँ।" पूर्णतः, सारा। आप इसमें सक्षम हो रही हैं। अब एक नया विशेषांक के लिए आगे बढ़ें। ग्राहक ने धारक करने वाली उत्पाद की गलत साइज प्राप्त की है। आप क्या कहेंगी? आप कहेंगी, "नमस्ते, मुझे गलत साइज प्राप्त करने के लिए खेद है। मैं समझता हूँ कि सही उत्पाद प्राप्त करना कितना महत्वपूर्ण होता है। क्या आप अपनी आदेश संख्या प्रदान कर सकते हैं और मैं सुनिश्चित करूँगा कि सही आकार आपको भेजा जाता है?" अद्भुत, सारा। ग्

In [52]:
print(telugu_summary)

సారా, మీరు కాల్ సెంటర్ లో వేయించే విభిన్న సన్నివేశాలను అభ్యాసించడం మరియు అవి సులభంగా పరిగెత్తడం గురించి తెలుసుకోవాలని ఊహించుకుంటున్నాను. మీరు కొంత గహనమైన సన్నివేశాలను నేర్చడం ఆధునిక సంవాద నైపుణ్యాలను అభ్యాసించడం వల్ల సాధించాలి. పెద్ద దుష్ప్రభావం ఉండిన సందర్భంలో గర్విష్టుడు కాల్ చేసిన సేవను ప్రశంసిస్తున్నాడు. మీరు అదిని నేర్చడానికి సహాయం చేసినంత చాలా సంతోషం ఉంది. ఈ సందర్భంలో చాలా కఠినమైన సన్నివేశాలను నేర్చడం వల్ల, మీరు వివిధ సంవాద సన్నవేశాలలో నిపుణతను పొందడం అవసరం. అభ్యాసం చేస్తే మీ వ్యవసాయ ఇంగ్లీష్ సంవాద నైపుణ్యం మరింత పెరగడం అనుకూలం. మీకు మరికొక అభ్యాసాలు లేదా ప్రశ్నలు ఉంటే, స్పష్టంగా చెప్పండి. మీరు చాలా బాగా చేస్తున్నారు.


In [53]:
print(english_summary)

Sarah is receiving guidance from Mr. Davis to improve her communication skills in various office scenarios. They practice handling customer calls in a call center setting, addressing issues such as damaged products, double payments, and technical difficulties. Sarah learns to express empathy, provide solutions, and ask for necessary details to assist customers effectively. As they progress, the scenarios become more challenging, involving dissatisfied customers, financial difficulties, and policy changes. Sarah demonstrates advanced communication skills by acknowledging concerns, offering alternative solutions, and exploring exceptions within company policies. She learns to handle delicate situations with empathy and professionalism, ensuring customer satisfaction and fostering positive relationships. Through practice and guidance, Sarah gains confidence and proficiency in navigating complex customer interactions in a call center environment. Mr. Davis encourages her to continue practi

In [56]:
eng_dir = "/kaggle/working/summaries"
os.mkdir(eng_dir)
os.path.exists("/kaggle/working/summaries")

True

In [57]:
eng_path = os.path.join(eng_dir, "english.txt")
tel_path = os.path.join(eng_dir, "telugu.txt")
hin_path = os.path.join(eng_dir, "hindi.txt")


In [58]:

with open(eng_path, 'w') as f:
    f.write(english_summary)
with open(tel_path, 'w') as f:
    f.write(telugu_summary)
with open(hin_path, 'w') as f:
    f.write(hindi_summary)
    
    

# RAG using OpenAI gpt-3.5 llm( Hugging face RAG is in the next section)

In [None]:
pip install sentence-transformers

In [None]:
pip install chromadb


In [64]:
# checking the working of flan-t5-large model on translations

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large")

input_text = "Translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]



<pad> Wie alte sind Sie?</s>


In [65]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM,pipeline
from langchain import HuggingFacePipeline

#Loading the model and tokenizer from flan-t5-small model 
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")
model_translate = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")

#Creating and initializing a text2textgeneration pipeline for translation purposes
pipe = pipeline("text2text-generation", model=model_translate, tokenizer=tokenizer)
llm = HuggingFacePipeline(
    pipeline = pipe,
    model_kwargs={"temperature": 1, "max_length": 12984},
)

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [67]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

def make_embedder():
    '''
     Function to create a Hugging Face embedder and Chroma vector database
    '''
    model_name = "sentence-transformers/all-mpnet-base-v2"
    model_kwargs = {'device': 'cpu'}
    encode_kwargs = {'normalize_embeddings': False}
    return HuggingFaceEmbeddings(
        model_name=model_name,
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs
    )


# Initializing embedder, database for the transcribed text document
hf = make_embedder()
db = Chroma.from_documents(texts, hf)

In [72]:
from langchain.chains import RetrievalQA

def make_qa_chain():
    '''
      Initializes a Question Answer pipeline using the llm created by using OpenAI's "gpt-3.5-turbo"
      The retriever find the maximum 3 relevant documents to the query and returns them as source along with the answer
    '''
    llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
    return RetrievalQA.from_chain_type(
        llm,
        retriever=db.as_retriever(search_type="mmr", search_kwargs={'fetch_k': 3}),
        return_source_documents=True
    )
    

In [76]:
qa_chain = make_qa_chain()

def ask_question(q):
    result = qa_chain({"query": q})
    print(f"Q: {result['query'].strip()}")
    print(f"A: {result['result'].strip()}\n")
    print('\n')

In [77]:
q = "summary of the text "
ask_question(q)
while True:
    print('\nEnter `e` to exit')
    q = input('enter your question: ')
    if q == 'e':
        break
    ask_question(q)

  warn_deprecated(


Q: summary of the text
A: The text provides guidance on how to ensure active participation in diverse team meetings by encouraging everyone to speak, asking open-ended questions, and maintaining a well-structured meeting with clear goals and agendas. It also discusses the importance of analyzing financial statements by looking at key figures, making comparisons, and identifying trends. Additionally, it touches on creating a realistic budget for a project by estimating expenses, prioritizing fund allocation, and regularly reviewing and adjusting the budget.




Enter `e` to exit


enter your question:  what is the effect of communication skills


Q: what is the effect of communication skills
A: Effective communication skills have a significant impact in various aspects of professional and personal life. In a business context, strong communication skills can lead to improved relationships with customers, colleagues, and superiors. It can enhance teamwork, productivity, and overall job satisfaction. Clear and concise communication can prevent misunderstandings, conflicts, and errors, leading to more efficient operations. Additionally, good communication skills can boost confidence, credibility, and career advancement opportunities.




Enter `e` to exit


enter your question:  e


In [79]:

def rag_openai(text, task , language):
    if text == "summarize":
        result = qa_chain("Summarize this : \n" + text)
        return result['result']
    else:
        result = qa_chain("Translate English to  " + language + ": " + text)
        return result['result']


In [80]:
rag_openai(text = "My name is Pravalika, and I like color violet", task = "translate", language = "German")

'Mein Name ist Pravalika, und ich mag die Farbe Violett.'

# RAG using Hugging Face llm 

In [84]:
'''
 Repeating the same process as above but with Hugging Face llm
 Uses flan-t5-large pretrained translation model
'''

def make_hf_qa_chain():
    llm = HuggingFacePipeline(
        pipeline=pipeline("text2text-generation", model=model_translate, tokenizer=tokenizer),
        model_kwargs={"temperature": 1, "max_length": 12984}
    )
    db = Chroma.from_documents(texts, make_embedder())
    qa_chain_hf = RetrievalQA.from_chain_type(llm, retriever=db.as_retriever(search_type="mmr", search_kwargs={'fetch_k': 5}), return_source_documents=True )
    return qa_chain_hf


tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
model_translate = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large")
qa_chain_hf= make_hf_qa_chain()



def ask_questions(q):
    result = qa_chain_hf({"query": q})
    print(f"Q: {result['query'].strip()}")
    print(f"A: {result['result'].strip()}\n")
    
    
    

def rag_hf(text, task , language):
    if text == "summarize":
        result = qa_chain_hf("Summarize this : \n" + text)
        return result['result']
    else:
        result = qa_chain_hf("Translate English to  " + language + ": " + text)
        return result['result']




In [85]:
text = content[:300]
print("Summarized Text:", rag_hf(text, "summarize", "English"))


Token indices sequence length is longer than the specified maximum sequence length for this model (867 > 512). Running this sequence through the model will result in indexing errors


Summarized Text: Customer service and communication skills are key to building positive relationships with customers. To start, what do


In [87]:
print(rag_hf(text, "translate", "German"))



Ich möchte besser bei Kommunikation mit Kunden und ensuring they have a positive experience


In [96]:
# Comparing results of RAG from OpenAI and Hugging Face for translation and summarization
dory_text = "I am Dory, I suffer from short term memory loss"
print(f' OpenAI Summarization: {rag_openai(text = dory_text, task = "summarize", language = "English")}\n Hugging Face Summarization: {rag_hf(dory_text, "summarize", "English")}\n\n OpenAI Translation: {rag_openai(text = dory_text, task = "translate", language = "German")}\n Hugging Face Translation: {rag_hf(dory_text, "translate", "German")}')




 OpenAI Summarization: The phrase "I am Dory, I suffer from short term memory loss" is already in English. It seems like a reference to the character Dory from the movie "Finding Nemo" who has short-term memory loss.
 Hugging Face Summarization: I am Dory, I suffer from short term memory loss

 OpenAI Translation: Ich bin Dory, ich leide an Gedächtnisverlust auf kurze Sicht.
 Hugging Face Translation: Ich bin Dory, ich besitzt eine kurze Zeitvermeidung


# Evaluation on Translation

In [100]:
from datasets import load_dataset
dataset = load_dataset('bentrevett/multi30k')

Downloading readme:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 4.60M/4.60M [00:00<00:00, 19.3MB/s]
Downloading data: 100%|██████████| 164k/164k [00:00<00:00, 909kB/s]
Downloading data: 100%|██████████| 156k/156k [00:00<00:00, 2.60MB/s]


Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [101]:
english_list = dataset['train']['en'][:50]
german_list = dataset['train']['de'][:50]


In [102]:
english_list[:5]

['Two young, White males are outside near many bushes.',
 'Several men in hard hats are operating a giant pulley system.',
 'A little girl climbing into a wooden playhouse.',
 'A man in a blue shirt is standing on a ladder cleaning a window.',
 'Two men are at the stove preparing food.']

In [103]:
english_content = ""
english_content = ''.join(english_list)
german_content = ""
german_content = ''.join(german_list)



In [104]:
english_content

"Two young, White males are outside near many bushes.Several men in hard hats are operating a giant pulley system.A little girl climbing into a wooden playhouse.A man in a blue shirt is standing on a ladder cleaning a window.Two men are at the stove preparing food.A man in green holds a guitar while the other man observes his shirt.A man is smiling at a stuffed lionA trendy girl talking on her cellphone while gliding slowly down the street.A woman with a large purse is walking by a gate.Boys dancing on poles in the middle of the night.A ballet class of five girls jumping in sequence.Four guys three wearing hats one not are jumping at the top of a staircase.A black dog and a spotted dog are fightingA man in a neon green and orange uniform is driving on a green tractor.Several women wait outside in a city.A lady in a black top with glasses is sprinkling powdered sugar on a bundt cake.A little girl is sitting in front of a large painted rainbow.A man lays on the bench to which a white dog

In [None]:
german_generated_new = ""
for sentence in english_list:
    translated_sentence = rag_hf(text = sentence, task = "translate", language = "German")
    german_generated_new += translated_sentence

In [121]:
german_generated_new

"Zwei junge, Weißer Männer sind outside nahe viele Berge.Einige Männer in Hardhats betrieben einen großen Pulleysystem.Ein kleines Mädchen steigt in ein Holzspielhaus.Ein Mann in einem blauen T-Shirt ist auf einem Leifen wachtZwei Männer sind auf dem Stove preparing food.Ein man in grüner Farbe hält ein Geiger, während der anderen man seineEin Mann ist schnappend an einem stöttiger LionA trendy girl talking on her cellphone while gliding slowly down the street.A woman with a large purse is walking by a gate.Boys dancing on poles in the middle of the night.Ein Balletklasse von fünf Mädchen, die in Sequenz schrecken.Four guys three wearing hats one not are jumping at the top of a staircase.Ein schwarzes Dog und ein störtes Dog kämpfenEin Mann in einem neongrün-Orange-Uniform fährt auf einem grünmehrere Frauen warten outside in einer Stadt.A lady in a black top with glasses is sprinkling powdered sugarEin kleines Mädchen ist in front einer großen gespaltenen Rainbow sitzen.Ein Mensch legt

In [122]:
german_content

"Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.Mehrere Männer mit Schutzhelmen bedienen ein Antriebsradsystem.Ein kleines Mädchen klettert in ein Spielhaus aus Holz.Ein Mann in einem blauen Hemd steht auf einer Leiter und putzt ein Fenster.Zwei Männer stehen am Herd und bereiten Essen zu.Ein Mann in grün hält eine Gitarre, während der andere Mann sein Hemd ansieht.Ein Mann lächelt einen ausgestopften Löwen an.Ein schickes Mädchen spricht mit dem Handy während sie langsam die Straße entlangschwebt.Eine Frau mit einer großen Geldbörse geht an einem Tor vorbei.Jungen tanzen mitten in der Nacht auf Pfosten.Eine Ballettklasse mit fünf Mädchen, die nacheinander springen.Vier Typen, von denen drei Hüte tragen und einer nicht, springen oben in einem Treppenhaus.Ein schwarzer Hund und ein gefleckter Hund kämpfen.Ein Mann in einer neongrünen und orangefarbenen Uniform fährt auf einem grünen Traktor.Mehrere Frauen warten in einer Stadt im Freien.Eine Frau mit schwarzem Oberteil

In [124]:
bleu = evaluate.load("bleu")
translated_bleu_results = bleu.compute(predictions= [german_generated_new], references= [german_content])
print(translated_bleu_results)

{'bleu': 0.1294950915271905, 'precisions': [0.49612403100775193, 0.21941747572815534, 0.10116731517509728, 0.05458089668615984], 'brevity_penalty': 0.8270232417873509, 'length_ratio': 0.8403908794788274, 'translation_length': 516, 'reference_length': 614}


In [None]:
pip install rouge_score

In [125]:
rouge = evaluate.load("rouge")
translated_rouge_results = rouge.compute(predictions= [german_generated_new], references= [german_content])
print(translated_rouge_results)

{'rouge1': 0.47256637168141596, 'rouge2': 0.2127659574468085, 'rougeL': 0.34690265486725663, 'rougeLsum': 0.34690265486725663}


# Evaluation on Retrieval

In [126]:
from datasets import load_dataset
dataset = load_dataset('cnn_dailymail', '2.0.0')

Downloading readme:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 257M/257M [00:01<00:00, 224MB/s]  
Downloading data: 100%|██████████| 257M/257M [00:01<00:00, 253MB/s]  
Downloading data: 100%|██████████| 259M/259M [00:00<00:00, 266MB/s] 
Downloading data: 100%|██████████| 34.7M/34.7M [00:00<00:00, 131MB/s] 
Downloading data: 100%|██████████| 30.0M/30.0M [00:00<00:00, 94.6MB/s]


Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [130]:
dataset['train'][0]['article']

'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details o

In [131]:
dataset['train'][0]['highlights']

"Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .\nYoung actor says he has no plans to fritter his cash away .\nRadcliffe's earnings from first five Potter films have been held in trust fund ."

In [133]:
qa_chain_hf(dataset['train'][0]['article'])['result']



'Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune'

In [137]:
actual_highlights = ""
for highlight in dataset['train'][:15]['highlights']:
    actual_highlights += highlight

In [None]:
generated_highlights = ""
for art in dataset['train'][:15]['article']:
    generated_highlights+=(qa_chain_hf(art))['result'] + "." + "\n"

In [156]:
print(generated_highlights)

Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune.
Soledad O'Brien tours jail where many mentally ill inmates are housed.
The whole bridge from one side of the Mississippi to the other just completely gave way, fell all.
Doctors removed five small polyps from President Bush's colon on Saturday.
NFL Commissioner Roger Goodell: "Your admitted conduct was not only illegal, but also cruel and.
Youssif's family will be able to travel to the United States.
Women are too afraid and ashamed to show their faces or have their real names used.
Tomas Medina Caracas, known popularly as "El Negro Ac.
White House press secretary Tony Snow, who is undergoing treatment for cancer, will step down from.
The nearest military base, Fort Dix, is more than 70 miles from Jersey City..
Bush will try to put a twist on comparisons of the war to Vietnam by invo.
Perfect. You're really mastering these leadership and management skills..
Carlos Alberto.
Bush's last colonoscopy was in Ju

In [157]:
print(actual_highlights)


Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .
Young actor says he has no plans to fritter his cash away .
Radcliffe's earnings from first five Potter films have been held in trust fund .Mentally ill inmates in Miami are housed on the "forgotten floor"
Judge Steven Leifman says most are there as a result of "avoidable felonies"
While CNN tours facility, patient shouts: "I am the son of the president"
Leifman says the system is unjust and he's fighting for change .NEW: "I thought I was going to die," driver says .
Man says pickup truck was folded in half; he just has cut on face .
Driver: "I probably had a 30-, 35-foot free fall"
Minnesota bridge collapsed during rush hour Wednesday .Five small polyps found during procedure; "none worrisome," spokesman says .
President reclaims powers transferred to vice president .
Bush undergoes routine colonoscopy at Camp David .NEW: NFL chief, Atlanta Falcons owner critical of Michael Vick's conduct .
NFL suspends Falco

In [158]:
bleu = evaluate.load("bleu")
summarization_bleu_results = bleu.compute(predictions= [generated_highlights], references= [actual_highlights])
print(summarization_bleu_results)

{'bleu': 0.006426211445378057, 'precisions': [0.5545454545454546, 0.1324200913242009, 0.01834862385321101, 0.004608294930875576], 'brevity_penalty': 0.12873490358780418, 'length_ratio': 0.32786885245901637, 'translation_length': 220, 'reference_length': 671}


In [160]:
rouge = evaluate.load("rouge")
summarization_rouge_results = rouge.compute(predictions= [generated_highlights], references= [actual_highlights])
print(summarization_rouge_results)

{'rouge1': 0.2588235294117647, 'rouge2': 0.05766710353866318, 'rougeL': 0.13594771241830067, 'rougeLsum': 0.25359477124183005}


In [None]:
pip install gradio

In [None]:
import gradio as gr


def call_rag(text, task, language):
    if task == "summarize":
        result = qa_chain_hf(f'Summarize {text}')
        return result['result']
    else:
        result = qa_chain_hf(f"Translate to {language}: {text}")
        return result['result']
rag_demo = gr.Interface(
    call_rag,
    ["text",gr.Radio(["Summarize", "Translate"]), gr.Radio(["German", "Spanish", "Russian"])],
    "text",
    title="RAG Based Summarization and Translation Inetrface",
    live= True
)

rag_demo.launch()