# AI Transcription Summarization and Translation Agent using Open Source LLM from HuggingFace

This is the fourth notebook in the series of experiments where I will build different AI agents using open-source LLMs from HuggingFace. In this notebook, I will use multiple functionalities using LLMs and call them from Gradio interface -
* Generate transcript of an audio file
* Create a summarization of that transcript
* Translate that summary to different language

### Google Colab
I will use Google Colab for creating and running the python code to build the AI agents using open-source LLMs from HuggingFace. Why did I choose Google Colab instead of my local computer?
* Free access to powerful T4 GPUs needed to run most of the LLMs efficiently.
* Easy ability to share code and collaborate.

### Hugging Face
I will need to connect to HuggingFace to use the appropriate open-source LLM for the AI application and connect that from my notebook in Colab. Here are the steps -
* Create a free HuggingFace account at https://huggingface.co
* Navigate to Settings from the user menu on the top right.
* Create a new API token with **write** permissions.
* Back to this colab notebook
  * Press the "key" icon on the side panel to the left
  * Click on add a new secret
  * In the name field put HF_TOKEN
  * In the value field put your actual token: hf_...
  * Ensure the notebook access switch is turned ON.

This way I can use my confidential API Keys for HuggingFace or others without needing to type them into my colab notebook, I will be sharing with others.

In [None]:
# Check GPU availability and specifications, such as its memory usage, temperature, and clock speed.
# We can also see that in details by clicking on Runtime (top menu) > View Resources
!nvidia-smi

In [5]:
# I will need to connect from my notebook in Colab to HuggingFace by validating the token, in order to use open-source models.
# The huggingface_hub library allows to interact with the HuggingFace Hub, a platform democratizing open-source LLMs and Datasets

import os
from IPython.display import Markdown, display, update_display
from huggingface_hub import login
from google.colab import userdata
import gradio as gr
from transformers import pipeline
import torch

hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

### Mounting Google Drive in Google Colab

Google Colab allows us to access files stored in our Google Drive, making it easy to work with datasets and other resources. Here are the steps to mount Google Drive in Google Colab

In [6]:
# Step 1 - Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Step 2 - Authorize access through the prompts in the browser

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Model Selection

I will select a model from the HuggingFace model library based on the specific  application. Here are the steps -

* Go to https://huggingface.co/models.
* For the Speech Recognition / Transcription model -
  * Click on **Automatic Speech Recognition** under Audio.
  * Choose any model and review it's specification.
  * I am choosing the **whisper-large** model from **OpenAI**
* For the Text Summarization model -
  * Click on **Summarization** under NLP.
  * Choose any model and review it's specification.
  * I am choosing the **bart-large-cnn** model from **Facebook**
* For the Translation model -
  * Click on **Translation** under NLP.
  * Choose any model and review it's specification.
  * I am choosing the **nllb-200-distilled-600M** model from **Facebook**

Note: We should select a model based on various criteria, such as the specific use-casr, available infrastructure, latency, performance. I will cover those in details later.

### HuggingFace Pipeline Library

This is a much simpler approach with the Hugging Face pipeline API, which  provides a high-level, task-specific interface for running inference with pretrained models without manually handling tokenization, preprocessing, or postprocessing.

This approach is ideal, when we need to run quick experimentation or prototyping and don't need to gain more granular control on the model behavior.

In [None]:
# Instantiate the pipeline for audio transcription
transcriber = pipeline(
    task="automatic-speech-recognition",
    model="openai/whisper-medium.en",
    dtype=torch.bfloat16,
    device='cuda',
    return_timestamps=True
)

In [None]:
# Instantiate the pipeline for text summarization
summarizer = pipeline(
    task="summarization",
    model="facebook/bart-large-cnn"
)

In [None]:
# Instantiate the pipeline for text translation
translator = pipeline(
    task="translation",
    model="facebook/nllb-200-distilled-600M",
    dtype=torch.bfloat16
)

### Build Custom Function to Call LLM

We will build custom function to transcribe audio file, create a summary of that transcription, and translate that summary to a different language. We will use the respective pipelines we built before from the functions.

In [10]:
# Create a function to generate the transcription
def transcribe(audio_filename):
  if audio_filename is None:
    return "No audio file provided"
  # Run inference - Generate the transcript text by calling the model through the pipeline API
  transcript = transcriber(audio_filename)["text"]
  # Return the transcript text
  return transcript

In [11]:
# Create a function to summarize text
def summarize(text_to_summarize):
  if text_to_summarize is None:
    return "No text provided"
  # Run inference - Generate the text summarization by calling the model through the pipeline API
  summary = summarizer(text_to_summarize)
  # Return the summary text
  return summary

In [12]:
# Create a function to translate text
def translate(text_to_translate, source_language, target_language):
  if text_to_translate is None:
    return "No text provided"
  # Run inference - Generate the text translation by calling the model through the pipeline API
  translation = translator(text_to_translate, src_lang=source_language, tgt_lang=target_language)
  # Return the translated text
  return translation

In [None]:
# We stored the audio file under the root of 'My Drive' folder.
audio_filename_1 = "/content/drive/MyDrive/denver_extract.mp3" # Audio with 15 minutes of conversation
audio_filename_2 = "/content/drive/MyDrive/little-girl.mp3" # Small 15 seconds audio clip

result = transcribe(audio_filename_2)
display(Markdown(result))

In [None]:
summary = summarize(result)
print(summary)

In [None]:
translation = translate(summary, "eng_Latn", "ben_Beng")
print(translation)

### Application 2 - Display the audio transcript in Gradio UI

In [None]:
# Build the UI with Gradio

ui = gr.Interface(
    fn=transcribe,
    inputs=gr.Audio(
        label="Upload audio file",
        sources=['upload'] # Indicate file upload, which is default, or it can be microphone
        type=['filepath'] # Dedault is 'numpy' which converts the audio to a tuple consisting of the sample rate and the data
    ),
    outputs=gr.Textbox(label="Transcript"),
    title="Audio Transcription using Whisper from HuggingFace",
    description="Upload an audio file and get the transcript."
)

# Launch the UI
ui.launch()