# Summarization of transcripts with Langchain

In this example, we intend to create a summarizer for long transcripts. The main goal is to break the original transcript into different chunks based on context - i.e. using an unsupervised approach to identify the different topics throughout the transcript (somehow similarly to Topic Modelling) - and summarize each of these chunks. in the end, the different summaries are returned to the user.

## Step 0: Configuring the environment

Most of the libraries that are necessary for the development of this example are built-in on the GenAI workspace, available in AI Studio. More specific libraries to handle the type of input will be added here. In this case, we are giving support to transcripts in the webvtt format, used to store transcripts, which require the webvtt-py library.

In [1]:
!pip install webvtt-py
!pip install pandas



### Configuration of Hugging face caches

In the next cell, we configure HuggingFace cache, so that all the models downloaded from them are persisted locally, even after the workspace is closed. This is a future desired feature for AI Studio and the GenAI addon.

In [2]:
import os
os.environ["HF_HOME"] = "/home/jovyan/local/hugging_face"
os.environ["HF_HUB_CACHE"] = "/home/jovyan/local/hugging_face/hub"

## Step 1: Loading the data from the transcript

At first, we need to read the data from the transcript. As our transcript is in the .vtt format, we use a library called webvtt-py to read the content. As the text is a trancript of audio/video, it is organized in small chunks of conversation, each containing a sequential id, the time of the start and end of the chunk, and the text content (often in the form speaker:content).

From this data, we expect to extract the actual content,  while keeping reference to the other metadata - for this reason, we are loading all the data into a Pandas dataset. 

In [3]:
import webvtt
import pandas as pd

data = {
    "id": [],
    "speaker": [],
    "content": [],
    "start": [],
    "end": []
}

for caption in webvtt.read('data/I_have_a_dream.vtt'):
    line = caption.text.split(":")
    while len(line) < 2:
        line = [''] + line
    data["id"].append(caption.identifier)
    data["speaker"].append(line[0].strip())
    data["content"].append(line[1].strip())
    data["start"].append(caption.start)
    data["end"].append(caption.end)
    
df = pd.DataFrame(data)

df.head()
    
    

Unnamed: 0,id,speaker,content,start,end
0,1,,I am happy to join with you today,00:00:00.880,00:00:03.920
1,2,,in what will go down in history,00:00:06.500,00:00:09.360
2,3,,as the greatest demonstration for freedom in t...,00:00:11.720,00:00:16.460
3,4,,nation.,00:00:16.460,00:00:17.293
4,5,,"Five score years ago,",00:00:26.410,00:00:28.740


As a second option, we provide here a code to load the same structure from a plain text document, which only contains the actual content of the speech/conversation, without extra metadata. For the sake of simplicity and reuse of code, we keep the same Data Frame structure as the previous version, by filling the remaining fields with empty strings.

In [4]:
import pandas as pd

with open("data/I_have_a_dream.txt") as file:
    lines = file.read()

data = {
    "id": [],
    "speaker": [],
    "content": [],
    "start": [],
    "end": []
}

for line in lines.split("\n"):
    if line.strip() != "":
        data["id"].append("")
        data["speaker"].append("")
        data["content"].append(line.strip())
        data["start"].append("")
        data["end"].append("")        
        
df = pd.DataFrame(data)

df.head()

Unnamed: 0,id,speaker,content,start,end
0,,,﻿I am happy to join with you today in what wil...,,
1,,,"Five score years ago, a great American, in who...",,
2,,,"One hundred years later, the colored American ...",,
3,,,In a sense we have come to our Nation’s Capita...,,
4,,,"This note was a promise that all men, yes, bla...",,


## Step 2: Semantic chunking of the transcript
Having the information content loaded according to the transcription format - with the text split into audio blocks, or into paragraphs, we now want to group these small blocks into relevant topics - so we can summarize each topic individually. Here, we are using a very simple approach for that, by using a semantic embedding of each sentence (using an embedding model from Hugging Face Sentence Transformers), and identifying the "breaks" among chunks as the ones with higher semantic distance. Notice that this method can be parameterized, to inform the number of topics or the best method to identify the breaks.

In [6]:
import numpy as np
from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cosine

embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = embedding_model.encode(df.content)


  from tqdm.autonotebook import tqdm, trange


In [7]:
class SemanticSplitter():
    def __init__ (self, content, embedding_model, method="number", partition_count = 10, quantile = 0.9):
        self.content = content
        self.embedding_model = embedding_model
        self.partition_count = partition_count
        self.quantile = quantile
        self.embeddings = embedding_model.encode(content)
        self.distances = [cosine(embeddings[i - 1], embeddings[i]) for i in range(1, len(embeddings))]
        self.breaks = []
        self.centroids = []
        self.load_breaks(method=method)

    def centroid_distance(self, embedding_id, centroid_id):
        return cosine(self.embeddings[embedding], self.centroid[centroid])

    def adjust_neighbors(self):
        self.breaks = []

    def load_breaks(self, method = 'number'):
        if method == 'number':
            self.breaks = np.sort(np.argpartition(self.distances, self.partition_count - 1)[0:self.partition_count - 1])
        elif method == 'quantiles':
            threshold = np.quantile(self.distances, self.quantile)
            self.breaks = [i for i, v in enumerate(self.distances) if v >= threshold]
        else:
            self.breaks = []

    def get_centroid(self, beginning, end):
        return embedding_model.encode('\n'.join(self.content[beginning : end]))
    
    def load_centroids(self):
        if len(self.breaks) == 0:
            self.centroids = [self.get_centroid(0, len(self.content))]
        else:
            self.centroids = []
            beginning = 0
            for break_position in self.breaks:
                self.centroids += [self.get_centroid(beginning, break_position + 1)]
                beginning = break_position + 1
            self.centroids += [self.get_centroid(beginning, len(self.content))]

    def get_chunk(self, beginning, end):
        return '\n'.join(self.content[beginning : end])
    
    def get_chunks(self):
        if len(self.breaks) == 0:
            return [self.get_chunk(0, len(self.content))]
        else:
            chunks = []
            beginning = 0
            for break_position in self.breaks:
                chunks += [self.get_chunk(beginning, break_position + 1)]
                beginning = break_position + 1
            chunks += [self.get_chunk(beginning, len(self.content))]
        return chunks
        
    

In [8]:
chunk_separator = "\n *-* \n"

splitter = SemanticSplitter(df.content, embedding_model, method="number", partition_count=6)
chunks = chunk_separator.join(splitter.get_chunks())

## Step 3: Using a LLM model to Summarize each chunk
In our example, we are going to summarize each individual chunk separately. This solution might be advantageous in different situations:
 * When the original text is too big , or the loaded model works with a context that is too small. In this scenario, breaking information into chunks are necessary to allow the model to be applied
 * When the user wants to make sure that all the separate topics of a conversation are covered into the summarized version. An extra step could be added to allow some verification or manual configuration of the chunks to allow the user to customize the output

To achieve this goal, we load a LLM model and use a summarization prompt. For the model, we illustrate four different options here:
* Calling an cloud API (e.g. openAI API): This would require an API key from the desired service. In our example, we reccomend saving your API keys into a secrets.yaml file, and not shared together with the code, for security issues. An example with empty keys is provided with our code
* Connecting to a Hugging Face rest API: This also requires an API key
* Loading the model locally using Hugging Face repo: This would require to download the model in the first time you run your code. This might take several minutes (depending on your internet connection), and the model will be saved in local HF cache (set to be persisted in the beginning of this notebook)
* Loading the model from a file available as a project asset.

In [9]:
### Code to access model through OpenAI service

import os
from langchain_openai import OpenAI

import yaml
with open('secrets.yaml') as file:
    secrets = yaml.safe_load(file)
os.environ["OPENAI_API_KEY"] = secrets["OpenAI"]
llm = OpenAI(model_name="gpt-3.5-turbo-instruct")


In [None]:
### Alternate code to connect to Hugging Face models
#from langchain_huggingface import HuggingFaceEndpoint

#import yaml
#with open('secrets.yaml') as file:
#    secrets = yaml.safe_load(file)
#huggingfacehub_api_token = secrets["HuggingFace"]

#repo_id = "mistralai/Mistral-7B-Instruct-v0.2"
#llm = HuggingFaceEndpoint(
#   huggingfacehub_api_token=huggingfacehub_api_token,
#   repo_id=repo_id,
#)


In [None]:
#from langchain_huggingface import HuggingFacePipeline
#from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

#model_id = "mistralai/Mistral-7B-v0.1"
#tokenizer = AutoTokenizer.from_pretrained(model_id)
#model = AutoModelForCausalLM.from_pretrained(model_id)
#pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=100, device=0)
#llm = HuggingFacePipeline(pipeline=pipe)

In [None]:
### Alternate code to load local models

#from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler
#from langchain_community.llms import LlamaCpp

#callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

#llm = LlamaCpp(
            #model_path="/home/jovyan/datafabric/Llama7b/ggml-model-f16-Q5_K_M.gguf",
            #n_gpu_layers=64,
            #n_batch=512,
            #n_ctx=4096,
            #max_tokens=1024,
            #f16_kv=True,  
            #callback_manager=callback_manager,
            #verbose=False,
            #stop=[],
            #streaming=False,
            #temperature=0.4,
        #)
    

In [10]:
prompt_template = '''
The following text is an excerpt of a transcription:

### 
{context} 
###

Please, produce a single paragraph summarizing the given excerpt.
'''

## Step 4: Create parallel chain to summarize the transcript

In the following cell, we create a chain that will receive a single string with multiple chunks (separated by the declared separator), than:
  * Break the input into separated chains - using the break_chunks function embedded in a RunnableLambda to be used in LangChain
  * Run a Parallel Chain with the following elements for each chunk:
    * Get an individual element
    * Personalize the prompt template to create an individual prompt for each chunk
    * Use the LLM inference to summarize the chunk
  * Merge the individual summaries into a single one




In [11]:
from operator import itemgetter
from langchain.prompts import ChatPromptTemplate
from langchain.schema import StrOutputParser
from langchain_core.runnables import RunnableLambda, RunnablePassthrough


def join_summaries(summaries):
    return "\n".join(list(summaries.values()))

def break_chunks(chunks):
    return chunks.split(chunk_separator)

lambda_join = RunnableLambda(join_summaries)
lambda_break = RunnableLambda(break_chunks)

prompt = ChatPromptTemplate.from_template(prompt_template)

chain = lambda_break | {f"summary_{i}" : itemgetter(i) | prompt | llm  for (i, _) in enumerate(RunnablePassthrough())} | lambda_join | StrOutputParser()



## Step 5: Connect to Galileo
Through the Galileo library called Prompt Quality, we connect our API generated in the Galileo console to log in. To get your ApiKey, use this link: https://console.hp.galileocloud.io/api-keys

In [12]:
import promptquality as pq

import yaml
with open('secrets.yaml') as file:
    secrets = yaml.safe_load(file)
os.environ['GALILEO_API_KEY'] = secrets["Galileo"]
galileo_url = "https://console.hp.galileocloud.io/"
pq.login(galileo_url)

👋 You have logged into 🔭 Galileo (https://console.hp.galileocloud.io/) as rafael.borges@hp.com.


Config(console_url=Url('https://console.hp.galileocloud.io/'), username=None, password=None, api_key=SecretStr('**********'), token=SecretStr('**********'), current_user='rafael.borges@hp.com', current_project_id=None, current_project_name=None, current_run_id=None, current_run_name=None, current_run_url=None, current_run_task_type=None, current_template_id=None, current_template_name=None, current_template_version_id=None, current_template_version=None, current_template=None, current_dataset_id=None, current_job_id=None, current_prompt_optimization_job_id=None, api_url=Url('https://api.hp.galileocloud.io/'))

## Step 6: Run the chain and connect the metrics to Galileo

In this session, we call the created chain and create the mechanisms to ingest the quality metrics into Galileo. For this example, we create a personalized metric (scorer), that will be running locally to measure the quality of the summarization. For this reason, we use HuggingFace implementation of ROUGE (using evaluate library), and implement into a CustomScorer from Galileo (next cell).

Below, we illustrate two alternative ways to connect to Galileo:
  * Using a customized run, which calculates the scores and logs into Galileo
  * Using the langchain callback (currently unavailable due to compatibility issues)

In [20]:
import evaluate
import time
import json

def rouge_executor(row) -> float:
    rouge = evaluate.load("rouge")
    reference = json.loads(row.node_input)["content"]
    prediction =  json.loads(row.node_output)["content"]
    rouge_values = rouge.compute(predictions =[prediction], references = [reference])
    return rouge_values["rougeL"]

def rouge_aggregator(scores, indices) -> dict:
    return {'Average RougeL': sum(scores)/len(scores)}

rouge_scorer = pq.CustomScorer(name='RougeL', executor=rouge_executor, aggregator=rouge_aggregator)

In [22]:
summaries = chain.invoke(chunks)

partitioned_run =  pq.EvaluateRun(
    project_name = "AIStudio_summarization_template",
    run_name = "Partitioned transcript",
    scorers=[pq.Scorers.toxicity, pq.Scorers.sexist, rouge_scorer]
)

start_time = time.time()
response = chain.invoke(chunks)
total_time = int((time.time() - start_time) * 1000000)
partitioned_run.add_workflow(input=chunks, output=response, duration_ns= total_time) 
partitioned_run.add_llm_step(input=chunks, output=response, duration_ns= total_time, model='local')

partitioned_run.finish()



Processing chain run...:   0%|          | 0/5 [00:00<?, ?it/s]

Initial job complete, executing scorers asynchronously. Current status:
cost: Done ✅
toxicity: Done ✅
sexist: Done ✅
pii: Done ✅
protect_status: Done ✅
latency: Done ✅
🔭 View your prompt run on the Galileo console at: https://console.hp.galileocloud.io/prompt/chains/1f038438-8246-4e4b-a4cf-36a7be970b97/7a9e4404-7057-4782-a391-b640d511b2cd?taskType=12


Failed to score response: None at index 0, exception:Unexpected UTF-8 BOM (decode using utf-8-sig): line 1 column 1 (char 0). Skipping row 0.


In [None]:
### THIS CODE IS NOT WORKING YET, AS GALILEO DOES NOT SUPPORT LISTS AS THE OUTPUT OF CHAIN NODES 

#summarization_callback =  pq.GalileoPromptCallback(
#    project_name = "AIStudio_summarization_template",
#    run_name = "Partitioned transcript",
#    scorers=[pq.Scorers.toxicity, pq.Scorers.sexist, rouge_scorer]
#)

#summaries = chain.invoke(chunks, config={"callbacks": [summarization_callback]})

