# Video Indexer Transcript Analysis with Langchain + AOAI + Azure AI Search 

## Contributors

- Korkrid Kyle Akepanidtaworn, AI Specialized CSA, Global Customer Success
- Serge Retkowsky, AI GBB, Microsoft France

In [1]:
# Install the required packages
%pip install langchain
%pip install langchain_community
%pip install -qU langchain-openai
%pip install --upgrade --quiet  azure-search-documents
%pip install --upgrade --quiet  azure-identity
%pip install tqdm
%pip install tenacity

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
# Import libraries
import time  # For execution time tracking
import json # For JSON handling
import re # For regular expressions
import requests # For making HTTP requests
import sys # To interact with the Python runtime environment
from dotenv import dotenv_values # For loading environment variables from .env file
from dotenv import load_dotenv # For loading environment variables from .env file
from pprint import pprint # For pretty-printing JSON
import getpass # For secure password input
from tqdm import tqdm # For progress bar in loops

# Import the required libraries for Azure AI VIdeo Indexer
from VideoIndexerClient.Consts import Consts
from VideoIndexerClient.VideoIndexerClient import VideoIndexerClient

# Import the required libraries for Azure Blob Storage and OpenAI
import os
import base64
from openai import AzureOpenAI  # Interface for interacting with Azure-hosted OpenAI services

# Import LangChain components for building conversational AI and working with documents
from langchain.chat_models import AzureChatOpenAI  # OpenAI's Azure-based chat models for conversation
from langchain_openai import AzureOpenAI # OpenAI's Azure-based models for text generation
from langchain.chains import RetrievalQA  # For building a QA pipeline with retrieval capabilities
from langchain.retrievers import AzureCognitiveSearchRetriever  # Retriever using Azure Cognitive Search
from langchain.prompts import PromptTemplate  # Template for formatting prompts to AI models
from langchain.document_loaders import TextLoader  # For loading documents (e.g., text files)
from langchain.text_splitter import CharacterTextSplitter  # For splitting text into smaller chunks for processing
from langchain.vectorstores import AzureSearch  # For storing and searching embeddings in Azure Search
from langchain_community.vectorstores.azuresearch import AzureSearch # For storing and searching embeddings in Azure Search
from langchain_community.retrievers import AzureAISearchRetriever
from langchain_openai import AzureOpenAIEmbeddings, OpenAIEmbeddings # Azure-specific OpenAI embeddings

# Load environment variables from the specified .env file
# This allows sensitive keys and endpoints to be managed securely outside the code
load_dotenv(".env")

# Print Python version to confirm environment compatibility
print(f"Python version: {sys.version}")
print(f"Python executable: {sys.executable}")

Python version: 3.12.10 (tags/v3.12.10:0cc8128, Apr  8 2025, 12:21:36) [MSC v.1943 64 bit (AMD64)]
Python executable: c:\Users\koakepan\Downloads\Azure-AI-Video-Indexer-Samples\.venv\Scripts\python.exe


In [None]:
# # Validate the environment variables for Azure OpenAI (Optional)
# print(os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME").split(","))  # Print the Azure OpenAI deployment name for confirmation
# print(os.getenv("AZURE_OPENAI_ENDPOINT").split(","))  # Print the Azure OpenAI endpoint for confirmation
# print(os.getenv("AZURE_OPENAI_KEY").split(","))  # Print the Azure OpenAI key for confirmation
# print(os.getenv("AZURE_OPENAI_API_VERSION").split(","))  # print the Azure OpenAI API version for confirmation
# print(os.getenv("AZURE_COGNITIVE_SEARCH_ENDPOINT"))  # Print the Azure Cognitive Search endpoint for confirmation
# print(os.getenv("AZURE_COGNITIVE_SEARCH_API_KEY"))  # Print the Azure Cognitive Search key for confirmation

# Document Ingestion

In [3]:
# Document directory
DOCS_DIR = "transcripts"

# Loop through the folders
docs = []
for dirpath, dirnames, filenames in os.walk(DOCS_DIR):
    for file in filenames:
        print(file)
        try:
            loader = TextLoader(os.path.join(dirpath, file), encoding="utf-8")
            docs.extend(loader.load_and_split())
        except Exception as e:
            pass

# Split into chunk of texts
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)
print(f"Total number of documents: {len(texts)}")  # Print the total number of documents loaded
# Print the first document for inspection   
print(f"First document: {texts[0]}")  # Print the first document for inspection

Saint_Gobain_2023.csv
St_Gobain_materials.csv
Total number of documents: 4
First document: page_content='﻿0,8493325710296631;00:00:02.080;00:00:06.295;2022 aura été décidément une année pleine de bouleversements, entre
0,8493325710296631;00:00:06.358;00:00:10.258;le dérèglement climatique, la flambée des prix de l'énergie et
0,8493325710296631;00:00:10.321;00:00:14.788;le retour de l'inflation dans cet environnement chahuté, Saint-Gobain à
0,8493325710296631;00:00:14.851;00:00:18.248;garder le cap et signé une nouvelle fois des résultats
0,8493325710296631;00:00:18.311;00:00:21.834;records ou tous les indicateurs de performance sont à la
0,8493325710296631;00:00:21.897;00:00:26.678;hausse. Des résultats qui valident l'efficacité de notre modèle opérationnel
0,8493325710296631;00:00:26.741;00:00:30.830;par pays pour s'adapter rapidement aux évolutions de nos marchés.
0,8493325710296631;00:00:30.893;00:00:34.920;Cette année 2022 aura été marquée par des réalisations majeures.
0,732153415

## Embeddings and Loading the Documents into the Azure AI Search

In [5]:
from langchain_openai import AzureOpenAIEmbeddings

# Initialize the OpenAI embeddings model using Azure settings
embeddings = AzureOpenAIEmbeddings(
    deployment=os.getenv("AZURE_ADA_EMBEDDING_DEPLOYMENT_NAME"),
    model=os.getenv("AZURE_ADA_EMBEDDING_MODEL_NAME"),
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key = os.getenv("AZURE_OPENAI_KEY"),
    chunk_size=1,
)

print(f"Embedding model: {embeddings}")  # Print the embedding model details for verification
print(f"Embedding model type: {type(embeddings)}")  # Print the type of the embedding model for verification

Embedding model: client=<openai.resources.embeddings.Embeddings object at 0x0000028837875940> async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x000002883787F800> model='text-embedding-ada-002' dimensions=None deployment='text-embedding-ada-002' openai_api_version='2023-05-15' openai_api_base=None openai_api_type='azure' openai_proxy=None embedding_ctx_length=8191 openai_api_key=SecretStr('**********') openai_organization=None allowed_special=None disallowed_special=None chunk_size=1 max_retries=2 request_timeout=None headers=None tiktoken_enabled=True tiktoken_model_name=None show_progress_bar=False model_kwargs={} skip_empty=False default_headers=None default_query=None retry_min_seconds=4 retry_max_seconds=20 http_client=None http_async_client=None check_embedding_ctx_length=True azure_endpoint='https://devaoai1234.openai.azure.com/' azure_ad_token=None azure_ad_token_provider=None azure_ad_async_token_provider=None validate_base_url=True
Embedding model type: <cl

In [6]:
# Specify the index name for Azure AI Search
index_name = os.getenv("AZURE_COGNITIVE_SEARCH_INDEX_NAME")
print(f"Index name: {index_name}")  # Print the index name for reference

# Initialize our Azure AI Search instance
vector_store: AzureSearch = AzureSearch(
    azure_search_endpoint=os.getenv("AZURE_COGNITIVE_SEARCH_ENDPOINT"),
    azure_search_key=os.getenv("AZURE_COGNITIVE_SEARCH_API_KEY"),
    index_name = index_name,
    embedding_function=embeddings.embed_query,
)
print("Azure AI Search instance initialized.")  # Print confirmation of Azure Search initialization

Index name: videoindexer-transcripts
Azure AI Search instance initialized.


In [7]:
# Add documents to Azure AI Search with TQDM progress tracking
for text in tqdm(texts, desc="Adding documents to Azure AI Search"):
    try:
        vector_store.add_documents(documents=[text])
    except Exception as e:
        print(f"Error adding document: {e}")
print("All documents added to Azure AI Search.")

Adding documents to Azure AI Search: 100%|██████████| 4/4 [00:02<00:00,  1.67it/s]

All documents added to Azure AI Search.





In [8]:
# Initialize the Azure Cognitive Search retriever for document retrieval
retriever = AzureAISearchRetriever(
    # 'content_key' specifies the field in the search index containing the document content
    content_key="content",  # The field in the Azure Cognitive Search index that holds document content
    
    # 'top_k' specifies how many of the top matching documents to retrieve from the index
    top_k=10,  # Retrieve the top 10 documents that match the search query
    
    # 'index_name' refers to the name of the search index in Azure Cognitive Search
    index_name = index_name,  # The name of the index to query in Azure Cognitive Search

    # 'service_name' specifies the name of the Azure Cognitive Search service
    service_name = os.getenv("AZURE_AI_SEARCH_SERVICE_NAME"),  # The name of the Azure AI Search service
    api_key =  os.getenv("AZURE_COGNITIVE_SEARCH_API_KEY"),  # The API key for authenticating with the Azure AI Search service
)

# Preview the retriever's configuration
# print(f"Retriever configuration: {retriever.__dict__}")  # Print the retriever's configuration for debugging
print("Retriever initialized.")  # Print confirmation of retriever initialization

Retriever initialized.


In [10]:
# Azure OpenAI has several chat models. You can find information about their latest models and their costs, context windows, and supported input types in the Azure docs.
# API Reference: https://python.langchain.com/docs/integrations/chat/azure_chat_openai/
from langchain_openai import AzureChatOpenAI

llm = AzureChatOpenAI(azure_deployment = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME"),
                      api_version = os.getenv("AZURE_OPENAI_API_VERSION"),
                      api_key = os.getenv("AZURE_OPENAI_KEY"),
                      temperature=0.7)

# Preview the LLM's configuration
print(f"LLM configuration: {llm.__dict__}")  # Print the LLM's configuration for debugging
print("LLM initialized.")  # Print confirmation of LLM initialization

LLM configuration: {'name': None, 'cache': None, 'verbose': False, 'callbacks': None, 'tags': None, 'metadata': None, 'custom_get_token_ids': None, 'callback_manager': None, 'rate_limiter': None, 'disable_streaming': False, 'client': <openai.resources.chat.completions.completions.Completions object at 0x00000288388ABE30>, 'async_client': <openai.resources.chat.completions.completions.AsyncCompletions object at 0x000002883898C3B0>, 'root_client': <openai.lib.azure.AzureOpenAI object at 0x00000288388A86B0>, 'root_async_client': <openai.lib.azure.AsyncAzureOpenAI object at 0x00000288388ABDA0>, 'model_name': None, 'temperature': 0.7, 'model_kwargs': {}, 'openai_api_key': SecretStr('**********'), 'openai_api_base': None, 'openai_organization': None, 'openai_proxy': None, 'request_timeout': None, 'stream_usage': False, 'max_retries': None, 'presence_penalty': None, 'frequency_penalty': None, 'seed': None, 'logprobs': None, 'top_logprobs': None, 'logit_bias': None, 'streaming': False, 'n': No

# Video Transcript Summarization

In [11]:
# Define a template message
template = """You are analyzing a transcript text file that contains the speech to text results from a video file. 
{context}
Question: {question}
Helpful Answer:"""

QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

# Set the Retrieval QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=retriever,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
)

In [12]:
questions = ["Could you summarize in a couple of lines the document Saint_Gobain_2023.csv?"]

chat_history = []

for question in questions:
    result = qa_chain({"query": question, "chat_history": chat_history})
    #chat_history.append((question, result))
    print("\033[1;31;34m")
    print(f"Question: {question}")
    print("\033[1;31;32m")
    print(f"Answer: {result['result']}")


  result = qa_chain({"query": question, "chat_history": chat_history})


[1;31;34m
Question: Could you summarize in a couple of lines the document Saint_Gobain_2023.csv?
[1;31;32m
Answer: The document "Saint_Gobain_2023.csv" is a transcript of speeches highlighting Saint-Gobain's achievements and innovations in 2022 and plans for 2023. It emphasizes the company's advancements in manufacturing, environmental sustainability, and social responsibility, including the development of low-carbon products and global recognition for gender equality and top employer certification.


In [13]:
questions = ["Can you generate 10 keywords from Saint_Gobain_2023.csv?"]

chat_history = []

for question in questions:
    result = qa_chain({"query": question, "chat_history": chat_history})
    #chat_history.append((question, result))
    print("\033[1;31;34m")
    print(f"Question: {question}")
    print("\033[1;31;32m")
    print(f"Answer: {result['result']}")

[1;31;34m
Question: Can you generate 10 keywords from Saint_Gobain_2023.csv?
[1;31;32m
Answer: To generate 10 keywords from the provided transcript, we should look for recurring themes, significant terms, and core ideas presented in the text. Here are 10 keywords based on the content:

1. Saint-Gobain
2. Innovation
3. Carbon Neutrality
4. Sustainability
5. Global Expansion
6. Performance
7. Decarbonization
8. Manufacturing
9. Leadership
10. Environmental Responsibility

These keywords reflect the main topics and themes discussed in the transcript, highlighting Saint-Gobain's focus on innovation, sustainability, and global market leadership.


In [14]:
questions = ["You are a twitter redactor. Write a tweeter post about the content of Saint_Gobain_materials.csv?\
Use some smileys"]

chat_history = []

for question in questions:
    result = qa_chain({"query": question, "chat_history": chat_history})
    #chat_history.append((question, result))
    print("\033[1;31;34m")
    print(f"Question: {question}")
    print("\033[1;31;32m")
    print(f"Answer: {result['result']}")

[1;31;34m
Question: You are a twitter redactor. Write a tweeter post about the content of Saint_Gobain_materials.csv?Use some smileys
[1;31;32m
Answer: 🌟 Exciting developments at Saint-Gobain! 🌟 In 2022, we saw groundbreaking innovations, including the launch of low-carbon glass and eco-friendly plasterboard made from recycled materials. 🌍 Our continuous commitment to sustainability and excellence has set new records, with 18 new factories and production lines opened worldwide. Congrats to the team for earning the Global Top Employer certification! 💼👏 Let's make 2023 another year of success and innovation in the construction industry! 🚀 #Sustainability #Innovation #SaintGobain #TopEmployer


In [15]:
questions = ["Display the timeframe of the curing of glass process in the St_Gobain.csv file? Just print the values \
             like a json file"]

chat_history = []

for question in questions:
    result = qa_chain({"query": question, "chat_history": chat_history})
    #chat_history.append((question, result))
    print("\033[1;31;34m")
    print(f"Question: {question}")
    print("\033[1;31;32m")
    print(f"Answer: {result['result']}")

[1;31;34m
Question: Display the timeframe of the curing of glass process in the St_Gobain.csv file? Just print the values              like a json file
[1;31;32m
Answer: ```json
{
  "start_time": "00:02:52.826",
  "end_time": "00:02:56.400"
}
```
