# Bhagavad Gita and Patanjali Yoga Sutras Analysis System

This notebook implements a system for retrieving and summarizing verses from the Bhagavad Gita and Patanjali Yoga Sutras based on user queries. The system uses modern NLP techniques including TF-IDF vectorization for text matching and T5 transformer for summary generation.

## Key Features
- Query-based verse retrieval from sacred texts
- Intelligent matching using TF-IDF and cosine similarity
- Automatic summary generation
- Bilingual support (Sanskrit and English)
- Performance evaluation metrics

## Prerequisites
- Python 3.10+
- Transformers library
- scikit-learn
- pandas
- torch
- numpy

In [11]:
# Install Transformers library for T5-base model
!pip install transformers

# Install scikit-learn for TF-IDF vectorization and similarity calculations
!pip install scikit-learn

# Install pandas for data handling
!pip install pandas

# Install other dependencies (if not already available)
!pip install numpy
!pip install torch



In [12]:
# Import necessary libraries
import os
import json
import pandas as pd
from transformers import T5Tokenizer, T5ForConditionalGeneration
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

# Define paths for Gita and PYS datasets
# Define directories for Gita and PYS datasets in Kaggle
DATA_DIRECTORIES = {
    "gita": "/kaggle/input/bhagwad-gita-dataset",  # Directory containing Gita-related CSV files
    "pys": "/kaggle/input/pys-dataset"    # Directory containing PYS-related CSV files
}

## Data Loading and Preprocessing

This section handles the loading and initial processing of our source texts:
- Loads verses from Bhagavad Gita and Patanjali Yoga Sutras
- Combines datasets while preserving source information
- Handles missing values and standardizes format
- Prepares text for vectorization

Note: The data is expected to be in CSV format with specific column structures.

In [13]:
def load_and_combine_csv(directory):
    combined_df = pd.DataFrame()
    for file_name in os.listdir(directory):
        if file_name.endswith(".csv"):
            file_path = os.path.join(directory, file_name)
            df = pd.read_csv(file_path)
            combined_df = pd.concat([combined_df, df], ignore_index=True)
    return combined_df

In [14]:
# Load Gita and PYS datasets
def load_datasets():
    gita_df = load_and_combine_csv(DATA_DIRECTORIES["gita"])
    pys_df = load_and_combine_csv(DATA_DIRECTORIES["pys"])
    return gita_df, pys_df

gita_df, pys_df = load_datasets()

# Display the first few rows of each dataset for verification
print("Gita Data Preview:")
print(gita_df.head())
print("\nPYS Data Preview:")
print(pys_df.head())

Gita Data Preview:
   Chapter  Verse                            Concept         Keyword  \
0      2.0   13.0             Transmigration of Soul  Transmigration   
1      2.0   14.0  Impermanance of pleasure and pain      Transience   
2      2.0   22.0             Transmigration of Soul  Transmigration   
3      2.0   25.0            Characteristics of Soul            Soul   
4      2.0   27.0           Cycle of Birth and Death  Transmigration   

                                            Sanskrit  \
0  देहिनोऽस्मिन्यथा देहे कौमारं यौवनं जरा| तथा दे...   
1  मात्रास्पर्शास्तु कौन्तेय शीतोष्णसुखदुःखदाः| आ...   
2  वासांसि जीर्णानि यथा विहाय नवानि गृह्णाति नरोऽ...   
3  अव्यक्तोऽयमचिन्त्योऽयमविकार्योऽयमुच्यते| तस्मा...   
4  जातस्य हि ध्रुवो मृत्युर्ध्रुवं जन्म मृतस्य च|...   

                                             English Speaker Sanskrit   \
0  Just as childhood, youth, and old age are natu...     NaN       NaN   
1  The sensations of cold and heat, pleasure and ...     NaN   

In [16]:
# Identify the text column dynamically
def find_text_column(df, possible_columns):
    for col in possible_columns:
        if col in df.columns:
            return col
    raise ValueError(f"No valid text column found in DataFrame. Available columns: {df.columns}")

def preprocess_data(gita_df, pys_df):
    # Identify the text column for both datasets
    gita_text_column = find_text_column(gita_df, ['text', 'verse', 'content', 'translation'])
    pys_text_column = find_text_column(pys_df, ['text', 'verse', 'content', 'translation'])
    
    # Fill missing values with empty strings
    gita_df[gita_text_column] = gita_df[gita_text_column].fillna('').astype(str)
    pys_df[pys_text_column] = pys_df[pys_text_column].fillna('').astype(str)

    # Add a source column for identification
    gita_df['source'] = 'Gita'
    pys_df['source'] = 'PYS'

    # Normalize the column names
    gita_df = gita_df.rename(columns={gita_text_column: 'text'})
    pys_df = pys_df.rename(columns={pys_text_column: 'text'})

    # Combine datasets
    combined_df = pd.concat([gita_df, pys_df], ignore_index=True)
    return combined_df

# Preprocess data
data = preprocess_data(gita_df, pys_df)

# Display the combined dataset preview
print("Combined Data Preview:")
print(data.head())

Combined Data Preview:
   Chapter  Verse                            Concept         Keyword  \
0      2.0   13.0             Transmigration of Soul  Transmigration   
1      2.0   14.0  Impermanance of pleasure and pain      Transience   
2      2.0   22.0             Transmigration of Soul  Transmigration   
3      2.0   25.0            Characteristics of Soul            Soul   
4      2.0   27.0           Cycle of Birth and Death  Transmigration   

                                            Sanskrit  \
0  देहिनोऽस्मिन्यथा देहे कौमारं यौवनं जरा| तथा दे...   
1  मात्रास्पर्शास्तु कौन्तेय शीतोष्णसुखदुःखदाः| आ...   
2  वासांसि जीर्णानि यथा विहाय नवानि गृह्णाति नरोऽ...   
3  अव्यक्तोऽयमचिन्त्योऽयमविकार्योऽयमुच्यते| तस्मा...   
4  जातस्य हि ध्रुवो मृत्युर्ध्रुवं जन्म मृतस्य च|...   

                                             English Speaker Sanskrit   \
0  Just as childhood, youth, and old age are natu...     NaN       NaN   
1  The sensations of cold and heat, pleasure and ...     Na

In [17]:
# Clean the 'text' column
data['text'] = data['text'].fillna('').astype(str)

# Remove rows where the text is empty (optional)
data = data[data['text'].str.strip() != '']

In [18]:
print(data[['text']].head())

      text
11624  1.0
11625  2.0
11626  3.0
11627  4.0
11628  5.0


In [19]:
print(data['text'].apply(type).value_counts())

text
<class 'str'>    895
Name: count, dtype: int64


## Text Vectorization and Similarity Calculation

This section implements the core text matching functionality:
- Converts text to TF-IDF vectors for efficient comparison
- Implements cosine similarity for finding relevant verses
- Handles multi-language content appropriately

The vectorization process helps us match user queries with the most relevant verses from our corpus.

In [21]:
# Chunking and vectorization
def vectorize_texts(data):
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(data['text'])
    return vectorizer, tfidf_matrix

# Apply vectorization
vectorizer, tfidf_matrix = vectorize_texts(data)

## Verse Retrieval System

The retrieval system matches user queries with relevant verses using the following process:
1. Query preprocessing
2. Vector space matching
3. Similarity ranking
4. Top-N selection

The system returns the most relevant verses based on semantic similarity to the query.

In [22]:
def retrieve_shlokas(query, vectorizer, tfidf_matrix, data, top_n=5):
    query_vector = vectorizer.transform([query])
    cosine_similarities = (tfidf_matrix @ query_vector.T).toarray().ravel()
    
    # Get the top N results
    top_indices = cosine_similarities.argsort()[-top_n:][::-1]
    retrieved_shlokas = data.iloc[top_indices].copy()
    retrieved_shlokas['similarity_score'] = cosine_similarities[top_indices]
    
    return retrieved_shlokas, cosine_similarities[top_indices]

# Example usage
query = "How to control the mind?"
retrieved_shlokas, scores = retrieve_shlokas(query, vectorizer, tfidf_matrix, data)
print(retrieved_shlokas)

       Chapter  Verse Concept Keyword Sanskrit English Speaker Sanskrit   \
24084      NaN    NaN     NaN     NaN      NaN     NaN     NaN       NaN   
11917      NaN    NaN     NaN     NaN      NaN     NaN     NaN       NaN   
11928      NaN    NaN     NaN     NaN      NaN     NaN     NaN       NaN   
11927      NaN    NaN     NaN     NaN      NaN     NaN     NaN       NaN   
11926      NaN    NaN     NaN     NaN      NaN     NaN     NaN       NaN   

      Swami Adidevananda Swami Gambirananda  ...  text speaker  \
24084                NaN                NaN  ...  34.0     NaN   
11917                NaN                NaN  ...  14.0   भगवान   
11928                NaN                NaN  ...  25.0   भगवान   
11927                NaN                NaN  ...  24.0   भगवान   
11926                NaN                NaN  ...  23.0   भगवान   

                                                sanskrit  \
24084  पुरुषार्थशून्यानां गुणानां प्रतिप्रसवः कैवल्यं...   
11917  दैवी ह्येषा गुणमयी 

## Summary Generation

This section implements automatic summary generation using the T5 transformer model:
- Uses T5-base model for text generation
- Implements custom prompting for spiritual context
- Generates concise, contextual summaries of retrieved verses

Note: The summary generation process is designed to maintain the spiritual context and meaning of the original texts.

In [23]:
# Load T5-base model
def load_t5_model():
    tokenizer = T5Tokenizer.from_pretrained("t5-base")
    model = T5ForConditionalGeneration.from_pretrained("t5-base")
    return tokenizer, model

tokenizer, model = load_t5_model()
    
# Prompt engineering and summarization
def generate_summary(query, shlokas, tokenizer, model):
    if not shlokas_text.strip():
        return "No relevant shlokas were found for the query."

    # Create a high-quality prompt
    prompt = f"""
    You are an AI trained on Indian scriptures like the Bhagavad Gita and Patanjali Yoga Sutras.
    Your job is to answer queries by retrieving relevant verses and summarizing their essence.

    Query: {query}

    Relevant Verses:
    {shlokas}

    Based on the above verses, provide a concise, accurate, and spiritually aligned summary. Avoid adding any information not present in the verses.
    """
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
    output = model.generate(**inputs, max_length=150, num_beams=5, early_stopping=True)
    return tokenizer.decode(output[0], skip_special_tokens=True)
    input_ids = tokenizer.encode(prompt, return_tensors="pt", max_length=512, truncation=True)
    output_ids = model.generate(input_ids, max_length=150, num_beams=5, early_stopping=True)
    summary = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return summary

# Generate summary
shlokas_text = " ".join(retrieved_shlokas['text'].tolist())
summary = generate_summary(query, shlokas_text, tokenizer, model)
print("Generated Summary:", summary)

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Generated Summary: True


In [24]:
# Check for inappropriate queries
def is_query_valid(query):
    inappropriate_keywords = ["abuse", "hate", "violence", "unethical", "illegal"]
    for keyword in inappropriate_keywords:
        if keyword in query.lower():
            return False
    return True

# Example usage
if not is_query_valid(query):
    print("Query is inappropriate.")

## Output Formatting

The system generates structured output containing:
- Original query
- Retrieved verses (with both Sanskrit and English translations)
- Generated summary
- Source information and metadata

All output is formatted in JSON for easy integration with other systems.

In [29]:
def format_output(query, retrieved_shlokas, summary):
    # Map available columns to the expected keys
    column_mapping = {
        'chapter': 'Chapter',
        'verse': 'Verse',
        'text': 'Text',
        'sanskrit': 'Sanskrit',
        'translation': 'Translation',
        'source': 'Source'
    }
    
    # Check which columns exist in the DataFrame
    available_columns = {key: value for key, value in column_mapping.items() if key in retrieved_shlokas.columns}
    
    # Rename columns for uniformity
    retrieved_shlokas = retrieved_shlokas.rename(columns=available_columns)
    
    # Select only available columns
    retrieved_shlokas = retrieved_shlokas[list(available_columns.values())].fillna('Not specified').to_dict(orient='records')
    
    # Format the final output
    return {
        "query": query,
        "retrieved_shlokas": retrieved_shlokas,
        "summary": summary
    }

In [30]:
# Query example
query = "How to control the mind?"

# Retrieve relevant shlokas
retrieved_shlokas, scores = retrieve_shlokas(query, vectorizer, tfidf_matrix, data)

# Generate summary
shlokas_text = " ".join(retrieved_shlokas['text'].tolist())
summary = generate_summary(query, shlokas_text, tokenizer, model)

# Format output as JSON
output = format_output(query, retrieved_shlokas, summary)
print(output)

{'query': 'How to control the mind?', 'retrieved_shlokas': [{'Chapter': 4.0, 'Text': '34.0', 'Sanskrit': 'पुरुषार्थशून्यानां गुणानां प्रतिप्रसवः कैवल्यं स्वरूपप्रतिष्ठा वा चितिशक्तिरिति', 'Translation': 'Ultimate liberation is when the gunas, devoid of any purpose for the purusa, return to their original (latent) state; in other words, when the power of consciousness is situated in its own essential nature.', 'Source': 'PYS'}, {'Chapter': 7.0, 'Text': '14.0', 'Sanskrit': 'दैवी ह्येषा गुणमयी मम माया दुरत्यया| मामेव ये प्रपद्यन्ते मायामेतां तरन्ति ते || 7.14 || ', 'Translation': 'Verily, this divine illusion of Mine, composed of the three qualities, is difficult to cross over; those who take refuge in Me alone, can cross over this illusion.', 'Source': 'Gita'}, {'Chapter': 7.0, 'Text': '25.0', 'Sanskrit': 'नाहं प्रकाशः सर्वस्य योगमायासमावृतः| मूढोऽयं नाभिजानाति लोको मामजमव्ययम् || 7.25 || ', 'Translation': 'I am not manifest to all, veiled as I am by the Yoga-Maya. This deluded world doe

  retrieved_shlokas = retrieved_shlokas[list(available_columns.values())].fillna('Not specified').to_dict(orient='records')


In [44]:
import json
import pandas as pd

def refine_output_with_dynamic_summary(query, retrieved_shlokas):
    # Map available columns to the expected keys
    column_mapping = {
        'chapter': 'Chapter',
        'verse': 'Verse',
        'text': 'Text',
        'sanskrit': 'Sanskrit',
        'translation': 'Translation',
        'source': 'Source'
    }

    # Check which columns exist in the DataFrame
    available_columns = {key: value for key, value in column_mapping.items() if key in retrieved_shlokas.columns}

    # Rename columns for uniformity
    retrieved_shlokas = retrieved_shlokas.rename(columns=available_columns)

    # Select only available columns
    shlokas_list = (
        retrieved_shlokas[list(available_columns.values())]
        .fillna('Not specified')
        .to_dict(orient='records')
    )

    # Generate a summary based on the translations in the retrieved shlokas
    summary_lines = [
        shloka.get("Translation", "No translation available")
        for shloka in shlokas_list
    ]
    generated_summary = " ".join(summary_lines)

    # Refine the output
    output = {
        "query": query,
        "retrieved_shlokas": shlokas_list,
        "summary": generated_summary
    }

    # Convert to JSON format with proper indentation
    return json.dumps(output, indent=4, ensure_ascii=False)

In [45]:
# Example usage
query = 'How to control the mind?'
refined_output = refine_output_with_dynamic_summary(query, retrieved_shlokas)
print(refined_output)

{
    "query": "How to control the mind?",
    "retrieved_shlokas": [
        {
            "Chapter": 4.0,
            "Text": "34.0",
            "Sanskrit": "पुरुषार्थशून्यानां गुणानां प्रतिप्रसवः कैवल्यं स्वरूपप्रतिष्ठा वा चितिशक्तिरिति",
            "Translation": "Ultimate liberation is when the gunas, devoid of any purpose for the purusa, return to their original (latent) state; in other words, when the power of consciousness is situated in its own essential nature.",
            "Source": "PYS"
        },
        {
            "Chapter": 7.0,
            "Text": "14.0",
            "Sanskrit": "दैवी ह्येषा गुणमयी मम माया दुरत्यया| मामेव ये प्रपद्यन्ते मायामेतां तरन्ति ते || 7.14 || ",
            "Translation": "Verily, this divine illusion of Mine, composed of the three qualities, is difficult to cross over; those who take refuge in Me alone, can cross over this illusion.",
            "Source": "Gita"
        },
        {
            "Chapter": 7.0,
            "Text": "25.0"

  retrieved_shlokas[list(available_columns.values())]


## Evaluation System

The evaluation system uses weighted criteria to assess the quality of results:

In [46]:
# Define evaluation criteria with weights
evaluation_criteria = {
    "Accuracy of Retrieved Verse": 0.3,
    "Contextual Relevance": 0.2,
    "Quality of Prompt and Summary": 0.15,
    "Cost Efficiency": 0.1,
    "Depth and Quality of Analysis": 0.1,
    "Scalability": 0.05,
    "User Experience": 0.05,
    "Error Handling": 0.05
}

### Scoring Guide

The total weighted score is calculated on a scale of 1 to 5:
- 5.0: Exceptional performance
- 4.0-4.9: Very good performance
- 3.0-3.9: Acceptable but may need improvement
- Below 3.0: Significant issues that need attention

Each criterion is weighted based on its importance to the overall system performance.

In [55]:
# Function to evaluate a query result
def evaluate_pipeline(query, retrieved_shlokas, summary, cost_per_query, scores):
    """
    Evaluate the pipeline performance for a specific query.
    
    Args:
    - query (str): The input query.
    - retrieved_shlokas (list): The retrieved verses.
    - summary (str): The generated summary.
    - cost_per_query (float): Cost incurred for processing the query.
    - scores (dict): Scores for each criterion (1–5).

    Returns:
    - dict: Evaluation results with total weighted score.
    """
    # Validate scores
    if set(scores.keys()) != set(evaluation_criteria.keys()):
        raise ValueError("Scores must include all evaluation criteria.")
    
    # Calculate weighted scores
    weighted_scores = {
        criterion: score * weight
        for criterion, (score, weight) in zip(scores.keys(), zip(scores.values(), evaluation_criteria.values()))
    }
    
    # Total weighted score
    total_score = sum(weighted_scores.values())
    
    # Structure the evaluation results
    evaluation_results = {
        "Query": query,
        "Cost per Query": cost_per_query,
        "Evaluation Scores": scores,
        "Weighted Scores": weighted_scores,
        "Total Weighted Score": round(total_score, 2)
    }
    return evaluation_results

## Usage Examples and Results

This section demonstrates the system's functionality with example queries and analyzes the results. The example uses the query "How to control the mind?" to show:
- Verse retrieval accuracy
- Summary quality
- Performance metrics
- Cost efficiency

In [56]:
# Example usage
query = "How to control the mind?"
cost_per_query = 0.000 

In [57]:
# Evaluate
evaluation_results = evaluate_pipeline(query, retrieved_shlokas, summary, cost_per_query, scores)

# Display evaluation results as a DataFrame
df_results = pd.DataFrame.from_dict(evaluation_results, orient='index', columns=['Value'])
print(df_results)

                                                                  Value
Query                                          How to control the mind?
Cost per Query                                                      0.0
Evaluation Scores     {'Accuracy of Retrieved Verse': 4, 'Contextual...
Weighted Scores       {'Accuracy of Retrieved Verse': 1.2, 'Contextu...
Total Weighted Score                                                4.4


### Future Improvements
- Enhanced summary generation with broader context
- More sophisticated verse matching algorithms
- Automated evaluation metrics
- Extended multilingual support
- User feedback integration