# Documenation for backend/citations.py

### Overview of citation.py 

- Generates citation links: Creates formatted links to specific pages in the textbook PDF.

- Extracts answers: Retrieves answers and associated source references from RAG responses.

- Handles page adjustments: Manages zero-indexing and adjusts for chapter starting points.

### Description of code:

#### Importing libraries 

`os` library: Built-in library for OS interaction, managing file and directory paths.

`roman` library: This module is used to convert integers to roman numerals, which will be utilized for page references in citations.

In [2]:
# Importing libraries for file management and Roman numeral conversion
import os
from roman import toRoman

# Output message indicating action taken
print("Modules imported: 'os' and 'toRoman' from the 'roman' package.")

Modules imported: 'os' and 'toRoman' from the 'roman' package.


#### Description:

### Function 1: default_textbook 

- Purpose of the function: Generate hyperlinks for referencing pages in the SWEBOK textbook.

- Zero-indexing adjustment: Adjusts zero-indexed pages to user-friendly numbering for navigation.

- Custom page handling: Converts pages before Chapter 1 to Roman numerals for clarity.

- Integration with chatbot: Provides accurate citations for user queries within the chatbot.

- Enhancing user experience: Improves accessibility to textbook content through direct hyperlink navigation.

In [1]:
def default_textbook(page, pdf_path):
    # The documents are zero indexed. So page 1 of the pdf is page 0 in docs
    # Chapter 1 starts at doc 33 (page 34 of the PDF)
    adjusted_page = page - 33 # Adjust the page number for zero-indexing and Chapter 1
    if adjusted_page >= 0:  # For pages starting from Chapter 1
        link = f'<a href="/team3/?view=pdf&file={pdf_path}&page={page + 1}" target="_blank">[{adjusted_page + 1}]</a>'
    else:  # For pages before Chapter 1
        adjusted_page = "Cover" if page == 0 else toRoman(page)  # Convert to Roman numerals for pages before Chapter 1
        link = f'<a href="/team3/?view=pdf&file={pdf_path}&page={page + 1}" target="_blank">[{adjusted_page}]</a>'
    
    # Output message indicating action taken
    print("Citation link created.")
    return link

# function call
citation_link = default_textbook(35, "sample.pdf") #print statement indicating the link has been created
print(citation_link) # print the formatted citation link

Citation link created.
<a href="/team3/?view=pdf&file=sample.pdf&page=36" target="_blank">[3]</a>


#### Link breakdown:

<a href="/team3/?view=pdf&file=sample.pdf&page=36" target="_blank">[3]</a>

If `page = 35` and `pdf_path = "sample.pdf"`, the function returns:

- URL: `/team3/?view=pdf&file=sample.pdf&page=36` — This opens `sample.pdf` to the specified page (36).

- Displayed Text: `[3]` — This represents the adjusted page number within the document (after page 33), shown as 3 in this case.

- The result is a clickable HTML link that users can use to access a specific page in the PDF.

#### Description:

### Answer extraction using RAG

- Function purpose: Extracts answers and source references from the RAG chain's response.

- PDF path management: Retrieves the textbook PDF path dynamically from environment configurations.

- Source link generation: Creates hyperlinks for source references based on page numbers and context.

- Special case handling: Uses a dedicated function for generating citations for specific textbooks.

- Formats final output: Compiles the answer and sources into a structured string for display.

In [3]:
def get_answer_with_source(response):
    """
    Extract the answer and relevant source information from the response.
    
    This function processes the response from the RAG (Retrieval-Augmented Generation) chain,
    extracting the answer and up to 5 source references (page numbers) from the context documents.

    Args:
      response (dict): The response dictionary from the RAG chain,
                       containing 'answer' and 'context' keys.
    Returns:
      str: A formatted string containing the answer followed by source information.
    """
    
    # Retrieve the path to the PDF file from environment variables
    pdf_path = f"{os.getenv('CORPUS_SOURCE')}/textbook.pdf"
    
    # Extract the answer from the response; default to 'No answer found.'
    answer = response.get('answer', 'No answer found.')  
    
    # Initialize a list to hold source references
    sources = []  
    
    # Iterate over the context documents and get the top 5 sources
    for doc in response['context'][:5]:
        # Get the page number from metadata, default to 'Unknown page'
        page = doc['metadata'].get('page', 'Unknown page')  
        
        # Create a hyperlink for the source reference (1-based indexing)
        link = f'<a href="/team3/?view=pdf&file={pdf_path}&page={page + 1}" target="_blank">[{page + 1}]</a>'
        
        # Special case handling for the specific textbook format
        if pdf_path.split("/")[-2] == "default":  # Check if the PDF is the default textbook
            link = default_textbook(page, pdf_path)  # Use the default_textbook function for citation

        # Append the link to the sources list
        sources.append(link)

    # Join the top 5 sources with newlines for formatting
    sources_info = "\nSources: " + "".join(sources)
    
    # Format the final output string with the answer and sources
    final_answer = f"{answer}\n\n{sources_info}"

    # Output message indicating successful extraction of the answer
    print("Answer extracted successfully.")  
    
    # Return the final formatted answer string
    return final_answer

# response input simulating the output from the RAG chain
response = {
    'answer': "This is the answer extracted from the RAG chain.",
    'context': [{'metadata': {'page': 34}}, {'metadata': {'page': 50}}]
}

# function call to get the answer with source information
output = get_answer_with_source(response)

# Print the output from the function call
print(output)

Answer extracted successfully.
This is the answer extracted from the RAG chain.


Sources: <a href="/team3/?view=pdf&file=/app/data/swebok/textbook.pdf&page=35" target="_blank">[35]</a><a href="/team3/?view=pdf&file=/app/data/swebok/textbook.pdf&page=51" target="_blank">[51]</a>


#### Link Breakdown:

href attribute:

- URL: The base URL is /team3/?view=pdf&file= followed by the path to the PDF file and the page number.

Target attribute:

- `target="_blank"` ensures that the link opens in a new tab or window, preserving the user's current session.

Displayed text:

- Each link displays the page number in brackets. For example: For page=35, the displayed text is [35].