<a href="https://colab.research.google.com/github/OwlSaver/GWU/blob/main/Praxis/Notebooks/Praxis_Test_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install python-docx
import docx
import os
import google.generativeai as genai
from google.colab import userdata
import json
import re
import textwrap



Collecting python-docx
  Downloading python_docx-1.1.2-py3-none-any.whl.metadata (2.0 kB)
Downloading python_docx-1.1.2-py3-none-any.whl (244 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.3/244.3 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-docx
Successfully installed python-docx-1.1.2


In [None]:
class GeminiInitializationError(Exception):
    pass

class GeminiQueryError(Exception):
    pass

class GeminiResponseBlockedError(Exception):
    pass

class ParagraphsTooLong(Exception):
    pass

In [None]:
class Document:
  """
  Represents a Word document (.docx) and provides methods to analyze its content
  using the Gemini API.

  The Document class facilitates interactions with a Word document, enabling
  users to extract content, estimate paragraph and word counts, identify
  frequent words, and convert the document into a knowledge graph using
  the power of the Gemini API.

  Attributes:
      INDENT (int): The number of spaces for indenting JSON submitted to Gemini.
      MAX_VARIABLE_LENGTH (int): The maximum allowed length of a variable's
                                 string representation before truncation.
      PARAGRAPH_SET (int): The maximum number of paragraphs to process in a set.
      Q_KG (str): The prompt for Gemini to create a knowledge graph.
      Q_PC (str): The prompt for Gemini to count paragraphs.
      Q_MPC (str): The prompt for Gemini to merge paragraph counts.
      Q_WC (str): The prompt for Gemini to count words.
      Q_MWC (str): The prompt for Gemini to merge word counts.
      Q_TW (str): The prompt for Gemini to identify top words.
      Q_MTW (str): The prompt for Gemini to merge top word lists.

  Methods:
      __init__(self, path): Initializes a Document object with a file path.
      get_next_paragraphs(self, num_para): Retrieves the next set of paragraphs.
      ask_gemini_about_variable(self, variable, question, **kwargs): Queries
                                Gemini about a Python variable.
      convert_document_to_knowledge_graph(self, set_limit=-1): Converts the
                                          document to a knowledge graph.
      get_document_paragraph_count(self, set_limit=-1): Estimates the total
                                                     number of paragraphs.
      get_document_word_count(self, set_limit=-1): Estimates the total
                                                number of words.
      get_document_top_words(self, set_limit=-1): Identifies the top
                                                frequent words.
  """
  INDENT = 4                    # The numer of characters to indent the json submitted to Gemini.
  MAX_VARIABLE_LENGTH = 2 ** 16 # Adjust as needed.  Consider the context window of your chosen model.
  PARAGRAPH_SET = 50            # The maximum number of paragraphs to process in a set.
  ITERATIONS = 3                # The number of times to process the data
  Q_KG = "Create a separate knowledge graph for each named entity you find in the content."
  Q_KG += " The knowledge graph must be in json format. The json structure should include entity,"
  Q_KG += " type, properties, and relationships. Do not include explanations of what you did."
  Q_KG += " Just include the json in your response."
  Q_PC = "How many paragraphs are there in this content?"
  Q_MPC = "Review the content and assume that the estimates are correct. Then add them all togeather for a total count of paragraphs."
  Q_WC = "How many words are there in this content?"
  Q_MWC = "Review the content and assume that the estimates are correct. Then add them all togeather for a total count of words."
  Q_TW = "What are the ten most used words in this content and how often does each occur? Ignore function words for this count."
  Q_MTW = "Review the content and assume that the estimates are correct. Then merge them all togeather for a summarized list of words sorted by frequency."


  def __init__(self, path):
    """
    Initialize the Document object with the given file path.

    Args:
        path: The path to a document that will be analyzed. Must be a
              Word document in docx format.

    Returns:
        Nothing
    """
    self.doc_path = path
    self.doc = docx.Document(path)
    self.num_paragraphs = len(self.doc.paragraphs)
    # Calculate the starting paragraph for iteration 0.
    self.iteration = 0
    self.current_paragraph = 0

  # Calculate the starting paragraph for each iteration. This is based on the number of
  # iterations and the paragraph set size. For the first
  # iteration, it will start at paragraph 0 and process each group of
  # paragraph sets. Each future iteration will start at ((iteration/iterations) *
  # paragraph set size)
  def next_iteration(self):
    self.iteration += 1
    self.current_paragraph = int((self.iteration / self.ITERATIONS) * self.PARAGRAPH_SET)

  def get_iteration(self):
    return self.iteration

  def get_next_paragraphs(self, num_para):
    """
    Retrieve the next 'num_para' paragraphs from the document.

    Args:
        num_para: The number of paragraphs to get.

    Returns:
        Up to the number of paragraphs requested. If there are fewer,
        all of the aviable paragraphas are returned. If none are
        avaiable, None is returned.
    """
    if self.current_paragraph >= self.num_paragraphs:
      retval = None
    else:
      end_paragraph = min(self.current_paragraph + num_para, self.num_paragraphs)
      extracted_text = [self.doc.paragraphs[i].text for i in range(self.current_paragraph, end_paragraph)]
      retval = '\n'.join(extracted_text)
      if len(retval) > self.MAX_VARIABLE_LENGTH:
        raise ParagraphsTooLong(f"Combined paragraphs were {len(retval)} long which exceeds limit of {self.MAX_VARIABLE_LENGTH}.")

      self.current_paragraph = end_paragraph  # Update the current paragraph position

    return retval

  def ask_gemini(self, variable, question, api_key=None, model_name="gemini-1.5-flash", max_output_tokens=None, temperature=None,  additional_context=None):
    """
    Asks Gemini a question about the contents of a Python variable.

    Args:
        variable: The Python variable to analyze.
        question: The question to ask Gemini about the variable.
        api_key: (Optional) Your Gemini API key.  If not provided, it should be
                 available via genai.configure().
        model_name: (Optional) The Gemini model to use (e.g., "gemini-1.5-pro", "gemini-1.0-pro").  Defaults to "gemini-1.5-pro".
        max_output_tokens: (Optional) Maximum number of tokens in the response.
        temperature: (Optional) Controls the randomness of the response (0.0 - 1.0).  Lower values are more deterministic.
        additional_context: (Optional) A string containing additional information relevant to the variable or question.

    Returns:
        The Gemini API's response as a string, or an error message.

    Raises:
        GeminiInitializationError: If there's an issue initializing the Gemini model.
        GeminiQueryError: If there's an error querying the Gemini API.
        GeminiResponseBlockedError: If the Gemini response is blocked due to content policy.
    """

    if api_key:
        genai.configure(api_key=api_key)


    try:
        model = genai.GenerativeModel(model_name)
    except Exception as e:
      raise GeminiInitializationError(f"Error initializing Gemini model: {str(e)}. Ensure your API key is configured and you have access to the specified model.")

    # --- Improved Prompt Construction ---
    prompt_parts = []
    prompt_parts.append("You are a helpful assistant that can analyze Python variables that contain paragraphs of text and answer questions about them.")

    if additional_context:
        prompt_parts.append(f"Additional Context:\n{additional_context}\n")

    prompt_parts.append(f"Analyze the following Python variable and answer the question below:")

    # --- Improved Variable Handling ---

    # Use a safer representation:  json.dumps is generally better for complex objects.
    try:
        variable_str = json.dumps(variable, indent=self.INDENT, default=str)  # Use default=str to handle non-serializable objects
    except TypeError:
        variable_str = repr(variable) # Fallback to repr if json fails

    # Limit variable size to prevent long prompts
    variable_str = textwrap.shorten(variable_str, width=self.MAX_VARIABLE_LENGTH, placeholder="... (truncated)")

    prompt_parts.append(f"Variable (Python):\n`python\n{variable_str}\n`")

    prompt_parts.append(f"Question:\n{question}")

    # ---  Construct the full prompt  ---
    prompt = "\n\n".join(prompt_parts)

    # --- Generation Configuration ---
    generation_config = {}
    if max_output_tokens is not None:
        generation_config["max_output_tokens"] = max_output_tokens
    if temperature is not None:
        generation_config["temperature"] = temperature

        # Add more safety settings, these are very important
    safety_settings = [
      {
        "category": "HARM_CATEGORY_HARASSMENT",
        "threshold": "BLOCK_MEDIUM_AND_ABOVE"
      },
      {
        "category": "HARM_CATEGORY_HATE_SPEECH",
        "threshold": "BLOCK_MEDIUM_AND_ABOVE"
      },
      {
        "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
        "threshold": "BLOCK_MEDIUM_AND_ABOVE"
      },
      {
        "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
        "threshold": "BLOCK_MEDIUM_AND_ABOVE"
      },
    ]

    # ---  Generate Response  ---
    try:
        response = model.generate_content(prompt, generation_config=generation_config, safety_settings=safety_settings)
        #  Check for response.prompt_feedback to see if the prompt was blocked.
        if response.prompt_feedback:
          raise GeminiResponseBlockedError(f"Prompt was blocked: {response.prompt_feedback}")
        return response.text
    except Exception as e:
        # More specific error handling:  Catch common errors.
        if "Response was blocked" in str(e):  # Check for content filtering.
            raise GeminiResponseBlockedError(f"Error: Gemini response was blocked due to content policy. Error: {e}")
        raise GeminiQueryError(f"Error querying Gemini API: {str(e)}")

  def process_document(self, query_key: str, merge_key: str, set_limit: int = -1, iterations: int = None, paragraph_set: int = None) -> str:
    """
    Transforms the document content into a knowledge graph using Gemini.

    This method processes the document in segments, prompting Gemini to
    create a knowledge graph for each named entity identified within the text.
    The knowledge graphs generated by Gemini are then combined and returned
    as a single string.

    Args:
      set_limit: The maximum number of document segments to process.
                A negative value indicates no limit. Defaults to -1.

    Returns:
      str: A string containing the accumulated knowledge graph responses
          from Gemini, potentially spanning multiple document segments.

    Raises:
      GeminiResponseBlockedError: If a Gemini query is blocked due to
                                  content policy violations.
    """
    inputChars = 0
    outputChars = 0
    set_count = 0
    gemini_response = ""
    if iterations == None:
      iterations = self.ITERATIONS
    if paragraph_set == None:
      paragraph_set = self.PARAGRAPH_SET
    while self.get_iteration() < iterations:

      fileContent = self.get_next_paragraphs(paragraph_set)
      inputChars += len(fileContent)
      while fileContent != None:
        try:
          gemini_response += self.ask_gemini(fileContent, query_key, api_key=userdata.get('GeminiAPIKey'), model_name="gemini-1.5-flash")
        except Exception as e:
          raise GeminiResponseBlockedError(f"An error occurred: {e}")
        fileContent = self.get_next_paragraphs(paragraph_set)
        inputChars += len(fileContent)
        set_count += 1
        if set_limit > 0 and set_count >= set_limit:
          fileContent = None
      if merge_key != None:
        try:
          final_response = self.ask_gemini(gemini_response, merge_key, api_key=userdata.get('GeminiAPIKey'), model_name="gemini-1.5-flash")
        except Exception as e:
          raise GeminiResponseBlockedError(f"An error occurred: {e}")
      else:
        final_response = gemini_response
      self.next_iteration()
      outputChars += len(final_response)
      print(f"The input was {inputChars} which resulted in otput of {outputChars}.")

    return final_response

  def convert_to_knowledge_graph(self, set_limit=-1):
    return self.process_document(self.Q_KG, None, set_limit, 1, 50)

  def get_paragraph_count(self, set_limit=-1):
    return self.process_document(self.Q_PC, self.Q_MPC, set_limit, 1, 50)

  def get_word_count(self, set_limit=-1):
    return self.process_document(self.Q_WC, self.Q_MWC, set_limit, 1, 50)

  def get_top_words(self, set_limit=-1):
    return self.process_document(self.Q_TW, self.Q_MTW, set_limit, 1, 50)



In [None]:
file_path = "/content/drive/MyDrive/Praxis/Conewago Township Sewer Authority, PA.docx"
aDocument = Document(file_path)
try:
  short_term_memory = aDocument.convert_to_knowledge_graph(set_limit=2)
except Exception as e:
  print(f"An error occurred: {e}")
else:
  print(short_term_memory)
finally:
  print("Done")


The input was 16543 which resulted in otput of 6676.
```json
[
  {
    "entity": "Conewago Township Sewer Authority",
    "type": "Organization",
    "properties": {
      "name": "Conewago Township Sewer Authority",
      "abbreviation": "CTSA",
      "established": "8-22-1989"
    },
    "relationships": []
  },
  {
    "entity": "Conewago Township",
    "type": "Location",
    "properties": {
      "name": "Conewago Township"
    },
    "relationships": [
      {
        "relationship": "located_in",
        "target": "Conewago Township Sewer Authority"
      }
    ]
  },
  {
    "entity": "Articles of Incorporation",
    "type": "Document",
    "properties": {
      "name": "Articles of Incorporation"
    },
    "relationships": [
      {
        "relationship": "mentions",
        "target": "Conewago Township Sewer Authority"
      }
    ]
  },

  {
    "entity": "Res. No. 2021-1",
    "type": "Resolution",
    "properties": {
      "resolution_number": "2021-1",
      "date": "9-

In [None]:
file_path = "/content/drive/MyDrive/Praxis/Conewago Township Sewer Authority, PA.docx"
aDocument = Document(file_path)
try:
  result = aDocument.get_paragraph_count(set_limit=4)
except Exception as e:
  print(f"An error occurred: {e}")
else:
  print(result)
finally:
  print("Done")

The input was 27901 which resulted in otput of 141.
The provided text gives four different paragraph counts: 14, 14, 16, and 16.  Adding these together gives a total of 14 + 14 + 16 + 16 = 60.

Done


In [None]:
file_path = "/content/drive/MyDrive/Praxis/Conewago Township Sewer Authority, PA.docx"
aDocument = Document(file_path)
try:
  result = aDocument.get_top_words(set_limit=4)
except Exception as e:
  print(f"An error occurred: {e}")
else:
  print(result)
finally:
  print("Done")

Okay, I'm ready to analyze the text and provide the word counts, excluding common function words.

Here's the breakdown of the 10 most frequent words in the provided text, after excluding function words (like "the," "a," "is," "of," "and," "shall," "be," "to," "in," "at," "by," "as," "for"):

Word | Count
------- | --------
Authority | 20
meeting | 9
members | 7
Township | 4
Conewago | 4
meetings | 4
member | 3
regular | 3
purpose | 2
Articles | 2
Okay, I can do that for you. I need to first define which words to exclude as "function words" and then I can tally the occurrence of each word in the provided text.

Here's the breakdown:

**1. Defining "Function Words" (Exclusion List):**

For this analysis, I will exclude the following types of words, as they are generally considered function words:

*   **Articles:** a, an, the
*   **Prepositions:** of, to, in, for, on, at, by, with, from, as, into, upon
*   **Pronouns:** he, she, it, him, her, his, its, they, them, their, i, me, my, mine