# Deriving Insights from Lengthy Documents with LLMs

This tutorial focuses on utilizing large language models (LLMs) to extract structured information from exceptionally long documents. The challenge lies in enabling the model to perform advanced analysis while maintaining a consistent output format, accounting for all relevant details, and avoiding off-topic or fabricated responses.

We will explore techniques for guiding the LLM to deliver accurate and comprehensive insights from lengthy documents, ensuring that the model remains on track and provides valuable information.

In [None]:
# download and unzip the data from github using wget and unzip

!wget https://colab.research.google.com/github/rjuro/unistra-nlp2024/blob/main/data/interviews.zip
# unzip in a folder called interviews
!mkdir interviews
!mv interviews.zip interviews
!cd interviews && unzip interviews.zip

In [None]:
!pip install openai -q

In [2]:
from openai import OpenAI

In [3]:
# get your API key from together.ai and then import here
from google.colab import userdata
TOGETHER_API_KEY = userdata.get('TOGETHER_API_KEY')

In [9]:
#load text file into a text variable
with open("interviews/Sonos_ John MacFarlane-transcript.txt", "r") as f:
    text = f.read()

In [4]:
system_prompt = """"You are a hyperinteligent information extraction engine with strong focus to detail."""

In [5]:
instruction = """
Extract information about the entrepreneurial journey from the interview below. Make sure to pay attention to all details. 
Answer following questions based on the provided interview. then provide the answers as JSON output.
Alwasy provide example from the interview in the "example" field.
"possible_categories" are not exclusive. You can add more categories if you find any relevant.
Include a short summary of the interview and entrepreneurial journey in the "summary" field.
Disregard the commercials in the interview.

Here are the categories and questions:

Means at Hand:
What initial resources were mentioned as being available at the start of the venture, such as personal expertise, networks, or funding?
Which resources were emphasized as most relied upon in the early stages of the venture, and how did they contribute to its initial development?
Affordable Loss:
How is risk approached in business decisions, particularly regarding potential losses, as described in the interview?
Is there a situation detailed where potential loss was weighed against a business opportunity? What level of loss was deemed acceptable (low, medium, high)?
Strategic Partnerships:
Are there mentions of strategic partnerships, collaborations, alliances, or mentorships formed?
How were these partnerships said to contribute to the venture's development and success?
Leveraging Contingencies:
Were there instances where unexpected challenges were turned into opportunities for the business, as per the interview?
Can an example be found in the interview of adapting to a significant challenge and the outcome of this pivot?
Goal Reassessment and Iteration:
How frequently does the interview indicate the business goals or models are reassessed in response to new information or market feedback?
Is there a specific instance mentioned where feedback led to a change in business strategy or goals? What was the impact of this change?
Market Research and Analysis:
How was the market for the product or service understood, according to the interview?
What methods for market research and analysis are described, and what were the key outcomes of this process?
Funding and Investment:
What strategies for securing funding for the venture are discussed in the interview?
Are types of investments pursued mentioned, along with the strategic considerations behind these choices?
Crisis Management and Adaptability:
How is managing crises or significant challenges in the business described?
Is an example provided of a crisis faced and the strategies used to overcome it? What was the effectiveness of these approaches?


INTERVIEW:
\n\n
"""

In [6]:
json_template_instruction = """

    INSTRUCTION:
Follow the below template and output valid JSON:


{
  "summary": "...",
  "categories": [
    {
      "name": "Means at Hand",
      "quantification_method": "categorization",
      "possible_categories": ["strong network", "industry expertise", "initial capital"],
      "selected_categories": [...],
      "description": "The resources and assets available at the start, including personal skills, networks, and financial resources.",
      "example": "..."
    },
    {
      "name": "Affordable Loss",
      "quantification_method": "level",
      "possible_levels": ["low", "medium", "high"],
      "selected_level": "...",
      "description": "The level of loss the entrepreneur is prepared to risk, focusing on manageable losses rather than expected gains.",
      "example": "..."
    },
    {
      "name": "Strategic Partnerships",
      "quantification_method": "count_and_type",
      "possible_types": ["co-founders", "mentors", "alliances"],
      "selected_types": [...],
      "description": "The formation of strategic partnerships, including co-founders, mentors, and alliances, and their contributions to development.",
      "example": "..."
    },
    {
      "name": "Leveraging Contingencies",
      "quantification_method": "examples_and_outcomes",
      "description": "How unexpected events were turned into opportunities, detailing the pivot or adaptation made.",
      "example": "..."
    },
    {
      "name": "Goal Reassessment and Iteration",
      "quantification_method": "frequency_and_impact",
      "description": "The frequency and impact of reassessing goals and business models in response to new information or feedback.",
      "example": "..."
    },
    {
      "name": "Market Research and Analysis",
      "quantification_method": "descriptive",
      "description": "Approaches to market research and analysis to identify opportunities, validate business ideas, and adapt to market changes.",
      "example": "..."
    },
    {
      "name": "Funding and Investment",
      "quantification_method": "categorization",
      "possible_categories": ["bootstrapping", "angel investment", "venture capital"],
      "selected_categories": [...],
      "description": "Strategies for securing financial resources, including the sources of funding and strategic considerations behind funding choices.",
      "example": "..."
    },
    {
      "name": "Crisis Management and Adaptability",
      "quantification_method": "examples_and_strategies",
      "description": "The entrepreneur's ability to manage crises and adapt to challenges, including strategies for navigating obstacles.",
      "example": "..."
    }
  ]
}


Output valid JSON only.

"""


In [10]:
PROMPT = instruction + text + json_template_instruction

In [15]:
# Point to the local server
client = OpenAI(base_url="https://api.together.xyz/v1", api_key=TOGETHER_API_KEY)

completion = client.chat.completions.create(
  model="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO", 
  messages=[
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": PROMPT}
  ],
  temperature=0.2

)

print(completion.choices[0].message)

ChatCompletionMessage(content=' {\n  "summary": "John MacFarlane, co-founder and former CEO of Sonos, shares the story of building a successful wireless speaker company in a highly competitive market. MacFarlane and his partners faced skepticism and challenges but managed to create a product that filled a niche for high-quality, wireless home audio. The company\'s success can be attributed to a strong network, industry expertise, and strategic partnerships. MacFarlane emphasizes the importance of leveraging contingencies, reassessing goals, and effective crisis management in the entrepreneurial journey.",\n  "categories": [\n    {\n      "name": "Means at Hand",\n      "quantification_method": "categorization",\n      "possible_categories": ["strong network", "industry expertise", "initial capital"],\n      "selected_categories": ["strong network", "industry expertise"],\n      "description": "The resources and assets available at the start, including personal skills, networks, and fin

In [16]:
import json

In [18]:
json.loads(completion.choices[0].message.content)

{'summary': "John MacFarlane, co-founder and former CEO of Sonos, shares the story of building a successful wireless speaker company in a highly competitive market. MacFarlane and his partners faced skepticism and challenges but managed to create a product that filled a niche for high-quality, wireless home audio. The company's success can be attributed to a strong network, industry expertise, and strategic partnerships. MacFarlane emphasizes the importance of leveraging contingencies, reassessing goals, and effective crisis management in the entrepreneurial journey.",
 'categories': [{'name': 'Means at Hand',
   'quantification_method': 'categorization',
   'possible_categories': ['strong network',
    'industry expertise',
    'initial capital'],
   'selected_categories': ['strong network', 'industry expertise'],
   'description': 'The resources and assets available at the start, including personal skills, networks, and financial resources.',
   'example': 'MacFarlane and his partner

## Automation

In [11]:
import os
import json
from tqdm.auto import tqdm
from openai import OpenAI
from dotenv import load_dotenv

# Load the environment variables
load_dotenv()
TOGETHER_API_KEY = os.getenv("TOGETHER_API_KEY")

# initialize the together client as above
client = OpenAI(base_url="https://api.together.xyz/v1", api_key=TOGETHER_API_KEY)


# Directory containing the text files
directory = "interviews"



In [12]:
# List to store the responses
responses = []

# Iterate over the files in the directory
for filename in tqdm(os.listdir(directory)):
    if filename.endswith(".txt"):
        try:
            with open(os.path.join(directory, filename), "r") as file:
                text = file.read()

            # Construct the prompt
            PROMPT = instruction + text + json_template_instruction

            completion = client.chat.completions.create(
                        model="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
                        messages=[
                            {"role": "system", "content": system_prompt},
                            {"role": "user", "content": PROMPT}
                                ],
                                temperature=0.2

)
            responses.append(completion.choices[0].message.content)

        except Exception as e:
            print(f"Error processing file {filename}: {e}")



  0%|          | 0/4 [00:00<?, ?it/s]

In [None]:
# pickle the responses
import pickle
with open("responses.pkl", "wb") as f:
    pickle.dump(responses, f)
responses

In [16]:
import pandas as pd

In [18]:
# Transforming the data for Excel
data_for_excel = []
for summary in responses:
    summary = json.loads(summary)
    row = {
        "Summary": summary["summary"],
        "Means at Hand": ", ".join(summary["categories"][0]["selected_categories"]),
        "Affordable Loss": summary["categories"][1]["selected_level"],
        "Strategic Partnerships": ", ".join(summary["categories"][2]["selected_types"]),
        "Leveraging Contingencies": summary["categories"][3]["example"],
        "Goal Reassessment and Iteration": summary["categories"][4]["example"],
        "Market Research and Analysis": summary["categories"][5]["example"],
        "Funding and Investment": ", ".join(summary["categories"][6]["selected_categories"]),
        "Crisis Management and Adaptability": summary["categories"][7]["example"],
    }
    data_for_excel.append(row)

# Create a DataFrame
df = pd.DataFrame(data_for_excel)

# Save to Excel
excel_path = "summary_excel_demo.xlsx"
df.to_excel(excel_path, index=False)

excel_path

'summary_excel_demo.xlsx'