# Automated Company Research 🧊

##### 💡 **Research Areas:** Rapid Prototyping, Generative AI, Information Retrieval, Langchain Basics, AI-assisted Market Research, Entrepreneurial Hacking.

<div style="display:flex; align-items:center; padding: 50px;">
<p style="margin-right:10px;">
    <img height="300px" style="width:auto;" width="200px" src="https://avatars.githubusercontent.com/u/192148546?s=400&u=95d76fbb02e6c09671d87c9155f17ca1e4ef8f21&v=4"> 
</p>
</div>

#### This is a simple prototype for automated company research. We wanted a tool to boost our client outreach process, either save time, or come up with better ideas. We built this prototype to gather valuable information from the public internet about companies to help us understand their business and develop better solutions for them.

## 🔥 Powered by LangChain, Jina AI & Google Gemini!




## Step 1: Installing Requirements

The `install_requirements()` function handles the installation of dependencies from a requirements.txt file.


- It checks if the requirements are already installed and skips the installation if true.

- If not, it tries to install the requirements using `pip install -r requirements.txt`.

- If the installation fails, it retries up to the specified maximum number of retries (`max_retries`).

- The process exits with an error if the installation fails after the maximum retries.

In [8]:
import os

requirements_installed = False
max_retries = 3
retries = 0


def install_requirements():
    """Installs the requirements from requirements.txt file"""
    global requirements_installed
    global retries
    global max_retries
    if requirements_installed:
        print("Requirements already installed.")
        return

    print("Installing requirements...")
    install_status = os.system("pip install -r requirements.txt")
    if install_status == 0:
        print("Requirements installed successfully.")
        requirements_installed = True
    else:
        print("Failed to install requirements.")
        if retries < max_retries:
            print("Retrying...")
            retries += 1
            return install_requirements()
        exit(1)
    return


The `install_requirements()` function checks if the required dependencies are already installed by examining the global variable `requirements_installed`. If they are not installed, the function attempts to install them using `pip` from the `requirements.txt` file.

In [None]:
install_requirements()

## Step 2: Setting Up Environment Variables

The `setup_env()` function ensures that required environment variables are loaded and available for the application. It checks if the environment variables (`OPENAI_API_KEY`, `GEMINI_API_KEY`, `SERPER_API_KEY`) are set, and if not, it exits the program. The function uses `load_dotenv()` to load variables from a `.env` file into the environment.

In [10]:
from dotenv import load_dotenv
import os


def setup_env():
    """Sets up the environment variables"""

    def check_env(env_var):
        value = os.getenv(env_var)
        if value is None:
            print(f"Please set the {env_var} environment variable.")
            exit(1)
        else:
            print(f"{env_var} is set.")

    load_dotenv()

    variables_to_check = ["OPENAI_API_KEY", "GEMINI_API_KEY", "SERPER_API_KEY"]

    for var in variables_to_check:
        check_env(var)

The setup_env() function is invoked to ensure that the required environment variables are loaded and set correctly. It checks the presence of critical API keys, and if any are missing, the program exits. If all variables are set, the function proceeds to load environment variables from the .env file.

In [None]:
setup_env()

## Step 3: Using Google Serper API and OpenAI for Search and Summarization

The `simple_serper_openai_search()` function allows you to perform a Google search using the Serper API and summarizes the search results using the OpenAI API. It first retrieves search results for a query, then formats the results into a JSON object. If the `verbose` flag is set to `True`, the function prints the raw results and the summary. Finally, it returns a human-readable summary of the search results.

In [12]:
from langchain_community.utilities import GoogleSerperAPIWrapper
from openai import OpenAI
import json


def simple_serper_openai_search(
    query="What is the current latest news in New York?", verbose=False
):
    """Simple function to search for a query using Google Serper API and summarize the results using OpenAI API"""
    search = GoogleSerperAPIWrapper()
    openai = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    search_results = search.results(query)
    search_results_json = json.dumps(search_results, indent=4)
    if verbose:
        print("Search Results Found:\n", search_results_json)
    search_summary = openai.chat.completions.create(
        messages=[
            {"role": "system", "content": "You are a summarizer of search results."},
            {
                "role": "user",
                "content": f"""Given a JSON of search results, neatly format the data into a summary. 
             The search results are as follows: {search_results_json}. 
             Don't miss out any details, make sure everything in the JSON is included in a human readable format.
             Don't hallucinate or make up factual information.""",
            },
        ],
        model="gpt-4o",
    )
    result = search_summary.choices[0].message.content
    if verbose:
        print("Summary of Search Results:\n", result)
    return result

## Step 4: Displaying the Query and Its Response

In this step, the code executes a search query, retrieves the summarized result using the `simple_serper_openai_search()` function, and displays it in a formatted Markdown format within the Jupyter notebook.


- `query`: The search query to be sent to the Google Serper API.

- `response`: The result of the search query after being processed and summarized by the OpenAI API.

- `Markdown(f"### {query}\n{response}")`: Formats the query and its response as a Markdown text, displaying it with a heading.

- `clear_output()`: Clears the previous output in the notebook to ensure the display is clean before rendering the new content.

- `display(markdown)`: Displays the formatted query and response.

In [None]:
from IPython.display import Markdown, display, clear_output

query = "What is the latest news in California?"
response = simple_serper_openai_search(query=query, verbose=True)
markdown = Markdown(f"### {query}\n{response}")
clear_output()
display(markdown)

## Step 5: Search Functions for Handling and Processing Results

In this step, several functions are created to perform searches and handle the results efficiently, ensuring the data is accurate and relevant.

### Functions Overview:

- **`jina_search(query)`**:

  - **Purpose**: Performs a search using the **Jina AI API**.
  
  - **Details**: Sends the query to the API and returns the response as text.

- **`regular_search(query)`**:

  - **Purpose**: Uses the **Google Serper API** to perform a regular search.
  
  - **Details**: Returns the search results in a structured format, providing organized insights.

- **`dedupe_links(links)`**:

  - **Purpose**: Removes duplicate links from the given list.
  
  - **Details**: Converts the list of links to a set and back to a list to ensure uniqueness.

- **`collect_links(response)`**:

  - **Purpose**: Extracts and collects links from the search results.
  
  - **Details**: Analyzes the response object, gathers sitelinks and main links, and removes duplicates by using the `dedupe_links` function.

These functions work together to perform intelligent searches, collect the necessary links, and process them for further use, ensuring the results are both relevant and non-redundant.


In [15]:
import requests
import json


def jina_search(query):
    """Searches the query using Jina AI API"""
    url = f"https://s.jina.ai/{query}"

    headers = {
        "Authorization": "Bearer jina_518eb161d77c4ed8ae0fa50025b1f30bHMfArtRdFm6F9_edQZJ5KbWjw3xW",
        "X-Retain-Images": "none",
        "X-Return-Format": "markdown",
    }

    response = requests.get(url, headers=headers)
    return response.text


def regular_search(query):
    """Searches the query using Google Serper API"""
    search = GoogleSerperAPIWrapper()
    response = search.results(query)
    return response


def dedupe_links(links):
    """Dedupes the links"""
    return list(set(links))


def collect_links(response: str):
    """Collects the links from the response"""
    organic = response["organic"]
    links = []
    for item in organic:
        sitelinks = item.get("sitelinks")
        if sitelinks or (type(sitelinks) == list and len(sitelinks) > 0):
            for link in sitelinks:
                links.append(link["link"])
        links.append(item["link"])
    return dedupe_links(links)

## Step 6: Link Caching for API Credit Efficiency

This step implements caching to reduce repeated API calls and save credits by storing search results for future use.

### Functions Overview:

- **`load_links_cache()`**:

  - **Purpose**: Loads the cached links from a JSON file (`links_cache.json`).
  
  - **Details**: Returns the links stored in the cache, allowing for faster access without repeated API calls.

- **`save_links_cache()`**:

  - **Purpose**: Saves the current state of the `linksCache` dictionary to the `links_cache.json` file.
  
  - **Details**: Ensures that the cache is updated with the latest search results to be used in future queries.

- **`add_links_to_cache(query, links)`**:

  - **Purpose**: Adds newly fetched links for a query to the cache.
  
  - **Details**: Calls `save_links_cache()` to store the new links, ensuring the cache remains up-to-date with fresh data.

- **`get_links_from_regular_search(query)`**:

  - **Purpose**: Checks if the links for a query already exist in the cache.
  
  - **Details**: If cached links are available, they are returned directly. If not, a regular search is performed, links are collected, added to the cache, and then returned for use.

By utilizing link caching, this process helps minimize unnecessary API calls, saving credits and improving efficiency by reusing previously fetched search results.


In [24]:
## Link caching to save API credits
linksCache = {}
links_cache_file = "outputs/links_cache.json"


def load_links_cache():
    """Loads the links cache from the file"""
    with open(links_cache_file, "r") as f:
        linksCache = json.load(f)
    return linksCache


def save_links_cache():
    """Saves the links cache to the file"""
    with open(links_cache_file, "w") as f:
        json.dump(linksCache, f)


def add_links_to_cache(query, links):
    """Adds the links to the cache"""
    linksCache[query] = links
    save_links_cache()


def get_links_from_regular_search(query):
    """Gets the links from the regular search"""
    ## check if file exists

    try:
        linksCache = load_links_cache()
    except:
        linksCache = {}

    cached_links = linksCache.get(query)
    if cached_links:
        return cached_links
    response = regular_search(query)
    links = collect_links(response)
    add_links_to_cache(query, links)
    return links

## Step 7: Integrating Gemini API for Content Generation

This step integrates the Google Gemini API to generate responses to queries using the Gemini model.

### Functions Overview:

- **`get_gemini_client(model)`**:

  - **Purpose**: Configures the Gemini API client with the provided API key and the specified model.
  
  - **Details**: Returns the client for the given Gemini model, allowing interaction with the Gemini API for content generation.

- **`gemini(query, model)`**:

  - **Purpose**: Calls the `get_gemini_client()` function to get the Gemini model.
  
  - **Details**: Sends the query to the model for content generation and returns the generated response text.

By integrating the Gemini API, this step enables content generation based on queries, allowing for more dynamic and interactive responses from the model.


In [None]:
import google.generativeai as genai
import os

gemini_api_key = os.getenv("GEMINI_API_KEY")
DEFAULT_GEMINI_MODEL = "gemini-1.5-flash"


def get_gemini_client(model=DEFAULT_GEMINI_MODEL):
    genai.configure(api_key=gemini_api_key)
    model = genai.GenerativeModel(model_name=model)
    return model


def gemini(query, model=DEFAULT_GEMINI_MODEL):
    model = get_gemini_client(model)
    response = model.generate_content(query)
    return response.text

## Step 8: Running Basic Research on a Company

This step performs basic research on a company by querying various aspects such as its goals, founders, competitors, and latest news.

### Functions Overview:

- **`queries`**:

  - **Purpose**: A list of queries is defined to gather general information about the company. These may include details about the company's mission, founders, competitors, and recent news.

- **`raw_markdown`**:

  - **Purpose**: Initializes an empty string to store the raw Markdown content of search results, allowing for structured aggregation of the information.

- **For each query**:

  - **Purpose**: The function `jina_search(query)` is called to perform the search using the Jina AI API.

  - **Details**: The search results are appended to the `raw_markdown` string, with each query’s results labeled with a title. Progress is printed as the results are updated.

- **Return**:

  - **Purpose**: The function returns the complete raw Markdown content with all the research results, providing a comprehensive overview of the company.

This step automates the process of researching a company by gathering structured data on key aspects, returning the results in Markdown format for easy analysis and review.


In [18]:
def run_basic_reserach(company):
    """Runs basic research on the company."""
    queries = [
        f"What is {company}?",
        f"What are the top goals of {company}?",
        f"Who are the founders of {company}?",
        f"Who are the competitors of {company}?",
        f"What is the latest news about {company}?",
    ]

    raw_markdown = ""

    for query in queries:
        print(f"Searching for: {query}")
        search_results = jina_search(query)
        title = f"### Search results for {query}"
        raw_markdown += title + "\n" + search_results + "\n"
        print("Updated results.")
    return raw_markdown

## Step 9: Creating and Generating a Research Report

This step creates a structured research report for a company based on basic research and generates a professional markdown document.

### Functions Overview:

- **`RESEARCH_REPORT_CREATION_PROMPT`**:

  - **Purpose**: A predefined prompt is used to guide the generation of the research report. This prompt ensures that the output is professional, factual, and properly structured.

- **`create_research_report`**:

  - **Purpose**: Accepts the company name and basic research notes as inputs and formats the prompt for generating the report.
  
  - **Steps**:
  
    1. Calls the Gemini API (`gemini`) to generate the research report based on the provided details.
    
    2. Saves the generated result to a file.
    
    3. Returns the final report, ensuring it is in Markdown format.
    

- **`generate_research_report`**:

  - **Purpose**: 
  
    1. First, it performs the basic research by calling `run_basic_research`.
    
    2. Then, it generates the research report by calling `create_research_report`.
    
    3. The function returns the generated report as a Markdown object for display.

This step automates the creation of a professional research report by leveraging the basic research data and generating a well-structured Markdown document using the Gemini API.


In [19]:
from IPython.display import Markdown, clear_output
from datetime import datetime
import math

RESEARCH_REPORT_CREATION_PROMPT = """
    I'm giving you the raw research notes of {company}.
    Your job is to generate a compact, 5 page research report in markdown. 
    Label the page numbers, mark the sections well. 
    Make sure it is professional, factual and well structured.
    Don't hallucinate or make up information.
    Make sure each page has atleast 4 sections with 2 paragraphs each or atleast 10 bullet points.
    Filter only the information relevant to {company} and discard the rest.
    This is a "Company Research Report" for Responsible AI stakeholders and Global Hacker Groups that are Legal & Ethical.

    Raw Research Notes: '{basic_research}'
    """


def create_reserach_report(company: str, basic_research: str, output_file=None):
    """Creates a research report for the company"""
    prompt = RESEARCH_REPORT_CREATION_PROMPT.format(
        company=company, basic_research=basic_research
    )
    print("Generating research report...")
    research_report = gemini(prompt)
    print("Research report generated.")
    clear_output()

    with open(output_file, "w") as f:
        f.write(research_report)
    return research_report


def generate_research_report(company: str, output_file=None):
    """Generates a research report for the company"""
    if not output_file:
        unix_timestamp = math.floor(datetime.now().timestamp())
        output_file = f"{company}_research_report_${unix_timestamp}.md"
    print(f"Running basic research on {company}...")
    basic_research = run_basic_reserach(company=company)
    print(f"Basic research completed. {len(basic_research)} characters found.")
    return Markdown(create_reserach_report(company, basic_research, output_file))

## Step 10: Generating Research Reports for Multiple Companies

This step focuses on generating research reports for one or more companies, tracking the time taken for each and handling any errors.

### Functions Overview:

- **`generate_one_company_report`**:

  - **Purpose**: This function generates a research report for a single company.
  
  - **Steps**:
  
    1. Tracks the start and end time to calculate the time taken for report generation.
    
    2. If successful, it returns the generated report.
    
    3. If an error occurs, it prints the exception stack trace for debugging.
  
- **`generate_batch_company_reports`**:

  - **Purpose**: Handles generating reports for a list of companies.
  
  - **Steps**:
  
    1. Iterates through each company in the list.
    
    2. Calls `generate_research_report` for each company.
    
    3. Prints the time taken for each individual report generation.
    
    4. At the end, it provides the total time taken for generating all reports in the batch.
    
    5. Handles errors gracefully and displays appropriate messages when issues arise.


These functions streamline the process of generating research reports for multiple companies, ensuring that the time taken is tracked, errors are handled, and progress is monitored effectively.


In [20]:
from datetime import datetime
import traceback


def generate_one_company_report(company: str):
    """Generates a research report for one company"""
    total_time = 0
    try:
        company_start_time = datetime.now()
        print(f"Generating research report for {company}...")
        report = generate_research_report(
            company,
            output_file=f"data/company_research/${company}_research_report_090825.md",
        )
        company_end_time = datetime.now()
        company_total_time_seconds = (company_end_time - company_start_time).seconds
        total_time += company_total_time_seconds
        print(f"Research report for {company} generated.")
        print(f"Time taken: {company_total_time_seconds} seconds")
        return report
    except Exception as e:
        print(f"Failed to generate research report for {company}.")
        traceback.print_exc()
        return None


def generate_batch_company_reports(companies: list):
    """Generates research reports for a batch of companies"""
    print(f"Generating research reports for {len(companies)} companies.")
    total_time = 0
    for company in companies:
        try:
            company_start_time = datetime.now()
            print(f"Generating research report for {company}...")
            generate_research_report(
                company,
                output_file=f"data/company_research/${company}_research_report_090825.md",
            )
            company_end_time = datetime.now()
            company_total_time_seconds = (company_end_time - company_start_time).seconds
            total_time += company_total_time_seconds
            print(f"Research report for {company} generated.")
            print(f"Time taken: {company_total_time_seconds} seconds")
        except Exception as e:
            print(f"Failed to generate research report for {company}.")
            print(f"Error: {e}")
    print(f"Total time taken: {total_time} seconds.")

## Step 11: Example of Generating a Research Report for One Company

In this step, a research report is generated for a single company (Vercel) using the `generate_one_company_report` function.

### Steps:

- **`company = "Vercel"`**:

  - Sets the company name to "Vercel" for which the report will be generated.

- **`report = generate_one_company_report(company)`**:

  - Calls the `generate_one_company_report` function, which generates a research report for "Vercel" and stores the result in the `report` variable.

- **`display(report)`**:

  - Displays the generated research report as Markdown in the notebook.
  
  - This shows the results of the generated research for the specified company.

This example demonstrates how to generate a research report for a single company (Vercel) using the functions defined, providing a clean and structured output in Markdown format.


In [None]:
## Example: One Company Report

from IPython.display import Markdown, display, clear_output

company = "Vercel"

report = generate_one_company_report(company)

# clear_output()
display(report)

## Step 12: Example of Running Reports on a Batch of Companies (Commented to Save API Credits)

In this step, a batch of companies is used to generate research reports.

### Steps:

- **`companies list:`**

  - A list of company names is defined, including "Google", "Deloitte India", "Zepto", and "Dyte".

- **`generate_batch_company_reports(companies):`**

  - This function is called with the `companies` list. It generates research reports for each company in the list and saves them in the specified output file (though this is commented out to avoid consuming API credits).

- **`batch_output_report:`**

  - This variable holds the path for saving the batch output report. It’s currently commented to save API credits.

This example demonstrates how to run reports for multiple companies at once using the `generate_batch_company_reports` function, efficiently handling multiple requests and saving results in a batch output file.


In [17]:
## Example: Run reports on a batch of companies (commented to save API credits)

# batch_output_report = "outputs/batch_company_research_report_01.md"

# companies = [
#     "Google",
#     "Deloitte India",
#     "Zepto",
#     "Dyte",
# ]

# generate_batch_company_reports(companies)

## Step 13: Company Deep Research

The `company_deep_research` function conducts thorough research on a company by exploring multiple queries related to it, such as its founders, competitors, and latest news. 

It collects relevant links from search results, fetches the page content, converts it to Markdown format, and compiles the information into a research report. 

This report is then saved in a specified file for further use.

In [22]:
import requests
import traceback
from markdownify import markdownify as md
from uuid import uuid4

DEFAULT_REQUEST_TIMEOUT = 30


def fetch_page_markdown(url, timeout=DEFAULT_REQUEST_TIMEOUT):
    try:
        response = requests.get(url, timeout=timeout)
        response.raise_for_status()
        html = response.text
        markdown = md(html)
        return markdown
    except Exception as e:
        print(f"Failed to fetch page text for {url}.")
        traceback.print_exc()
        return ""


def check_file_exists(file):
    """Checks if the file exists"""
    return os.path.exists(file)


def update_temp_file(temp_file, markdown):
    """Updates the temp file with the markdown"""
    if not check_file_exists(temp_file):
        print(f"Temp file {temp_file} does not exist.")
        print(f"Creating temp file: {temp_file}")
        with open(temp_file, "w") as f:
            f.write(markdown)
    with open(temp_file, "a") as f:
        f.write(markdown)


def crawl_links_and_get_markdown(links):
    """Crawls the links and gets the markdown"""
    temp_file = f"outputs/temp_{uuid4()}.md"
    print(f"Rolling write to temp file: {temp_file}")
    markdown = ""
    for link in links:
        print(f"Crawling link: {link}")
        response = fetch_page_markdown(link) + "\n"
        title = f"### PAGE: {link}\n"
        current_markdown = title + response
        update_temp_file(temp_file, current_markdown)
        print(f"Temp file {temp_file} updated.")
        markdown += title + response
    return markdown


def company_deep_research(company: str) -> Markdown:
    """Does a deep research on the company."""
    link_exploration_queries = [
        f"What is {company}?",
        f"Who are the founders of {company}?",
        f"Who are the competitors of {company}?",
        f"What is the latest news about {company}?",
        f"What are the top goals of {company}?",
        f"What is {company}'s business model?",
        f"What is {company} known for",
        f"Who are some of {company}'s customers?",
        f"What are some employee reviews of {company}?",
        f"Where is the headquarters of {company}?",
        f"What industry does {company} operate in?",
        f"What are some of the products of {company}?",
    ]

    all_links = []

    for query in link_exploration_queries:
        print(f"Exploring links for: {query}")
        links = get_links_from_regular_search(query)
        all_links.extend(links)

    raw_markdown = crawl_links_and_get_markdown(all_links)
    unix_timestamp = math.floor(datetime.now().timestamp())
    research_report = create_reserach_report(
        company,
        raw_markdown,
        output_file=f"data/company_research/{company}_research_report_${unix_timestamp}_naive.md",
    )
    Markdown(research_report)

## Step 14: Deep Research on Company

This code performs deep research on the company "Databricks" by exploring various queries related to the company, retrieving relevant links, and then fetching their markdown content.

 The collected data is compiled into a markdown research report. Finally, the report is displayed in the Jupyter notebook using the `display` function

In [None]:
# Deep Research: 🚀

from IPython.display import display_markdown

company = "Databricks"

display_markdown(company_deep_research(company))

## Conclusion:

This app provides a robust solution for generating detailed research reports on companies using various AI-driven search techniques and summarization tools. By integrating APIs like Google Serper and Gemini, along with caching mechanisms, the app ensures efficient and accurate data retrieval. 

The app's ability to generate both individual and batch reports, format them into markdown, and leverage external links for in-depth research makes it a comprehensive tool for business and market analysis. It automates the research process while ensuring that reports are professional, fact-based, and structured.

---

# Thank You for visiting The Hackers Playbook! 🌐

If you liked this research material;

- [Subscribe to our newsletter.](https://thehackersplaybook.substack.com)

- [Follow us on LinkedIn.](https://www.linkedin.com/company/the-hackers-playbook/)

- [Leave a star on our GitHub.](https://www.github.com/thehackersplaybook)

<div style="display:flex; align-items:center; padding: 50px;">
<p style="margin-right:10px;">
    <img height="200px" style="width:auto;" width="200px" src="https://avatars.githubusercontent.com/u/192148546?s=400&u=95d76fbb02e6c09671d87c9155f17ca1e4ef8f21&v=4"> 
</p>
</div>
