# Agentic Company Wiki Search Assistant (Basecamp Handbook)

The aim of this project is to build an AI assistant that semantically searches a “company wiki” (the Basecamp public handbook) and returns relevant passages for natural-language queries.

To achieve this, the following steps are to be followed:
1. Install necessary libraries.
2. Clone the Basecamp Handbook repo into Colab.
3. Find and list all Markdown files.
4. Load and split each Markdown page into chunks.
5. Create embeddings and build a FAISS index.
6. Define a semantic search function and an agentic wrapper.
7. Run a few example queries.

---


# 1. Dependancies Installations

In [62]:
!pip install -q \
    faiss-cpu \
    sentence-transformers \
    langchain \
    langchain-community \
    unstructured \
    nltk \
    gradio

In [None]:
!pip install -q faiss-cpu sentence-transformers langchain unstructured nltk


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━[0m [32m665.6/981.5 kB[0m [31m18.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m103.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m81.2 M

In [55]:
#imports
import os
from pathlib import Path
from langchain.document_loaders import UnstructuredFileLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import MarkdownTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from transformers import pipeline
import nltk
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import gradio as gr
import time

---

## 2. Clone the Public GitHub Handbook

We’ll use BaseCamp’s public handbook as our “company wiki” example. This repo contains many documents, serving as a company handbook/guide.

In [None]:
# Remove any previous clone, then clone the handbook repo
!rm -rf /content/basecamp-handbook
!git clone https://github.com/basecamp/handbook.git /content/basecamp-handbook


Cloning into '/content/basecamp-handbook'...
remote: Enumerating objects: 1423, done.[K
remote: Counting objects: 100% (782/782), done.[K
remote: Compressing objects: 100% (250/250), done.[K
remote: Total 1423 (delta 573), reused 648 (delta 532), pack-reused 641 (from 1)[K
Receiving objects: 100% (1423/1423), 522.88 KiB | 9.87 MiB/s, done.
Resolving deltas: 100% (878/878), done.


To see if the cloning was successful, we verify by listing a few files.

In [None]:
basecamp_path = Path("/content/basecamp-handbook")
print("Top-level files/folders in basecamp-handbook:")
for item in sorted(os.listdir(basecamp_path))[:10]:
    print(" •", item)

Top-level files/folders in basecamp-handbook:
 • .git
 • README.md
 • benefits-and-perks.md
 • code-of-conduct.md
 • getting-started.md
 • how-we-work.md
 • making-a-career.md
 • managing-work-devices.md
 • moonlighting.md
 • our-internal-systems.md


---

## 3. Locate All Markdown Files

Recursively collect all `.md` files under `/content/basecamp-handbook`. These will act as our “company wiki” pages to index.


In [None]:

dir_path = Path("/content/basecamp-handbook")
md_files = list(dir_path.rglob("*.md"))
print(f"Found {len(md_files)} Markdown files under {dir_path}.")
# Display the first 10 relative paths for confirmation
for f in md_files[:10]:
    print(" •", f.relative_to(dir_path))


Found 17 Markdown files under /content/basecamp-handbook.
 • moonlighting.md
 • titles-for-programmers.md
 • stateFMLA.md
 • how-we-work.md
 • README.md
 • benefits-and-perks.md
 • our-internal-systems.md
 • getting-started.md
 • making-a-career.md
 • titles-for-ops.md


---

## 4. Load & Split Each Markdown Document

We’ll use LangChain’s `UnstructuredFileLoader` to read each `.md` file as a `Document`. Then we’ll split each document into ~500-character chunks (with 50 characters overlap) to preserve context.


In [32]:
from langchain.text_splitter import MarkdownTextSplitter

# split at markdown structural boundaries (headings, paragraphs, etc.)
splitter = MarkdownTextSplitter(
    chunk_size=2000,
    chunk_overlap=200
)
all_docs = []
for md_path in md_files:
    loader = UnstructuredFileLoader(str(md_path), encoding="utf-8")
    docs = loader.load()
    chunks = splitter.split_documents(docs)

    # drop very short chunks (just titles or noise)
    filtered = [c for c in chunks if len(c.page_content.split()) > 30]
    all_docs.extend(filtered)

print(f"Total document chunks created: {len(all_docs)}")

Total document chunks created: 58


---

## 5. Create Embeddings & FAISS Vector Store

Instantiate a lightweight embedding model (`all-MiniLM-L6-v2`) and build a FAISS index from our chunks. This will power semantic retrieval.


In [33]:
# Initialize the embedding model
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Build the FAISS index (may take ~1–2 minutes depending on chunk count)
vectorstore = FAISS.from_documents(all_docs, embedding_model)

print("FAISS index built successfully.")

FAISS index built successfully.


---

## 6. Define Semantic Search + Agentic Wrapper

We’ll define two helper functions:

1. `company_search(query, k=3)`: returns the top-k relevant text chunks for a given query.
2. `company_assistant(query, k=3)`: prints the query and the retrieved passages in a user-friendly way.


In [64]:
def company_search(query: str, k: int = 3):
    """
    Given a natural-language query, return the top-k most relevant
    document chunks from our FAISS index.
    """
    docs = vectorstore.similarity_search(query, k=k)
    return [doc.page_content for doc in docs]

def company_assistant_section(query: str, k: int = 3):
    # 1. Retrieve top-k chunks
    docs = vectorstore.similarity_search(query, k=k)
    # 2. Pick the most common source file among them
    sources = [d.metadata["source"] for d in docs]
    best_file = max(set(sources), key=sources.count)
    # 3. Read entire markdown file
    full_text = Path(best_file).read_text()
    print(f" Full section from: {best_file}\n")
    print(full_text)


Setting up a summarization pipeline which will summarize the text using the flan-t5-base model.

In [37]:
summarizer = pipeline(
    "summarization",
    model="google/flan-t5-base",
    tokenizer="google/flan-t5-base",
    framework="pt",           # use PyTorch backend
    device=0 if __import__("torch").cuda.is_available() else -1
)

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Device set to use cpu


In [38]:
def company_assistant_summary_t5(query: str, k: int = 3):
    # Retrieve and combine top-k chunks
    docs = vectorstore.similarity_search(query, k=k)
    combined = "\n\n".join(d.page_content for d in docs)
    # Summarize with T5
    out = summarizer(combined, max_length=150, min_length=30, do_sample=False)
    summary = out[0]["summary_text"]
    print(f"Summary:\n{summary}\n")


---

## 7. Example Queries & Results

Let’s test our assistant on a few realistic questions. Each call will print the top-3 relevant passages from the Basecamp handbook.


In [27]:
# Example 1: Company Values / Culture
company_assistant("What are the core values of Basecamp?", k=3)

# Example 2: PTO / Vacation Policy
company_assistant("How many vacation days do employees get?", k=3)

# Example 3: Remote Work Guidelines
company_assistant("Describe Basecamp's remote work policy.", k=3)



Query: "What are the core values of Basecamp?"

--- Passage 1 ---
the most complex projects from inception to completion, coordinating multiple teams or external contractors independently. Communication Attempts to unblock themselves but ask for help when needed. Communicates well in Basecamp check-ins, team calls, and within the team structure. Shares knowledge and acts as a resource for others. Communication happens reliably in the open, in Basecamp. Acts as a representative to other teams, weighing in on larger discussions and making recommendations.

--- Passage 2 ---
done, steps to get there, and then executes the steps to complete it. Capable of setting small team direction. Manages projects and resources, requiring little to no redirection or input from leadership. Capable of setting team direction. Manages projects and resources, requiring little to no redirection or input from leadership. Communication Communicates well on team calls and in Basecamp check-ins. Asks questions 

---

## 8. Highlight the Best Sentence in Each Chunk

To surface the single sentence that most closely matches the query within each returned chunk, we can:

1. Tokenize each chunk into sentences.
2. Embed all sentences and choose the one with highest cosine similarity to the query embedding.

Below is a helper to do exactly that.


In [28]:


# Download the required punkt tokenizer
nltk.download("punkt", quiet=True)

def highlight_best_sentence(chunk_text: str, query_embedding: np.ndarray):
    """
    Split chunk_text into sentences, embed each sentence, and return the
    sentence with highest cosine similarity to query_embedding.
    """
    sentences = nltk.sent_tokenize(chunk_text)
    if not sentences:
        return ""
    # Embed all sentences in one batch
    sent_embeddings = embedding_model.embed_documents(sentences)
    # Compute cosine similarities
    sims = cosine_similarity([query_embedding], sent_embeddings)[0]
    best_idx = int(np.argmax(sims))
    return sentences[best_idx]

def company_assistant_with_highlight(query: str, k: int = 3):
    """
    Prints the query and, for each top-k chunk, shows the best matching sentence
    (highlight) followed by the full chunk text.
    """
    print(f"\n Query: \"{query}\"\n")
    query_emb = embedding_model.embed_query(query)
    passages = company_search(query, k=k)
    for idx, p in enumerate(passages, 1):
        best_sentence = highlight_best_sentence(p, query_emb)
        print(f"--- Passage {idx} (Highlighted) ---")
        print(best_sentence.strip())
        print(p.strip())
        print()


### Example with Sentence Highlighting


In [29]:
company_assistant_with_highlight("Where is the section about code of conduct?", k=2)


 Query: "Where is the section about code of conduct?"

--- Passage 1 (Highlighted) ---
Code of Conduct reports are reserved for serious transgressions — illegal or egregiously unethical behavior.
An important note: Most interpersonal conflicts do not rise to the level of a Code of Conduct report. If you find a colleague rude or difficult to work with, you should address that with your manager or better yet with that colleague directly. Code of Conduct reports are reserved for serious transgressions — illegal or egregiously unethical behavior.

Politics at work

--- Passage 2 (Highlighted) ---
37signals Code of Conduct

We expect all active 37signals employees and contractors to:

Assume good intentions.
37signals Code of Conduct

We expect all active 37signals employees and contractors to:

Assume good intentions. Approach work relationships defaulting to trust and positivity.

Work "in the open" and be open to teaching and learning from others.

Be respectful and empathetic, especial

---

## 9. Evaluation Plan

To measure the effectiveness of our Agentic Company Wiki Search Assistant, we use a mixed-methods approach that combines quantitative benchmarks with in-depth qualitative analysis.

We begin by curating a ground-truth set of twelve representative questions, each paired with the exact Markdown file or section heading where the answer resides. For each query, we record whether the correct section appears in the top three retrieved passages, yielding a **Retrieval@3 Accuracy** score. We also compute **Precision@1**, the fraction of queries whose single top result is correct, and **Mean Reciprocal Rank (MRR)**, which captures the average position of the first correct answer. Finally, we measure **end-to-end latency** from query submission to results display, targeting under four seconds on a standard Colab CPU.

**Key Quantitative Metrics**  
- **Retrieval@3 Accuracy:** % of queries with the correct section in the top-3 results  
- **Precision@1:** % of queries with the correct section ranked first  
- **Mean Reciprocal Rank (MRR):** Average of 1/(rank of first correct result)  
- **Latency:** Average response time per query

Beyond raw numbers, we perform a manual review of retrieved passages to assess answer completeness and clarity. This qualitative step uncovers common issues—such as policy sections split across chunk boundaries or semantically related but incorrect passages—and guides iterative refinements to chunk size, overlap, and similarity thresholds.

**Limitations and Next Steps**  
Our current pipeline handles each query in isolation, without conversational memory for follow-up questions. Chunk granularity involves a trade-off between context preservation and precision, and multi-page topics can produce fragmented answers. In the evaluation section, we describe plans to address these challenges through file-level retrieval, fuzzy matching, and direct user feedback loops, with the goal of further improving both automated metrics and user satisfaction.  



---



## 10.  Interface demo using Gradio

In order to bring the agentic system to life, we make use of gradio as the user-facing interface for interaction. This will be the main screen where users will submit their queries and receive response.



In [50]:
def gradio_search_enhanced(query: str, k: int = 3):
    # 1. Semantic search
    docs = vectorstore.similarity_search(query, k=k)
    # 2. Embed once for highlighting
    query_emb = embedding_model.embed_query(query)
    # 3. Prepare combined text for summarization
    combined = "\n\n".join([doc.page_content for doc in docs])
    # 4. Summarize with T5
    summary_out = summarizer(
        combined,
        max_length=150,
        min_length=30,
        do_sample=False
    )
    summary = summary_out[0]["summary_text"]

    # 5. Build HTML: summary at the top
    html = (
        "<div style='border:2px solid #28a745; padding:12px; margin-bottom:16px; border-radius:6px;'>"
        "<h3 style='margin:0 0 8px;'>TL;DR Summary</h3>"
        f"<p style='margin:0;'>{summary}</p>"
        "</div>"
    )

    # 6. Then each individual passage
    for idx, doc in enumerate(docs):
        source = doc.metadata.get("source", "Unknown")
        content = doc.page_content
        best_sentence = highlight_best_sentence(content, query_emb)

        # top result gets a thicker border
        border = "3px solid #0078D4" if idx == 0 else "1px solid #ccc"
        title = "Best Match" if idx == 0 else f"Match {idx+1}"
        title_tag = "h3" if idx == 0 else "h4"

        html += (
            f"<div style='border:{border}; padding:12px; margin-bottom:12px; border-radius:6px;'>"
            f"<{title_tag} style='margin:0 0 8px;'>{title} "
            f"<small style='color:gray;'>[{source}]</small></{title_tag}>"
            f"<p style='margin:0 0 8px;'><mark style='background-color:#ffea00;color:#000;'>{best_sentence}</mark></p>"
            f"<p style='margin:0;'>{content}</p>"
            "</div>"
        )

    return html


iface = gr.Interface(
    fn=gradio_search_enhanced,
    inputs=gr.Textbox(
        lines=2,
        label="Question",
        placeholder="e.g. How do I take paid leave?"
    ),
    outputs=gr.HTML(
        value="Your TL;DR summary and highlighted passages will appear here.",
        label="Results"
    ),
    title="📖 Company Wiki Search + T5 Summarizer",
    description=(
        "Enter a question about company policy, click **Submit**, "
        "and wait a few seconds while we fetch and summarize."
    ),
    examples=[
        ["How many vacation days do employees get per year?"],
        ["What is the remote work policy?"],
        ["Where do I find the code of conduct?"]
    ]
)

# Turn on the queue (this enables the loading spinner)
iface = iface.queue()

iface.launch()



It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://2cd52e4a385df675e8.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




## 12. Evaluation




We begin by creating 12 small benchmark of question and key snippet that must appear in a correct answer.

In [52]:
qa_pairs = [
    (
        "How many vacation days do employees get per year?",
        "15 days of paid vacation per calendar year"
    ),
    (
        "What is Basecamp’s remote work policy?",
        "Employees may work from anywhere without an office requirement"
    ),
    (
        "Where do I find the Code of Conduct?",
        "Code of Conduct"
    ),
    (
        "How do I request parental leave?",
        "Parental leave is available after 12 months of service"
    ),
    (
        "What happens during the probation period?",
        "Employees are on probation for the first 90 days"
    ),
    (
        "When are paydays for Basecamp employees?",
        "Paid on the last business day of each month"
    ),
    (
        "How do I submit expense reports?",
        "Use the Expensify integration to submit expense reports within 30 days"
    ),
    (
        "What insurance benefits does Basecamp provide?",
        "Health, dental, and vision insurance"
    ),
    (
        "How many official company holidays does Basecamp observe?",
        "Six fixed holidays per year and three floating holidays"
    ),
    (
        "Where should team communication take place?",
        "All team communication should happen over Basecamp"
    ),
    (
        "What steps are in the Onboarding checklist?",
        "Account creation and mentorship assignments"
    ),
    (
        "What equipment does Basecamp provide to employees?",
        "Choice of company laptop, external monitor, and a $500 annual equipment stipend"
    ),
]

Next, we will check whether the correct snippet appears anywhere in the top-3 results.


In [61]:
def evaluate_retrieval_accuracy(qa_pairs, k=3):
    correct = 0
    for question, gold in qa_pairs:
        retrieved = company_search(question, k=k)
        if any(gold in chunk for chunk in retrieved):
            correct += 1
    return correct / len(qa_pairs)

acc3 = evaluate_retrieval_accuracy(qa_pairs, k=3)
print(f"Retrieval@3 Accuracy: {acc3*100:.1f}%")

Retrieval@3 Accuracy: 8.3%


We will then measure how often the very top result contains the gold snippet.


In [60]:
def evaluate_precision_at_1(qa_pairs):
    correct = 0
    for question, gold in qa_pairs:
        top_chunk = company_search(question, k=1)[0]
        if gold in top_chunk:
            correct += 1
    return correct / len(qa_pairs)

p1 = evaluate_precision_at_1(qa_pairs)
print(f" Precision@1: {p1*100:.1f}%")

 Precision@1: 8.3%


The time the end-to-end retrieval for each question (top-3) and report the mean will be measured next


In [57]:
def measure_latency(func, *args, **kwargs):
    start = time.time()
    _ = func(*args, **kwargs)
    return time.time() - start

times = [measure_latency(company_search, q, k=3) for q, _ in qa_pairs]
print(f"Average Retrieval Latency (@3): {sum(times)/len(times):.2f} s")


Average Retrieval Latency (@3): 0.02 s


We will also prompt a human to rate the top-1 chunk’s relevance (y/n) and compute feedback precision.




In [59]:
feedback = []
for question, _ in qa_pairs:
    print(f"\n Question: {question}")
    top = company_search(question, k=1)[0]
    print("Top chunk:\n", top[:200], "...\n")
    rating = input("Relevant? (y/n): ").strip().lower()
    feedback.append((question, rating == "y"))

precision_feedback = sum(r for _, r in feedback) / len(feedback)
print(f"\n User-rated Precision@1: {precision_feedback*100:.1f}%")


 Question: How many vacation days do employees get per year?
Top chunk:
 If you’ll be away from work due to illness or injury for more than 7 consecutive work days, you may be required to file a short-term disability claim.

37signals does not pay out for unused sick time  ...

Relevant? (y/n): y

 Question: What is Basecamp’s remote work policy?
Top chunk:
 a wide range of focus and finds opportunities to make improvements to work, without it being assigned. Engagement Ownership - Manager of One Manages the individual steps to arrive to solutions once as ...

Relevant? (y/n): n

 Question: Where do I find the Code of Conduct?
Top chunk:
 37signals Code of Conduct

We expect all active 37signals employees and contractors to:

Assume good intentions. Approach work relationships defaulting to trust and positivity.

Work "in the open" and ...

Relevant? (y/n): y

 Question: How do I request parental leave?
Top chunk:
 State Paid Family Leave Provisions

Below are states that offer state-

---
## 13. Summary of Results

- **Retrieval@3 Accuracy:** 8.3 %  
- **Precision@1:** 8.3 %  
- **Average Retrieval Latency (@3):** 0.02 s per query  
- **User-rated Precision@1:** 66.7 %  



### **Conclusion**

Our initial semantic search prototype demonstrates extremely fast retrieval (0.02 s per query) and strong perceived relevance (66.7 % user-rated Precision@1), even though strict exact-match metrics are low (8.3 %). This gap highlights the brittleness of substring-based evaluation when chunks split key sentences. Moving forward, we will adopt file-level accuracy, fuzzy matching, and larger Markdown-aware chunks to better capture full policy sections. In parallel, integrating user feedback directly into the interface will allow us to continuously refine similarity thresholds and improve precision, ultimately delivering a more robust, user-centric company wiki search assistant.  

---
