# **Building a Code Search System**
We will build a code search system based on the Langchain codebase. Our system will be able to answer queries such as:
* get relevant documents in arxiv retriever
* base class for retriever that does not use vector store
* bm25 retriever test

## **Clone the Langchain Repository**

In [1]:
!git clone https://github.com/langchain-ai/langchain.git

# And install the thirdai package
%pip install thirdai -U

Cloning into 'langchain'...
remote: Enumerating objects: 171026, done.[K
remote: Counting objects: 100% (1/1), done.[K
remote: Total 171026 (delta 0), reused 0 (delta 0), pack-reused 171025[K
Receiving objects: 100% (171026/171026), 231.80 MiB | 37.50 MiB/s, done.
Resolving deltas: 100% (127816/127816), done.
Updating files: 100% (7421/7421), done.
Defaulting to user installation because normal site-packages is not writeable
[33mDEPRECATION: textract 1.6.5 has a non-standard dependency specifier extract-msg<=0.29.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of textract or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m[33mDEPRECATION: distro-info 0.23ubuntu1 has a non-standard version number. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version 

## **Chunking**
To ensure that each chunk is semantically coherent, we will split it along class and function boundaries. In addition, for our use case, it's important to know which file, class, and/or function a snippet is taken from. This kind of information is perfect for utilizing NeuralDB's notion of strong columns.

In [2]:
import ast
from typing import List
import pandas as pd
import glob
import warnings

def apply_to_codebase(path_to_codebase, chunking_strategy):
    """Traverses entire codebase and applies chunking strategy to all files.
    Returns a dataframe with 5 columns: id, chunk, path_to_file, lineno, end_lineno
    `lineno` is the line number (not line index) that the chunk starts on.
    `end_lineno` is the chunk's last line number (again, not line index).
    For example, 
    """
    dfs = []
    warn = False
    for path_to_file in glob.iglob(f"{path_to_codebase}/**/*.*", recursive = True):
        if path_to_file.endswith(".py"):
            try:
                script = open(path_to_file).read()
                ast_body = ast.parse(script).body
                script_lines = script.splitlines(keepends=True)
                df = chunking_strategy(ast_body, script_lines)
                df["path_to_file"] = [path_to_file for _ in range(len(df))]
                dfs.append(df)
            except Exception as e:
                print("Failed to open", path_to_file)
                print("Reason:", e)
                print("Skipping...")
        else:
            warn = True
        
    if warn:
        warnings.warn("Found non-Python files in the codebase. This script only snippets python code.", RuntimeWarning)

    df = pd.concat(dfs)
    df["id"] = range(len(df))
    df.index = range(len(df))
    return df

def split_by_function(ast_body: List[ast.AST], script_lines: List[str]) -> pd.DataFrame:
    """
    ast_body: List of elements in the Python script as returned by `ast.parse(script).body`
    script_lines: List of lines in the Python script

    The script will be split into snippets according to these rules:
    - Each function is a chunk
    - Each method of a class is a chunk
    - Expressions between functions or classes are clubbed together.
    - Comments (not docstrings) are clubbed with the next chunk

    This chunking method produces a dataframe with four columns:
    - snippet: A snippet from the codebase
    - trace: The stack trace of the snippet; the class and/or function from which
      the snippet is taken
    - lineno: The line number where the snippet starts. Note that it is 1-indexed,
      which is consistent with the lineno returned by the AST module.
    - end_lineno: The line number of the last line of the snippet. Like lineno, 
      it is 1-indexed.
    """

    start_linenos, end_linenos, traces = _split_by_function(
        ast_body=ast_body,
        start_lineno=1,
        end_lineno=len(script_lines),
    )
    
    # Only keep non-empty lines.

    for i in range(len(start_linenos)):
        while start_linenos[i] <= end_linenos[i] and not script_lines[start_linenos[i] - 1].strip():
            start_linenos[i] += 1
        while start_linenos[i] <= end_linenos[i] and not script_lines[end_linenos[i] - 1].strip():
            end_linenos[i] -= 1
    
    snippets = []
    final_traces = []
    final_linenos = []
    final_end_linenos = []

    for lineno, end_lineno, snippet_trace in zip(start_linenos, end_linenos, traces):
        if lineno <= end_lineno:
            snippets.append("".join(script_lines[lineno - 1: end_lineno]))
            final_traces.append(snippet_trace)
            final_linenos.append(lineno)
            final_end_linenos.append(end_lineno)
    
    return pd.DataFrame({
        "snippet": snippets,
        "trace": final_traces,
        "lineno": final_linenos,
        "end_lineno": final_end_linenos,
    })

def _split_by_function(ast_body: List[ast.AST], start_lineno: int, end_lineno: int):
    """This helper function allows us to reuse chunking logic within a scope
    such as a class.
    """
    start_linenos = []
    end_linenos = []
    traces = []
    can_connect = False

    # start_lineno is always the previous end_lineno + 1
    # This is because comments are not captured by the AST parser.
    # Thus, to keep comments, we must keep all lines between the previous
    # element and the current element.

    for elem in ast_body:
        if isinstance(elem, ast.FunctionDef):
            # A function block is always its own chunk
            start_linenos.append(start_lineno)
            end_linenos.append(elem.end_lineno)
            # Add function name to trace.
            traces.append(f"function name: {elem.name}")
            can_connect = False
            start_lineno = elem.end_lineno + 1
        elif isinstance(elem, ast.ClassDef):
            # A class is treated as a mini-script;
            # functions/methods inside a class are their own snippets.
            class_start_linenos, class_end_linenos, class_trace = _split_by_function(
                ast_body=elem.body,
                start_lineno=start_lineno,
                end_lineno=elem.end_lineno,
            )
            start_linenos.extend(class_start_linenos)
            end_linenos.extend(class_end_linenos)
            # Prepend class name to the trace of every snippet in the class.
            traces.extend([f"class name: {elem.name}. {trace}" for trace in class_trace])
            can_connect = False
            start_lineno = elem.end_lineno + 1
        else:
            # Group expressions in the global scope that are neither functions
            # nor classes.
            if can_connect:
                end_linenos[-1] = elem.end_lineno
                start_lineno = elem.end_lineno + 1
            else:
                start_linenos.append(start_lineno)
                end_linenos.append(elem.end_lineno)
                # Append an empty string so `traces`` is always the same length
                # as start_linenos and end_linenos
                traces.append("")
                can_connect = True
                start_lineno = elem.end_lineno + 1
    start_linenos.append(start_lineno)
    end_linenos.append(end_lineno)

    return start_linenos, end_linenos, traces

In [3]:
langchain_snippets = apply_to_codebase("./langchain", split_by_function)
langchain_snippets.to_csv("langchain.csv", index=None)

Failed to open ./langchain/libs/langchain/tests/integration_tests/examples/non-utf8-encoding.py
Reason: 'utf-8' codec can't decode byte 0xb1 in position 23: invalid start byte
Skipping...
Failed to open ./langchain/libs/community/tests/examples/non-utf8-encoding.py
Reason: 'utf-8' codec can't decode byte 0xb1 in position 23: invalid start byte
Skipping...
Failed to open ./langchain/libs/community/tests/integration_tests/examples/non-utf8-encoding.py
Reason: 'utf-8' codec can't decode byte 0xb1 in position 23: invalid start byte
Skipping...




## **Build NeuralDB**
As previously mentioned, NeuralDB has a notion of strong and weak columns.

`strong_columns` are columns in your CSV file that contains “strong” signals; words or strings that you want exact matches with, such as keywords, brands, categories, or a stack trace.

`weak_columns` contain “weak” signals; phrases or passages that you want rough or semantic matches with, such as product descriptions, chunks of an essay, or code snippets.


In [15]:
from thirdai import neural_db as ndb, licensing

# TODO: Your ThirdAI key goes here
licensing.activate("YOUR-THIRDAI-KEY")

doc = ndb.CSV(
    "langchain.csv",
    # Path to file and stack trace are strong signals.
    strong_columns=["path_to_file", "trace"],
    # Code snippets contain weak signals
    weak_columns=["snippet"],
    reference_columns=["snippet"],
)

db = ndb.NeuralDB()

db.insert([doc])

['36b2f067e847f4e87999080101509b3b69f8cde0']

## **Let's test it!**

In [5]:
for res in db.search("get relevant documents in arxiv retriever", top_k=2):
    print("file:", res.metadata["path_to_file"], "\n")
    print("trace:", res.metadata["trace"], "\n")
    print("snippet:", res.metadata["snippet"], "\n")
    print("=" * 100)

file: ./langchain/libs/community/langchain_community/retrievers/arxiv.py 

trace: class name: ArxivRetriever.  

snippet: class ArxivRetriever(BaseRetriever, ArxivAPIWrapper):
    """`Arxiv` retriever.

    It wraps load() to get_relevant_documents().
    It uses all ArxivAPIWrapper arguments without any change.
    """

    get_full_documents: bool = False
 

file: ./langchain/templates/retrieval-agent/retrieval_agent/chain.py 

trace: class name: ArxivRetriever. function name: _get_relevant_documents 

snippet:     def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        try:
            if self.is_arxiv_identifier(query):
                results = self.arxiv_search(
                    id_list=query.split(),
                    max_results=self.top_k_results,
                ).results()
            else:
                results = self.arxiv_search(  # type: ignore
                    query[: self.ARXIV_MAX

In [6]:
for res in db.search("base class for retriever that does not use vector store", top_k=2):
    print("file:", res.metadata["path_to_file"], "\n")
    print("trace:", res.metadata["trace"], "\n")
    print("snippet:", res.metadata["snippet"], "\n")
    print("=" * 100)

file: ./langchain/libs/experimental/langchain_experimental/retrievers/__init__.py 

trace:  

snippet: """**Retriever** class returns Documents given a text **query**.

It is more general than a vector store. A retriever does not need to be able to
store documents, only to return (or retrieve) it.
"""
 

file: ./langchain/libs/community/langchain_community/vectorstores/neo4j_vector.py 

trace: class name: Neo4jVector. function name: from_documents 

snippet:     @classmethod
    def from_existing_index(
        cls: Type[Neo4jVector],
        embedding: Embeddings,
        index_name: str,
        search_type: SearchType = DEFAULT_SEARCH_TYPE,
        keyword_index_name: Optional[str] = None,
        **kwargs: Any,
    ) -> Neo4jVector:
        """
        Get instance of an existing Neo4j vector index. This method will
        return the instance of the store without inserting any new
        embeddings.
        Neo4j credentials are required in the form of `url`, `username`,
        

In [7]:
for res in db.search("bm25 retriever test", top_k=2):
    print("file:", res.metadata["path_to_file"], "\n")
    print("trace:", res.metadata["trace"], "\n")
    print("snippet:", res.metadata["snippet"], "\n")
    print("=" * 100)

file: ./langchain/libs/community/tests/unit_tests/retrievers/test_bm25.py 

trace: function name: test_from_texts_with_bm25_params 

snippet: @pytest.mark.requires("rank_bm25")
def test_from_texts_with_bm25_params() -> None:
    input_texts = ["I have a pen.", "Do you have a pen?", "I have a bag."]
    bm25_retriever = BM25Retriever.from_texts(
        texts=input_texts, bm25_params={"epsilon": 10}
    )
    # should count only multiple words (have, pan)
    assert bm25_retriever.vectorizer.epsilon == 10
 

file: ./langchain/libs/community/tests/unit_tests/retrievers/test_bm25.py 

trace: function name: test_from_texts 

snippet: @pytest.mark.requires("rank_bm25")
def test_from_texts() -> None:
    input_texts = ["I have a pen.", "Do you have a pen?", "I have a bag."]
    bm25_retriever = BM25Retriever.from_texts(texts=input_texts)
    assert len(bm25_retriever.docs) == 3
    assert bm25_retriever.vectorizer.doc_len == [4, 5, 4]
 



In [8]:
db.save("langchain.ndb")

'langchain.ndb'

## **A full copilot system with _Chain of Thought_**

We will build a copilot system that is powerful enough to answer a question like this one:

In [9]:
task = (
    "I want to integrate a retriever called MyRetriever with this open source codebase. "
    "It does not use a vector store. "
    "Create a skeleton for a class that wraps MyRetriever and inherits the right interface. "
    "(Don't implement the methods, just write #TODOs)"
)

### Language model query script

In [10]:
import os
from openai import OpenAI

# TODO: Your OpenAI key goes here
os.environ['OPENAI_API_KEY'] = "YOUR-OPENAI-KEY"
openai_client = OpenAI() # defaults to os.environ['OPENAI_API_KEY']

def query_gpt(query=""):
    messages = [{"role": "user", "content": f"{query}"}]
    response = openai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages,
        temperature=0,
    )
    return response.choices[0].message.content

### Search and retreive the relevant code snippet(s)

In [20]:
def get_references(query, radius=None, print_metadata=False):
    search_results = db.search(query, top_k=3)
    references = []
    for i, result in enumerate(search_results):
        if (print_metadata):
            print(f"Reference {i + 1} \n{result.text}")
        if radius:
            references.append(f"```{result.context(radius=radius)}```")
        else:
            references.append(f"```{result.text}```")
    return references

def get_context(query, radius=None, print_metadata=False):
    references = get_references(str(query), radius=radius, print_metadata=print_metadata)
    context = "\n\n".join(references[:5])
    return context
    

### Initial thoughts
Action items required to accomplish the above given task

In [21]:
def initial_thoughts(task):
    prompt = (
        "Act as a software engineer who is the expert in an unnamed open source codebase. "
        f"You are asked to do the following:\n\n{task}\n\n"
        "You have access to an oracle that can give you snippets and examples from "
        "this open source codebase, and only from this open source codebase. "
        "What pieces of information would you want to get from the oracle to complete the task? "
        "List them in separate lines."
    )
    # Only return non-empty lines.
    return [query for query in query_gpt(prompt).split("\n") if query]

for query in initial_thoughts(task):
    print(query)

1. The interface that the open source codebase uses for retrievers.
2. Any existing classes that wrap external retrievers in the codebase.
3. Examples of how other retrievers are integrated into the codebase.
4. Any specific requirements or conventions for integrating external retrievers.
5. Any dependencies or configurations needed for integrating external retrievers.


### Perform actions

In [22]:
def refine_thoughts(task, context, previous_answer=""):
    prompt = task
    prompt += (
        f"Act as an experienced software engineer:\n\n"
        f"Answer the query ```{task}``` , given your previous answers : ```{previous_answer}```\n\n"
        f"modify your answer based on this new information (do not construct "
        f"your answer from outside the context provided ): ```{context}```"
    )
    response = query_gpt(prompt)
    return response

def copilot(task, radius=None, verbose=False):
    queries = initial_thoughts(task)
    if verbose:
        print(len(queries), "queries:")
        for query in queries:
            print(query)
        print("\n")

    draft_answer = ""

    for query in queries:
        if verbose:
            print("Query:", query)
            print("Retrieved references:")
        retrieved_info = get_context(query, radius=radius, print_metadata=verbose) # retrieve neural db response for current thought
        # LLM modifies answer based on the previous answer and current ndb results
        draft_answer = refine_thoughts(
            task,
            context=f"Answers to the query '{query}':\n\n{retrieved_info}",
            previous_answer=draft_answer,
        )
        if verbose:
            print("Draft Answer:")
            print(draft_answer)
            print("=" * 100)
    return draft_answer

## Let's get this task done!

In [23]:
answer = copilot(task=task, verbose=True)

5 queries:
1. The interface that the open source codebase uses for retrievers
2. Any existing classes or modules that handle retrievers in the codebase
3. Any specific requirements or constraints for integrating external retrievers
4. Examples of how other retrievers are integrated in the codebase
5. Any relevant documentation or guidelines for extending the codebase with new retrievers


Query: 1. The interface that the open source codebase uses for retrievers
Retrieved references:
Reference 1 
snippet: import glob
import os
import re
import shutil
import sys
from pathlib import Path

if __name__ == "__main__":
    intermediate_dir = Path(sys.argv[1])

    templates_source_dir = Path(os.path.abspath(__file__)).parents[2] / "templates"
    templates_intermediate_dir = intermediate_dir / "templates"

    readmes = list(glob.glob(str(templates_source_dir) + "/*/README.md"))
    destinations = [
        readme[len(str(templates_source_dir)) + 1 : -10] + ".md" for readme in readmes
    ]
 