# Exploratory Data Analysis

This notebook details the process of extracting content from a GitHub repository of a Python library as well as web scraping the documentation of Python libraries for loading into Pinecone and answering context-specific questions.

## Langgraph GitHub Repo

Details the process of extracting content from the Langgraph GitHub repository, including functions from .py files, text from .md files, and cell data from .ipynb files. Additionally, we are also scraping the Langgraph docs and doing POC by loading the embeddings into Pinecone.

Setting up of OpenAI API key...

In [2]:
import os
from getpass import getpass

# Prompt the user to enter the OpenAI API key securely
api_key = getpass("Enter your OpenAI API key: ")

# Set the environment variable
os.environ["OPENAI_API_KEY"] = api_key

Enter your OpenAI API key:  ········


### Extracting all functions from Python files

In [12]:
import pandas as pd
from pathlib import Path

DEF_PREFIXES = ['def ', 'async def ']
NEWLINE = '\n'

def get_function_name(code):
    """
    Extract function name from a line beginning with 'def' or 'async def'.
    """
    for prefix in DEF_PREFIXES:
        if code.startswith(prefix):
            return code[len(prefix): code.index('(')]


def get_until_no_space(all_lines, i):
    """
    Get all lines until a line outside the function definition is found.
    """
    ret = [all_lines[i]]
    for j in range(i + 1, len(all_lines)):
        if len(all_lines[j]) == 0 or all_lines[j][0] in [' ', '\t', ')']:
            ret.append(all_lines[j])
        else:
            break
    return NEWLINE.join(ret)


def get_functions(filepath):
    """
    Get all functions in a Python file.
    """
    with open(filepath, 'r') as file:
        all_lines = file.read().replace('\r', NEWLINE).split(NEWLINE)
        for i, l in enumerate(all_lines):
            for prefix in DEF_PREFIXES:
                if l.startswith(prefix):
                    code = get_until_no_space(all_lines, i)
                    function_name = get_function_name(code)
                    yield {
                        'code': code,
                        'function_name': function_name,
                        'filepath': filepath,
                    }
                    break


def extract_functions_from_repo(code_root):
    """
    Extract all .py functions from the repository.
    """
    code_files = list(code_root.glob('**/*.py'))

    num_files = len(code_files)
    print(f'Total number of .py files: {num_files}')

    if num_files == 0:
        print('Verify langgraph-python repo exists and code_root is set correctly.')
        return None

    all_funcs = [
        func
        for code_file in code_files
        for func in get_functions(str(code_file))
    ]

    num_funcs = len(all_funcs)
    print(f'Total number of functions extracted: {num_funcs}')

    return all_funcs

For testing purposes we are cloning the repo onto our local system and processing the files

In [16]:
# Define the root directory for the cloned repository
root_dir = Path("/Users/pragneshanekal/Documents/Local/Northeastern-Fall-2024/Big-Data")

# Adjust the path to the specific directory within the cloned repository
code_root = root_dir / 'langgraph'

# Extract all functions from the repository
all_funcs = extract_functions_from_repo(code_root)

Total number of .py files: 189
Total number of functions extracted: 620


In [18]:
df = pd.DataFrame(all_funcs)
data = df.to_dict('records')
df.head()

Unnamed: 0,code,function_name,filepath
0,def _make_regular_expression(pkg_prefix: str) ...,_make_regular_expression,/Users/pragneshanekal/Documents/Local/Northeas...
1,"def _get_full_module_name(module_path, class_n...",_get_full_module_name,/Users/pragneshanekal/Documents/Local/Northeas...
2,"def _get_doc_title(data: str, file_name: str) ...",_get_doc_title,/Users/pragneshanekal/Documents/Local/Northeas...
3,"def _get_imports(\n code: str, doc_title: s...",_get_imports,/Users/pragneshanekal/Documents/Local/Northeas...
4,def comment_install_cells(notebook: nbformat.N...,comment_install_cells,/Users/pragneshanekal/Documents/Local/Northeas...


In [20]:
from langchain.schema import Document

documents = [
    Document(
        page_content=item['code'],
        metadata={
            'function_name': item['function_name'],
            'file_path': item['filepath']
        }
    ) for item in data
]

In [22]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

### Create embeddings and store into Chromadb

In [24]:
from langchain.vectorstores import Chroma

vectorstore = Chroma.from_documents(documents, embeddings)

In [25]:
from langchain_openai import ChatOpenAI
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

llm = ChatOpenAI(temperature=0)

system_prompt = """
You are an AI assistant specialized in generating code based on requirements. 
Your task is to analyze the given Python code snippets and generate new Python code that meets the specified requirements.
The response MUST only include Python code and a one line description of the Python code.
"""

compressor = LLMChainExtractor.from_llm(llm)
retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 5})
)

In [30]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type="stuff",
    return_source_documents=True
)

question = "How do I define a new State in Langgraph?"
result = qa_chain({"query": question})

# Display the generated code
print("Generated Code:")
print(result['result'])

Generated Code:
To define a new State in Langgraph, you can create a new class that inherits from TypedDict. Inside this class, you can define the fields that represent the state variables you want to use. Here is an example of how you can define a new State in Langgraph:

```python
from typing import TypedDict

class MyState(TypedDict):
    foo: int
    bar: str
    # Add more fields as needed
```

In this example, `MyState` is a new State class that has two fields: `foo` of type `int` and `bar` of type `str`. You can add more fields as needed for your specific use case.


In [32]:
# Display the top 5 similar code snippets
print("\nTop 5 Similar Code Snippets:")
for doc in result['source_documents']:
    print(f"Function: {doc.metadata['function_name']}")
    print(f"File: {doc.metadata['file_path']}")
    print(f"Code:\n{doc.page_content}\n")


Top 5 Similar Code Snippets:
Function: rt_graph
File: /Users/pragneshanekal/Documents/Local/Northeastern-Fall-2024/Big-Data/langgraph/libs/langgraph/tests/test_utils.py
Code:
def rt_graph() -> CompiledGraph:
    class State(TypedDict):
        foo: int
        node_run_id: int

    graph = StateGraph(State)
    graph.add_node(node)
    graph.set_entry_point("node")
    graph.add_edge("node", END)
    return graph.compile()

Function: test_debug_subgraphs
File: /Users/pragneshanekal/Documents/Local/Northeastern-Fall-2024/Big-Data/langgraph/libs/langgraph/tests/test_pregel.py
Code:
State(TypedDict): messages: Annotated[list[str], operator.add]



### Extract .md files from GitHub repo

In [34]:
import pandas as pd
from pathlib import Path

# Markdown element prefixes
HEADING_PREFIXES = ['#', '##', '###', '####', '#####', '######']
NEWLINE = '\n'

def get_markdown_elements(filepath):
    """
    Extract structured elements (e.g., headings, paragraphs) from a Markdown (.md) file.
    """
    with open(filepath, 'r', encoding='utf-8') as file:
        all_lines = file.read().split(NEWLINE)
        for line in all_lines:
            line = line.strip()
            if line:  # Skip empty lines
                if any(line.startswith(prefix) for prefix in HEADING_PREFIXES):
                    # Extract heading level and text
                    heading_level = len(line.split(' ')[0])
                    heading_text = line[heading_level:].strip()
                    yield {
                        'type': 'heading',
                        'level': heading_level,
                        'content': heading_text,
                        'filepath': filepath,
                    }
                else:
                    # Treat non-heading lines as paragraph content
                    yield {
                        'type': 'paragraph',
                        'content': line,
                        'filepath': filepath,
                    }

def extract_markdown_from_repo(md_root):
    """
    Extract all elements from .md files in the repository.
    """
    md_files = list(md_root.glob('**/*.md'))

    num_files = len(md_files)
    print(f'Total number of .md files: {num_files}')

    if num_files == 0:
        print('Verify the repository exists and md_root is set correctly.')
        return None

    all_elements = [
        element
        for md_file in md_files
        for element in get_markdown_elements(md_file)
    ]

    num_elements = len(all_elements)
    print(f'Total number of elements extracted: {num_elements}')

    return all_elements


In [36]:
# Extract all functions from the repository
all_funcs = extract_markdown_from_repo(code_root)

Total number of .md files: 101
Total number of elements extracted: 10848


In [38]:
df = pd.DataFrame(all_funcs)
data = df.to_dict('records')
df.head()

Unnamed: 0,type,level,content,filepath
0,heading,1.0,🦜🕸️LangGraph,/Users/pragneshanekal/Documents/Local/Northeas...
1,paragraph,,![Version](https://img.shields.io/pypi/v/langg...,/Users/pragneshanekal/Documents/Local/Northeas...
2,paragraph,,[![Downloads](https://static.pepy.tech/badge/l...,/Users/pragneshanekal/Documents/Local/Northeas...
3,paragraph,,[![Open Issues](https://img.shields.io/github/...,/Users/pragneshanekal/Documents/Local/Northeas...
4,paragraph,,[![Docs](https://img.shields.io/badge/docs-lat...,/Users/pragneshanekal/Documents/Local/Northeas...


### Extract information from .ipynb notebooks

In [47]:
import json
from bs4 import BeautifulSoup

def convert_html_to_markdown(html_content):
    """
    Converts HTML content in markdown cells to plain markdown and extracts links.
    """
    soup = BeautifulSoup(html_content, "html.parser")
    plain_text = soup.get_text()  # Extract text without HTML tags

    # Extract hyperlinks and format them
    links = []
    for a_tag in soup.find_all('a', href=True):
        links.append(f"[{a_tag.text}]({a_tag['href']})")

    # Combine text and hyperlinks into a markdown-like format
    markdown_content = plain_text.strip() + "\n" + "\n".join(links)
    return markdown_content

def extract_and_format_with_full_markdown_context(file_path):
    """
    Process a single .ipynb file, extracting code cells with surrounding markdown context.
    """
    with open(file_path, 'r', encoding='utf-8') as f:
        notebook_data = json.load(f)

    processed_cells = []  
    markdown_buffer = []  

    # Iterate through cells and process them
    for index, cell in enumerate(notebook_data.get('cells', [])):
        cell_type = cell.get('cell_type')
        source_content = ''.join(cell.get('source', []))

        if cell_type == 'markdown':
            markdown_text = convert_html_to_markdown(source_content)
            markdown_buffer.append(markdown_text)  # Add to the markdown buffer
        elif cell_type == 'code':
            # Determine markdown above and below the code cell
            markdown_above = '\n'.join(markdown_buffer) if markdown_buffer else None

            # Look ahead to find markdown below the code cell
            markdown_below = None
            for next_index in range(index + 1, len(notebook_data['cells'])):
                next_cell = notebook_data['cells'][next_index]
                if next_cell['cell_type'] == 'markdown':
                    markdown_below = convert_html_to_markdown(''.join(next_cell.get('source', [])))
                    break
                elif next_cell['cell_type'] == 'code':
                    break

            # Add the processed cell to the list
            processed_cells.append({
                "file_path": file_path,
                "cell_number": index + 1,  # Cell number (1-indexed)
                "code": source_content,
                "markdown_above": markdown_above,
                "markdown_below": markdown_below
            })

            # Clear the markdown buffer after processing a code cell
            markdown_buffer = []

    return processed_cells

def extract_notebooks_from_repo(repo_path):
    """
    Process all .ipynb files in a GitHub repository.
    """
    repo_path = Path(repo_path)
    notebook_files = list(repo_path.glob('**/*.ipynb'))

    print(f"Found {len(notebook_files)} Jupyter notebooks in the repository.")

    all_cells = []
    for notebook_file in notebook_files:
        processed_cells = extract_and_format_with_full_markdown_context(notebook_file)
        all_cells.extend(processed_cells)

    print(f"Extracted {len(all_cells)} cells from notebooks.")

    # Convert the list of dictionaries into a DataFrame
    df = pd.DataFrame(all_cells)
    return df

In [49]:
# Extract all functions from the repository
all_funcs = extract_notebooks_from_repo(code_root)

Found 165 Jupyter notebooks in the repository.
Extracted 1191 cells from notebooks.


  soup = BeautifulSoup(html_content, "html.parser")


In [51]:
all_funcs.head()

Unnamed: 0,file_path,cell_number,code,markdown_above,markdown_below
0,/Users/pragneshanekal/Documents/Local/Northeas...,2,%%capture --no-stderr\n%pip install -U langgra...,# LangGraph Quick Start\n\nIn this comprehensi...,"Next, set your API keys:\n"
1,/Users/pragneshanekal/Documents/Local/Northeas...,4,import getpass\nimport os\n\n\ndef _set_env(va...,"Next, set your API keys:\n",Set up LangSmith for LangGraph development\n\n...
2,/Users/pragneshanekal/Documents/Local/Northeas...,7,from typing import Annotated\n\nfrom typing_ex...,Set up LangSmith for LangGraph development\n\n...,Note\n\n The first thing you do when you de...
3,/Users/pragneshanekal/Documents/Local/Northeas...,10,from langchain_anthropic import ChatAnthropic\...,Note\n\n The first thing you do when you de...,**Notice** how the `chatbot` node function tak...
4,/Users/pragneshanekal/Documents/Local/Northeas...,12,"graph_builder.add_edge(START, ""chatbot"")",**Notice** how the `chatbot` node function tak...,"Similarly, set a `finish` point. This instruct..."


## Airflow Documentation

Details the process of scraping the documentation of Airflow and storing into Pinecone vector store and getting relevant context for further querying and response generation.


In [59]:
from bs4 import BeautifulSoup as Soup
from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader

### Scraping the documentation

In [61]:
loader = RecursiveUrlLoader(
    "https://airflow.apache.org/docs/apache-airflow/stable/index.html",
    max_depth=5,
    prevent_outside=True,
    extractor=lambda x: Soup(x, "html.parser").text,
    base_url="https://airflow.apache.org/docs/apache-airflow/stable/"
)
docs = loader.load()

# Sort the list based on the URLs and get the text
d_sorted = sorted(docs, key=lambda x: x.metadata["source"])
d_reversed = list(reversed(d_sorted))
concatenated_content = "\n\n\n --- \n\n\n".join(
    [doc.page_content for doc in d_reversed]
)

In [71]:
docs[30].metadata

{'source': 'https://airflow.apache.org/docs/apache-airflow/stable/ui.html',
 'content_type': 'text/html',
 'title': 'UI / Screenshots — Airflow Documentation',
 'language': 'en'}

In [81]:
from pinecone import Pinecone, ServerlessSpec
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore

In [85]:
import os
from getpass import getpass

# Prompt the user to enter the OpenAI API key securely
api_key = getpass("Enter your Pinecone API key: ")

# Set the environment variable
os.environ["PINECONE_API_KEY"] = api_key

Enter your Pinecone API key:  ········


In [93]:
OPENAI_API_KEY=os.getenv("OPENAI_API_KEY")
PINECONE_API_KEY=os.getenv("PINECONE_API_KEY")

### Creating embedding and storing into Pinecone

In [95]:
embeddings = OpenAIEmbeddings(model="text-embedding-3-large", api_key=OPENAI_API_KEY)
pinecone_api_key = PINECONE_API_KEY
pc = Pinecone(api_key=pinecone_api_key)

In [97]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    chunk_size = 10000,
    chunk_overlap  = 200
)
text = text_splitter.create_documents([concatenated_content])

Created a chunk of size 43021, which is longer than the specified 10000
Created a chunk of size 25000, which is longer than the specified 10000
Created a chunk of size 11084, which is longer than the specified 10000


In [99]:
text1 = [chunk for chunk in text if len(chunk.page_content) <= 10000]

In [103]:
pc.create_index(
                name='airflow-docs-2',
                dimension=3072,
                metric="cosine",
                spec=ServerlessSpec(cloud="aws", region="us-east-1"),
            )

In [105]:
index_name = 'airflow-docs-2'

In [107]:
index = pc.Index(index_name)

In [109]:
vector_store = PineconeVectorStore(index=index, embedding=embeddings)

In [111]:
vector_store.add_documents(text1)

['3832f592-0f2f-4af9-95fb-8de0517c078a',
 'f8dd91d4-3802-47c8-b6c3-27c7ad173cb7',
 'a0bf628d-f80e-453c-8956-ad13e9ce23b4',
 'ace800f6-9be5-4bbe-a61e-b997a5a7bd0d',
 '4050627a-7fda-4513-97ca-e6380f8560ea',
 'f19d8c38-f295-4c16-be50-59b84d6197e2',
 'a090e9f4-19b0-4ab4-8cb9-48514af6873e',
 'b2d0f997-2876-4753-826b-e8714dea5ecc',
 '2752bfff-2da7-45f3-add2-b639d9099680',
 'c719d4ac-7ee4-4011-ad61-c3a921fbb7b3',
 'cf4e8d4d-d1c9-42a7-b2f9-5527d0f2bb32',
 'eafa962f-0507-42b2-a5ac-61016303ab95',
 'e6dc935a-d9fd-44b4-a34e-ce6f2028a956',
 '18d8ca63-9211-432b-bf0d-ee4cf79f2eab',
 'e54fcf65-34db-4bae-8d21-217d6eea70c7',
 '744ca851-7401-4db7-ab7d-0a29a5ddb7a1',
 'c29dcb38-b7eb-41c6-a491-fb9fd0da55c3',
 '385bdc75-f5bf-4234-bdc8-6041eea82d0a',
 'd7916a5d-264f-4972-972c-cc11ea723c12',
 '0a3d54c5-ffd9-4988-8153-48e3d99d905d',
 'd7991e33-f5d9-42a7-b3f9-8f9efeef69b2',
 '67c9af2c-e260-4160-bc2b-520be0fd22d4',
 'ba6160b1-259c-493a-82d5-0b24ecc0b9b3',
 '924c2849-2ca1-4781-90a5-238c8e338e4a',
 '08f9ed36-d4e5-

In [117]:
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 3})
context = retriever.invoke("What is the latest Version in Airflow?")

In [119]:
context

[Document(id='ad63a3da-aaa3-4e30-9ae6-7e347702e80d', metadata={}, page_content='Generating Cluster Config\nairflow.providers.google.cloud.operators.bigquery.BigQueryGetDatasetTablesOperator\n\n\nChanges in amazon provider package\nMigration of AWS components\nairflow.providers.amazon.aws.hooks.emr.EmrHook\nairflow.providers.amazon.aws.operators.emr_add_steps.EmrAddStepsOperator\nairflow.providers.amazon.aws.operators.emr_create_job_flow.EmrCreateJobFlowOperator\nairflow.providers.amazon.aws.operators.emr_terminate_job_flow.EmrTerminateJobFlowOperator\nairflow.providers.amazon.aws.operators.batch.AwsBatchOperator\nairflow.providers.amazon.aws.sensors.athena.AthenaSensor\nairflow.providers.amazon.aws.hooks.s3.S3Hook\n\n\nChanges in other provider packages\nChanged return type of list_prefixes and list_keys methods in S3Hook\nRemoved HipChat integration\nairflow.providers.salesforce.hooks.salesforce.SalesforceHook\nairflow.providers.apache.pinot.hooks.pinot.PinotAdminHook.create_segment\n

### Querying the Pinecone index

In [121]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field

### OpenAI

# Grader prompt
code_gen_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """You are a coding assistant with expertise in LCEL, LangChain expression language. \n 
    Here is a full set of Airflow documentation:  \n ------- \n  {context} \n ------- \n Answer the user 
    question based only on the above provided documentation. Ensure any code you provide can be executed \n 
    with all required imports and variables defined. Structure your answer with a description of the code solution. \n
    Then list the imports. And finally list the functioning code block. If you are unable to answer from the context give I don't know Here is the user question:""",
        ),
        ("placeholder", "{messages}"),
    ]
)


# Data model
class code(BaseModel):
    """Schema for code solutions to questions about LCEL."""

    prefix: str = Field(description="Description of the problem and approach")
    imports: str = Field(description="Code block import statements")
    code: str = Field(description="Code block not including import statements")


expt_llm = "gpt-4o"
llm = ChatOpenAI(temperature=0, model=expt_llm, api_key=OPENAI_API_KEY)
code_gen_chain_oai = code_gen_prompt | llm.with_structured_output(code)
question = "What is the latest Version in Airflow?"
solution = code_gen_chain_oai.invoke(
    {"context": context, "messages": [("user", question)]}
)
solution

code(prefix='The latest version of Airflow mentioned in the provided documentation is 2.10.3, released on 2024-11-04.', imports='', code="latest_version = '2.10.3'")