## Step 0: Configuring the environment

In this step, we are installing the libraries allowed for our project, which involve the use of LangChain, integration with Huggingface models, OpenAI, in addition to the storage of embeddings using ChromaDB.


In [1]:
%pip install --upgrade --quiet  GitPython


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Step 1: Downloading Notebooks and Extracting Code from Jupyter Notebooks

In this step, we download Jupyter Notebook files (*.ipynb*) from a GitHub repository directly using the GitHub API, and extract both code and context from these notebooks. The code performs the following operations:

- **Downloading Notebooks from a GitHub Repository**:  
  We begin by downloading only the Jupyter Notebooks from the desired GitHub repository using the function `download_and_extract_notebooks`.

  - **Function** `download_and_extract_notebooks(repo_owner, repo_name, save_dir='./notebooks')`:  
    - **Objective**: This function uses the GitHub API to download all `.ipynb` files from a given repository and saves them in a specified directory.
    - **Process**:
        - First, it checks if the target directory already exists. If it does, the directory is deleted to ensure a fresh copy of the notebooks.
        - The function calls the GitHub API to list the contents of the repository and recursively navigates through subdirectories, downloading only the files that end with `.ipynb`.
        - For each notebook downloaded, the function extracts the code and context (explained below).
    - **Input**:
        - **repo_owner**: The owner of the GitHub repository (e.g., `passarel`).
        - **repo_name**: The name of the repository (e.g., `crawler_data_source`).
        - **save_dir**: The directory where the notebooks will be saved (default is `./notebooks`).

- **Extracting Code and Context From Notebooks**:  
  After downloading the notebooks, the next step is to extract both the code and any markdown context from each notebook.

  - **Function** `extract_code_and_context(notebook_path)`:  
    - **Objective**: This function reads a notebook and extracts the code cells and any corresponding markdown context.

- **Process**:
  - The notebook is opened using the `nbformat.read` function.
  - The function iterates through each cell of the notebook:
    - If the cell is of type markdown, it extracts the content of the markdown cell as context.
    - If the cell is of type code, it creates a dictionary with the following fields:
      - **ID**: A unique identifier for the code snippet, generated using `uuid.uuid4()`.
      - **Embedding**: Initially set to None (embeddings will be generated later).
      - **Code**: The code content of the cell.
      - **Filename**: The name of the notebook file.
      - **Context**: The markdown context associated with the code (if any).
  - The extracted code and context are appended to a list.

### Key Changes from the Original Approach:
1. **No Cloning of the Entire Repository**:  
   Instead of cloning the entire repository, we now directly interact with the GitHub API to download only the relevant `.ipynb` files, saving time and space.
2. **Recursion into Subdirectories**:  
   The code automatically handles subdirectories within the repository, ensuring that all notebooks, regardless of their location, are processed.
3. **Cleaner Data Handling**:  
   Each notebook is processed immediately after being downloaded, simplifying the workflow and ensuring that extracted data is directly available for further use.


In [8]:
import os
import requests
import shutil
import nbformat
import uuid

# Function to download a specific file from GitHub
def download_file(file_url, save_path):
    response = requests.get(file_url)
    if response.status_code == 200:
        with open(save_path, 'wb') as file:
            file.write(response.content)
        print(f"Downloaded: {save_path}")
    else:
        print(f"Failed to download {file_url}, Status Code: {response.status_code}")

# Function to get all .ipynb files from a GitHub repo and extract data
def download_and_extract_notebooks(repo_owner, repo_name, save_dir='./notebooks'):
    # Check if the directory exists and remove it if it does
    if os.path.exists(save_dir):
        shutil.rmtree(save_dir)  # Removes the entire directory
        print(f"Existing directory {save_dir} removed.")
    
    # Create the directory to save the notebooks
    os.makedirs(save_dir)

    # GitHub API URL to list repo contents
    api_url = f"https://api.github.com/repos/{repo_owner}/{repo_name}/contents"

    response = requests.get(api_url)
    
    if response.status_code != 200:
        print(f"Error fetching repository contents: {response.status_code}")
        return

    # Parse the JSON response
    repo_contents = response.json()

    # Extracted data list
    all_extracted_data = []

    # Filter for .ipynb files and download them
    for item in repo_contents:
        if item['type'] == 'file' and item['name'].endswith('.ipynb'):
            notebook_path = os.path.join(save_dir, item['name'])
            download_file(item['download_url'], notebook_path)
            # Extract code and context
            extracted_data = extract_code_and_context(notebook_path)
            all_extracted_data.extend(extracted_data)
        elif item['type'] == 'dir':  # If it's a directory, fetch the contents of the directory
            download_notebooks_from_repo_dir(repo_owner, repo_name, item['path'], save_dir, all_extracted_data)

    return all_extracted_data

# Recursive function to list contents from a specific directory in a GitHub repo and extract data
def download_notebooks_from_repo_dir(repo_owner, repo_name, dir_path, save_dir, all_extracted_data):
    api_url = f"https://api.github.com/repos/{repo_owner}/{repo_name}/contents/{dir_path}"
    response = requests.get(api_url)
    
    if response.status_code != 200:
        print(f"Error fetching directory contents: {response.status_code}")
        return

    repo_contents = response.json()

    for item in repo_contents:
        if item['type'] == 'file' and item['name'].endswith('.ipynb'):
            notebook_path = os.path.join(save_dir, os.path.basename(item['path']))
            download_file(item['download_url'], notebook_path)
            # Extract code and context
            extracted_data = extract_code_and_context(notebook_path)
            all_extracted_data.extend(extracted_data)
        elif item['type'] == 'dir':  # Recurse into subdirectories
            download_notebooks_from_repo_dir(repo_owner, repo_name, item['path'], save_dir, all_extracted_data)

# Function to extract code and context from notebooks
def extract_code_and_context(notebook_path):
    with open(notebook_path, 'r', encoding='utf-8') as f:
        notebook = nbformat.read(f, as_version=4)

    extracted_data = []
    for cell in notebook['cells']:
        if cell['cell_type'] == 'markdown':
            context = ''.join(cell['source'])
        elif cell['cell_type'] == 'code':
            cell_data = {
                "id": str(uuid.uuid4()),  
                "embedding": None,        
                "code": ''.join(cell['source']),
                "filename": os.path.basename(notebook_path),
                "context": context if 'context' in locals() else ''
            }
            extracted_data.append(cell_data)

    return extracted_data

# Example usage for the repository:
repo_owner = "passarel"
repo_name = "crawler_data_source"

# This will download all .ipynb files from the repo, extract the data and return it
extracted_notebooks_data = download_and_extract_notebooks(repo_owner, repo_name)

### Alternative code to download the repository in case of a connection error

In [20]:
#import os
#import git
#import nbformat
#import uuid
#import shutil

# Function to clone GitHub repository with validation
#def clone_repo(repo_url, clone_dir="./notebooks"):
    # Check if the directory exists
#    if os.path.exists(clone_dir):
        # Remove the existing directory
#        shutil.rmtree(clone_dir)
#        print(f"Existing directory {clone_dir} removed.")
        
    # Clone the repo into the specified directory
#    git.Repo.clone_from(repo_url, clone_dir)
#    print(f"Repository cloned in: {clone_dir}")

# Function to find all .ipynb notebooks in a directory
#def find_all_notebooks(directory):
#    notebooks = []
#    for root, dirs, files in os.walk(directory):
#        for file in files:
#            if file.endswith(".ipynb"):
#                notebooks.append(os.path.join(root, file))
#    return notebooks

# Function to extract code and context from notebooks
#def extract_code_and_context(notebook_path):
#    with open(notebook_path, 'r', encoding='utf-8') as f:
#        notebook = nbformat.read(f, as_version=4)

#    extracted_data = []
#    for cell in notebook['cells']:
#        if cell['cell_type'] == 'markdown':
#            context = ''.join(cell['source'])
#        elif cell['cell_type'] == 'code':
#            cell_data = {
#                "id": str(uuid.uuid4()),  
#                "embedding": None,        
#                "code": ''.join(cell['source']),
#                "filename": os.path.basename(notebook_path),
#                "context": context if 'context' in locals() else ''
#            }
#            extracted_data.append(cell_data)

#    return extracted_data

# Main function to clone and process notebooks
#def process_repo(repo_url, clone_dir="./notebooks"):
    # Clone the repository (if exists, remove and clone again)
#    clone_repo(repo_url, clone_dir)
    
    # Find all notebooks in the cloned repo
#    notebooks = find_all_notebooks(clone_dir)
    
#    all_extracted_data = []
    
    # Process each notebook to extract code and context
#    for notebook in notebooks:
#        print(f"Extracting data from: {notebook}")
#        extracted_data = extract_code_and_context(notebook)
#        all_extracted_data.extend(extracted_data)
    
#    print(f"Extraction completed. Total notebooks processed: {len(notebooks)}")
#    return all_extracted_data

# Example usage:
#repo_url = "https://github.com/passarel/crawler_data_source"
#extracted_notebooks_data = process_repo(repo_url)

Existing directory ./notebooks removed.
Repository cloned in: ./notebooks
Extracting data from: ./notebooks/LLM_experiments/GemmaSummarization/fine-tuning-4bits.ipynb
Extracting data from: ./notebooks/LLM_experiments/GemmaSummarization/fine-tuning-fullprec.ipynb
Extracting data from: ./notebooks/LLM_experiments/GemmaSummarization/fine-tuning-8bits.ipynb
Extracting data from: ./notebooks/LLM_experiments/Galileo/summarization-with-langchain.ipynb
Extracting data from: ./notebooks/LLM_experiments/Galileo/text-generation-with-langchain.ipynb
Extracting data from: ./notebooks/LLM_experiments/Galileo/code-generation-with-langchain.ipynb
Extracting data from: ./notebooks/LLM_experiments/Galileo/chatbot-with-langchain.ipynb
Extracting data from: ./notebooks/Natural_Language/bert_qa/Training.ipynb
Extracting data from: ./notebooks/Natural_Language/bert_qa/Deployment.ipynb
Extracting data from: ./notebooks/Natural_Language/bert_qa/Testing Mlflow Server.ipynb
Extracting data from: ./notebooks/Nat

## Step 2: Generate metadata with llm  🔢

In this step, we use a language model (LLM) to generate descriptions and explanatory metadata for each extracted code snippet. The code performs the following operations:

-  We define a prompt template that contains placeholders for the code snippet, the file name, and an optional context. The goal is for the model to provide a clear and concise explanation of what the code does, based on these three pieces of information.

-  A PromptTemplate object is created from this template, allowing it to be used in conjunction with the language model.

-  We use the OpenAI LLM, authenticated with an API key, to process the information and generate responses.

- The function update_context_with_llm iterates through the data structure containing the extracted code, runs the language model for each item, and replaces the original context field with the explanation generated by the AI.

- Finally, the data structure is updated with the new explanations, which are stored in the context field.

-  The ultimate goal is to enrich the original data structure by providing clear explanations for each code snippet, making it easier to understand and use the information later

In [21]:
from langchain_core.prompts import PromptTemplate
from langchain_openai import OpenAI

In [22]:
os.environ["OPENAI_API_KEY"] = "" #your api key

In [23]:
template = """
You will receive three pieces of information: a code snippet, a file name, and an optional context. Based on this information, explain in a clear, summarized and concise way what the code snippet is doing.

Code:
{code}

File name:
{filename}

Context:
{context}

Describe what the code above does.
"""

prompt = PromptTemplate.from_template(template)



In [24]:
llm = OpenAI()

llm_chain = prompt | llm


### Generate metadata with llm local

If you happen to be using a local model with LlamaCPP to generate metadata

In [None]:
### Alternate code to load local models. 
###This specific example requires the project to have an asset call Llama7b, associated with the cloud S3 URI s3://dsp-demo-bucket/LLMs (public bucket)

# from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler
# from langchain_community.llms import LlamaCpp

# callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

# llm_local = LlamaCpp(
            # model_path="/home/jovyan/datafabric/Llama7b/ggml-model-f16-Q5_K_M.gguf",
            # n_gpu_layers=64,
            # n_batch=512,
            # n_ctx=4096,
            # max_tokens=1024,
            # f16_kv=True,  
            # callback_manager=callback_manager,
            # verbose=False,
            # stop=[],
            # streaming=False,
            # temperature=0.4,
        # )

# llm_chain = prompt | llm_local


In [25]:
import httpcore

def update_context_with_llm(data_structure):
    updated_structure = []
    
    for item in data_structure:
        code = item['code']
        filename = item['filename']
        context = item['context']
        
        try:
            # Try calling an LLM to generate code explanation
            response = llm_chain.invoke({
                "code": code, 
                "filename": filename, 
                "context": context
            })
            
            # Update item with LLM response
            item['context'] = response.strip()
            
            # Print message indicating context was updated
            #print(f"Context generated for file {filename}: {item['context']}")

        except httpcore.ConnectError as e:
            # API or model connection specific error
            print(f"Connection error processing file {filename}:The connection to the API or model has been corrupted. Details: {str(e)}")
            # Keep the original context in case of error
            item['context'] = context
        
        except httpcore.ProtocolError as e:
            # Protocol error, similar to the original error mentioned
            print(f"Protocol error when processing the file {filename}: {str(e)}")
            # Keep the original context
            item['context'] = context
        
        except Exception as e:
            # Other general errors
            print(f"Error processing the file {filename}: {str(e)}")
            # Keep the original context
            item['context'] = context
        
        # Add the updated item (or not) to the structure
        updated_structure.append(item)
    
    return updated_structure


In [26]:
updated_data = update_context_with_llm(extracted_notebooks_data)

In [27]:
updated_data

[{'id': '97f7b0f9-7b8c-4fa5-8869-867a35c4534a',
  'embedding': None,
  'code': '!pip install datasets # This one is for downloading our samsum dataset direclty from Hugging Face\n!pip install peft # Both peft and trl are the libs that help us \n!pip install trl # to configure our training methods and params\n!pip install bitsandbytes # This one will help us to quantize the model\n!pip install mlflow==2.11.0',
  'filename': 'fine-tuning-4bits.ipynb',
  'context': 'The code snippet installs several libraries needed for fine-tuning a model. Specifically, it installs datasets, peft, trl, bitsandbytes, and mlflow version 2.11.0. These libraries will be used to configure the training methods and parameters, quantize the model, and download the samsum dataset directly from Hugging Face. The file name fine-tuning-4bits.ipynb suggests that this code is used for fine-tuning a model with 4-bit precision.'},
 {'id': '88ae9f75-54c4-48b2-ac96-cef2535bb452',
  'embedding': None,
  'code': 'from datas

## Step 3: Generate Embeddings and Structure Data

In this step, we use an embeddings model to generate embedding vectors for the context extracted from each code snippet. The code performs the following operations:

**HuggingFace Embeddings**: We use the HuggingFace embeddings model "all-MiniLM-L6-v2" to generate vectors that semantically represent the context of the code snippets.

**Function** *update_embeddings*: This function iterates through the previously extracted data structure. For each item:

- Generates an embedding vector from the context field using the embed_query method of the embeddings model.
- Updates the item in the data structure, inserting the new embedding vector into the embedding field.
Conversion to DataFrame: After updating the data structure with the embeddings, we use the to_dataframe_row function to convert the list of code snippets and their respective metadata into a format suitable for a Pandas DataFrame.

Each item in the data structure is converted into a dictionary containing:

- **ID**: A unique identifier for the code snippet.
- **Embeddings**: The embedding vector generated for the context.
- **Code**: The extracted code.
- **Metadata**: Additional metadata, such as the filename and updated context.
  
The list of dictionaries is then converted into a DataFrame.

Creating the DataFrame: The to_dataframe_row function organizes this data, and Pandas is used to create a DataFrame, facilitating the manipulation and future use of the data with the results stored in a DataFrame for easy visualization and further processing.

In [32]:
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
def update_embeddings(data_structure):
    updated_structure = []
    for item in data_structure:
        context = item['context']

        # Generate the embedding for the context
        embedding_vector = embeddings.embed_query(context)

        # Update the item with the new embedding
        item['embedding'] = embedding_vector
        updated_structure.append(item)
    
    return updated_structure


In [None]:
updated_structure = update_embeddings(updated_data)

In [None]:
import pandas as pd
def to_dataframe_row(embedded_snippets: list):
    """
    Helper function to convert a list of embedded snippets into a dataframe row
    in dictionary format.

    Args:
        embedded_snippets: List of dictionaries containing Snippets to be converted

    Returns:
        List of Dictionaries suitable for conversion to a DataFrame
    """
    outputs = []
    for snippet in embedded_snippets:
        output = {
            "ids": snippet['id'],
            "embeddings": snippet['embedding'],
            "code": snippet['code'],
            "metadatas": {
                "filenames": snippet['filename'],
                "context": snippet['context'],
            },
        }
        outputs.append(output)
    return outputs




In [None]:
rows = to_dataframe_row(updated_structure)
df = pd.DataFrame(rows)

In [None]:
df

In [None]:
# Accessing the 'context' field within dictionaries in the 'metadatas' column
contexts = df['metadatas'].apply(lambda x: x.get('context', None))

# Display the contexts
print(contexts)


## Step 4: Store and Query Documents in ChromaDB 🔗🏦

In this step, we use ChromaDB, a vector database system, to store code snippets and their respective metadata. We also implement a function to retrieve documents based on queries. The code performs the following operations:

####  Connection and Collection Creation
- **ChromaDB Client**: A ChromaDB client is initialized to interact with the database.
- **Collection Creation or Retrieval**: The collection named "my_collection" is created (or retrieved, if it already exists) within the ChromaDB database. Collections are used to store documents and their corresponding embeddings.
#### Inserting Documents
- **Data Extraction**: The following fields are extracted from the DataFrame and converted into lists:
   - **ids**: A list of unique identifiers for each document (code snippet).
   - **documents**: A list of code snippets.
   - **metadatas**: A list of metadata associated with each document, such as the filename and context.
   - **embeddings_list**: A list of embedding vectors previously generated for the context of each code snippet.
- **Inserting into ChromaDB**: The upsert method is used to insert or update the documents, ids, metadata, and embeddings in the created collection.
#### Querying Documents
- **Query**: After adding the documents to the collection, a query is performed. The code searches for documents related to the query text "!pip install", returning the 5 most relevant results.
#### *retriever* **Function*
- **Document Retrieval**: The retriever function is implemented to query the collection. It takes a query string, the collection, and the number of results to return (top_n) as parameters.
  - **Query in ChromaDB**: The function executes a query in the collection using the provided string.
  - **Creating Document Objects**: For each result returned, the function creates a Document instance containing the page content (code snippet) and its metadata.
  - **Returning Documents**: The function returns a list of Document objects that contain the page content and metadata for easy retrieval and future analysis.


In [None]:
import chromadb

chroma_client = chromadb.Client()
collection = chroma_client.get_or_create_collection(name="my_collection")

ids = df["ids"].tolist()
documents = df["code"].tolist()
metadatas = df["metadatas"].tolist()
embeddings_list = df["embeddings"].tolist()

data_to_insert = {
    "ids": ids,
    "documents": documents,
    "metadatas": metadatas,
    "embeddings": embeddings_list
}

for i in range(len(ids)):
    print(f"ID: {ids[i]}")
    print(f"Document: {documents[i]}")
    print(f"Metadata: {metadatas[i]}")
    print(f"Embedding: {embeddings_list[i]}\n")

collection.upsert(
    documents=documents,
    ids=ids,
    metadatas=metadatas,
    embeddings=embeddings_list  
)

print("Documents added successfully!!")


In [None]:
document_count = collection.count()
print(f"Total documents in the collection: {document_count}")

In [None]:
results = collection.query(
    query_texts=["!pip install"],
    n_results=5,  
)

In [34]:
results

{'ids': [['e174ba3d-d46f-467c-bdd5-5169bcab1eb3',
   '2894eeb6-a424-4821-9af1-a4e49bee5a0d',
   '453ca5ea-fea7-4261-9c2e-df7e334ac2f4',
   '83135cda-a56a-4936-af8a-76fbbdba8cad',
   '73f1d4c7-94b3-4dcc-811a-df379b4db2ea']],
 'embeddings': None,
 'documents': [['!pip install webvtt-py\n!pip install pandas',
   '%pip install --upgrade --quiet  GitPython',
   '"""\nChain with Local model\n"""\n\n#from typing import List\n#from langchain.prompts import ChatPromptTemplate\n#from langchain.schema.runnable import RunnablePassthrough\n#from langchain.schema import StrOutputParser\n#import uuid\n\n#def format_docs(docs: List[Document]) -> str:\n#    return "\\n\\n".join([doc.page_content for doc in docs])\n\n#template = """You are a Python wizard tasked with generating code for a Jupyter Notebook (.ipynb) based on the given context.\n#Your answer should consist of just the Python code, without any additional text or explanation.\n\n#Context:\n#{context}\n\n#Question: {question}\n#"""\n\n#prompt

In [35]:
from langchain.schema import Document
from typing import List


def retriever(query: str, collection, top_n: int = 10) -> List[Document]:
    results = collection.query(
        query_texts=[query],
        n_results=top_n
    )
    
    documents = [
        Document(
            page_content=str(results['documents'][i]),
            metadata=results['metadatas'][i] if isinstance(results['metadatas'][i], dict) else results['metadatas'][i][0]  # Corrigir o metadado se for uma lista
        )
        for i in range(len(results['documents']))
    ]
    
    return documents


## Step 5: Chain 🦜⛓️

In this step, we use a flow to automatically generate Python code based on a provided context and question. The code performs the following:

#### Function *format_docs(docs: List[Document]) -> str:*
- **Purpose**: This function formats a list of documents docs into a single string by concatenating the content of each document (doc.page_content) with two line breaks (\n\n) between them. This ensures that the context used in code generation is organized and readable.

#### Language Model and Processing Chain:
- **ChatOpenAI**: A language model from OpenAI is used to generate responses based on the provided prompt.
- The **chain**processes data using the following components:
  - **Context**: The context is formatted using the *format_docs* function, which calls the retriever function to fetch relevant context from the document base.
  - **Question**: The question is passed directly through the chain to process the prompt.
  - **Model**: The model generates the code based on the template and the provided data.
  - **Output Parser**: The output is processed with StrOutputParser to ensure the return is a clean string.

#### Function *clean_and_print_code(result: str)*:
- Purpose: This function takes the generated code string from the model and removes any formatting markers (e.g., ```python). After cleaning, the code is printed in a clean format, ready for execution.

#### Interaction with Galileo:
- The *promptquality* library is used to evaluate the quality of the generated prompts.
- **Galileo Callback**: A custom callback is configured using the Galileo API Key, where the following evaluation scopes are set:
   - **Context Adherence**: Evaluates whether the generated code aligns with the provided context.
   - **Correctness**: Checks the factual accuracy of the generated code.
   - **Prompt Perplexity**: Measures the complexity of the prompt, useful for evaluating its clarity.
 
#### Chain Execution:
- A set of inputs containing the query and the question is provided to run the chain. The system generates code based on questions like "How can I use audio in RAG?" and "create code audio with RAG" using the vector base.

#### Results Publishing:
- The Galileo callback finalizes and publishes the results, recording the evaluation of each run of the code generation chain.

#### Function *create_new_code_cell_from_output(output)*:
 - Purpose: This function dynamically creates a new code cell in the Jupyter Notebook from the generated output. It handles different output formats such as strings or dictionaries (if the output contains JSON) and inserts the resulting code into the next code cell in the notebook.


#### Processing the results: 
- After the chain execution, the function iterates over each generated result, attempts to parse it as JSON, and creates a new code cell in the notebook from the output. If the result is not JSON, it treats the output as a code string.

In [36]:
from langchain_core.runnables import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI


def format_docs(docs: List[Document]) -> str:
    return "\n\n".join([doc.page_content for doc in docs])

template = """You are a Python wizard tasked with generating code for a Jupyter Notebook (.ipynb) based on the given context.
Your answer should consist of just the Python code, without any additional text or explanation.

Context:
{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

model = ChatOpenAI() 

chain = {
    "context": lambda inputs: format_docs(retriever(inputs['query'], collection)), 
    "question": RunnablePassthrough()
} | prompt | model | StrOutputParser()




### Local Model
Cell to run a local model using LlamaCPP

In [None]:
# def format_docs(docs: List[Document]) -> str:
    # return "\n\n".join([doc.page_content for doc in docs])

# template = """You are a Python wizard tasked with generating code for a Jupyter Notebook (.ipynb) based on the given context.
# Your answer should consist of just the Python code, without any additional text or explanation.

# Context:
# {context}

# Question: {question}
# """

# prompt = ChatPromptTemplate.from_template(template)
# model = llm_local() 

# chain = {
    # "context": lambda inputs: format_docs(retriever(inputs['query'], collection)), 
    # "question": RunnablePassthrough()
# } | prompt | model | StrOutputParser()

In [37]:
def clean_and_print_code(result: str):
    clean_code = result.replace("```python", "").replace("```", "").strip()
    
    print(clean_code)

In [42]:
import promptquality as pq

os.environ['GALILEO_API_KEY'] = "htMRukWlQyvOEDMnAUYQUTQnEZL6_3ubALGkhn6ph70" #your api Key
galileo_url = "https://console.hp.galileocloud.io/"
pq.login(galileo_url)


👋 You have logged into 🔭 Galileo (https://console.hp.galileocloud.io/) as diogo.vieira@hp.com.


Config(console_url=Url('https://console.hp.galileocloud.io/'), username=None, password=None, api_key=SecretStr('**********'), token=SecretStr('**********'), current_user='diogo.vieira@hp.com', current_project_id=None, current_project_name=None, current_run_id=None, current_run_name=None, current_run_url=None, current_run_task_type=None, current_template_id=None, current_template_name=None, current_template_version_id=None, current_template_version=None, current_template=None, current_dataset_id=None, current_job_id=None, current_prompt_optimization_job_id=None, api_url=Url('https://api.hp.galileocloud.io/'))

### Information Parameter 💡

**Query**: A query is generally used to retrieve information, such as documents or code snippets, from a database or retrieval system, like a vector database or an embeddings database. In this case, the query is likely being used to search for code snippets related to the specific request, such as the creation of an LLM model and an embedding model.

**Question**: The question represents the specific task you are asking the language model to perform. This involves generating code based on the context retrieved by the query. The question is sent to the LLM to generate the appropriate response or code based on the provided information.

In [40]:
from IPython.display import display, Markdown
from IPython import get_ipython
from IPython.display import display, Code


prompt_handler = pq.GalileoPromptCallback(
    scorers=[
        pq.Scorers.context_adherence_plus,  # groundedness
        pq.Scorers.correctness,             # factuality
        pq.Scorers.prompt_perplexity        # perplexity 
    ]
)

# Example of inputs to run the chain
inputs = [
    {"query": "instantiate a model with llama cpp local", "question": "create code local llm model with llamacpp"},

]
#How to create a vector bank?
#create code a chromadb vector database

results = chain.batch(inputs, config=dict(callbacks=[prompt_handler]))

# Publish run results
prompt_handler.finish()


Processing chain run...:   0%|          | 0/5 [00:00<?, ?it/s]

Initial job complete, executing scorers asynchronously. Current status:
cost: Done ✅
toxicity: Done ✅
pii: Done ✅
protect_status: Done ✅
prompt_perplexity: Failed ❌, error was: Executing this metric requires credentials for OpenAI or Azure OpenAI service to be set.
latency: Done ✅
groundedness: Failed ❌, error was: Executing this metric requires credentials for OpenAI, Azure OpenAI or Vertex to be set.
factuality: Failed ❌, error was: Executing this metric requires credentials for OpenAI, Azure OpenAI or Vertex to be set.
🔭 View your prompt run on the Galileo console at: https://console.hp.galileocloud.io/prompt/chains/ac3e990f-623d-4616-ade7-fb2b9d3cefd3/e7ded322-462f-45f4-b751-468180879c48?taskType=12


In [41]:
import json
from IPython.core.getipython import get_ipython

def create_new_code_cell_from_output(output):
    """
    Creates a new code cell in Jupyter Notebook from an output,
    dealing with different output formats.

    Args:
        output: The output to be inserted into the new cell. It can be a string, a dictionary
                or another type of object.
    """

    shell = get_ipython()

    if isinstance(output, dict):
        code = output['cells'][0]['source']
        code = ''.join(code)
    else:
        code = str(output)

    clean_code = code.strip()

    shell.set_next_input(clean_code, replace=False)

for result in results:
    try:
        output = json.loads(result)
        create_new_code_cell_from_output(output)
    except json.JSONDecodeError:
        # If it's not JSON, just treat it as a string of code
        create_new_code_cell_from_output(result)


### LLM generated code here!!!

In [None]:
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler
from langchain_community.llms import LlamaCpp

callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

llm_local = LlamaCpp(
            model_path="/home/jovyan/datafabric/Llama7b/ggml-model-f16-Q5_K_M.gguf",
            n_gpu_layers=64,
            n_batch=512,
            n_ctx=4096,
            max_tokens=1024,
            f16_kv=True,  
            callback_manager=callback_manager,
            verbose=False,
            stop=[],
            streaming=False,
            temperature=0.4,
        )

llm_chain = prompt | llm_local