### Loading The resume (as a directory text files) into Pinecone

**1. Why did I break it down from a PDF?**

a. Good Question! The PDF came up a bit messy when I loaded the data, and it wasn't as clear for sourcing purposes. This is an attempt to see how having the text broken out might produce more accurate responses. So we'll find out!  straight up PDF v Folder/File structure, let the game begin!

**2. Chunk Size set to 128 (with 64 overlap). Why so small?**

a. Default settings tend to choose 512 as a chunk size, which is about a paragraph. I found 128 to more accurately represent each entry in my resume, and it  better represents the size of the questions.

**3. Why Pinecone and not ChromaDB?**

a. Streamlit has a free cloud this is deployed on, and there were dependency issues with ChromaDB I was too lazy to resolve, so I went with Pinecone. 


In [3]:
import os
os.chdir("..")

In [105]:
from pinecone import Pinecone
import streamlit as st 

In [107]:
# temp variables
pc_api = st.secrets.pinecone.api_key
pc_index = st.secrets.pinecone.index

In [90]:
import tiktoken

tokenizer = tiktoken.get_encoding('cl100k_base')

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

In [108]:
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def fetch_load_split(directory="resume/", chunk_size=128, chunk_overlap=64):
    loader = DirectoryLoader(directory)
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, 
        chunk_overlap=chunk_overlap, 
        length_function=tiktoken_len,
        separators=["\n\n", "\n", " ", ""]
        )
    documents = loader.load_and_split(text_splitter=text_splitter)
    return documents

def load_to_pinecone(formatted_documents,namespace="v1", batch_size=100):
    pc = Pinecone(api_key=pc_api)
    index = pc.Index(pc_index)
    batch_limit = 100
    for i in range(0, len(formatted_documents), batch_size):
        index.upsert(vectors=formatted_documents[i:i+batch_size], namespace=namespace)

In [92]:
documents = fetch_load_split()

In [102]:
import uuid
from langchain.embeddings.openai import OpenAIEmbeddings
from dotenv import load_dotenv

load_dotenv()

def get_embeddings(documents, model_name='text-embedding-ada-002'):
    embed = OpenAIEmbeddings(model=model_name)
    texts = [document.page_content for document in documents]
    return embed.embed_documents(texts)

def get_section(source):
    section = source.split("/")[-2:]
    section[1] = section[1].split(".")[0]
    return f"{section[0].title()} - {section[1].replace('_', ' ').title()}"

def pull_source(document):
    return document.metadata['source']
    
def create_metadata(document):
    source = pull_source(document)
    section = get_section(source)
    document.metadata.update({"section": section})
    return document.metadata

def format_to_json(documents):
    new_list = []
    embeddings = get_embeddings(documents)
    for i, document in enumerate(documents):
        temp = {"id": str(uuid.uuid4()),
                "values": embeddings[i],
                "metadata": create_metadata(document)
                }
        new_list.append(temp)
    return new_list

In [103]:
new_docs = format_to_json(documents)
new_docs[0]

{'id': '7f4fbe09-cb86-4c2d-8b3e-f4c8fd95322a',
 'values': [-0.02534717737418346,
  -0.005624209161496576,
  0.0054434557448072044,
  0.002487096046864358,
  0.010567116732423283,
  0.0024471218186307916,
  -0.011102425125013896,
  0.018881770174844045,
  -0.016239989827584983,
  -0.020828344285255634,
  0.01396666956816606,
  0.013069855273519294,
  0.001622434849750657,
  -0.005224465947838318,
  -0.0028312225380573977,
  -0.005252274309070927,
  -0.005874483015606178,
  -0.005419123545143995,
  0.00669134885378581,
  -0.03225751649059945,
  -0.020550262535574716,
  -0.004630065602535677,
  0.0009897982405838257,
  -0.013869340769513221,
  -0.018005810986968497,
  -0.0067782491677304485,
  0.020758822916512813,
  -0.028169708809825417,
  -0.010316843343974978,
  0.003392600745434624,
  0.011901910621007823,
  -0.022469027353431106,
  -0.01078958269096157,
  -0.021245466909777005,
  -0.03617847027868747,
  -0.0019656923005513466,
  0.013021190874192875,
  0.01585067463297363,
  0.02812

In [104]:
# load_to_pinecone(new_docs)

### Loading on Use


In [None]:
import zipfile as z

def unzip() -> None:
    " unzips resume.zip and dumps into current directory as resume/"
    with z.ZipFile('./resume.zip', 'r') as zip_ref:
        zip_ref.extractall()


    

