<a href="https://colab.research.google.com/github/CodeAlchemyAI/AI-Notebooks/blob/main/LangChain/chat_with_apple_jobs_post.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This example demonstrates how to create a Q&A system using job postings from Apple's official jobs site: https://jobs.apple.com.

In [3]:
import requests
from bs4 import BeautifulSoup
from langchain.indexes import VectorstoreIndexCreator
from langchain.document_loaders import UnstructuredURLLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain import OpenAI
from langchain.chains import RetrievalQAWithSourcesChain
import pinecone 
import os

  from tqdm.autonotebook import tqdm


In [None]:
# Let's install necessary libraries
!pip install  langchain
!pip install openai
!pip install BeautifulSoup
!pip install requests
!pip install unstructured
!pip install pdf2image
!pip install chromadb
!pip install pinecone-client

First, let's create a function to retrieve all the URLs from jobs.apple.com.

In [None]:


os.environ["OPENAI_API_KEY"] = "sk-OPENAI-KEY"

embeddings = OpenAIEmbeddings()


# Base URL
base_url = "https://jobs.apple.com"

# Store all the links
all_links = []

# Loop through all pages
for i in range(1, 2):
    # Get page content
    url = f"{base_url}/en-us/search?page={i}"
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all 'a' elements with the specific class
    a_elements = soup.find_all('a', class_='table--advanced-search__title')

    # Extract and append the href attribute of each 'a' element to the list
    for a in a_elements:
        link = base_url + a['href']
        all_links.append(link)


Init the Pinecone database

In [4]:
PINECONE_API_KEY ="PINECONE-API-KEY"
PINECONE_ENV='asia-northeast1-gcp'
pinecone.init(
    api_key=PINECONE_API_KEY,
    environment=PINECONE_ENV
)
index_name = "apple"


Iterate through all links, load page text, and transform this text into vectors. Save these vectors to Pinecone. This process is performed in batches to enhance efficiency.

In [None]:
def process_batch(batch):
    loader = UnstructuredURLLoader(urls=batch)
    documents = loader.load()
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    docs = text_splitter.split_documents(documents)
    Pinecone.from_documents(docs, embeddings, index_name=index_name)

batch = []
batch_limit = 100

for i, link in all_links:
    batch.append(link)
    if len(batch) >= batch_limit:
        process_batch(batch)
        batch = []
        
# Processing the remaining batch (if it's not empty)
if batch:
    process_batch(batch)

We can now interact with the data using the RetrievalQAWithSourcesChain.

In [9]:
text_field = "text"

index = pinecone.Index(index_name)
embed = OpenAIEmbeddings()
vectorstore = Pinecone(
    index, embed.embed_query, text_field
)
chain = RetrievalQAWithSourcesChain.from_chain_type(OpenAI(temperature=0), chain_type="stuff", retriever=vectorstore.as_retriever())
chain({"question": "Is Apple working on a large language model?"}, return_only_outputs=True)


{'answer': ' Yes, Apple is working on large language models.\n',
 'sources': 'https://jobs.apple.com/en-us/details/200478486/natural-language-generation-research-engineer-input-experience?team=MLAI, https://jobs.apple.com/en-us/details/200482795/aiml-software-engineer-machine-learning-platform-intelligence?team=MLAI'}