Dependencies

In [2]:
import os
from dotenv import load_dotenv
load_dotenv()

True

In [3]:
from langchain_community.document_loaders import WebBaseLoader, SitemapLoader
from bs4 import BeautifulSoup

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [4]:
import nest_asyncio
nest_asyncio.apply()

Loading Documents

In [5]:
# # Could potentially use firecrawl, figure out API cost later
# crawl_params = {
#     'crawlerOptions': {
#         'includes': ['#skip-content'],
#         'limit': 5,
#     }
# }
# loader = FireCrawlLoader(
#     api_key=os.environ["FIRE_CRAWL_API"], url="https://www.brandeis.edu/registrar/bulletin/provisional/courses/subjects/1400.html", mode="crawl", params=crawl_params
# )

In [6]:
# pages = []
# for doc in loader.lazy_load():
#     pages.append(doc)
#     if len(pages) >= 10:
#         pages = []

In [7]:
def get_plain_text_with_header(content: BeautifulSoup) -> str:
  skip_content = content.find(id="skip-content")
  return skip_content.get_text(strip=True) if skip_content else ""

In [8]:
loader = SitemapLoader(
    "https://www.brandeis.edu/sitemap.xml",
    filter_urls=["https://www.brandeis.edu/computer-science/undergraduate/"],
    parsing_function=get_plain_text_with_header,
    continue_on_failure=True
)

In [9]:
docs = loader.load()

Fetching pages: 100%|##########| 19/19 [00:01<00:00, 13.38it/s]


Chunking

In [10]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [14]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    add_start_index=True
)
splits = text_splitter.split_documents(docs)

In [16]:
len(splits)

83

In [15]:
splits[0:10]

[Document(metadata={'source': 'https://www.brandeis.edu/computer-science/undergraduate/bachelor-master.html', 'loc': 'https://www.brandeis.edu/computer-science/undergraduate/bachelor-master.html', 'lastmod': '2022-11-21\n', 'start_index': 0}, page_content="5-year Bachelor's/Master's ProgramThe BA/MS and BS/MS are designed for Brandeis undergraduates who are interested in taking additional computer science courses their senior year and completing an MS the year after they obtain their undergraduate degree.RequirementsAvailable only to Brandeis students who have completed all requirements for the undergraduate Bachelors degree and have performed well in the computer science major and have completed three 100-level COSI electives in addition to those required for their undergraduate degree. Students should apply for the program through theGraduate School of Arts and Sciencesand in consultation with their Undergraduate Advising Head in their senior year, at which time they should propose a

Embedding

In [12]:
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small", api_key=os.getenv("OPENAI_API_KEY"))
url = "9cb98d40-7c3b-441b-9b6d-4f1ac13eb1fa.europe-west3-0.gcp.cloud.qdrant.io"

In [13]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

In [17]:
qdrant = QdrantVectorStore.from_documents(
    splits,
    embeddings,
    url=url,
    prefer_grpc=True,
    api_key=os.getenv("QDRANT_CLUSTER_KEY"),
    collection_name="brandeis.edu",
)

In [15]:
results = qdrant.similarity_search(
    "What courses to I need to get the cosi major?", k=5
)

In [16]:
for r in results:
  print(r.page_content)

or BIOL 51a or PSYC 51aCOSI 131a: Operating SystemsCOSI 121b: Structure and Interpretation of Computer ProgramsCOSI 130a: Intro. to Theory of ComputationSix ElectivesFull details and recommendations are provided in theUniversity Bulletin.Additional RequirementsMath 8a: Introduction to Probability and Statistics or MATH 36a, ECON 83a, BIOL 51a, or PSYC 51aMATH 10a: Techniques of CalculusAdditional Requirements for Degree with Departmental HonorsGraduation with honors in computer science requires completion and defense of a senior honors thesis. Students interested in senior thesis should contact prospective mentors by the spring of their junior year and should take note of the prerequisites for enrollment in COSI 99d (Senior Research).MinorCore CoursesCOSI 12b: Advanced Programming Techniques in JavaCOSI 21a: Data Structures and the Fundamentals of ComputingFour ElectivesForBusiness Majors,we suggest the following electives:COSI 102a: Software EntrepreneurshipCOSI 125a: Human Computer
o