<a href="https://colab.research.google.com/github/SomewhatJustin/ut-course-search/blob/main/ut_austin_course_semantic_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# First, install a compatible version of protobuf
!pip install "protobuf<4.21"

# Then, install the other packages
!pip install -qU \
  "pinecone-client[grpc]"==2.2.1 \
  datasets==2.12.0 \
  sentence-transformers==2.2.2





In [None]:
from datasets import load_dataset

dataset = load_dataset("ganoot/ut-courses", split='train')
dataset



Dataset({
    features: ['Course Code', 'Course Name', 'Description'],
    num_rows: 10675
})

In [None]:
dataset[:5]

{'Course Code': ['A I 388', 'A I 388U', 'A I 389L', 'A I 391L', 'A I 394D'],
 'Course Name': ['Natural Language Processing.',
  'Planning, Search, and Reasoning Under Uncertainty.',
  'Automated Logical Reasoning.',
  'Machine Learning.',
  'Deep Learning.'],
 'Description': ['Explore computational methods for syntactic and semantic analysis of structures representing meanings of natural language; study of current natural language processing systems; methods for computing outlines and discourse structures of descriptive text. Three lecture hours a week for one semester. Artificial Intelligence 388 and Computer Science 388 may not both be counted. Prerequisite: Graduate standing, and a course in artificial intelligence or consent of instructor.',
  'Introduction to three key foundational problems in AI: planning, search, and reasoning under uncertainty. Investigate how to define planning domains, including representations for world states and actions, covering both symbolic and path pla

In [None]:
# Initialize an empty list to store the concatenated course information
courses = []

def truncate_to_tokens(text, max_tokens):
    """
    Truncate the text to the specified maximum number of tokens.
    :param text: Input text string.
    :param max_tokens: Maximum number of tokens allowed.
    :return: Truncated text.
    """
    tokens = text.split()  # Split the text into tokens (words)
    truncated_tokens = tokens[:max_tokens]  # Keep only the first 'max_tokens' tokens
    return ' '.join(truncated_tokens)  # Join the tokens back into a string


# Iterate through each record in the dataset
for record in dataset:
    # Concatenate the course code, course name, and description
    course_info = f"{record['Course Code']} - {record['Course Name']} - {record['Description']}"
    course_info = truncate_to_tokens(course_info, 250)

    # Add the concatenated string to the courses list
    courses.append(course_info)

# Now, 'courses' contains each course as a single string
# Here's a print statement to display the first 5 courses
for course in courses[:5]:
    print(course)

A I 388 - Natural Language Processing. - Explore computational methods for syntactic and semantic analysis of structures representing meanings of natural language; study of current natural language processing systems; methods for computing outlines and discourse structures of descriptive text. Three lecture hours a week for one semester. Artificial Intelligence 388 and Computer Science 388 may not both be counted. Prerequisite: Graduate standing, and a course in artificial intelligence or consent of instructor.
A I 388U - Planning, Search, and Reasoning Under Uncertainty. - Introduction to three key foundational problems in AI: planning, search, and reasoning under uncertainty. Investigate how to define planning domains, including representations for world states and actions, covering both symbolic and path planning. Study algorithms to efficiently find valid plans with or without optimality, and partially ordered, or fully specified solutions. Three lecture hours a week for one semest

In [None]:
from sentence_transformers import SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
if device != 'cuda':
    print(f"You are using {device}. This is much slower than using "
          "a CUDA-enabled GPU. If on Colab you can change this by "
          "clicking Runtime > Change runtime type > GPU.")

model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

In [None]:
query = 'python for data science'

xq = model.encode(query)
xq.shape

(384,)

In [None]:
_id = '0'
metadata = {'text': query}

vectors = [(_id, xq, metadata)]

In [None]:
import os

if 'PINECONE_API_KEY' not in os.environ:
    os.environ['PINECONE_API_KEY'] = input("Enter Pinecone API Key: ")
if 'PINECONE_ENVIRONMENT' not in os.environ:
    os.environ['PINECONE_ENVIRONMENT'] = input("Enter Pinecone Environment: ")


In [None]:
import os
import pinecone

# get api key from app.pinecone.io
api_key = os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY'
# find your environment next to the api key in pinecone console
env = os.environ.get('PINECONE_ENVIRONMENT') or 'PINECONE_ENVIRONMENT'

pinecone.init(
    api_key=api_key,
    environment=env
)

In [None]:
index_name = 'semantic-search'

In [None]:
# only create index if it doesn't exist
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=model.get_sentence_embedding_dimension(),
        metric='cosine'
    )

# now connect to the index
index = pinecone.GRPCIndex(index_name)

In [None]:
from tqdm.auto import tqdm

batch_size = 128
vector_limit = 100000

courses = courses[:vector_limit]

for i in tqdm(range(0, len(courses), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(courses))
    # create IDs batch
    ids = [str(x) for x in range(i, i_end)]
    # create metadata batch
    metadatas = [{'text': text} for text in courses[i:i_end]]
    # create embeddings
    xc = model.encode(courses[i:i_end])
    # create records list for upsert
    records = zip(ids, xc, metadatas)
    # upsert to Pinecone
    index.upsert(vectors=records)

# check number of records in the index
index.describe_index_stats()

  0%|          | 0/84 [00:00<?, ?it/s]

{'dimension': 384,
 'index_fullness': 0.10496,
 'namespaces': {'': {'vector_count': 10496}},
 'total_vector_count': 10496}

In [None]:
query = "history of weapons of mass descruction"

# create the query vector
xq = model.encode(query).tolist()

# now query
xc = index.query(xq, top_k=5, include_metadata=True)

for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

0.42: M S 304 - American Military History: 1775 to Present. - Covers development of American Profession of Arms from a "dual military tradition" evaluating military leadership at the tactical, operational, and strategic levels of war. Explores ways in which Industrial Revolution transformed the United States and other societies organized armed violence. Three lecture hours a week for one semester.
0.41: N S 326 - Evolution of Warfare. - Explores the forms of warfare employed by great leaders in history as they relate to the evolution of warfare. Three lecture hours a week for one semester. Prerequisite: Consent of instructor.
0.39: HIS 361J - Medieval Warfare. - Examine the development of warfare between the late Roman Empire and the early modern world (c. 400-1500), including a brief retrospective on war in the ancient world. Three lecture hours a week for one semester. Only one of the following may be counted: History 361J, 362K (Topic: Medieval Warfare), 362K (Topic 1). Prerequisite

0.42: M S 304 - American Military History: 1775 to Present. - Covers development of American Profession of Arms from a "dual military tradition" evaluating military leadership at the tactical, operational, and strategic levels of war. Explores ways in which Industrial Revolution transformed the United States and other societies organized armed violence. Three lecture hours a week for one semester.
0.41: N S 326 - Evolution of Warfare. - Explores the forms of warfare employed by great leaders in history as they relate to the evolution of warfare. Three lecture hours a week for one semester. Prerequisite: Consent of instructor.
0.39: HIS 361J - Medieval Warfare. - Examine the development of warfare between the late Roman Empire and the early modern world (c. 400-1500), including a brief retrospective on war in the ancient world. Three lecture hours a week for one semester. Only one of the following may be counted: History 361J, 362K (Topic: Medieval Warfare), 362K (Topic 1). Prerequisite