In [2]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained('BeIR/query-gen-msmarco-t5-large-v1')
model = T5ForConditionalGeneration.from_pretrained('BeIR/query-gen-msmarco-t5-large-v1')

docs = ["Learn the alignment foundations of yoga while building strength, flexibility, and vitality! This class is for those who are new to yoga or those looking to fine tune their practice. Students will learn the \"ins and outs\" of a variety of standing poses, back bends, and maybe even a few inversions or arm balances! Modifications will be given to accommodate all levels. Each class will also include breath instruction and poses for stress reduction.",
        "An introduction to the actor's technique and performance skills, exploring the elements necessary to begin training as an actor, i.e., observation, concentration, and imagination. Focus is on physical and vocal exercises, improvisation, and text and character. There is required play reading, play attendance, and some scene study.",
        "Generative models are a class of machine learning algorithms that define probability distributions over complex, high-dimensional objects such as images, sequences, and graphs. Recent advances in deep neural networks and optimization algorithms have significantly enhanced the capabilities of these models and renewed research interest in them. This course explores the foundational probabilistic principles of deep generative models, their learning algorithms, and popular model families, which include variational autoencoders, generative adversarial networks, autoregressive models, and normalizing flows. The course also covers applications in domains such as computer vision, natural language processing, and biomedicine, and draws connections to the field of reinforcement learning."
       ]


for para in docs:
    input_ids = tokenizer.encode(para, return_tensors='pt')
    outputs = model.generate(
        input_ids=input_ids,
        max_length=64,
        do_sample=True,
        top_p=0.95,
        num_return_sequences=5)

    print("Paragraph:")
    print(para)

    print("\nGenerated Queries:")
    for i in range(len(outputs)):
        query = tokenizer.decode(outputs[i], skip_special_tokens=True)
        print(f'{i + 1}: {query}')

Paragraph:
Learn the alignment foundations of yoga while building strength, flexibility, and vitality! This class is for those who are new to yoga or those looking to fine tune their practice. Students will learn the "ins and outs" of a variety of standing poses, back bends, and maybe even a few inversions or arm balances! Modifications will be given to accommodate all levels. Each class will also include breath instruction and poses for stress reduction.

Generated Queries:
1: can you have a regular yoga class and do inversions
2: how yoga helps
3: what poses can you do during yoga class
4: where to learn posing and breathing
5: are yoga classes good?
Paragraph:
An introduction to the actor's technique and performance skills, exploring the elements necessary to begin training as an actor, i.e., observation, concentration, and imagination. Focus is on physical and vocal exercises, improvisation, and text and character. There is required play reading, play attendance, and some scene stu

In [1]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import os.path

tokenizer = T5Tokenizer.from_pretrained('BeIR/query-gen-msmarco-t5-large-v1')
model = T5ForConditionalGeneration.from_pretrained('BeIR/query-gen-msmarco-t5-large-v1')
SCRAPE_SEM = "FA23"

from tqdm import tqdm
import json

descriptions = []
gen_queries = []

parsed_data = json.load(open(f'./json/{SCRAPE_SEM}/parsed_courses.json', 'r'))
for course_data in tqdm(parsed_data):
    descriptions.append(course_data[4])

synthetic_data_path = f"./json/{SCRAPE_SEM}/synthetic_queries.json"

if not os.path.isfile(synthetic_data_path):
    for para in tqdm(descriptions):
        input_ids = tokenizer.encode(para, return_tensors='pt')
        outputs = model.generate(
            input_ids=input_ids,
            max_length=64,
            do_sample=True,
            top_p=0.95,
            num_return_sequences=5)

        for i in range(len(outputs)):
            query = tokenizer.decode(outputs[i], skip_special_tokens=True)
            gen_queries.append([query, para])

    json.dump(gen_queries, open(synthetic_data_path, 'w'))
else:
    gen_queries = json.load(open(synthetic_data_path, 'r'))

100%|██████████████████████████████████| 3626/3626 [00:00<00:00, 2296322.86it/s]
100%|█████████████████████████████████████| 3626/3626 [4:02:51<00:00,  4.02s/it]


In [2]:
from sentence_transformers import SentenceTransformer, models

word_emb = models.Transformer('sentence-transformers/msmarco-distilbert-base-dot-prod-v3')
pooling = models.Pooling(word_emb.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_emb, pooling])

In [3]:
from sentence_transformers import SentenceTransformer, InputExample, losses, models, datasets

train_data = list(map(lambda entry: InputExample(texts=[entry[0], entry[1]]), gen_queries))
train_dataloader = datasets.NoDuplicatesDataLoader(train_data, batch_size=4)
train_loss = losses.MultipleNegativesRankingLoss(model)

In [5]:
num_epochs = 5
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=num_epochs, warmup_steps=warmup_steps, show_progress_bar=True)
os.makedirs(f'models_{SCRAPE_SEM}', exist_ok=True)
model.save(f'models_{SCRAPE_SEM}/course-embeddings')

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Iteration:   0%|          | 0/4532 [00:00<?, ?it/s]

Iteration:   0%|          | 0/4532 [00:00<?, ?it/s]

Iteration:   0%|          | 0/4532 [00:00<?, ?it/s]

Iteration:   0%|          | 0/4532 [00:00<?, ?it/s]

Iteration:   0%|          | 0/4532 [00:00<?, ?it/s]

In [6]:
from sentence_transformers import SentenceTransformer
import numpy as np
import os.path
import faiss
import json

model = SentenceTransformer(f'models_{SCRAPE_SEM}/course-embeddings')
courses = json.load(open(f'./json/{SCRAPE_SEM}/parsed_courses.json', 'r'))
course_descs = list(map(lambda course: course[4], courses))

desc_embeddings = model.encode(course_descs)
desc_embeddings = np.array(desc_embeddings.astype('float32'))
norms = np.linalg.norm(desc_embeddings, axis=1)
norm_embeddings = np.array([vec / norms[i] for i, vec in enumerate(desc_embeddings)])
norm_embeddings = np.array(norm_embeddings.astype('float32'))

if not os.path.isfile(f'models_{SCRAPE_SEM}/courses-normalized.index'):
    desc_embeddings = model.encode(course_descs)
    desc_embeddings = np.array(desc_embeddings.astype('float32'))
    
    norms = np.linalg.norm(desc_embeddings, axis=1)
    norm_embeddings = np.array([vec / norms[i] for i, vec in enumerate(desc_embeddings)])
    norm_embeddings = np.array(norm_embeddings.astype('float32'))
    
    index = faiss.IndexIDMap(faiss.IndexFlatIP(768))
    index.add_with_ids(norm_embeddings, np.arange(len(norm_embeddings)))
    faiss.write_index(index, f'models_{SCRAPE_SEM}/courses-normalized.index')
else:
    index = faiss.read_index(f'models_{SCRAPE_SEM}/courses.index')

In [7]:
def course_rec(query, topn=10):
    norm_emb = (emb := model.encode([query])) / np.linalg.norm(emb)
    top_res = index.search(norm_emb, topn)
    top_res = list(top_res[1])[0]

    for doc in top_res:
        print(f"{courses[doc][2]} ({courses[doc][0]} {courses[doc][1]})")
        print(course_descs[doc])

In [8]:
course_rec("fascinating computer science theory")

Discrete Structures (CS 2800)
Covers the mathematics that underlies most of computer science. Topics include mathematical induction; logical proof; propositional and predicate calculus; combinatorics and discrete mathematics; some basic elements of basic probability theory; basic number theory; sets, functions, and relations; graphs; and finite state machines. These topics are discussed in the context of applications to many areas of computer science, such as the RSA cryptosystem and web searching.
Computer System Organization (CS 3410)
Introduction to computer organization, systems programming and the hardware/ software interface. Topics include instruction sets, computer arithmetic, datapath design, data formats, addressing modes, memory hierarchies including caches and virtual memory, I/O devices, bus based I/O systems, and multicore architectures. Students learn assembly language programming and design a pipelined RISC processor.
Business Computing (HADM 1740)
Provides a foundation

In [9]:
course_rec("advanced natural language processing and robot learning", 20)

Natural Lang Process (LING 4474)
This course constitutes an introduction to natural language processing (NLP), the goal of which is to enable computers to use human languages as input, output, or both. NLP is at the heart of many of todays most exciting technological achievements, including machine translation, automatic conversational assistants and Internet search. Possible topics include methods for handling underlying linguistic phenomena (e.g., syntactic analysis, word sense disambiguation and discourse analysis) and vital emerging applications (e.g., machine translation, sentiment analysis, summarization and information extraction). 
Natural Lang Process (CS 4740)
This course constitutes an introduction to natural language processing (NLP), the goal of which is to enable computers to use human languages as input, output, or both. NLP is at the heart of many of todays most exciting technological achievements, including machine translation, automatic conversational assistants and I

In [10]:
course_rec("math courses with applications to machine learning and computer science")

Machine Learning Engineering (CS 5781)
Machine learning is increasingly driven by advances in the underlying hardware and software systems. This course will focus on the challenges inherent to engineering machine learning systems to be correct, robust, and fast. The course walks through the development of a software library for machine learning from scratch, with each assignment requiring students to build models in their own library. Topics will include: tensor languages and auto differentiation; model debugging, testing, and visualization; fundamentals of GPUs; compression and low power inference. Guest lectures will cover current topics from ML engineers.
 
Advanced Machine Learning (CS 6780)
Gives a graduate level introduction to machine learning and in depth coverage of new and advanced methods in machine learning, as well as their underlying theory. Emphasizes approaches with practical relevance and discusses a number of recent applications of machine learning in areas like infor

In [11]:
course_rec("metaphysics and epistemology")

Epistemology (PHIL 3610)
This course will be an advanced introduction to some contemporary debates in epistemology. We will start by considering skeptical arguments that we cannot really know whether the world is the way it appears to us. We will look at different strategies to respond to such skeptical arguments, in particular contextualism, and explore questions concerning the nature of knowledge and the relation between knowledge and other epistemologically significant concepts, such as certainty, justification, and evidence. We will also look at Bayesian epistemology and its theoretical underpinnings, at knowledge first approaches to epistemology, at the relation between knowledge and action, and at the compatibility of traditional epistemology with formal epistemology. Also will explore the notion of common knowledge, and issues in social epistemology.
Syntax I (LING 6403)
An advanced introduction to syntactic theory within the principles and parameters/minimalist frameworks. Topi

In [12]:
course_rec("learn to play tennis")

Introduction to Squash (PE 1465)
A beginners course. Rules of the game and basic strokes are taught.  All necessary equipment is furnished. Please note: no black soled shoes are allowed on the courts and safety glasses are required to be worn at all times.
Outdoor Advanced Tennis (PE 1447)
For players with high school team or tournament experience. Skills emphasized are spins, serve and return of serve, volley, overhead smash, court positioning, and playing strategies. Class will meet twice a week for 1 1/2 hours.
Outdoor Intermediate Tennis (PE 1446)
Review and further instruction in strokes: forehand, backhand, serve, volley, and lob. Topspin and underspin are covered along with doubles strategy. All equipment is furnished.
Intermediate Squash (PE 1466)
More advanced techniques than beginning squash. All necessary equipment is furnished. Please note: no black soled shoes are allowed on the courts and safety glasses are required to be worn at all times.
Outdoor Beginning Tennis (PE 14

In [13]:
course_rec("adventurous outdoor PE classes")

Intermediate Figure Skating (PE 1546)
Students in this class should have previous experience (ability to skate forward and backward and safely stop). The course will review the skills taught in the PE1545 and PE1540, as a foundation, and move into learning more advanced turns, stops, and basic jumps and spins. Course will also cover skate safety, proper fitting, and body mechanics of skating. An assessment will be done the first day of class and students grouped accordingly.
OADI Programs (PE 1661)
Outdoors sampler course for PSP students through the PSP program. Activities include events like hiking, canoeing, and trail running.
Advanced Figure Skating (PE 1547)
This course is for the advanced figure skater. Permission of instructor if havent previously taken PE1546 or PE1547 in 2021 2022. Skaters will be learning advanced skills including jumps, spins, turns, edge moves, step sequences, and ice dance patterns. Students will be working independently on ice at times, as well as, group 

In [14]:
course_rec("advanced data structures and object-oriented programming")

Data Struct & Functional Progr (CS 3110)
Advanced programming course that emphasizes functional programming techniques and data structures. Programming topics include recursive and higher order procedures, models of programming language evaluation and compilation, type systems, and polymorphism. Data structures and algorithms covered include graph algorithms, balanced trees, memory heaps, and garbage collection. Also covers techniques for analyzing program performance and correctness.
Visual Data Analytics for Web (INFO 5100)
This course will introduce students to working with data in the context of modern web applications. These include data representation with relational and non relational databases, data mining to find patterns and make predictions, and graphical presentation for visualization.
Data-Driven Web Applications (INFO 3300)
This course will introduce students to working with data in the context of modern web applications. These include data representation with relational 

In [15]:
course_rec("how to take compelling and fascinating photos")

Photography for Non-Majors (ART 1601)
This class is an aesthetic and practical education within the realm of images. Students will become capable in the processes of photography and delve into the history and thinking surrounding the medium. They will learn to relate their images to other images which they have made, as well as to contemporary and historical images. The class includes technical lessons and aesthetic explorations. The class will advance via frequent group critiques. This class is for students who are excited about using photography as a creative and inquiring medium, and concurrently gaining technical knowledge to make this happen.
Photography: Intro Photography (ART 2601)
This course explores camera and lens as devices that frame and translate three dimensional space to a two dimensional surface. Through assignments and individual investigation, students acquire a deeper understanding of visual perception and photography as medium for personal expression. This course i

In [29]:
course_rec("creating a logical and mathematical framework for truth")

Problems in Semantics (PHIL 6700)
In this class we will discuss the properties of truth conditional semantics, with a focus on those phenomena that have been used to question the adequacy of such systems. The course starts of by discussing the fundamental (formal) properties of truth conditional semantics, and the notion of interpretation relative to a model. Then, we will explore different aspects of the grammar of natural languages that have been invoked against such semantic systems, such as vagueness and degree expressions, presuppositional content, indexicals and lexical semantics, a.o.
Problems in Semantics (PHIL 3700)
In this class we will discuss the properties of truth conditional semantics, with a focus on those phenomena that have been used to question the adequacy of such systems. The course starts of by discussing the fundamental (formal) properties of truth conditional semantics, and the notion of interpretation relative to a model. Then, we will explore different aspects