In [2]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained('BeIR/query-gen-msmarco-t5-large-v1')
model = T5ForConditionalGeneration.from_pretrained('BeIR/query-gen-msmarco-t5-large-v1')

docs = ["Learn the alignment foundations of yoga while building strength, flexibility, and vitality! This class is for those who are new to yoga or those looking to fine tune their practice. Students will learn the \"ins and outs\" of a variety of standing poses, back bends, and maybe even a few inversions or arm balances! Modifications will be given to accommodate all levels. Each class will also include breath instruction and poses for stress reduction.",
        "An introduction to the actor's technique and performance skills, exploring the elements necessary to begin training as an actor, i.e., observation, concentration, and imagination. Focus is on physical and vocal exercises, improvisation, and text and character. There is required play reading, play attendance, and some scene study.",
        "Generative models are a class of machine learning algorithms that define probability distributions over complex, high-dimensional objects such as images, sequences, and graphs. Recent advances in deep neural networks and optimization algorithms have significantly enhanced the capabilities of these models and renewed research interest in them. This course explores the foundational probabilistic principles of deep generative models, their learning algorithms, and popular model families, which include variational autoencoders, generative adversarial networks, autoregressive models, and normalizing flows. The course also covers applications in domains such as computer vision, natural language processing, and biomedicine, and draws connections to the field of reinforcement learning."
       ]


for para in docs:
    input_ids = tokenizer.encode(para, return_tensors='pt')
    outputs = model.generate(
        input_ids=input_ids,
        max_length=64,
        do_sample=True,
        top_p=0.95,
        num_return_sequences=5)

    print("Paragraph:")
    print(para)

    print("\nGenerated Queries:")
    for i in range(len(outputs)):
        query = tokenizer.decode(outputs[i], skip_special_tokens=True)
        print(f'{i + 1}: {query}')

Paragraph:
Learn the alignment foundations of yoga while building strength, flexibility, and vitality! This class is for those who are new to yoga or those looking to fine tune their practice. Students will learn the "ins and outs" of a variety of standing poses, back bends, and maybe even a few inversions or arm balances! Modifications will be given to accommodate all levels. Each class will also include breath instruction and poses for stress reduction.

Generated Queries:
1: can you have a regular yoga class and do inversions
2: how yoga helps
3: what poses can you do during yoga class
4: where to learn posing and breathing
5: are yoga classes good?
Paragraph:
An introduction to the actor's technique and performance skills, exploring the elements necessary to begin training as an actor, i.e., observation, concentration, and imagination. Focus is on physical and vocal exercises, improvisation, and text and character. There is required play reading, play attendance, and some scene stu

In [2]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import os.path

tokenizer = T5Tokenizer.from_pretrained('BeIR/query-gen-msmarco-t5-large-v1')
model = T5ForConditionalGeneration.from_pretrained('BeIR/query-gen-msmarco-t5-large-v1')

from tqdm import tqdm
import json

descriptions = []
gen_queries = []

parsed_data = json.load(open('./parsed_courses.json', 'r'))
for course_data in tqdm(parsed_data):
    descriptions.append(course_data[3])
    
synthetic_data_path = "synthetic_queries.json"

if not os.path.isfile(synthetic_data_path):
    for para in tqdm(descriptions):
        input_ids = tokenizer.encode(para, return_tensors='pt')
        outputs = model.generate(
            input_ids=input_ids,
            max_length=64,
            do_sample=True,
            top_p=0.95,
            num_return_sequences=5)

        for i in range(len(outputs)):
            query = tokenizer.decode(outputs[i], skip_special_tokens=True)
            gen_queries.append([query, para])

    json.dump(gen_queries, open(synthetic_data_path, 'w'))
else:
    gen_queries = json.load(open(synthetic_data_path, 'r'))

100%|██████████████████████████████████| 4119/4119 [00:00<00:00, 5934846.50it/s]


In [3]:
from sentence_transformers import SentenceTransformer, models

word_emb = models.Transformer('sentence-transformers/msmarco-distilbert-base-dot-prod-v3')
pooling = models.Pooling(word_emb.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_emb, pooling])

In [8]:
from sentence_transformers import SentenceTransformer, InputExample, losses, models, datasets

train_data = list(map(lambda entry: InputExample(texts=[entry[0], entry[1]]), gen_queries))
train_dataloader = datasets.NoDuplicatesDataLoader(train_data, batch_size=4)
train_loss = losses.MultipleNegativesRankingLoss(model)

In [9]:
num_epochs = 5
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=num_epochs, warmup_steps=warmup_steps, show_progress_bar=True)
os.makedirs('models', exist_ok=True)
model.save('models/course-embeddings')

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Iteration:   0%|          | 0/5148 [00:00<?, ?it/s]

Iteration:   0%|          | 0/5148 [00:00<?, ?it/s]

Iteration:   0%|          | 0/5148 [00:00<?, ?it/s]

Iteration:   0%|          | 0/5148 [00:00<?, ?it/s]

Iteration:   0%|          | 0/5148 [00:00<?, ?it/s]

In [15]:
from sentence_transformers import SentenceTransformer
import numpy as np
import os.path
import faiss
import json

model = SentenceTransformer('models/course-embeddings')
courses = json.load(open('parsed_courses.json', 'r'))
course_descs = list(map(lambda course: course[3], courses))

desc_embeddings = model.encode(course_descs)
desc_embeddings = np.array(desc_embeddings.astype('float32'))
norms = np.linalg.norm(desc_embeddings, axis=1)
norm_embeddings = np.array([vec / norms[i] for i, vec in enumerate(desc_embeddings)])
norm_embeddings = np.array(norm_embeddings.astype('float32'))

if not os.path.isfile('models/courses-normalized.index'):
    desc_embeddings = model.encode(course_descs)
    desc_embeddings = np.array(desc_embeddings.astype('float32'))
    
    norms = np.linalg.norm(desc_embeddings, axis=1)
    norm_embeddings = np.array([vec / norms[i] for i, vec in enumerate(desc_embeddings)])
    norm_embeddings = np.array(norm_embeddings.astype('float32'))
    
    index = faiss.IndexIDMap(faiss.IndexFlatIP(768))
    index.add_with_ids(norm_embeddings, np.arange(len(norm_embeddings)))
    faiss.write_index(index, 'models/courses-normalized.index')
else:
    index = faiss.read_index('models/courses.index')

[0.99999994 1.         0.99999994 ... 1.         1.         1.        ]
(4119,)


"\nif not os.path.isfile('models/courses-normalized.index'):\n    desc_embeddings = model.encode(course_descs)\n    desc_embeddings = np.array(desc_embeddings.astype('float32'))\n    \n    norms = np.linalg.norm(desc_embeddings, axis=1)\n    norm_embeddings = np.array([vec / norms[i] for i, vec in enumerate(desc_embeddings)])\n    norm_embeddings = np.array(norm_embeddings.astype('float32'))\n    \n    index = faiss.IndexIDMap(faiss.IndexFlatIP(768))\n    index.add_with_ids(norm_embeddings, np.arange(len(norm_embeddings)))\n    faiss.write_index(index, 'models/courses-normalized.index')\nelse:\n    index = faiss.read_index('models/courses.index')\n"

In [5]:
def course_rec(query, topn=10):
    norm_emb = (emb := model.encode([query])) / np.linalg.norm(emb)
    top_res = index.search(norm_emb, topn)
    top_res = list(top_res[1])[0]

    for doc in top_res:
        print(f"{courses[doc][2]} ({courses[doc][0]} {courses[doc][1]})")
        print(course_descs[doc])

In [6]:
course_rec("fascinating computer science theory")

Discrete Structures - Honors (CS 2802)
Covers the mathematics that underlies most of computer science. Topics include mathematical induction; logical proof; propositional and predicate calculus; combinatorics and discrete mathematics; some basic elements of basic probability theory; basic number theory; sets, functions, and relations; graphs; and finite-state machines. These topics are discussed in the context of applications to many areas of computer science, such as the RSA cryptosystem and web searching.   This course is an honors version of CS 2800.  It will cover essentially the same material, but go into more depth.
Discrete Structures (CS 2800)
Covers the mathematics that underlies most of computer science. Topics include mathematical induction; logical proof; propositional and predicate calculus; combinatorics and discrete mathematics; some basic elements of basic probability theory; basic number theory; sets, functions, and relations; graphs; and finite-state machines. These t

In [7]:
course_rec("advanced natural language processing and robot learning", 20)

Natural Lang Processing & ML (CS 6741)
Robust language understanding has the potential to transform how we interact with computers, extract information from text and study language on large scale. This research-oriented course examines machine learning and inference methods for recovering language structure and meaning. Possible topics include structured prediction and deep learning, methods for situated language understanding, language grounding, and learning to generate text.
Advanced Language Technologies (INFO 6300)
Graduate-level introduction to technologies for the computational treatment of information in human-language form, covering modern natural-language processing (NLP) and/or information retrieval (IR). Possible topics include language modeling, word embeddings, text categorization and clustering, information extraction, computational syntactic and semantic formalisms, grammar induction, machine translation, latent semantic analysis (LSI), and clickthrough data for web sea

In [8]:
course_rec("math courses with applications to machine learning and computer science")

Discrete Structures - Honors (CS 2802)
Covers the mathematics that underlies most of computer science. Topics include mathematical induction; logical proof; propositional and predicate calculus; combinatorics and discrete mathematics; some basic elements of basic probability theory; basic number theory; sets, functions, and relations; graphs; and finite-state machines. These topics are discussed in the context of applications to many areas of computer science, such as the RSA cryptosystem and web searching.   This course is an honors version of CS 2800.  It will cover essentially the same material, but go into more depth.
Big Data Technologies (ORIE 5270)
This course offers a broad overview of computational techniques and mathematical skills useful for data scientists. Topics include: unix shell, regular expressions, version control: (git), data structures and algorithms, working with databases, data analysis using Python and related libraries (Pandas, NumPy/Scipy, scikit-learn), paral

In [9]:
course_rec("metaphysics and epistemology")

Knowledge&Reality (PHIL 2610)
An introduction to some central philosophical questions about knowledge and reality. Questions to be addressed may include: What, if anything, do we know? What is it for a belief to be reasonable? What is it for one event to cause another event? What makes the person reading the beginning of this sentence the same as the person reading the end of this sentence? Readings are typically drawn from recent sources.
Introduction To Philosophy (PHIL 1100)
A general introduction to some of the main topics, texts, and methods of philosophy. Topics may include the existence of God, the nature of mind and its relation to the body, causation, free will, knowledge and skepticism, and justice and moral obligation. Readings may be drawn from the history of philosophy and contemporary philosophical literature.
Modern Philosophy (PHIL 2220)
A survey of Western philosophy in the 17th and 18th centuries: Descartes, Locke, Spinoza, Leibniz, Berkeley, Hume, and Kant. We focus 

In [10]:
course_rec("learn to play tennis")

Outdoor Advanced Tennis (PE 1447)
For players with high school team or tournament experience. Skills emphasized are spins, serve and return of serve, volley, overhead smash, court positioning, and playing strategies. Class will meet twice a week for 1 1/2 hours.
Outdoor Intermediate Tennis (PE 1446)
Review and further instruction in strokes: forehand, backhand, serve, volley, and lob. Topspin and underspin are covered along with doubles strategy. All equipment is furnished.
Outdoor Beginning Tennis (PE 1445)
Instruction and practice in the basic skills of the game. Grip, serve, forehand, backhand, and lob are areas covered along with scoring systems. All equipment is furnished.
Introduction to Squash (PE 1465)
A beginner's course. Rules of the game and basic strokes are taught.  All necessary equipment is furnished. Please note: no black soled shoes are allowed on the courts and safety glasses are required to be worn at all times.
Advanced Volleyball (PE 1571)
For the experienced playe

In [11]:
course_rec("adventurous outdoor PE classes")

Open Gym (PE 1246)
This Open Gym PE course offers an exclusive time and space to students who have previous experience with exercise. The Appel Fitness Center offers a wide variety of free weight and cardio equipment, and experience with this equipment is mandatory for this course. A CFC Personal Trainer will be on staff to supervise and assist students with their training, but independent exercise should be expected.
OADI Programs (PE 1661)
Outdoors sampler course for PSP students through the PSP program. Activities include events like hiking, canoeing, and trail running.
Intro to Camping (PE 1616)
This class is designed for anyone wanting to learn the introductory skills to spend a night in the woods.  As an Introduction to camping this class seeks to teach students about tarps, water purification and campsite selection as well as some basic wilderness knowledge and awareness of gear needed for a one night outing.  Students will end the class with a one night outing in the woods off 

In [12]:
course_rec("advanced data structures and object-oriented programming")

Data Struct & Functional Progr (CS 3110)
Advanced programming course that emphasizes functional programming techniques and data structures. Programming topics include recursive and higher-order procedures, models of programming language evaluation and compilation, type systems, and polymorphism. Data structures and algorithms covered include graph algorithms, balanced trees, memory heaps, and garbage collection. Also covers techniques for analyzing program performance and correctness.
Obj-Oriented Prog & Data Struc (ENGRD 2110)
Intermediate programming in a high-level language and introduction to computer science. Topics include object-oriented programming (classes, objects, subclasses, types), graphical user interfaces, algorithm analysis (asymptotic complexity, big "O" notation), recursion, testing, program correctness (loop invariants), searching/sorting, data structures (lists, trees, stacks, queues, heaps, search trees, hash tables, graphs), graph algorithms. Java is the principal

In [13]:
course_rec("how to take compelling and fascinating photos")

Photography for Non-Majors (ART 1601)
This class is an aesthetic and practical education within the realm of images. Students will become capable in the processes of photography and delve into the history and thinking surrounding the medium. They will learn to relate their images to other images which they have made, as well as to contemporary and historical images. The class includes technical lessons and aesthetic explorations. The class will advance via frequent group critiques. This class is for students who are excited about using photography as a creative and inquiring medium, and concurrently gaining technical knowledge to make this happen.
Photography: Intro Photography (ART 2601)
This course explores camera and lens as devices that frame and translate three-dimensional space to a two-dimensional surface. Through assignments and individual investigation, students acquire a deeper understanding of visual perception and photography as medium for personal expression. This course i