# Advanced Recommendation!

# Part 1. Non-personalized Recommendations with User Ratings

In this first part, we're going to build a non-personalized recommender based on user ratings.  In many online platforms, such as Amazon, IMDb, and MovieLens, users are able to express their preference to items by explicit ratings (like by assigning a 1-5 star rating to a movie). We're going to use those ratings to generate a recommendation. For this part, we're focusing on **non-personalized** recommendations (that is, everyone gets the same recommendation).

For this part, we will:

* load and process the MovieLens 1M dataset,
* build the non-personalized recommender, and
* evaluate the recommender.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import pandas as pd
import numpy as np
from scipy.sparse import coo_matrix

path = "/content/drive/MyDrive/ML_633/ratings-1.dat"
data_df = pd.read_csv(path, sep='::', names=["UserID", "MovieID", "Rating", "Timestamp"], engine='python')

# First, generate dictionaries for mapping old id to new id for users and movies
unique_MovieID = data_df['MovieID'].unique()
unique_UserID = data_df['UserID'].unique()
j = 0
user_old2new_id_dict = dict()
for u in unique_UserID:
    user_old2new_id_dict[u] = j
    j += 1
j = 0
movie_old2new_id_dict = dict()
for i in unique_MovieID:
    movie_old2new_id_dict[i] = j
    j += 1

# Then, use the generated dictionaries to reindex UserID and MovieID in the data_df
user_list = data_df['UserID'].values
movie_list = data_df['MovieID'].values
for j in range(len(data_df)):
    user_list[j] = user_old2new_id_dict[user_list[j]]
    movie_list[j] = movie_old2new_id_dict[movie_list[j]]
data_df['UserID'] = user_list
data_df['movieID'] = movie_list

# generate train_df with 70% samples and test_df with 30% samples, and there should have no overlap between them.
train_index = np.random.random(len(data_df)) <= 0.7
train_df = data_df[train_index]
test_df = data_df[~train_index]

# generate train_mat and test_mat
num_user = len(data_df['UserID'].unique())
num_movie = len(data_df['MovieID'].unique())

train_mat = coo_matrix((train_df['Rating'].values, (train_df['UserID'].values, train_df['MovieID'].values)), shape=(num_user, num_movie)).astype(float).toarray()
test_mat = coo_matrix((test_df['Rating'].values, (test_df['UserID'].values, test_df['MovieID'].values)), shape=(num_user, num_movie)).astype(float).toarray()

## Part 1a: Build the non-personalized recommender

In [None]:
import numpy as np
movie_avgs = np.divide(np.sum(train_mat, axis=0), np.count_nonzero(train_mat, axis=0), where=np.count_nonzero(train_mat, axis=0) != 0)
total_avgs = np.sum(np.sum(train_mat, axis=0)) / np.sum(np.count_nonzero(train_mat, axis=0))
prediction_mat = np.full((num_user, num_movie), total_avgs)

for i in range(num_movie):
    if movie_avgs[i] > 0:
        prediction_mat[:, i] = movie_avgs[i]

Please print out the id of the top-5 movies with largest predicted ratings and their predicted ratings.

In [None]:
user1 = prediction_mat[0, :]
top_5_idx = np.argsort(user1)[-5:][::-1]
top_5_ratings = user1[top_5_idx]
print("Top-5 Movies with Largest Predicted Ratings and their predicted ratings:")
for idx, movie_idx in enumerate(top_5_idx):
    # Convert new movie index back to the original movie ID
    original_movie_id = list(movie_old2new_id_dict.keys())[list(movie_old2new_id_dict.values()).index(movie_idx)]
    print(f"Movie ID: {original_movie_id}, Predicted Rating: {top_5_ratings[idx]:.2f}")

Top-5 Movies with Largest Predicted Ratings and their predicted ratings:
Movie ID: 3382, Predicted Rating: 5.00
Movie ID: 439, Predicted Rating: 5.00
Movie ID: 3607, Predicted Rating: 5.00
Movie ID: 3881, Predicted Rating: 5.00
Movie ID: 2930, Predicted Rating: 5.00


## Part 1b: Evaluate the non-personalized recommender

In [None]:
# calculate and print out the RMSE for your prediction_df and the test_df
sq_error = (prediction_mat[test_mat.nonzero()] - test_mat[test_mat.nonzero()]) ** 2
rmse = np.sqrt(np.mean(sq_error))
print("RMSE:", rmse)

RMSE: 0.9793182255274792


# Part 2. Our own Netflix Prize (sort of)

* Try item-item collaborative filtering instead of user-user CF
* Try to include the baseline estimation model in your collaborative filtering model
* Build an MF model
* Add bias factors to your MF model and learn them
* Add CF to your MF model and learn the CF weights
* Incorporate an LLM (?) into your model
* ...

In [None]:
#item-item CF with KNN
from sklearn.metrics.pairwise import cosine_similarity

# cosine similarity between items
item_sim = cosine_similarity(train_mat.T)
k = 7
knn_ii_pred = np.zeros((num_user, num_movie))

for user in range(num_user):
    for movie in range(num_movie):
        sim_scores = item_sim[movie]
        rated_movies = np.where(train_mat[user] > 0)[0]
        if len(rated_movies) == 0:
            continue
        rated_sim = sim_scores[rated_movies]
        if len(rated_sim) > k:
            top_k_idx = np.argsort(rated_sim)[-k:]
        else:
            top_k_idx = np.argsort(rated_sim)

        top_k_movies = rated_movies[top_k_idx]
        top_k_sims = rated_sim[top_k_idx]
        top_k_ratings = train_mat[user, top_k_movies]

        if np.sum(np.abs(top_k_sims)) > 0:
            knn_ii_pred[user, movie] = np.dot(top_k_sims, top_k_ratings) / np.sum(np.abs(top_k_sims))
        else:
            knn_ii_pred[user, movie] = 0


In [7]:
# print your best RMSE for the test set
sq_error = (knn_ii_pred[test_mat.nonzero()] - test_mat[test_mat.nonzero()]) ** 2
rmse = np.sqrt(np.mean(sq_error))
print(f"RMSE of item-item CF with KNN: {rmse:.4f}")

RMSE of item-item CF with KNN: 0.9441


In [8]:
# Baseline estimation in CF (bias)
mu = np.mean(train_mat[train_mat > 0])
bu = np.zeros(train_mat.shape[0])
for user in range(train_mat.shape[0]):
    user_ratings = train_mat[user, :]
    rated_indices = user_ratings > 0
    if rated_indices.sum() > 0:
        bu[user] = np.mean(user_ratings[rated_indices]) - mu

bi = np.zeros(train_mat.shape[1])
for movie in range(train_mat.shape[1]):
    movie_ratings = train_mat[:, movie]
    rated_indices = movie_ratings > 0
    if rated_indices.sum() > 0:
        bi[movie] = np.mean(movie_ratings[rated_indices]) - mu

# Predict the rating: b_ui = mu + bi + bu
baseline_pred = mu + bu[:, np.newaxis] + bi
print(baseline_pred)

[[4.92649175 3.99390152 4.71326001 ... 1.54431024 5.54431024 4.54431024]
 [4.46464965 3.53205941 4.2514179  ... 1.08246813 5.08246813 4.08246813]
 [4.71815842 3.78556818 4.50492667 ... 1.3359769  5.3359769  4.3359769 ]
 ...
 [4.61399175 3.68140152 4.40076001 ... 1.23181024 5.23181024 4.23181024]
 [4.57072252 3.63813229 4.35749078 ... 1.18854101 5.18854101 4.18854101]
 [4.35468324 3.42209301 4.1414515  ... 0.97250173 4.97250173 3.97250173]]


In [9]:
sq_error = (baseline_pred[test_mat.nonzero()] - test_mat[test_mat.nonzero()]) ** 2
rmse = np.sqrt(np.mean(sq_error))
print(f"RMSE of the Baseline: {rmse:.4f}")

RMSE of the Baseline: 0.9340


In [None]:
# MF
def MF(train_mat, num_factors=9, lr=0.01, reg=0.001, epochs=20):
    num_users, num_items = train_mat.shape
    # Initialize latent factors
    User = np.random.normal(scale=0.1, size=(num_users, num_factors))
    Item = np.random.normal(scale=0.1, size=(num_items, num_factors))
    rows, cols = train_mat.nonzero()
    ratings = train_mat[rows, cols]

    for epoch in range(epochs):
        for i in range(len(ratings)):
            u = rows[i]
            m = cols[i]
            rating = ratings[i]
            pred = np.dot(User[u], Item[m])
            error = rating - pred
            # Update latent vectors using SGD
            User[u] += lr * (error * Item[m] - reg * User[u])
            Item[m] += lr * (error * User[u] - reg * Item[m])

        MF_pred = User @ Item.T
        train_loss = np.sqrt(np.mean((train_mat[rows, cols] - MF_pred[rows, cols])**2))
        print(f"Epoch {epoch+1}/{epochs}, RMSE: {train_loss:.4f}")

    return User @ Item.T

mf_pred = MF(train_mat, num_factors=9, lr=0.01, reg=0.1, epochs=20)

Epoch 1/20, RMSE: 2.6768
Epoch 2/20, RMSE: 1.1050
Epoch 3/20, RMSE: 0.9691
Epoch 4/20, RMSE: 0.9477
Epoch 5/20, RMSE: 0.9399
Epoch 6/20, RMSE: 0.9346
Epoch 7/20, RMSE: 0.9294
Epoch 8/20, RMSE: 0.9238
Epoch 9/20, RMSE: 0.9183
Epoch 10/20, RMSE: 0.9132
Epoch 11/20, RMSE: 0.9087
Epoch 12/20, RMSE: 0.9047
Epoch 13/20, RMSE: 0.9012
Epoch 14/20, RMSE: 0.8979
Epoch 15/20, RMSE: 0.8948
Epoch 16/20, RMSE: 0.8919
Epoch 17/20, RMSE: 0.8890
Epoch 18/20, RMSE: 0.8863
Epoch 19/20, RMSE: 0.8837
Epoch 20/20, RMSE: 0.8812


In [None]:
sq_error = (mf_pred[test_mat.nonzero()] - test_mat[test_mat.nonzero()]) ** 2
rmse = np.sqrt(np.mean(sq_error))
print(f"RMSE of the MF: {rmse:.4f}")

RMSE of the MF: 0.9103


In [45]:
# MF with Bias
def MF_bias(train_mat, num_factors=9, lr=0.01, reg=0.01, epochs=20):
    num_users, num_items = train_mat.shape
    # Initialize latent factors
    User = np.random.normal(scale=0.1, size=(num_users, num_factors))
    Item = np.random.normal(scale=0.1, size=(num_items, num_factors))
    #Initialize bias
    ub = np.zeros(num_users)
    ib = np.zeros(num_items)
    mu = np.mean(train_mat[train_mat != 0])
    rows, cols = train_mat.nonzero()
    ratings = train_mat[rows, cols]

    for epoch in range(epochs):
        for i in range(len(ratings)):
            u = rows[i]
            m = cols[i]
            rating = ratings[i]
            pred = mu + ub[u] + ib[m] + np.dot(User[u], Item[m])
            error = rating - pred
            # Update latent vectors using SGD
            ub[u] += lr * (error - reg * ub[u])
            ib[m] += lr * (error - reg * ib[m])
            User[u] += lr * (error * Item[m] - reg * User[u])
            Item[m] += lr * (error * User[u] - reg * Item[m])

        MF_bias_pred = mu + ub[:, np.newaxis] + ib[np.newaxis,:] + User @ Item.T
        train_loss = np.sqrt(np.mean((train_mat[rows, cols] - MF_bias_pred[rows, cols])**2))
        print(f"Epoch {epoch+1}/{epochs}, RMSE: {train_loss:.4f}")

    return mu + ub[:, np.newaxis] + ib[np.newaxis,:] + User @ Item.T

mf_bias_pred = MF_bias(train_mat, num_factors=25, lr=0.01, reg=0.1, epochs=50)

Epoch 1/50, RMSE: 0.9362
Epoch 2/50, RMSE: 0.9141
Epoch 3/50, RMSE: 0.9079
Epoch 4/50, RMSE: 0.9050
Epoch 5/50, RMSE: 0.9031
Epoch 6/50, RMSE: 0.9015
Epoch 7/50, RMSE: 0.8998
Epoch 8/50, RMSE: 0.8975
Epoch 9/50, RMSE: 0.8945
Epoch 10/50, RMSE: 0.8908
Epoch 11/50, RMSE: 0.8869
Epoch 12/50, RMSE: 0.8833
Epoch 13/50, RMSE: 0.8800
Epoch 14/50, RMSE: 0.8771
Epoch 15/50, RMSE: 0.8745
Epoch 16/50, RMSE: 0.8719
Epoch 17/50, RMSE: 0.8694
Epoch 18/50, RMSE: 0.8669
Epoch 19/50, RMSE: 0.8644
Epoch 20/50, RMSE: 0.8619
Epoch 21/50, RMSE: 0.8595
Epoch 22/50, RMSE: 0.8571
Epoch 23/50, RMSE: 0.8548
Epoch 24/50, RMSE: 0.8526
Epoch 25/50, RMSE: 0.8505
Epoch 26/50, RMSE: 0.8485
Epoch 27/50, RMSE: 0.8466
Epoch 28/50, RMSE: 0.8447
Epoch 29/50, RMSE: 0.8429
Epoch 30/50, RMSE: 0.8411
Epoch 31/50, RMSE: 0.8394
Epoch 32/50, RMSE: 0.8378
Epoch 33/50, RMSE: 0.8362
Epoch 34/50, RMSE: 0.8347
Epoch 35/50, RMSE: 0.8332
Epoch 36/50, RMSE: 0.8317
Epoch 37/50, RMSE: 0.8303
Epoch 38/50, RMSE: 0.8289
Epoch 39/50, RMSE: 0.

In [46]:
sq_error = (mf_bias_pred[test_mat.nonzero()] - test_mat[test_mat.nonzero()]) ** 2
rmse = np.sqrt(np.mean(sq_error))
print(f"RMSE of the MF: {rmse:.4f}")

RMSE of the MF: 0.8714


In [6]:
# MF + Bias + CF
from sklearn.metrics.pairwise import cosine_similarity
def MF_bias_ii(train_mat, num_factors=20, lr=0.005, reg=0.01, epochs=20, k=10, alpha=0.5, verbose=True):
    num_users, num_items = train_mat.shape
    # Initialize latent factors
    User = np.random.normal(scale=0.1, size=(num_users, num_factors))
    Item = np.random.normal(scale=0.1, size=(num_items, num_factors))
    #Initialize bias
    ub = np.zeros(num_users)
    ib = np.zeros(num_items)
    mu = np.mean(train_mat[train_mat != 0])
    rows, cols = train_mat.nonzero()
    ratings = train_mat[rows, cols]

    for epoch in range(epochs):
        for i in range(len(ratings)):
            u = rows[i]
            m = cols[i]
            rating = ratings[i]
            pred = mu + ub[u] + ib[m] + np.dot(User[u], Item[m])
            error = rating - pred
            # Update latent vectors using SGD
            ub[u] += lr * (error - reg * ub[u])
            ib[m] += lr * (error - reg * ib[m])
            User[u] += lr * (error * Item[m] - reg * User[u])
            Item[m] += lr * (error * User[u] - reg * Item[m])

        if verbose:
            MF_bias_pred = mu + ub[:, None] + ib[None,:] + User @ Item.T
            rmse = np.sqrt(np.mean((train_mat[rows, cols] - MF_bias_pred[rows, cols])**2))
            print(f"Epoch {epoch+1}/{epochs}, RMSE: {rmse:.4f}")

    MF_bias_pred = mu + ub[:, None] + ib[None,:] + User @ Item.T

    item_sim = cosine_similarity(train_mat.T)
    MF_bias_ii_pred = MF_bias_pred.copy()
    for user in range(num_users):
        rated_items = np.where(train_mat[u]>0)[0]
        for movie in range(num_items):
            sim_scores = item_sim[movie][rated_items]
            ratings = train_mat[user][rated_items]

            if len(sim_scores) == 0 or np.sum(np.abs(sim_scores)) == 0:
                cf_adjust = 0
            else:
                top_k_idx = np.argsort(sim_scores)[-k:]
                top_k_sims = sim_scores[top_k_idx]
                top_k_ratings = ratings[top_k_idx]
                cf_adjust = np.dot(top_k_sims, top_k_ratings) / np.sum(np.abs(top_k_sims) + 1e-8)

            MF_bias_ii_pred[user, movie] = alpha * MF_bias_pred[user, movie] + (1 - alpha) * cf_adjust

    return MF_bias_ii_pred

# mf_bias_ii_pred = MF_bias_ii(train_mat, num_factors=20, k=10, alpha=0.7)
mf_bias_ii_pred = MF_bias_ii(
    train_mat,
    num_factors=40,
    lr=0.005,
    reg=0.05,
    epochs=25,
    k=10,
    alpha=0.9
)

Epoch 1/25, RMSE: 0.9428
Epoch 2/25, RMSE: 0.9176
Epoch 3/25, RMSE: 0.9069
Epoch 4/25, RMSE: 0.9008
Epoch 5/25, RMSE: 0.8966
Epoch 6/25, RMSE: 0.8933
Epoch 7/25, RMSE: 0.8903
Epoch 8/25, RMSE: 0.8874
Epoch 9/25, RMSE: 0.8842
Epoch 10/25, RMSE: 0.8805
Epoch 11/25, RMSE: 0.8761
Epoch 12/25, RMSE: 0.8712
Epoch 13/25, RMSE: 0.8658
Epoch 14/25, RMSE: 0.8602
Epoch 15/25, RMSE: 0.8545
Epoch 16/25, RMSE: 0.8489
Epoch 17/25, RMSE: 0.8433
Epoch 18/25, RMSE: 0.8377
Epoch 19/25, RMSE: 0.8321
Epoch 20/25, RMSE: 0.8267
Epoch 21/25, RMSE: 0.8213
Epoch 22/25, RMSE: 0.8159
Epoch 23/25, RMSE: 0.8106
Epoch 24/25, RMSE: 0.8055
Epoch 25/25, RMSE: 0.8004


In [7]:
sq_error = (mf_bias_ii_pred[test_mat.nonzero()] - test_mat[test_mat.nonzero()]) ** 2
rmse = np.sqrt(np.mean(sq_error))
print(f"RMSE of the MF: {rmse:.4f}")

RMSE of the MF: 0.9029



From the above results, we can observe that Matrix Factorization with bias achieved the lowest RMSE (0.8714), making it the most accurate model. Slightly behind it was Matrix Factorization with bias and Collaborative Filtering (0.9029), followed by standard Matrix Factorization (0.9103), which still performed well but lacked the additional bias terms. The baseline Collaborative Filtering approach with bias (0.9340) showed moderate success, while Item-Item KNN (0.9441) was less accurate, potentially due to sparsity or insufficient overlap in user ratings. Finally, the non-personalized method resulted in the highest RMSE (0.979318), reflecting its inability to tailor recommendations to individual user preferences.

*   Non-personalzied recommendation algorithm -  0.979318
*   Item-item collaboration with KNN - 0.9441
*   Baseline model (CF with bias) - 0.9340
*   Matrix Factorization - 0.9103
*   Matrix Factorization with bias and Collaborative filtering - 0.9029
*   Matrix Factorization with bias - 0.8714

MF with biases gave the best result, perhaps because it effectively blends latent factor modeling with user and item bias corrections, capturing both deeper relationships and consistent rating tendencies. By contrast, adding CF signals could help but wasn’t fine-tuned enough, plain MF lacks bias terms, baseline CF with bias mainly adjusts global averages, item-item KNN struggles with sparse data, and the non-personalized approach ignores individual preferences.

# Part 3. Dual Embedding Space Model Implementation

In [1]:
!pip install numpy==1.24.3



In [2]:
!pip install --upgrade gensim

Collecting gensim
  Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.7/26.7 MB[0m [31m54.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.6/38.6 MB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: scipy, gensim
  Attempting uninstall: scipy
    Found existing installation: scipy 1.14.1
    Uninstalling scipy-1.14.1:
      Successfully 

In [3]:
import os
import numpy as np
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity

filepath = "/content/drive/MyDrive/ML_633/enron_814"
docs = []

for filename in sorted(os.listdir(filepath)):
    path = os.path.join(filepath, filename)
    with open(path, 'r') as file:
        raw_text = file.read().lower()
        parts = raw_text.split()
        doc_id_raw = parts[1]
        doc_id = doc_id_raw.strip("<>").split('.')
        doc_id = f"{doc_id[0]}.{doc_id[1]}"
        lines = raw_text.splitlines()
        content_lines = []
        found_start = False

        for line in lines:
            if 'jarnold.nsf' in line:
                found_start = True
                continue
            if found_start:
                content_lines.append(line.strip())
        if not found_start:
            content_lines = [line.strip() for line in lines if line.strip()]

        content = ' '.join(content_lines).split()

        docs.append({
            'Document-ID': doc_id,
            'content': content
        })

sentences = [doc['content'] for doc in docs]
model = Word2Vec(sentences=sentences, vector_size=100, window=5, min_count=1, workers=4)

def get_average_embedding(words):
    vecs = [model.wv[word] for word in words if word in model.wv]
    return np.mean(vecs, axis=0) if vecs else np.zeros(model.vector_size)

doc_embeddings = np.array([get_average_embedding(doc['content']) for doc in docs])

Now show the results for the query: `buyer`

In [None]:
query = 'buyer'
query_embedding = get_average_embedding([query])
scores = cosine_similarity([query_embedding], doc_embeddings)[0]
top_5_idx = scores.argsort()[::-1][:5]

print(f"\nQuery: {query}")
print("Rank\tScore\t\tDocumentID\t\tDocument")
for rank, idx in enumerate(top_5_idx, start=1):
    doc_id = docs[idx]['Document-ID']
    text = ' '.join(docs[idx]['content'])
    print(f"{rank}\t{scores[idx]:.4f}\t{doc_id}\t{text}")


Query: buyer
Rank	Score		DocumentID		Document
1	0.9869	190536.1075857652345	got tix for tonight
2	0.9868	2799094.1075857601603	tickets requisitioned for england/germany. $1500!!!!!!
3	0.9867	12216704.1075857601669	please use this vol curve for a dry run to figure out var for my book, ng price, and jim's book, storage, and communicate the results. thanks,john
4	0.9867	23808285.1075857598702	i'm here... "cooper, sean" <coopers@epenergy.com> on 08/29/2000 01:27:32 pm to: cc: subject: pab deleted my pab file, or for the non technical among you, my outlook personal address book was accidently deleted this week in an upgrade to windows 2000 nt. i have restored an old one, but it is several months, if not a whole year out of date. this is the first message to confirm the current address i have for you is still active. please reply confirming you recieved it. a second message will follow to try and replace some of the address's i know i have lost. thanks for your help sean. ******************

Now show the results for the query: `margins`

In [None]:
query = 'margins'
query_embedding = get_average_embedding([query])
scores = cosine_similarity([query_embedding], doc_embeddings)[0]
top_5_idx = scores.argsort()[::-1][:5]

print(f"\nQuery: {query}")
print("Rank\tScore\t\tDocumentID\t\tDocument")
for rank, idx in enumerate(top_5_idx, start=1):
    doc_id = docs[idx]['Document-ID']
    text = ' '.join(docs[idx]['content'])
    print(f"{rank}\t{scores[idx]:.4f}\t{doc_id}\t{text}")


Query: margins
Rank	Score		DocumentID		Document
1	0.9818	25081271.1075857600086	hello, just checking to see if things are progressing as scheduled. thanks, john from: vladimir gorny 07/31/2000 06:35 pm to: john j lavorato/corp/enron@enron, jeffrey a shankman/hou/ect@ect, john arnold/hou/ect@ect, debbie r brackett/hou/ect@ect, frank hayden/corp/enron@enron, stephen stock/hou/ect@ect cc: subject: forward-forward vol implementation plan plan of action for implementation of the var methodology change related to forward-forward volatilities: 1. finalize the methodology proposed (research/market risk) - done 2. testing of the new methodology for the natural gas desk in excel (market risk) - done 3. get approval for the methodology change from rick buy (see draft of the memo attached) - john lavorato and john sherriff - by 8/7/00 - john lavorato, any comments on the memo? - would you like to run this by john sherriff or should i do it? 4. develop and implement the new methodology in a stage 

Now show the results for the query: `winter`

In [None]:
query = 'winter'
query_embedding = get_average_embedding([query])
scores = cosine_similarity([query_embedding], doc_embeddings)[0]
top_5_idx = scores.argsort()[::-1][:5]

print(f"\nQuery: {query}")
print("Rank\tScore\t\tDocumentID\t\tDocument")
for rank, idx in enumerate(top_5_idx, start=1):
    doc_id = docs[idx]['Document-ID']
    text = ' '.join(docs[idx]['content'])
    print(f"{rank}\t{scores[idx]:.4f}\t{doc_id}\t{text}")


Query: winter
Rank	Score		DocumentID		Document
1	0.9995	7036870.1075857652237	what's your view of crude from here over next 1-4 weeks?
2	0.9994	21805512.1075857596240	rajib: the following are my bids for the asian option: gq 1 : .41 gq 2 : .63 gq 3 : .57
3	0.9994	33025919.1075857594206	saw a lot of the bulls sell summer against length in front to mitigate margins/absolute position limits/var. as these guys are taking off the front, they are also buying back summer. el paso large buyer of next winter today taking off spreads. certainly a reason why the spreads were so strong on the way up and such a piece now. really the only one left with any risk premium built in is h/j now. it was trading equivalent of 180 on access, down 40+ from this morning. certainly if we are entering a period of bearish to neutral trade, h/j will get whacked. certainly understand the arguments for h/j. if h settles $20, that spread is probably worth $10. h 20 call was trading for 55 on monday. today it was 10/

Now show the results for the query: `risk`

In [None]:
query = 'risk'
query_embedding = get_average_embedding([query])
scores = cosine_similarity([query_embedding], doc_embeddings)[0]
top_5_idx = scores.argsort()[::-1][:5]

print(f"\nQuery: {query}")
print("Rank\tScore\t\tDocumentID\t\tDocument")
for rank, idx in enumerate(top_5_idx, start=1):
    doc_id = docs[idx]['Document-ID']
    text = ' '.join(docs[idx]['content'])
    print(f"{rank}\t{scores[idx]:.4f}\t{doc_id}\t{text}")


Query: risk
Rank	Score		DocumentID		Document
1	0.9996	32572339.1075857602061	---------------------- forwarded by john arnold/hou/ect on 04/11/2000 04:57 pm --------------------------- from: rudi zipter 04/08/2000 09:03 am to: john arnold/hou/ect@ect cc: vladimir gorny/hou/ect@ect, minal dalia/hou/ect@ect, sunil dalal/corp/enron@enron subject: option analysis on ng price book john, several months ago we talked about the development of an option analysis tool that could be used to stress test positions under various scenarios as a supplement to our v@r analysis. we have recently completed the project and would like to solicit your feedback on the report results. we have selected your ng price position for april 4, 2000 (post-id 753650) for the initial analysis. attached in the excel file below you will find: analysis across the various forward months in your position underlying vs. greeks, theoretical p&l volatility vs. greeks, theoretical p&l time change vs. greeks, theoretical p&l sum

Now show the results for the query: `never`

In [None]:
query = 'never'
query_embedding = get_average_embedding([query])
scores = cosine_similarity([query_embedding], doc_embeddings)[0]
top_5_idx = scores.argsort()[::-1][:5]

print(f"\nQuery: {query}")
print("Rank\tScore\t\tDocumentID\t\tDocument")
for rank, idx in enumerate(top_5_idx, start=1):
    doc_id = docs[idx]['Document-ID']
    text = ' '.join(docs[idx]['content'])
    print(f"{rank}\t{scores[idx]:.4f}\t{doc_id}\t{text}")


Query: never
Rank	Score		DocumentID		Document
1	0.9994	3233370.1075857658589	that info is correct. from: edie leschber 12/29/2000 12:30 pm to: john arnold/hou/ect@ect cc: subject: gas team - reorg john, my name is edie leschber and i will be your business analysis and reporting contact effective immediately. i am currently in the process of verifying team members under your section of the gas team. attached is a file with the current list. please confirm that your list is complete and/or send me changes to it at your earliest convenience. new cost centers have been set up due to the reorganization and we would like to begin using these as soon as possible. i look forward to meeting you and working with you very soon. thank you for your assistance. edie leschber x30669
2	0.9994	18917089.1075857653085	as long as i own enron stock, the desks are my colleagues. feel free to share the info with hunter and chris. clayton vernon @ enron 03/26/2001 03:45 pm to: john arnold/hou/ect@ect cc: sub