**DATA 612 Project 3**

**Gullit Navarrete**

**6/22/25**

**Introduction**

**Introduction: Recommender System**

The recommender system that I'll be using for this project is the item-item collaborative filtering from my last project assignment (based on jokedataset3 from https://eigentaste.berkeley.edu/dataset/). The item-item collaborative filtering does capture shared user ratings, but its accuracy depends heavily on having a sufficient number of user ratings which in this dataset I would argue is sufficient enough for this project. Following which I aim to implement truncated singular value decomposition (SVD).

In [7]:
import pandas as pd
import numpy as np

# Importing
url = "https://raw.githubusercontent.com/GullitNa/DATA612-Project2/main/Dataset3JokeSet.csv"
jokes_df = pd.read_csv(url, header=None, encoding='ISO-8859-1')
jokes_df.columns = ['Joke']

url1 = "https://raw.githubusercontent.com/GullitNa/DATA612-Project2/main/jester_part1.csv"
url2 = "https://raw.githubusercontent.com/GullitNa/DATA612-Project2/main/jester_part2.csv"
url3 = "https://raw.githubusercontent.com/GullitNa/DATA612-Project2/main/jester_part3.csv"

df1 = pd.read_csv(url1, header=None)
df2 = pd.read_csv(url2, header=None)
df3 = pd.read_csv(url3, header=None)

# Combining
ratings_df = pd.concat([df1, df2, df3], ignore_index=True)
ratings_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,141,142,143,144,145,146,147,148,149,150
0,62,99,99,99,99,0.21875,99,-9.28125,-9.28125,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
1,34,99,99,99,99,-9.6875,99,9.9375,9.53125,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
2,18,99,99,99,99,-9.84375,99,-9.84375,-7.21875,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
3,82,99,99,99,99,6.90625,99,4.75,-5.90625,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
4,27,99,99,99,99,-0.03125,99,-9.09375,-0.40625,99,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0


In [8]:
# Data Cleaning
ratings_df = ratings_df.drop(columns=0)
ratings_df.replace(99.0, np.nan, inplace=True)

**Item-Item Collaborative Filtering**

Now for this approach, I plan to use item-item collaborative filtering to identify jokes that are rated similarly by users. By comparing user rating patterns across jokes, I can then set it up to find jokes with similar audience reception regardless of their content.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

item_user_matrix = ratings_df.T
# Item-item similarity based on user ratings
item_similarity = cosine_similarity(item_user_matrix.fillna(0))
def recommend_similar_items(joke_id, top_n=5):
    print(f"\nOriginal Joke [{joke_id}]:\n{jokes_df.iloc[joke_id]['Joke']}\n")
    sim_scores = list(enumerate(item_similarity[joke_id]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:top_n+1]
    print(f"Top {top_n} similar jokes:")
    for idx, score in sim_scores:
        print(f"\n[Joke {idx} | Similarity Score: {score:.3f}]")
        print(jokes_df.iloc[idx]['Joke'])
recommend_similar_items(joke_id=53, top_n=3)


Original Joke [53]:
The Pope dies and, naturally, goes to heaven. He's met by the reception committee, and after a whirlwind tour he is told that he can enjoy any of the myriad of recreations available. He decides that he wants to read all of the ancient original text of the Holy Scriptures, so he spends the next eon or so learning languages. After becoming a linguistic master, he sits down in the library and begins to pour over every version of the Bible, working back from most recent "Easy Reading" to the original script. All of a sudden there is a scream in the library. The Angels come running in only to find the Pope huddled in his chair, crying to himself and muttering, "An 'R'! The scribes left out the 'R'."  A particularly concerned Angel takes him aside, offering comfort, asks him what the problem is and what does he mean.  After collecting his wits, the Pope sobs again, "It's the letter 'R'. They left out the 'R'. The word was supposed to be CELEBRATE!"

Top 3 similar jokes:


**Matrix Factorization Method**

For the matrix factorization method that I'll implement, I intend to use truncated SVD. This method is ideal for my recommender system as well as my own perspective because it extracts the top k latent factors, which is capturing the most important patterns in user preferences and item relationships. Additionally this matrix factorization also computes only the first k factors in a far more efficient than a full entire SVD, making it better suited for a dataset/recommender system of my scale instead.



In [16]:
from sklearn.decomposition import TruncatedSVD

# Impute missing values
user_means = ratings_df.mean(axis=1)
R_filled = ratings_df.apply(lambda row: row.fillna(row.mean()), axis=1)
R_demeaned = R_filled.sub(user_means, axis=0)
# Extract k (latent) factors
k = 20   # number of latent factors
svd = TruncatedSVD(n_components=k, random_state=42)
user_factors = svd.fit_transform(R_demeaned)
item_factors = svd.components_

# Reconstruct the demeaned ratings (U Σ) · Vᵀ
R_demeaned_hat = user_factors.dot(item_factors)
R_hat = (
    pd.DataFrame(R_demeaned_hat,
                 index=ratings_df.index,
                 columns=ratings_df.columns)
      .add(user_means, axis=0)
)
user_id = ratings_df.index[0]
top_jokes = R_hat.loc[user_id]\
               .sort_values(ascending=False)\
               .head(10)

print(f"Top 10 joke recommendations for {user_id}")
print(top_jokes)

Top 10 joke recommendations for 0
53     9.178906
89     8.674358
35     8.642234
127    8.288770
119    7.690467
32     6.821426
117    6.442474
105    6.368761
72     6.243819
50     6.223025
Name: 0, dtype: float64


**Evaluation/Further Testing**

After the initial implementation of the truncated SVD matrix factorization, I use a simulation to further provide context to the result. This simulation is done by holding out one rating per user and then measuring how well this SVD model predicts those same unseen ratings.

Using the dataframes 'train' and 'test', which train starts as a copy of the full user×item matrix, and test is all null values but will receive exactly one held-out rating per user.

After finding which jokes a user originally rated using (.dropna().index), randomly pick one of those jokes with (np.random.choice) and copy that rating into 'test' at (user, item) followed by removing it from 'train' and looping.

The evaluation itself shows us the result by calculating the average squared difference between true and predicted ratings. As shown below, the SVD Test RMSE is 2.3404

In [13]:
from sklearn.metrics import mean_squared_error

# Hold-out split
train = ratings_df.copy()
test  = pd.DataFrame(index=train.index, columns=train.columns)

np.random.seed(123)
for user in train.index:
    rated = train.loc[user].dropna().index
    if len(rated)==0: continue
    hold = np.random.choice(rated, size=1, replace=False) # Picks a joke
    for item in hold:
        test.at[user,item]  = train.at[user,item]
        train.at[user,item] = np.nan

# Evaluation
actual = test.stack()
predicted = R_hat.stack()[actual.index]
rmse_svd = np.sqrt(mean_squared_error(actual, predicted))
print(f"SVD Test RMSE: {rmse_svd:.4f}")

SVD Test RMSE: 2.3404


**Conclusion**
Extended my existing item–item collaborative filtering recommender (from JokeDataset3) by implementing truncated SVD as a matrix‐factorization method a recommendation context. After loading and cleaning the Jester jokes ratings matrix, I imputed missing values with each user’s mean, mean‐centered the data, and applied TruncatedSVD to extract the top k (latent) factors (which was k = 20). Afterwhich, reconstructed the ratings and added back the user means to produce a dense prediction matrix, which I evaluated on a held-out test set using RMSE. The Medium tutorial “Recommender System — singular value decomposition (SVD) & truncated SVD” was a guideline used in both clarifying and udnerstanding the difference between full and truncated SVD, showing code examples for performing the decomposition and reconstructing predictions, and guiding my choice of how to integrate truncated SVD rather than full SVD into my recommender system.

**Sources**
https://medium.com/data-science/recommender-system-singular-value-decomposition-svd-truncated-svd-97096338f361

