## <font color='#475468'> Joke Recommendations:</font>


## Initialize

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Load Data

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Joke metadata
dfMvs = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/JokeText.csv')

# User ratings for each joke
dfMvsRtg = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/UserRatings1.csv')

Let us first try to build a recommender using joke content, also known as **Content Based Filtering**

### 1. Content Based Filtering

The idea here is to determine how similar the descriptions are based on the terms used in the descriptions - while ignoring commonly used words.  Then recommend other items with similar descriptions.  In order to do this, **TF-IDF Vectorization** is used.

#### Prepare data

In [4]:
dfMvs.head()

Unnamed: 0,JokeId,JokeText
0,0,"A man visits the doctor. The doctor says ""I ha..."
1,1,This couple had an excellent relationship goin...
2,2,Q. What's 200 feet long and has 4 teeth? \n\nA...
3,3,Q. What's the difference between a man and a t...
4,4,Q.\tWhat's O. J. Simpson's Internet address? \...


  plt.savefig(


  plt.savefig(



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  plt.savefig(


In [5]:
dfMvs.shape

(100, 2)

In [8]:
# Remove duplicates
dfMvs.drop_duplicates(subset ='JokeText', keep = 'first', inplace = True)
dfMvs.shape

(100, 2)

There are no duplicates in the data. We do not need to create a new column named "description" because there is no tagline or whatsoever that describes the jokes other than JokeText.

#### Build Model

In [10]:
# Generate a matrix of common terms that show up in each joke

from sklearn.feature_extraction.text import TfidfVectorizer
mdlTfvMvs = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=1, stop_words='english')
tfidf_matrix = mdlTfvMvs.fit_transform(dfMvs['JokeText'])
tfidf_matrix.shape

(100, 3774)

The similarity between any two movies (x) and (y) is defined as the **Cosine Similarity**:
cosine(x,y)=x.y⊺||x||.||y||

Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity Score.

In [11]:
# Calculate cosine similarity between each pair of movies as a function of the similarity of the common terms

from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
cosine_sim.shape

(100, 100)

#### Predict

In [26]:
# Prepare recommendation function (build code from scratch and then package as function for ease of understanding)
dfMvs['JokeId'] = dfMvs['JokeId'].astype(str)
titles = dfMvs['JokeId']
indices = pd.Series(dfMvs.index, index=dfMvs['JokeId'])

def get_recommendations(JokeText):
    idx = indices[JokeText]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

In [27]:
get_recommendations("1").head(10)

45    45
52    52
37    37
92    92
67    67
22    22
64    64
44    44
75    75
13    13
Name: JokeId, dtype: object

In [28]:
get_recommendations('7').head(10)

50    50
63    63
59    59
14    14
67    67
90    90
0      0
1      1
2      2
3      3
Name: JokeId, dtype: object

These recommendations suggest jokes that are close in description. Anyone querying our engine for recommendations based on a joke will receive the same recommendations for that joke, regardless of who s/he is.  This is a good way of providing recommendations especially when no further data is available.  

What if we also have data on personal tastes?  Can we make recommendations that capture these tastes and recommend jokes that are more personalized?  For this, we use a technique called **Collaborative Filtering** which is based on the idea that users similar to me can be used to predict how much I will like a particular product or service those users have used/experienced but I have not.

utexas_ds_orie_divider_gray.png

### 2. Collaborative Filtering

The idea here is that a user x joke matrix is decomposed into a product of user x concept . concept x concept . concept x joke matrices.  These can then be used to derive similarities between users.  This process is known as **Singular Value Decomposition**.

#### Prepare data

In [29]:
dfMvsRtg.head(10)

Unnamed: 0,JokeId,User1,User2,User3,User4,User5,User6,User7,User8,User9,...,User36701,User36702,User36703,User36704,User36705,User36706,User36707,User36708,User36709,User36710
0,0,5.1,-8.79,-3.5,7.14,-8.79,9.22,-4.03,3.11,-3.64,...,,,,,,,,,2.91,
1,1,4.9,-0.87,-2.91,-3.88,-0.58,9.37,-1.55,0.92,-3.35,...,,,,-5.63,,-6.07,,-1.6,-4.56,
2,2,1.75,1.99,-2.18,-3.06,-0.58,-3.93,-3.64,7.52,-6.46,...,,,,,,4.08,,,8.98,
3,3,-4.17,-4.61,-0.1,0.05,8.98,9.27,-6.99,0.49,-3.4,...,,,,,,,,,,
4,4,5.15,5.39,7.52,6.26,7.67,3.45,5.44,-0.58,1.26,...,2.28,-0.49,5.1,-0.29,-3.54,-1.36,7.48,-5.78,0.73,2.62
5,5,1.75,-0.78,1.26,6.65,8.25,-8.11,-6.75,2.14,0.34,...,,-3.4,-0.92,-4.27,,-2.57,9.32,7.96,-9.13,3.3
6,6,4.76,1.6,-5.39,-7.52,4.08,4.42,-0.15,-0.24,-3.01,...,-9.95,-4.42,0.97,-3.54,6.36,3.01,3.74,5.19,-9.42,0.53
7,7,3.3,1.07,1.5,7.28,2.52,2.72,-5.87,8.06,-6.65,...,4.32,-1.07,0.49,-2.14,2.57,-5.73,-2.33,2.67,8.69,-2.62
8,8,-2.57,-8.69,-8.4,-5.15,-9.66,9.08,-3.54,2.82,-3.4,...,,,,,,,,,,
9,9,-1.41,-4.66,4.37,-7.14,2.48,9.13,-5.19,7.52,1.36,...,-8.4,-6.26,-1.17,0.44,7.52,8.59,8.88,6.07,8.35,3.06


#### Build Model

In [32]:
# Prepare data into Surprise library format

!pip3 install scikit-surprise #or !conda install -c conda-forge scikit-surprise
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import train_test_split




In [64]:

# Reshape the data into long format
df_long = dfMvsRtg.melt(id_vars=['JokeId'], var_name='User', value_name='Rating')

# Filter out rows with missing ratings (if any)
df_long = df_long.dropna(subset=['Rating'])

# Convert JokeId and User columns to strings (if necessary)
df_long['JokeId'] = df_long['JokeId'].astype(str)
df_long['User'] = df_long['User'].astype(str)

# Prepare data into Surprise library format
reader = Reader(rating_scale=(0, 5))
X = Dataset.load_from_df(df_long[['User', 'JokeId', 'Rating']], reader)

# Split the data into train and test sets
X_train, X_test = train_test_split(X, test_size=.25)

# Example output to verify the structure
print(df_long)


        JokeId       User  Rating
0            0      User1    5.10
1            1      User1    4.90
2            2      User1    1.75
3            3      User1   -4.17
4            4      User1    5.15
...        ...        ...     ...
3670967     67  User36710    3.59
3670968     68  User36710    5.39
3670969     69  User36710    4.71
3670980     80  User36710    0.97
3670984     84  User36710    1.26

[3012090 rows x 3 columns]


In [34]:
# Define SVD model

from surprise import SVD

mdlSvdMvsRtg = SVD()

In [35]:
# Fit SVD model

mdlSvdMvsRtg.fit(X_train)
test_pred = mdlSvdMvsRtg.test(X_test)

In [36]:
# Evalute SVD accuracy

from surprise import accuracy

accuracy.rmse(test_pred)

RMSE: 4.6286


4.628557674587144

In [37]:
# Tune hyperparameters

from surprise.model_selection import GridSearchCV

param_grid = {'n_epochs': [5, 10, 15], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(X)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

4.694591111470256
{'n_epochs': 15, 'lr_all': 0.005, 'reg_all': 0.4}


In [38]:
# Cross-validate

from surprise.model_selection import cross_validate

cross_validate(mdlSvdMvsRtg, X, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    4.6277  4.6307  4.6321  4.6236  4.6271  4.6282  0.0030  
MAE (testset)     3.6822  3.6826  3.6819  3.6759  3.6782  3.6801  0.0026  
Fit time          45.74   47.31   47.00   48.52   48.18   47.35   0.98    
Test time         6.69    9.16    8.42    8.00    6.58    7.77    1.00    


{'test_rmse': array([4.62770836, 4.63067434, 4.63206976, 4.62357937, 4.62712245]),
 'test_mae': array([3.68216462, 3.68256365, 3.68186069, 3.67588626, 3.67820868]),
 'fit_time': (45.74053406715393,
  47.309288024902344,
  46.99779677391052,
  48.51509737968445,
  48.17904353141785),
 'test_time': (6.692754030227661,
  9.16285753250122,
  8.424071073532104,
  7.9973344802856445,
  6.581785202026367)}

Let us now use the trained model to arrive at predictions.

#### Predict

Let's first see which jokes user # 18569 has already viewed.

In [69]:
df_long[df_long['User'] == 'User18569']

Unnamed: 0,JokeId,User,Rating
1856800,0,User18569,7.23
1856801,1,User18569,7.23
1856802,2,User18569,6.89
1856803,3,User18569,8.25
1856804,4,User18569,0.68
...,...,...,...
1856885,85,User18569,0.92
1856896,96,User18569,7.82
1856897,97,User18569,3.50
1856898,98,User18569,5.97


Now, let's predict what rating user # 18569 would give to jokeId # 86 (since he/she hasn't seen it yet)

In [70]:
mdlSvdMvsRtg.predict(86, 18569)

Prediction(uid=86, iid=18569, r_ui=None, est=0.7819483606067552, details={'was_impossible': False})

One startling feature of this recommender system is that it doesn't care what the joke is (or what it contains). It works purely on the basis of an assigned joke ID and tries to predict ratings based on how the other users have predicted the joke.

An extension to this could be to create a hybrid model that uses content filtering in the initial phase when user preferences are not available, and then gradually shift to a collaborative filtering model blended with some content filtering.

## Takeaways

* Introduced content-based filtering to recommend items based on their descriptions using *TF-IDF Vectorization*
* In the event that user preference data is available, collaborative filtering is leveraged to recommend items based on other similar users using *Singular Value Decomposition*

utexas_ds_orie_divider_gray.png