The following assignment consists again of a theoretical part (learning portfolio) and a practical part (assignment). The goal is to train a neural model for a recommendation system.

The plan would be that in the first week we will discuss your learnings from the theory part, that means you are relatively free to fill your Learning Portfolio on this topic and in the following week we will discuss your solutions of the practical part.

#Theory part (filling your Learning Portfolio, June 7)

In preparation for the practical part, I ask you to familiarize yourself with the following video sources in the next week:

1) Please watch the following videos:

https://www.youtube.com/watch?v=Fmtorg_dmM0&ab_channel=ritvikmath (not absolutely necessary, only for the overview)

https://course.fast.ai/Lessons/lesson7.html (The second part of the presentation starting with the topic collaborative filtering is mandatory)

Note: The first part of the video mainly contains tips for neural networks to submit a Kaggle Competition. For that, you would have to watch the end of the 6th video to understand this better. But this is not mandatory.

2) Please download the following notebooks and edit it in Google-Colab. Try to answer a few questions that are asked at the end. Take notes and update your Learning Portfolio.

https://www.kaggle.com/code/jhoward/collaborative-filtering-deep-dive/notebook


#Practical part (Assignment, June 14)

Find any data set that can be used for a recommender system and try to train and validate a neural network for it.

For this purpose I ask you to download a data set from the given lists and to use it for your program application.

https://gist.github.com/entaroadun/1653794

https://github.com/caserec/Datasets-for-Recommender-Systems

https://grouplens.org/datasets/movielens/

https://eigentaste.berkeley.edu/dataset/

#Using Jokes Dataset

In [12]:
from bs4 import BeautifulSoup
import re
import pandas as pd
from google.colab import drive
from google.colab import data_table
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
data_table.enable_dataframe_formatter()

from fastai.collab import *
from fastai.tabular.all import *

drive.mount('/content/drive')

jokes_file = '/content/drive/MyDrive/Colab Notebooks/Homework7/jokes'
ratings_file= "/content/drive/MyDrive/Colab Notebooks/Homework7/jester-data-3.xls"




Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [13]:

# Function to extract jokes from an HTML file
def extract_joke(file_path):
    with open(file_path, 'r') as file:
        html_content = file.read()
    pattern = r"<!--begin of joke -->(.*?)<!--end of joke -->"

    matches = re.findall(pattern, html_content, re.DOTALL)

    cleaned_jokes = []
    for joke in matches:
        soup = BeautifulSoup(joke, "html.parser")
        cleaned_joke = soup.get_text()
        cleaned_jokes.append(cleaned_joke)

    return cleaned_jokes



# Iterate through all HTML files and extract jokes
html_files = [file_name for file_name in os.listdir(jokes_file) if file_name.endswith(".html")]
html_files.sort(key=lambda x: int(re.search(r"\d+", x).group()))
jokes = []
for file_name  in html_files:
    file_path = os.path.join(jokes_file, file_name)  # Replace with your actual file path
    joke = extract_joke(file_path)
    if joke:
        jokes.extend(joke)




joke_text_df = pd.DataFrame({"joke_text": jokes})
joke_text_df = joke_text_df.reset_index().rename(columns={'index': 'joke_id'})
joke_text_df.head(5)

Unnamed: 0,joke_id,joke_text
0,0,"\nA man visits the doctor. The doctor says ""I have bad news for you.You have\ncancer and Alzheimer's disease"". \nThe man replies ""Well,thank God I don't have cancer!""\n"
1,1,"\nThis couple had an excellent relationship going until one day he came home\nfrom work to find his girlfriend packing. He asked her why she was leaving him\nand she told him that she had heard awful things about him. \n\n""What could they possibly have said to make you move out?"" \n\n""They told me that you were a pedophile."" \n\nHe replied, ""That's an awfully big word for a ten year old."" \n"
2,2,\nQ. What's 200 feet long and has 4 teeth? \n\nA. The front row at a Willie Nelson Concert.\n
3,3,\nQ. What's the difference between a man and a toilet? \n\nA. A toilet doesn't follow you around after you use it.\n
4,4,"\nQ.\tWhat's O. J. Simpson's Internet address? \nA.\tSlash, slash, backslash, slash, slash, escape.\n"


In [35]:
#import ratingsfile
ratings = pd.read_excel(ratings_file, header=None)


#replace 99 with NaN
ratings.replace(99, np.nan, inplace=True)


#drop column 0, because this indicates the number of rated jokes, we dont care about
ratings.columns = [str(i) for i in range(101)]
ratings = ratings.drop("0", axis=1)


ratings = ratings.reset_index().rename(columns={'index': 'user_id'})

#normalize ratings from -10 to 10  to 0-10
min_val = -10
max_val = 10
user_id = ratings['user_id']
df_without_id = ratings.drop('user_id', axis=1)
# Apply min-max scaling to the values in the DataFrame
df_scaled = (df_without_id - min_val) / (max_val - min_val) * 10
df_scaled.insert(0, 'user_id', user_id)
df_scaled.head(10)







Unnamed: 0,user_id,1,2,3,4,5,6,7,8,9,...,91,92,93,94,95,96,97,98,99,100
0,0,,,,,4.175,,4.61,8.445,,...,,,,,,,,,,
1,1,,,,,0.365,,0.415,0.705,,...,,,3.615,,,,,,,
2,2,,,,,1.94,,1.26,1.115,,...,,,,,,,,,,
3,3,,5.025,,,3.59,,2.575,4.565,,...,,,,,5.92,,,,2.96,
4,4,,,,,2.525,,8.105,6.36,,...,,,,,,,,,,
5,5,,,,,6.555,,7.21,5.705,,...,,,,,,7.235,,,,
6,6,,,,,4.975,,0.945,1.31,,...,,,3.81,,,4.08,,,,
7,7,,,,,8.4,,3.81,1.09,,...,,,,,,,,,,
8,8,9.1,9.175,,,6.53,,1.555,0.12,,...,,,,,,,,,0.05,
9,9,,,,,6.53,,5.075,9.49,,...,,,,,,,,,,


In [38]:
# Prepare DataFrame for Dataloader
# Use melt to reshape the DataFrame
melted_df = df_scaled.melt(id_vars='user_id', var_name='joke_number', value_name='rating')
melted_df = melted_df.dropna(subset=['rating'])
# Print the melted DataFrame
melted_df.head()

Unnamed: 0,user_id,joke_number,rating
8,8,1,9.1
28,28,1,8.18
42,42,1,0.585
60,60,1,2.04
82,82,1,7.72


#### DataLoader

In [40]:
#DataLoader
dls = CollabDataLoaders.from_df(melted_df, item_name='joke_number', bs=64)
dls.show_batch()

Unnamed: 0,user_id,joke_number,rating
0,23750,69,8.4
1,21225,53,6.75
2,17187,63,6.41
3,10684,95,0.535
4,18897,35,8.325
5,13584,13,5.39
6,873,34,3.13
7,10421,19,6.455
8,6856,32,8.64
9,20412,68,9.465


#### Train using collab Learner (like shown in chapter "Using fastai.collab")



In [41]:
learn = collab_learner(dls, y_range=(0, 10.5))
learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch,train_loss,valid_loss,time
0,5.449389,5.520938,01:22
1,5.442591,5.423838,01:17
2,5.063986,5.193975,01:17
3,4.560149,4.913106,01:15
4,3.563958,4.870349,01:15


In [42]:
learn.model

EmbeddingDotBias(
  (u_weight): Embedding(24939, 50)
  (i_weight): Embedding(101, 50)
  (u_bias): Embedding(24939, 1)
  (i_bias): Embedding(101, 1)
)

In [60]:
movie_bias = learn.model.i_bias.weight.squeeze()
id_best_joke = movie_bias.argsort(descending=True)[:5]
print("5 BestJokes:" , id_best_joke)

5 BestJokes: tensor([47, 21, 31, 27,  2])


#### Workaround because I can not convert tensor tolist and therefore cannot match jokeid from tensor with joketext from original dataframe, in order to get for each joke id the corresponding joke-text

In [61]:
import numpy as np
tensor_as_list = id_best_joke.numpy()
print(tensor_as_list)

[47 21 31 27  2]


In [76]:
# Filter the DataFrame based on the best joke IDs

df_filtered = joke_text_df[joke_text_df['joke_id'].isin(tensor_as_list)]


# Print the filtered DataFrame
# Listed from the best joke (joke rank=1) until the 5th best (jokerank = 5)
counter = 0
for index, row in df_filtered.iterrows():
    joke_id= row['joke_id']
    joke_text = row['joke_text']
    print(f"Joke rank: {counter+1}.\nJoke ID: {joke_id}\nJoke:{joke_text}")
    counter +=1

Joke rank: 1.
Joke ID: 2
Joke:
Q. What's 200 feet long and has 4 teeth? 

A. The front row at a Willie Nelson Concert.

Joke rank: 2.
Joke ID: 21
Joke:
A duck walks into a pharmacy and asks for a condom. The pharmacist says
"Would you like me to stick that on your bill?"
The duck says: 
"What kind of duck do you think I am!"

Joke rank: 3.
Joke ID: 27
Joke:
A mechanical, electrical and a software engineer from Microsoft were
driving through the desert when the car broke down. The mechanical
engineer said "It seems to be a problem with the fuel injection system,
why don't we pop the hood and I'll take a look at it." To which the
electrical engineer replied, "No I think it's just a loose ground wire,
I'll get out and take a look." Then, the Microsoft engineer jumps in.
"No, no, no. If we just close up all the windows, get out, wait a few
minutes, get back in, and then reopen the windows everything will work
fine."

Joke rank: 4.
Joke ID: 31
Joke:
A man arrives at the gates of heaven. St.