# Assignment 3

Welcome to the third assignment! Here you will implement a simple Content-Based Recommender. We will use part of the MovieLens 20M dataset.

You will write and execute your code in Python using this Jupyter Notebook.

**TASK:** Your job is to **fill in the missing code only** (!!!). The place to enter your code is clearly marked with comments.

**SUBMISSION:** You will submit this Notebook via the Interface of JupyterHub.

- Submissions are possible until **25.04.2023 23:59 CEST**.
- Do **NOT** rename the file it needs to be named as "assignment3.ipynb" (in the case if you want to run the Jupyter Notebook offline).
- Please **save** ("File -> "Save and Checkpoint") and **close** your Jupyter Notebook ("file" -> "Close and Halt") before you hand in your solution.
- Before handing in [check if your Jupyter Notebook **validates**]! 
- Please use the CLI to validate your solution. The button in the webinterface is not working if the JupyterNotebook runs more than just some seconds.

**GRADING:** We will test whether your code produces the expected output. Therefore hidden tests will compare results of the standard solution with yours (based on the whole dataset - multiple, randomly selected inputs - accuracy of the solution must be within two decimal places). Note that the visible test cells are only an indicator for the correctness of your solution. They **do not** guarantee that your solution is correct. 

**Late submissions are not possible. We will automatically collect all submissions at the end of the deadline!**

We reserve the right to carry out automatic plagiarism checks. Please do not exploit the submission system. We will look at all submissions and such submissions will be scored 0 points.

## Preparation
Importing necessary modules.

In [None]:
import csv
import pandas as pd
import numpy as np
from scipy import sparse as sp
import sklearn.preprocessing as pp
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

#if you wish to disable warnings, uncomment the following two lines
#import warnings
#warnings.filterwarnings("ignore")

In [None]:
np.set_printoptions(threshold=500, precision=4)
pd.options.display.max_seq_items = 100
%precision 4

Make sure to enter the correct location of your data.

In [None]:
data_directory = '~/shared/data/assignment3/'

In [None]:
# Hidden

## Create the movies DataFrame

In [None]:
links = pd.read_csv(data_directory + 'links.csv')
movies_plain = pd.read_csv(data_directory + 'movies.csv')
metadata = pd.read_csv(data_directory + 'movies_metadata.csv', low_memory=False)
metadata.drop(metadata.columns[[0,1,2,4,6,7,8,10,11,12,13,14,15,16,17,18,19,20,21,22,23]], axis=1, inplace=True)
keywords = pd.read_csv(data_directory + 'keywords.csv', low_memory=False)
credits = pd.read_csv(data_directory + 'credits.csv', low_memory=False)

keywords['id'] = keywords['id'].astype('int')
links=links[links['tmdbId'].isnull()==False]
links['tmdbId'] = links['tmdbId'].astype('int')
metadata = metadata.drop([19730, 29503, 35587])
metadata['id'] = metadata['id'].astype('int')
credits['id'] = credits['id'].astype('int')

movies = metadata.merge(links, how='inner', left_on='id', right_on='tmdbId')
movies = movies.merge(movies_plain, how='inner', left_on='movieId', right_on='movieId')
movies = movies.merge(keywords, how='inner', left_on='id', right_on='id')
movies = movies.merge(credits, how='inner', left_on='id', right_on='id')
movies = movies.drop(columns=['tmdbId','genres_y'])
movies.rename(columns={'genres_x': 'genres'}, inplace=True)

movies=movies[movies['overview'].isnull()==False]

movies = movies[movies['movieId'] < 1000]

from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    movies[feature] = movies[feature].apply(literal_eval)
    

# Get the director's name from the crew feature. If director is not listed, return NaN
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

# Returns the list top 3 elements or entire list; whichever is more.
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

# Define new director, cast, genres and keywords features that are in a suitable form.
movies['director'] = movies['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    movies[feature] = movies[feature].apply(get_list)

    
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        # Check if string exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    movies[feature] = movies[feature].apply(clean_data)

    
# Drop duplicate movies   
import collections
movie_ids = movies['movieId'].tolist()
movie_ids_dup = [x for  x, y in collections.Counter(movie_ids).items() if y > 1]

for movie_id in movie_ids_dup:
    to_drop = movies.index[movies.movieId == movie_id].tolist()[1:]
    movies.drop(to_drop, inplace=True)

movies.drop(columns='crew', inplace=True)


movies.rename(columns={'overview':'plot'}, inplace=True)

def create_metadata(x):
        return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])  

# Create a new metadata feature
movies['metadata'] = movies.apply(create_metadata, axis=1)

display(movies.head())

## Create the ratings DataFrame

In [None]:
ratings = pd.read_csv(data_directory + 'ratings.csv')
ratings = ratings.drop(columns=['timestamp'])
ratings = ratings[(ratings['userId'] < 1000) & (ratings['movieId'] < 100) ]

ratings = ratings[ratings['movieId'].isin(movies['movieId'])]

## keep users with more than 2 ratings
ratings_count = ratings.groupby(['userId', 'movieId']).size().groupby('userId').size()
ratings_ok = ratings_count[ratings_count >= 2].reset_index()[['userId']]
ratings = ratings.merge(ratings_ok, 
               how = 'right',
               left_on = 'userId',
               right_on = 'userId')


ratings.columns = ['user', 'item', 'rating']

item_ids = ratings['item'].unique()
item_ids.sort()

display(ratings.head())

In [None]:
## trim movies dataframe to contain only movies in item_ids

movies[movies['movieId'].isin(item_ids)] # TODO change to movies = movie[...], further change expected output below

movies.rename(columns={'movieId': 'item_id'}, inplace=True)

## The `Recommender_CB` class

In the following, we will build functionality into the `Recommender_CB` class. The initialization stores the various data sources. A helper function returns the titles of movies.

In [None]:
class Recommender_CB:
    
    def __init__(self, profile_type='plot'):
        self.profile_type = profile_type
    
    def build_model(self, ratings, items_meta):
        self.ratings = ratings
        self.items_meta = items_meta
        
        ## user_id and item_id are external ids; i_id is internal id
        self.item_ids = self.ratings.item.unique()
        self.item_ids.sort()
        
        self.user_ids = self.ratings.user.unique()
        self.user_ids.sort()
        
        self.i_id_to_item_id = self.items_meta['item_id'].tolist()
    
    
    def get_item_titles(self, item_ids):
        return [ self.items_meta[self.items_meta['item_id'] == id]['title'].item() for id in item_ids] 


## Build the content of Items --- TO EDIT


For the purpose of this assignment we consider two types of content.

The first one, called *plot* content, is based on the movie's plot description, contained in attribute plot. 

The second one, called *meta* content, is based on the director, the actors, genres, and keywords for the movies, which are combined in the attribute metadata.

We will build TF-IDF vectors for the movies based on the two types of contents. For this purpose, we will use the `TfIdfVectorizer` module from `scikit-learn`.

Steps to implement:

1. Apply the `vectorizer` on the `plot` column of `self.items_meta` to retrieve tf-idf vector for each movie. Store the result into `self.plot_tfidf`
2. Get the feature names from the `vectorizer` and store them in `self.plot_tfidf_tokens`.
3. Apply the `vectorizer` on the `metadata` column of `self.items_meta` to retrieve tf-idf vector for each movie. Store the result into `self.meta_tfidf`
4. Get the feature names from the `vectorizer` and store them in `self.meta_tfidf_tokens`.


In [None]:
def build_item_contents(self):
    """
    This function creates a TF-IDF vector representation of the plot and the meta content.
    
    :var plot_tfidf: Tf-idf-weighted document-term matrix (shape: (n_items, n_features))
    :type plot_tfidf: scipy.sparse.csr.csr_matrix
    :var plot_tfidf_tokens: list of feature names
    :type plot_tfidf_tokens: list
    :var meta_tfidf: Tf-idf-weighted document-term matrix (shape: (n_items, n_features))
    :type meta_tfidf: scipy.sparse.csr.csr_matrix
    :var meta_tfidf_tokens: list of feature names
    :type meta_tfidf_tokens: list


    
    """
    vectorizer = TfidfVectorizer(stop_words='english') # Define a TF-IDF Vectorizer that removes all english stop words (e.g., 'the', 'a')
    
    # YOUR CODE HERE
    arr = self.items_meta
    arr = vectorizer.fit_transform(arr["plot"])
    arr2 = pd.DataFrame(arr.toarray().transpose(),index=vectorizer.get_feature_names_out())
    self.plot_tfidf = arr
    self.plot_tfidf_tokens = vectorizer.get_feature_names_out(arr2) 
    

    arr = self.items_meta
    arr = vectorizer.fit_transform(arr["metadata"])
    self.meta_tfidf = arr
    arr2 = pd.DataFrame(arr.toarray().transpose(),index=vectorizer.get_feature_names_out())
    self.meta_tfidf_tokens = vectorizer.get_feature_names_out(arr2)
    
    self.plot_tfidf.sort_indices()
    self.meta_tfidf.sort_indices()
    
    self.set_content_type()

def set_content_type(self, profile_type='plot'):
    if profile_type == 'plot':
        self.tfidf = self.plot_tfidf
        self.tfidf_tokens = self.plot_tfidf_tokens
    else:
        self.tfidf = self.meta_tfidf
        self.tfidf_tokens = self.meta_tfidf_tokens

Add the functions to the class.

In [None]:
Recommender_CB.build_item_contents = build_item_contents
Recommender_CB.set_content_type = set_content_type

Test the function. (Shows only the nonzero vector coordinates.)

In [None]:
cbr = Recommender_CB()
cbr.build_model(ratings, movies)

cbr.build_item_contents()

print(cbr.plot_tfidf.shape)
print(cbr.plot_tfidf.data)
# print(cbr.plot_tfidf)

print(cbr.meta_tfidf.shape)
print(cbr.meta_tfidf.data)

**EXPECTED OUTPUT:**

```
(958, 9241)
[0.1611 0.3988 0.1611 ... 0.124  0.0878 0.1253]
(958, 3792)
[0.2485 0.3649 0.1075 ... 0.2854 0.3717 0.4114]
```

In [None]:
# Hidden

The following function returns the vector representations for specified items. Vectors are stacked vertically.

In [None]:
def get_item_vectors(self, item_ids):
    i_ids = [self.i_id_to_item_id.index(item_id) for item_id in item_ids]
    item_vector = self.tfidf[i_ids]
    return item_vector 

Add the function to the class.

In [None]:
Recommender_CB.get_item_vectors = get_item_vectors

## Build the profiles of Users --- TO EDIT

The following function computes the user profile as a vector that averages the tf-idf vectors of all items the user has rated, weighted by the ratings of the user. 

Steps to implement:

1. Get the td-idf vectors corresponding to the items rated by the user.
2. Compute a weighted average of these vectors, where each vector is weighted by the rating of the user to it. Store the output into the `user_profile` vector. Tips: You may want to use `scipy.sparse.csr_matrix.multiply` to multiply the sparse td-idf vectors with the user ratings. 


In [None]:
def get_user_profile(self, user_id, ratings):
    """
    This function takes a user ID and a ratings array  as input.
    
    :return: user profile as array of weighted average of tf-idf vectors corresponding to
             the items rated by the user (1, self.tfidf.shape[1])
    :rtype: numpy.ndarray
    """
    
    item_ids_rated_by_user_id = np.array( ratings.loc[ ratings['user'] == user_id ]['item'] )
    user_ratings = np.array( ratings.loc[ ratings['user'] == user_id ]['rating'] )

    # YOUR CODE HERE
    arr = get_item_vectors(self, item_ids_rated_by_user_id)
    arr = arr.multiply(user_ratings[:, np.newaxis])
    arr = arr.toarray().transpose()
    newlist = []
    for i in range(len(arr)):
        newlist.append(sum(arr[i])/len(user_ratings))
    user_profile = np.array([newlist])

    user_profile = pp.normalize(user_profile)
    #print(user_profile.shape)
    #print(type(user_profile))

    return user_profile

Add the function to the class.

In [None]:
Recommender_CB.get_user_profile = get_user_profile

Test the function. (Shows only the nonzero vector coordinates.)

In [None]:
user_profile = cbr.get_user_profile(1, ratings)
print(user_profile[user_profile.nonzero()])

**EXPECTED OUTPUT:**

```
[0.052  0.0574 0.0574 0.0658 0.0789 0.0789 0.144  0.1336 0.0501 0.0557
 0.0538 0.0546 0.0668 0.0596 0.1071 0.1241 0.0658 0.0574 0.066  0.0713
 0.0957 0.0741 0.0619 0.1084 0.0638 0.0768 0.0741 0.1027 0.0594 0.0757
 0.0612 0.0493 0.0714 0.1132 0.0467 0.0741 0.0741 0.0713 0.066  0.0511
 0.0757 0.052  0.066  0.0442 0.0714 0.041  0.0713 0.0434 0.0651 0.1775
 0.0688 0.0658 0.0398 0.0714 0.0741 0.0611 0.0587 0.0557 0.0768 0.0601
 0.1286 0.0543 0.0741 0.066  0.1388 0.0501 0.0757 0.0757 0.0789 0.051
 0.0658 0.0744 0.0658 0.0574 0.0757 0.139  0.1555 0.0757 0.0641 0.0499
 0.0594 0.1174 0.0757 0.0636 0.0379 0.051  0.0587 0.085  0.0741 0.0557
 0.1365 0.0684 0.0413 0.0757 0.0531 0.085  0.0688 0.0658 0.0611 0.0684
 0.0658 0.0684 0.0583 0.0621 0.1316 0.0668 0.0714 0.0543 0.0658 0.0557
 0.072  0.085  0.1168 0.0789 0.0744 0.0594 0.1142 0.0612 0.0587 0.1196
 0.085  0.0802 0.0893 0.0659 0.0686 0.0744 0.085  0.085  0.144  0.1178
 0.085  0.0363 0.0757 0.1236 0.0361 0.066  0.1594 0.1278 0.0459 0.0789
 0.0574 0.0344 0.0802 0.0688 0.0531 0.0658 0.0768 0.0688 0.1514 0.0686
 0.0768 0.1188 0.0621 0.0594 0.0395 0.0349 0.0408]
```

In [None]:
# Hidden

Build the profiles of all users. Use only the positive rankings (positive feedback) to determine weights.

In [None]:
def build_user_profiles(self):
    positive_ratings = self.ratings[ratings.rating>3]
    self.user_profiles = {}
    for user_id in positive_ratings['user'].unique():
        self.user_profiles[user_id] = self.get_user_profile(user_id, positive_ratings)
    

In [None]:
Recommender_CB.build_user_profiles = build_user_profiles

In [None]:
cbr.build_user_profiles()

## Make Recommendations --- TO EDIT


The following function recommends topN items to the user based on her/his profile. The recommendations should exclude items already rated by the user.

Steps to implement:

1. Retrieve the user profile.
2. Compute the cosine similarity between the user profile and each tf-idf vector, and store it into array `sims`. Tips: Use `linear_kernel` from scikit-learn to take the inner product, since all vectors are normalized. Also, flatten the output at the end.
3. Identify the indices in `sims` that have the largest similarities. Tips: `a[::-1]` returns the reverse of list `a`. You may want to use the `numpy.argsort` method.
4. Retrieve the item_ids from `self.i_id_to_item_id` that correspond to the indices found.
5. Include in the recommendation list only items from `from_item_ids`, and exclude those in `item_ids_rated_by_user_id`.
6. Return only the topN. Recommended items should be sorted from most to least similar to user profile.

In [None]:
def recommend(self, user_id, from_item_ids=None, topN=20):
    """
    This function takes a user ID and an array of items (if given) from which the topN items should be recommended.
    Recommendations are made based on cosine similarty between user profil and tf idf-vector.
    
    :return: topN recommended items as list of item IDs of length topN
    :rtype: list
    """
    
    item_ids_rated_by_user_id = self.ratings.loc[ self.ratings['user'] == user_id ]['item'].tolist()

    if from_item_ids is None:
        from_item_ids = self.item_ids

    # YOUR CODE HERE
    #arr = get_user_profile(self, user_id, ratings)
    arr = self.user_profiles[user_id]
    tfidf = self.tfidf 
    sims = linear_kernel(arr, tfidf)
    sims = sims.flatten()
    sims = np.argsort(sims)
    sims = sims[::-1]
    newlist = []
    for i in sims:
        newlist.append(self.i_id_to_item_id[i])

    filteredlist = []
    for n in newlist:
        if n in from_item_ids and n not in item_ids_rated_by_user_id:
            filteredlist.append(n)

    recommendations = filteredlist[:topN]
    return recommendations
    

Add the function to the class.

In [None]:
Recommender_CB.recommend = recommend

Test the function.

In [None]:
recs = cbr.recommend(1)
print(recs)
print(cbr.recommend(2))
print(cbr.recommend(3))
print(cbr.recommend(4))
print(cbr.recommend(5))
print(cbr.recommend(6))

**EXPECTED OUTPUT:**

```
[22, 9, 65, 40, 20, 61, 6, 86, 94, 88, 44, 13, 3, 18, 92, 15, 85, 81, 78, 90]
[31, 5, 71, 58, 49, 57, 54, 64, 90, 94, 72, 88, 84, 76, 78, 97, 2, 37, 44, 52]
[40, 18, 2, 65, 60, 92, 61, 30, 84, 21, 17, 81, 90, 78, 10, 87, 58, 47, 77, 42]
[13, 92, 24, 46, 76, 87, 86, 85, 84, 83, 82, 81, 80, 79, 78, 77, 75, 89, 74, 73]
[31, 72, 64, 55, 54, 76, 14, 45, 23, 75, 67, 44, 66, 46, 78, 39, 43, 84, 34, 73]
[31, 5, 64, 57, 60, 46, 45, 92, 18, 39, 38, 82, 84, 50, 73, 34, 99, 23, 48, 76]
```

In [None]:
# Hidden

Show the movie titles of the recommendations.

In [None]:
display(cbr.get_item_titles(recs))

In [None]:
# feel free to use this field for additional tests

In [None]:
# feel free to use this field for additional tests

In [None]:
# feel free to use this field for additional tests

In [None]:
# feel free to use this field for additional tests

In [None]:
# feel free to use this field for additional tests