# Part 2 - Implementation 

In your report, describe the system, motivate the choices you made. Comment on the strengths and weaknesses of the system. 

Suggest 5 movies for each of the first 5 users, Vincent, Edgar, Addilyn, Marlee, and Javier, that they haven't rated already.

Good sources: https://towardsdatascience.com/cosine-similarity-explained-using-python-machine-learning-pyshark-5c5d6b9c18fa 

https://www.kaggle.com/code/ibtesama/getting-started-with-a-movie-recommendation-system#Content-Based-Filtering

https://pianalytix.com/content-based-recommender-system/

In [None]:
import pandas as  pd
from sklearn.metrics.pairwise import cosine_similarity

Codewise we need to: 

- Get a sample the top 5 highest rated movies for a user. 

-  Find the 5 most similar movies for those 5 top rated movies. Then randomly pick k of those 25 as recommended. 

- Get the top 5 recommendation items for the 5 first users, that they haven't already rated. 

Cons: what if there are no reviews? what if all highest rated still are with low scores (1-2)? Risk for creating an echo chamber. 

In [None]:
# data is already in OHE format
movies_df = pd.read_csv('movie_genres.csv')
reviews_df = pd.read_csv('user_reviews.csv')

# Get a sample the top 5 highly (3+) rated movies for a user. 

def get_toprated(user_name,k) : 
    row = reviews_df.loc[reviews_df["User"] == str(user_name)]
    list = []
    for n in range(0,k):
        row.drop(row.columns[row.columns.str.contains('User',case = False)],axis = 1, inplace = True) # dropping interupting columns
        row.drop(row.columns[row.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)
        #t= row.idxmax(axis=1)
        #print(row)
        (row.loc[1] >= 3) & (row.loc[1] != 0)
         
        #row.drop(row.columns[row.columns.str.contains(str(t),case = False)],axis = 1, inplace = True) # remove movie from the comparison
        #print(t)
    #list.append(t)
    return row

get_toprated("Edgar", 3)


# compute similarity matix
#corr_mat = cosine_similarity(reviews_df)




KeyError: 0

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def top_k_items(item_id, top_k, corr_mat, map_name):
    
    # sort correlation value ascendingly and select top_k item_id
    top_items = corr_mat[item_id,:].argsort()[-top_k:][::-1] 
    top_items = [map_name[e] for e in top_items] 

    return top_items

# preprocessing
rated_items = items.loc[items[ITEM_COL].isin(ratings[ITEM_COL])].copy()

# extract the genre
genre = rated_items['genres'].str.split(",", expand=True)

# get all possible genre
all_genre = set()
for c in genre.columns:
    distinct_genre = genre[c].str.lower().str.strip().unique()
    all_genre.update(distinct_genre)
all_genre.remove(None)

# create item-genre matrix
item_genre_mat = rated_items[[ITEM_COL, 'genres']].copy()
item_genre_mat['genres'] = item_genre_mat['genres'].str.lower().str.strip()

# OHE the genres column
for genre in all_genre:
    item_genre_mat[genre] = np.where(item_genre_mat['genres'].str.contains(genre), 1, 0)
item_genre_mat = item_genre_mat.drop(['genres'], axis=1)
item_genre_mat = item_genre_mat.set_index(ITEM_COL)

# compute similarity matix
corr_mat = cosine_similarity(item_genre_mat)

# get top-k similar items
ind2name = {ind:name for ind,name in enumerate(item_genre_mat.index)}
name2ind = {v:k for k,v in ind2name.items()}
similar_items = top_k_items(name2ind['99'],
                            top_k = 10,
                            corr_mat = corr_mat,
                            map_name = ind2name)

# display result
print("The top-k similar movie to item_id 99")
display(items.loc[items[ITEM_COL].isin(similar_items)])

del corr_mat
gc.collect();

In [None]:

# get top-k similar items
ind2name = {ind:name for ind,name in enumerate(item_genre_mat.index)}
name2ind = {v:k for k,v in ind2name.items()}
similar_items = top_k_items(name2ind['99'],
                            top_k = 10,
                            corr_mat = corr_mat,
                            map_name = ind2name)

# display result
print("The top-k similar movie to item_id 99")
display(items.loc[items[ITEM_COL].isin(similar_items)])

del corr_mat
gc.collect();

Text decription of the content filtering solution. 

One way to go about the recommender system is to implement a content filter-based system, using the features of an item to compare and map similar items. This way the recommendation can be based on the similarity of other movies (not yet watched) that share characteristics with highly rated movies the user has seen. To recommend a similar movie an initial example is needed however, this example can be chosen from the highly (4-5) rated movies from the user_review data file. Using a similarity matrix computed with e.g. "cosine_similarity" from the sklearn.metrics.pairwise library similar movies can then be found. For this, we need a one-hot-encoded dataframe since the sklearn library only can compute on numerical values, despite the genres being categorical values in many cases. 

In order to give five recommendations for the five first users, one or multiple top-rated movies can be chosen from reviews for the user at hand. The most similar movies can then be found through the similarity score that can be listed for items in descending order. Then a list of the most similar items can be stored away for the user and used in the recommendation. By creating a list bigger than the intended number of presented recommendations, the movies can be randomly sampled and the recommendation becomes more dynamic. This, as the recommendation otherwise is based on that one rated review and with only 5 recommendations it would give the same recommendations until the review is changed. For the recommendations to be fairly similar a tolerance level should be determined.  

Some of the cons with this solution are that without a review and a top-rated movie to find similar movies based on, the solution would need to fall back on something else and recommend the most popular movie choices for those users. That is a clear weakness of this solution. Similar is the case where the only reviews we have for a user are low-scoring reviews. For users with many high-scoring reviews, we instead face the problem of choosing the reviews for our recommendation. Here the answer is not obvious and multiple highly rated reviews could be used. 

# Part 3 - Discussion 

Assessing the quality of a recommendation system before deploying it to users is difficult. Why? In a few paragraphs, discuss fundamental challenges in the evaluation of recommendation systems and how they may lead to problems in practice.

- Before deployment: only historic sparse data, although being good at predicting the historic data is it good at predicting the future? 

- choosing the right metric + connecting it to the objective. Determining the right assumptions (e.g. going with collaborative filtering, content filtering)

- Complex problem to predict, not always rational. Complex social interactions leading to successes. 

- parse data is used (hard to determine signal and noise)

- exploit VS explore?  And choosing the right mechanism for recommendations(pull/push). 

Kais förslag: Before getting the system online it's very hard to predict if the system will be good at accomplishing the chosen objective. There are many challenges to building successful recommenders, one of which is choosing what to measure and how to link the metrics to the end effect you seek. Recommenders are overall also difficult due to the fact that you must find something that you actually can measure, and that still is a good measure of your objective. Creating long chains of proxies to link the measurable thing and the objective might lead to a lot of loose connections and a hard time telling where the chain failed if it does not work in practice. 

Another challenge to recommenders, that use ratings or reviews, is the usually very low number of item/user pairs of these ratings. Ie. Most users have not rated most items, making the data incredibly sparse. Having a lot of sparse data, that is not random either, will make it harder to generalize. Despite having a lot of data to test with, the real generalization will first show after the launch of the system in practice. Despite there being a couple of different ways to cope with the spare data, it still poses a limitation in the development phase of the system trying to find weak signals among the noise.      

A third challenging aspect is to determine future behavior from historic data. What makes it harder still is predicting complex choices for users that are not necessarily rational and following the historic trends, but rather combinations of trending themes and personal tastes. Complex social trends and individual choices on large scale are intrinsically hard to predict and something that won't show until after getting the system online.

Finally, the recommendation must also be presented at the right point in time and in the right place. A system can use both push and pull approaches to recommendations. While giving the right sort of recommendation to a user, doing so via another mail in an already full inbox might not be the best way to reach the user. Similarly, the content inventory is crucial. Not even the best recommender systems will do any good if the product-to-market fit is poor. 

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=430916cd-0b84-4120-a50c-61096e15b16a' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>