# User-Based Collaberative Filtering
## Datasets Provided
The MovieLens-20M dataset is split over 6 .csv files
1. genome-tags:    Associates each movie tag with a number
2. tags:           Each user and what tags they have assigned to which movies, with timestamp
3. genome-scores:  Each tag's relevance to every movie as a double 0-1
4. ratings:        20M ratings users gave to movies they watched (0.5-5, steps of 0.5)
5. movies:         Each movie's ID number and name, as well as genres 
6. links:          Each movie's entry numbers in IMDB and themovieDB (seemingly irrelevant?)

MOVIE IDS ARE CONSISTENT ACROSS 'RATINGS', 'TAGS', 'MOVIES', AND 'LINKS'

## Methodology
As this filtering method is USER-based, it relies on user-provided ratings and tags.
A user's 'likes' can be represented by a bag of words (or tag IDs) made up of the tags associated with the movies they watch, as well as their ratings.

Our active user(s) (whose interests we are predicting) has the following structure for ratings, as an example:

| User | MovieID | Rating |
| --- | --- | --- |
| 123 | 456 | 2.5 |
| 123 | 983 | 3.5 |
| 123 | 234 | 5 |



In [2]:
# Import block
import pandas as pd
import numpy as np
import sklearn    # Comes with kNN method, may be useful

In [3]:
# Data loading block
genome_tags = pd.read_csv("genome-tags.csv") # get tags
user_tags = pd.read_csv("tags.csv", usecols=[0, 1, 2]) # get user-given tags, but not timestamp
genome_scores = pd.read_csv("genome-scores.csv")
ratings = pd.read_csv("ratings.csv", usecols=[0, 1, 2]) # get user-provided ratings
movies = pd.read_csv("movies.csv")
# links = pd.read_csv("links.csv")

############# DATA SAVING STEP ######################
# Set movie IDs to int32 from int64
movies["movieId"] = movies["movieId"].astype('int32')

# Change ratings from floats (0.5 to 5 in 0.5 increments) to integers (1-10)
# Change user ID and movie ID to int32 from int64
ratings["userId"] = ratings["userId"].astype('int32')
ratings["movieId"] = ratings["movieId"].astype('int32')
ratings["rating"] = (ratings["rating"] * 2).astype('int8') # Keep this in mind (1-10)

# Change movieId to int32 and tagId to int16
# Change relevance from float64 to float32
genome_scores["movieId"] = genome_scores["movieId"].astype('int32')
genome_scores["tagId"] = genome_scores["tagId"].astype('int16')
genome_scores["relevance"] = genome_scores["relevance"].astype('float32')
                     
# Change userId and movieId to int32
user_tags["userId"] = user_tags["userId"].astype('int32')
user_tags["movieId"] = user_tags["movieId"].astype('int32')
                     
# Change genome tag IDs to int16
genome_tags["tagId"] = genome_tags["tagId"].astype('int16')

In [4]:
print("Items in genome-tags:\t\t" + str(len(genome_tags.index)))
print("Items in user-tags:\t\t" + str(len(user_tags.index)))
print("Items in genome-scores:\t\t" + str(len(genome_scores.index)))
print("Items in ratings:\t\t" + str(len(ratings.index)))
print("Items in movies:\t\t" + str(len(movies.index)))

Items in genome-tags:		1128
Items in user-tags:		465564
Items in genome-scores:		11709768
Items in ratings:		20000263
Items in movies:		27278


In [5]:
display(genome_tags)
#display(user_tags)
# Get only the user tags that appear within the genome
user_tags = user_tags[user_tags["tag"].isin(genome_tags["tag"])].reset_index(drop=True)
#display(user_tags)
tags_list = genome_tags["tag"].values.tolist()
# Convert tags in user_tags to their respective number
user_tags["tag"] = user_tags["tag"].apply(lambda tag : 
                                          tags_list.index(tag)+1).astype('int16')
#isplay(user_tags)
display(ratings)
#display(user_tags["tag"].apply(lambda tag_id : tags_list.index(tag_id)+1)

Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s
...,...,...
1123,1124,writing
1124,1125,wuxia
1125,1126,wwii
1126,1127,zombie


Unnamed: 0,userId,movieId,rating
0,1,2,7
1,1,29,7
2,1,32,7
3,1,47,7
4,1,50,7
...,...,...,...
20000258,138493,68954,9
20000259,138493,69526,9
20000260,138493,69644,6
20000261,138493,70286,10


In [6]:
#ratings["userId"][4235]
display(movies)

q = movies[["movieId", "genres"]]
display(q)
display(q.loc[q.movieId==3])

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
27273,131254,Kein Bund für's Leben (2007),Comedy
27274,131256,"Feuer, Eis & Dosenbier (2002)",Comedy
27275,131258,The Pirates (2014),Adventure
27276,131260,Rentun Ruusu (2001),(no genres listed)


Unnamed: 0,movieId,genres
0,1,Adventure|Animation|Children|Comedy|Fantasy
1,2,Adventure|Children|Fantasy
2,3,Comedy|Romance
3,4,Comedy|Drama|Romance
4,5,Comedy
...,...,...
27273,131254,Comedy
27274,131256,Comedy
27275,131258,Adventure
27276,131260,(no genres listed)


Unnamed: 0,movieId,genres
2,3,Comedy|Romance


In [41]:
num_users = ratings["userId"].iloc[-1] # Get number of users (by looking at value of the final user)
# who_watched_what array
# There are 138493 total users
WWW = np.zeros((num_users, 20))
# Genres are a pipe-separated list, and are selected from the following:
genres = ["Action", "Adventure", "Animation", "Children", "Comedy", "Crime", "Documentary", \
          "Drama", "Fantasy", "Film-Noir", "Horror", "Musical", "Mystery", "Romance", \
          "Sci-Fi", "Thriller", "War", "Western", "IMAX", "(no genres listed)"]

num_to_genre = {1 : "Action", 2 : "Adventure", 3 : "Animation", 4 : "Children", \
                5 : "Comedy", 6: "Crime", 7 : "Documentary", 8 : "Drama", \
                9 : "Fantasy", 10 : "Film-Noir", 11 : "Horror", 12 : "Musical", \
                13 : "Mystery", 14 : "Romance", 15 : "Sci-Fi", 16 : "Thriller", \
                17 : "War", 18 : "Western", 19 : "IMAX", 20 : "(no genres listed)"}

genre_to_num = {"Action" : 1, "Adventure" : 2, "Animation" : 3, "Children" : 4, \
                "Comedy" : 5, "Crime" : 6, "Documentary" : 7, "Drama" : 8, \
                "Fantasy" : 9, "Film-Noir" : 10, "Horror" : 11, "Musical" : 12, \
                "Mystery" : 13, "Romance" : 14, "Sci-Fi" : 15, "Thriller" : 16, \
                "War" : 17, "Western" : 18, "IMAX" : 19, "(no genres listed)" : 20}

WWW = pd.DataFrame(data=WWW, columns=genres)
display(WWW)

Unnamed: 0,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,IMAX,(no genres listed)
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
138488,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
138489,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
138490,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
138491,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
WWW2 = np.zeros((num_users, 20))
display(WWW2.shape)
for rating_num in range(ratings.shape[0]):     # For every rating available
    r = ratings.iloc[rating_num] # Output the 'userId', 'movieId', and 'rating' given
    #r_genres = movies.loc[movies.movieId==movie_id].genres
    r_genres = [genre_to_num[g] for g in movies.loc[movies.movieId==r.movieId].genres.values[0].split("|")]
    #print(r_genres)
    for i in r_genres:
        WWW2[r.userId-1, i-1] += r.rating / 2 # add the score 
    #WWW2[r.userId, r.userId] += r.rating / 2
display(pd.DataFrame(WWW2))
    

#for rating_num in range(ratings.index[-1]+1):
#    which_user = ratings["userId"][rating_num]
#    which_rating = ratings["rating"][rating_num]
#    movie_id = ratings["movieId"][rating_num]
#    movie_genres = movies.loc[movies.movieId==movie_id].genres
#    movie_genres = movie_genres.values[0].split("|") # Get list of all genres
#    #print(movie_genres)
#    for genre in movie_genres:
#        WWW.loc[WWW.index==which_user-1][genre] += which_rating
#display(WWW)

(138493, 20)

In [61]:
display(WWW2[1,1])
r = ratings.iloc[34]
display(r)

test_movie = movies.loc[movies.movieId==4]
display(test_movie)

q = [genre_to_num[g] for g in movies.loc[movies.movieId==r.movieId].genres.values[0].split("|")]
display(q)

0.0

userId        1
movieId    1208
rating        7
Name: 34, dtype: int32

Unnamed: 0,movieId,title,genres
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance


[1, 8, 17]

In [68]:
# Save the results of the extremely time-consuming loop into a .csv for future use
savetxt('who_watched_what.csv', WWW, delimiter=',')

(20000263, 3)

In [70]:
WWW2 = np.zeros((num_users, 20))
display(WWW2[0,0])

0.0