# User-Based Collaberative Filtering
## Datasets Provided
The MovieLens-20M dataset is split over 6 .csv files
1. genome-tags:    Associates each movie tag with a number
2. tags:           Each user and what tags they have assigned to which movies, with timestamp
3. genome-scores:  Each tag's relevance to every movie as a double 0-1
4. ratings:        20M ratings users gave to movies they watched (0.5-5, steps of 0.5)
5. movies:         Each movie's ID number and name, as well as genres 
6. links:          Each movie's entry numbers in IMDB and themovieDB (seemingly irrelevant?)

MOVIE IDS ARE CONSISTENT ACROSS 'RATINGS', 'TAGS', 'MOVIES', AND 'LINKS'

## Methodology
As this filtering method is USER-based, it relies on user-provided ratings and tags.
A user's 'likes' can be represented by a bag of words (or tag IDs) made up of the tags associated with the movies they watch, as well as their ratings.

Our active user(s) (whose interests we are predicting) has the following structure for ratings, as an example:

| User | MovieID | Rating |
| --- | --- | --- |
| 123 | 456 | 2.5 |
| 123 | 983 | 3.5 |
| 123 | 234 | 5 |



In [3]:
# Import block
import pandas as pd
import numpy as np
import sklearn    # Comes with kNN method, may be useful

In [4]:
# Data loading block
genome_tags = pd.read_csv("genome-tags.csv") # get tags
user_tags = pd.read_csv("tags.csv", usecols=[0, 1, 2]) # get user-given tags, but not timestamp
genome_scores = pd.read_csv("genome-scores.csv")
ratings = pd.read_csv("ratings.csv", usecols=[0, 1, 2]) # get user-provided ratings
movies = pd.read_csv("movies.csv")
# links = pd.read_csv("links.csv")

############# DATA SAVING STEP ######################
# Set movie IDs to int32 from int64
movies["movieId"] = movies["movieId"].astype('int32')

# Change ratings from floats (0.5 to 5 in 0.5 increments) to integers (1-10)
# Change user ID and movie ID to int32 from int64
ratings["userId"] = ratings["userId"].astype('int32')
ratings["movieId"] = ratings["movieId"].astype('int32')
ratings["rating"] = (ratings["rating"] * 2).astype('int8') # Keep this in mind (1-10)

# Change movieId to int32 and tagId to int16
# Change relevance from float64 to float32
genome_scores["movieId"] = genome_scores["movieId"].astype('int32')
genome_scores["tagId"] = genome_scores["tagId"].astype('int16')
genome_scores["relevance"] = genome_scores["relevance"].astype('float32')
                     
# Change userId and movieId to int32
user_tags["userId"] = user_tags["userId"].astype('int32')
user_tags["movieId"] = user_tags["movieId"].astype('int32')
                     
# Change genome tag IDs to int16
genome_tags["tagId"] = genome_tags["tagId"].astype('int16')

In [5]:
print("Items in genome-tags:\t\t" + str(len(genome_tags.index)))
print("Items in user-tags:\t\t" + str(len(user_tags.index)))
print("Items in genome-scores:\t\t" + str(len(genome_scores.index)))
print("Items in ratings:\t\t" + str(len(ratings.index)))
print("Items in movies:\t\t" + str(len(movies.index)))

Items in genome-tags:		1128
Items in user-tags:		465564
Items in genome-scores:		11709768
Items in ratings:		20000263
Items in movies:		27278


In [52]:
display(genome_tags)
#display(user_tags)
# Get only the user tags that appear within the genome
user_tags = user_tags[user_tags["tag"].isin(genome_tags["tag"])].reset_index(drop=True)
#display(user_tags)
tags_list = genome_tags["tag"].values.tolist()
# Convert tags in user_tags to their respective number
user_tags["tag"] = user_tags["tag"].apply(lambda tag : 
                                          tags_list.index(tag)+1).astype('int16')
display(user_tags)
#display(user_tags["tag"].apply(lambda tag_id : tags_list.index(tag_id)+1)

Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s
...,...,...
1123,1124,writing
1124,1125,wuxia
1125,1126,wwii
1126,1127,zombie


Unnamed: 0,userId,movieId,tag
0,65,208,288
1,65,353,288
2,65,521,712
3,65,592,288
4,65,668,149
...,...,...,...
217566,138446,3086,882
217567,138446,3489,1091
217568,138446,7164,1091
217569,138446,55999,829


In [21]:
num_users = ratings["userId"].iloc[-1] # Get number of users (by looking at value of the final user)
# Create a user-movie matrix where rows=users and columns=movie ratings
#user_ratings_matrix = pd.DataFrame(index=range(0, num_users+1), columns=movies["movieId"])

#user_ratings_matrix.shape

# Notes: user tags are almost useless - many include their own and stray from the genome tags

# RATINGS appear important - other users' ratings on movies in the same GENRE would affect recs


In [41]:
tags_list.index('wwii')

1125