<a href="https://colab.research.google.com/github/DianaMoyano1/RecommenderSystems_Examples/blob/main/01_MovieLens_to_small_df.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
from google.colab import drive
drive.mount('drive')
%cd 'drive/My Drive/Courses/Udemy/Recommender Systems/Colab Examples'

Drive already mounted at drive; to attempt to forcibly remount, call drive.mount("drive", force_remount=True).
/content/drive/My Drive/Courses/Udemy/Recommender Systems/Colab Examples


In [None]:
df= pd.read_csv("rating.csv")
df=df.drop(['timestamp'],axis=1)

In [None]:
df.head()

Unnamed: 0,userId,movieId,rating
0,1,2,3.5
1,1,29,3.5
2,1,32,3.5
3,1,47,3.5
4,1,50,3.5


# First Preprocessing

### User ID 
Count from 0 instead of 1 as we will index to a Numpy array. Substract 1 from each ID

In [None]:
df.userId = df.userId-1

In [None]:
df.head()

Unnamed: 0,userId,movieId,rating
0,0,2,3.5
1,0,29,3.5
2,0,32,3.5
3,0,47,3.5
4,0,50,3.5


### Movie IDs
There are only 20K movies. These are not sequential and there are a lot of missing items
Create a new mapping from 0 to ~20K

In [None]:
unique_movie_ids=set(df.movieId.values) #Set converts to sequence of iterable elements with dintinct elements

Below code loops through each unique movie without encountering again.
Movies ID go from 1 to ~100K, but THEY ARE NOT SEQUENTIAL. We need to use the entire array space, so we assign new ids to the old ones.
User IDs do not have this issue because they covered all the numbers (no space in between)
The dictionary will map old IDs with new IDs. 
The key of our dictionary is the old movie ID and the
value of the dictionary is the count. We only need the former.
count += increments the count for the next iteration

In [None]:
#We map a dictonary from old id to new id
#Data mapping is the process of matching fields from one database to another
#First ID being 0
movie2idx={}
count=0
for movie_id in unique_movie_ids:
  movie2idx[movie_id]=count
  count+=1 #We look through each unique ID


In [None]:
#Add them to the dataframe
df['movie_idx']=df.apply(lambda row: movie2idx[row.movieId], axis=1)

In [None]:
df.to_csv('edited_rating.csv')

# Shrinking

The array will be too large (over 100K users and movies). We could select a subset of users (those who rate the most) and a subset of movies (highest number of ratings). 

In [None]:
import pickle
from collections import Counter

In [None]:
print("original dataframe size:", len(df))

original dataframe size: 20000263


In [None]:
N = df.userId.max() + 1 #Number of users
M = df.movie_idx.max() + 1 #Number of movies

In [None]:
#Counts how many times a user/movie appears
user_ids_count=Counter(df.userId)
movie_ids_count=Counter(df.movie_idx)

In [None]:
#Number of user we would like to keep
n = 10000
m = 2000

In [None]:
#Select the most common user and movie ids
#These are in a tuple. Key is the id and second value is the count 
#We are only looking for the id itself
user_ids= [u for u, c in user_ids_count.most_common(n)]
movie_ids= [m for m, c in movie_ids_count.most_common(m)]



In [None]:
#Make a copy, otherwise ids won't be overwritten
df_small=df[df.userId.isin(user_ids) & df.movie_idx.isin(movie_ids)].copy()

In [None]:
#Remake user ids and movie ids since they're no longer sequential
new_user_id_map={}
i=0
for old in user_ids:
  new_user_id_map[old] = i
  i+=1
print("i:",i)


i: 10000


In [None]:
new_movie_id_map={}
j=0
for old in movie_ids:
  new_movie_id_map[old] = j
  j+=1
print("j:",j)

j: 2000


Setting the new Ids

In [None]:
df_small.loc[:,'userId'] = df_small.apply(lambda row: new_user_id_map[row.userId], axis=1)
df_small.loc[:,'movie_idx'] = df_small.apply(lambda row: new_movie_id_map[row.movie_idx], axis=1)

In [None]:
print("max user id:", df_small.userId.max())
print("max movie id:", df_small.movie_idx.max())

max user id: 9999
max movie id: 1999


In [None]:
df_small.to_csv('very_small_rating.csv')