# Recommendation - Data Preparation 🎬

---

<img src="https://cdn-images-1.medium.com/max/1200/0*ePGWILY6GyplT-nn" />

---

In the next few challenges, you will build a powerful **movie recommender**.

We will use the open-source library [LightFM](https://github.com/lyst/lightfm) which provides easy python implementation of **hybrid** recommendation engines.

In this first part, we will prepare the data in order to train efficiently of the model.

We let you load the data `movies` and `ratings` downloaded from the **small** [movielens dataset](https://grouplens.org/datasets/movielens/).



In [1]:
!pip install lightfm

Collecting lightfm
  Downloading lightfm-1.17.tar.gz (316 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/316.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m153.6/316.4 kB[0m [31m4.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.4/316.4 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: lightfm
  Building wheel for lightfm (setup.py) ... [?25l[?25hdone
  Created wheel for lightfm: filename=lightfm-1.17-cp310-cp310-linux_x86_64.whl size=808329 sha256=5d4cb70ff551feafab7d2edbbc5d7b2b9e1ecbd127077f82de94e6d8fb2e369e
  Stored in directory: /root/.cache/pip/wheels/4f/9b/7e/0b256f2168511d8fa4dae4fae0200fdbd729eb424a912ad636
Successfully built lightfm
Installing collected packages: lightfm
Successfully installed lightfm-1.17


In [10]:
import numpy as np
import pandas as pd
from google.colab import files

# Upload files
uploaded_movies = files.upload()
uploaded_ratings = files.upload()

# Read uploaded files with specified encoding
movies = pd.read_csv(next(iter(uploaded_movies)), encoding='latin1')
ratings = pd.read_csv(next(iter(uploaded_ratings)), encoding='latin1')


Saving movies.csv to movies (4).csv


Saving ratings.csv to ratings (2).csv


**Q1**. What are the different types of recommendation models? Explain briefly with your own words the differences between them.

Recommendation systems are essential tools in various applications, from e-commerce to streaming services, guiding users toward products or content that align with their preferences. There are several types of recommendation models, each with distinct approaches and methodologies. Here’s a detailed explanation of the primary types:

1) Content-Based Filtering:
Content-based filtering relies on the characteristics and features of items to make recommendations. It analyzes the properties of the items a user has shown interest in and suggests similar items based on those attributes. This approach uses techniques like cosine similarity to measure the similarity between items. For example, if a user likes a particular book, the system will recommend other books with similar genres, authors, or keywords. The key advantage of content-based filtering is its ability to recommend items that match the user’s specific tastes, but it can be limited by the need for detailed item descriptions and may struggle with suggesting diverse or unexpected items.

2) Collaborative Filtering:
Collaborative filtering leverages the interactions and preferences of a large group of users to generate recommendations. There are two main types of collaborative filtering:

    User-Based Collaborative Filtering: This approach finds users who have similar preferences to the target user and recommends items that these similar users have liked. For example, if two users have rated several movies similarly, one user's positive rating for a new movie can lead to a recommendation for the other user.
    Item-Based Collaborative Filtering: Instead of focusing on users, this method finds items that are similar based on user ratings. If a user likes a particular movie, the system will recommend other movies that have been liked by users who also liked that movie.

Collaborative filtering is effective because it can uncover hidden patterns and relationships in user behavior, making it possible to recommend items that are unexpected but relevant. However, it requires a large amount of user interaction data and can suffer from the "cold start" problem, where new users or items have insufficient data for making accurate recommendations.

**Q1bis**. What data is expected by the LightFM `fit` method? Especially, how does the train data should be organized, and what should be the type of the train dataset?

The LightFM fit method requires the training data to be structured as a sparse matrix that includes movie data, actors, directors, and ratings. This matrix should be constructed by merging the information from both CSV files.

**Q2**. Explore `movies` and `ratings`, what do those datasets contain? How are they organized?

In [12]:
from IPython.display import display

# Display first few rows of movies DataFrame
print("Movies DataFrame:")
display(movies.head())

Movies DataFrame:


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [13]:
# Display first few rows of ratings DataFrame
print("\nRatings DataFrame:")
display(ratings.head())


Ratings DataFrame:


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


---

### Q3 & Q4 are optional
> you can come back to it if you have time after having finished the whole project of the day

We created a few utils functions for you in `utils.py` script. Especially:
- `threshold_interactions_df`:
> Limit interactions df to minimum row and column interactions

**Q3**. Open `src/utils.py` file, and have a look at the documentation of this function to understand its goal and how it works.

Have a look the code to understand fully how it works. You should be familiar with everything.

What does represent the variable `sparsity`? What is the range of values in which sparsity can be?

**Q4**. Create a new DataFrame `ratings_thresh`, that filters `ratings` with only:
- users that rated strictly more than 4 movies
- movies that have been rated at least 10 times

How many users/movies remain in this new dataset?

**Q5**. In order to fit a [LightFM](https://lyst.github.io/lightfm/docs/home.html) model, we need to transform our Dataframe to a sparse matrix (cf. below). This is not straightforward so we included the function `df_to_matrix` in `utils.py`.

> 🔦 **Hint**:  Sparse matrices are just **big matrices with a lot of zeros or empty values**.
>
> Existing tools (Pandas DataFrame, Numpy arrays for example) are not suitable for manipulating this kind of data. So we will use [Scipy sparse matrices](https://docs.scipy.org/doc/scipy-0.14.0/reference/sparse.html).
>
> It exists many different "types" of sparse matrices (CSC, CSR, COO, DIA, etc.). You don't need to know them. Just know that it corresponds to different formats with different methods of manipulation, slicing, indexing, etc.

> 🔦 **Hint 2**:  By going from a DataFrame to a sparse matrix, you will lose the information of the ids (userId and movieId), you will only deal with indices (row number and column number). Therefore, the `df_to_matrix` function also returns dictionaries mapping indexes to ids (ex: uid_to_idx mapping userId to index of the matrix)


Have a look at the util function documentation, and use it to create 5 new variables:
- a final sparse matrix `ratings_matrix` (this will be the data used to train the model)
- the following utils mappers:
    - `uid_to_idx`
    - `idx_to_uid`
    - `mid_to_idx`
    - `idx_to_mid`

In [16]:
import numpy as np
from scipy.sparse import coo_matrix

def df_to_sparse_matrix(df, rows_key, cols_key, values_key):
    row_idx = {key: idx for idx, key in enumerate(df[rows_key].unique())}
    col_idx = {key: idx for idx, key in enumerate(df[cols_key].unique())}

    rows = df[rows_key].map(row_idx)
    cols = df[cols_key].map(col_idx)
    values = df[values_key]

    return coo_matrix((values, (rows, cols))), row_idx, {v: k for k, v in row_idx.items()}, col_idx, {v: k for k, v in col_idx.items()}

# Convert ratings DataFrame to sparse matrix and create mapper dictionaries
ratings_matrix, uid_to_idx, idx_to_uid, mid_to_idx, idx_to_mid = df_to_sparse_matrix(ratings, 'userId', 'movieId', 'rating')

# Display information about the sparse matrix and mappers
print("Shape of ratings_matrix:", ratings_matrix.shape)
print("Number of unique users:", len(uid_to_idx))
print("Number of unique movies:", len(mid_to_idx))

Shape of ratings_matrix: (610, 9724)
Number of unique users: 610
Number of unique movies: 9724


**Q6**.
- On the one side, find what movies did the userId 4 rate?

- On the other side, what is the value of `ratings_matrix` for:
    - userId = 4 and movieId=1
    - userId = 4 and movieId=2
    - userId = 4 and movieId=21
    - userId = 4 and movieId=32
    - userId = 4 and movieId=126

Conclude on the values signification in `ratings_matrix`

In [20]:
# Find movies rated by userId 4
movies_rated_by_user_4 = ratings[ratings['userId'] == 4]['movieId'].unique()

# Display movies rated by userId 4
print("Movies rated by userId 4:")
print(movies_rated_by_user_4)

# Get indices for userId 4 and specific movieIds
user_4_idx = uid_to_idx[4]
movie_1_idx = mid_to_idx[1]
movie_2_idx = mid_to_idx[2]
movie_21_idx = mid_to_idx[21]
movie_32_idx = mid_to_idx[32]
movie_126_idx = mid_to_idx[126]

# Find ratings for userId 4 and specific movieIds
rating_4_1 = ratings_matrix.getrow(user_4_idx).toarray()[0, movie_1_idx]
rating_4_2 = ratings_matrix.getrow(user_4_idx).toarray()[0, movie_2_idx]
rating_4_21 = ratings_matrix.getrow(user_4_idx).toarray()[0, movie_21_idx]
rating_4_32 = ratings_matrix.getrow(user_4_idx).toarray()[0, movie_32_idx]
rating_4_126 = ratings_matrix.getrow(user_4_idx).toarray()[0, movie_126_idx]

# Display ratings for userId 4 and specific movieIds
print("\nRating for userId 4 and movieId=1:", rating_4_1)
print("Rating for userId 4 and movieId=2:", rating_4_2)
print("Rating for userId 4 and movieId=21:", rating_4_21)
print("Rating for userId 4 and movieId=32:", rating_4_32)
print("Rating for userId 4 and movieId=126:", rating_4_126)


Movies rated by userId 4:
[  21   32   45   47   52   58  106  125  126  162  171  176  190  215
  222  232  235  247  260  265  296  319  342  345  348  351  357  368
  417  441  450  457  475  492  509  538  539  553  588  593  595  599
  608  648  708  759  800  892  898  899  902  904  908  910  912  914
  919  920  930  937 1025 1046 1057 1060 1073 1077 1079 1080 1084 1086
 1094 1103 1136 1179 1183 1188 1196 1197 1198 1199 1203 1211 1213 1219
 1225 1250 1259 1265 1266 1279 1282 1283 1288 1291 1304 1391 1449 1466
 1500 1517 1580 1597 1617 1641 1704 1719 1732 1733 1734 1834 1860 1883
 1885 1892 1895 1907 1914 1916 1923 1947 1966 1967 1968 2019 2076 2078
 2109 2145 2150 2174 2186 2203 2204 2282 2324 2336 2351 2359 2390 2395
 2406 2467 2571 2583 2599 2628 2683 2692 2712 2762 2763 2770 2791 2843
 2858 2874 2921 2926 2959 2973 2997 3033 3044 3060 3079 3083 3160 3175
 3176 3204 3255 3317 3358 3365 3386 3408 3481 3489 3508 3538 3591 3788
 3809 3851 3897 3911 3967 3996 4002 4014 4020 4021 

In [22]:
# Convert NumPy array to DataFrame
movies_rated_by_user_4_df = pd.DataFrame(movies_rated_by_user_4, columns=['movieId'])

# Merging on movieId
merged_df = pd.merge(movies_rated_by_user_4_df, movies, on='movieId')

# Print titles of movies rated by userId 4
print("Titles of movies rated by userId 4:")
print(merged_df['title'])


Titles of movies rated by userId 4:
0                                      Get Shorty (1995)
1              Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
2                                      To Die For (1995)
3                            Seven (a.k.a. Se7en) (1995)
4                                Mighty Aphrodite (1995)
                             ...                        
211                                        L.I.E. (2001)
212                     Man Who Wasn't There, The (2001)
213    Harry Potter and the Sorcerer's Stone (a.k.a. ...
214    Devil's Backbone, The (Espinazo del diablo, El...
215                                 No Man's Land (2001)
Name: title, Length: 216, dtype: object


In [23]:
user_id = 4
movie_id_list = [1, 2, 21, 32, 126]

user_idx = uid_to_idx[user_id]
for movie_id in movie_id_list:
    movie_idx = mid_to_idx[movie_id]
    rating = ratings_matrix.getrow(user_idx).toarray()[0, movie_idx]
    print(f"Rating for userId {user_id} and movieId {movie_id}: {rating}")


Rating for userId 4 and movieId 1: 0.0
Rating for userId 4 and movieId 2: 0.0
Rating for userId 4 and movieId 21: 3.0
Rating for userId 4 and movieId 32: 2.0
Rating for userId 4 and movieId 126: 1.0


**Q5**. Now that you have a `ratings_matrix` in the correct format, let's save it in pickle format:
- Create a variable `dst_dir` corresponding to the path of the folder `data/netflix` located at the root of the repository
- **Verify that this is the correct path**
- Save the ratings_matrix in pickle (as `ratings_matrix.pkl`) in this corresponding directory

In [24]:
import os
import pickle

# Define the destination directory
dst_dir = "/content/data/netflix"  # Update this with the correct path

# Create the directory if it doesn't exist
os.makedirs(dst_dir, exist_ok=True)

# Save the ratings_matrix in pickle format
ratings_matrix_file = os.path.join(dst_dir, "ratings_matrix.pkl")
with open(ratings_matrix_file, "wb") as file:
    pickle.dump(ratings_matrix, file)

# Verify the path
print("Ratings matrix saved at:", ratings_matrix_file)


Ratings matrix saved at: /content/data/netflix/ratings_matrix.pkl


**Q6**. Save also all mappings objects into pickle (`idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`) as it will be useful for later.

In [25]:
import os
import pickle

# Define the destination directory
dst_dir = "/content/data/netflixApp"

# Create the directory if it doesn't exist
os.makedirs(dst_dir, exist_ok=True)

# Save mappers
with open(os.path.join(dst_dir, "uid_to_idx.pkl"), 'wb') as f:
    pickle.dump(uid_to_idx, f)

with open(os.path.join(dst_dir, "idx_to_uid.pkl"), 'wb') as f:
    pickle.dump(idx_to_uid, f)

with open(os.path.join(dst_dir, "mid_to_idx.pkl"), 'wb') as f:
    pickle.dump(mid_to_idx, f)

with open(os.path.join(dst_dir, "idx_to_mid.pkl"), 'wb') as f:
    pickle.dump(idx_to_mid, f)

print("Mappers saved successfully!")


Mappers saved successfully!


Up to next challenge now! 🍿