# Recommendation - Data Preparation 🎬

---

<img src="https://cdn-images-1.medium.com/max/1200/0*ePGWILY6GyplT-nn" />

---

In the next few challenges, you will build a powerful **movie recommender**.

We will use the open-source library [LightFM](https://github.com/lyst/lightfm) which provides easy python implementation of **hybrid** recommendation engines.

In this first part, we will prepare the data in order to train efficiently of the model.

We let you load the data `movies` and `ratings` downloaded from the **small** [movielens dataset](https://grouplens.org/datasets/movielens/).



In [2]:
### TODO: Load the movies and ratings datasets
import pandas as pd
#from lightfm import LightFM
#from lightfm.data import Dataset
# Load the datasets
movies = pd.read_csv('../../ml-latest-small/ml-latest-small/movies.csv')
ratings = pd.read_csv('../../ml-latest-small/ml-latest-small/ratings.csv')
# Display the first few rows of each DataFrame
print(movies.head())
print(ratings.head())

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931


**Q1**. What are the different types of recommendation models? Explain briefly with your own words the differences between them.

Content-Based Filtering: This method suggests products based on what the user has enjoyed or engaged with previously. It depends on the attributes of the objects (e.g., movie genre, director, actors).

As an illustration, if a user enjoys a certain film, the system may suggest further films from the same director or genre.

Collaborative Filtering: This technique suggests content based on the choices and actions of users who are similar to one another. It finds people who have similar likes and recommends products that they enjoyed.


Example: A movie that one user enjoys may be suggested to the other if there is a significant overlap in the movies that both users rank highly.

Hybrid Models: Integrate two or more recommendation systems to take use of their advantages while minimizing their drawbacks.
For instance, to increase accuracy, a system may integrate collaborative and content-based filtering.

**Q1bis**. What data is expected by the LightFM `fit` method? Especially, how does the train data should be organized, and what should be the type of the train dataset? 

In [None]:
A hybrid recommendation system that can manage content-based and collaborative filtering is the LightFM library. In order to train a model with LightFM, certain formatted interaction data between users and items must be provided. The LightFM model's 'fit' procedure requires sparse matrices as training data. The interactions between users and items are represented by these matrices. The interaction matrix is the primary matrix that is needed.

**Q2**. Explore `movies` and `ratings`, what do those datasets contain? How are they organized?

Movies datasets contain :
movieId - A unique identifier for each movie.
title - The name of the movie.
genres - The categories or genres the movie belongs to (e.g. Comedy, Fantasy, Romance, Thriller, Crime, Adventure, Action).


Rating datasets contain:
userId - A unique identifier for each user.
movieId - The identifier of the movie being rated.
rating - The rating given by the user.
timestamp - The time when the rating was given.

---

### Q3 & Q4 are optional
> you can come back to it if you have time after having finished the whole project of the day

We created a few utils functions for you in `utils.py` script. Especially:
- `threshold_interactions_df`:
> Limit interactions df to minimum row and column interactions

**Q3**. Open `src/utils.py` file, and have a look at the documentation of this function to understand its goal and how it works.

Have a look the code to understand fully how it works. You should be familiar with everything.

What does represent the variable `sparsity`? What is the range of values in which sparsity can be?

**Q4**. Create a new DataFrame `ratings_thresh`, that filters `ratings` with only:
- users that rated strictly more than 4 movies
- movies that have been rated at least 10 times

How many users/movies remain in this new dataset?

**Q5**. In order to fit a [LightFM](https://lyst.github.io/lightfm/docs/home.html) model, we need to transform our Dataframe to a sparse matrix (cf. below). This is not straightforward so we included the function `df_to_matrix` in `utils.py`.

> 🔦 **Hint**:  Sparse matrices are just **big matrices with a lot of zeros or empty values**.
> 
> Existing tools (Pandas DataFrame, Numpy arrays for example) are not suitable for manipulating this kind of data. So we will use [Scipy sparse matrices](https://docs.scipy.org/doc/scipy-0.14.0/reference/sparse.html).
>
> It exists many different "types" of sparse matrices (CSC, CSR, COO, DIA, etc.). You don't need to know them. Just know that it corresponds to different formats with different methods of manipulation, slicing, indexing, etc.

> 🔦 **Hint 2**:  By going from a DataFrame to a sparse matrix, you will lose the information of the ids (userId and movieId), you will only deal with indices (row number and column number). Therefore, the `df_to_matrix` function also returns dictionaries mapping indexes to ids (ex: uid_to_idx mapping userId to index of the matrix) 


Have a look at the util function documentation, and use it to create 5 new variables:
- a final sparse matrix `ratings_matrix` (this will be the data used to train the model)
- the following utils mappers:
    - `uid_to_idx`
    - `idx_to_uid`
    - `mid_to_idx`
    - `idx_to_mid`

In [4]:
from utils import df_to_matrix
ratings_matrix, uid_to_idx, idx_to_uid, mid_to_idx, idx_to_mid = df_to_matrix(ratings, row_name='userId', col_name='movieId')

print("Shape of ratings matrix:", ratings_matrix.shape)

Shape of ratings matrix: (610, 9724)


**Q6**.
- On the one side, find what movies did the userId 4 rate?

- On the other side, what is the value of `ratings_matrix` for:
    - userId = 4 and movieId=1
    - userId = 4 and movieId=2
    - userId = 4 and movieId=21
    - userId = 4 and movieId=32
    - userId = 4 and movieId=126

Conclude on the values signification in `ratings_matrix`

In [5]:
rated_movies_indices = ratings_matrix[3].nonzero()[1]  # userId 4 corresponds to index 3
rated_movies_ids = [idx_to_mid[idx] for idx in rated_movies_indices]

print("Movies rated by userId 4:")
print(rated_movies_ids)

Movies rated by userId 4:
[47, 235, 260, 296, 441, 457, 553, 593, 608, 648, 919, 1025, 1060, 1073, 1080, 1136, 1196, 1197, 1198, 1213, 1219, 1265, 1282, 1291, 1500, 1517, 1580, 1617, 1732, 1967, 2078, 2174, 2395, 2406, 2571, 2628, 2692, 2858, 2959, 2997, 3033, 3176, 3386, 3489, 3809, 1704, 914, 21, 32, 45, 52, 58, 106, 125, 126, 162, 171, 176, 190, 215, 222, 232, 247, 265, 319, 342, 345, 348, 351, 357, 368, 417, 450, 475, 492, 509, 538, 539, 588, 595, 599, 708, 759, 800, 892, 898, 899, 902, 904, 908, 910, 912, 920, 930, 937, 1046, 1057, 1077, 1079, 1084, 1086, 1094, 1103, 1179, 1183, 1188, 1199, 1203, 1211, 1225, 1250, 1259, 1266, 1279, 1283, 1288, 1304, 1391, 1449, 1466, 1597, 1641, 1719, 1733, 1734, 1834, 1860, 1883, 1885, 1892, 1895, 1907, 1914, 1916, 1923, 1947, 1966, 1968, 2019, 2076, 2109, 2145, 2150, 2186, 2203, 2204, 2282, 2324, 2336, 2351, 2359, 2390, 2467, 2583, 2599, 2683, 2712, 2762, 2763, 2770, 2791, 2843, 2874, 2921, 2926, 2973, 3044, 3060, 3079, 3083, 3160, 3175, 3204, 3

In [6]:
movie_ids_to_check = [1, 2, 21, 32, 126]  # Movie IDs to check
for movie_id in movie_ids_to_check:
    idx = mid_to_idx.get(movie_id, None)
    if idx is not None:
        rating_value = ratings_matrix[3, idx]  # userId 4 corresponds to index 3
        print(f"Rating for userId 4 and movieId {movie_id}: {rating_value}")
    else:
        print(f"MovieId {movie_id} not found in the dataset.")

Rating for userId 4 and movieId 1: 0.0
Rating for userId 4 and movieId 2: 0.0
Rating for userId 4 and movieId 21: 1.0
Rating for userId 4 and movieId 32: 1.0
Rating for userId 4 and movieId 126: 1.0


**Q5**. Now that you have a `ratings_matrix` in the correct format, let's save it in pickle format:
- Create a variable `dst_dir` corresponding to the path of the folder `data/netflix` located at the root of the repository
- **Verify that this is the correct path**
- Save the ratings_matrix in pickle (as `ratings_matrix.pkl`) in this corresponding directory

In [7]:
dst_dir = "data/netflix"

In [8]:
import os
import pickle



# Verify if the directory exists, if not, create it
if not os.path.exists(dst_dir):
    os.makedirs(dst_dir)

# Save the ratings_matrix in pickle format
with open(os.path.join(dst_dir, "ratings_matrix.pkl"), "wb") as f:
    pickle.dump(ratings_matrix, f)

print("ratings_matrix has been saved in pickle format.")

ratings_matrix has been saved in pickle format.


**Q6**. Save also all mappings objects into pickle (`idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`) as it will be useful for later.

In [9]:
with open(os.path.join(dst_dir, "idx_to_mid.pkl"), "wb") as f:
    pickle.dump(idx_to_mid, f)

with open(os.path.join(dst_dir, "mid_to_idx.pkl"), "wb") as f:
    pickle.dump(mid_to_idx, f)

with open(os.path.join(dst_dir, "uid_to_idx.pkl"), "wb") as f:
    pickle.dump(uid_to_idx, f)

with open(os.path.join(dst_dir, "idx_to_uid.pkl"), "wb") as f:
    pickle.dump(idx_to_uid, f)

Up to next challenge now! 🍿