# Recommendation - Data Preparation 🎬

---

<img src="https://cdn-images-1.medium.com/max/1200/0*ePGWILY6GyplT-nn" />

---

In the next few challenges, you will build a powerful **movie recommender**.

We will use the open-source library [LightFM](https://github.com/lyst/lightfm) which provides easy python implementation of **hybrid** recommendation engines.

In this first part, we will prepare the data in order to train efficiently of the model.

We let you load the data `movies` and `ratings` downloaded from the **small** [movielens dataset](https://grouplens.org/datasets/movielens/).



In [33]:
import pickle
import pandas as pd
import numpy as np
import scipy.sparse as sparse
from utils import df_to_matrix,threshold_interactions_df

In [7]:
### TODO: Load the movies and ratings datasets
moviesDf = pd.read_csv('ml-latest-small/movies.csv')
ratingsDf = pd.read_csv('ml-latest-small/ratings.csv')
moviesDf.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [6]:
ratingsDf.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


**Q1**. What are the different types of recommendation models? Explain briefly with your own words the differences between them.

**Content-Based Recommendation**

Content-based recommendation systems suggest items to users based on the similarity between user features and item features. These systems analyze the characteristics of items that a user has shown interest in and recommend items that share similar attributes.

**Rating-Based Recommendation (Collaborative Filtering)**

Rating-based recommendation systems, make suggestions based on the rating matrix, which captures user preferences for various items. This matrix represents the ratings users give to different items, with rows corresponding to users and columns to items. Collaborative filtering identifies patterns and relationships within this matrix to recommend items. For example, if User A and User B have similar tastes and User A likes a particular book, the system might recommend that book to User B.

**Clustering-Based Recommendation**

Clustering-based recommendation systems group users or items into clusters based on similarities within the rating matrix. By identifying these clusters, the system can recommend items that are popular within a user's cluster or suggest items from similar clusters. For instance, users with similar movie-watching habits might be grouped together, and popular movies within these clusters can be recommended to other users in the same group. 

**Q1bis**. What data is expected by the LightFM `fit` method? Especially, how does the train data should be organized, and what should be the type of the train dataset? 

The primary input to the LightFM fit method is an interaction matrix, typically in the form of a sparse matrix from the scipy.sparse module, such as csr_matrix or coo_matrix.

**Q2**. Explore `movies` and `ratings`, what do those datasets contain? How are they organized?

The movies dataset contains the information about each movie, including a unique movie Id, the title of the movie, and its associated genres. The ratings dataset captures user interactions with movies, including the user Id, movie Id, rating given, and the timestamp of the rating. These datasets are organized to facilitate the building of recommendation systems by linking user ratings to specific movies and their attributes.

In [21]:
ratingsThresholdDf = threshold_interactions_df(ratingsDf, 'userId', 'movieId', 5, 10)

Starting interactions info
Number of rows: 610
Number of cols: 9724
Sparsity: 1.700%
Ending interactions info
Number of rows: 610
Number of columns: 3650
Sparsity: 4.055%


---

### Q3 & Q4 are optional
> you can come back to it if you have time after having finished the whole project of the day

We created a few utils functions for you in `utils.py` script. Especially:
- `threshold_interactions_df`:
> Limit interactions df to minimum row and column interactions

**Q3**. Open `src/utils.py` file, and have a look at the documentation of this function to understand its goal and how it works.

Have a look the code to understand fully how it works. You should be familiar with everything.

What does represent the variable `sparsity`? What is the range of values in which sparsity can be?

**Q4**. Create a new DataFrame `ratings_thresh`, that filters `ratings` with only:
- users that rated strictly more than 4 movies
- movies that have been rated at least 10 times

How many users/movies remain in this new dataset?

**Q5**. In order to fit a [LightFM](https://lyst.github.io/lightfm/docs/home.html) model, we need to transform our Dataframe to a sparse matrix (cf. below). This is not straightforward so we included the function `df_to_matrix` in `utils.py`.

> 🔦 **Hint**:  Sparse matrices are just **big matrices with a lot of zeros or empty values**.
> 
> Existing tools (Pandas DataFrame, Numpy arrays for example) are not suitable for manipulating this kind of data. So we will use [Scipy sparse matrices](https://docs.scipy.org/doc/scipy-0.14.0/reference/sparse.html).
>
> It exists many different "types" of sparse matrices (CSC, CSR, COO, DIA, etc.). You don't need to know them. Just know that it corresponds to different formats with different methods of manipulation, slicing, indexing, etc.

> 🔦 **Hint 2**:  By going from a DataFrame to a sparse matrix, you will lose the information of the ids (userId and movieId), you will only deal with indices (row number and column number). Therefore, the `df_to_matrix` function also returns dictionaries mapping indexes to ids (ex: uid_to_idx mapping userId to index of the matrix) 


Have a look at the util function documentation, and use it to create 5 new variables:
- a final sparse matrix `ratings_matrix` (this will be the data used to train the model)
- the following utils mappers:
    - `uid_to_idx`
    - `idx_to_uid`
    - `mid_to_idx`
    - `idx_to_mid`

In [22]:
ratings_matrix, uid_to_idx, idx_to_uid, mid_to_idx, idx_to_mid = df_to_matrix(ratingsThresholdDf, 'userId', 'movieId')

**Q6**.
- On the one side, find what movies did the userId 4 rate?

- On the other side, what is the value of `ratings_matrix` for:
    - userId = 4 and movieId=1
    - userId = 4 and movieId=2
    - userId = 4 and movieId=21
    - userId = 4 and movieId=32
    - userId = 4 and movieId=126

Conclude on the values signification in `ratings_matrix`

In [26]:
ratingsDf[ratingsDf.userId == 4]

Unnamed: 0,userId,movieId,rating,timestamp
300,4,21,3.0,986935199
301,4,32,2.0,945173447
302,4,45,3.0,986935047
303,4,47,2.0,945173425
304,4,52,3.0,964622786
...,...,...,...,...
511,4,4765,5.0,1007569445
512,4,4881,3.0,1007569445
513,4,4896,4.0,1007574532
514,4,4902,4.0,1007569465


In [37]:
# Check if user_id and movie_ids exist in the mapping
user_id = 4
movie_ids = [1, 2, 21, 32, 126]
movie_rates = [ratings_matrix[uid_to_idx[user_id], mid_to_idx[x]] for x in movie_ids]
for Id, Rating in zip(movie_ids, movie_rates):
    print(f'Movie Id: {Id}, Rating: {Rating}')

Movie Id: 1, Rating: 0.0
Movie Id: 2, Rating: 0.0
Movie Id: 21, Rating: 1.0
Movie Id: 32, Rating: 1.0
Movie Id: 126, Rating: 1.0


All values in the sparse matrix are binary.

**Q5**. Now that you have a `ratings_matrix` in the correct format, let's save it in pickle format:
- Create a variable `dst_dir` corresponding to the path of the folder `data/netflix` located at the root of the repository
- **Verify that this is the correct path**
- Save the ratings_matrix in pickle (as `ratings_matrix.pkl`) in this corresponding directory

In [40]:
directory = './data'

In [41]:
pickle.dump(ratings_matrix, open(directory + '/ratings_matrix.pkl', 'wb'))

**Q6**. Save also all mappings objects into pickle (`idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`) as it will be useful for later.

In [42]:
pickle.dump(idx_to_mid, open(directory + '/idx_to_mid.pkl', 'wb'))
pickle.dump(mid_to_idx, open(directory + '/mid_to_idx.pkl', 'wb'))
pickle.dump(uid_to_idx, open(directory + '/uid_to_idx.pkl', 'wb'))
pickle.dump(idx_to_uid, open(directory + '/idx_to_uid.pkl', 'wb'))

In [43]:
pickle.dump(moviesDf, open(directory + '/moviesDf.pkl', 'wb'))

Up to next challenge now! 🍿