# Recommendation - Data Preparation üé¨

---

<img src="https://cdn-images-1.medium.com/max/1200/0*ePGWILY6GyplT-nn" />

---

In the next few challenges, you will build a powerful **movie recommender**.

We will use the open-source library [LightFM](https://github.com/lyst/lightfm) which provides easy python implementation of **hybrid** recommendation engines.

In this first part, we will prepare the data in order to train efficiently of the model.

We let you load the data `movies` and `ratings` downloaded from the **small** [movielens dataset](https://grouplens.org/datasets/movielens/).



In [1]:
### TODO: Load the movies and ratings datasets
import pandas as pd

movies = pd.read_csv("ml-latest-small/movies.csv")
ratings = pd.read_csv("ml-latest-small/ratings.csv")

print(movies.head())
print(ratings.head())

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931


**Q1**. What are the different types of recommendation models? Explain briefly with your own words the differences between them.

Different types of recommendation models:
1. Content-based recommendation: 
This system recommends based on similarity between user interest and the description of the item.


2. Rating-based recommendation: It uses nearest neighbour approach and gives either user-based or item-based recommendations using a rating matrix.

User-based: Uses ratings and items liked by similar users as recommendation.
Item-based: Finds items similar to user's liking and recommends them.

3. Clustering-based recommendation: Classifies users into clusters based on similar ratings to items and  then recommends the same thing to users belonging to the cluster.

DIFFERENCE:
Content-based recommendation is not depend on other users like the other two systems. It limits the ability to expand the user's existing interests. But it is not affected by less ratings as it depends upon the user.

Rating-based recommendation uses other users' ratings on items for recommendation. It helps users find new interests. It accuracy is affected if items ratings are too few in number.

Clustering-based recommendation creates clusters of people and hence is depend on other users. It is able to find new interesting items for users better than rating-based system but might lack in user-specific relevance unlike the other two systems.



**Q1bis**. What data is expected by the LightFM `fit` method? Especially, how does the train data should be organized, and what should be the type of the train dataset?

1. The fit method expects a sparse matrix of user and item features that has 1 for positive interaction and -1 for negative interactions.
 
2. The train data should be organised as a sparse matrix of user ratings and user movies and the train dataset should be of integer type.

**Q2**. Explore `movies` and `ratings`, what do those datasets contain? How are they organized?

In [2]:
print(movies.shape)
print(movies.info())
movies.head()

(9742, 3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB
None


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
ratings["date_time"] = pd.to_datetime(ratings["timestamp"], unit='s')
print(ratings.shape)
print(ratings.info())
ratings.head()

(100836, 5)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 5 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   userId     100836 non-null  int64         
 1   movieId    100836 non-null  int64         
 2   rating     100836 non-null  float64       
 3   timestamp  100836 non-null  int64         
 4   date_time  100836 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(3)
memory usage: 3.8 MB
None


Unnamed: 0,userId,movieId,rating,timestamp,date_time
0,1,1,4.0,964982703,2000-07-30 18:45:03
1,1,3,4.0,964981247,2000-07-30 18:20:47
2,1,6,4.0,964982224,2000-07-30 18:37:04
3,1,47,5.0,964983815,2000-07-30 19:03:35
4,1,50,5.0,964982931,2000-07-30 18:48:51


---

###¬†Q3 & Q4 are optional
> you can come back to it if you have time after having finished the whole project of the day

We created a few utils functions for you in `utils.py` script. Especially:
- `threshold_interactions_df`:
> Limit interactions df to minimum row and column interactions

**Q3**. Open `src/utils.py` file, and have a look at the documentation of this function to understand its goal and how it works.

Have a look the code to understand fully how it works. You should be familiar with everything.

What does represent the variable `sparsity`? What is the range of values in which sparsity can be?

Threshold_interactiond_df takes the interaction matrix, and filters out any users or movies that don't have many reviews.
The sparsity represents how empty (preportion of 0's) the matrix is.
The values ranges between 0, when no users rate no movies, and 1 (100%) when all users rate all movies.

**Q4**. Create a new DataFrame `ratings_thresh`, that filters `ratings` with only:
- users that rated strictly more than 4 movies
- movies that have been rated at least 10 times

How many users/movies remain in this new dataset?

In [4]:
from utils import threshold_interactions_df

ratings_thresh = threshold_interactions_df(
    df = ratings,
    row_name= 'userId',
    col_name='movieId',
    row_min=5,
    col_min=10
)
print("There remains",
      len(ratings_thresh.userId.unique()),
      "users and",
      len(ratings_thresh.movieId.unique()),
      "movies after performing threshold.")

Starting interactions info
Number of rows: 610
Number of cols: 9724
Sparsity: 1.700%
Ending interactions info
Number of rows: 610
Number of columns: 3650
Sparsity: 4.055%
There remains 610 users and 3650 movies after performing threshold.


**Q5**. In order to fit a [LightFM](https://lyst.github.io/lightfm/docs/home.html) model, we need to transform our Dataframe to a sparse matrix (cf. below). This is not straightforward so we included the function `df_to_matrix` in `utils.py`.

> üî¶ **Hint**:  Sparse matrices are just **big matrices with a lot of zeros or empty values**.
> 
> Existing tools (Pandas DataFrame, Numpy arrays for example) are not suitable for manipulating this kind of data. So we will use [Scipy sparse matrices](https://docs.scipy.org/doc/scipy-0.14.0/reference/sparse.html).
>
> It exists many different "types" of sparse matrices (CSC, CSR, COO, DIA, etc.). You don't need to know them. Just know that it corresponds to different formats with different methods of manipulation, slicing, indexing, etc.

> üî¶ **Hint 2**:  By going from a DataFrame to a sparse matrix, you will lose the information of the ids (userId and movieId), you will only deal with indices (row number and column number). Therefore, the `df_to_matrix` function also returns dictionaries mapping indexes to ids (ex: uid_to_idx mapping userId to index of the matrix) 


Have a look at the util function documentation, and use it to create 5 new variables:
- a final sparse matrix `ratings_matrix` (this will be the data used to train the model)
- the following utils mappers:
    - `uid_to_idx`
    - `idx_to_uid`
    - `mid_to_idx`
    - `idx_to_mid`

In [5]:
from utils import df_to_matrix

ratings_matrix, uid_to_idx, idx_to_uid, mid_to_idx, idx_to_mid = df_to_matrix(
    df = ratings,
    row_name= 'userId',
    col_name='movieId'
)
ratings_matrix

<610x9724 sparse matrix of type '<class 'numpy.float64'>'
	with 100836 stored elements in Compressed Sparse Row format>

**Q6**.
- On the one side, find what movies did the userId 4 rate?

- On the other side, what is the value of `ratings_matrix` for:
    - userId = 4 and movieId=1
    - userId = 4 and movieId=2
    - userId = 4 and movieId=21
    - userId = 4 and movieId=32
    - userId = 4 and movieId=126

Conclude on the values signification in `ratings_matrix`

In [6]:
"""
print("Movies User 4 rated:")
ratings[ratings.userId == 4]
"""
movies.index = movies["movieId"]
uid_4_ratings = ratings_matrix[uid_to_idx[4]].tocoo()

print("Movies User 4 rated:")
print()

for movie_idx in uid_4_ratings.col:
    mid = idx_to_mid[movie_idx]
    print("Movie ID:", mid)
    print("Movie Title:",movies.loc[mid]["title"])
    print()


Movies User 4 rated:

Movie ID: 47
Movie Title: Seven (a.k.a. Se7en) (1995)

Movie ID: 235
Movie Title: Ed Wood (1994)

Movie ID: 260
Movie Title: Star Wars: Episode IV - A New Hope (1977)

Movie ID: 296
Movie Title: Pulp Fiction (1994)

Movie ID: 441
Movie Title: Dazed and Confused (1993)

Movie ID: 457
Movie Title: Fugitive, The (1993)

Movie ID: 553
Movie Title: Tombstone (1993)

Movie ID: 593
Movie Title: Silence of the Lambs, The (1991)

Movie ID: 608
Movie Title: Fargo (1996)

Movie ID: 648
Movie Title: Mission: Impossible (1996)

Movie ID: 919
Movie Title: Wizard of Oz, The (1939)

Movie ID: 1025
Movie Title: Sword in the Stone, The (1963)

Movie ID: 1060
Movie Title: Swingers (1996)

Movie ID: 1073
Movie Title: Willy Wonka & the Chocolate Factory (1971)

Movie ID: 1080
Movie Title: Monty Python's Life of Brian (1979)

Movie ID: 1136
Movie Title: Monty Python and the Holy Grail (1975)

Movie ID: 1196
Movie Title: Star Wars: Episode V - The Empire Strikes Back (1980)

Movie ID: 1

In [7]:
for mid in [1,2,21,32,126]:
    print("User Index:", uid_to_idx[4])
    print("Movie Index:", mid_to_idx[mid])
    print("Value of matrix:", ratings_matrix[uid_to_idx[4],mid_to_idx[mid]])
    print()

User Index: 3
Movie Index: 0
Value of matrix: 0.0

User Index: 3
Movie Index: 481
Value of matrix: 0.0

User Index: 3
Movie Index: 291
Value of matrix: 1.0

User Index: 3
Movie Index: 292
Value of matrix: 1.0

User Index: 3
Movie Index: 298
Value of matrix: 1.0



**Q5**. Now that you have a `ratings_matrix` in the correct format, let's save it in pickle format:
- Create a variable `dst_dir` corresponding to the path of the folder `data/netflix` located at the root of the repository
- **Verify that this is the correct path**
- Save the ratings_matrix in pickle (as `ratings_matrix.pkl`) in this corresponding directory

In [9]:
dst_dir = "./data/netflix"

In [10]:
import pickle
pickle.dump(ratings_matrix, open(dst_dir + "/ratings_matrix.pkl","wb"))

**Q6**. Save also all mappings objects into pickle (`idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`) as it will be useful for later.

In [11]:
pickle.dump(idx_to_mid, open(dst_dir + "/idx_to_mid.pkl","wb"))
pickle.dump(mid_to_idx, open(dst_dir + "/mid_to_idx.pkl","wb"))
pickle.dump(uid_to_idx, open(dst_dir + "/uid_to_idx.pkl","wb"))
pickle.dump(idx_to_uid, open(dst_dir + "/idx_to_uid.pkl","wb"))

Up to next challenge now! üçø