# Recommendation - Data Preparation 🎬

---

<img src="https://cdn-images-1.medium.com/max/1200/0*ePGWILY6GyplT-nn" />

---

In the next few challenges, you will build a powerful **movie recommender**.

We will use the open-source library [LightFM](https://github.com/lyst/lightfm) which provides easy python implementation of **hybrid** recommendation engines.

In this first part, we will prepare the data in order to train efficiently of the model.

We let you load the data `movies` and `ratings` downloaded from the **small** [movielens dataset](https://grouplens.org/datasets/movielens/).



In [105]:
import os
import pandas as pd
import scipy

In [106]:
### TODO: Load the movies and ratings datasets
path = os.path.join("/","home","guillaume","code","GGIML","vivadata-student","data","ml-latest-small")

In [107]:
movies = pd.read_csv(path+os.path.sep+"movies.csv")
ratings = pd.read_csv(path+os.path.sep+"ratings.csv")

In [108]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


**Q1**. What are the different types of recommendation models? Explain briefly with your own words the differences between them.

There are two main types of recommendation models: 
- Content-based filtering that is based on the characteristics on the different items and 'how much' a user gives importance to each of these characteristics.
- Collaborative filtering where recommendations are based on what similar users have liked. Items to be recommended do not need any features.

**Q1bis**. What data is expected by the LightFM `fit` method? Especially, how does the train data should be organized, and what should be the type of the train dataset? 

In [109]:
from lightfm import LightFM

In [110]:
LightFM.fit()

TypeError: fit() missing 2 required positional arguments: 'self' and 'interactions'

LightFM expects the following :
- interactions: the matrix containing user-item interactions. (shape : [n_users, n_items])
- user_features: each row contains that user's weights over features (used for content-based filtering)
- item_features: each row contains that item's weights over features. (used for content-based filtering)
- sample_weight: matrix with entries expressing weights of individual interactions from the interactions matrix. Its row and col arrays must be the same as those of the interactions matrix.

Matrices must contain floats and we must have the users as rows and the items as columns.

**Q2**. Explore `movies` and `ratings`, what do those datasets contain? How are they organized?

In [111]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


The movies table is composed of rows for the different movies and columns for the features. We should just split the genre columns to create the features and we would have a matrix suitable for the item_features matrix expected by LightFM.

In [112]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


The ratings table is our interactions matrix. Transformations are necessary to have unique user Id for each row, 1 movieId per column and the ratings as float values in this matrix.

---

### Q3 & Q4 are optional
> you can come back to it if you have time after having finished the whole project of the day

We created a few utils functions for you in `utils.py` script. Especially:
- `threshold_interactions_df`:
> Limit interactions df to minimum row and column interactions

**Q3**. Open `src/utils.py` file, and have a look at the documentation of this function to understand its goal and how it works.

Have a look the code to understand fully how it works. You should be familiar with everything.

What does represent the variable `sparsity`? What is the range of values in which sparsity can be?

**Q4**. Create a new DataFrame `ratings_thresh`, that filters `ratings` with only:
- users that rated strictly more than 4 movies
- movies that have been rated at least 10 times

How many users/movies remain in this new dataset?

**Q5**. In order to fit a [LightFM](https://lyst.github.io/lightfm/docs/home.html) model, we need to transform our Dataframe to a sparse matrix (cf. below). This is not straightforward so we included the function `df_to_matrix` in `utils.py`.

> 🔦 **Hint**:  Sparse matrices are just **big matrices with a lot of zeros or empty values**.
> 
> Existing tools (Pandas DataFrame, Numpy arrays for example) are not suitable for manipulating this kind of data. So we will use [Scipy sparse matrices](https://docs.scipy.org/doc/scipy-0.14.0/reference/sparse.html).
>
> It exists many different "types" of sparse matrices (CSC, CSR, COO, DIA, etc.). You don't need to know them. Just know that it corresponds to different formats with different methods of manipulation, slicing, indexing, etc.

> 🔦 **Hint 2**:  By going from a DataFrame to a sparse matrix, you will lose the information of the ids (userId and movieId), you will only deal with indices (row number and column number). Therefore, the `df_to_matrix` function also returns dictionaries mapping indexes to ids (ex: uid_to_idx mapping userId to index of the matrix) 


Have a look at the util function documentation, and use it to create 5 new variables:
- a final sparse matrix `ratings_matrix` (this will be the data used to train the model)
- the following utils mappers:
    - `uid_to_idx`
    - `idx_to_uid`
    - `mid_to_idx`
    - `idx_to_mid`

In [113]:
import utils

In [114]:
ratings_matrix, uid_to_idx, idx_to_uid, mid_to_idx, idx_to_mid = utils.df_to_matrix(ratings, 'userId', 'movieId')

In [115]:
idx_to_mid

{0: 1,
 1: 3,
 2: 6,
 3: 47,
 4: 50,
 5: 70,
 6: 101,
 7: 110,
 8: 151,
 9: 157,
 10: 163,
 11: 216,
 12: 223,
 13: 231,
 14: 235,
 15: 260,
 16: 296,
 17: 316,
 18: 333,
 19: 349,
 20: 356,
 21: 362,
 22: 367,
 23: 423,
 24: 441,
 25: 457,
 26: 480,
 27: 500,
 28: 527,
 29: 543,
 30: 552,
 31: 553,
 32: 590,
 33: 592,
 34: 593,
 35: 596,
 36: 608,
 37: 648,
 38: 661,
 39: 673,
 40: 733,
 41: 736,
 42: 780,
 43: 804,
 44: 919,
 45: 923,
 46: 940,
 47: 943,
 48: 954,
 49: 1009,
 50: 1023,
 51: 1024,
 52: 1025,
 53: 1029,
 54: 1030,
 55: 1031,
 56: 1032,
 57: 1042,
 58: 1049,
 59: 1060,
 60: 1073,
 61: 1080,
 62: 1089,
 63: 1090,
 64: 1092,
 65: 1097,
 66: 1127,
 67: 1136,
 68: 1196,
 69: 1197,
 70: 1198,
 71: 1206,
 72: 1208,
 73: 1210,
 74: 1213,
 75: 1214,
 76: 1219,
 77: 1220,
 78: 1222,
 79: 1224,
 80: 1226,
 81: 1240,
 82: 1256,
 83: 1258,
 84: 1265,
 85: 1270,
 86: 1275,
 87: 1278,
 88: 1282,
 89: 1291,
 90: 1298,
 91: 1348,
 92: 1377,
 93: 1396,
 94: 1408,
 95: 1445,
 96: 1473,
 

In [116]:
ratings_matrix.todense()

matrix([[1., 1., 1., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [1., 1., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 1., ..., 1., 1., 1.]])

In [117]:
print(ratings_matrix)

  (0, 0)	1.0
  (0, 1)	1.0
  (0, 2)	1.0
  (0, 3)	1.0
  (0, 4)	1.0
  (0, 5)	1.0
  (0, 6)	1.0
  (0, 7)	1.0
  (0, 8)	1.0
  (0, 9)	1.0
  (0, 10)	1.0
  (0, 11)	1.0
  (0, 12)	1.0
  (0, 13)	1.0
  (0, 14)	1.0
  (0, 15)	1.0
  (0, 16)	1.0
  (0, 17)	1.0
  (0, 18)	1.0
  (0, 19)	1.0
  (0, 20)	1.0
  (0, 21)	1.0
  (0, 22)	1.0
  (0, 23)	1.0
  (0, 24)	1.0
  :	:
  (609, 9699)	1.0
  (609, 9700)	1.0
  (609, 9701)	1.0
  (609, 9702)	1.0
  (609, 9703)	1.0
  (609, 9704)	1.0
  (609, 9705)	1.0
  (609, 9706)	1.0
  (609, 9707)	1.0
  (609, 9708)	1.0
  (609, 9709)	1.0
  (609, 9710)	1.0
  (609, 9711)	1.0
  (609, 9712)	1.0
  (609, 9713)	1.0
  (609, 9714)	1.0
  (609, 9715)	1.0
  (609, 9716)	1.0
  (609, 9717)	1.0
  (609, 9718)	1.0
  (609, 9719)	1.0
  (609, 9720)	1.0
  (609, 9721)	1.0
  (609, 9722)	1.0
  (609, 9723)	1.0


**Q6**.
- On the one side, find what movies did the userId 4 rate?

- On the other side, what is the value of `ratings_matrix` for:
    - userId = 4 and movieId=1
    - userId = 4 and movieId=2
    - userId = 4 and movieId=21
    - userId = 4 and movieId=32
    - userId = 4 and movieId=126

Conclude on the values signification in `ratings_matrix`

In [118]:
ratings_matrix[uid_to_idx[1],:]

<1x9724 sparse matrix of type '<class 'numpy.float64'>'
	with 232 stored elements in Compressed Sparse Row format>

In [119]:
for index in ratings_matrix[uid_to_idx[4],:].indices:
    print(movies[movies['movieId']==idx_to_mid[index]]['title'].values)

['Seven (a.k.a. Se7en) (1995)']
['Ed Wood (1994)']
['Star Wars: Episode IV - A New Hope (1977)']
['Pulp Fiction (1994)']
['Dazed and Confused (1993)']
['Fugitive, The (1993)']
['Tombstone (1993)']
['Silence of the Lambs, The (1991)']
['Fargo (1996)']
['Mission: Impossible (1996)']
['Wizard of Oz, The (1939)']
['Sword in the Stone, The (1963)']
['Swingers (1996)']
['Willy Wonka & the Chocolate Factory (1971)']
["Monty Python's Life of Brian (1979)"]
['Monty Python and the Holy Grail (1975)']
['Star Wars: Episode V - The Empire Strikes Back (1980)']
['Princess Bride, The (1987)']
['Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)']
['Goodfellas (1990)']
['Psycho (1960)']
['Groundhog Day (1993)']
['Fantasia (1940)']
['Indiana Jones and the Last Crusade (1989)']
['Grosse Pointe Blank (1997)']
['Austin Powers: International Man of Mystery (1997)']
['Men in Black (a.k.a. MIB) (1997)']
['L.A. Confidential (1997)']
['Big Lebowski, The (1998)']
['Labyrinth (1986)']

In [120]:
print(ratings_matrix[uid_to_idx[4],mid_to_idx[1]])

0.0


In [121]:
print(ratings_matrix[uid_to_idx[4],mid_to_idx[2]])

0.0


In [122]:
print(ratings_matrix[uid_to_idx[4],mid_to_idx[21]])

1.0


In [123]:
print(ratings_matrix[uid_to_idx[4],mid_to_idx[32]])

1.0


In [124]:
print(ratings_matrix[uid_to_idx[4],mid_to_idx[126]])

1.0


This matrix contains 1 if a movie has been rated, and nothing otherwise.

**Q5**. Now that you have a `ratings_matrix` in the correct format, let's save it in pickle format:
- Create a variable `dst_dir` corresponding to the path of the folder `data/netflix` located at the root of the repository
- **Verify that this is the correct path**
- Save the ratings_matrix in pickle (as `ratings_matrix.pkl`) in this corresponding directory

In [125]:
dst_dir = os.path.join("/","home","guillaume","code","GGIML","vivadata-student","data","netflix")

In [126]:
import pickle as pkl

In [127]:
with open(dst_dir+os.path.sep+'ratings_matrix.pkl', 'wb') as pklfile:
    pkl.dump(ratings_matrix, pklfile)

**Q6**. Save also all mappings objects into pickle (`idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`) as it will be useful for later.

In [133]:
with open(dst_dir+os.path.sep+'idx_to_mid.pkl', 'wb') as pklfile:
    pkl.dump(idx_to_mid, pklfile)

In [134]:
with open(dst_dir+os.path.sep+'mid_to_idx.pkl', 'wb') as pklfile:
    pkl.dump(mid_to_idx, pklfile)

In [135]:
with open(dst_dir+os.path.sep+'idx_to_uid.pkl', 'wb') as pklfile:
    pkl.dump(idx_to_uid, pklfile)

In [136]:
with open(dst_dir+os.path.sep+'uid_to_idx.pkl', 'wb') as pklfile:
    pkl.dump(uid_to_idx, pklfile)

Up to next challenge now! 🍿