# Recommendation - Data Preparation 🎬

---

<img src="https://cdn-images-1.medium.com/max/1200/0*ePGWILY6GyplT-nn" />

---

Le but de ce projet est de créer un moteur de recommandation afin de le faire tourner sur une plateforme type Allociné.

L'idée est simple, proposer chaque semaine des films à des utilisateurs. Pour cela, on a accès aux notes qu'ils leurs ont données.

Nous avons accès à 2 fichiers. Une liste de film (movies) ainsi qu'un data set de notes (ratings). Commençons par faire converger les 2.

In [15]:
import pandas as pd
movies = pd.read_csv("movies.csv")
ratings = pd.read_csv("ratings.csv")

print("Nombre de notes : "+ str(ratings.shape[0]))

Nombre de notes : 100836


In [2]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [3]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [19]:
movies.shape

(9742, 3)

In [4]:
movies.genres.values[0].split('|')

['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy']

Voyons quel genre de notes sont émises par les utilisateurs? On remarque que cela va de 0 à 5. 

In [6]:
ratings.rating.unique()

array([4. , 5. , 3. , 2. , 1. , 4.5, 3.5, 2.5, 0.5, 1.5])

In [7]:
n_movies = ratings["movieId"].nunique()
print(n_movies, "movies rated")

n_users = ratings.userId.nunique()
print(n_users, "users that rated at least 1 movie")

9724 movies rated
610 users that rated at least 1 movie


Dans notre matrice, chaque utilisateur est identifié par un UserId. On remarque que certaines personnes ont beaucoup plus commentées de films que d'autres. Certaines ont commentées plus de 2000 films

In [8]:
ratings.userId.value_counts()

414    2698
599    2478
474    2108
448    1864
274    1346
610    1302
68     1260
380    1218
606    1115
288    1055
249    1046
387    1027
182     977
307     975
603     943
298     939
177     904
318     879
232     862
480     836
608     831
600     763
590     728
483     728
105     722
19      703
305     677
489     648
111     646
438     635
       ... 
467      22
245      21
293      21
37       21
439      21
324      21
507      21
364      21
598      21
49       21
157      21
547      21
87       21
549      21
26       21
281      21
320      20
207      20
576      20
194      20
189      20
257      20
147      20
53       20
278      20
406      20
595      20
569      20
431      20
442      20
Name: userId, Length: 610, dtype: int64

Construisons maintenant notre sparse matrix en utilisant LightFM

---

In [11]:
def get_df_mappings(df, row_name, col_name):
    """Map entities in interactions df to row and column indices
    Parameters
    ----------
    df : DataFrame
        Interactions DataFrame.
    row_name : str
        Name of column in df which contains row entities.
    col_name : str
        Name of column in df which contains column entities.
    Returns
    -------
    rid_to_idx : dict
        Maps row ID's to the row index in the eventual interactions matrix.
    idx_to_rid : dict
        Reverse of rid_to_idx. Maps row index to row ID.
    cid_to_idx : dict
        Same as rid_to_idx but for column ID's
    idx_to_cid : dict
    """


    # Create mappings
    rid_to_idx = {}
    idx_to_rid = {}
    for (idx, rid) in enumerate(df[row_name].unique().tolist()):
        rid_to_idx[rid] = idx
        idx_to_rid[idx] = rid

    cid_to_idx = {}
    idx_to_cid = {}
    for (idx, cid) in enumerate(df[col_name].unique().tolist()):
        cid_to_idx[cid] = idx
        idx_to_cid[idx] = cid

    return rid_to_idx, idx_to_rid, cid_to_idx, idx_to_cid

In [9]:
import numpy as np
import scipy.sparse as sparse

def df_to_matrix(df, row_name, col_name):
    """Take interactions dataframe and convert to a sparse matrix
    Parameters
    ----------
    df : DataFrame
    row_name : str
    col_name : str
    Returns
    -------
    interactions : sparse csr matrix
    rid_to_idx : dict
    idx_to_rid : dict
    cid_to_idx : dict
    idx_to_cid : dict
    """
    rid_to_idx, idx_to_rid, cid_to_idx, idx_to_cid = get_df_mappings(df, row_name, col_name)
    
    def map_ids(row, mapper):
        return mapper[row]
    
    I = df[row_name].apply(map_ids, args=[rid_to_idx]).values
    J = df[col_name].apply(map_ids, args=[cid_to_idx]).values
    V = df["rating"].values
    
    #np.ones() 
    
    # V = []
    # for item in zip(I, J):
    #     if np.isin(item[0], df[row_name].values) & np.isin(item[1], df[col_name].values):
    #         V.append(df[(df[row_name]==item[0]) & (df[col_name]==item[1])]['rating'][1])
    #     else:
    #         V.append(0)
        
    #V = [df[(df[row_name]==item[0]) & (df[col_name]==item[1])]['rating'][1] for item in zip(I, J) if ((item[0] in list(df[row_name].values)) & (item[1] in list(df[col_name].values)))]
    interactions = sparse.coo_matrix((V, (I, J)), dtype=np.float64)
    interactions = interactions.tocsr()
    return interactions, rid_to_idx, idx_to_rid, cid_to_idx, idx_to_cid


In [12]:
interactions, rid_to_idx, idx_to_rid, cid_to_idx, idx_to_cid = df_to_matrix(ratings, "userId", "movieId")

Nous arrivons donc à une sparse matrix avec comme index le nombre d'utilisateur et en colonnes chaque de nos films

In [13]:
interactions.toarray()

array([[4. , 4. , 4. , ..., 0. , 0. , 0. ],
       [0. , 0. , 0. , ..., 0. , 0. , 0. ],
       [0. , 0. , 0. , ..., 0. , 0. , 0. ],
       ...,
       [2.5, 2. , 0. , ..., 0. , 0. , 0. ],
       [3. , 0. , 0. , ..., 0. , 0. , 0. ],
       [5. , 0. , 5. , ..., 3. , 3.5, 3.5]])

In [26]:
print(interactions[0])
print('----------------------------')
print("Nombre de film : "+str(movies.shape[0]))

  (0, 0)	4.0
  (0, 1)	4.0
  (0, 2)	4.0
  (0, 3)	5.0
  (0, 4)	5.0
  (0, 5)	3.0
  (0, 6)	5.0
  (0, 7)	4.0
  (0, 8)	5.0
  (0, 9)	5.0
  (0, 10)	5.0
  (0, 11)	5.0
  (0, 12)	3.0
  (0, 13)	5.0
  (0, 14)	4.0
  (0, 15)	5.0
  (0, 16)	3.0
  (0, 17)	3.0
  (0, 18)	5.0
  (0, 19)	4.0
  (0, 20)	4.0
  (0, 21)	5.0
  (0, 22)	4.0
  (0, 23)	3.0
  (0, 24)	4.0
  :	:
  (0, 207)	3.0
  (0, 208)	5.0
  (0, 209)	5.0
  (0, 210)	5.0
  (0, 211)	4.0
  (0, 212)	4.0
  (0, 213)	5.0
  (0, 214)	5.0
  (0, 215)	5.0
  (0, 216)	4.0
  (0, 217)	4.0
  (0, 218)	4.0
  (0, 219)	5.0
  (0, 220)	4.0
  (0, 221)	4.0
  (0, 222)	5.0
  (0, 223)	5.0
  (0, 224)	5.0
  (0, 225)	5.0
  (0, 226)	4.0
  (0, 227)	4.0
  (0, 228)	5.0
  (0, 229)	4.0
  (0, 230)	4.0
  (0, 231)	5.0
----------------------------
Nombre de film : 9742


Pour mieux comprendre les choses, nous pouvons joindre nos 2 dataset afin de réunir l'information

In [39]:
df = movies.set_index('movieId').join(ratings.set_index('movieId')).drop(columns=["genres", "timestamp"])
df.head()

Unnamed: 0_level_0,title,userId,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Toy Story (1995),1.0,4.0
1,Toy Story (1995),5.0,4.0
1,Toy Story (1995),7.0,4.5
1,Toy Story (1995),15.0,2.5
1,Toy Story (1995),17.0,4.5


In [40]:
uid_to_idx, idx_to_uid, mid_to_idx, idx_to_mid = get_df_mappings(df, "title", "rating")

def map_ids(row, mapper):
        return mapper[row]

df["title"].apply(map_ids, args=[uid_to_idx]).values

array([   0,    0,    0, ..., 9734, 9735, 9736])

In [86]:
uid_to_idx

{'Toy Story (1995)': 0,
 'Jumanji (1995)': 1,
 'Grumpier Old Men (1995)': 2,
 'Waiting to Exhale (1995)': 3,
 'Father of the Bride Part II (1995)': 4,
 'Heat (1995)': 5,
 'Sabrina (1995)': 6,
 'Tom and Huck (1995)': 7,
 'Sudden Death (1995)': 8,
 'GoldenEye (1995)': 9,
 'American President, The (1995)': 10,
 'Dracula: Dead and Loving It (1995)': 11,
 'Balto (1995)': 12,
 'Nixon (1995)': 13,
 'Cutthroat Island (1995)': 14,
 'Casino (1995)': 15,
 'Sense and Sensibility (1995)': 16,
 'Four Rooms (1995)': 17,
 'Ace Ventura: When Nature Calls (1995)': 18,
 'Money Train (1995)': 19,
 'Get Shorty (1995)': 20,
 'Copycat (1995)': 21,
 'Assassins (1995)': 22,
 'Powder (1995)': 23,
 'Leaving Las Vegas (1995)': 24,
 'Othello (1995)': 25,
 'Now and Then (1995)': 26,
 'Persuasion (1995)': 27,
 'City of Lost Children, The (Cité des enfants perdus, La) (1995)': 28,
 'Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)': 29,
 'Dangerous Minds (1995)': 30,
 'Twelve Monkeys (a.k.a. 12 Monkeys) (1995)': 

Faisons un test rapide afin de voir si notre sparse matrix est valide: Regardons la note du film 126 qu'a émise l'utilisateur 4

In [51]:
print(interactions.toarray()[rid_to_idx[4]][cid_to_idx[126]])
df[df['userId']==4].loc[126]['rating']

1.0


1.0

In [52]:
ratings_matrix = interactions

Enregistrons dorénavant ça au format pickel afin de pouvoir l'utiliser plus tard lors de nos calculs de similarité

In [None]:
import pickle

with open("../../../../../../data/netflix/ratings_matrix.pkl", "wb") as file:
    pickle.dump(ratings_matrix, file)
    
with open("../../../../../../data/netflix/idx_to_mid.pkl", "wb") as file:
    pickle.dump(idx_to_mid, file)
with open("../../../../../../data/netflix/mid_to_idx.pkl", "wb") as file:
    pickle.dump(mid_to_idx, file)
with open("../../../../../../data/netflix/uid_to_idx.pkl", "wb") as file:
    pickle.dump(uid_to_idx, file)
with open("../../../../../../data/netflix/idx_to_uid.pkl", "wb") as file:
    pickle.dump(idx_to_uid, file)