# Atelier formatif (b) — Systèmes de recommandations
## Description

Cet atelier porte sur les approches utilisateur-utilisateur et item-item. Il constitue aussi une introduction au TP1.

Les données de l'atelier et du TP1 sont tirées du répertoire de MovieLens, un ensemble de votes à des films :

### Data
votes.csv : 100 000 votes de 943 utilisateurs pour 1682 films.
u.csv : profil des 943 utilisateurs.
items.csv : titre et autres informations sur les films

### Questions
1. Supposons que j'écoute le film "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb" (film numéro 474). Prenez une approche item-item et recommandez 5 films sur la la base de la similarité cosinus uniquement, c.à.d. en utilisant le que poids w de la formule (1) p. 20 des notes sur l'approche u-u et i-i.

1. La grande proportion de valeurs manquantes pose des difficultés pour l'estimation des votes avec les approches utilisateur-utilisateur et item-item. Décrivez-en deux.

1. La correction pour la fréquence inverse utilisateur (FIU) aura-t-elle tendance à favoriser ou à défavoriser la sérendipité? Expliquez.


In [98]:
import pandas as pd

# 1. Importer les données
items = pd.read_csv("data/items.csv")
items = items.rename(columns={column: column.strip() for column in items.columns})

ratings = pd.read_csv("data/votes.csv")
ratings = ratings.rename(columns={column: column.strip() for column in ratings.columns})
users = pd.read_csv("data/u.csv")
users = users.rename(columns={column: column.strip() for column in users.columns})

dr_strange = items[items["movie id"] == 474]

In [99]:
columns = ["user.id", "item.id", "rating"]
users_items = pd.merge(ratings, users, left_on="user.id", right_on="id").drop(
    columns=["id"]
)[columns]
users_items = pd.merge(users_items, items, left_on="item.id", right_on="movie id")[
    columns
]
users_items = users_items.pivot(
    index="user.id", columns="item.id", values="rating"
).fillna(0)
items = items.set_index("movie id")

In [100]:
users_items

item.id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
user.id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
940,0.0,0.0,0.0,2.0,0.0,0.0,4.0,5.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
941,5.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
942,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [111]:
# 2. Calculer la correlation entre les films
items["mean"] = users_items.mean(axis=0)
items = items[["mean"]]
items["std"] = users_items.std(axis=0)
users_items.cov()

item.id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
item.id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.170108,0.558637,0.381639,0.713005,0.261205,0.019232,1.402343,0.912324,0.815616,0.248508,...,0.003634,-0.007894,-0.005920,-0.003947,0.003634,-0.001973,-0.005920,-0.003947,0.006819,0.006819
2,0.558637,1.359806,0.214872,0.736224,0.278673,0.021846,0.507238,0.421748,0.173168,0.098857,...,-0.001418,-0.001891,-0.001418,-0.000946,-0.001418,-0.000473,-0.001418,-0.000946,0.008136,0.008136
3,0.381639,0.214872,0.934147,0.339904,0.135649,0.039359,0.494620,0.133425,0.269353,0.089469,...,-0.000922,-0.001229,-0.000922,-0.000615,0.002263,-0.000307,-0.000922,-0.000615,-0.000922,0.008632
4,0.713005,0.736224,0.339904,2.382332,0.363625,0.021043,0.901677,0.955951,0.676560,0.248072,...,-0.002506,-0.003341,0.013418,0.008945,0.003864,-0.000835,-0.002506,-0.001671,0.007048,0.010233
5,0.261205,0.278673,0.135649,0.363625,0.985636,-0.005317,0.402933,0.254224,0.269789,-0.038957,...,-0.000959,-0.001279,-0.000959,-0.000639,-0.000959,-0.000320,-0.000959,-0.000639,-0.000959,0.008595
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1678,-0.001973,-0.000473,-0.000307,-0.000835,-0.000320,-0.000105,-0.001676,-0.000985,-0.001311,-0.000384,...,-0.000003,-0.000005,-0.000003,-0.000002,-0.000003,0.001060,0.003181,0.002121,-0.000003,-0.000003
1679,-0.005920,-0.001418,-0.000922,-0.002506,-0.000959,-0.000314,-0.005029,-0.002955,-0.003934,-0.001152,...,-0.000010,-0.000014,-0.000010,-0.000007,-0.000010,0.003181,0.009544,0.006363,-0.000010,-0.000010
1680,-0.003947,-0.000946,-0.000615,-0.001671,-0.000639,-0.000209,-0.003352,-0.001970,-0.002623,-0.000768,...,-0.000007,-0.000009,-0.000007,-0.000005,-0.000007,0.002121,0.006363,0.004242,-0.000007,-0.000007
1681,0.006819,0.008136,-0.000922,0.007048,-0.000959,-0.000314,0.007710,0.012969,0.008804,-0.001152,...,-0.000010,-0.000014,-0.000010,-0.000007,-0.000010,-0.000003,-0.000010,-0.000007,0.009544,-0.000010
