# Anbefalingssystem

Her er oppgaven å lage et anbefalingsystem for filmer. Data består av rangeringer og film data. Data er de samme som ble brukt i lab 5. 


## Data forberedelse

Se gjennom stegene i dataforberedelsen og prøv å forstå hva som skjer. 

In [95]:
import numpy as np
import pandas as pd
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import NearestNeighbors

In [96]:
# lese inn data
rangering = pd.read_csv('./data/ratings.dat', sep='::', 
                        names = ['BrukerID', 'FilmID', 'Rangering', 'Tidstempel'], 
                        engine='python')

film = pd.read_csv('./data/movies.dat', sep='::', 
                   names=['FilmID', 'Tittel', 'Sjanger'], 
                   encoding='latin-1', 
                   engine='python')

In [97]:
# første rad
print(rangering.shape)
rangering.head(10)

(1000209, 4)


Unnamed: 0,BrukerID,FilmID,Rangering,Tidstempel
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291
5,1,1197,3,978302268
6,1,1287,5,978302039
7,1,2804,5,978300719
8,1,594,4,978302268
9,1,919,4,978301368


In [98]:
# første rad
print(film.shape)
film.head(10)

(3883, 3)


Unnamed: 0,FilmID,Tittel,Sjanger
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [99]:
# lage sjanger dummyvariabler
mlb = MultiLabelBinarizer()

genres_df = pd.DataFrame(mlb.fit_transform(film['Sjanger'].str.split('|')),
                         columns=mlb.classes_, 
                         index=film.FilmID)
genres_df.head()

Unnamed: 0_level_0,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
FilmID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [100]:
# train, val, test split
train_df, validation_df = train_test_split(rangering,
                                           stratify=rangering['BrukerID'], 
                                           test_size=0.2,
                                           random_state=42)
validation_df, test_df = train_test_split(validation_df,
                                          stratify=validation_df['BrukerID'], 
                                          test_size=0.5,
                                          random_state=42)

In [101]:
# features
train_movie_features = train_df.pivot(
    index='FilmID',
    columns='BrukerID',
    values='Rangering'
)
validation_movie_features = validation_df.pivot(
    index='FilmID',
    columns='BrukerID',
    values='Rangering'
)
test_movie_features = test_df.pivot(
    index='FilmID',
    columns='BrukerID',
    values='Rangering'
)

In [102]:
train_movie_features.head()

BrukerID,1,2,3,4,5,6,7,8,9,10,...,6031,6032,6033,6034,6035,6036,6037,6038,6039,6040
FilmID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,4.0,,,,5.0,...,,4.0,,,,,,,,3.0
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,1.0,,,,,
4,,,,,,,,3.0,,,...,,,,,,2.0,,,,
5,,,,,,,,,,,...,,,,,1.0,,,,,


In [103]:
# funksjon for å regne ut RMSE 
def rmse(prediction, actual=validation_movie_features):
    return np.sqrt(np.nanmean(((prediction - actual)**2).values))

In [104]:
selected_films = (
    rangering['FilmID']
    .value_counts()
    .head(3706)
    .index
)

film = film[film.index.isin(selected_films)]
genres_df = genres_df.loc[selected_films]
filtered_rangering = rangering[rangering['FilmID'].isin(selected_films)]

user_movie_matrix = filtered_rangering.pivot(
    index='FilmID', columns='BrukerID', values='Rangering'
)

user_movie_matrix = user_movie_matrix.reindex(
    index=selected_films,  
    columns=range(1, 6041) 
)


user_movie_matrix = user_movie_matrix.fillna(0)
print(user_movie_matrix.shape)

(3706, 6040)


## Baseline modeller

1. Lag en prediksjon med gjennomsnittsrangering for alle brukere og filmer og regn ut RMSE på valideringsdata. 
2. Lag en prediksjon med gjennomsnittsrangering per film for alle brukere og regn ut RMSE på valideringsdata. Tips: Bruk [`sklearn.impute.SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html).
3. Lag en prediksjon med gjennomsnittsrangering per bruker for alle film og regn ut RMSE på valideringsdata. 

In [105]:
mean_all_ratings = train_df['Rangering'].mean()
base_prediction = pd.DataFrame(
    mean_all_ratings,
    index=genres_df.index, 
    columns=train_df['BrukerID'].unique()
)

In [106]:
print(rmse(base_prediction))
print(base_prediction.shape)

1.114456972706777
(3706, 6040)


In [107]:
pivot_table = train_df.pivot(index='FilmID', columns='BrukerID', values='Rangering')

imputer = SimpleImputer(strategy='mean')
imputed_values = imputer.fit_transform(pivot_table)

base_film_prediction = pd.DataFrame(
    imputed_values,
    index=pivot_table.index,
    columns=pivot_table.columns
)

base_film_prediction = base_film_prediction.reindex(index=user_movie_matrix.index, columns=user_movie_matrix.columns)

In [108]:
print(rmse(base_film_prediction))
print(base_film_prediction.shape)

1.0336980938636404
(3706, 6040)


In [109]:
user_means = train_df.groupby('BrukerID')['Rangering'].mean()
unique_users = train_df['BrukerID'].unique()
user_mean_dict = dict(zip(unique_users, user_means[unique_users]))

base_user_prediction = pd.DataFrame(user_mean_dict, index=genres_df.index)

In [110]:
print(rmse(base_user_prediction))
print(base_user_prediction.shape)

1.0336809238126035
(3706, 6040)


## Innholdsbasert modell

1. Kombiner treningsdata `train_df` med data om genres `genres_df` i et nytt `DataFrame` `user_df`. Tips: Bruk [`pandas.DataFrame.merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html). 
2. For hver bruker i det kombinerte datasettet (Tips: du kan bruke [`pandas.groupby.apply`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.apply.html)): 
  - Tilpass en lineær regresjonsmodel til data. 
  - Prediker rangering for alle filmer. 
3. Regn ut RMSE på valideringsdata.

In [111]:
 user_df = train_df.merge(genres_df, left_on='FilmID', right_index=True)

In [112]:
def train_and_predict(user_data):
    genre_columns = genres_df.columns
    
    X_train = user_data[genre_columns]
    y_train = user_data['Rangering']

    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(genres_df[genre_columns])

    return pd.Series(y_pred, index=genres_df.index)

content_prediction = user_df.groupby('BrukerID').apply(train_and_predict, include_groups=False).T.clip(lower=1, upper=5)

In [113]:
print(rmse(content_prediction))
print(content_prediction.shape)

1.0561310214947643
(3706, 6040)


## Samarbeidsbasert modell:

1. Trek gjennomsnittelig rangering per bruker fra treningsdata. 
2. Sett alle rangeringer til film som ikke har rangering til 0. 
3. Regn ut similarity (korrelasjon) mellom filmene. Tips: Bruk [`numpy.corrcoef`](https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html). 
4. Finn de 10 filmene som ligner mest på hver film. Tips: Bruk [`sklearn.neighbors.NearestNeighbors`](https://ogrisel.github.io/scikit-learn.org/sklearn-tutorial/modules/generated/sklearn.neighbors.NearestNeighbors.html). 
5. Finn gjennomsnittet av rangeringene av de 10 nærmeste naboene. 
6. Regn ut RMSE på valideringsdata

In [114]:
mean_user_rating = user_movie_matrix.mean(axis=0)
normalized_ratings = user_movie_matrix.sub(mean_user_rating, axis=1)
means = normalized_ratings.fillna(0)

In [115]:
correlation_matrix = np.corrcoef(means)
numerator = np.dot(correlation_matrix, means)
denominator = np.dot(np.abs(correlation_matrix).sum(axis=1), means)
collaborative_prediction = numerator / denominator

In [116]:
collaborative_prediction = pd.DataFrame(collaborative_prediction, index=means.index, columns=means.columns)

In [117]:
similarity = np.corrcoef(collaborative_prediction)

In [118]:
nn = NearestNeighbors(n_neighbors=10, metric='cosine')
nn.fit(similarity)
neighbours = nn.kneighbors(collaborative_prediction.T, return_distance=False)

In [119]:
for i, ix in enumerate(collaborative_prediction.index):
    collaborative_prediction.loc[ix] = collaborative_prediction.iloc[neighbours[i]].mean(axis=0)

In [120]:
collaborative_prediction = collaborative_prediction + means

In [121]:
print(rmse(collaborative_prediction))

0.47022934851262144


## Kombiner prediksjonene

1. Regn ut gjennomsnittet av prediksjonene du får med den innholdsbaserte og den sammarbeidsbaserte modellen. 
2. Regn ut RMSE på valideringsdata

In [122]:
combined_prediction = (collaborative_prediction + content_prediction) / 2

In [123]:
print(rmse(combined_prediction))

0.5779216048801534


## Generalisering

Velg ut den beste modellen og regn ut RMSE på testdata. 

In [124]:
print(rmse(collaborative_prediction, actual=test_movie_features))

0.47024795870999014
