# Hybrid recommender systems
An hybrid recommender combines several classical ones. During the previous session, you have developped several approaches. We will now try to combine them.

## The big everything
In order to combine existing models, you need to import them in this notebook. Use this cell to copy/paste all your previous models.

In [1]:
import numpy as np
from sklearn.base import BaseEstimator


class RatingMean(BaseEstimator):
    
    def __init__(self):
        self._mean = None
    
    def fit(self, X, y, **fit_params):
        self._mean = np.mean(X['rating'])
        
    def predict(self, X, **fit_params):
        return [np.argsort(self._mean)[:10] for x in X]

## A first hybrid system
Code an hybrid model that will syndicate your previous models. You are free to use any technique but, since you like to be directed, you can go with a mixed system with fixed weights.

# MovieLens 100k: The challenge
Your challenge now is to do the best score possible on ML100k. The evaluation will be done using the framework proposed in the first practical session. Your goal is to use all the tricks possible to optimize your prediction. The metrics used will be F1-score@10, so an even mix of precision@10 and recall@10.

In the previous session, you had to code the recall metrics. Copy it here so it can be used for evaluation.

In [2]:
def recall(Y_predicted, Y_true):
    return 0.

**Edit the following code at your own risk**

This code is the code that will be used for evaluation. You can edit it but do it at your own risk since I will use my code to evaluate your work. You just need to have a variable with a sklearn compatible model that follows the specifications described below.

## Loading data
Download and load ml100k

In [3]:
# Download the ml100k dataset
import requests
import os
import zipfile
import pandas as pd


DATA_HEADER = "user id | item id | rating | timestamp"
ITEM_HEADER = "movie id | movie title | release date | video release date | IMDb URL | unknown | Action | Adventure | Animation | Children's | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western"
USER_HEADER = "user id | age | gender | occupation | zip code"

print('Loading ml100k')
if not os.path.exists('ml-100k'):
    url = 'http://files.grouplens.org/datasets/movielens/ml-100k.zip' 
    r = requests.get(url)

    if r.status_code != 200:
        print('Error: could not download ml100k')

    with open('ml-100k.zip', 'wb') as f:  
        f.write(r.content)
    fzip = zipfile.ZipFile('ml-100k.zip', 'r')
    fzip.extractall('.')
    fzip.close()
    print('ml100k downloaded')

    
def convert_header_to_camel_case(headers):
    """Take headers available in ML 100k doc and convert it to a list of strings
    
    Example:
      convert "user id | item id | rating | timestamp"
      to ['user_id', 'item_id', 'rating', 'timestamp']
    """
    return headers.replace(' ', '_').split('_|_')

data = pd.read_csv(
    'ml-100k/u.data',
    delimiter='\t',
    names=convert_header_to_camel_case(DATA_HEADER),
    encoding='latin-1'
)

item = pd.read_csv(
    'ml-100k/u.item',
    delimiter='|',
    names=convert_header_to_camel_case(ITEM_HEADER),
    encoding='latin-1'
)

user = pd.read_csv(
    'ml-100k/u.user',
    delimiter='|',
    names=convert_header_to_camel_case(USER_HEADER),
    encoding='latin-1'
)

print('Peeking data')
print(data[:3])
print('Peeking item')
print(item[:3])
print('Peeking user')
print(user[:3])


Loading ml100k
Peeking data
   user_id  item_id  rating  timestamp
0      196      242       3  881250949
1      186      302       3  891717742
2       22      377       1  878887116
Peeking item
   movie_id        movie_title release_date  video_release_date  \
0         1   Toy Story (1995)  01-Jan-1995                 NaN   
1         2   GoldenEye (1995)  01-Jan-1995                 NaN   
2         3  Four Rooms (1995)  01-Jan-1995                 NaN   

                                            IMDb_URL  unknown  Action  \
0  http://us.imdb.com/M/title-exact?Toy%20Story%2...        0       0   
1  http://us.imdb.com/M/title-exact?GoldenEye%20(...        0       1   
2  http://us.imdb.com/M/title-exact?Four%20Rooms%...        0       0   

   Adventure  Animation  Children's  ...  Fantasy  Film-Noir  Horror  Musical  \
0          0          1           1  ...        0          0       0        0   
1          1          0           0  ...        0          0       0        0  

## Learn and evaluate
Run the algorithm and output a score.
Note that, in order to be able to process all the data available, item and user specific data are passed through keywork arguments.
We consider a recommendation successful if it recommends movies that have been seen by the user regardless of the rating.


Grade: For this exercise, the grade is a bonus grade that will be attributed depending on your rank among the other students.

In [4]:
from tabulate import tabulate
from sklearn.model_selection import KFold


# Cross validation
kf = KFold(random_state=0, n_splits=5)

for name, estimator in [
        ('GlobalMean', RatingMean),
    ]:
    scores = []
    for train, test in kf.split(data):
        est = estimator()
        est.fit(data.iloc[train], None, user=user, item=item)
        
        
        for user_id, row in data.iloc[test].groupby('user_id').agg(tuple).iterrows():
            # Test data if the id of a user in the test set
            # Your estimator must return a list of 10 movies
            pred = est.predict([user_id])[0]
        
            # True recommendations
            true = row['item_id']

            prec = len(set(pred).intersection(true)) / 10
            rec = recall(pred, true)
            
            scores.append((prec + rec) / 2.)

    new_line = [name, np.mean(scores)]
    print(tabulate([new_line], tablefmt="pipe"))  # print current algo perf
    #table.append(new_line)

|:-----------|--:|
| GlobalMean | 0 |
