# Hybrid Recommendation System using LightFM
<p align="center">
  <img src="https://user-images.githubusercontent.com/64508435/167173215-75c3565f-fb12-496a-a434-c67213dae170.png" width="600" />
</p>
- There are two main types of recommendation systems: 
    - Collaborative filtering
        - Pros: 
            - **Personalized** Recommendations
            - **Scalability**: Collaborative filtering can be used for a wide range of items
            - **Serendipity**: Collaborative filtering can introduce users to new items they may not have discovered otherwise, by recommending items that are popular among similar users.

        - Cons:
    - Content-based filtering
- **Hybrid** recommendation can solve some problems of both Content Based Filtering and Collaborative Filtering.
- [Medium Reference](https://medium.com/@dikosaktiprabowo/hybrid-recommendation-system-using-lightfm-e10dd6b42923)

In [41]:
from pathlib import Path

import pandas as pd
import numpy as np

from scipy.sparse import csr_matrix

from lightfm.data import Dataset
from lightfm.cross_validation import random_train_test_split

from lightfm import LightFM # model
from lightfm.evaluation import precision_at_k, recall_at_k, auc_score # evaluation

In [2]:
base_path = Path("__file__").resolve().parents[2]
data_path = base_path / "data"

In [3]:
df_list = []

## Load data (users, movies, rating)

In [4]:
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
m_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url', 'action', 'adventure', 'animation', 'children','comedy', 'crime','documentary', 'drama', 'fantasy', 'film_noir','horror', 'musical', 'mystery', 'romance',' scifi', 'thriller', 'war', 'western', 'no_genre']
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
df_list = []
for i,c,s in zip(['user', 'item', 'data'], [u_cols,m_cols,r_cols], ['|','|','\t']):
    filename = 'u.'+i
    file_path = data_path / "movielens" / "ml-100k" / filename
    temp = pd.read_csv(file_path, sep=s, names=c,
                    encoding='latin-1')
    df_list.append(temp)
user, item, rating = df_list[0].copy(),df_list[1].copy(),df_list[2].copy()


## Data Preprocessing

### User

#### Create binning for Age
- Check quantiles to create four groups proportionally

In [5]:
pd.qcut(user['age'],4).head()

0    (6.999, 25.0]
1     (43.0, 73.0]
2    (6.999, 25.0]
3    (6.999, 25.0]
4     (31.0, 43.0]
Name: age, dtype: category
Categories (4, interval[float64, right]): [(6.999, 25.0] < (25.0, 31.0] < (31.0, 43.0] < (43.0, 73.0]]

- Create adjusted binning

In [6]:
user['age_bin'] = pd.cut(user['age'], bins=[0,25,30,45,np.inf], labels= ['<= 25', '26 - 30', '31 - 45', '>= 45'])

In [7]:
user.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code,age_bin
0,1,24,M,technician,85711,<= 25
1,2,53,F,other,94043,>= 45
2,3,23,M,writer,32067,<= 25
3,4,24,M,technician,43537,<= 25
4,5,33,F,other,15213,31 - 45


In [8]:
user_features_df = pd.get_dummies(user.drop(columns = ['age','zip_code']), dtype=np.int32)

In [9]:
user_features_df.head()

Unnamed: 0,user_id,sex_F,sex_M,occupation_administrator,occupation_artist,occupation_doctor,occupation_educator,occupation_engineer,occupation_entertainment,occupation_executive,...,occupation_retired,occupation_salesman,occupation_scientist,occupation_student,occupation_technician,occupation_writer,age_bin_<= 25,age_bin_26 - 30,age_bin_31 - 45,age_bin_>= 45
0,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,1,0,0,0
1,2,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,3,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,0
3,4,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,1,0,0,0
4,5,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [10]:
user_features_col = user_features_df.drop(columns=['user_id']).columns.values
user_feat = user_features_df.drop(columns=['user_id']).to_dict(orient='records')

In [11]:
print(user_features_col)

['sex_F' 'sex_M' 'occupation_administrator' 'occupation_artist'
 'occupation_doctor' 'occupation_educator' 'occupation_engineer'
 'occupation_entertainment' 'occupation_executive' 'occupation_healthcare'
 'occupation_homemaker' 'occupation_lawyer' 'occupation_librarian'
 'occupation_marketing' 'occupation_none' 'occupation_other'
 'occupation_programmer' 'occupation_retired' 'occupation_salesman'
 'occupation_scientist' 'occupation_student' 'occupation_technician'
 'occupation_writer' 'age_bin_<= 25' 'age_bin_26 - 30' 'age_bin_31 - 45'
 'age_bin_>= 45']


In [12]:
user.shape

(943, 6)

### Movie
- For movie, only  "genre" feature needed and it is already one hot encoded.

In [13]:
item.head()

Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url,action,adventure,animation,children,comedy,...,film_noir,horror,musical,mystery,romance,scifi,thriller,war,western,no_genre
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [14]:
item_features = item.drop(columns=['title', 'release_date', 'video_release_date', 'imdb_url'])
item_features_col = item_features.drop(columns=['movie_id']).columns.values
item_feat = item_features.drop(columns =['movie_id']).to_dict(orient='records')

In [15]:
print(item_features_col)

['action' 'adventure' 'animation' 'children' 'comedy' 'crime'
 'documentary' 'drama' 'fantasy' 'film_noir' 'horror' 'musical' 'mystery'
 'romance' ' scifi' 'thriller' 'war' 'western' 'no_genre']


In [16]:
item.shape

(1682, 24)

### user-item interaction

In [17]:
rating.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


- We created a new feature with the assumption that rating of 3 and higher classify as liked by user

In [18]:
rating['liked'] = np.where(rating['rating'] >=3,1,0)

In [19]:
rating.shape

(100000, 5)

## LightFM model

### Dataset
- Fit `users`, `items`, `user features`, `item features` into lightFM's `Dataset()` object to create mappings

In [20]:
dataset = Dataset()
dataset.fit(
    users=[x for x in user['user_id']], 
    items=[x for x in item['movie_id']], 
    item_features=item_features_col, 
    user_features=user_features_col
)

In [21]:
num_users, num_items = dataset.interactions_shape()

In [22]:
print(f"Number of users: {num_users}")
print(f"Number of items: {num_items}")

Number of users: 943
Number of items: 1682


#### Build `item_features` to be fitted into the model

In [23]:
item_features = dataset.build_item_features((x,y) for x,y in zip(item_features['movie_id'],item_feat))

#### Build `user_features` to be fitted into the model

In [24]:
user_features = dataset.build_user_features((x,y) for x,y in zip(user['user_id'], user_feat))

#### Build interactions (user — item) and its respective weights (in this case each user’s movie rating score)

In [25]:
(interactions, weights) = dataset.build_interactions((x, y) for x,y in zip(rating['user_id'], rating['movie_id']))

### Model Training

#### Split train test

In [26]:
# split interactions
train, test = random_train_test_split(interactions,test_percentage=0.2, random_state=779)
# split weights
train_w, test_w = random_train_test_split(weights, test_percentage=0.2, random_state=779)

### Create Model

In [27]:
n_components = 30
loss = 'warp'
epoch = 30
num_thread = 4
model = LightFM(no_components= n_components, loss=loss, random_state = 1616)
model.fit(train,  user_features= user_features, item_features= item_features, epochs=epoch,num_threads = num_thread, sample_weight = train_w)

<lightfm.lightfm.LightFM at 0x3049e4ed0>

### Model Evaluation
- Precision and Recall will be calculated by k number of top recommendations. 

In [28]:
k = 10
train_precision = precision_at_k(model, train, k=k,item_features=item_features, user_features=user_features).mean()
test_precision = precision_at_k(model, test, train_interactions=train, k=k,item_features=item_features, user_features=user_features).mean()

train_recall = recall_at_k(model, train, k=k,item_features=item_features, user_features=user_features).mean()
test_recall = recall_at_k(model, test,train_interactions=train, k=k,item_features=item_features, user_features=user_features).mean()

train_auc = auc_score(model, train,item_features=item_features, user_features=user_features).mean()
test_auc = auc_score(model, test, train_interactions=train,item_features=item_features, user_features=user_features).mean()

In [29]:
print(f"Precision@{k} - Train: {train_precision:.2f}")
print(f"Precision@{k} - Test: {test_precision:.2f}")

print(f"Recall@{k} - Train: {train_recall:.2f}")
print(f"Recall@{k} - Test: {test_recall:.2f}")

print(f"AUC - Train: {train_auc:.2f}")
print(f"AUC - Test: {train_auc:.2f}")

Precision@10 - Train: 0.49
Precision@10 - Test: 0.25
Recall@10 - Train: 0.09
Recall@10 - Test: 0.13
AUC - Train: 0.90
AUC - Test: 0.90


### Recommendation for Single User

- Predict scores for sample user (lightFM index = 3, user_id = 4)
- Note: LightFM create it’s own index that may be different to the ids in your dataframe.

In [30]:
scores = model.predict(3, np.arange(1682))

In [31]:
scores[:10]

array([-3.570756 , -3.3724377, -2.935549 , -4.2918496, -2.4583693,
       -1.4785393, -2.9252636, -3.901453 , -1.8755553, -2.057639 ],
      dtype=float32)

In [32]:
# get top items based on the scores
top_items = item.iloc[np.argsort(-scores)]

In [33]:
top_items[0:10][['title','movie_id']]

Unnamed: 0,title,movie_id
1292,Star Kid (1997),1293
330,"Edge, The (1997)",331
896,Time Tracers (1995),897
988,Cats Don't Dance (1997),989
343,"Apostle, The (1997)",344
318,Everyone Says I Love You (1996),319
352,Deep Rising (1998),353
270,Starship Troopers (1997),271
902,Afterglow (1997),903
354,Sphere (1998),355


In [34]:
# compare with ground truth
known_positives = item.iloc[interactions.tocsr()[3].indices]
known_positives_rating = rating[(rating['user_id']==user['user_id'][3])][['movie_id','rating']].merge(item[['movie_id','title']], on = 'movie_id')
known_positives_rating[known_positives_rating['movie_id'].isin(top_items['movie_id'][0:10])]

Unnamed: 0,movie_id,rating,title
10,271,4,Starship Troopers (1997)


- Among top 10 recommended movies, there are 1 movie "Starship Troopers" that is rated by the user with the rating=4

In [35]:
# Known ratings by sample user sorted descending from highest rating
known_positives_rating.sort_values(by=['rating'], ascending = False)

Unnamed: 0,movie_id,rating,title
23,301,5,In & Out (1997)
19,359,5,"Assignment, The (1997)"
17,327,5,Cop Land (1997)
15,329,5,Desperate Measures (1998)
20,362,5,Blues Brothers 2000 (1998)
13,258,5,Contact (1997)
1,303,5,Ulee's Gold (1997)
11,300,5,Air Force One (1997)
9,354,5,"Wedding Singer, The (1998)"
8,50,5,Star Wars (1977)


#### Similar item calculation from `item_features`
- It is also possible in LightFM to calculate item and user similarity similar to content based filtering.


In [36]:
def similar_items(item_id, model, N=10, norm = True):
    item_bias , item_representations = model.get_item_representations(features=item_features)

    # Cosine similarity
    scores = item_representations.dot(item_representations[item_id, :])
    item_norms = np.linalg.norm(item_representations, axis=1)

    if norm == True:
        scores /= item_norms
        best = np.argpartition(scores, -N)[-N:]
        similar = sorted(zip(best, scores[best]/ item_norms[item_id] ), key=lambda x: -x[1])
    else:
        best = np.argpartition(scores, -N)[-N:]
        similar = sorted(zip(best, scores[best] ), key=lambda x: -x[1])
    return similar

- For example, if we have a user rated Ulee’s Gold (movie_id=32) a 5 star and we would like to recommend similar drama movie. These are some of them.

In [37]:
similar_item_list = similar_items(302, model)
similar_idx = [x[0] for x in similar_item_list ]
item.iloc[similar_idx][['title']]

Unnamed: 0,title
302,Ulee's Gold (1997)
886,Eve's Bayou (1997)
304,"Ice Storm, The (1997)"
873,Career Girls (1997)
339,Boogie Nights (1997)
123,Lone Star (1996)
1100,Six Degrees of Separation (1993)
899,Kundun (1997)
895,"Sweet Hereafter, The (1997)"
261,In the Company of Men (1997)


- This content based recommendation may not be personalized enough. We can see how hybrid recommendation could give a more sophisticated result.

#### Similar users calculation from `user_features`

In [38]:
def similar_users(user_id, model, N=10, norm = True):
    user_bias ,user_representations = model.get_user_representations(features= user_features)

    # Cosine similarity
    scores = user_representations.dot(user_representations[user_id, :])
    item_norms = np.linalg.norm(user_representations, axis=1)
    
    if norm == True:
        scores /= item_norms
        best = np.argpartition(scores, -N)[-N:]
        similar = sorted(zip(best, scores[best] / item_norms[user_id]), 
                    key=lambda x: -x[1])
    else:
        best = np.argpartition(scores, -N)[-N:]
        similar = sorted(zip(best, scores[best]), 
                    key=lambda x: -x[1])
    return similar

In [39]:
similar_item_list = similar_users(3,model)
similar_idx = [x[0] for x in similar_item_list]
cols = ['user_id', 'sex_M', 'occupation_writer', 'age_bin_<= 25']
user_features_df[user_features_df['user_id'].isin(similar_idx)].loc[:,cols]

Unnamed: 0,user_id,sex_M,occupation_writer,age_bin_<= 25
2,3,1,1,1
151,152,0,0,0
175,176,1,0,0
246,247,1,0,0
247,248,1,0,1
412,413,1,0,0
549,550,0,0,1
715,716,0,0,0
810,811,0,0,0
830,831,1,0,1


- Based on the similarity between users, It turns out the user (id=3) has similarity with other male user with age younger than 25 years old.

### Recommendation for new user (cold start problem)

- **Cold start** problem is a situation where there is a new user, hence no historical data at all. 
- We could give recommendation based on the user similarity with other user (possible through information given when creating new account).

In [42]:
new_user = pd.DataFrame(np.zeros(len(user_features_col))).T
new_user.columns = user_features_col
new_user['sex_M'] = 1
new_user['occupation_lawyer'] = 1
new_user['age_bin_<= 25'] = 1
new_user = csr_matrix(new_user)
scores_new_user = model.predict(user_ids = 0,item_ids = np.arange(interactions.shape[1]), user_features=new_user)
top_items_new_user = item.iloc[np.argsort(-scores_new_user)]
top_items_new_user[0:10][['title']]

Unnamed: 0,title
49,Star Wars (1977)
0,Toy Story (1995)
99,Fargo (1996)
180,Return of the Jedi (1983)
422,E.T. the Extra-Terrestrial (1982)
150,Willy Wonka and the Chocolate Factory (1971)
126,"Godfather, The (1972)"
312,Titanic (1997)
287,Scream (1996)
171,"Empire Strikes Back, The (1980)"


- Even though we don’t have historical data for this new user, we can still recommend some movie that are liked by similar other user. We can do recommendation in this way until there are some historical data available.