In order to install the requied libraries install anaconda and also install LightFM using `pip install lightfm`

In [1]:
import sqlite3
import numpy as np
import pandas as pd
from lightfm import LightFM
from lightfm.datasets import fetch_movielens

# LightFM approach

import the necessary libraries

In [2]:
import sqlite3
import numpy as np
import pandas as pd
from lightfm import LightFM
from lightfm.data import Dataset

Connect to the database (from file this time) and retrieve the data required.
- Since the category is a list I chose to use this way to retrieve it from the sql so I dunno if it's the best way to do so
*Running this will take up some memory as it is loading a 500mb file to memory (possibly twice)* 

In [3]:
conn = sqlite3.connect("./database.sqlite")
data = pd.read_sql("""SELECT rating, author_id, created_at, podcasts.podcast_id, podcasts.title 
                        FROM podcasts 
                        INNER JOIN reviews ON reviews.podcast_id = podcasts.podcast_id""", 
                   conn)
category_data = pd.read_sql('''
                            SELECT podcast_id, category FROM categories
                            ''', conn)
category_data = category_data.groupby('podcast_id')['category'].apply(', '.join).reset_index()
data = pd.merge(data, category_data, on='podcast_id', how='left')

Close the connections and inspect the data lightly (will add more exploratory data analysis in the future as it has some actual good insights that helped make decisions)
                                                    

In [4]:
conn.close()
print(data['category'])
data.head()

0         arts, arts-performing-arts, music
1         arts, arts-performing-arts, music
2              arts, arts-design, education
3              arts, arts-design, education
4              arts, arts-design, education
                        ...                
984400                      society-culture
984401                      society-culture
984402                      society-culture
984403                      society-culture
984404                      society-culture
Name: category, Length: 984405, dtype: object


Unnamed: 0,rating,author_id,created_at,podcast_id,title,category
0,5,F7E5A318989779D,2018-04-24T12:05:16-07:00,c61aa81c9b929a66f0c1db6cbe5d8548,Backstage at Tilles Center,"arts, arts-performing-arts, music"
1,5,F6BF5472689BD12,2018-05-09T18:14:32-07:00,c61aa81c9b929a66f0c1db6cbe5d8548,Backstage at Tilles Center,"arts, arts-performing-arts, music"
2,1,1AB95B8E6E1309E,2019-06-11T14:53:39-07:00,ad4f2bf69c72b8db75978423c25f379e,TED Talks Daily,"arts, arts-design, education"
3,5,11BB760AA5DEBD1,2018-05-31T13:08:09-07:00,ad4f2bf69c72b8db75978423c25f379e,TED Talks Daily,"arts, arts-design, education"
4,5,D86032C8E57D15A,2019-06-19T13:56:05-07:00,ad4f2bf69c72b8db75978423c25f379e,TED Talks Daily,"arts, arts-design, education"


Time to create the dataset in the format that the lightfm library wants. The fit call takes in a list of user_ids and item_ids and assigns integers to them (that's how it works internally but no need for us to care)

In [5]:
dataset = Dataset()
dataset.fit(data['author_id'] ,data['podcast_id'])

Register the category column as a feature of a podcast.

In [6]:
dataset.fit_partial(items=data['podcast_id'],
                    item_features=data['category'])

Just checking if anything is too off

In [7]:
num_users, num_items = dataset.interactions_shape()
print('Num users: {}, num_items {}.'.format(num_users, num_items))

Num users: 755438, num_items 46693.


Building the interactions (the user, the podcast and the rating given) returns the interactions we want but also the rating in a structure that can be used by LightFM

In [8]:
(interactions, weights) = dataset.build_interactions(zip(data['author_id'], data['podcast_id'], data['rating']-3))
print(repr(interactions))

<755438x46693 sparse matrix of type '<class 'numpy.int32'>'
	with 984405 stored elements in COOrdinate format>


Creates the item features in a way that the library can use when building models

In [9]:
item_features = dataset.build_item_features(map(lambda x: (x[0][1], x[1][1].split(', ')), 
                                                zip(data['podcast_id'].items(), data['category'].items())))
print(repr(item_features))

<46693x47101 sparse matrix of type '<class 'numpy.float32'>'
	with 117684 stored elements in Compressed Sparse Row format>


Splits the dataset into test (20%) and training (80%) sets for later testing and fits the model to the training set 

In [10]:
from lightfm.cross_validation import random_train_test_split
train, test = random_train_test_split(interactions)
model = LightFM(loss='warp', no_components=4)
model.fit(train, item_features=item_features, num_threads=4)

<lightfm.lightfm.LightFM at 0x7efc60014130>

**Be warned this will take a lot of CPU time to run**
Tests the accracy of the model using a recall metric. Basically what recall means is, given k recommendations from the model, what percentage of my rated episodes are in the k recommendations? For example, if I like 10 songs and out of the 100 recommendations, only a 2 of them are in the model, then the recall is 0.2. For this case, we have 755438 users, 46693 items and 1566431 ratings. This means that the average user gives only 2 ratings and we have a lot of items, meaning that we don't have a lot of data to work with here. 

With this dataset and this test (we've set k = 100), the recall of the random recommender would be `0.0021416486411239373`.

In [11]:
from lightfm.evaluation import recall_at_k
print("Train recall: %.4f" % recall_at_k(model, train, item_features=item_features, num_threads=4, k=100).mean())
print("Test recall: %.4f" % recall_at_k(model, test, item_features=item_features, num_threads=4, k=100).mean())


Train recall: 0.1937
Test recall: 0.0942


In [12]:
from lightfm.evaluation import precision_at_k
print("Train precision: %.4f" % precision_at_k(model, train, item_features=item_features, num_threads=4, k=100).mean())
print("Test precision: %.4f" % precision_at_k(model, test, item_features=item_features, num_threads=4, k=100).mean())

Train precision: 0.0023
Test precision: 0.0011


# Using Surprise

In [13]:
import pandas as pd
import sqlite3

from surprise import NormalPredictor
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate
from surprise.prediction_algorithms.knns import KNNWithZScore
from surprise.prediction_algorithms.random_pred import NormalPredictor

In [14]:
conn = sqlite3.connect("./database.sqlite")
data = pd.read_sql("""SELECT rating, author_id, created_at, categories.podcast_id, podcasts.title, category 
                        FROM podcasts 
                        INNER JOIN categories ON categories.podcast_id = podcasts.podcast_id
                        INNER JOIN reviews ON reviews.podcast_id = podcasts.podcast_id""", 
                   conn)

In [15]:
conn.close()
print(data.info())
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1566431 entries, 0 to 1566430
Data columns (total 6 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   rating      1566431 non-null  int64 
 1   author_id   1566431 non-null  object
 2   created_at  1566431 non-null  object
 3   podcast_id  1566431 non-null  object
 4   title       1566431 non-null  object
 5   category    1566431 non-null  object
dtypes: int64(1), object(5)
memory usage: 71.7+ MB
None


Unnamed: 0,rating,author_id,created_at,podcast_id,title,category
0,5,F7E5A318989779D,2018-04-24T12:05:16-07:00,c61aa81c9b929a66f0c1db6cbe5d8548,Backstage at Tilles Center,arts
1,5,F7E5A318989779D,2018-04-24T12:05:16-07:00,c61aa81c9b929a66f0c1db6cbe5d8548,Backstage at Tilles Center,arts-performing-arts
2,5,F7E5A318989779D,2018-04-24T12:05:16-07:00,c61aa81c9b929a66f0c1db6cbe5d8548,Backstage at Tilles Center,music
3,5,F6BF5472689BD12,2018-05-09T18:14:32-07:00,c61aa81c9b929a66f0c1db6cbe5d8548,Backstage at Tilles Center,arts
4,5,F6BF5472689BD12,2018-05-09T18:14:32-07:00,c61aa81c9b929a66f0c1db6cbe5d8548,Backstage at Tilles Center,arts-performing-arts


In [16]:
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(data[['author_id', 'podcast_id', 'rating']], reader)

In [17]:
algo = SVD()

# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.6741  0.6790  0.6769  0.6761  0.6786  0.6769  0.0017  
MAE (testset)     0.3262  0.3286  0.3273  0.3274  0.3283  0.3276  0.0009  
Fit time          60.74   62.00   59.99   59.54   63.08   61.07   1.31    
Test time         2.43    2.44    2.43    2.16    2.40    2.37    0.11    


{'test_rmse': array([0.67414536, 0.67896713, 0.67685313, 0.67612192, 0.67856277]),
 'test_mae': array([0.3261711 , 0.32858719, 0.32727697, 0.32742935, 0.32834908]),
 'fit_time': (60.740195989608765,
  61.99877452850342,
  59.9943573474884,
  59.538230419158936,
  63.08291935920715),
 'test_time': (2.432588577270508,
  2.4396536350250244,
  2.4349913597106934,
  2.1642627716064453,
  2.3961892127990723)}

In [18]:
algo = NormalPredictor()

# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm NormalPredictor on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.1933  1.1934  1.1899  1.1936  1.1958  1.1932  0.0019  
MAE (testset)     0.7494  0.7509  0.7482  0.7508  0.7529  0.7504  0.0016  
Fit time          1.40    2.03    1.99    1.88    1.92    1.84    0.23    
Test time         2.59    3.10    3.18    2.35    2.18    2.68    0.40    


{'test_rmse': array([1.1932917 , 1.19337409, 1.18986904, 1.19364162, 1.19576625]),
 'test_mae': array([0.7493828 , 0.75091567, 0.74820614, 0.75076317, 0.75290268]),
 'fit_time': (1.3983306884765625,
  2.0343613624572754,
  1.9909543991088867,
  1.8780488967895508,
  1.9157841205596924),
 'test_time': (2.593496322631836,
  3.10198712348938,
  3.184617280960083,
  2.3515849113464355,
  2.183678388595581)}

In [19]:
from  surprise.prediction_algorithms.baseline_only import BaselineOnly
algo = BaselineOnly()

# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Evaluating RMSE, MAE of algorithm BaselineOnly on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8159  0.8202  0.8227  0.8211  0.8203  0.8200  0.0022  
MAE (testset)     0.4364  0.4390  0.4409  0.4399  0.4394  0.4391  0.0015  
Fit time          6.99    7.15    7.15    7.12    7.29    7.14    0.10    
Test time         1.97    2.09    1.93    2.23    1.89    2.02    0.12    


{'test_rmse': array([0.81590572, 0.82015449, 0.82265914, 0.82108715, 0.82031133]),
 'test_mae': array([0.43640922, 0.43898752, 0.44091987, 0.43994509, 0.43943167]),
 'fit_time': (6.987760782241821,
  7.147241830825806,
  7.149007797241211,
  7.115860939025879,
  7.286087512969971),
 'test_time': (1.9670774936676025,
  2.088625907897949,
  1.9327120780944824,
  2.232301712036133,
  1.8892900943756104)}