<a href="https://colab.research.google.com/github/Emaperidol/AI_ML_Healthcare/blob/main/EM_Assignment_12_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task: Use LightFM to Build a Movie Recommender System

## LightFM

\
LightFM is a Python implementation of a number of popular recommendation algorithms for both implicit and explicit feedback.

\
You should use LightFM to complete this assignment.

\
You should read and learn from the following two tutorials regarding how you can bring customized dataset into LightFM and build a recommmender model:

* https://making.lyst.com/lightfm/docs/examples/dataset.html
* https://making.lyst.com/lightfm/docs/quickstart.html

## MovieLens 100k Small Dataset

Data Downloading Source: https://www.kaggle.com/datasets/fuzzywizard/movielens-100k-small-dataset?resource=download

\
This dataset describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

\
Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

\
The data are contained in the files `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`.

\
For this homework assignment, you will only use two datasets, namely `movies.csv`, `ratings.csv`.


###Q1.
Read the two datasets: "ratings.csv" and "movies.csv" into two DataFrames, and name them "ratings_data" and "movies_data" resepctively.

In [4]:
import pandas as pd
from google.colab import files

uploaded = files.upload()
ratings_data = pd.read_csv('ratings.csv')
movies_data = pd.read_csv('movies.csv')


Saving movies.csv to movies.csv
Saving ratings.csv to ratings.csv


###Q2.

We’ll use LightFM’s built-in Dataset class to build an interaction dataset from raw data. The goal is to demonstrate how to go from raw data (lists of interactions and perhaps item and user features) to scipy.sparse matrices that can be used to fit a LightFM model.

\
First, merge the ratings dataset with movies dataset, and then create unique user ids and movie ids that start from 0.

\
Hint: some sample code:
```
# Merge ratings data with movies data
data = ratings_data.merge(movies_data)

# Create unique user and movie ids from 0
data['user_id'] = data.groupby(['userId']).ngroup()
data['movie_id'] = data.groupby(['movieId']).ngroup()

# Convert ids from int64 to int32 required by model
data['user_id'] = data['user_id'].astype(np.int32)
data['movie_id'] = data['movie_id'].astype(np.int32)
```

In [5]:
import numpy as np

# Merging Data
data = ratings_data.merge(movies_data, on='movieId')

# Unique IDs
data['user_id'] = data.groupby(['userId']).ngroup()
data['movie_id'] = data.groupby(['movieId']).ngroup()

# Converting from int64 to int32
data['user_id'] = data['user_id'].astype(np.int32)
data['movie_id'] = data['movie_id'].astype(np.int32)


###Q3.
Build the ratings dataset in dictionary format.

\
Hint: Some sample code:
```
ratings = data[['userId', 'movieId', 'rating']].to_dict('records')
```

In [6]:
ratings = data[['userId', 'movieId', 'rating']].to_dict('records')


###Q4.
Now let's build the features dataset in dictionary format.

\
Hint: The feature dataset contains three variables, namely `movieId`, `title`, and `genres`. Note that the length of the feasture dataset should be the same as the number of unqiue movies.

In [7]:
features = data[['movieId', 'title', 'genres']].drop_duplicates().to_dict('records')

###Q5.

Since you have the features dataset and the ratings dataset, now you are ready to build a LightFM Dataset. The LightFM Dataset creates a mapping between the user and item ids from our input data to indices that will be used internally by our model.

\
Hint: First create a Dataset and then call its fit method. The first argument is an iterable of all user ids in our data, and the second is an iterable of all item ids. In this case, we use generator expressions to lazily iterate over our data and yield user and item ids. This call will assign an internal numerical id to every user and item id we pass in. These will be contiguous (from 0 to however many users and items we have), and will also determine the dimensions of the resulting LightFM model.

In [10]:
!pip install lightfm
from lightfm.data import Dataset

dataset = Dataset()
dataset.fit((x['userId'] for x in ratings),
            (x['movieId'] for x in ratings))




###Q6.


We can check that the mappings have been created by querying the Dataset on how many users and books it knows about.

\
Hint: Some sample code:
```
num_users, num_items = dataset.interactions_shape()
print('Num users: {}, num_items {}.'.format(num_users, num_items))
```

In [11]:
num_users, num_items = dataset.interactions_shape()
print('Num users: {}, num_items {}.'.format(num_users, num_items))


Num users: 610, num_items 9724.


###Q7.

Having created the mapping, we can build the interaction matrix.

In [12]:
(interactions, weights) = dataset.build_interactions(((x['userId'], x['movieId'])
                                                      for x in ratings))

###Q8.

Note that if we don’t have all user and items ids at once, we can repeatedly call `fit_partial` to supply additional ids. Now, `fit_partial` to add some item feature mappings. In particular, add movie titles to the item feature mappings.

In [13]:
dataset.fit_partial(items=(x['movieId'] for x in features),
                    item_features=(x['title'] for x in features))


###Q9.

Since we have item features, we can create the item features matrix.

In [14]:
item_features = dataset.build_item_features(((x['movieId'], [x['title']])
                                             for x in features))


###Q10.

Randomly split the dataset into train and test set so that the train set contains 80% of the data.

\
Hint: Use the `random_train_test_split` function.

In [15]:
from lightfm.cross_validation import random_train_test_split

train, test = random_train_test_split(interactions, test_percentage=0.2)


###Q11.

We’re going to use the WARP (Weighted Approximate-Rank Pairwise) model. WARP is an implicit feedback model: all interactions in the training matrix are treated as positive signals, and products that users did not interact with they implicitly do not like. The goal of the model is to score these implicit positives highly while assigining low scores to implicit negatives.

\
Model training is accomplished via SGD (stochastic gradient descent). This means that for every pass through the data — an epoch — the model learns to fit the data more and more closely. We’ll run it for 30 epochs in this example.

In [19]:
from lightfm import LightFM

model = LightFM(loss='warp')
model.fit(train, item_features=item_features, epochs=30)



<lightfm.lightfm.LightFM at 0x7af4cf7a9f00>

###Q12.

We should now evaluate the model to see how well it’s doing. We’re most interested in how good the ranking produced by the model is.

\
Precision@k is one suitable metric, expressing the percentage of top k items in the ranking the user has actually interacted with.

\
Measure Precision@k in both the train and the test set.

In [20]:
from lightfm.evaluation import precision_at_k

print("Train precision: %.2f" % precision_at_k(model, train, item_features=item_features, k=5).mean())
print("Test precision: %.2f" % precision_at_k(model, test, item_features=item_features, k=5).mean())



Train precision: 0.46
Test precision: 0.11


###Q13.

Unsurprisingly, the model fits the train set better than the test set.

\
For an alternative way of judging the model, we can sample a couple of users and get their recommendations.

\
To make predictions for given user, we pass the id of that user and the ids of all products we want predictions for into the predict method.

Now, please write a function that can show: (1) five of the movies that a particular user had rated; and (2) the top five movies the recommender model recommend.

For example, your function can show the following:
```
User ID:  12

Five of the movies the user has watched:
                         Toy Story (1995)
                         Tommy Boy (1995)
          Clear and Present Danger (1994)
                          Saturn 3 (1980)
Twelve Monkeys (a.k.a. 12 Monkeys) (1995)


Top five movies the model recommends:
            Crimson Tide (1995)
              Disclosure (1994)
  Star Trek: Generations (1994)
Clear and Present Danger (1994)
               Quiz Show (1994)
```

In [21]:
def user_recommendations(user_id, model, data, item_features):
    # Movies watched by user
    known_positives = data[data['userId'] == user_id]['title'].head(5).tolist()

    # Movies predicted to be liked by user
    scores = model.predict(user_id, np.arange(num_items), item_features=item_features)
    top_items = data['title'][np.argsort(-scores)].head(5)

    return user_id, known_positives, top_items.tolist()

user_id = 13
user_id, known_positives, top_items = user_recommendations(user_id, model, data, item_features)
print(f"User ID:  {user_id}\n")
print(f"Five of the movies the user has watched: \n{known_positives}\n")
print(f"Top five movies the model recommends: \n{top_items}\n")


User ID:  13

Five of the movies the user has watched: 
['Seven (a.k.a. Se7en) (1995)', 'Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)', 'Beetlejuice (1988)', 'Matrix, The (1999)', 'Gladiator (2000)']

Top five movies the model recommends: 
['Heat (1995)', 'Usual Suspects, The (1995)', 'Usual Suspects, The (1995)', 'Seven (a.k.a. Se7en) (1995)', 'Seven (a.k.a. Se7en) (1995)']

