In [1]:
import pandas as pd
import numpy as np

# Recommender Systems

As its name indicates,
this is a workshop on how can the machine recommend items to users.
n general, RS focus on the
problem with *user* and *item*, like consumers and
products (Amazon and other shopping platforms) or
audience and media content (YouTube, iTunes, and
other media platforms), and many other businesses.

There are two main approach:
* Collaborative Filtering (CF)
* Content-based (CB)

## Collaborative Filtering

CF method focus on the relationship between users.
To represent this relationship as data,
we can construct a **rating matrix**.

Let $m$ denote the number of items and $n$ denote the number of users, then our rating matrix
$$
M \in \mathbb{R}^{m \times n}
$$

or if we like the rating to be whole number, (i.e. 1, 2, 3, 4, 5),
then $\mathbb{R}$ becomes $\mathbb{Z}$.

In [2]:
ml100k = pd.read_csv("../data/ml100k.data", sep="\t", names=["user", "item", "rating", "timestamp"])
ml100k.head()

Unnamed: 0,user,item,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [3]:
ml100k[ml100k["user"] == 22]

Unnamed: 0,user,item,rating,timestamp
2,22,377,1,878887116
607,22,376,3,878887112
680,22,128,5,878887983
705,22,80,4,878887227
1184,22,241,3,878888025
...,...,...,...,...
95843,22,385,4,878887869
96516,22,265,3,878888066
96606,22,233,3,878888066
98485,22,792,4,878886647


In [4]:
subset_ml100k = ml100k.head(10)
subset_M = pd.pivot_table(subset_ml100k, values=["rating"], index=["user"], columns=["item"])
subset_M

Unnamed: 0_level_0,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating
item,51,86,242,265,302,346,377,451,465,474
user,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
6,,3.0,,,,,,,,
22,,,,,,,1.0,,,
115,,,,2.0,,,,,,
166,,,,,,1.0,,,,
186,,,,,3.0,,,,,
196,,,3.0,,,,,,,
244,2.0,,,,,,,,,
253,,,,,,,,,5.0,
298,,,,,,,,,,4.0
305,,,,,,,,3.0,,


## Content-based

recommendations based on the information in the content

### Cosine Similarity

Recall the dot product of two vectors $a$ and $b$,
$$
a \cdot b = \|a\| \|b\| \cos(\theta)
$$
where $\cos(\theta)$ is the distance between $a, b$.

then given dot product and the norm of the vectors,
how can we find the distance between them?

$$
\cos(\theta) = \frac{a \cdot b}{\|a\| \|b\|}
$$

Then what is $a$ and $b$ in our case?
We can to compare the similarity of two course descriptions, each are two English paragraphs.
Some advanced NLP technics could be used to compute a tokenization of the paragraphs,
but for simplicity, we will just count if the words appears in the paragraph.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

In [6]:
X.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])


### Links to look at

* [Build a Recommendation Engine With Collaborative Filtering](https://realpython.com/build-recommendation-engine-collaborative-filtering/)
* [Cosine Similarity](https://towardsdatascience.com/using-cosine-similarity-to-build-a-movie-recommendation-system-ae7f20842599)
* [Count Vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
* [Building a Recommender System from Scratch](https://www.jillcates.com/pydata-workshop/html/tutorial.html)