Logs
- [2024/04/28]   
  You do not need to restart this notebook when updating the scratch library

In [1]:
import numpy as np
import matplotlib.pyplot as plt

from collections import defaultdict
from typing import List, Tuple, Dict
from collections import Counter
from scratch.natural_language_processing import NaturalLanguageProcessing as nlp

In [2]:
plt.rcParams.update(plt.rcParamsDefault)
plt.rcParams.update({
  'font.size': 16,
  'grid.alpha': 0.25})

In [3]:
%load_ext autoreload
%autoreload 2 

Some real wordl examples of recommendation systems:
- Netflix recomends movies you might want to watch
- Amazon recommends products you might want to buy
- X (Twitter) recommends users you might want to follow

In [4]:
users_interests = [
  ["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"],
  ["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],
  ["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"],
  ["R", "Python", "statistics", "regression", "probability"],
  ["machine learning", "regression", "decision trees", "libsvm"],
  ["Python", "R", "Java", "C++", "Haskell", "programming languages"],
  ["statistics", "probability", "mathematics", "theory"],
  ["machine learning", "scikit-learn", "Mahout", "neural networks"],
  ["neural networks", "deep learning", "Big Data", "artificial intelligence"],
  ["Hadoop", "Java", "MapReduce", "Big Data"],
  ["statistics", "R", "statsmodels"],
  ["C++", "deep learning", "artificial intelligence", "probability"],
  ["pandas", "R", "Python"],
  ["databases", "HBase", "Postgres", "MySQL", "MongoDB"],
  ["libsvm", "regression", "support vector machines"]
]

## Manual Curation

Before the digital age, most of the recommendation has been done by the   
knoweldge of curators. For example in library, the librarian who will give    
recommendation which books to be read for a specific topic, or using card catalog

## Recommending What's Popular

In [5]:
popular_interests = Counter(interest for user_interests in users_interests
                            for interest in user_interests)
popular_interests

Counter({'Python': 4,
         'R': 4,
         'Big Data': 3,
         'HBase': 3,
         'Java': 3,
         'statistics': 3,
         'regression': 3,
         'probability': 3,
         'Hadoop': 2,
         'Cassandra': 2,
         'MongoDB': 2,
         'Postgres': 2,
         'scikit-learn': 2,
         'statsmodels': 2,
         'pandas': 2,
         'machine learning': 2,
         'libsvm': 2,
         'C++': 2,
         'neural networks': 2,
         'deep learning': 2,
         'artificial intelligence': 2,
         'Spark': 1,
         'Storm': 1,
         'NoSQL': 1,
         'scipy': 1,
         'numpy': 1,
         'decision trees': 1,
         'Haskell': 1,
         'programming languages': 1,
         'mathematics': 1,
         'theory': 1,
         'Mahout': 1,
         'MapReduce': 1,
         'databases': 1,
         'MySQL': 1,
         'support vector machines': 1})

With `popular_interests`, we can just suggest to a user the most popular interests   
that he's not already interested in:

In [6]:
def most_popular_new_interests(user_interests: List[str], 
                                max_results: int = 5) -> List[Tuple[str, int]]:
  suggestions = [(interest, frequency)
                  for interest, frequency in popular_interests.most_common()
                    if interest not in user_interests]
  
  return suggestions[:max_results]

In [7]:
# popular recommendations that are new for user 1 (users_interests[1])
print(users_interests[1])
most_popular_new_interests(users_interests[1])

['NoSQL', 'MongoDB', 'Cassandra', 'HBase', 'Postgres']


[('Python', 4), ('R', 4), ('Big Data', 3), ('Java', 3), ('statistics', 3)]

In [8]:
# popular recommendations that are new for user 3 (users_interests[3])
print(users_interests[3])
most_popular_new_interests(users_interests[3])

['R', 'Python', 'statistics', 'regression', 'probability']


[('Big Data', 3), ('HBase', 3), ('Java', 3), ('Hadoop', 2), ('Cassandra', 2)]

## User-Based Collaborative Filtering

One way to taking a user's interests into account is to look for users who are   
somehow _similar_ to her, and then suggest the things that those users are    
interested in.

In [9]:
unique_interests = sorted({interest for user_interests in users_interests 
                            for interest in user_interests})

assert unique_interests[:6] == [
  "Big Data", "C++", "Cassandra", "HBase", "Hadoop", "Haskell",
  # ...
] 

In [10]:
print(len(unique_interests))
print(unique_interests)

36
['Big Data', 'C++', 'Cassandra', 'HBase', 'Hadoop', 'Haskell', 'Java', 'Mahout', 'MapReduce', 'MongoDB', 'MySQL', 'NoSQL', 'Postgres', 'Python', 'R', 'Spark', 'Storm', 'artificial intelligence', 'databases', 'decision trees', 'deep learning', 'libsvm', 'machine learning', 'mathematics', 'neural networks', 'numpy', 'pandas', 'probability', 'programming languages', 'regression', 'scikit-learn', 'scipy', 'statistics', 'statsmodels', 'support vector machines', 'theory']


Next we want to produce an "interest" vector of 0s and 1s for each user.

In [11]:
def make_user_interest_vector(user_interests: List[str]) -> List[int]:
  """Given a list of interests, produce a vector whose ith element is 1 
  if unique_interests[i] is in the list, 0 otherwise"""
  return [1 if interest in user_interests else 0 
          for interest in unique_interests]

Transform each `user_interests` into a vector where the length is equal to  
`unique_interests`. 

In [12]:
user_interest_vectors = [make_user_interest_vector(user_interests)
                          for user_interests in users_interests]

# user_interest_vector[i][j] equals 1 if user i specified interest j, and 0 otherwise
print(user_interest_vectors[0])

[1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


Similarity between two `user_interests`'s are calculated using `cosine_similarity`    

In [13]:
user_similarities = [[nlp.cosine_similarity(interest_vector_i, interest_vector_j)
                      for interest_vector_j in user_interest_vectors]
                        for interest_vector_i in user_interest_vectors]

In [14]:
for user_similarity in user_similarities:
  print([f"{col:.3f}" for col in user_similarity])

['1.000', '0.338', '0.000', '0.000', '0.000', '0.154', '0.000', '0.000', '0.189', '0.567', '0.000', '0.000', '0.000', '0.169', '0.000']
['0.338', '1.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.600', '0.000']
['0.000', '0.000', '1.000', '0.183', '0.000', '0.167', '0.000', '0.204', '0.000', '0.000', '0.236', '0.000', '0.471', '0.000', '0.000']
['0.000', '0.000', '0.183', '1.000', '0.224', '0.365', '0.447', '0.000', '0.000', '0.000', '0.516', '0.224', '0.516', '0.000', '0.258']
['0.000', '0.000', '0.000', '0.224', '1.000', '0.000', '0.000', '0.250', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.577']
['0.154', '0.000', '0.167', '0.365', '0.000', '1.000', '0.000', '0.000', '0.000', '0.204', '0.236', '0.204', '0.471', '0.000', '0.000']
['0.000', '0.000', '0.000', '0.447', '0.000', '0.000', '1.000', '0.000', '0.000', '0.000', '0.289', '0.250', '0.000', '0.000', '0.000']
['0.000', '0.000', '0.204', '0.000', '0.250', '0

In [15]:
# Users 0 and 9 share interests in Hadoop, Java, and Big Data
assert 0.56 < user_similarities[0][9] < 0.58, "several shared interests"

# Users 0 and 9 share only one interest: Big Data
assert 0.18 < user_similarities[0][8] < 0.20, "only one shared interest"

Using `user_similarities`, we can write a function that finds the most similar  
users to a given user. We'll make sure not to include the user herself, nor any  
users with zero similarity

In [16]:
def most_similar_users_to(user_id: int) -> List[Tuple[int, float]]:
  pairs = [(other_user_id, similarity)                        # Find other
            for other_user_id, similarity in                  # users with
              enumerate(user_similarities[user_id])           # nonzero
            if user_id != other_user_id and similarity > 0]   # similarity.
  
  return sorted(pairs,                      # Sort them
                key=lambda pair: pair[-1],  # most similar
                reverse=True)               # first.

In [17]:
most_similar_users_to(0)

[(9, 0.5669467095138409),
 (1, 0.3380617018914066),
 (8, 0.1889822365046136),
 (13, 0.1690308509457033),
 (5, 0.1543033499620919)]

To get new recommendations, we put the similarity score of the other users   
to their interests. Then, we sum all similarity for the same interests   
from the other users.

<img src="./img-resources/recommender-sys-user-based-suggestion.png" width=800>

In [18]:
def user_based_suggestion(user_id: int, include_current_interests: bool = False):
  # Sum up the similarities
  suggestions: Dict[str, float] = defaultdict(float)
  for other_user_id, similarity in most_similar_users_to(user_id):
    for interest in users_interests[other_user_id]:
      suggestions[interest] += similarity
  
  # Convert them to a sorted list
  suggestions = sorted(suggestions.items(),
                        key=lambda pair: pair[-1],    # weight
                        reverse=True)

  # And (maybe) exclude already interests
  if include_current_interests:
    return suggestions
  else:
    return [(suggestion, weight)
            for suggestion, weight in suggestions
              if suggestions not in users_interests[user_id]]

In [19]:
user_based_suggestion(0)

[('Big Data', 0.7559289460184544),
 ('Java', 0.7212500594759328),
 ('Hadoop', 0.5669467095138409),
 ('MapReduce', 0.5669467095138409),
 ('MongoDB', 0.50709255283711),
 ('HBase', 0.50709255283711),
 ('Postgres', 0.50709255283711),
 ('NoSQL', 0.3380617018914066),
 ('Cassandra', 0.3380617018914066),
 ('neural networks', 0.1889822365046136),
 ('deep learning', 0.1889822365046136),
 ('artificial intelligence', 0.1889822365046136),
 ('databases', 0.1690308509457033),
 ('MySQL', 0.1690308509457033),
 ('Python', 0.1543033499620919),
 ('R', 0.1543033499620919),
 ('C++', 0.1543033499620919),
 ('Haskell', 0.1543033499620919),
 ('programming languages', 0.1543033499620919)]

There is a problem to the above approach if the number of items (unique interests)   
are very large (remember the curse of dimensionality).  
All our vectors in high dimensional spaces are very far apart.

## Item-Based Collaborative Filtering

In [20]:
interest_user_matrix = [[user_interest_vector[j]
                          for user_interest_vector in user_interest_vectors]
                            for j, _ in enumerate(unique_interests)]

interest_user_matrix

[[1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0],
 [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
 [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
 [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
 [0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0],
 [0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0],
 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
 [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,

In the above result, row `j` of `interest_user_matrix` is column `j` of   
`user_interest_vectors`.

In [21]:
interest_user_matrix[0]

[1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0]

The above result shows that user 0, 8, and 9 indicates that they have interest  
in Big Data (Big Data is `unique_interests[0]`)

Now we use again cosine similarity. We compare similarity between two rows in   
`interest_user_matrix`

In [22]:
interest_similarities = [[nlp.cosine_similarity(user_vector_i, user_vector_j)
                            for user_vector_j in interest_user_matrix]
                          for user_vector_i in interest_user_matrix]

for interest_similarity in interest_similarities:
  print([f"{interest:.3f}" for interest in interest_similarity])

['1.000', '0.000', '0.408', '0.333', '0.816', '0.000', '0.667', '0.000', '0.577', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.577', '0.577', '0.408', '0.000', '0.000', '0.408', '0.000', '0.000', '0.000', '0.408', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000']
['0.000', '1.000', '0.000', '0.000', '0.000', '0.707', '0.408', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.354', '0.354', '0.000', '0.000', '0.500', '0.000', '0.000', '0.500', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.408', '0.707', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000']
['0.408', '0.000', '1.000', '0.816', '0.500', '0.000', '0.408', '0.000', '0.000', '0.500', '0.000', '0.707', '0.500', '0.000', '0.000', '0.707', '0.707', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000', '0.000']
['0.333', '0.000', '0.816

We can find the interests most similar to Big Data (`unique_interests[0]`) with  
the following function

In [23]:
def most_similar_interests_to(interest_id: int):
  similarities = interest_similarities[interest_id]
  pairs = [(unique_interests[other_interest_id], similarity)
            for other_interest_id, similarity in enumerate(similarities)
              if interest_id != other_interest_id and similarity > 0]
  return sorted(pairs,
                key=lambda pair: pair[-1],
                reverse=True)

In [24]:
most_similar_interests_to(16)

[('Spark', 1.0),
 ('Cassandra', 0.7071067811865475),
 ('Hadoop', 0.7071067811865475),
 ('Big Data', 0.5773502691896258),
 ('HBase', 0.5773502691896258),
 ('Java', 0.5773502691896258)]

Now we can ceate recommendations for a user by summing up the similarities of  
the interests similar to his:

In [25]:
def item_based_suggestions(user_id: int,
                            include_current_interests: bool = False):
  # Add up the similar interests
  suggestions = defaultdict(float)
  user_interest_vector = user_interest_vectors[user_id]
  for interest_id, is_interested in enumerate(user_interest_vector):
    if is_interested == 1:
      similar_interests = most_similar_interests_to(interest_id)
      for interest, similarity in similar_interests:
        suggestions[interest] += similarity

  # Sort them by weight
  suggestions = sorted(suggestions.items(),
                        key=lambda pair: pair[-1],
                        reverse=True)
  
  if include_current_interests:
    return suggestions
  else:
    return [(suggestion, weight)
            for suggestion, weight in suggestions
              if suggestion not in users_interests[user_id]]

In [26]:
item_based_suggestions(0)

[('MapReduce', 1.861807319565799),
 ('MongoDB', 1.3164965809277263),
 ('Postgres', 1.3164965809277263),
 ('NoSQL', 1.2844570503761732),
 ('MySQL', 0.5773502691896258),
 ('databases', 0.5773502691896258),
 ('Haskell', 0.5773502691896258),
 ('programming languages', 0.5773502691896258),
 ('artificial intelligence', 0.4082482904638631),
 ('deep learning', 0.4082482904638631),
 ('neural networks', 0.4082482904638631),
 ('C++', 0.4082482904638631),
 ('Python', 0.2886751345948129),
 ('R', 0.2886751345948129)]