## Implicit Model in Machine Learning

An implicit model in machine learning is a type of model where the relationship between inputs and outputs is not explicitly defined. Instead, these models learn a representation of the data that enables them to make predictions or perform tasks without explicitly delineating the underlying relationship.

One common example of an implicit model is the collaborative filtering technique used in recommendation systems. Below is a simplified formula that illustrates how collaborative filtering might be represented:

$$
\hat{r}_{ui} = \sigma(q_u \cdot p_i + b_u + b_i)
$$

where:
- $\hat{r}_{ui}$ is the predicted rating for user $u$ and item $i$,
- $q_u$ is a vector representing the preferences of user $u$,
- $p_i$ is a vector representing the features of item $i$,
- $b_u$ is a bias term for user $u$,
- $b_i$ is a bias term for item $i$,
- $\sigma$ is the sigmoid function, which maps the dot product plus biases to a value between 0 and 1.

This formula demonstrates how collaborative filtering implicitly learns about user preferences and item features without explicitly defining these relationships. Implicit models like this are widely used in recommendation systems where understanding the complex or ambiguous interactions between users and items is essential.


In [None]:
pip install lightfm

Collecting lightfm
  Downloading lightfm-1.17.tar.gz (316 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.4/316.4 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: lightfm
  Building wheel for lightfm (setup.py) ... [?25l[?25hdone
  Created wheel for lightfm: filename=lightfm-1.17-cp310-cp310-linux_x86_64.whl size=808333 sha256=f9c5c46ef251fcabf03478a7512aae4400941a8368c07b8a3f1cb7b01da947aa
  Stored in directory: /root/.cache/pip/wheels/4f/9b/7e/0b256f2168511d8fa4dae4fae0200fdbd729eb424a912ad636
Successfully built lightfm
Installing collected packages: lightfm
Successfully installed lightfm-1.17


In [None]:
# from lightfm.data import Dataset
import pandas as pd
import numpy as np
from scipy.sparse import coo_matrix
import requests
from collections import deque
from lightfm.data import Dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Unnamed: 0,Username,track_name,artist_name,rank,playcount,country,log_total_track_count,log_total_artist_count,activeness_most inactive,activeness_inactive,activeness_medium,activeness_active,activeness_most active
0,emosoup,Higher,Sleep Token,1,1321,United States,0.740725,0.686149,False,False,True,False,False
1,maiconslavieiro,Higher,Sleep Token,21,151,Brazil,0.766376,0.741361,False,False,False,True,False
2,velenious,Higher,Sleep Token,19,1259,United States,0.808702,0.76686,False,False,False,False,True
3,Antimemetic,Higher,Sleep Token,32,29,United States,0.646481,0.554484,True,False,False,False,False
4,frankcreature,Higher,Sleep Token,43,43,Czech Republic,0.610178,0.553889,True,False,False,False,False


In [None]:
df = pd.read_csv('/content/drive/MyDrive/Googlecolab/user_songs_filtered.csv')
df_top_tracks = df

# Group by 'Username' and aggregate the other columns into lists
aggregated_data = df_top_tracks.groupby('Username').agg({
    'track_name': list,
    'artist_name': list,
    'playcount': list
}).reset_index()


# Rename the columns to match the second image
aggregated_data.rename(columns={
    'Username': 'User',
    'track_name': 'Tracks',
    'artist_name': 'Artists',
    'playcount': 'Playcounts'
}, inplace=True)

# The resulting DataFrame should be in the desired format
aggregated_data.head()


Unnamed: 0,User,Tracks,Artists,Playcounts
0,-Dolorosa-,"[I Will Follow You Into The Dark, Destruction ...","[Death Cab for Cutie, Chelsea Wolfe, Have a Ni...","[283, 225, 187, 190, 263, 302, 174, 327, 173, ..."
1,-itssoeasy,"[Drive, Dig, Holding Someone's Hair Back, Anna...","[Incubus, Incubus, Circa Survive, Incubus, Kil...","[62, 42, 23, 45, 23, 17, 44, 20, 23, 25, 55, 1..."
2,0-172,"[Head Hunter, Pachuca Sunrise, Throwin' Shapes...","[Dance Gavin Dance, Minus the Bear, Minus the ...","[154, 185, 166, 144, 195, 165, 193, 185, 180, ..."
3,23linear,"[8000, Crystalline, Neurosomatic Circuit, Proc...","[Extrawelt, Younger Brother, Androcell, Androc...","[68, 97, 73, 77, 324, 262, 142, 120, 118, 112,..."
4,40belowsummer,"[My Own Summer (Shove It), One Step Closer, In...","[Deftones, Linkin Park, Linkin Park, Faith No ...","[166, 130, 155, 134, 152, 169, 131, 150, 142, ..."


In [None]:
records = []
for i, row in aggregated_data.iterrows():
    user = row['User']
    for track, artist, playcount in zip(row['Tracks'], row['Artists'], row['Playcounts']):
        track_artist = f"{track} - {artist}"
        records.append((user, track_artist, playcount))

df_flat = pd.DataFrame(records, columns=['User', 'Track_Artist', 'Playcount'])
print(df_flat)

#(interactions, weights) = dataset.build_interactions([(x['User'], x['Track_Artist'], float(x['Playcount'])) for index, x in df_flat.iterrows()])

               User                                       Track_Artist  \
0        -Dolorosa-  I Will Follow You Into The Dark - Death Cab fo...   
1        -Dolorosa-  Destruction Makes the World Burn Brighter - Ch...   
2        -Dolorosa-                       Bloodhail - Have a Nice Life   
3        -Dolorosa-                      Such Small Hands - La Dispute   
4        -Dolorosa-                          Teardrop - Massive Attack   
...             ...                                                ...   
393115  zzakkkkkkkk                             Anne - John Frusciante   
393116  zzakkkkkkkk            Passover - 2007 Remaster - Joy Division   
393117  zzakkkkkkkk                             Get the Dutch - Duster   
393118  zzakkkkkkkk                             Hope - John Frusciante   
393119  zzakkkkkkkk                         The Real - John Frusciante   

        Playcount  
0             283  
1             225  
2             187  
3             190  
4          

In [None]:
pip install implicit

Collecting implicit
  Downloading implicit-0.7.2-cp310-cp310-manylinux2014_x86_64.whl (8.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.9/8.9 MB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: implicit
Successfully installed implicit-0.7.2


In [None]:
from scipy.sparse import coo_matrix
from implicit.als import AlternatingLeastSquares

df_flat['itemID'] = df_flat.groupby('Track_Artist').ngroup() + 1
df_flat['userID'] = df_flat.groupby('User').ngroup() + 1

user_item_matrix = coo_matrix((df_flat['Playcount'].astype(np.float32),
                                (df_flat['userID'], df_flat['itemID'])))

user_item_matrix_csr = user_item_matrix.tocsr()

item_user_matrix = user_item_matrix.T.tocsr()

In [None]:
# This is only necessary for colab since it only supports python 3.10, but the library we are using only supports <= 3.9.
# Comment this section if you are running it on your local machine

!sudo rm -rf /usr/local/lib/python3.8/dist-packages/OpenSSL
!sudo rm -rf /usr/local/lib/python3.8/dist-packages/pyOpenSSL-22.1.0.dist-info/

!wget https://repo.anaconda.com/miniconda/Miniconda3-py39_23.5.2-0-Linux-x86_64.sh
!chmod +x Miniconda3-py39_23.5.2-0-Linux-x86_64.sh

!bash ./Miniconda3-py39_23.5.2-0-Linux-x86_64.sh -b -f -p /usr/local
import sys
sys.path.append('/usr/local/lib/python3.9/site-packages/')
!pip3 install pyOpenSSL==22.0.0

# Installing the recommenders library.
# Ensure that you have python version <=3.9 when installing this.
!pip install recommenders[examples]

--2024-04-16 08:58:38--  https://repo.anaconda.com/miniconda/Miniconda3-py39_23.5.2-0-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.191.158, 104.16.32.241, 2606:4700::6810:20f1, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.191.158|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 93409434 (89M) [application/x-sh]
Saving to: ‘Miniconda3-py39_23.5.2-0-Linux-x86_64.sh’


2024-04-16 08:58:40 (75.5 MB/s) - ‘Miniconda3-py39_23.5.2-0-Linux-x86_64.sh’ saved [93409434/93409434]

PREFIX=/usr/local
Unpacking payload ...

Installing base environment...


Downloading and Extracting Packages


Downloading and Extracting Packages

Preparing transaction: - \ | / - \ done
Executing transaction: / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - done
installation finished.
    You currently have a PYTHONPATH environment variable set. This 

In [None]:
!pip install lightfm
!pip install pyspark

[0mCollecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488496 sha256=fc3985aacbeef7b934027c5bd8b4b30c33502c01bddc64b4b0deb00d3796e3d6
  Stored in directory: /root/.cache/pip/wheels/92/09/11/aa01d01a7f005fda8a66ad71d2be7f8aa341bddafb27eee3c7
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1
[0m

In [None]:
from recommenders.evaluation.python_evaluation import precision_at_k as p_at_k, recall_at_k, diversity, map_at_k, ndcg_at_k
from recommenders.utils.timer import Timer
# from recommenders.models.lightfm.lightfm_utils import (
#     track_model_metrics,
#     prepare_test_df,
#     prepare_all_predictions,
#     compare_metric,
#     similar_users,
#     similar_items,
# )
from recommenders.evaluation.python_evaluation import (rmse, mae, rsquared, exp_var, map_at_k, ndcg_at_k, precision_at_k,
                                                     recall_at_k, get_top_k_items,
                                                     catalog_coverage, distributional_coverage, novelty, diversity, serendipity)
from recommenders.utils.constants import SEED as DEFAULT_SEED
import pyspark
import pyspark.sql.functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import FloatType, IntegerType, LongType, StructType, StructField
from pyspark.ml.feature import Tokenizer, StopWordsRemover
from pyspark.ml.feature import HashingTF, CountVectorizer, VectorAssembler
from pyspark.ml.recommendation import ALS
from recommenders.datasets.spark_splitters import spark_random_split
from recommenders.datasets.python_splitters import python_chrono_split, python_stratified_split
from recommenders.evaluation.spark_evaluation import SparkRankingEvaluation, SparkDiversityEvaluation
from recommenders.utils.spark_utils import start_or_get_spark

In [None]:
df_full = df_flat.copy()
df_flat = df_flat[['Playcount', 'userID', 'itemID']]
df_flat.rename(columns={'Playcount': 'rating'}, inplace=True)
df_flat.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_flat.rename(columns={'Playcount': 'rating'}, inplace=True)


Unnamed: 0,rating,userID,itemID
0,283,1,58390
1,225,1,30858
2,187,1,16060
3,190,1,112448
4,263,1,115772


In [None]:
dataset = Dataset()
dataset.fit(users=df_flat['userID'].unique(),
            items=df_flat['itemID'].unique())

In [None]:
train, test = python_stratified_split(
    df_flat, ratio=0.8, seed=42
)
(test_interactions, test_weights) = dataset.build_interactions([(x['userID'], x['itemID'], float(x['rating'])) for index, x in test.iterrows()])
(train_interactions, train_weights) = dataset.build_interactions([(x['userID'], x['itemID'], float(x['rating'])) for index, x in train.iterrows()])

In [None]:
from implicit.evaluation import train_test_split
train_interaction_matrix, test_interaction_matrix = train_test_split(interactions, train_percentage=0.8 )

# AlternatingLeastSquares
The AlternatingLeastSquares model from the implicit library primarily focuses on matrix factorization techniques that work directly with user-item interaction data. It doesn't natively support incorporating user or item features directly into the model during the matrix factorization process in the way that models like LightFM do, which are designed as hybrid recommendation models capable of utilizing both interaction data and metadata (e.g., genre, user demographics).

The Alternating Least Squares (ALS) model is a computational technique used primarily for recommendation systems, particularly within the context of collaborative filtering. ALS alternates between fixing the user factors and solving for the item factors, and then fixing the item factors to solve for the user factors. This approach simplifies the optimization problem into a series of linear equations that can be solved independently.

The ALS method aims to find the best factorization of a given matrix in terms of a lower-dimensional representation of users and items. Essentially, it minimizes the squared differences between the observed ratings and the product of two matrices representing the latent (hidden) factors for users and items.

Here's a high-level representation of the ALS process:

$$
\min_{q_*, p_*} \sum_{(u, i) \in \text{Ratings}} (r_{ui} - q_u^T p_i)^2 + \lambda (\| q_u \|^2 + \| p_i \|^2)
$$

where:
- $r_{ui}$ is the known rating of user $u$ for item $i$,
- $q_u$ is the latent factor vector for user $u$,
- $p_i$ is the latent factor vector for item $i$,
- $\lambda$ is the regularization parameter to prevent overfitting,
- The first term minimizes the error between predicted and actual ratings,
- The second term (regularization) penalizes the complexity of the model.

The ALS model iteratively updates the user and item factors while keeping the other constant, hence the term "alternating". This method is highly scalable and can handle large datasets effectively.

The ALS model is especially popular in systems with implicit feedback data (e.g., click data, purchase history), where the goal is to infer user preferences indirectly.


In [None]:
model_implicit = AlternatingLeastSquares(factors=20, iterations=50, calculate_training_loss=True, regularization=0.05)
model_implicit.fit(train_interactions)

from implicit.evaluation import precision_at_k, AUC_at_k, ndcg_at_k,mean_average_precision_at_k

# Assuming model, train, and test are already defined
# Calculate precision at k
p_at_k = precision_at_k(model_implicit, train_interactions, test_interactions, K=30)

# Calculate AUC
auc_score = AUC_at_k(model_implicit, train_interactions, test_interactions, K=30)
mean_average_precision_at_k = mean_average_precision_at_k(model_implicit, train_interactions, test_interactions, K=30)
ndcg_at_k = ndcg_at_k(model_implicit, train_interactions, test_interactions, K=30)

print(f'Precision at k: {p_at_k}')
print(f'AUC at K: {auc_score}')
print(f'MAP at k: {p_at_k}')
print(f'NDCG at K: {ndcg_at_k}')

from datetime import datetime
current_time = datetime.now()
formatted_time = current_time.strftime('%Y-%m-%d_%H:%M:%S')
import pickle
model_filename = f'/content/drive/MyDrive/Googlecolab/model_implicit.pkl'  # Update the path to your desired Google Drive folder

with open(model_filename, 'wb') as model_file:
    pickle.dump(model_implicit, model_file)



  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/9483 [00:00<?, ?it/s]

  0%|          | 0/9483 [00:00<?, ?it/s]

  0%|          | 0/9483 [00:00<?, ?it/s]

  0%|          | 0/9483 [00:00<?, ?it/s]

Precision at k: 0.05574324324324324
AUC at K: 0.5266927924999275
MAP at k: 0.05574324324324324
NDCG at K: 0.04382238775352431


In [None]:
user_id_map = dict(zip(df_full['User'].astype("category"), df_full['userID']))
item_id_map = dict(zip(df_full['Track_Artist'].astype("category"), df_full['itemID']))

user_code_to_id_map = {v: k for k, v in user_id_map.items()}
item_code_to_id_map = {v: k for k, v in item_id_map.items()}

def recommend_implicit(user_id, model, user_item_matrix_csr, user_id_map, item_code_to_id_map, n_items=10):

    user_code = user_id_map.get(user_id)
    if user_code is None:
        raise ValueError(f"User ID {user_id} not found.")

    recommended, _ = model.recommend(user_code, user_item_matrix_csr[user_code], N=n_items)

    return [item_code_to_id_map.get(item_index, 'Unknown Item') for item_index in recommended]


user_id = df['Username'][0]
recommended_tracks = recommend_implicit(user_id, model_implicit, user_item_matrix_csr, user_id_map, item_code_to_id_map)
print(f"Recommended tracks for user {user_id} using Implicit: {recommended_tracks}")

Recommended tracks for user emosoup using Implicit: ['!? - lungskull', '...baby one more time - The Marías', '25-8 - Gridiron', 'A Ravenous Oblivion - Austere', 'A Rash Decision - Ice Nine Kills', 'Alien - Structures', 'Auto-Mobile - Duster', 'Alien - Mommy Long Legs', '1992 - Blur', 'Alien - Lebanon Hanover']


In [None]:
# Assuming you have the following variables defined:
# als_model: The trained implicit ALS model
# test_df: Your test DataFrame containing 'userID' and 'itemID'

# Extract the user and item factors from the model
user_factors = model.user_factors
item_factors = model.item_factors

# Make sure userID and itemID columns are integer type and zero-indexed
test['userID'] = test['userID'].astype(int)
test['itemID'] = test['itemID'].astype(int)

# Define a function to compute the dot product between the user and item factors
def calculate_score(user_id, item_id, user_factors, item_factors):
    user_factor = user_factors[user_id-1]
    item_factor = item_factors[item_id-1]
    score = np.dot(user_factor, item_factor)
    return score

# Use the 'apply' method with a lambda function to compute predictions
test['prediction'] = test.apply(lambda x: calculate_score(x['userID'], x['itemID'], user_factors, item_factors), axis=1)

# The test_df now has an additional column 'prediction' with the predicted scores
print(test.head())


    rating  userID  itemID  prediction
10     339       1   41908   -0.000017
22     230       1  139076    0.000000
18     194       1  140975    0.000000
20     218       1   83514    0.000000
7      327       1   27319   -0.000504


In [None]:
train['prediction'] = train.apply(lambda x: calculate_score(x['userID'], x['itemID'], user_factors, item_factors), axis=1)
print(train.head())

    rating  userID  itemID  prediction
19     185       1   83828    0.002689
16     203       1   92664    0.000000
15     258       1  143073    0.000323
26     243       1   18136    0.001428
4      263       1  115772   -0.000381


In [None]:
df_flat['prediction'] = df_flat.apply(lambda x: calculate_score(x['userID'], x['itemID'], user_factors, item_factors), axis=1)
print(df_flat.head())

   rating  userID  itemID  prediction
0     283       1   58390    0.000165
1     225       1   30858   -0.007035
2     187       1   16060    0.002023
3     190       1  112448    0.000000
4     263       1  115772   -0.000381


In [None]:
all_predictions = df_flat[['userID', 'itemID', 'prediction']]
all_predictions

Unnamed: 0,userID,itemID,prediction
0,1,58390,0.000165
1,1,30858,-0.007035
2,1,16060,0.002023
3,1,112448,0.000000
4,1,115772,-0.000381
...,...,...,...
393115,9483,8089,0.004814
393116,9483,90074,0.001798
393117,9483,45945,-0.001466
393118,9483,54532,-0.000544


In [None]:
test_pure = test[['userID', 'itemID', 'rating']]
train_pure = train[['userID', 'itemID', 'rating']]
train_prediction = train[['userID', 'itemID', 'prediction']]
test_prediction = test[['userID', 'itemID', 'prediction']]

In [None]:
eval_recall = recall_at_k(test_pure, all_predictions, col_prediction='prediction', k=30)
print(f" Recall: {eval_recall}")
eval_precision = p_at_k(test_pure, all_predictions, col_prediction='prediction', k=30)
print(f" Precision: {eval_precision}")

 Recall: 0.7358563109591265
 Precision: 0.19761678793630708


In [None]:
model.recommend(1, user_item_matrix_csr[1], N=30)

(array([ 7623,    50,    42,    53, 18390,  7621, 15808,  3819,  3823,
         2293,    55,    52,    51, 21667, 15804, 18394, 21034,  1120,
        72350,  3822,  2096, 20465,  7632, 72356,  6374, 18393, 57209,
        22241, 56378, 57210], dtype=int32),
 array([0.27160427, 0.23607565, 0.1810439 , 0.16270314, 0.15963456,
        0.15868631, 0.14831728, 0.1463519 , 0.13877231, 0.12674192,
        0.12324326, 0.12243912, 0.11909252, 0.11830647, 0.11668863,
        0.10809422, 0.10778655, 0.10283604, 0.10180348, 0.09972437,
        0.09486857, 0.09182658, 0.08720981, 0.08143379, 0.08060978,
        0.07942408, 0.07935429, 0.07904494, 0.07771206, 0.07770023],
       dtype=float32))

In [None]:
# Load the model from the file
with open('/content/drive/MyDrive/Googlecolab/model_implicit.pkl', 'rb') as f:
    model = pickle.load(f)

## Evaluate with Recommenders

In [None]:
# Prepare for diversity based evaluations
K=30
# Merge all_predictions with test on userID and itemID
# merged_df = pd.merge(test_predictions, test, left_on=['userID', 'itemID'], right_on=['userID', 'itemID'], how='outer')

#top_all = merged_df[['userID', 'itemID', 'prediction']]
# print(top_all.shape[0])

# Sort top_all DataFrame by 'prediction' column within each 'userID' group in descending order
train_prediction_sorted = train_prediction.sort_values(by=['userID', 'prediction'], ascending=[True, False])

# Group by 'userID' and take the top_k items for each group
top_k_reco = train_prediction_sorted.groupby('userID').head(K)
print(top_k_reco.shape[0])
#top_k_reco = pd.read_csv('/content/drive/MyDrive/top_k_reco.csv')

271902


In [None]:
eval_diversity = diversity(test_pure, top_k_reco, col_user='userID', col_item='itemID')
print(f"Diversity: {eval_diversity}")

eval_novelty = novelty(test_pure, top_k_reco, col_user='userID', col_item='itemID')
print(f"Novelty: {eval_novelty}")

eval_distributional_coverage = distributional_coverage(test_pure, top_k_reco, col_user='userID', col_item='itemID')
print(f"distributional_coverage: {eval_distributional_coverage}")

eval_catalog_coverage = catalog_coverage(test_pure, top_k_reco, col_user='userID', col_item='itemID')
print(f"catalog_coverage: {eval_catalog_coverage}")

eval_serendipity = serendipity(test_pure, top_k_reco, col_user='userID', col_item='itemID')
print(f"serendipity: {eval_serendipity}")

Diversity: 0.9966214340467417
Novelty: 7.625275339624368
distributional_coverage: 15.757265316286
catalog_coverage: 2.4455546147332767
serendipity: 0.9965969218269587
