# Recommendation Model

The objective of a Recommendation Model or Recommendation System (RecSys) is to recommend relevant items for users.
We will code a RecSys based on the article [Recommender Systems in Python 101](https://www.kaggle.com/code/gspmoreira/recommender-systems-in-python-101) of Gabriel Moreira. We will use the same variable names and some code lines.

## Introduction

In [None]:
# basic packages
import pandas as pd # dataframe handling
import math
import random

# modeling packages
from sklearn.model_selection import train_test_split

We will use the 
[Deskdrop Dataset](https://www.kaggle.com/datasets/gspmoreira/articles-sharing-reading-from-cit-deskdrop)
to follow the same steps as in the article. 
The Deskdrop Dataset contains a real sample of 12 months logs (March 2016 - February 2017) from CI&T's Internal Communication platform (DeskDrop). 
It consists of the shared articles dataset and the user interactions dataset.
The shared articles dataset has information of the articles available on the website.
The user interaction has information about user interation to different articles.
<br>
We will only consider available datasets, to avoid recommending unavaiable articles.

In [None]:
# datasets

# we read the datasets
articles_df = pd.read_csv('datasets/shared_articles.csv')
interactions_df = pd.read_csv('datasets/users_interactions.csv')

# we make sure the analyzed articles are available for users
articles_df = articles_df[articles_df['eventType'] == 'CONTENT SHARED']

This left us with 3,047 available articles.

## Data cleansing

We will associate to each interation type a weight because 
some interactions represent higher interest from the user.
We will use the same weights proposed in the article.

In [None]:
# event weights
event_type_strength = {
   'VIEW': 1.0,
   'LIKE': 2.0, 
   'BOOKMARK': 2.5, 
   'FOLLOW': 3.0,
   'COMMENT CREATED': 4.0,  
}

# add a column of weight to interactions dataframe
interactions_df['eventStrength'] = interactions_df['eventType'].apply(lambda x: event_type_strength[x])

To avoid **cold start** (recommending to users with no previous interactions) we will filter for only users with 5+ interactions to different articles.

In [None]:
# df of number of iteractions per user per content
users_interactions_count_df = interactions_df.groupby(['personId', 'contentId']).size().groupby('personId').size()

# df of users with 5+ interactions
users_with_enough_interactions_df = users_interactions_count_df[users_interactions_count_df >= 5].reset_index()[['personId']]

# we recreate the interactions df with selected users
interactions_from_selected_users_df = interactions_df.merge(users_with_enough_interactions_df, 
                                                            how = 'right',
                                                            on = 'personId')

This left us with 1895 users.
<br>
For multiple interactions to the same article, we sum the interation weight of aech interaction and apply a log transform to avoid outliers.

In [None]:
# we define the log function
def smooth_user_preference(x):
    return math.log(1+x, 2)

# we do the sum and apply the log function
interactions_full_df = interactions_from_selected_users_df \
                       .groupby(['personId', 'contentId'])['eventStrength'].sum() \
                       .apply(smooth_user_preference).reset_index()

## Modeling

To begin with the modeling, we will do an 80-20 split. 
Some users are really active, so we stratify by user to avoid bias.

In [None]:
# train-test split
interactions_train_df, interactions_test_df = train_test_split(interactions_full_df,
                                   stratify=interactions_full_df['personId'], 
                                   test_size=0.20,
                                   random_state=42)

We will use the **Top-N accurracy** strategy to evaluate accurracy. 
We will use N=5 and N=10 and evaluate for each item the user has interacted with.
To shorten computational time, we will limmit the recommendation analysis to 100 random selected not-viewed-by-the-user articles plus the objective one.
Here we code the evaluation object.

In [None]:
# random sample size
EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS = 100

#Indexing by personId to speed up the searches during evaluation
interactions_full_indexed_df = interactions_full_df.set_index('personId')
interactions_train_indexed_df = interactions_train_df.set_index('personId')
interactions_test_indexed_df = interactions_test_df.set_index('personId')

# Get the list of items the user has intercated with
def get_items_interacted(person_id, interactions_df):
    interacted_items = interactions_df.loc[person_id]['contentId'] # select only user interactions
    return set(interacted_items if type(interacted_items) == pd.Series else [interacted_items]) # transforms the df into a set

In [None]:
# we will use evaluator as an object
class ModelEvaluator:

    # method to select items the user hasn't interacted with
    def get_not_interacted_items_sample(self, person_id, sample_size, seed=42):
        interacted_items = get_items_interacted(person_id, interactions_full_indexed_df) # get interacted items
        all_items = set(articles_df['contentId']) # get all items
        non_interacted_items = all_items - interacted_items # get non-interacted items
        random.seed(seed) # set random seed
        non_interacted_items_sample = random.sample(list(non_interacted_items), sample_size) # here I added the list command
        return set(non_interacted_items_sample) # return as set

    # check if objective is in top n
    def _verify_hit_top_n(self, item_id, recommended_items, topn):        
            try: # try to find objective item
                # here it applies a lazy evaluation:
                index = next(i for i, c in enumerate(recommended_items) if c == item_id) # returns index if finds
            except: # if it isnt finded
                index = -1 # fix index 
            hit = int(index in range(0, topn)) # 1 if find; 0 alternatively
            return hit, index

    #evaluate model for each user
    def evaluate_model_for_user(self, model, person_id):
        interacted_values_testset = interactions_test_indexed_df.loc[person_id] # locate interacted articles for user in test set
        if type(interacted_values_testset['contentId']) == pd.Series: # interacted is a pd.Series if coincidences <= 1
            person_interacted_items_testset = set(interacted_values_testset['contentId']) #transform into a set
        else: # interacted is a pd.DataFrame if there's more than one coincidence
            person_interacted_items_testset = set([int(interacted_values_testset['contentId'])])  
        interacted_items_count_testset = len(person_interacted_items_testset) 

        #use the model method to create an ordered list of recomended items
        person_recs_df = model.recommend_items(person_id, 
                                               items_to_ignore=get_items_interacted(person_id, interactions_train_indexed_df), 
                                               topn=10000000000) # need to check this later...

        # test with top5 and top10 benchmarks
        hits_at_5_count = 0 # to count success rate
        hits_at_10_count = 0 # to count success rate
        
        #For each item the user has interacted in test set
        for item_id in person_interacted_items_testset:

            # select items the user hasn't interacted with
            non_interacted_items_sample = self.get_not_interacted_items_sample(person_id, 
                                                                               sample_size=EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS, 
                                                                               seed=item_id%(2**32))

            #Combine the current interacted item with the 100 random items
            items_to_filter_recs = non_interacted_items_sample.union(set([item_id]))

            #filter recommendations list to not interacted items + current item only
            valid_recs_df = person_recs_df[person_recs_df['contentId'].isin(items_to_filter_recs)]                    
            valid_recs = valid_recs_df['contentId'].values
            
            #Verifying if the current interacted item is among the Top-N recommended items
            hit_at_5, index_at_5 = self._verify_hit_top_n(item_id, valid_recs, 5)
            hits_at_5_count += hit_at_5 # +1 if its in the top 5
            hit_at_10, index_at_10 = self._verify_hit_top_n(item_id, valid_recs, 10)
            hits_at_10_count += hit_at_10 # +1 if its in the top 10

        # for each person we calculate recall = # of positives/# of trials
        recall_at_5 = hits_at_5_count / float(interacted_items_count_testset)
        recall_at_10 = hits_at_10_count / float(interacted_items_count_testset)

        # generate a dict as answer
        person_metrics = {'hits@5_count':hits_at_5_count, 
                          'hits@10_count':hits_at_10_count, 
                          'interacted_count': interacted_items_count_testset,
                          'recall@5': recall_at_5,
                          'recall@10': recall_at_10}
        
        return person_metrics

    # 
    def evaluate_model(self, model):
        people_metrics = [] # here we will save all users outcomes

        # for each user in test set
        for idx, person_id in enumerate(list(interactions_test_indexed_df.index.unique().values)):
            person_metrics = self.evaluate_model_for_user(model, person_id)  # evaluate user
            person_metrics['_person_id'] = person_id # add an element to response dict
            people_metrics.append(person_metrics) # append each user results to a general list

        # generate a df from people metrics
        detailed_results_df = pd.DataFrame(people_metrics).sort_values('interacted_count', ascending=False)

        # calculate global recall = sum of global positives / sum of global trials
        global_recall_at_5 = detailed_results_df['hits@5_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        global_recall_at_10 = detailed_results_df['hits@10_count'].sum() / float(detailed_results_df['interacted_count'].sum())

        # generate dict as answer
        global_metrics = {'modelName': model.get_model_name(),
                          'recall@5': global_recall_at_5,
                          'recall@10': global_recall_at_10}    
        return global_metrics, detailed_results_df

# create model evaluator object
model_evaluator = ModelEvaluator()

### Popularity Model

Popularity model consists in recommend the most popular non viewed elements in the dataset.

In [None]:
#Select the most popular articles
item_popularity_df = (interactions_full_df.groupby('contentId')['eventStrength'].sum() # we sum weight of each interaction
                                          .sort_values(ascending=False).reset_index()) # order so top1 = most popular

In [None]:
class PopularityRecommender:

    # fix object property name
    MODEL_NAME = 'Popularity'

    # init method
    def __init__(self, popularity_df, items_df=None):
        self.popularity_df = popularity_df
        self.items_df = items_df

    # get method
    def get_model_name(self):
        return self.MODEL_NAME

    # recommend the more popular items user hasnt seen 
    def recommend_items(self, user_id, items_to_ignore=[], topn=10):
        recommendations_df = (self.popularity_df[~self.popularity_df['contentId'].isin(items_to_ignore)] # not ignored items
                               .sort_values('eventStrength', ascending = False) # ordered by event strength (redundant)
                               .head(topn)) # return only top n elements
        return recommendations_df

Here we execute popularity model evaluation using top_n with n=5,10.

In [None]:
# create object from most popular elements, i think articles_df is unnecesary
popularity_model = PopularityRecommender(item_popularity_df, articles_df)

# calculate results of training
pop_global_metrics, pop_detailed_results_df = model_evaluator.evaluate_model(popularity_model)

In [None]:
pop_global_metrics