# Content-Based Filtering

Le filtrage basé sur le contenu est un type de filtrage dont la décision de sélection ou non d'un document se base uniquement sur le contenu de celui-ci. Les techniques de filtrage basées sur le contenu fonctionnent par la caractérisation du contenu de l'information (document) à filtrer.

Les données sont issue du dataset [Kaggle - News Portal User Interactions by Globo.com](https://www.kaggle.com/gspmoreira/news-portal-user-interactions-by-globocom#clicks_sample.csv)


# Chargement des bibliotèques

In [2]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import pickle
import os
import glob
import random
from sklearn.metrics.pairwise import cosine_similarity

# Chargement du jeu de données

In [3]:
!wget "https://s3-eu-west-1.amazonaws.com/static.oc-static.com/prod/courses/files/AI+Engineer/Project+9+-+Réalisez+une+application+mobile+de+recommandation+de+contenu/news-portal-user-interactions-by-globocom.zip" data.zip
!unzip -q news-portal-user-interactions-by-globocom.zip
!unzip -q clicks.zip

--2021-10-10 09:01:45--  https://s3-eu-west-1.amazonaws.com/static.oc-static.com/prod/courses/files/AI+Engineer/Project+9+-+R%C3%A9alisez+une+application+mobile+de+recommandation+de+contenu/news-portal-user-interactions-by-globocom.zip
Resolving s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)... 52.218.108.59
Connecting to s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)|52.218.108.59|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 376587710 (359M) [application/zip]
Saving to: ‘news-portal-user-interactions-by-globocom.zip’


2021-10-10 09:02:00 (24.8 MB/s) - ‘news-portal-user-interactions-by-globocom.zip’ saved [376587710/376587710]

--2021-10-10 09:02:00--  http://data.zip/
Resolving data.zip (data.zip)... failed: Name or service not known.
wget: unable to resolve host address ‘data.zip’
FINISHED --2021-10-10 09:02:00--
Total wall clock time: 15s
Downloaded: 1 files, 359M in 14s (24.8 MB/s)


In [4]:
articles_metadata = pd.read_csv('./articles_metadata.csv')
articles_metadata['datetime'] = pd.to_datetime(articles_metadata['created_at_ts'] / 1000, unit='s')
print(f"Articles from {articles_metadata['datetime'].min()} to {articles_metadata['datetime'].max()}")
articles_metadata.head()

Articles from 2006-09-27 11:14:35 to 2018-03-13 12:12:30


Unnamed: 0,article_id,category_id,created_at_ts,publisher_id,words_count,datetime
0,0,0,1513144419000,0,168,2017-12-13 05:53:39
1,1,1,1405341936000,0,189,2014-07-14 12:45:36
2,2,1,1408667706000,0,250,2014-08-22 00:35:06
3,3,1,1408468313000,0,230,2014-08-19 17:11:53
4,4,1,1407071171000,0,162,2014-08-03 13:06:11


## Articles

In [5]:
articles = pickle.load( open("./articles_embeddings.pickle", "rb" ) )
articles.shape

(364047, 250)

In [6]:
def get_cosine_similarity(a, b):
    """Returns the cosine similarity of 2 vectors
    @params
        a vector
        b vector
    @return
        cosine similarity
    """
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot_product / (norm_a * norm_b)

## Interactions utilisateurs

In [7]:
all_files = glob.glob("./clicks/*.csv")
data = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    data.append(df)

clicks = pd.concat(data, axis=0, ignore_index=True)

In [8]:
sorted_clicks = clicks.sort_values(by=['click_timestamp'], ascending=True)

# Prédictions
Exemple de prédiction avec l'utilisateur n° **5**.
On va sélectionner le dernier article d'une période de 8 jours, puis comparer avec les articles lu les 8 jours suivants. 

Cet utilisateur à un historique de lecture qui va du 01/10/2017 au 16/10/2017, il a lu 87 articles durant cette période.

In [9]:
user_id = 5
clicks['datetime'] = pd.to_datetime(clicks['click_timestamp'] / 1000, unit='s')
user_click = clicks[clicks['user_id'] == user_id].sort_values('click_timestamp', ascending=False)
print(f"Articles from {user_click['datetime'].min()} to {user_click['datetime'].max()}")
print(f"This user have read {user_click.shape[0]}")


Articles from 2017-10-01 03:01:24.884999990 to 2017-10-16 22:19:20.851999998
This user have read 87


Sélection des articles lu durant la  période du 01/10/2017 au 08/10/2017

L'utilisateur à lu 33 articles durant cette période

In [10]:
# 8 days
ref_start_date = '2017-10-01'
ref_end_date = '2017-10-08'
mask = (user_click['datetime'] > ref_start_date) & (user_click['datetime'] <= ref_end_date)
ref_period = user_click.loc[mask]

print(f"This user had read {ref_period.shape[0]} articles during 8 days")

This user had read 33 articles during 8 days


Sélection des articles lu durant la période du 09/10/2017 au 16/10/2017

L'utilisateur à lu 42 articles durant cette période

In [11]:
# Get next 8 days articles
pred_start_date = '2017-10-09'
pred_end_date = '2017-10-16'
mask = (user_click['datetime'] > pred_start_date) & (user_click['datetime'] <= pred_end_date)
pred_period = user_click.loc[mask]

print(f"This user have read {pred_period.shape[0]} article during 8 days")

This user have read 42 article during 8 days


Le dernier article lu le **07/10/2017** par l'utilisateur à le n° **202763**, cet article a été publié le **06/10/2017**

In [12]:
last_article = ref_period['click_article_id'][:1].iloc[0]
article_date = articles_metadata[articles_metadata['article_id'] == last_article]['datetime'].iloc[0]
last_article_date = ref_period['datetime'][:1].iloc[0]
print(f"On {last_article_date} the user read his last article #{last_article}, the article was published on {article_date}")

On 2017-10-07 14:52:53.525000095 the user read his last article #202763, the article was published on 2017-10-06 22:00:40


On peut supposer que l'utilisateur à tendance à lire les articles parus dans la semaine. Nous allons sélectionner uniquement les articles publiés durant la semaine de référence (du 09/10/2017 au 16/10/2017).

In [13]:
articles_read_list = pred_period['click_article_id'].tolist()
pred_articles = articles_metadata[articles_metadata['article_id'].isin(articles_read_list)]
print(f"{pred_articles.shape[0]} articles published during this period")


40 articles published during this period


Etant donné que seulement 40 articles ont été plubliés durant cete période et que l'utilisateur en à lu 42, il devient nécessaire t'étendre la recherche à un ensemble plus grand, nous utiliserond l'intervale de 2 semaines.

Cela représente 11637 articles.

In [14]:
# get articles published during this week
mask = (articles_metadata['datetime'] > ref_start_date) & (articles_metadata['datetime'] <= pred_end_date)
pred_period_articles = articles_metadata.loc[mask]
print(f"During this period {pred_period_articles.shape[0]} have been published")

During this period 11637 have been published


# Recommendations

In [15]:
def sortBySim(elem):
    return elem[1]

In [16]:
score = []
list_articles = pred_period_articles['article_id'].tolist()
for idx in list_articles:
  sim = get_cosine_similarity(
  np.array(articles[last_article]), np.array(articles[idx]))
  if sim < 0.25:
    cbf = 0
  elif sim < 0.50:
    cbf = 1
  elif sim < 0.75:
    cbf = 2
  else:
    cbf = 3
  score.append([user_id, idx, sim, cbf])

score.sort(key=sortBySim, reverse=True)
score_df = pd.DataFrame(score, columns=['user_id', 'article_id', 'sim', 'CBF'])
score_df['article_id'][:5]

0    363967
1    363952
2    363947
3    363910
4    363297
Name: article_id, dtype: int64

## Évaluation

In [18]:
read = 0
not_read = 0

user_read = pred_period['click_article_id'].tolist()
for (idx, pred) in enumerate(score[:100]):
    if idx in user_read:
      read += 1
    else:
      not_read += 1  
    
print("-----")
print(f"Recommendations (already read): {read}")
print(f"Recommendations (not yet read): {not_read}")

-----
Recommendations (already read): 0
Recommendations (not yet read): 100


L'utilisation du "content based", ne semble pas donner de bons résultats pour cet utilisateur. Un test A/B permettrait certainement de mieux vérifier la pertinence de cet algorithme.  