# Memory-based Collaborative filtering with GA4 Dallas Free Press dataset

* Dataset: GA4 DFP
* CF-type: Memory
* Implicit/Explicit: Implicit
* User-User/Item-Item: Item-item


# Step 1: Preprocess user-article interaction matrix

Our goal here is to turn raw GA4 event data into a shape compatible with memory-based collaborative filtering (CF). 

There are a few pieces of implicit feedback that I believe we _could_ incorperate into the CF:
* Whether the user visited the article page
* Time spent on a given article page
* Whether the user Scrolled n-amount of the article page
* Whether the user clicked onto another article page after reading this article

As we note [here](https://docs.google.com/document/d/1mO94wAjdVFDRKSQo7JYVkd_NsiBDkHxhfyj7u_uZY9Q/edit?tab=t.0#heading=h.yevokmfzxk7) you _could_ combine these 4 signals into a single collaborative filter, however memory-based collaborative filters are generally not extended in this way.   

Because we don't know which user interaction will produce the best recommendations, lets demonstrate how to create a few different user-article interaction matrices.

Recommendations will need to optimize for _something_ and ultimately, we'll need a matrix that supports it. For example, if you want to optimize for recirculation (i.e. reads article, clicks next article), then a matrix that tracks "2nd clicks" should support that goal. 

In [1]:
import glob
import os

# Define the path to the folder containing the GA4 event files
folder_path = "../data/ga4-dfp/raw/"

# Use glob to get all .json files in the folder
file_paths = glob.glob(os.path.join(folder_path, "*.json"))
print(file_paths[:5])

['../data/ga4-dfp/raw/20250310.json', '../data/ga4-dfp/raw/20250306.json', '../data/ga4-dfp/raw/20250214.json', '../data/ga4-dfp/raw/20250222.json', '../data/ga4-dfp/raw/20250218.json']


### Matrix: page view

In this section, we're going to preprocess GA4 data to create a matrix of users and whether a user "visited" an article page.

As will often be the case when working with raw GA4 events, we need to make a call on how an article "visit" is represented in the data. For now, we will use the `page_view` event described [here](https://support.google.com/analytics/answer/9234069?sjid=14081786719704974415-NA) as our page view indicator. 

Within every `page_view` event, there is a `page_location` parameter that will contain the page URL--that will be "article" the interaction is assigned to. 


In [2]:
import json
import pandas as pd

from urllib.parse import urljoin, urlparse


page_view_records = []
# Iterate through all GA4 data files
for file_path in file_paths:
    with open(file_path) as f:
        data = json.load(f)

    # for each file, iterate through the events
    for event in data:
        user_id = event.get("user_pseudo_id")
        event_name = event.get("event_name")
        if event_name == "page_view":
            # Extract page_location from page_view event_params
            event_params = event.get("event_params", [])
            for param in event_params:
                if param["key"] == "page_location":
                    page_location = param["value"].get("string_value")
                    # Perform a minor amount of url cleanup
                    page_location = urljoin(page_location, urlparse(page_location).path)
                    break
            if user_id and page_location:
                page_view_records.append(
                    {"user_pseudo_id": user_id, "page_location": page_location}
                )

# Create DataFrame and pivot to get visited pages per user
df = pd.DataFrame(page_view_records)
user_page_visits_matrix = (
    df.drop_duplicates()
    .assign(visited=1)
    .pivot_table(
        index="user_pseudo_id", columns="page_location", values="visited", fill_value=0
    )
)

user_page_visits_matrix.head()

page_location,http://b46.cleverjumper.com/123.php,https://dallasfreepdev.wpengine.com/,https://dallasfreepress-com.translate.goog/es/south-dallas/los-estudiantes-regresan-a-la-escuela-virtual-pero-las-familias-de-dallas-no-tienen-acceso-al-internet/,https://dallasfreepress.com/,https://dallasfreepress.com/about-us/,https://dallasfreepress.com/author/Fatima-Syed/,https://dallasfreepress.com/author/Jeffrey-Ruiz/,https://dallasfreepress.com/author/Michaela-Rush/,https://dallasfreepress.com/author/Sona-Chaudhary/,https://dallasfreepress.com/author/Sujata-Dand/,...,https://dallasfreepress.com/west-dallas/west-dallas-residents-dont-trust-city-plans-for-their-land/,https://dallasfreepress.com/west-dallas/west-dallas-residents-resist-air-permit-renewal-for-gaf-shingle-plant/,https://dallasfreepress.com/west-dallas/west-dallas-residents-step-toward-health-community-and-connection/,https://dallasfreepress.com/west-dallas/west-dallas-residents-struggle-to-qualify-for-citys-home-repair-program/,https://dallasfreepress.com/west-dallas/west-dallas-stem-school-has-a-new-principal/,https://dallasfreepress.com/west-dallas/west-dallas-strong-west-dallas-1-website-breaks-new-ground-and-hopes-to-inspire-new-leadership/,https://dallasfreepress.com/west-dallas/who-gets-to-name-the-west-dallas-stem-school/,https://dallasfreepress.com/west-dallas/with-325-million-invested-into-a-trinity-river-park-west-dallas-residents-want-investment-in-people/,https://dallasfreepress.com/whats-a-news-desert/,https://dallasfreepress.com/who-owns-south-dallas/
user_pseudo_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000102388.1739116,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1000496648.1739036,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1001135239.1739064,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1002132338.1739897,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1002238659.1738966,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Matrix: Dwell time

In this section, we're going to preprocess GA4 data to create a matrix of users and how long they "dwelled" on a given page.

Again, this variable isn't directly provided within the raw GA4 event data, so we need to make a call on how to calculate it. 

Thankfully, Google Analytics provides some helpful [documentation](https://support.google.com/analytics/answer/11109416?hl=en) on how to use the critical `engagement_time_msec` event parameter. 

For our purposes, `engagement_time_msec` ticks up during a session as long as the user is actively focused on the webpage (see documentation on what "focused" means). Any time a user "navigates to next page" or "leaves the website", a `user_engagement` event is fired with an `engagement_time_msec` parameter. We can use this `engagement_time_msec` to calculate our dwell time. 


In [None]:
page_time_records = []

# Iterate through all GA4 data files
for file_path in file_paths:
    with open(file_path) as f:
        data = json.load(f)

    for event in data:
        user_id = event.get("user_pseudo_id")
        event_name = event.get("event_name")
        if event_name == "user_engagement":
            # Extract page_location from page_view event_params
            event_params = event.get("event_params", [])
            for key in event_params:
                if key["key"] == "engagement_time_msec":
                    engagement_time_msec = key["value"].get("int_value")
                if key["key"] == "page_location":
                    page_location = key["value"].get("string_value")
                    # Perform a minor amount of url cleanup
                    page_location = urljoin(page_location, urlparse(page_location).path)
            if user_id and page_location and engagement_time_msec:
                page_time_records.append(
                    {
                        "user_pseudo_id": user_id,
                        "page_location": page_location,
                        "engagement_time_msec": engagement_time_msec,
                    }
                )

# Create DataFrame and pivot to get visited pages per user
df = pd.DataFrame(page_time_records)
df["engagement_time_msec"] = df["engagement_time_msec"].astype(int)
# Group by user and page_location to sum engagement time
aggregated_df = df.groupby(["user_pseudo_id", "page_location"], as_index=False)[
    "engagement_time_msec"
].sum()
# Pivot to get engagement time per page per user
user_page_engagement_time_matrix = aggregated_df.pivot_table(
    index="user_pseudo_id",
    columns="page_location",
    values="engagement_time_msec",
    fill_value=0,  # fills in 0 for page visits that didn't happen
)
user_page_engagement_time_matrix.head()

page_location,https://dallasfreepress-com.translate.goog/es/south-dallas/los-estudiantes-regresan-a-la-escuela-virtual-pero-las-familias-de-dallas-no-tienen-acceso-al-internet/,https://dallasfreepress.com/,https://dallasfreepress.com/about-us/,https://dallasfreepress.com/author/Jeffrey-Ruiz/,https://dallasfreepress.com/author/Michaela-Rush/,https://dallasfreepress.com/author/Sona-Chaudhary/,https://dallasfreepress.com/author/Sujata-Dand/,https://dallasfreepress.com/author/admin/,https://dallasfreepress.com/author/amber-sims/,https://dallasfreepress.com/author/bekah-s-mcneel/,...,https://dallasfreepress.com/west-dallas/west-dallas-remembers-isabel-chavela-lozada-tavera/,https://dallasfreepress.com/west-dallas/west-dallas-residents-dont-trust-city-plans-for-their-land/,https://dallasfreepress.com/west-dallas/west-dallas-residents-resist-air-permit-renewal-for-gaf-shingle-plant/,https://dallasfreepress.com/west-dallas/west-dallas-residents-step-toward-health-community-and-connection/,https://dallasfreepress.com/west-dallas/west-dallas-stem-school-has-a-new-principal/,https://dallasfreepress.com/west-dallas/west-dallas-strong-west-dallas-1-website-breaks-new-ground-and-hopes-to-inspire-new-leadership/,https://dallasfreepress.com/west-dallas/who-gets-to-name-the-west-dallas-stem-school/,https://dallasfreepress.com/west-dallas/with-325-million-invested-into-a-trinity-river-park-west-dallas-residents-want-investment-in-people/,https://dallasfreepress.com/whats-a-news-desert/,https://dallasfreepress.com/who-owns-south-dallas/
user_pseudo_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000496648.1739036,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1002132338.1739897,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1002238659.1738966,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1002314562.1740338,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1002547819.1739042,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Matrix: Scroll Depth

In this section, we're going to preprocess GA4 data to create a matrix of users and how far they scolled down a given page.

This data is not provided by GA4 by default, however a previous developer set up a "(LNL) GA4 Scroll Event" on the Dallas Free Press' [GTM account](https://tagmanager.google.com/?authuser=2#/container/accounts/6002468403/containers/34578105/workspaces/15/tags) to record it. 

From here, calculating the max scroll depth of a user on an article is fairly straightforward.
1. Scan the dataset for `scroll_custom` events
2. Record the user, page url and scroll depth of each scroll event
3. Create the user-article matrix by aggregating by user+article and taking the maximum scroll depth  

In [None]:
scroll_depth_records = []

for file_path in file_paths:
    with open(file_path) as f:
        data = json.load(f)

    for event in data:
        if event.get("event_name") == "scroll_custom":
            user_id = event.get("user_pseudo_id")
            # Cast event_params to a dictionary for easier access to page location and scroll depth
            params = {p["key"]: p["value"] for p in event.get("event_params", [])}

            # Extract page_location and perform a little url cleanup
            page_location = params.get("page_location", {}).get("string_value")
            page_location = urljoin(page_location, urlparse(page_location).path)
            # Extract percent_scrolled
            percent_scrolled = params.get("percent_scrolled", {}).get("int_value")

            if user_id and page_location and percent_scrolled is not None:
                scroll_depth_records.append(
                    {
                        "user_pseudo_id": user_id,
                        "page_location": page_location,
                        "percent_scrolled": int(percent_scrolled),
                    }
                )


# Convert to DataFrame and find max scroll per user per page
user_page_scroll_matrix = pd.DataFrame(scroll_depth_records)
user_page_scroll_matrix = user_page_scroll_matrix.pivot_table(
    index="user_pseudo_id",
    columns="page_location",
    values="percent_scrolled",
    aggfunc="max",
    fill_value=0,
)
user_page_scroll_matrix.head()

page_location,https://dallasfreepdev.wpengine.com/,https://dallasfreepress-com.translate.goog/es/south-dallas/los-estudiantes-regresan-a-la-escuela-virtual-pero-las-familias-de-dallas-no-tienen-acceso-al-internet/,https://dallasfreepress.com/,https://dallasfreepress.com/about-us/,https://dallasfreepress.com/author/Fatima-Syed/,https://dallasfreepress.com/author/Jeffrey-Ruiz/,https://dallasfreepress.com/author/Michaela-Rush/,https://dallasfreepress.com/author/Sona-Chaudhary/,https://dallasfreepress.com/author/Sujata-Dand/,https://dallasfreepress.com/author/admin/,...,https://dallasfreepress.com/west-dallas/west-dallas-residents-dont-trust-city-plans-for-their-land/,https://dallasfreepress.com/west-dallas/west-dallas-residents-resist-air-permit-renewal-for-gaf-shingle-plant/,https://dallasfreepress.com/west-dallas/west-dallas-residents-step-toward-health-community-and-connection/,https://dallasfreepress.com/west-dallas/west-dallas-residents-struggle-to-qualify-for-citys-home-repair-program/,https://dallasfreepress.com/west-dallas/west-dallas-stem-school-has-a-new-principal/,https://dallasfreepress.com/west-dallas/west-dallas-strong-west-dallas-1-website-breaks-new-ground-and-hopes-to-inspire-new-leadership/,https://dallasfreepress.com/west-dallas/who-gets-to-name-the-west-dallas-stem-school/,https://dallasfreepress.com/west-dallas/with-325-million-invested-into-a-trinity-river-park-west-dallas-residents-want-investment-in-people/,https://dallasfreepress.com/whats-a-news-desert/,https://dallasfreepress.com/who-owns-south-dallas/
user_pseudo_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000102388.1739116,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1000496648.1739036,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1001135239.1739064,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1002132338.1739897,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1002238659.1738966,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Matrix: 2nd page

In this section, we're going to preprocess GA4 data to create a matrix of users and whether or not a page resulted in a user continuing on to a subsequent page.

The `page_view` event is our friend here. We'll assume if a `page_view` event is fired in the same user session as a previous `page_view` event, then that first `page_view` can be considered responsible for the second `page_view`. 

Currently, it's not clear to whether BigQuery event exports are sorted by session, time, user, etc. As a result, we'll take an agnostic approach to our calculations.  


In [None]:
page_views = []

# We start by creating a list of all the page_view events within our dataset
for file_path in file_paths:
    with open(file_path) as f:
        data = json.load(f)

    for event in data:
        if event.get("event_name") == "page_view":
            user_id = event.get("user_pseudo_id")
            timestamp = int(event.get("event_timestamp", 0))
            params = {p["key"]: p["value"] for p in event.get("event_params", [])}

            page_location = params.get("page_location", {}).get("string_value")
            page_location = urljoin(page_location, urlparse(page_location).path)

            session_id = params.get("ga_session_id", {}).get("int_value")

            if user_id and page_location and session_id:
                page_views.append(
                    {
                        "user_pseudo_id": user_id,
                        "ga_session_id": session_id,
                        "event_timestamp": timestamp,
                        "page_location": page_location,
                    }
                )

# Step 2: Create DataFrame and sort by user, session, timestamp
page_view_df = pd.DataFrame(page_views)
page_view_df.sort_values(
    by=["user_pseudo_id", "ga_session_id", "event_timestamp"], inplace=True
)

# Step 3: For each user-session, mark if a page view lead to another
page_view_df["leads_to_another"] = False
# If a user-session "group" has more than one element, then we know all but the last page view lead to another page view
for (user, session), group in page_view_df.groupby(["user_pseudo_id", "ga_session_id"]):
    if len(group) > 1:
        page_view_df.loc[group.index[:-1], "leads_to_another"] = True


# Step 4: Create matrix: user x page, value = 1 if it led to another, else 0
# leads_df is created by taking the rows of page_view_df where leads_to_another is True
leads_df = page_view_df[page_view_df["leads_to_another"]].copy()
leads_df["led"] = 1

user_page_leads_matrix = leads_df.pivot_table(
    index="user_pseudo_id",
    columns="page_location",
    values="led",
    aggfunc="max",
    fill_value=0,
)

user_page_leads_matrix.head()

page_location,https://dallasfreepress.com/,https://dallasfreepress.com/about-us/,https://dallasfreepress.com/author/Michaela-Rush/,https://dallasfreepress.com/author/Sona-Chaudhary/,https://dallasfreepress.com/author/admin/,https://dallasfreepress.com/author/brenda-hernandez/,https://dallasfreepress.com/author/christina-hughes-babb/,https://dallasfreepress.com/author/david-silva-ramirez/,https://dallasfreepress.com/author/dfp-student-journalists/,https://dallasfreepress.com/author/documenternotes/,...,https://dallasfreepress.com/west-dallas/west-dallas-oldest-black-owned-grocery-store-thrives-while-others-fizzle-out-in-food-desert/,https://dallasfreepress.com/west-dallas/west-dallas-remembers-isabel-chavela-lozada-tavera/,https://dallasfreepress.com/west-dallas/west-dallas-residents-dont-trust-city-plans-for-their-land/,https://dallasfreepress.com/west-dallas/west-dallas-residents-resist-air-permit-renewal-for-gaf-shingle-plant/,https://dallasfreepress.com/west-dallas/west-dallas-residents-step-toward-health-community-and-connection/,https://dallasfreepress.com/west-dallas/west-dallas-stem-school-has-a-new-principal/,https://dallasfreepress.com/west-dallas/west-dallas-strong-west-dallas-1-website-breaks-new-ground-and-hopes-to-inspire-new-leadership/,https://dallasfreepress.com/west-dallas/who-gets-to-name-the-west-dallas-stem-school/,https://dallasfreepress.com/whats-a-news-desert/,https://dallasfreepress.com/who-owns-south-dallas/
user_pseudo_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1002732508.174148,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1003313917.1742518,1,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1008308890.1738085,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
100938759.17367911,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1010095652.174226,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Step 2: Sparse matrix

Okay! We've calculated our user-item interaction matrices. They're going to be pretty sparse, so let's collapse them into a sparse matrices

In [None]:
from scipy.sparse import csr_matrix

user_page_visits_sparse_matrix = csr_matrix(user_page_visits_matrix.values)
user_page_engagement_time_sparse_matrix = csr_matrix(
    user_page_engagement_time_matrix.values
)
user_page_scroll_sparse_matrix = csr_matrix(user_page_scroll_matrix.values)
user_page_leads_sparse_matrix = csr_matrix(user_page_leads_matrix.values)

# Step 3: Webpage-Webpage Cosine Similarity  

We transpose the sparse matrix because `cosine_similarity(x)` computes similarity row-wise. So, if we want webpage-webpage similarity, we need the rows of X to represent webpages.

So cosine_similarity(sparse_matrix.T) computes the similarity between webpages based on user interaction patterns, which is exactly what we want for item-item collaborative filtering.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

user_page_visits_similarity = cosine_similarity(user_page_visits_sparse_matrix.T)
user_page_visits_similarity_df = pd.DataFrame(
    user_page_visits_similarity,
    index=user_page_visits_matrix.columns,
    columns=user_page_visits_matrix.columns,
)

user_page_engagement_time_similarity = cosine_similarity(
    user_page_engagement_time_sparse_matrix.T
)
user_page_engagement_time_similarity_df = pd.DataFrame(
    user_page_engagement_time_similarity,
    index=user_page_engagement_time_matrix.columns,
    columns=user_page_engagement_time_matrix.columns,
)

user_page_scroll_similarity = cosine_similarity(user_page_scroll_sparse_matrix.T)
user_page_scroll_similarity_df = pd.DataFrame(
    user_page_scroll_similarity,
    index=user_page_scroll_matrix.columns,
    columns=user_page_scroll_matrix.columns,
)

user_page_leads_similarity = cosine_similarity(user_page_leads_sparse_matrix.T)
user_page_leads_similarity_df = pd.DataFrame(
    user_page_leads_similarity,
    index=user_page_leads_matrix.columns,
    columns=user_page_leads_matrix.columns,
)

# Step 4: Recommendation function

In [None]:
def recommend_pages(interaction_matrix, similarity_df, user_id, top_n=5):
    # Grab all movies rated by the user
    user_ratings = interaction_matrix.loc[user_id]
    # We need to filter user_ratings, b/c interaction_matrix included 0 as a fill value for missing ratings
    rated_items = user_ratings[user_ratings > 0].index.tolist()

    # For each user-rated movie, get the similarity scores of all other movies to that movie
    scores = pd.Series(dtype=float)
    for item in rated_items:
        similar_items = similarity_df[item]
        scores = scores.add(similar_items, fill_value=0)

    # Remove already rated items
    scores = scores.drop(rated_items, errors="ignore")
    return scores.sort_values(ascending=False).head(top_n)

# Demo:

In [None]:
user = "1448845853.1742950447"

recommendations = recommend_pages(
    user_page_visits_matrix, user_page_visits_similarity_df, user_id=user, top_n=5
)
print(f"Top 5 recommended articles for User {user} based on visits:")
print(recommendations)

recommendations = recommend_pages(
    user_page_engagement_time_matrix,
    user_page_engagement_time_similarity_df,
    user_id=user,
    top_n=5,
)
print(f"Top 5 recommended articles for User {user} based on engagement time:")
print(recommendations)


recommendations = recommend_pages(
    user_page_scroll_matrix, user_page_scroll_similarity_df, user_id=user, top_n=5
)
print(f"Top 5 recommended articles for User {user} based on scroll depth:")
print(recommendations)

recommendations = recommend_pages(
    user_page_leads_matrix, user_page_leads_similarity_df, user_id=user, top_n=5
)
print(f"Top 5 recommended articles for User {user} based on leads:")
print(recommendations)

Top 5 recommended articles for User 1448845853.1742950447 based on visits:
page_location
https://dallasfreepress.com/dallas-news/concerns-of-gentrification-halt-duplex-construction-in-owenwood-neighborhood/    0.160128
https://dallasfreepress.com/tag/emmanuel-glover/                                                                         0.160128
https://dallasfreepress.com/event-location/resource-center-west/                                                         0.160128
https://dallasfreepress.com/events/heirs-property-workshop/                                                              0.113228
https://dallasfreepress.com/event-location/career-institute-east/                                                        0.113228
dtype: float64
Top 5 recommended articles for User 1448845853.1742950447 based on engagement time:
page_location
https://dallasfreepress.com/tag/emmanuel-glover/                                                                  0.460011
https://dallasfreepress.c