# Data Analysis

This notebook defines functions that extract valuable insights from the clean dataset prepared during the preprocessing stage. The functions allow users to filter data by subscription type. The following functions are included:

**Page count** : it counts how many times each page can be found in all user journeys.

**Page presence** : is similar to ‘page count’ but counts each page only once if it exists in a journey; it shows how many times each page is part of a journey

**Page destination**: is a metric that shows the most frequent follow-ups after every page. It looks at every page and counts which pages follow next. If one is interested in what the users do after visiting page X, they can consult this metric.

**Page sequences** : look at what the most popular run of N pages is. I will consult this metric if I’m interested in the sequence of three (or any other number) pages that most often shows up. Count each sequence only once per journey.

**Journey length** : is a straightforward metric that considers the average length of a user journey in terms of pages.

**Entry Page Frequency** : Identifies which pages are most often the starting point of user journeys, useful for understanding how users typically enter the platform or website.

**Exit Page Frequency** : Tracks the most common last pages users visit before leaving, Helps detect potential drop-off points or pages where user engagement ends.

**Click Depth** : Measures the average number of steps it takes for users to reach key pages (e.g., pricing, contact), Helps assess site structure and accessibility of important content.

In [49]:
import pandas as pd
import numpy as np
from collections import Counter

In [50]:
data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/user-jurney-analysis/clean_data.csv')

In [51]:
data.head()

Unnamed: 0,user_id,subscription_type,full_user_journey
0,1516,Annual,Homepage-Log in-Sign up-Log in
1,3395,Annual,Pricing-Sign up-Log in-Homepage-Pricing
2,10107,Annual,Homepage-Career tracks-Homepage-Career tracks-...
3,11145,Monthly,Homepage-Log in-Homepage-Log in-Homepage-Log in
4,12400,Monthly,Homepage-Career tracks-Sign up-Log in-Career t...


In [52]:
data.isnull().sum()

Unnamed: 0,0
user_id,0
subscription_type,0
full_user_journey,96


we have 96 messing values form the full_user_journey column because in the phase of preprocessing we removed the duplicate pages as it's not important for the insights, and we may have users that just have the same page in all of the sessions.We gonna remove these messing values

In [53]:
data.dropna(subset=['full_user_journey'], inplace=True)

define a function that counts how times a page can be found in all user journeys


In [54]:
def page_count(df, subtype = 'All', target_column = 'full_user_journey'):
    df_copy = df.copy()
    # drop missing values
    df_copy = df_copy.dropna()

    # choose membership type
    if subtype != 'All':
        df_copy = df_copy[df_copy['subscription_type'].str.contains(subtype)]


    # split the pages at the hyphen
    pages_list = df_copy[target_column].str.split('-')
    # flatten the rows of list
    pages = [page for sublist in pages_list for page in sublist]
    # create a series to count the values
    pages_count = pd.Series(pages).value_counts()



    return pages_count

In [55]:
page_count(data, subtype = 'Annual')

Unnamed: 0,count
Homepage,787
Sign up,459
Log in,454
Pricing,393
Courses,393
Career tracks,388
Coupon,362
Checkout,218
Career track certificate,194
Resources center,142


define a function that counts if a page is found in user's journeys

In [56]:


def page_presence(df, subtype = 'All', target_column = 'full_user_journey'):
    df_copy = df.copy()
    # drop missing values
    df_copy = df_copy.dropna()

    # choose membership type
    if subtype != 'All':
        df_copy = df_copy[df_copy['subscription_type'].str.contains(subtype)]

    # split the pages at the hyphen
    pages_list = df_copy[target_column].str.split('-')

    # convert each list to a set to remove duplicates
    pages_list = pages_list.apply(lambda x: list(set(x)) if isinstance(x, list) else x)

    # flatten the rows of list
    pages = [page for sublist in pages_list for page in sublist]

    # create a series to count the values
    pages_count = pd.Series(pages).value_counts()


    return pages_count

In [57]:
page_presence(data, subtype = 'Annual')

Unnamed: 0,count
Homepage,459
Log in,353
Coupon,350
Sign up,349
Pricing,253
Courses,240
Career tracks,201
Checkout,188
Career track certificate,123
Resources center,94


define a function that shows the most frequent follow-ups after every page

In [58]:


def page_destination(df, subtype = 'All', target_column = 'full_user_journey'):
    df_copy = df.copy()
    # drop missing values
    df_copy = df_copy.dropna()

    # choose membership type
    if subtype != 'All':
        df_copy = df_copy[df_copy['subscription_type'].str.contains(subtype)]

    # split each row at the hyphen
    pages_list = df_copy[target_column].str.split('-')


    # count the following page after every page
    page_pair_count = Counter()

    for pair in pages_list:
        if isinstance(pair, list):
            for i in range(len(pair)-1):
                page_pair = (pair[i], pair[i+1])
                page_pair_count[page_pair] += 1


In [59]:
page_destination(data, subtype = 'All')

 define a function returns the most popular sequence given the number of pages in a sequence


In [60]:

def page_sequence(df, sequence_size, subtype = 'All', target_column = 'full_user_journey'):
    df_copy = df.copy()
    # drop na
    df_copy = df_copy.dropna()

    # filter by subscription type
    if subtype != 'All':
        df_copy = df_copy[df_copy['subcription_type'].str.contains(subtype)]

    # split strings at hyphen
    pages_list = df_copy[target_column].str.split('-')

    # count the occurance of page sequence once per row
    sequence_count = Counter()

    for sequence in pages_list:
        if len(sequence) >= sequence_size:
            for i in range(len(sequence)-sequence_size +1):
                sequence_group = tuple(sequence[i:i + sequence_size])
                sequence_count[sequence_group] += 1

    top_sequence, count = sequence_count.most_common(1)[0]

    return top_sequence, count

In [61]:
page_sequence(data, sequence_size = 4)

(('Career tracks', 'Courses', 'Career tracks', 'Courses'), 56)

define function to check the average number of user page visits


In [62]:


def avg_journey(df, subtype = 'All', target_column = 'full_user_journey'):
    df_copy = df.copy()

    # drop missing values
    df_copy = df_copy.dropna()

    # filter by subscription type
    if subtype != 'All':
        df_copy = df_copy[df_copy['subscription_type'].str.contains(subtype)]

    # seperate strings at hyphen
    pages_list = df_copy[target_column].str.split('-')

    # count the length of every row
    journey_length = pages_list.apply(len)

    # Average journey
    avg_visits = pd.DataFrame(journey_length).mean().round(1)

    return print("Average journey for", subtype, "subscribers:", avg_visits)

In [63]:
avg_journey(data, subtype = 'All')

Average journey for All subscribers: full_user_journey    4.9
dtype: float64


define a function

In [64]:


def exit_page_frequency(df, journey_col='full_user_journey'):
    """
    Calcule la fréquence des pages de sortie dans les parcours utilisateurs.

    Paramètres :
        df (pd.DataFrame) : DataFrame contenant les parcours.
        journey_col (str) : Nom de la colonne contenant les parcours (listes ou chaînes).

    Retour :
        pd.Series : Pages de sortie triées par fréquence décroissante.
    """
    # Supprimer les lignes manquantes
    df = df.dropna(subset=[journey_col])

    # Extraire la dernière page de chaque parcours
    exit_pages = []
    for journey in df[journey_col]:
        if isinstance(journey, str):
            pages = journey.split(',')  # ou split(' > ') selon ton format
        elif isinstance(journey, list):
            pages = journey
        else:
            continue  # ignorer les formats inattendus

        if pages:
            exit_pages.append(pages[-1].strip())

    # Compter la fréquence des pages de sortie
    return pd.Series(Counter(exit_pages)).sort_values(ascending=False)


In [65]:
exit_page_frequency(data)

Unnamed: 0,0
Coupon,183
Log in,78
Homepage-Sign up-Checkout,21
Courses-Sign up-Checkout,21
Log in-Checkout,18
...,...
Homepage-Courses-Sign up-Courses-Career tracks-Sign up-Homepage-Pricing-Sign up-Coupon,1
Homepage-Courses-Pricing-Sign up-Log in,1
Homepage-Pricing-Career track certificate-Sign up-Career track certificate-Course certificate-Sign up-Course certificate-Sign up,1
Homepage-Course certificate-Homepage-Career tracks-Sign up-Homepage-Career tracks-Sign up,1


define a function that measures the average number of steps it takes for users to reach key pages (pricing, contact)

In [66]:
def click_depth(df, journey_col='full_user_journey', key_pages=['Pricing', 'Contact']):
    """
    Calcule la profondeur moyenne des clics pour atteindre des pages clés.

    Paramètres :
        df (pd.DataFrame) : DataFrame contenant les parcours.
        journey_col (str) : Nom de la colonne contenant les parcours (listes ou chaînes).
        key_pages (list) : Liste des pages clés à rechercher dans les parcours utilisateurs.

    Retour :
        float : Profondeur moyenne des clics pour atteindre les pages clés.
    """
    # Supprimer les lignes avec des parcours manquants
    df = df.dropna(subset=[journey_col])

    click_depths = []

    for journey in df[journey_col]:
        if isinstance(journey, str):
            pages = journey.split(',')  # ou split(' > ') selon ton format
        elif isinstance(journey, list):
            pages = journey
        else:
            continue  # ignorer les formats inattendus

        # Trouver la première occurrence de chaque page clé dans le parcours
        for page in key_pages:
            if page in pages:
                click_depths.append(pages.index(page) + 1)  # index + 1 pour la position réelle
                break  # On s'arrête dès qu'on trouve une page clé

    # Calculer la profondeur moyenne
    if click_depths:
        return sum(click_depths) / len(click_depths)
    else:
        return 0  # Si aucune page clé n'est trouvée, retour de 0


In [67]:
click_depth(data, key_pages=['Pricing', 'Contact'])


0

It means that, on average, users reach the key pages with their first click in their journey. In other words, the key page (such as "Pricing" or "Contact") is typically visited first in their user journey.