# User Journey Analysis

## Preprocessing the Data

In this project, we are provided with data containing the user journeys of people who bought our product. We need to create Python programs to analyze the sequence of visited pages with the objective of improving the front page flow and identifying which pages are important.

But before analyzing this data, we must first clean it and prepare it for the next step.

For this part of the process, we must create a Python program with three functions to help transform the data into a more analysis-ready state and then export this new data to a CSV.

To begin, inspect the CSV itself and see if there is any need to clean the data. After inspection, we should notice that some user journey strings have multiple duplicate pages, one after another. While the Homepage reference in journeys like Homepage-Pricing-Homepage might be helpful for the analysis, the repeating reference in Homepage-Homepage-Homepage-Pricing is not.

The first function we need to create removes sequences of repeating pages. It should leave just a single entity in the place of the sequence. But it should only apply where the duplicate page is replicated sequentially. So, it should do nothing in the first example (Homepage-Pricing-Homepage) while replacing the second (Homepage-Homepage-Homepage-Pricing) with Homepage-Pricing. This operation should be done for each row of data.

In [3]:
# Import pandas for data manipulation and analysis.
import pandas as pd

In [4]:
# Load the dataset
data = pd.read_csv('user_journey_raw.csv')

# Display the first few rows to understand the features and their data types
data_info = data.info()
data_head = data.head()

data_info, data_head

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9935 entries, 0 to 9934
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   user_id            9935 non-null   int64 
 1   session_id         9935 non-null   int64 
 2   subscription_type  9935 non-null   object
 3   user_journey       9935 non-null   object
dtypes: int64(2), object(2)
memory usage: 310.6+ KB


(None,
    user_id  session_id subscription_type  \
 0     1516     2980231            Annual   
 1     1516     2980248            Annual   
 2     1516     2992252            Annual   
 3     1516     3070491            Annual   
 4     1516     3709807            Annual   
 
                                         user_journey  
 0  Homepage-Log in-Log in-Log in-Log in-Log in-Lo...  
 1  Other-Sign up-Sign up-Sign up-Sign up-Sign up-...  
 2          Log in-Log in-Log in-Log in-Log in-Log in  
 3  Homepage-Log in-Log in-Log in-Log in-Log in-Lo...  
 4  Log in-Log in-Log in-Log in-Log in-Log in-Log ...  )

In [5]:
def remove_page_duplicates(data, target_column='user_journey'):
    # Create a copy of the dataframe to ensure the original dataframe is not modified
    data_cleaned = data.copy()
    
    # Function to remove consecutive duplicates from a string of journey pages
    def remove_consecutive_duplicates(journey):
        # Split the journey string into a list of pages
        pages = journey.split('-')
        # Use a list comprehension to keep a page if it's not the same as the next one
        cleaned_pages = [pages[i] for i in range(len(pages) - 1) if pages[i] != pages[i+1]] + [pages[-1]]
        # Join the cleaned list back into a string
        return '-'.join(cleaned_pages)
    
    # Apply the function to the target column of the dataframe
    data_cleaned[target_column] = data_cleaned[target_column].apply(remove_consecutive_duplicates)
    
    return data_cleaned

# Apply the function to the loaded data
cleaned_data = remove_page_duplicates(data)

# Display the first few rows of the cleaned dataframe to verify the changes
cleaned_data.head()


Unnamed: 0,user_id,session_id,subscription_type,user_journey
0,1516,2980231,Annual,Homepage-Log in-Other
1,1516,2980248,Annual,Other-Sign up-Log in
2,1516,2992252,Annual,Log in
3,1516,3070491,Annual,Homepage-Log in
4,1516,3709807,Annual,Log in


Next, we will look at the structure of the data. Currently, there is a row for every session of the user. But when considering a user’s journey, we’re interested in the page sequences instead of the specific sessions. To prepare the data for the analysis, we'll need to group a single user's journey strings into one big string—which is what the second function will do.

In [8]:
def group_by(data, group_column='user_id', target_column='user_journey', sessions='All', count_from='last'):
    data_sorted = data.sort_values(by=[group_column, 'session_id'])
    def group_journeys(sub_df):
        if sessions == 'All':
            journeys = sub_df[target_column].tolist()
        else:
            if count_from == 'first':
                journeys = sub_df[target_column].head(sessions).tolist()
            else:  # count_from == 'last'
                journeys = sub_df[target_column].tail(sessions).tolist()
        return '-'.join(journeys)
    grouped_data = data_sorted.groupby(group_column).apply(group_journeys).reset_index(name=target_column)
    return grouped_data

# Group all sessions for each user
grouped_data_all = group_by(cleaned_data)

# Display the first few rows of the grouped dataframe
grouped_data_all.head()

Unnamed: 0,user_id,user_journey
0,1516,Homepage-Log in-Other-Other-Sign up-Log in-Log...
1,3395,Other-Pricing-Sign up-Log in-Homepage-Pricing-...
2,10107,Homepage-Homepage-Career tracks-Homepage-Caree...
3,11145,Homepage-Log in-Homepage-Log in-Homepage-Log i...
4,12400,Homepage-Career tracks-Sign up-Log in-Other-Ca...


The final function that remains removes unnecessary pages from the data. (Not all pages are essential in a user journey analysis.) Perhaps prompts like ‘log in’ should be removed. But this is not something we can hardcode into the preprocessing because it’s a decision that the data scientist can make and tinker with. That’s why we should create a function that can be called upon later if needed.

In [9]:
def remove_pages(data, pages, target_column='user_journey'):
    # Create a copy of the dataframe to ensure the original dataframe is not modified
    data_cleaned = data.copy()
    
    # Function to remove specified pages from a journey string
    def remove_specified_pages(journey):
        # Split the journey string into a list of pages
        journey_list = journey.split('-')
        # Remove the specified pages
        cleaned_journey_list = [page for page in journey_list if page not in pages]
        # Join the cleaned list back into a string
        return '-'.join(cleaned_journey_list)
    
    # Apply the function to remove specified pages from the target column
    data_cleaned[target_column] = data_cleaned[target_column].apply(remove_specified_pages)
    
    return data_cleaned

# Example usage: Remove "Log in" pages from the user journey
pages_to_remove = ["Log in"]
data_without_login_pages = remove_pages(grouped_data_all, pages_to_remove)

# Display the first few rows of the dataframe with removed pages
data_without_login_pages.head()


Unnamed: 0,user_id,user_journey
0,1516,Homepage-Other-Other-Sign up-Homepage-Checkout...
1,3395,Other-Pricing-Sign up-Homepage-Pricing-Pricing...
2,10107,Homepage-Homepage-Career tracks-Homepage-Caree...
3,11145,Homepage-Homepage-Homepage-Homepage-Homepage-H...
4,12400,Homepage-Career tracks-Sign up-Other-Career tr...


## Data Analyzing

Given the preprocessed data, you can begin your analysis in a new notebook to keep matters clean. Now is the time to think what metrics we can generate to obtain valuable statistics about the behavior of purchasing customers. Please take your time and think of as many such metrics as possible.

In [10]:
# Re-defining the essential functions for setup
def page_count(data, target_column='user_journey', subscription_type_column='subscription_type', plan=None):
    if plan:
        data_filtered = data[data[subscription_type_column] == plan]
    else:
        data_filtered = data
    
    page_counts = {}
    
    for journey in data_filtered[target_column]:
        pages = journey.split('-')
        for page in pages:
            if page not in page_counts:
                page_counts[page] = 1
            else:
                page_counts[page] += 1
    
    sorted_counts = sorted(page_counts.items(), key=lambda x: x[1], reverse=True)
    
    return sorted_counts

# Placeholder to indicate the function's readiness; actual data application will follow "Function `page_count` re-defined. Awaiting data to apply."


The function page_count has been re-defined and is ready for application on the data. This function will count the occurrences of each page across all user journeys, with the option to filter by subscription plan.

In [13]:
# Define the function for Page Presence
def page_presence(data, target_column='user_journey', subscription_type_column='subscription_type', plan=None):
    if plan:
        data_filtered = data[data[subscription_type_column] == plan]
    else:
        data_filtered = data
    
    page_presence_dict = {}
    
    for journey in data_filtered[target_column]:
        # Using a set to ensure each page in a journey is counted only once
        pages = set(journey.split('-'))
        for page in pages:
            if page not in page_presence_dict:
                page_presence_dict[page] = 1
            else:
                page_presence_dict[page] += 1
    
    sorted_presence = sorted(page_presence_dict.items(), key=lambda x: x[1], reverse=True)
    
    return sorted_presence


Page Presence aims to count how many unique journeys include each page at least once, regardless of how many times the page appears within a single journey.
The process for this function involves filtering the dataset based on a specified subscription plan (if any), iterating through each user journey to identify unique pages, and then counting the presence of each unique page across all journeys. The results are sorted by the frequency of page presence to highlight the most common pages that appear in user journeys.

In [14]:
def page_destination(data, target_column='user_journey', subscription_type_column='subscription_type', plan=None):
    # Filter data by subscription plan if specified
    if plan:
        data_filtered = data[data[subscription_type_column] == plan]
    else:
        data_filtered = data

    # Initialize a dictionary to hold the follow-up page counts
    destination_counts = {}

    # Iterate through each journey in the filtered data
    for journey in data_filtered[target_column]:
        pages = journey.split('-')
        for i in range(len(pages) - 1):
            current_page = pages[i]
            next_page = pages[i + 1]
            if current_page not in destination_counts:
                destination_counts[current_page] = {next_page: 1}
            else:
                if next_page not in destination_counts[current_page]:
                    destination_counts[current_page][next_page] = 1
                else:
                    destination_counts[current_page][next_page] += 1

    # For analysis, convert counts to a more readable format or sort them
    for page, destinations in destination_counts.items():
        sorted_destinations = sorted(destinations.items(), key=lambda x: x[1], reverse=True)
        destination_counts[page] = sorted_destinations

    return destination_counts


The function page_destination calculates how often each page is followed by another, providing insights into user navigation patterns.
It filters the journeys based on a specific subscription plan if provided.
The function iterates through each user journey, tracking the sequence of pages and counting the transitions from one page to the next.
It organizes the counts in a nested dictionary structure, where each key is a page and its value is another dictionary of pages that follow it, along with the counts of those transitions.
The output is adjusted to provide sorted lists of follow-up pages for each page, making it easier to identify the most common next steps users take after visiting a specific page.

In [15]:
def page_sequences(data, sequence_length, target_column='user_journey', subscription_type_column='subscription_type', plan=None):
    # Filter data by subscription plan if specified
    if plan:
        data_filtered = data[data[subscription_type_column] == plan]
    else:
        data_filtered = data

    # Initialize a dictionary to hold counts of each page sequence
    sequence_counts = {}

    # Iterate through each journey in the filtered data
    for journey in data_filtered[target_column]:
        pages = journey.split('-')
        # Generate sequences of the specified length
        for i in range(len(pages) - sequence_length + 1):
            sequence = '-'.join(pages[i:i + sequence_length])
            if sequence not in sequence_counts:
                sequence_counts[sequence] = 1
            else:
                sequence_counts[sequence] += 1

    # Sort sequences by their counts in descending order
    sorted_sequences = sorted(sequence_counts.items(), key=lambda x: x[1], reverse=True)

    return sorted_sequences


This function, page_sequences, is designed to uncover patterns in user navigation by identifying and counting the occurrences of sequences of a specified length (N pages).
It filters journeys based on a specific subscription plan if provided, allowing for targeted analysis.
The function iterates through each user journey, extracting sequences of N consecutive pages and counting their occurrences across all journeys.
The output is a sorted list of these sequences, with the most common sequences appearing first. This sorting makes it easier to identify the most frequently traveled paths.

Use Cases

Identifying Common Paths: Understand common navigation paths users take, which can be useful for optimizing user flow or enhancing popular content.

Content Strategy: By recognizing the sequences that lead to conversions or high engagement, you can tailor your content strategy to reinforce these paths.

User Experience Optimization: Identifying popular sequences can help in streamlining navigation, removing unnecessary steps, or highlighting key pathways to improve user experience.

In [16]:
def journey_length(data, target_column='user_journey', subscription_type_column='subscription_type', plan=None):
    # Filter data by subscription plan if specified
    if plan:
        data_filtered = data[data[subscription_type_column] == plan]
    else:
        data_filtered = data

    # Initialize a list to hold the lengths of all journeys
    journey_lengths = []

    # Iterate through each journey in the filtered data
    for journey in data_filtered[target_column]:
        # Count the number of pages in the journey
        pages = journey.split('-')
        journey_length = len(pages)
        journey_lengths.append(journey_length)

    # Calculate the average journey length
    if journey_lengths:
        avg_journey_length = sum(journey_lengths) / len(journey_lengths)
    else:
        avg_journey_length = 0

    return avg_journey_length


The function journey_length calculates the average number of pages visited in user journeys, providing an aggregate measure of engagement and navigation depth.
It supports filtering by subscription plan, enabling comparisons of journey complexity across different user segments.
By iterating through each journey, the function counts pages and computes the average length, offering insights into the overall user experience and the efficiency of the website's structure.

Use Cases

Evaluating User Engagement: Longer journeys might indicate higher engagement but can also suggest a convoluted path to conversion.

Optimizing Navigation Paths: Understanding the average journey length helps in optimizing the site layout to either streamline paths to important actions or enhance user exploration, depending on the strategic goals.

Segment Analysis: Comparing journey lengths across different subscription plans can reveal how different segments interact with the site, highlighting opportunities for tailored content or features.

In [11]:
page_count(data)

[('Checkout', 17896),
 ('Log in', 17265),
 ('Coupon', 11855),
 ('Courses', 7149),
 ('Sign up', 6824),
 ('Other', 6820),
 ('Career tracks', 4910),
 ('Homepage', 3808),
 ('Career track certificate', 3044),
 ('Resources center', 2266),
 ('Pricing', 2262),
 ('Course certificate', 1114),
 ('Success stories', 604),
 ('Upcoming courses', 188),
 ('Instructors', 76),
 ('Blog', 36),
 ('About us', 33)]

In [17]:
page_presence(data)

[('Log in', 3798),
 ('Homepage', 2396),
 ('Checkout', 2021),
 ('Other', 1535),
 ('Sign up', 1210),
 ('Coupon', 1041),
 ('Pricing', 929),
 ('Courses', 908),
 ('Career tracks', 747),
 ('Career track certificate', 355),
 ('Resources center', 339),
 ('Course certificate', 191),
 ('Upcoming courses', 101),
 ('Success stories', 49),
 ('Instructors', 26),
 ('About us', 22),
 ('Blog', 15)]

In [18]:
page_destination(data)

{'Homepage': [('Homepage', 1070),
  ('Log in', 947),
  ('Pricing', 442),
  ('Career tracks', 349),
  ('Sign up', 340),
  ('Courses', 244),
  ('Career track certificate', 116),
  ('Course certificate', 65),
  ('Resources center', 50),
  ('Instructors', 25),
  ('Other', 23),
  ('Upcoming courses', 10),
  ('About us', 5),
  ('Success stories', 4),
  ('Blog', 4),
  ('Checkout', 1),
  ('Coupon', 1)],
 'Log in': [('Log in', 13389),
  ('Sign up', 107),
  ('Other', 90),
  ('Homepage', 28),
  ('Courses', 5),
  ('Pricing', 5),
  ('Career tracks', 4),
  ('Career track certificate', 3),
  ('Resources center', 1)],
 'Other': [('Other', 5057),
  ('Resources center', 248),
  ('Log in', 79),
  ('Sign up', 45),
  ('Pricing', 45),
  ('Courses', 29),
  ('Coupon', 26),
  ('Career track certificate', 22),
  ('Career tracks', 19),
  ('Homepage', 16),
  ('Success stories', 14),
  ('Course certificate', 13),
  ('Blog', 6),
  ('Upcoming courses', 4),
  ('About us', 4)],
 'Sign up': [('Sign up', 5521),
  ('Log 

In [20]:
page_sequences(data,2)

[('Checkout-Checkout', 15832),
 ('Log in-Log in', 13389),
 ('Coupon-Coupon', 10814),
 ('Courses-Courses', 5962),
 ('Sign up-Sign up', 5521),
 ('Other-Other', 5057),
 ('Career tracks-Career tracks', 3775),
 ('Career track certificate-Career track certificate', 2563),
 ('Resources center-Resources center', 1686),
 ('Pricing-Pricing', 1168),
 ('Homepage-Homepage', 1070),
 ('Homepage-Log in', 947),
 ('Course certificate-Course certificate', 897),
 ('Success stories-Success stories', 551),
 ('Homepage-Pricing', 442),
 ('Homepage-Career tracks', 349),
 ('Career tracks-Courses', 341),
 ('Homepage-Sign up', 340),
 ('Resources center-Other', 337),
 ('Pricing-Checkout', 286),
 ('Courses-Career tracks', 260),
 ('Other-Resources center', 248),
 ('Homepage-Courses', 244),
 ('Sign up-Log in', 238),
 ('Courses-Sign up', 209),
 ('Career tracks-Sign up', 194),
 ('Career track certificate-Career tracks', 164),
 ('Pricing-Sign up', 128),
 ('Homepage-Career track certificate', 116),
 ('Log in-Sign up', 10

In [21]:
page_sequences(data,3)

[('Checkout-Checkout-Checkout', 13939),
 ('Coupon-Coupon-Coupon', 9774),
 ('Log in-Log in-Log in', 9757),
 ('Courses-Courses-Courses', 5040),
 ('Sign up-Sign up-Sign up', 4568),
 ('Other-Other-Other', 3870),
 ('Career tracks-Career tracks-Career tracks', 2894),
 ('Career track certificate-Career track certificate-Career track certificate',
  2163),
 ('Resources center-Resources center-Resources center', 1271),
 ('Homepage-Log in-Log in', 827),
 ('Homepage-Homepage-Homepage', 711),
 ('Course certificate-Course certificate-Course certificate', 705),
 ('Pricing-Pricing-Pricing', 557),
 ('Success stories-Success stories-Success stories', 505),
 ('Homepage-Pricing-Pricing', 327),
 ('Homepage-Career tracks-Career tracks', 300),
 ('Career tracks-Courses-Courses', 299),
 ('Career tracks-Career tracks-Courses', 270),
 ('Homepage-Sign up-Sign up', 250),
 ('Resources center-Resources center-Other', 246),
 ('Other-Resources center-Resources center', 234),
 ('Homepage-Courses-Courses', 232),
 ('Cou

In [22]:
page_sequences(data,4)

[('Checkout-Checkout-Checkout-Checkout', 12560),
 ('Coupon-Coupon-Coupon-Coupon', 8960),
 ('Log in-Log in-Log in-Log in', 7534),
 ('Courses-Courses-Courses-Courses', 4245),
 ('Sign up-Sign up-Sign up-Sign up', 3770),
 ('Other-Other-Other-Other', 3319),
 ('Career tracks-Career tracks-Career tracks-Career tracks', 2161),
 ('Career track certificate-Career track certificate-Career track certificate-Career track certificate',
  1820),
 ('Resources center-Resources center-Resources center-Resources center', 912),
 ('Homepage-Log in-Log in-Log in', 818),
 ('Course certificate-Course certificate-Course certificate-Course certificate',
  552),
 ('Success stories-Success stories-Success stories-Success stories', 464),
 ('Homepage-Homepage-Homepage-Homepage', 446),
 ('Pricing-Pricing-Pricing-Pricing', 348),
 ('Career tracks-Courses-Courses-Courses', 274),
 ('Homepage-Career tracks-Career tracks-Career tracks', 264),
 ('Career tracks-Career tracks-Career tracks-Courses', 250),
 ('Career tracks-Ca

In [23]:
page_sequences(data,5)

[('Checkout-Checkout-Checkout-Checkout-Checkout', 11209),
 ('Coupon-Coupon-Coupon-Coupon-Coupon', 8146),
 ('Log in-Log in-Log in-Log in-Log in', 5911),
 ('Courses-Courses-Courses-Courses-Courses', 3599),
 ('Sign up-Sign up-Sign up-Sign up-Sign up', 3104),
 ('Other-Other-Other-Other-Other', 2774),
 ('Career tracks-Career tracks-Career tracks-Career tracks-Career tracks',
  1652),
 ('Career track certificate-Career track certificate-Career track certificate-Career track certificate-Career track certificate',
  1517),
 ('Resources center-Resources center-Resources center-Resources center-Resources center',
  735),
 ('Success stories-Success stories-Success stories-Success stories-Success stories',
  424),
 ('Course certificate-Course certificate-Course certificate-Course certificate-Course certificate',
  413),
 ('Homepage-Log in-Log in-Log in-Log in', 386),
 ('Homepage-Homepage-Homepage-Homepage-Homepage', 307),
 ('Career tracks-Career tracks-Career tracks-Courses-Courses', 225),
 ('Care

In [24]:
journey_length(data)

8.671363865123302