Users By Average Session Time

Calculate each user's average session time. A session is defined as the time difference between a page_load and page_exit. For simplicity, assume a user has only 1 session per day and if there are multiple of the same events on that day, consider only the latest page_load and earliest page_exit, with an obvious restriction that load time event should happen before exit time event . Output the user_id and their average session time.

In [1]:
import pandas as pd 
import datetime as dt
import numpy as np

In [3]:
facebook_web_log = pd.read_csv("../CSV/facebook_web_log.csv")
columns_to_keep = ["user_id", "timestamp", "action"]
facebook_web_log = facebook_web_log[columns_to_keep]
facebook_web_log.head(3)

Unnamed: 0,user_id,timestamp,action
0,0,2019-04-25 13:30:15,page_load
1,0,2019-04-25 13:30:18,page_load
2,0,2019-04-25 13:30:40,scroll_down


In [4]:
df = pd.merge(facebook_web_log.loc[facebook_web_log['action'] == 'page_load', ['user_id', 'timestamp']],
            facebook_web_log.loc[facebook_web_log['action'] == 'page_exit', ['user_id', 'timestamp']],
            how='left', on='user_id', suffixes=['_load', '_exit']).dropna()
df

Unnamed: 0,user_id,timestamp_load,timestamp_exit
0,0,2019-04-25 13:30:15,2019-04-25 13:31:40
1,0,2019-04-25 13:30:15,2019-04-28 15:31:40
2,0,2019-04-25 13:30:18,2019-04-25 13:31:40
3,0,2019-04-25 13:30:18,2019-04-28 15:31:40
4,1,2019-04-25 13:40:00,2019-04-25 13:40:35
5,1,2019-04-25 13:40:00,2019-04-26 11:15:35
7,1,2019-04-26 11:15:00,2019-04-25 13:40:35
8,1,2019-04-26 11:15:00,2019-04-26 11:15:35
9,0,2019-04-28 14:30:15,2019-04-25 13:31:40
10,0,2019-04-28 14:30:15,2019-04-28 15:31:40


Давайте разберем код по шагам:

1. `facebook_web_log.loc[facebook_web_log['action'] == 'page_load', ['user_id', 'timestamp']]`: Выбирает из DataFrame `facebook_web_log` только те строки, где действие ('action') равно 'page_load', и оставляет только столбцы 'user_id' и 'timestamp'.

2. `facebook_web_log.loc[facebook_web_log['action'] == 'page_exit', ['user_id', 'timestamp']]`: Аналогично предыдущему шагу, но для действия 'page_exit'.

3. `pd.merge(...)`: Объединяет два DataFrame, созданных на предыдущих шагах, по столбцу 'user_id' с использованием операции LEFT JOIN.

4. `how='left'`: Указывает тип объединения. В данном случае, LEFT JOIN означает, что все строки из 'page_load' будут включены в результат, а соответствующие строки из 'page_exit' будут добавлены, если они существуют.

5. `on='user_id'`: Указывает столбец, по которому происходит объединение.

6. `suffixes=['_load', '_exit']`: Добавляет суффиксы к названиям столбцов, чтобы отличить столбцы 'timestamp' из 'page_load' и 'page_exit'.

7. `.dropna()`: Удаляет строки, содержащие хотя бы один пропущенный (NaN) элемент. Это гарантирует, что в результате останутся только строки, в которых есть как 'timestamp_load', так и 'timestamp_exit'.

Таким образом, в результате выполнения этого кода получится DataFrame `df`, содержащий информацию о времени загрузки страницы ('timestamp_load') и времени выхода со страницы ('timestamp_exit') для каждого пользователя ('user_id'), где их действие было как 'page_load'.

In [5]:
df['date_load'] = pd.to_datetime(df['timestamp_load']).dt.date
df

Unnamed: 0,user_id,timestamp_load,timestamp_exit,date_load
0,0,2019-04-25 13:30:15,2019-04-25 13:31:40,2019-04-25
1,0,2019-04-25 13:30:15,2019-04-28 15:31:40,2019-04-25
2,0,2019-04-25 13:30:18,2019-04-25 13:31:40,2019-04-25
3,0,2019-04-25 13:30:18,2019-04-28 15:31:40,2019-04-25
4,1,2019-04-25 13:40:00,2019-04-25 13:40:35,2019-04-25
5,1,2019-04-25 13:40:00,2019-04-26 11:15:35,2019-04-25
7,1,2019-04-26 11:15:00,2019-04-25 13:40:35,2019-04-26
8,1,2019-04-26 11:15:00,2019-04-26 11:15:35,2019-04-26
9,0,2019-04-28 14:30:15,2019-04-25 13:31:40,2019-04-28
10,0,2019-04-28 14:30:15,2019-04-28 15:31:40,2019-04-28


In [6]:
df = df[df['timestamp_load'] < df['timestamp_exit']]
df

Unnamed: 0,user_id,timestamp_load,timestamp_exit,date_load
0,0,2019-04-25 13:30:15,2019-04-25 13:31:40,2019-04-25
1,0,2019-04-25 13:30:15,2019-04-28 15:31:40,2019-04-25
2,0,2019-04-25 13:30:18,2019-04-25 13:31:40,2019-04-25
3,0,2019-04-25 13:30:18,2019-04-28 15:31:40,2019-04-25
4,1,2019-04-25 13:40:00,2019-04-25 13:40:35,2019-04-25
5,1,2019-04-25 13:40:00,2019-04-26 11:15:35,2019-04-25
8,1,2019-04-26 11:15:00,2019-04-26 11:15:35,2019-04-26
10,0,2019-04-28 14:30:15,2019-04-28 15:31:40,2019-04-28
12,0,2019-04-28 14:30:10,2019-04-28 15:31:40,2019-04-28


In [7]:
df = df.groupby(['user_id', 'date_load']).agg({'timestamp_load': 'max', 'timestamp_exit': 'min'}).reset_index()
df

Unnamed: 0,user_id,date_load,timestamp_load,timestamp_exit
0,0,2019-04-25,2019-04-25 13:30:18,2019-04-25 13:31:40
1,0,2019-04-28,2019-04-28 14:30:15,2019-04-28 15:31:40
2,1,2019-04-25,2019-04-25 13:40:00,2019-04-25 13:40:35
3,1,2019-04-26,2019-04-26 11:15:00,2019-04-26 11:15:35


Давайте разберем код по шагам:

1. `df.groupby(['user_id', 'date_load'])`: Группирует DataFrame `df` по уникальным значениям в столбцах 'user_id' и 'date_load'. Это создает группы данных, где каждая группа представляет собой все строки для одного пользователя в определенный день загрузки страницы.

2. `.agg({'timestamp_load': 'max', 'timestamp_exit': 'min'})`: Применяет агрегацию к каждой группе. Для столбца 'timestamp_load' используется функция 'max', чтобы получить максимальное значение времени загрузки внутри каждой группы, и для 'timestamp_exit' используется функция 'min', чтобы получить минимальное значение времени выхода внутри каждой группы.

3. `.reset_index()`: Сбрасывает индексы после группировки и агрегации, создавая новый DataFrame с обычными числовыми индексами.

Таким образом, в результате выполнения этого кода, DataFrame `df` будет содержать уникальные значения 'user_id' и 'date_load', а также максимальное время загрузки ('timestamp_load') и минимальное время выхода ('timestamp_exit') для каждого пользователя в каждый день загрузки страницы.

In [9]:
df['timestamp_exit'] = pd.to_datetime(df['timestamp_exit'])
df['timestamp_load'] = pd.to_datetime(df['timestamp_load'])

In [10]:
df['duration'] = df['timestamp_exit'] - df['timestamp_load']
df

Unnamed: 0,user_id,date_load,timestamp_load,timestamp_exit,duration
0,0,2019-04-25,2019-04-25 13:30:18,2019-04-25 13:31:40,0 days 00:01:22
1,0,2019-04-28,2019-04-28 14:30:15,2019-04-28 15:31:40,0 days 01:01:25
2,1,2019-04-25,2019-04-25 13:40:00,2019-04-25 13:40:35,0 days 00:00:35
3,1,2019-04-26,2019-04-26 11:15:00,2019-04-26 11:15:35,0 days 00:00:35


In [11]:
df = df[df['duration'] > '0 days']
df

Unnamed: 0,user_id,date_load,timestamp_load,timestamp_exit,duration
0,0,2019-04-25,2019-04-25 13:30:18,2019-04-25 13:31:40,0 days 00:01:22
1,0,2019-04-28,2019-04-28 14:30:15,2019-04-28 15:31:40,0 days 01:01:25
2,1,2019-04-25,2019-04-25 13:40:00,2019-04-25 13:40:35,0 days 00:00:35
3,1,2019-04-26,2019-04-26 11:15:00,2019-04-26 11:15:35,0 days 00:00:35


In [12]:
result = df.groupby('user_id')['duration'].agg(lambda x: np.mean(x)).reset_index()


In [13]:
result

Unnamed: 0,user_id,duration
0,0,0 days 00:31:23.500000
1,1,0 days 00:00:35


Solution Walkthrough
This solution aims to calculate each user's average session time. A session is defined as the time difference between a page load and page exit. The code takes in a dataframe of web logs and performs several data manipulations and aggregations to calculate the average session time for each user.

Understanding The Data
First, let's understand the data that we are working with. The dataframe facebook_web_log contains web logs of user activities on a Facebook website. It has columns such as user_id, timestamp, and action that indicate the user ID, timestamp of the event, and the action performed (e.g., page load, page exit).

The Problem Statement
We want to calculate the average session time for each user. A session is defined as the time difference between a page_load and page_exit event. We assume that a user has only 1 session per day and if there are multiple events of the same type on a given day, we consider only the latest page_load and earliest page_exit, with the restriction that the load time event should happen before the exit time event. The desired output is a table that contains the user_id and their average session time.

Breaking Down The Code
Let's break down the code step by step to understand how it solves the problem.

Step 1: Merging and Filtering the Data
The first step is to merge and filter the original dataframe (facebook_web_log). We use the merge function to combine two subsets of the dataframe:

df = pd.merge(
    facebook_web_log.loc[
        facebook_web_log["action"] == "page_load",
        ["user_id", "timestamp"],
    ],
    facebook_web_log.loc[
        facebook_web_log["action"] == "page_exit",
        ["user_id", "timestamp"],
    ],
    how="left",
    on="user_id",
    suffixes=["_load", "_exit"],
).dropna()
In this code, we select the 'page_load' and 'page_exit' events from the original dataframe using boolean indexing (facebook_web_log['action'] == 'page_load', facebook_web_log['action'] == 'page_exit'). We then merge the two subsets on the 'user_id' column, keeping only the rows where both events are present (how='left', suffixes=['_load', '_exit']). Finally, we drop any rows that have missing values (dropna()).

Step 2: Manipulating Timestamps and Filtering
Next, we manipulate the timestamp column and further filter the data:

df["date_load"] = pd.to_datetime(df["timestamp_load"]).dt.date
df = df[df["timestamp_load"] < df["timestamp_exit"]]
In the first line, we convert the 'timestamp_load' column to datetime format using the to_datetime function from pandas. We then extract only the date portion of the timestamp using .dt.date.

In the second line, we filter the dataframe to keep only the rows where the load time is earlier than the exit time (df['timestamp_load'] < df['timestamp_exit']).

Step 3: Grouping and Aggregating Data
After filtering the data, we need to group and aggregate it to get the desired output:

df = (
    df.groupby(["user_id", "date_load"])
    .agg({"timestamp_load": "max", "timestamp_exit": "min"})
    .reset_index()
)
In this code, we group the dataframe by user_id and date_load using the groupby function. We then apply aggregations to the grouped data: we take the maximum load time and minimum exit time within each group using the agg function. Finally, we reset the index of the resulting dataframe.

Step 4: Calculating Duration and Filtering
Next, we calculate the duration of each session and further filter the data:

df["duration"] = df["timestamp_exit"] - df["timestamp_load"]
df = df[df["duration"] > "0 days"]
In the first line, we subtract the load time from the exit time to calculate the duration of each session (df['timestamp_exit'] - df['timestamp_load']).

In the second line, we filter the dataframe to keep only the rows where the duration is greater than zero days (df['duration'] > '0 days').

Step 5: Calculating Average Session Time
Finally, we calculate the average session time for each user:

result = (
    df.groupby("user_id")["duration"]
    .agg(lambda x: np.mean(x))
    .reset_index()
)
In this code, we group the dataframe by user_id and calculate the mean of the duration column within each group using the agg function and the lambda function lambda x: np.mean(x). The resulting dataframe contains the user_id and their average session time.

Bringing It All Together
The complete code for calculating the average session time for each user is as follows:

import pandas as pd 
import datetime as dt
import numpy as np

#... assume we have defined the dataframe `facebook_web_log` containing the web logs

df = pd.merge(
    facebook_web_log.loc[
        facebook_web_log["action"] == "page_load",
        ["user_id", "timestamp"],
    ],
    facebook_web_log.loc[
        facebook_web_log["action"] == "page_exit",
        ["user_id", "timestamp"],
    ],
    how="left",
    on="user_id",
    suffixes=["_load", "_exit"],
).dropna()

df["date_load"] = pd.to_datetime(df["timestamp_load"]).dt.date
df = df[df["timestamp_load"] < df["timestamp_exit"]]

df = (
    df.groupby(["user_id", "date_load"])
    .agg({"timestamp_load": "max", "timestamp_exit": "min"})
    .reset_index()
)

df["duration"] = df["timestamp_exit"] - df["timestamp_load"]
df = df[df["duration"] > "0 days"]

result = (
    df.groupby("user_id")["duration"]
    .agg(lambda x: np.mean(x))
    .reset_index()
)
Conclusion
The code successfully calculates each user's average session time by performing several data manipulations and aggregations. It uses pandas functions such as merge, groupby, and agg to achieve the desired result.