Retention Rate

Find the monthly retention rate of users for each account separately for Dec 2020 and Jan 2021. Retention rate is the percentage of active users an account retains over a given period of time. In this case, assume the user is retained if he/she stays with the app in any future months. For example, if a user was active in Dec 2020 and has activity in any future month, consider them retained for Dec. You can assume all accounts are present in Dec 2020 and Jan 2021. Your output should have the account ID and the Jan 2021 retention rate divided by Dec 2020 retention rate.

In [1]:
import pandas as pd

In [4]:
sf_events = pd.read_excel("../CSV/sf_events.xlsx", header=1)
sf_events.head()

Unnamed: 0,date,account_id,user_id
0,2021-01-01,A1,U1
1,2021-01-01,A1,U2
2,2021-01-06,A1,U3
3,2021-01-02,A1,U1
4,2020-12-24,A1,U2


In [5]:
sf_events['date'] = pd.to_datetime(sf_events['date'],format='%Y-%m-%d')
sf_events.head()

Unnamed: 0,date,account_id,user_id
0,2021-01-01,A1,U1
1,2021-01-01,A1,U2
2,2021-01-06,A1,U3
3,2021-01-02,A1,U1
4,2020-12-24,A1,U2


In [7]:
dec_2020 = sf_events[(sf_events['date'].dt.year == 2020) & (sf_events['date'].dt.month == 12)].drop_duplicates()
dec_2020

Unnamed: 0,date,account_id,user_id
4,2020-12-24,A1,U2
5,2020-12-08,A1,U1
6,2020-12-09,A1,U1
11,2020-12-17,A2,U4
12,2020-12-25,A3,U6
15,2020-12-06,A3,U7
16,2020-12-06,A3,U6
22,2020-12-05,A1,U8


In [8]:
jan_2021 = sf_events[(sf_events['date'].dt.year == 2021) & (sf_events['date'].dt.month == 1)].drop_duplicates()
jan_2021

Unnamed: 0,date,account_id,user_id
0,2021-01-01,A1,U1
1,2021-01-01,A1,U2
2,2021-01-06,A1,U3
3,2021-01-02,A1,U1
7,2021-01-10,A2,U4
8,2021-01-11,A2,U4
9,2021-01-12,A2,U4
10,2021-01-15,A2,U5
17,2021-01-14,A3,U6


In [9]:
max_date = sf_events.groupby('user_id')['date'].max().to_frame('max_date').reset_index()
max_date

Unnamed: 0,user_id,max_date
0,U1,2021-02-07
1,U2,2021-02-10
2,U3,2021-01-06
3,U4,2021-02-01
4,U5,2021-02-01
5,U6,2021-01-14
6,U7,2020-12-06
7,U8,2020-12-05


In [11]:
dec_2020 = dec_2020.iloc[:,1:]
dec_2020

Unnamed: 0,account_id,user_id
4,A1,U2
5,A1,U1
6,A1,U1
11,A2,U4
12,A3,U6
15,A3,U7
16,A3,U6
22,A1,U8


In [13]:
jan_2021 = jan_2021.drop(columns='date')
jan_2021

Unnamed: 0,account_id,user_id
0,A1,U1
1,A1,U2
2,A1,U3
3,A1,U1
7,A2,U4
8,A2,U4
9,A2,U4
10,A2,U5
17,A3,U6


In [14]:
dec_2020 = dec_2020.merge(max_date, on='user_id')
dec_2020

Unnamed: 0,account_id,user_id,max_date
0,A1,U2,2021-02-10
1,A1,U1,2021-02-07
2,A1,U1,2021-02-07
3,A2,U4,2021-02-01
4,A3,U6,2021-01-14
5,A3,U7,2020-12-06
6,A3,U6,2021-01-14
7,A1,U8,2020-12-05


In [15]:
jan_2021 = jan_2021.merge(max_date, on='user_id')
jan_2021

Unnamed: 0,account_id,user_id,max_date
0,A1,U1,2021-02-07
1,A1,U2,2021-02-10
2,A1,U3,2021-01-06
3,A1,U1,2021-02-07
4,A2,U4,2021-02-01
5,A2,U4,2021-02-01
6,A2,U4,2021-02-01
7,A2,U5,2021-02-01
8,A3,U6,2021-01-14


In [16]:
dec_2020['retention'] = 0
dec_2020

Unnamed: 0,account_id,user_id,max_date,retention
0,A1,U2,2021-02-10,0
1,A1,U1,2021-02-07,0
2,A1,U1,2021-02-07,0
3,A2,U4,2021-02-01,0
4,A3,U6,2021-01-14,0
5,A3,U7,2020-12-06,0
6,A3,U6,2021-01-14,0
7,A1,U8,2020-12-05,0


In [17]:
jan_2021['retention'] = 0
jan_2021

Unnamed: 0,account_id,user_id,max_date,retention
0,A1,U1,2021-02-07,0
1,A1,U2,2021-02-10,0
2,A1,U3,2021-01-06,0
3,A1,U1,2021-02-07,0
4,A2,U4,2021-02-01,0
5,A2,U4,2021-02-01,0
6,A2,U4,2021-02-01,0
7,A2,U5,2021-02-01,0
8,A3,U6,2021-01-14,0


In [18]:
dec_2020.loc[dec_2020['max_date'] > '2020-12-31', 'retention'] = 1
dec_2020

Unnamed: 0,account_id,user_id,max_date,retention
0,A1,U2,2021-02-10,1
1,A1,U1,2021-02-07,1
2,A1,U1,2021-02-07,1
3,A2,U4,2021-02-01,1
4,A3,U6,2021-01-14,1
5,A3,U7,2020-12-06,0
6,A3,U6,2021-01-14,1
7,A1,U8,2020-12-05,0


In [19]:
jan_2021.loc[jan_2021['max_date'] > '2021-01-31', 'retention'] = 1
jan_2021

Unnamed: 0,account_id,user_id,max_date,retention
0,A1,U1,2021-02-07,1
1,A1,U2,2021-02-10,1
2,A1,U3,2021-01-06,0
3,A1,U1,2021-02-07,1
4,A2,U4,2021-02-01,1
5,A2,U4,2021-02-01,1
6,A2,U4,2021-02-01,1
7,A2,U5,2021-02-01,1
8,A3,U6,2021-01-14,0


In [20]:
retention_dec = dec_2020.groupby('account_id')['retention'].mean().to_frame('dec_retention').reset_index()
retention_dec

Unnamed: 0,account_id,dec_retention
0,A1,0.75
1,A2,1.0
2,A3,0.666667


In [21]:
retention_jan = jan_2021.groupby('account_id')['retention'].mean().to_frame('jan_retention').reset_index()
retention_jan

Unnamed: 0,account_id,jan_retention
0,A1,0.75
1,A2,1.0
2,A3,0.0


In [22]:
merged = retention_dec.merge(retention_jan, on='account_id')
merged

Unnamed: 0,account_id,dec_retention,jan_retention
0,A1,0.75,0.75
1,A2,1.0,1.0
2,A3,0.666667,0.0


In [23]:
merged['retention'] = merged['jan_retention'] / merged['dec_retention']
merged

Unnamed: 0,account_id,dec_retention,jan_retention,retention
0,A1,0.75,0.75,1.0
1,A2,1.0,1.0,1.0
2,A3,0.666667,0.0,0.0


In [24]:
result = merged[['account_id', 'retention']]
result

Unnamed: 0,account_id,retention
0,A1,1.0
1,A2,1.0
2,A3,0.0


Solution Walkthrough
This question asks us to find the monthly retention rate of users for each account separately for December 2020 and January 2021. The retention rate is the percentage of active users an account retains over a given period of time. We need to assume that a user is retained if they stay with the app in any future months. The output should consist of the account ID and the January 2021 retention rate divided by the December 2020 retention rate.

Let's break down the solution into smaller steps and understand the code and logic behind it.

Understanding The Data
The code assumes that there is a dataframe called sf_events that contains information about events related to user activity in an app. The dataframe has columns like date, account_id, and user_id. The date column contains the date of the event, the account_id column represents the account associated with the event, and the user_id column represents the specific user associated with the event.

The Problem Statement
We need to calculate the retention rate for each account separately for December 2020 and January 2021. The retention rate is calculated as the percentage of active users an account retains over a given period of time. In this case, we need to consider a user as retained if they stay with the app in any future months. The output should contain the account ID and the January 2021 retention rate divided by the December 2020 retention rate.

Breaking Down The Code
Let's break down the code step by step:

import pandas as pd
First, we import the pandas library as pd to work with data in a structured manner using dataframes.

sf_events["date"] = pd.to_datetime(
    sf_events["date"], format="%Y-%m-%d"
)
In this line, we convert the date column of the sf_events dataframe into a datetime format using the pd.to_datetime() function. We specify the date format as %Y-%m-%d.

dec_2020 = sf_events[
    (sf_events["date"].dt.year == 2020)
    & (sf_events["date"].dt.month == 12)
][["account_id", "user_id"]].drop_duplicates()
In this line, we filter the sf_events dataframe to only include the events that occurred in December 2020. We select the columns account_id and user_id and remove any duplicate rows using the drop_duplicates() function. The result is stored in a new dataframe called dec_2020.

jan_2021 = sf_events[
    (sf_events["date"].dt.year == 2021)
    & (sf_events["date"].dt.month == 1)
][["account_id", "user_id"]].drop_duplicates()
In this line, we filter the sf_events dataframe to only include the events that occurred in January 2021. We select the columns account_id and user_id and remove any duplicate rows using the drop_duplicates() function. The result is stored in a new dataframe called jan_2021.

max_date = (
    sf_events.groupby("user_id")["date"]
    .max()
    .to_frame("max_date")
    .reset_index()
)
In this line, we group the sf_events dataframe by user_id and find the maximum date for each user using the max() function. We convert the result into a dataframe with a column named max_date using the to_frame() function. The resulting dataframe is then reset to have a default index using the reset_index() function. The result is stored in a dataframe called max_date.

dec_2020 = dec_2020.merge(max_date, on="user_id")
jan_2021 = jan_2021.merge(max_date, on="user_id")
In these two lines, we merge the dec_2020 and jan_2021 dataframes with the max_date dataframe based on the common column user_id. This allows us to add the max_date column to each of the dataframes.

dec_2020["retention"] = 0
jan_2021["retention"] = 0
In these two lines, we create a new column called retention in the dec_2020 and jan_2021 dataframes and initialize its values to 0. This column will be used to track whether a user is retained or not.

dec_2020.loc[dec_2020["max_date"] > "2020-12-31", "retention"] = 1
jan_2021.loc[jan_2021["max_date"] > "2021-01-31", "retention"] = 1
In these two lines, we update the retention column in the dec_2020 and jan_2021 dataframes based on the max_date column. If the max_date is greater than '2020-12-31' for dec_2020 or '2021-01-31' for jan_2021, we set the retention value to 1, indicating that the user is retained.

retention_dec = (
    dec_2020.groupby("account_id")["retention"]
    .mean()
    .to_frame("dec_retention")
    .reset_index()
)
retention_jan = (
    jan_2021.groupby("account_id")["retention"]
    .mean()
    .to_frame("jan_retention")
    .reset_index()
)
In these two lines, we calculate the average retention for each account in December 2020 and January 2021. We group the dec_2020 and jan_2021 dataframes by account_id and calculate the mean of the retention column for each group. We store the results in new dataframes called retention_dec and retention_jan, respectively. We also rename the mean column to dec_retention and jan_retention.

merged = retention_dec.merge(retention_jan, on="account_id")
In this line, we merge the retention_dec and retention_jan dataframes based on the common column account_id. This allows us to combine the retention rates for December 2020 and January 2021 for each account into a single dataframe called merged.

merged["retention"] = (
    merged["jan_retention"] / merged["dec_retention"]
)
In this line, we create a new column called retention in the merged dataframe by dividing the jan_retention column by the dec_retention column. This calculates the retention rate for each account by dividing the January 2021 retention rate by the December 2020 retention rate.

result = merged[["account_id", "retention"]]
In this line, we select only the account_id and retention columns from the merged dataframe and store the result in a new dataframe called result. This dataframe contains the final output with the account ID and the retention rate.

Bringing It All Together
To summarize, the code imports the necessary libraries and works with the sf_events dataframe to calculate the monthly retention rate for each account separately for December 2020 and January 2021. It does this by filtering the events for each month, finding the maximum date for each user, determining whether a user is retained or not, and calculating the average retention rate for each account in both months. Finally, it divides the January 2021 retention rate by the December 2020 retention rate and creates a dataframe with the account ID and retention rate as the output.

Conclusion
In this walkthrough, we have explained the code that calculates the monthly retention rate for each account separately for December 2020 and January 2021. The code involves manipulating dataframes using pandas and performing calculations to determine the retention rate.