**Purpose :** of this notebook is to clean `ongouser_activity.csv` in order to aggregate it (Part C. Notebook 2) and then add subscription column. Following that I will build machine learning model to undertsand what user behavior drives their subscription DNA (what knowledge we can drive from their behavior to understand whether a user will subscribe).

In [1]:
import numpy as np
import pandas as pd
# ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
df_z = pd.read_csv('ongouser_activity_stream.csv',low_memory=False)
df_z.drop(columns=['Unnamed: 0'],inplace=True)
df_z.head(2)

Unnamed: 0,user_id,source,activity_datetime,activity,activity_occurrence,feature_1,feature_2,feature_3
0,9ba427e7-34ce-4c0b-914d-4f34432864a1,,2019-11-30 20:07:51.459000,answered_question,1,I’m new to running and have pain in ITBand and...,I may not have gotten to the couch stretch sec...,The Run Experience
1,87836262-e847-48a4-99c5-bfb8dbae86c8,,2019-12-09 02:10:30.015000,answered_question,1,"Hi, where would I find the marathon training p...",Hi. I would like that as well 🙂 please.,The Run Experience


In [3]:
# ongouser_modeling_ds-2.csv contains the subscription information
df_x = pd.read_csv('ongouser_modeling_ds-2.csv', parse_dates=['joined_community_at','converted_to_started_subscription_at'])
df_x.head(2)

Unnamed: 0.1,Unnamed: 0,joined_community_at,community_type,goal,community,metric_started_app_session_week1,metric_started_app_session_week2,metric_started_app_session_week3,metric_started_app_session_week4,metric_complete_session_gettingstarted_week4,...,metric_session_start_day_0,metric_session_start_day_1_30,metric_session_start_day_30_60,metric_session_start_day_60_up,_id,flag_startsession_b4_subs,flag_starttrial_b4_subs,flag_companysession_b4_subs,flag_compquickstart_b4_subs,flag_enrolprgm_b4_subs
0,1,2020-01-17 15:50:00,running,running.health,The Run Experience,1.0,1.0,1.0,1.0,1.0,...,1,0,0,0,bc03395a-222e-42d3-be3f-d88878b57640,,,,,
1,2,2020-01-17 15:43:00,running,running.race,The Run Experience,1.0,1.0,1.0,1.0,0.0,...,0,0,0,0,c4b763d9-2733-4d89-b889-3b2d92e8e2a1,,,,,


In [4]:
print(f'Shape of Ongouser_modeling_ds rows: {df_x.shape[0]}, columns: {df_x.shape[1]}')

# Label the unique ids of ongouser with easy to read ids
dicti = {}
x=0
for i in df_x['_id'].unique():  
    dicti[i]=x
    x+=1
df_x['easy_id_4_subscription']=df_x['_id'].map(dicti)

dicti2 = {}
for index, row in df_x.iterrows():
    dicti2[row['easy_id_4_subscription']]=row['joined_community_at']

Shape of Ongouser_modeling_ds rows: 19778, columns: 45


In [5]:
# Create a new date mapped specific to each user (2 months after joining the community)
# 2 months because we are trying to predict subscription 2 months after joining the community (Free-tier)
df_x['2month_4m_joining'] = df_x.apply(lambda x: x['joined_community_at'] + pd.DateOffset(months = 2), axis=1)

# Mapping the user_ids of OngouserActivity (dataframe To be aggregated) with same ids from other .csv
df_z['easy_id']=df_z['user_id'].map(dicti)
df_z[['user_id','easy_id']].nunique()

user_id    762
easy_id    591
dtype: int64

Some owners (171) in activity_stream are not present in the Ongo-modeling data, so we have to remove them

In [6]:
df_z = df_z.dropna(subset=['easy_id'])
df_z[['user_id','easy_id']].nunique()

user_id    591
easy_id    591
dtype: int64

In [7]:
# Startdate needs to be datetime column
df_z['activity_datetime'] = pd.to_datetime(df_z['activity_datetime'], errors='coerce')
# Timezone info has to be removed to compare datetime columns
#df_z['activity_datetime'] = df_z['activity_datetime']#.dt.tz_convert(True)

# finding common ids between Ongouser_modeling and OngoUser_Activity
common_id_finding2 = df_z['easy_id'].to_list()
filt = df_x['easy_id_4_subscription'].isin(common_id_finding2)
print(f'Before selecting only common users, Ongouser_modeling data had rows: {df_x.shape[0]} and columns: {df_x.shape[1]}')
common_agg_df_2 = df_x[filt]
print(f'After selecting only common users, OngoModelling data had rows: {common_agg_df_2.shape[0]} and columns: {common_agg_df_2.shape[1]}')

Before selecting only common users, Ongouser_modeling data had rows: 19778 and columns: 47
After selecting only common users, OngoModelling data had rows: 591 and columns: 47


Users who has subscribed (my main target in the Ongouser_modeling data), we know their date of joining Ongo_community (`joined_community_at`) and their date of subscribing (`converted_to_started_subscription_at`). We need to see the activity of the subscribers (between these two dates) and Non-subscribers (for whatever timeframe data available for them {maximum 2 months is considered} as they never subscribed). So for that I need to filter out any data I have for subscribers beyond the subscription date.

In [8]:
# Only considering the users that subscribed
common_agg_df_3 = common_agg_df_2[common_agg_df_2['converted_to_started_subscription_at'].notna()]
print(f'Total people who subscribed out of {common_agg_df_2.shape[0]} Ongo users : {common_agg_df_3.shape[0]}')

Total people who subscribed out of 591 Ongo users : 312


Selecting the Subscribed users from **Ongouser_activity.csv** (only those rows that are between `joined_community_at` and `converted_to_started_ subscription_at` dates)

In [9]:
from tqdm import tqdm_notebook
agg3 = [] 
for index1,row1 in tqdm_notebook(df_z.iterrows()):
    for index2,row2 in common_agg_df_3.iterrows():
        if row1['easy_id'] == row2['easy_id_4_subscription']:
            if (row1['activity_datetime']<=row2['converted_to_started_subscription_at']):
                if (row1['activity_datetime']<=row2['2month_4m_joining']):
                    agg3.append(row1)
                    
# Subscribed user info from Ongouser_activity between joined_community_at & converted_to_started_ subscription_at dates
agg3_df = pd.DataFrame(agg3)
agg3_df.shape

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




(90524, 9)

In [10]:
# Only considering the users that DID NOT subscribe
common_agg_df_4 = common_agg_df_2[common_agg_df_2['converted_to_started_subscription_at'].isna()]
print(f'Total people who did NOT subscribed out of {common_agg_df_2.shape[0]} Ongo users : {common_agg_df_4.shape[0]}')

Total people who did NOT subscribed out of 591 Ongo users : 279


Selecting the NON-Subscribed users from **Ongouser_activity.csv**, for them the range of timeframe would be 2 months.

In [11]:
agg4 = [] 
for index1,row1 in tqdm_notebook(df_z.iterrows()):
    for index2,row2 in common_agg_df_4.iterrows():
        if row1['easy_id'] == row2['easy_id_4_subscription']:
            if (row1['activity_datetime']<=row2['2month_4m_joining']):
                agg4.append(row1)
                
# NON-Subscribed user info from Ongouser Activity
agg4_df = pd.DataFrame(agg4)
agg4_df.shape

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




(106655, 9)

Joining the rows from **Ongouser_activity.csv** for subscribers (between `joined_community_at` and `converted_to_started_ subscription_at` dates) and Non-subscribers

In [12]:
final_df2_2be_agg = agg3_df.append(agg4_df, ignore_index=True)
# Remove this line later
final_df2_2be_agg['easy_id'].nunique()

553

In [13]:
final_df2_2be_agg['joined_community_at'] = final_df2_2be_agg['easy_id'].map(dicti2)
print(f'Overall shape of data that I need to aggregate, rows: {final_df2_2be_agg.shape[0]} and columns: {final_df2_2be_agg.shape[1]}')

final_df2_2be_agg.to_csv("clean_NewData.csv", index=True)
df2 = pd.read_csv('clean_NewData.csv')
df2.head(2)

Overall shape of data that I need to aggregate, rows: 197179 and columns: 10


Unnamed: 0.1,Unnamed: 0,user_id,source,activity_datetime,activity,activity_occurrence,feature_1,feature_2,feature_3,easy_id,joined_community_at
0,0,0f69074a-3822-4e9a-8424-20c28419528a,segment_tre,2019-08-30 01:32:16.491,became_adopted_user,1.0,30,,,16999.0,2019-08-16 16:18:00
1,1,ee75a716-67af-47c9-93de-618817bae00d,segment_tre,2019-09-06 16:51:47.062,became_adopted_user,1.0,30,,,14904.0,2019-09-02 17:23:00


**Next step: Aggregate this df, add binary (whether user synced their historic data from 3rd party integration) and other important features that separate the clusters to the aggregated data and then predict subscription**

In [14]:
#### Next: Part C. 2. Aggregating_Ongo_UserActivity-Stream_with_AppleHealthKit_G.ipynb ####