# Task Functionality User Sample

In order to help the research team find Gtmhub users to interview who use the task functionality, the data science and analytics team gathered, transformed, and extracted data on users that met their research criteria. More on the research plan [here](https://dovetailapp.com/projects/2IUhqbkGJ73oTG1mfsTIRq/readme).

Steps:
1. Get User and Account Data from Azure Data Lake.
    - This data comes from the Gtmhub Raw set of data in Gtmhub. 
    - There are two sets of data, one from EU and one from US, so we combine them.
2. Clean up and filter the user and account data.
    - Remove unnecessary fields.
    - Filter users for english speakers, active, and created greater than 6 months ago.
    - Filter accounts for only active accounts.
3. Get Backend Users from Redshift backend schema (in Azure Data Lake).
    - These users contain additional information that the raw set of users do not contain (e.g., email, name, etc.).
4. Clean up Backend Users and merge with Gtmhub Raw users.
5. Join user and account data.
6. Get task related data from Azure Data Lake (Redshift backend schema).
    - Three different event tables: task_created, task_modified, task_deleted.
7. Group and combine task related data by user.
8. Join task data with user data.
9. Get HubSpot contacts and companies.
    - Contacts come from a separate script, `contacts.py` in the hubspot_tap repository.
    - Companies come from a separate script, `companies.py` in the hubspot_tap repository.
10. Clean up and combine the HubSpot contacts and companies.
11. Get Chargebee subscriptions.
    - This data comes from the chargebee_rest_subscriptions_all table from data sources in Gtmhub insights.
    - SQL query is below.
12. Join subscription data with user data.
13. Join Hubspot data with user data.
14. Output file to csv for delivery to research team.

In [1]:
# Imports
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, __version__
from dotenv import load_dotenv
import pandas as pd
import datetime
import json
import os

In [2]:
# Environment vars
load_dotenv()
connect_str = os.getenv('AZURE_STORAGE_CONNECTION_STRING')

In [3]:
# Instantiate blob service client
try:
    blob_service_client = BlobServiceClient.from_connection_string(connect_str)
except Exception as e:
    print(f'Unable to connect to BlobServiceClient: {e}')

In [4]:
def get_json_blob_as_df(blob_client, container, blob_path):
    """
    Get a JSON blob from Azure and read it into a pandas dataframe.
    Params:
        :blob_client (BlobServiceClient object): Azure blob service client object
        :container (str): Name of the Azure storage container
        :blob_path (str): Name of the Azure blob
    """
    blob_client_instance = blob_client.get_blob_client(container, blob_path)
    streamdownloader = blob_client_instance.download_blob()
    file_reader = json.loads(streamdownloader.readall())
    df = pd.DataFrame(file_reader)
    return df

In [5]:
# Get users and accounts
eu_users = get_json_blob_as_df(blob_service_client, "researchanalyticsinsights", "Unprocessed/Gtmhub/2022/03/02/gtmhubrawuserseu.json")
us_users = get_json_blob_as_df(blob_service_client, "researchanalyticsinsights", "Unprocessed/Gtmhub/2022/03/02/gtmhubrawusersus.json")
eu_accounts = get_json_blob_as_df(blob_service_client, "researchanalyticsinsights", "Unprocessed/Gtmhub/2022/03/02/gtmhubrawaccountseu.json")
us_accounts = get_json_blob_as_df(blob_service_client, "researchanalyticsinsights", "Unprocessed/Gtmhub/2022/03/02/gtmhubrawaccountsus.json")

In [6]:
# Combine user & account dfs
users_df = pd.concat([eu_users, us_users])
accounts_df = pd.concat([eu_accounts, us_accounts])

In [7]:
users_df.head()

Unnamed: 0,id,clientid,language,accountid,datecreated,additionalinvitationsleft,isactive,data_source_id,sync_date
0,573d9359ed915d00052efb10,auth0|5645157bf1fe5dfc60f9bdee,english,573d9359ed915d00052efb0f,2016-05-19T00:00:00,-1,True,account_573dbb12ed915d0005cc2c46,2022-02-09T00:00:00
1,573d93d9ed915d00052efb6a,auth0|573d93d07fc909a622626057,english,573d93d9ed915d00052efb69,2016-05-19T00:00:00,-1,True,account_573dbb12ed915d0005cc2c46,2022-02-09T00:00:00
2,573db6aeed915d0005cc2bc5,auth0|573db6a20a999f9358843d40,english,573db6aeed915d0005cc2bc4,2016-05-19T00:00:00,-1,True,account_573dbb12ed915d0005cc2c46,2022-02-09T00:00:00
3,573dbb12ed915d0005cc2c47,waad|rHo1ZMfofQ4UGF5b4rLcQtJ7E1V5ZExBzGaISTKZjHA,english,573dbb12ed915d0005cc2c46,2016-05-19T00:00:00,-1,True,account_573dbb12ed915d0005cc2c46,2022-02-09T00:00:00
4,573dbb61ed915d0005cc2c4c,auth0|573dbafc7fc909a6226264f0,english,573dbb12ed915d0005cc2c46,2016-05-19T00:00:00,-1,True,account_573dbb12ed915d0005cc2c46,2022-02-09T00:00:00


In [8]:
# Remove unneeded columns
users_df = users_df.drop(['clientid', 'additionalinvitationsleft', 'data_source_id', 'sync_date'], axis=1)
# Remove non-english and inactive users
users_df = users_df[(users_df['language'] == 'english') & (users_df['isactive'] == True)]
# Remove users if they were created less that 60 days ago
users_df['datecreated'] = pd.to_datetime(users_df['datecreated'], format='%Y-%m-%dT%H:%M:%S', errors='coerce')
six_months_ago = datetime.datetime.today() - datetime.timedelta(days=60)
users_df = users_df[users_df['datecreated'] < six_months_ago]

In [9]:
accounts_df.head()

Unnamed: 0,id,language,isactive,type,trialends,datecreated,ownerid,edition,subscriptionid,planid,hasslackintegration,settings,data_source_id,sync_date
0,573d9359ed915d00052efb0f,english,True,InternalAccount,2016-06-18T00:00:00,2016-05-19T00:00:00,573d9359ed915d00052efb10,enterprise,,,False,"{""coloring"":{""defaultColor"":""#5e35b1"",""ranges""...",account_573dbb12ed915d0005cc2c46,2022-02-09T00:00:00
1,573d93d9ed915d00052efb69,english,False,TrialAccount,2016-06-18T00:00:00,2016-05-19T00:00:00,573d93d9ed915d00052efb6a,,,,False,"{""coloring"":{""defaultColor"":""#603fa5"",""ranges""...",account_573dbb12ed915d0005cc2c46,2022-02-09T00:00:00
2,573db6aeed915d0005cc2bc4,english,False,TrialAccount,2016-06-18T00:00:00,2016-05-19T00:00:00,573db6aeed915d0005cc2bc5,,,,False,"{""coloring"":{""defaultColor"":""#603fa5"",""ranges""...",account_573dbb12ed915d0005cc2c46,2022-02-09T00:00:00
3,573dbb12ed915d0005cc2c46,english,True,ClientAccount,2016-06-18T00:00:00,2016-05-19T00:00:00,573dbb12ed915d0005cc2c47,gtmhub-enterprise-v2,AzyWN0S24v9LZCGz,gtmhub-live-v2-annual,True,"{""aggregation"":true,""autosave"":false,""branding...",account_573dbb12ed915d0005cc2c46,2022-02-09T00:00:00
4,573dc4cced915d0005cc2c4f,english,False,TrialAccount,2016-06-18T00:00:00,2016-05-19T00:00:00,573dc4cced915d0005cc2c50,,,,False,"{""coloring"":{""defaultColor"":""#603fa5"",""ranges""...",account_573dbb12ed915d0005cc2c46,2022-02-09T00:00:00


In [10]:
# Remove unnecessary columns
accounts_df = accounts_df.drop(['type', 'trialends', 'ownerid', 'edition', 'planid', 'hasslackintegration', 'settings', 'data_source_id', 'sync_date'], axis=1)
# Keep active accounts
accounts_df = accounts_df[accounts_df['isactive'] == True]
# Fix datetime
accounts_df['datecreated'] = pd.to_datetime(accounts_df['datecreated'], format='%Y-%m-%dT%H:%M:%S', errors='coerce')

In [11]:
# Add prefixes for table clarity
users_df = users_df.add_prefix('user_')
accounts_df = accounts_df.add_prefix('account_')

In [12]:
users_df.head()

Unnamed: 0,user_id,user_language,user_accountid,user_datecreated,user_isactive
0,573d9359ed915d00052efb10,english,573d9359ed915d00052efb0f,2016-05-19,True
1,573d93d9ed915d00052efb6a,english,573d93d9ed915d00052efb69,2016-05-19,True
2,573db6aeed915d0005cc2bc5,english,573db6aeed915d0005cc2bc4,2016-05-19,True
3,573dbb12ed915d0005cc2c47,english,573dbb12ed915d0005cc2c46,2016-05-19,True
4,573dbb61ed915d0005cc2c4c,english,573dbb12ed915d0005cc2c46,2016-05-19,True


In [13]:
# Get number of users per account
account_sum = accounts_df.merge(users_df, how='inner', left_on='account_id', right_on='user_accountid')
account_sum = account_sum.groupby('account_id')['user_id'].count().reset_index().rename(columns={'user_id': 'user_count'})

In [14]:
account_sum.head()

Unnamed: 0,account_id,user_count
0,573d9359ed915d00052efb0f,1
1,573dbb12ed915d0005cc2c46,248
2,57fb5f7bed915d0006582898,71
3,57fde284ed915d0006582b21,1
4,582466b3ed915d00078d039a,4


In [15]:
# Get backend users
backendusers_df = get_json_blob_as_df(blob_service_client, "researchanalyticsinsights", "Unprocessed/Redshift/2022/03/02/backendusers.json")

In [16]:
backendusers_df.head()

Unnamed: 0,id,received_at,uuid,company_name,editionname,email,account_id,account_name,accountstatus,avatar,...,context_library_name,created_at,editionplanid,context_group_id,deleted,company_status,status,company_edition,experiments,is_primary
0,5e15b5611f2fb20001f71d22,2020-01-08T10:56:39,1,Sngular,gtmhub-enterprise,comunicacion@sngular.com,5d3599596fa5cb0001feb5eb,Sngular,1,https://lh3.googleusercontent.com/a-/AAuE7mAp9...,...,analytics-go,2020-01-08T10:56:33.798,gtmhub-enterprise-10m-70,,,,,,,
1,5e15ba561f2fb20001f71d93,2020-01-08T11:17:44,3,mihail-staging-13,gtmhub-start-v2,mihail-staging-13@okrs.tech,5e15ba561f2fb20001f71d92,mihail-staging-13,1,https://s.gravatar.com/avatar/cf168d408662fae6...,...,analytics-go,2020-01-08T11:17:42.733,gtmhub-start-v2-monthly,,,,,,,
2,5da85c3f76023f0001dbe2b2,2020-01-08T11:56:22,9,Receipt Bank,gtmhub-enterprise,emma.pegg@receipt-bank.com,5d3711f3429b5e00017f0424,Receipt Bank,1,https://s.gravatar.com/avatar/b5c38bfac1be5aaf...,...,analytics-go,2019-10-17T12:19:11.407,enterprise-annual-400,,,,,,,
3,5e15ec9fbd8a480001b20ce6,2020-01-08T14:56:09,27,Beat,growth-engine,g.papageorgiou@thebeat.co,5d35a36e6fa5cb0001feb678,Beat,1,https://s.gravatar.com/avatar/f76a976177c77953...,...,analytics-go,2020-01-08T14:52:15.553,gtmhub-okrs-fixed-17m--400,,,,,,,
4,5e15eeb7bd8a480001b20d28,2020-01-08T15:01:14,29,KIWI.com,growth-engine,gabriela.korosiova@kiwi.com,5b7131683e55df0007694ade,KIWI.com,1,https://s.gravatar.com/avatar/19f3a20f3dcf49ee...,...,analytics-go,2020-01-08T15:01:11.267,gtmhub-okrs-platform-kiwi-new,,,,,,,


In [17]:
backendusers_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167649 entries, 0 to 167648
Data columns (total 32 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   id                       167649 non-null  object
 1   received_at              167649 non-null  object
 2   uuid                     167649 non-null  int64 
 3   company_name             167649 non-null  object
 4   editionname              167647 non-null  object
 5   email                    167642 non-null  object
 6   account_id               167649 non-null  object
 7   account_name             167649 non-null  object
 8   accountstatus            77461 non-null   object
 9   avatar                   167648 non-null  object
 10  company_account_status   77461 non-null   object
 11  last_name                130162 non-null  object
 12  uuid_ts                  167649 non-null  object
 13  accountcreated           167649 non-null  object
 14  company_id          

In [18]:
# Remove unnecessary columns
backendusers_df = backendusers_df.drop(['received_at', 'uuid', 'editionname', 'account_id', 'account_name', 'accountstatus', 'avatar', 'context_library_name', 'company_account_status', 'uuid_ts', 'accountcreated', 'company_id', 'context_library_version', 'trialends', 'company_plan', 'created_at', 'editionplanid', 'context_group_id', 'deleted', 'company_status', 'status', 'company_edition', 'experiments', 'is_primary'], axis=1)

In [19]:
# Add prefix
backendusers_df = backendusers_df.add_prefix('backenduser_')

In [20]:
# Join backend with users
user_df = users_df.merge(backendusers_df, how='inner', left_on='user_id', right_on='backenduser_id')

In [21]:
# Join sum with accounts
account_df = accounts_df.merge(account_sum, how='inner', left_on='account_id', right_on='account_id')

In [22]:
# Drop additional user columns
user_df = user_df.drop(['user_isactive', 'backenduser_id', 'backenduser_company_created_at', 'backenduser_last_name', 'backenduser_first_name', 'backenduser_roles', 'user_language'], axis=1)
# Drop additional account columns
account_df = account_df.drop(['account_language', 'account_isactive'], axis=1)

In [23]:
# Join users and accounts
df = user_df.merge(account_df, how='inner', left_on='user_accountid', right_on='account_id')

In [24]:
# Drop duplicate column
df = df.drop(['user_accountid'], axis=1)

In [25]:
df.head()

Unnamed: 0,user_id,user_datecreated,backenduser_company_name,backenduser_email,backenduser_name,account_id,account_datecreated,account_subscriptionid,user_count
0,573dbb12ed915d0005cc2c47,2016-05-19,live,ivan@gtmhub.com,Ivan Osmak,573dbb12ed915d0005cc2c46,2016-05-19,AzyWN0S24v9LZCGz,248
1,573dbb61ed915d0005cc2c4c,2016-05-19,live,radoslav@gtmhub.com,radoslav@gtmhub.com,573dbb12ed915d0005cc2c46,2016-05-19,AzyWN0S24v9LZCGz,248
2,57fe3116ed915d0005d06ffb,2016-10-12,live,jordan@gtmhub.com,Jordan Angelov,573dbb12ed915d0005cc2c46,2016-05-19,AzyWN0S24v9LZCGz,248
3,581077f5ed915d0007940cad,2017-01-13,live,bo@gtmhub.com,bo@gtmhub.com,573dbb12ed915d0005cc2c46,2016-05-19,AzyWN0S24v9LZCGz,248
4,5832b55eed915d0006ae0d98,2016-11-21,live,momchil@gtmhub.com,momchil@gtmhub.com,573dbb12ed915d0005cc2c46,2016-05-19,AzyWN0S24v9LZCGz,248


In [26]:
# Get task information
metric_modified = get_json_blob_as_df(blob_service_client, "researchanalyticsinsights", "Unprocessed/Redshift/2022/03/02/backendmetric_modified.json")

In [29]:
modded_metric_modified = metric_modified[
    metric_modified["confidence"].notna()
]

modded_metric_modified["confidence"].head()

4790    0.5
4793    0.5
4797    0.5
4798    0.5
4803    0.5
Name: confidence, dtype: float64

In [30]:
# Create metric_modified by user df
metric_modified_group = modded_metric_modified.groupby('user_id')['id'].count().reset_index().rename(columns={'id': 'confidence_modified'})
metric_modified_group.head()

Unnamed: 0,user_id,confidence_modified
0,573dbb12ed915d0005cc2c47,27629
1,573dbb61ed915d0005cc2c4c,192
2,57fb6d2bed915d00065828c3,7
3,57fe3116ed915d0005d06ffb,20781
4,581077f5ed915d0007940cad,2300


In [31]:
# NaN to 0
metric_modified_df = metric_modified_group.fillna(0)

In [32]:
# Merge tasks with users
df = df.merge(metric_modified_df, how='inner', left_on='user_id', right_on='user_id')

In [33]:
# Remove duplicate rows
df = df[~df.duplicated(keep='last')]

In [34]:
# Get HubSpot Contacts
hs_contacts = pd.read_json('hubspot_contacts.json')
# Explode properties
hs_contacts = hs_contacts.join(hs_contacts.properties.apply(pd.Series))
# Keep necessary columns
hs_contacts = hs_contacts[['associatedcompanyid', 'email', 'hs_object_id', 'jobtitle']]

In [35]:
# Get HubSpot Companies
hs_companies = pd.read_json('hubspot_companies.json')
# Explode properties
hs_companies = hs_companies.join(hs_companies.properties.apply(pd.Series))
# Keep necessary columns
hs_companies = hs_companies[['hs_object_id', 'annualrevenue', 'industry', 'numberofemployees', 'website']]

In [36]:
# Merge hubspot contacts with companies 
hubspot = hs_contacts.merge(hs_companies, how='left', left_on='associatedcompanyid', right_on='hs_object_id')
hubspot = hubspot.drop(['associatedcompanyid', 'hs_object_id_x', 'hs_object_id_y'], axis=1)

In [37]:
hubspot.head()

Unnamed: 0,email,jobtitle,annualrevenue,industry,numberofemployees,website
0,coolrobot@hubspot.com,Robot,250000000.0,COMPUTER_SOFTWARE,3000.0,hubspot.com
1,barbara.soltysinska@indahash.com,"Co-founder, CEO",10000000.0,,122.0,indahash.com
2,kanakof@shoproyal.net,,,,,okrs.tech
3,rebeccawood@spotahome.com,City Manager Berlin,50000000.0,Electronics,500.0,spotahome.com
4,marta.zarosa@indahash.com,Chief Business Development Officer,10000000.0,,122.0,indahash.com


Subsription SQL
```
SELECT
    subscription_id,
    subscription_mrr
FROM chargebee_rest_subscriptions_all
WHERE subscription_id IN (<list-of-subscription-ids-from-df>)
ORDER BY subscription_id
```

In [38]:
# Get subscriptions
subscriptions = pd.read_csv('subscriptions.csv')
# Remove unnecessary columns
subscriptions = subscriptions.drop(['Unnamed: 0'], axis=1)

In [39]:
subscriptions.head()

Unnamed: 0,subscription_id,subscription_mrr
0,169la0SaV8PVZ51RT,18000.0
1,169laESJSfNno1mFA,6600.0
2,169lamSiQBPDj5sN,1700.0
3,169laPSf7P6yAlGx,500.0
4,169lawSKfwvpX3rVd,100.0


In [40]:
# Merge subscriptions with users
df = df.merge(subscriptions, how='inner', left_on='account_subscriptionid', right_on='subscription_id')
# Remove $0 subscriptions
df = df[df.subscription_mrr > 0]
# Remove gtmhub & primeholding users
df = df[~df.backenduser_email.str.contains('primeholding')]
df = df[~df.backenduser_email.str.contains('gtmhub')]
# Remove unnecessary columns
df = df.drop(['account_subscriptionid', 'subscription_id', 'subscription_mrr', 'user_id'], axis=1)

In [41]:
# Merge users with HubSpot data
df = df.merge(hubspot, how='left', left_on='backenduser_email', right_on='email')

In [42]:
# Write user sample to CSV
df.to_csv('confidence_levels_2022-03-04.csv', index=False)