title: Analyse impact of bad video verification experience on subsequent user journey  
author: Claudia Dai   
date: 2020-10-14   
region: EU   
tags: acquire, kyc, call quality, user journey, video verification, logistic regression
summary: There is an internal KYC QA team that reviews video KYC verification calls and rates them for quality, on several dimensions. This work aims to analyze whether bad KYC call quality has an effect on the subsequent user journey (card activation and first-time MAU). We are observing the effect of KYC call quality on 1271 users who initiated KYC in the time period from 2020-05-05 until 2020-08-05. We conducted exploratory data analysis, statistical tests (t-test, chi-square test), and logistic regression to analyze the variables’ relationship and provide a quantified value for the strength of association.

[DATA-8971]
# Analyse impact of bad video verification experience on subsequent user journey

There is an internal KYC QA team that reviews video KYC verification calls and rates them for quality, on several dimensions. Our hypothesis is that users who have a bad experience with video KYC will be much less likely to become loyal N26 customers. Could you analyse the data to see if there is support for this hypothesis? Any other interesting findings you turn up from the data would also be useful to share.

# KYC QA data

The QA team collect their findings in this spreadsheet: https://docs.google.com/spreadsheets/d/1snu_SAm5Cj0rYTQHPsML_wLTRE2eu77Lyi_Z9Y1-2_0/edit?usp=sharing

Jasmin Altmann is the QA team lead, and can provide additional context on the data or the video KYC process.

**Data sampling**: 2% of all calls received are sampled at random.

**Error classification**:
- **minor** = watch, for internal reporting, no exposure to user
- **serious** = error, for internal reporting, no exposure to user
- **reset** = for BAFIN/GDPR, the call has to be reset means that the user has to do it again.

**Duration of call**: 7-8 minutes is acceptable, if it's longer then something didn't go right usually on the user side. If it's a longer call, agent might get irritated which has an impact on user experience of KYC call.

**Audio and video quality**: is measured on agent's side, so we don't have any direct insights whether the quality was good on the user side. That could be infered from the duration of the call. If quality on agent side is not good, agent might get irritated too.

**Error scoring**: was done by one of the working students. Right now it's not working, but Jasmin can get it working if it proves to be of value for analysis.

**Sheet tabs**: Until August 2020, two batches are saved in two sheets for each month. The batches go from 5th of every month until 20th, and from 20th until 5th. As of September 2020, there will be weekly batches that all are entered into one single sheet for the month.

In [2]:
%%capture
!pip install imblearn

In [3]:
from functions import *

---

# Get data

In [22]:
df_kyc = pd.read_csv("kyc_qa.csv", index_col=0)
cols_as_datetime(df_kyc, ["call_initiated", "call_completed"])
df_kyc.shape

(1732, 11)

In [23]:
date_start = df_kyc.call_initiated.min().date()
date_end = df_kyc.call_initiated.max().date() - timedelta(days=30)

In [24]:
df_kyc = df_kyc[
    df_kyc.call_completed <= date_end
]  # filter out most recent 30 days to give users enough time for action
df_kyc.shape

(1272, 11)

In [7]:
user_ids = str(set(df_kyc.user_id)).replace("{", "(").replace("}", ")")

In [8]:
query = """
SELECT
    user_id,
    us.user_created,
    kyc_finished,
    closed_at,
    product_id,
    is_premium,
    is_card_f30d,
    CASE WHEN DATEDIFF('days', fa.user_created::date, fa.first_active::date) <= 30 THEN 1 ELSE 0 END AS is_ftmau_f30d
    /* CASE WHEN txn_ts BETWEEN kyc_finished::date AND DATEADD('day',30,kyc_finished::date) THEN COUNT(txn_ts) END AS count_txn_f30d */
FROM
    dbt.zrh_users as us
    
LEFT JOIN
    dbt.stg_cohort_first_active as fa
ON
    us.user_created = fa.user_created

LEFT JOIN
    dbt.zrh_transactions as txn
ON
    txn.user_created = us.user_created

LEFT JOIN
    (
        SELECT
            step1.user_created,
            step1.created AS kyc_finished,
            CASE WHEN datediff('days', step1.created::date, step2.created::date) <= 30 THEN 1 ELSE 0 END AS is_card_f30d
        FROM (
            SELECT user_created,step,created
            FROM dbt.zrh_lower_funnel
            WHERE step LIKE 'kyc_finished') step1
        INNER JOIN (
            SELECT user_created,step,created
            FROM dbt.zrh_lower_funnel
            WHERE step LIKE 'card_activated') step2
        ON step1.user_created = step2.user_created
    ) ca
ON
    us.user_created = ca.user_created

WHERE 
    user_id IN {}
""".format(
    user_ids
)

In [9]:
filename = generate_filename("data/", "DATA-8971", "users", date_start, date_end)
# get_data(query, filename)

In [10]:
df_users = pd.read_csv(filename)
df_users.shape

(141201, 8)

In [13]:
df_users.duplicated().value_counts()

True     139930
False      1271
dtype: int64

In [14]:
df_users.drop_duplicates(inplace=True, keep="first")
df_users.shape

(1271, 8)

In [19]:
data = pd.merge(left=df_kyc, right=df_users, how="inner", on="user_id", sort=True)

In [20]:
df_kyc[~df_kyc.isin(data)].shape

(1272, 11)

In [13]:
data.duplicated().value_counts()

False    1272
dtype: int64

In [14]:
data.shape

(1272, 18)

In [15]:
filename = generate_filename("data/", "DATA-8971", "merged", date_start, date_end)
# save_data(data, filename)