title: Analyse impact of bad video verification experience on subsequent user journey  
author: Claudia Dai   
date: 2020-10-14   
region: EU   
tags: acquire, kyc, call quality, user journey, video verification, logistic regression
summary: There is an internal KYC QA team that reviews video KYC verification calls and rates them for quality, on several dimensions. This work aims to analyze whether bad KYC call quality has an effect on the subsequent user journey (card activation and first-time MAU). We are observing the effect of KYC call quality on 1271 users who initiated KYC in the time period from 2020-05-05 until 2020-08-05. We conducted exploratory data analysis, statistical tests (t-test, chi-square test), and logistic regression to analyze the variables’ relationship and provide a quantified value for the strength of association.

[DATA-8971]
# Analyse impact of bad video verification experience on subsequent user journey

There is an internal KYC QA team that reviews video KYC verification calls and rates them for quality, on several dimensions. Our hypothesis is that users who have a bad experience with video KYC will be much less likely to become loyal N26 customers. Could you analyse the data to see if there is support for this hypothesis? Any other interesting findings you turn up from the data would also be useful to share.

# KYC QA data

The QA team collect their findings in this spreadsheet: https://docs.google.com/spreadsheets/d/1snu_SAm5Cj0rYTQHPsML_wLTRE2eu77Lyi_Z9Y1-2_0/edit?usp=sharing

Jasmin Altmann is the QA team lead, and can provide additional context on the data or the video KYC process.

**Data sampling**: 2% of all calls received are sampled at random.

**Error classification**:
- **minor** = watch, for internal reporting, no exposure to user
- **serious** = error, for internal reporting, no exposure to user
- **reset** = for BAFIN/GDPR, the call has to be reset means that the user has to do it again.

**Duration of call**: 7-8 minutes is acceptable, if it's longer then something didn't go right usually on the user side. If it's a longer call, agent might get irritated which has an impact on user experience of KYC call.

**Audio and video quality**: is measured on agent's side, so we don't have any direct insights whether the quality was good on the user side. That could be infered from the duration of the call. If quality on agent side is not good, agent might get irritated too.

**Error scoring**: was done by one of the working students. Right now it's not working, but Jasmin can get it working if it proves to be of value for analysis.

**Sheet tabs**: Until August 2020, two batches are saved in two sheets for each month. The batches go from 5th of every month until 20th, and from 20th until 5th. As of September 2020, there will be weekly batches that all are entered into one single sheet for the month.

In [1]:
%%capture
!pip install imblearn

In [2]:
from functions import *

---

# Prep data

In [3]:
df_kyc = pd.read_csv("kyc_qa.csv", index_col=0)
cols_as_datetime(df_kyc, ["call_initiated", "call_completed"])
date_start = df_kyc.call_initiated.min().date()
date_end = df_kyc.call_initiated.max().date() - timedelta(days=30)

In [4]:
filename = generate_filename("data/", "DATA-8971", "merged", date_start, date_end)
data = pd.read_csv(filename)

In [5]:
data.shape

(1272, 18)

In [6]:
data.head()

Unnamed: 0,user_id,call_initiated,call_completed,call_duration,audio_video_quality,competence,feedback,language,guidance,error_classification,error_scoring_total,user_created,kyc_finished,closed_at,product_id,is_premium,is_card_f30d,is_ftmau_f30d
0,0019a7d1-328b-41bf-9715-6cd2d2d34d4b,2020-05-25 14:55:36.930,2020-05-25 15:16:28.466,OK,OK,good,good,good,good,,4,2020-05-25 13:22:20.995211,2020-05-25 15:16:28.466000+00:00,,STANDARD,False,1.0,1
1,002e7dcd-52f2-4974-943e-d3a74f944c96,2020-06-24 10:25:19.735,2020-06-24 10:32:27.857,efficient,acceptable,good,good,good,good,,4,2020-03-21 12:58:35.118166,2020-06-24 10:32:27.857000+00:00,,STANDARD,False,1.0,0
2,008183a4-de4e-4ca9-a9e2-b6bd1940da09,2020-07-28 16:27:50.054,2020-07-28 16:49:25.515,long,acceptable,good,good,good,good,,4,2020-07-28 16:04:51.754697,2020-07-28 16:49:25.515000+00:00,,STANDARD,False,1.0,1
3,00bf352f-302a-4c57-9cc5-7fe49dec6c5e,2020-05-25 10:50:13.983,2020-05-25 11:00:41.469,efficient,very good,good,good,good,good,,4,2020-05-25 08:29:02.788190,2020-05-25 11:00:41.469000+00:00,,STANDARD,False,1.0,1
4,00ed6466-ca09-486d-b4f8-c15f0bbe077d,2020-07-28 06:54:46.561,2020-07-28 07:26:36.848,OK,acceptable,good,good,good,good,,4,2019-10-12 21:49:54.057021,2020-07-28 07:26:36.848000+00:00,,BUSINESS_CARD,False,1.0,0


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1272 entries, 0 to 1271
Data columns (total 18 columns):
user_id                 1272 non-null object
call_initiated          1272 non-null object
call_completed          1272 non-null object
call_duration           1236 non-null object
audio_video_quality     1237 non-null object
competence              1237 non-null object
feedback                1237 non-null object
language                1237 non-null object
guidance                1236 non-null object
error_classification    1272 non-null object
error_scoring_total     1272 non-null object
user_created            1272 non-null object
kyc_finished            1111 non-null object
closed_at               68 non-null object
product_id              1271 non-null object
is_premium              1272 non-null bool
is_card_f30d            1111 non-null float64
is_ftmau_f30d           1272 non-null int64
dtypes: bool(1), float64(1), int64(1), object(15)
memory usage: 170.3+ KB


In [8]:
data.is_card_f30d = data.is_card_f30d.fillna(0)

In [9]:
cols_obj = []

cols_int = ["is_card_f30d", "is_ftmau_f30d"]

cols_dt = [
    "call_initiated",
    "call_completed",
    "user_created",
    "kyc_finished",
    "closed_at",
]

correct_dtypes(data, cols_obj, cols_int, cols_dt)

In [10]:
data["is_closed"] = pd.notnull(
    data.closed_at
)  # returns True if account closed, False if not
data["is_closed"] = data["is_closed"].astype(int)

data["is_premium"] = data["is_closed"].astype(int)

data = data[
    data.error_classification != "Fraud"
]  # drop cases that are fraud as these wouldn't become loyal customers anyway

data = data.replace(
    {
        "error_classification": {
            "Serious": np.nan,
            "Minor": np.nan,
            "Please Specify": np.nan,
            "please specify": np.nan,
            "None": np.nan,
        }
    }
)

# have to re-bin audio_video_quality because the valuation changed a few months ago
data["audio_video_quality"] = (
    data["audio_video_quality"]
    .replace("very good", "acceptable")
    .replace("OK", "acceptable")
    .replace("improvable", "not acceptable")
)

data["kyc_reset"] = pd.notnull(
    data.error_classification
)  # returns True if account closed, False if not
data["kyc_reset"] = data["kyc_reset"].astype(str)

data["su_to_kyc_days"] = (data["call_initiated"] - data["user_created"]).dt.days
# data is highly skewed towards zero value
data["su_to_kyc_days_log"] = np.log(data["su_to_kyc_days"] + 1)

In [11]:
rm_cols = [
    "closed_at",  # converted to is_closed
    "product_id",  # we have is_premium
    "error_classification",  # we have kyc_reset
    "user_created",  # we use user_id
    "kyc_finished",
    "call_initiated",
    "call_completed",
    "error_scoring_total",  # for now remove because not working
]

data.drop(rm_cols, axis=1, inplace=True)

In [12]:
filename = generate_filename("data/", "DATA-8971", "cleaned", date_start, date_end)
# save_data(data, filename)