# Finnq Data Generation

## Chris Wilkinson | Alexander Bricken | Samad Twemlow-Carter (CAS)

---

**"We are looking to understand how using a user persona model will allow for greater optimisation and a better, personalised user experience for incoming FinnQ users."**

### Research

Upon searching for datasets that would emulate the FinnQ dataset, we realised that we would not be able to find one accurate enough.

However, in order to generate the dataset required to train the model, we can take some inspiration from other datasets that may have one or two similar columns.

Here are the datasets we have used for inspiration:
- https://users.nber.org/~rdehejia/data/.nswdata2.html (NSW Data - Lalonde)
- http://hdr.undp.org/sites/all/themes/hdr_theme/country-notes/KOR.pdf (Korean HDI, 2018)
- Personal experience.

---

### Data Generation

In [110]:
# IMPORT LIBRARIES
import pandas as pd
import numpy as np
import math
import statistics as stat
import random
from scipy.stats import skewnorm
import pickle
from sklearn.model_selection import train_test_split

#### Sign Up (Pre-)Data

This is the data that will be acquired from users answering questions when they first sign up to the app. It will allow for initial classification.

**Categorical:** 
- Age
- Gender
- GNI per capita, PPP`($)`
- Level of education
- Reason for downloading Finnq (this can be split into 7 different categories):
    - Saving
    - Cash transfers
    - Mobile banking
    - Insurance
    - P2P investing
    - Everything
    - Not really sure

This is the data that will be acquired once users rack-up more hours on the platform. We can use this data as a way to tailor experience for future new users.

**App Use variables:**
- Time spent on individual screens
- Type of account linked
- User of particular features

We want to have enough data to make valuable insights. Thus, we will be generating a dataset of 1000 users.

In [100]:
### Generation
random.seed(100)

# set amount of users
user_num = 5000

# generate data frame for users to be filled
# we have 5 columns
columns = ['Age (years)', 'Gender', 'GNI per capita (PPP $)', 'Education (years)', 'Reason']
signup_df = pd.DataFrame(index=range(user_num), columns=columns)
signup_df

Unnamed: 0,Age (years),Gender,GNI per capita (PPP $),Education (years),Reason
0,,,,,
1,,,,,
2,,,,,
3,,,,,
4,,,,,
...,...,...,...,...,...
4995,,,,,
4996,,,,,
4997,,,,,
4998,,,,,


In [107]:
random.seed(100)
np.random.seed(100)
# ages from around 15 to 80
ages = []
[ages.append(random.randrange(15,80,1)) for i in range(user_num)]

# gender roughly 50/50 split
# 1 = male
# 0 = female
genders = []
genders = [random.randrange(0,2,1) for i in range(user_num)]

# income distribtion
# generate with a right skew mean 
# $35,945 = GNI per capita
incomes = []
# generate constants for right skew
c = skewnorm.rvs(a=2, size=user_num)
[incomes.append(35945) for i in range(user_num)]
skew_income = incomes * c
incomes = skew_income.tolist()
incomes = [abs(number) for number in incomes]

# level of education (years) left skew
# range 0 - 20+
# expected years of education = 16
educ = np.random.binomial(25, 0.6, user_num)


# reason for downloading Finnq
reason_list = ['saving', 'cash_transfers', 'mobile_banking', 'insurance', 'p2p_investing', 'everything', 'not_sure']
reasons = []
reasons = [random.choice(reason_list) for i in range(user_num)]



In [108]:
# add values
signup_df.loc[:, 'Age (years)'] = ages
signup_df.loc[:, 'Gender'] = genders
signup_df.loc[:, 'GNI per capita (PPP $)'] = incomes
signup_df.loc[:, 'Education (years)'] = educ
signup_df.loc[:, 'Reason'] = reasons

In [109]:
signup_df

Unnamed: 0,Age (years),Gender,GNI per capita (PPP $),Education (years),Reason
0,33,1,79504.948110,11,everything
1,73,1,7387.088847,15,saving
2,73,1,67719.029937,14,cash_transfers
3,37,0,16049.154734,17,p2p_investing
4,65,1,10989.764805,17,cash_transfers
...,...,...,...,...,...
4995,53,1,24636.781609,16,not_sure
4996,29,0,37684.565267,19,not_sure
4997,32,1,5177.408313,15,everything
4998,62,1,31019.805448,15,cash_transfers


In [112]:
# split data into train and test
train_df, test_df = train_test_split(signup_df, test_size=0.2)

In [115]:
# output data as pickle
train_df.to_pickle("./data/train.pkl")
test_df.to_pickle("./data/test.pkl")

In [98]:
# Categorising if you want

# income bracket range from 
# 0-25000; 25001-50000; 50001-75000; 75001-100000; 100001+; 
#  low   ;  low-medium ;  medium ;    medium-high ; high 




# age range
# 0-15;     16-25;    26-35;           36-60;     61+
# infant; millennial; post-millennial; mid-life; retiree



# education
# 0-5  ;   6-10 ;   11-15 ;  16-20 ; 21+
# low   ;  low-medium ;  medium ;    medium-high ; high 



#### App Use (Post-)Data

We need to generate data on each user to understand how they are using the app. Depending on how this changes their classification/reaffirms it, new users who log on to the app will be further classified.

We only look at users with over 10 hours spent using the app.

**App Use variables:**
- Time spent on individual screens
- Type of account linked
- User of particular features

In [None]:
use_columns = ['Total Time', 'Initial Classification', 'Most Used Feature', 'Least Used Feature']
appuse_df = pd.DataFrame(index=range(user_num), columns=columns)
appuse_df