# Finnq Data Generation

## Chris | Alexander | Samad (CAS)

---

**"We are looking to understand how using a user persona model will allow for greater optimisation and a better, personalised user experience for incoming FinnQ users."**

### Research

Upon searching for datasets that would emulate the FinnQ dataset, we realised that we would not be able to find one accurate enough.

However, in order to generate the dataset required to train the model, we can take some inspiration from other datasets that may have one or two similar columns.

Here are the datasets we have used for inspiration:
- https://users.nber.org/~rdehejia/data/.nswdata2.html (NSW Data - Lalonde)
- http://hdr.undp.org/sites/all/themes/hdr_theme/country-notes/KOR.pdf (Korean HDI, 2018)
- Personal experience.

---

### Data Generation

In [47]:
# IMPORT LIBRARIES
import pandas as pd
import numpy as np
import math
import statistics as stat
import random
from scipy.stats import skewnorm

#### Sign Up (Pre-)Data

This is the data that will be acquired from users answering questions when they first sign up to the app. It will allow for initial classification.

**Categorical:** 
- Age
- Gender
- GNI per capita, PPP`($)`
- Level of education
- Reason for downloading Finnq (this can be split into 5 different categories):
    - Saving
    - Cash transfers
    - Mobile banking
    - Insurance
    - P2P investing
    - Gift cards
    - Everything
    - Not really sure

This is the data that will be acquired once users rack-up more hours on the platform. We can use this data as a way to tailor experience for future new users.

**App Use variables:**
- Time spent on individual screens
- Type of account linked
- User of particular features

We want to have enough data to make valuable insights. Thus, we will be generating a dataset of 1000 users.

In [44]:
### Generation
random.seed(100)

# set amount of users
user_num = 1000

# generate data frame for users to be filled
# we have 5 columns
columns = ['Age (years)', 'Gender', 'GNI per capita (PPP $)', 'Education (years)', 'Reason']
signup_df = pd.DataFrame(index=range(user_num), columns=columns)
signup_df

Unnamed: 0,Age (years),Gender,GNI per capita (PPP $),Education (years),Reason
0,,,,,
1,,,,,
2,,,,,
3,,,,,
4,,,,,
...,...,...,...,...,...
995,,,,,
996,,,,,
997,,,,,
998,,,,,


In [112]:
random.seed(100)
np.random.seed(100)
# ages from around 15 to 80
ages = []
[ages.append(random.randrange(15,80,1)) for i in range(user_num)]

# gender roughly 50/50 split
# 1 = male
# 0 = female
gender = []
[gender.append(random.randrange(0,2,1)) for i in range(user_num)]

# income distribtion
# generate with a right skew mean 
# $35,945 = GNI per capita
income = []
# generate constants for right skew
c = skewnorm.rvs(a=2, size=user_num)
[income.append(35945) for i in range(user_num)]
skew_income = income * c
income = skew_income.tolist()
income = [abs(number) for number in income]

# level of education (years) left skew
# range 0 - 20+
educ = []
# expected years of education = 16
b = skewnorm.rvs(a=0.5, size=user_num)
[educ.append(16) for i in range(user_num)]
skew_educ = educ * b
educ = skew_educ.tolist()
educ = [round(i) for i in educ]
educ


# reason for downloading Finnq



[1,
 -31,
 7,
 6,
 30,
 14,
 21,
 16,
 -1,
 -23,
 -13,
 11,
 16,
 14,
 -27,
 -8,
 -8,
 0,
 18,
 -3,
 15,
 -13,
 13,
 -21,
 18,
 -7,
 14,
 4,
 16,
 -19,
 4,
 37,
 8,
 14,
 7,
 26,
 -5,
 8,
 36,
 -6,
 -10,
 -9,
 -11,
 -8,
 6,
 3,
 -18,
 -10,
 3,
 8,
 5,
 22,
 -1,
 8,
 14,
 24,
 3,
 6,
 -9,
 -8,
 -5,
 22,
 -12,
 12,
 -11,
 21,
 -7,
 12,
 -15,
 4,
 11,
 1,
 -5,
 14,
 -1,
 16,
 0,
 4,
 -2,
 -9,
 9,
 0,
 -8,
 15,
 22,
 -29,
 -21,
 4,
 19,
 18,
 -2,
 -1,
 -20,
 16,
 0,
 -13,
 -18,
 1,
 47,
 33,
 24,
 -26,
 14,
 -4,
 11,
 22,
 4,
 -6,
 -1,
 -22,
 40,
 -22,
 -2,
 1,
 -9,
 2,
 -3,
 -29,
 -11,
 5,
 7,
 15,
 -1,
 18,
 -7,
 21,
 4,
 -3,
 0,
 12,
 4,
 8,
 17,
 31,
 37,
 -1,
 13,
 18,
 4,
 -7,
 14,
 -26,
 -27,
 -13,
 -15,
 -14,
 12,
 34,
 1,
 4,
 24,
 3,
 -11,
 -5,
 -9,
 22,
 6,
 19,
 26,
 16,
 16,
 21,
 3,
 -4,
 14,
 -11,
 4,
 25,
 31,
 -12,
 8,
 13,
 14,
 18,
 1,
 -32,
 -9,
 -6,
 0,
 11,
 10,
 -10,
 21,
 15,
 2,
 9,
 18,
 5,
 -7,
 3,
 19,
 -9,
 25,
 41,
 -5,
 -7,
 -26,
 1,
 -3,
 -4,
 27,
 10,
 16,


In [None]:
# Categorising

# income bracket range from 
# 0-25000; 25000-50000; 50000-75000; 75000-100000; 100000-125000; 

#### App Use (Post-)Data