# Finnq Data Classification

## Chris Wilkinson | Alexander Bricken | Samad Twemlow-Carter (CAS)

---

In [155]:
# IMPORT LIBRARIES
import pandas as pd
import numpy as np
import math
import statistics as stat
import random
import sklearn
import matplotlib.pyplot as plt

---

## App Collected Data

### Categorical Intensity Approach

In [156]:
df2 = pd.read_csv('data/use_data.csv')

By using a categorical intensity equation similar to Airbnbs, we can track how intensely a user uses a given feature.

If we set a threshold for the categorical intensity needed for the model to register that the user is using a feature enough for it to be featured on the dashboard, then we can update the list of features to be included.

In [157]:
# we can delete the click columns (they are calculated randomly as nested lists below)
df2 = df2.drop(['savings', 'p2p', 'market_pur', 'credit_checks'], axis=1)

#### Sample data generation: Filling the model with clicks

The amount of clicks should be designed such that it correlates with the persona initially given. However, some form of error should be implemented to show users that were incorrectly classified into a user persona that does not fit their use and thus the app can be updated.

Columns to fill:
- savings
- cash_transfer
- p2p
- market_pur
- credit_checks

To store reference a list of lists in the categorical intensity calculation, one can insert an index which references a respective list of lists for calculation of cat.

In [158]:
low_min = 50
low_max = 150
high_min = 150
high_max = 250

In [159]:
# generate nested lists of click data

# this function makes randomnested lists for each category outside the "main" one
def random_avg_list(low_min, low_max, stor_len):
    rand_click_list1 = [[random.randint(low_min, low_max) for i in range(7)] for j in range(stor_len)]
    rand_click_list2 = [[random.randint(low_min, low_max) for i in range(7)] for j in range(stor_len)]
    rand_click_list3 = [[random.randint(low_min, low_max) for i in range(7)] for j in range(stor_len)]
    return rand_click_list1, rand_click_list2, rand_click_list3

# main nested lists for respective calling
saving_click_list = [[random.randint(high_min, high_max) for i in range(7)] for k in range(len(df2))]
p2p_click_list = [[random.randint(high_min, high_max) for i in range(7)] for k in range(len(df2))]
credit_click_list = [[random.randint(high_min, high_max) for i in range(7)] for k in range(len(df2))]
market_click_list = [[random.randint(0, 200) for i in range(7)] for k in range(len(df2))]
# market_pur is kind of a random experiment that doesn't relate to any given persona.
# will be generating random values surrounding it.

In [160]:
df2

Unnamed: 0,persona,cat_savings,cat_p2p,cat_market_pur,cat_credit_checks,threshold,included features
0,3,,,,,,
1,3,,,,,,
2,3,,,,,,
3,2,,,,,,
4,3,,,,,,
...,...,...,...,...,...,...,...
245,1,,,,,,
246,2,,,,,,
247,3,,,,,,
248,1,,,,,,


#### Categorical Intensity

In [161]:
#input: alpha value and list of clicks per number of days
#output: category score for the period of days
alpha = 0.3
def category_intensity(A_d): 
    cat_score = 0
    for d in range(len(A_d)): 
        cat_score += (alpha**(d - len(A_d))*A_d[d])
    return cat_score

In [162]:
# for default users

rand_list1, rand_list2, rand_list3 = random_avg_list(low_min, low_max, len(df2))

for index in df2.index:
    '''
    calculating categorical intensity for each persona
    '''
    
    # for defaulters
    if df2.loc[index, 'persona'] == 0:
        # set up storage length
        stor_len = len(df2.loc[df2['persona'] == 0]['cat_savings'])
        # generate random list
        # store into positions
        df2['cat_savings'][index] = category_intensity(rand_list1[index])
        df2['cat_p2p'][index] = category_intensity(rand_list2[index])
        df2['cat_market_pur'][index] = category_intensity(market_click_list[index])
        df2['cat_credit_checks'][index] = category_intensity(rand_list3[index])
        
    # for p2pers
    elif df2.loc[index, 'persona'] == 1:
        # set up storage length
        stor_len = len(df2.loc[df2['persona'] == 1]['cat_savings'])
        
        # store into positions
        df2['cat_savings'][index] = category_intensity(rand_list1[index])
        df2['cat_p2p'][index] = category_intensity(p2p_click_list[index])
        df2['cat_market_pur'][index] = category_intensity(market_click_list[index])
        df2['cat_credit_checks'][index] = category_intensity(rand_list2[index])
        
    # for savers (budgeters)
    elif df2.loc[index, 'persona'] == 2:
        # set up storage length
        stor_len = len(df2.loc[df2['persona'] == 2]['cat_savings'])
        
        # store into positions
        df2['cat_savings'][index] = category_intensity(saving_click_list[index])
        df2['cat_p2p'][index] = category_intensity(rand_list1[index])
        df2['cat_market_pur'][index] = category_intensity(market_click_list[index])
        df2['cat_credit_checks'][index] = category_intensity(rand_list2[index])
        
    # for mobile bankers
    elif df2.loc[index, 'persona'] == 3:
        # set up storage length
        stor_len = len(df2.loc[df2['persona'] == 3]['cat_savings'])
        
        # store into positions
        df2['cat_savings'][index] = category_intensity(rand_list1[index])
        df2['cat_p2p'][index] = category_intensity(rand_list2[index])
        df2['cat_market_pur'][index] = category_intensity(market_click_list[index])
        df2['cat_credit_checks'][index] = category_intensity(credit_click_list[index])
    
    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#return

In [163]:
df2

Unnamed: 0,persona,cat_savings,cat_p2p,cat_market_pur,cat_credit_checks,threshold,included features
0,3,6.928609e+05,4.228759e+05,5.587448e+05,1.090058e+06,,
1,3,3.783766e+05,4.511774e+05,1.043638e+06,1.072466e+06,,
2,3,9.256673e+05,7.192506e+05,3.189865e+05,1.073527e+06,,
3,2,1.071845e+06,7.354322e+05,5.408165e+05,6.774263e+05,,
4,3,5.804185e+05,4.747248e+05,9.100111e+05,1.446357e+06,,
...,...,...,...,...,...,...,...
245,1,6.981903e+05,1.148068e+06,6.116457e+05,8.959548e+05,,
246,2,1.511399e+06,6.987100e+05,1.427405e+05,7.370665e+05,,
247,3,8.042252e+05,5.366507e+05,4.911962e+05,1.234756e+06,,
248,1,6.442118e+05,1.504719e+06,7.849644e+05,6.394082e+05,,


#### Threshold Calculation

The threshold needs to be a function of the feature with the highest categorical intensity that any user has. If a user has a lot of clicks across many features, the top ones should be selected for dashboard display. Similarly, if the user has very few clicks (i.e. hasn't used the app a lot recently) except for one feature, then the single most-used feature should be the only to exceed the threshold.

$$ Threshold = 0.8x_{max}$$ , 
where $x_{max}$ is the categorical intensity of the top feature used.

In [164]:
def threshold_calc(max_cat):
    bound = 0.8 * max_cat
    return bound

In [147]:
df2

Unnamed: 0,persona,cat_savings,cat_p2p,cat_market_pur,cat_credit_checks,threshhold,included features
0,3,5.254208e+05,4.542298e+05,9.585329e+05,1.503272e+06,,
1,3,4.158050e+05,5.752036e+05,3.918490e+05,1.253129e+06,,
2,3,4.052533e+05,8.135669e+05,1.139420e+06,1.476940e+06,,
3,2,1.208285e+06,5.058355e+05,6.336193e+05,6.671689e+05,,
4,3,6.828084e+05,7.008593e+05,1.011514e+06,1.574282e+06,,
...,...,...,...,...,...,...,...
245,1,8.922328e+05,1.137437e+06,8.215461e+05,5.969087e+05,,
246,2,1.205054e+06,3.861146e+05,6.040519e+05,6.503133e+05,,
247,3,4.233219e+05,8.073730e+05,8.567114e+05,1.168098e+06,,
248,1,8.019577e+05,1.239318e+06,1.075111e+06,8.092244e+05,,


In [168]:
# calculate thresholds
max_list = df2.max(axis=1)

thresholds = [threshold_calc(max_list[i]) for i in range(len(df2))]

df2.loc[:, 'threshold'] = thresholds

In [171]:
df2['threshold'][1]

857973.0406950162

In [172]:
# check cat intensity values against thresholds
for index in df2.index:
    
    features = ""
    
    # add budget saver to list of features if greater than threshold
    if df2['cat_savings'][index] > df2['threshold'][index]:
        features += "Budget Saver; "
    
    # add p2p to list of features if greater than threshold
    if df2['cat_p2p'][index] > df2['threshold'][index]:
        features += "P2P; "
        
    if df2['cat_market_pur'][index] > df2['threshold'][index]:
        features += "Market Purchaser; "
        
    if df2['cat_credit_checks'][index] > df2['threshold'][index]:
        features += "Credit User; "
        
    # add features to included features at index
    df2['included features'][index] = features

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [175]:
df2['included features']

0                         Credit User; 
1       Market Purchaser; Credit User; 
2           Budget Saver; Credit User; 
3                        Budget Saver; 
4                         Credit User; 
                     ...               
245                               P2P; 
246                      Budget Saver; 
247                       Credit User; 
248                               P2P; 
249    Budget Saver; Market Purchaser; 
Name: included features, Length: 250, dtype: object


In [176]:
for val in df2['included features']: 
    print(val) 

Credit User; 
Market Purchaser; Credit User; 
Budget Saver; Credit User; 
Budget Saver; 
Credit User; 
Budget Saver; 
P2P; 
Credit User; 
Credit User; 
Credit User; 
Budget Saver; 
P2P; 
Credit User; 
P2P; 
Credit User; 
Budget Saver; 
P2P; 
Credit User; 
Credit User; 
Credit User; 
Budget Saver; 
P2P; 
Budget Saver; 
Budget Saver; 
Budget Saver; 
Market Purchaser; Credit User; 
Budget Saver; P2P; 
Budget Saver; 
Market Purchaser; Credit User; 
Budget Saver; 
Credit User; 
Budget Saver; 
P2P; Market Purchaser; 
Credit User; 
Credit User; 
Credit User; 
P2P; 
Market Purchaser; Credit User; 
Budget Saver; 
Credit User; 
Budget Saver; 
Credit User; 
Budget Saver; Market Purchaser; 
Budget Saver; 
Credit User; 
Budget Saver; P2P; Credit User; 
P2P; Market Purchaser; 
Budget Saver; 
Credit User; 
Credit User; 
Credit User; 
Budget Saver; 
P2P; 
Credit User; 
Credit User; 
P2P; 
Budget Saver; 
Credit User; 
Budget Saver; 
Credit User; 
Credit User; 
Budget Saver; 
Credit User; 
P2P; Market P

In [177]:
df2

Unnamed: 0,persona,cat_savings,cat_p2p,cat_market_pur,cat_credit_checks,threshold,included features
0,3,6.928609e+05,4.228759e+05,5.587448e+05,1.090058e+06,8.720462e+05,Credit User;
1,3,3.783766e+05,4.511774e+05,1.043638e+06,1.072466e+06,8.579730e+05,Market Purchaser; Credit User;
2,3,9.256673e+05,7.192506e+05,3.189865e+05,1.073527e+06,8.588220e+05,Budget Saver; Credit User;
3,2,1.071845e+06,7.354322e+05,5.408165e+05,6.774263e+05,8.574760e+05,Budget Saver;
4,3,5.804185e+05,4.747248e+05,9.100111e+05,1.446357e+06,1.157086e+06,Credit User;
...,...,...,...,...,...,...,...
245,1,6.981903e+05,1.148068e+06,6.116457e+05,8.959548e+05,9.184547e+05,P2P;
246,2,1.511399e+06,6.987100e+05,1.427405e+05,7.370665e+05,1.209119e+06,Budget Saver;
247,3,8.042252e+05,5.366507e+05,4.911962e+05,1.234756e+06,9.878048e+05,Credit User;
248,1,6.442118e+05,1.504719e+06,7.849644e+05,6.394082e+05,1.203775e+06,P2P;


In [178]:
# output as csv file 
df2.to_csv('resultant_CI_features.csv')

We can see that the final output suggests more features for users that use more than just one feature to a decent extent.

We could further refine this model and the arbitrary values such as alpha and the threshold constant (currently set at 0.3 and 0.8, respectively), by testing different values and gathering satisfaction data from users through surveys such as the net promoter score test.

---