## Investment customer quality and look alike analysis

#### What is look-alike
Look-alike modeling is a process that identifies people who has similar feature or just look/act like the target audiences. It's commonly used in marketing to identify prospect who has similar behavior pattern as targeting clients. 

#### How to perform look-alike

1. Always understand the goal, best customers may have different definition based on the needs and business model. For example, in this case the best customers could be either the one who purchased multiple products and has deeper engagement with the entity, or simply signed contract with the business and paid the fixed service fee.
2. Understand the “best customer” segment and define commonalities. These traits may include anything from household income, balance with the firm, engagement to the business, or activity on different channel / platforms.
3. Once we know how "best customer" look like, we can dentify all customers within the expansive database who share the same characteristics as the best customers, and try to develop relationship with them and turn them to real "best customers". 

Look-alike could be converted to traditional binary classification if the not-alike is obtainable, or anormly detection (one class classification) when the rest not look-alike minority shows significant different as look-alike. 

### Pre-processing
Other than cleaning and imputing missing data, random undersample (at this time just for reduce data size) and scaling is performed

In [None]:
# remove missing value
def mis_val_10_del(df):
    mis_val = df.columns[df.isnull().any()].tolist()
    print ('No. of Variables with missing value: ', len(mis_val))
    print(df.shape)
    mis=  pd.DataFrame(pd.Series(mis_val), columns=['Var'])
    mis['PTC']=[100*(sum(df[i].isnull())/df.shape[0]) for i in mis_val]
    print (mis)
    threshold = input ('Threshold for deleteing record with missing value (input 0-100: recommend <10%):  ')
    for i in mis_val:
        if 100*(sum(df[i].isnull())/df.shape[0])<int(threshold):
            df.dropna(inplace= True, subset = [i])
    print('Rest variables with more than ', threshold, 'precentage of missing data as : ', 
          df.columns[df.isnull().any()].tolist(), 'current df shape: ', df.shape)
    
# or if want to fill missing value
# numerical
# df.fillna(df.mean()) # df.fillna(df.median())
# categorical 
# df[df.columns.intersection(list(df.dtypes[df.dtypes == np.object].index))].apply(lambda x: x.fillna(x.value_counts().index[0]))

In [None]:
# Undersampling
import imblearn
from imblearn.under_sampling import RandomUnderSampler
X = dfpo.drop(columns = ['Accept'])
y = dfpo['Accept']
y.value_counts() 
# imbalanced data with huge size, reduce size to PC friendly
undersample = RandomUnderSampler(sampling_strategy=0.5)
# fit and apply the transform
X_over, y_over = undersample.fit_resample(X, y)
# summarize class distribution
y_over.value_counts()

# Scale dataset
import pandas as pd
import numpy as np
Num_features=X_over.select_dtypes(include=[np.number]).columns
Num_features
# Scaler only on numerical data (there is no benefit to scale binary data)
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler() 
X_over[Num_features]=scaler.fit_transform(X_over[Num_features])


#### For confidential reason, raw data has already been scaled thus would not repeat normalizing the data in following analysis.

## Preparing

- Feature Selection and Correlation/Association analysis
- Encode

In [None]:
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
df = pd.read_csv('/kaggle/input/invst-target/referral_categ_undersamp_scaled.csv')

## Measure of Association
Technically, association refers to any relationship between two variables, whereas correlation is often used to refer only to a linear relationship between two variables. Here I found calculate correlation after one hot encode categorical data could expend data size significantly also would not solve scale and order issues, thus decided to go after measure of association with dython package. 

In the [Dython](http://shakedzy.xyz/dython/modules/nominal/) package, numerical and categorical variables are treated accordingly: 
* Pearson's R for continuous-continuous cases 
* Correlation Ratio for categorical-continuous cases 
* Cramer's V or Theil's U for categorical-categorical cases

Cramer's V:based on Pearson's chi-squared statistic, measure of association between two nominal variables, giving a value between 0 and +1 (inclusive). 

In [None]:
!pip install dython
import numpy as np
import dython
from dython.model_utils import roc_graph
from dython.nominal import associations
import matplotlib.pyplot as plt
def associations_example():
    associations(df, nominal_columns=list(df.dtypes[df.dtypes == np.object].index))

import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"]=20,20
associations_example()

assoc = associations(df.drop(columns = ['ID']), nominal_columns=list(df.dtypes[df.dtypes == np.object].index))
#ID is row number
ass_corr = assoc['corr']
# Variables with most relationship to Accept as the response
pd.DataFrame(ass_corr['ACCEPT'].sort_values(ascending=False)).head(10)

As we notice, Estimated total asset, estimated total deposit, estimated stock/fund etc all have similar color, meaning all estimated variables are strongly correlated (may came from one source), we are wondering if those variables belong to a cluster, if so, what else variables could be clustered to one cluster? Below we use hierarchical clustering to determine which columns belongs to which cluster.

In [None]:
# A clusterred correlation matrix
import scipy
import scipy.cluster.hierarchy as sch

X = df.corr().values
d = sch.distance.pdist(X)   # vector of ('55' choose 2) pairwise distances
L = sch.linkage(d, method='complete')
ind = sch.fcluster(L, 0.5*d.max(), 'distance')
columns = [df.columns.tolist()[i] for i in list((np.argsort(ind)))]
df_rein = df.reindex(columns, axis=1)
def plot_corr(df,size=10):
    '''Plot a graphical correlation matrix for a dataframe.

    Input:
        df: pandas DataFrame
        size: vertical and horizontal size of the plot'''
    
    %matplotlib inline
    import matplotlib.pyplot as plt

    # Compute the correlation matrix for the received dataframe
    corr = df.corr()
    
    # Plot the correlation matrix
    fig, ax = plt.subplots(figsize=(size, size))
    cax = ax.matshow(corr, cmap='RdYlGn')
    plt.xticks(range(len(corr.columns)), corr.columns, rotation=90);
    plt.yticks(range(len(corr.columns)), corr.columns);
    
    # Add the colorbar legend
    cbar = fig.colorbar(cax, ticks=[-1, 0, 1], aspect=40, shrink=.8)
plot_corr(df_rein, size=18)

# Other than the estimated varaibles, loan seemed to be 1 cluster, and the rest balances belongs to another cluster.

In [None]:
pd.DataFrame(ass_corr['ACCEPT'].sort_values(ascending=False)).tail(10)
# if set 0.05 as threshold, below variables may not have impact on the Acceptance of investment product. 
# in other word not impact on the definition of being 'best customers'

#### Feature Selection
Knowing that we have strongly associated variables, no significatn association variables, also variables with strong interactions, it would be valuable to select only related feature (and look into dimension reduction in following analysis). 

Here I would highly recommend spend some time in this article from Dr. Brownlee
https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/
As the inputs here a combination of numerical and categorical, output is categorical, ANOVA: f_classif(), Chi-squared, Mutual information: mutual_info_classif()  would be better statistical measures for filter-based feature selection.

In [None]:
dfc = df.copy()
# encode categorical variables
df.replace({"Y":1, "N":0}, inplace = True)
df.select_dtypes(include=['object']).columns

In [None]:
df['FP_CATG'].unique()

In [None]:
fstprd = pd.get_dummies(df['FP_CATG']) 
df = df.merge(fstprd[['SVNG', 'EQTY', 'RET','INVST', 'CD', 'MM']], left_index=True, right_index=True)
df.drop(columns=['ID', 'CHANNEL', 'FP_CATG'], inplace = True)
# check if still object 
print(df.select_dtypes(include=['object']).columns)
print(df.shape)

In [None]:
# Feature Selection
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif

X=df.drop(columns = ['ACCEPT'])
X_indices = np.arange(X.shape[-1])
y=df['ACCEPT']
plt.figure(1)
plt.clf()

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, stratify=y, random_state=10)
selector = SelectKBest(f_classif, k=X.shape[-1])
selector.fit(X_train, y_train)
scores = -np.log10(selector.pvalues_)
scores /= scores.max()

plt.bar(X_indices - .45, scores, width=.2,
        label=r'Univariate score ($-Log(p_{value})$)')

# Compare to the weights of an SVM
clf = make_pipeline(MinMaxScaler(), LinearSVC())
clf.fit(X_train, y_train)
print('Classification accuracy without selecting features: {:.3f}'
      .format(clf.score(X_test, y_test)))

svm_weights = np.abs(clf[-1].coef_).sum(axis=0)
svm_weights /= svm_weights.sum()

plt.bar(X_indices - .25, svm_weights, width=.2, label='SVM weight')

clf_selected = make_pipeline(
        SelectKBest(f_classif, k='all'), MinMaxScaler(), LinearSVC()
)
clf_selected.fit(X_train, y_train)
print('Classification accuracy after univariate feature selection: {:.3f}'
      .format(clf_selected.score(X_test, y_test)))

svm_weights_selected = np.abs(clf_selected[-1].coef_).sum(axis=0)
svm_weights_selected /= svm_weights_selected.sum()

plt.bar(X_indices[selector.get_support()] - .05, svm_weights_selected,
        width=.2, label='SVM weights after selection')


plt.title("Comparing feature selection")
plt.xlabel('Feature number')
plt.yticks(())
plt.axis('tight')
plt.legend(loc='upper right')
plt.show()
# k from 35 to len of total dataset all same

In [None]:
fs= pd.concat([pd.Series(X.columns), round(pd.Series(selector.scores_),0), round(pd.Series(selector.pvalues_),5)], axis=1)
fs.columns = ["Val","F_score","P-value"]
fs.sort_values(by=['F_score'], ascending = False)
# F value < 50 or P value >0.05, consider not important features
# result similar to correlation (association)

remove unimportant features
Either by threshold or business insights

## One-class classification
Outlier detection with imbalanced data

In [None]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.svm import OneClassSVM

In [None]:
X = df.drop(columns = ['ACCEPT'])
y = df['ACCEPT']
print(y.value_counts())
# imbalanced data

# split into train/test sets
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
trainX = scaler.fit_transform(trainX)
testX = scaler.transform(testX)

In [None]:
#Treat Referred (minority) as outliers
print(trainy.value_counts())
# define outlier detection model
model = OneClassSVM(gamma='scale', nu=0.01)
# fit on majority class
trainX = trainX[trainy==0]
model.fit(trainX)
# detect outliers in the test set
yhat = model.predict(testX)
# mark inliers 1, outliers -1 as 0 are normal and majority
testy[testy == 1] = -1
testy[testy == 0] = 1
# calculate score
score_u_n = f1_score(testy, yhat, pos_label=-1)
score_u_p = f1_score(testy, yhat, pos_label=1)
print('F1 Score:',round(score_u_n,4),'(n)',round(score_u_p,3),'(p)')
# calculate score
score = f1_score(testy, yhat, pos_label=-1)
print('F1 Score: %.3f' % score)
# care about negative, 0.305(n) not high as expected
# non investment cutomers have wider range and contains outlier, treat non-referred (majority) as outliers

In [None]:
#Treat non-referred (majority) as outliers
import imblearn
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from numpy import mean

# define over/under strategy
over = SMOTE(sampling_strategy = 1)
steps = [('o', over)]
pipeline = Pipeline(steps = steps)
# fit and apply the transform
X_over, y_over = pipeline.fit_resample(trainX, trainy)
# summarize class distribution
print(y_over.value_counts())
# define outlier detection model
model_u = OneClassSVM(gamma='scale', nu=0.01)
# fit on majority class (since we undersampled 0s, 1 are clean and consideredoneclass as normal majority)
trainX_u = X_over[y_over==1]
model_u.fit(trainX_u)
# detect outliers in the test set
yhat = model_u.predict(testX)
# mark inliers 1, outliers -1 (minority as 1, potential 0s as -1)
testy[testy == 1] = 1
testy[testy == 0] = -1
# calculate score
score_u_n = f1_score(testy, yhat, pos_label=-1)
score_u_p = f1_score(testy, yhat, pos_label=1)
print('F1 Score:',round(score_u_n,4),'(n)',round(score_u_p,3),'(p)')
# calculate score
score = f1_score(testy, yhat, pos_label=-1)
print('F1 Score: %.3f' % score)
# with over sample, positive (referrals) predict accuracy increased to 0.489(p) yet precision reduced significantly

One class classification is used to detect outliers but here majority (not referred or not yet accepted referrals) is not a clean set. 
Resampling changed the referral/rest proportion thus could not perform well with testing set.
See if there is any boundary between referral and rest (combinition of future referral and true non-referals).

In [None]:
from sklearn.decomposition import PCA
import numpy as np
pca = PCA()
pca.fit(trainX)
np.cumsum(pca.explained_variance_ratio_)
# pca would not show clear difference in 3d graph