# Descriptive Task
The goal of the descriptive task is to identify customer segments. To do this we are going to perform clustering with the dataset.
We also used KMeans, DBSCAN and KMedoids with very similar results.

First, we need to import the train and test datasets generated in the Data Preparation stage (both versions before aggregating and scaling) and then append them.

In [1]:
import pandas as pd

_train_unagg_df = pd.read_csv('./train_unagg.csv')
_test_unagg_df = pd.read_csv('./test_unagg.csv')
unagg_df = _train_unagg_df.append(_test_unagg_df)

_train_unsca_df = pd.read_csv('./train_unsca.csv')
_test_unsca_df = pd.read_csv('./test_unsca.csv')
unsca_df = _train_unsca_df.append(_test_unsca_df)

## Loan Type Segmentation
In this section we will focus on clustering the data in the dataset by characteristics of the loans that were taken. Specifically, we will look at the ``balance``, ``loan_amount``, ``duration`` and ``payments`` columns.
The result shows a relation between the client’s balance, loan amount and its duration and number of payments, meaning that a bigger balance is linked to a bigger amount and duration.

In [2]:
_loan_df = unagg_df[['balance', 'loan_amount', 'duration','payments']]

### Using KMeans

In [3]:
from sklearn.cluster import KMeans

loan_df = _loan_df.copy()
clusters = KMeans(n_clusters=3, random_state=42)
clusters.fit(loan_df)

loan_df['cluster'] = clusters.labels_
loan_df = loan_df.groupby('cluster').agg('mean')
loan_df.head()

Unnamed: 0_level_0,balance,loan_amount,duration,payments
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,42970.494761,67463.264272,27.404936,3103.66526
1,47085.75674,370291.83725,54.926045,6814.78135
2,46002.569553,196916.037834,43.592874,4857.1189


### Using KMedoids


In [4]:
# Can't allocate enough space MemoryError???
# from sklearn_extra.cluster import KMedoids

# loan_df = _loan_df.copy()
# clusters = KMedoids(n_clusters=3, random_state=42)
# clusters.fit(loan_df)

# loan_df['cluster'] = clusters.labels_
# loan_df = loan_df.groupby('cluster').agg('mean')
# loan_df.head()

### Using DBSCAN

In [5]:
from sklearn.cluster import DBSCAN

loan_df = _loan_df.copy()
clusters = DBSCAN(eps=100)
clusters.fit(loan_df)

loan_df['cluster'] = clusters.labels_
loan_df = loan_df.groupby('cluster').agg('mean')
loan_df.head()

Unnamed: 0_level_0,balance,loan_amount,duration,payments
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
-1,45459.076389,154296.455538,36.51048,4250.054209
0,31808.54,80952.0,24.0,3373.0
1,27810.925,80952.0,24.0,3373.0
2,25151.677778,87216.0,48.0,1817.0
3,33466.133333,87216.0,48.0,1817.0


## Economic Power Segmentation
In this section we will focus on clustering the data in the dataset by characteristics of the loans that were taken. Specifically, we will look at the ``balance_mean``, ``balance_min``, ``balance_max``, ``balance_std``, ``balance_bal_range`` and ``bal_per_month`` columns.
We can conclude that the group of people with a bigger balance spend more than those who have less, having a lower balance minimum and wider balance range.

In [6]:
_econ_df = unsca_df[['balance_mean', 'balance_min', 'balance_max', 'balance_std', 'balance_bal_range', 'bal_per_month']]

### Using KMeans

In [7]:
from sklearn.cluster import KMeans

econ_df = _econ_df.copy()
clusters = KMeans(n_clusters=3, random_state=42)
clusters.fit(econ_df)

econ_df['cluster'] = clusters.labels_
econ_df = econ_df.groupby('cluster').agg('mean')
econ_df.head()

Unnamed: 0_level_0,balance_mean,balance_min,balance_max,balance_std,balance_bal_range,bal_per_month
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,54737.717097,555.18007,123958.541608,25508.184108,123403.361538,10675.4287
1,42165.398607,714.550259,79007.036788,16573.209927,78292.486528,7782.239479
2,29295.179272,742.885714,49303.312315,9972.741015,48560.426601,5170.127952


### Using KMedoids

In [8]:
from sklearn_extra.cluster import KMedoids

econ_df = _econ_df.copy()
clusters = KMedoids(n_clusters=3, random_state=42)
clusters.fit(econ_df)

econ_df['cluster'] = clusters.labels_
econ_df = econ_df.groupby('cluster').agg('mean')
econ_df.head()

Unnamed: 0_level_0,balance_mean,balance_min,balance_max,balance_std,balance_bal_range,bal_per_month
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,54821.293726,557.078397,123865.781882,25476.355219,123308.703484,10657.466433
1,42226.701854,717.979348,79532.282609,16720.94756,78814.303261,7832.978601
2,29556.42641,736.994313,49884.616114,10095.110416,49147.621801,5235.639208


### Using DBSCAN

In [9]:
from sklearn.cluster import DBSCAN

econ_df = _econ_df.copy()
clusters = DBSCAN(eps=1000000)
clusters.fit(econ_df)

econ_df['cluster'] = clusters.labels_
econ_df = econ_df.groupby('cluster').agg('mean')
econ_df.head()

Unnamed: 0_level_0,balance_mean,balance_min,balance_max,balance_std,balance_bal_range,bal_per_month
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,43606.789462,656.15176,89016.236657,18355.478881,88360.084897,8218.007041


## Operation Type Segmentation
The purpose of this clustering was to segment the groups according to their most frequent operation types.
We can conclude that the higher the number of credit card withdrawal operations, the higher the amount of the loan. The same can be said for the number of collection operations. Meanwhile, the high number of interest credits and credit in cash operations seems to correlate to a lower amount borrowed.


In [10]:
_op_type_df = unagg_df[['operation', 'loan_amount']]


from sklearn.preprocessing import LabelEncoder
from agg import *

le = LabelEncoder()
_op_type_df['operation'] = le.fit_transform(_op_type_df['operation'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  _op_type_df['operation'] = le.fit_transform(_op_type_df['operation'])


### Using KMeans

In [11]:
from sklearn.cluster import KMeans

op_type_df = _op_type_df.copy()
clusters = KMeans(n_clusters=3, random_state=42)
clusters.fit(op_type_df)

op_type_df['cluster'] = clusters.labels_
op_type_df = op_type_df.groupby('cluster').agg({
    'operation': [ccount_collection_op, ccount_remittance_op, ccount_ccw_op, ccount_interest_op, ccount_credit_op, ccount_withdrawal_op],
    'loan_amount': 'mean'
})
op_type_df.head()

Unnamed: 0_level_0,operation,operation,operation,operation,operation,operation,loan_amount
Unnamed: 0_level_1,ccount_collection_op,ccount_remittance_op,ccount_ccw_op,ccount_interest_op,ccount_credit_op,ccount_withdrawal_op,mean
cluster,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
0,1183,4335,53,5103,5401,13302,67467.553392
1,370,912,11,1364,1534,3895,370291.83725
2,1034,2161,13,2994,2875,8154,196923.750914


### Using KMedoids

In [12]:
from sklearn_extra.cluster import KMedoids

op_type_df = _op_type_df.copy()
clusters = KMedoids(n_clusters=3, random_state=42)
clusters.fit(op_type_df)

op_type_df['cluster'] = clusters.labels_
op_type_df = op_type_df.groupby('cluster').agg({
    'operation': [ccount_collection_op, ccount_remittance_op, ccount_ccw_op, ccount_interest_op, ccount_credit_op, ccount_withdrawal_op],
    'loan_amount': 'mean'
})
op_type_df.head()

### Using DBSCAN

In [None]:
from sklearn.cluster import DBSCAN

op_type_df = _op_type_df.copy()
clusters = DBSCAN(eps=1000)
clusters.fit(op_type_df)

op_type_df['cluster'] = clusters.labels_
op_type_df = op_type_df.groupby('cluster').agg({
    'operation': [ccount_collection_op, ccount_remittance_op, ccount_ccw_op, ccount_interest_op, ccount_credit_op, ccount_withdrawal_op],
    'loan_amount': 'mean'
})
op_type_df.head()