# Ejercicio Clustering. Detección de Anomalías

Uno de los usos de los algoritmos de clustering es la Detección de Anomalías, esto es, la detección de observaciones anómalas, aquellas que no siguen un comportamiento normal. Si el objetivo del clustering es encontrar grupos de elementos similares, aquellos elementos que no son similares a ningún grupo se pueden considerar como elementos anómalos.

Para este ejercicio vamos a usar un [Dataset de transacciones de tarjetas de crédito](https://www.kaggle.com/arjunbhasin2013/ccdata), donde cada observacion es un cliente distinto.

Nuestro objetivo es implementar un modelo que agrupa las transacciones apropiadamente y encontrar los potenciales outliers, es decir, aquellas transacciones que son sospechosas de ser un fraude o un error. Para resolver este ejercicio correctamente hay que investigar, en vez de simplemente seguir a rajatabla lo enseñado en el curso.

**Pistas:**

- Hemos explicado un algoritmo de clustering que no solo asigna elementos a clusters válidos, sino que también clasifica elementos como valores extremos (outliers). 

- Para la búsqueda de hiperparámetros, un buen sitio para mirar es [ParameterSampler](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ParameterSampler.html).

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv("./data/CC GENERAL.csv")

In [3]:
df.head()

Unnamed: 0,CUST_ID,BALANCE,BALANCE_FREQUENCY,PURCHASES,ONEOFF_PURCHASES,INSTALLMENTS_PURCHASES,CASH_ADVANCE,PURCHASES_FREQUENCY,ONEOFF_PURCHASES_FREQUENCY,PURCHASES_INSTALLMENTS_FREQUENCY,CASH_ADVANCE_FREQUENCY,CASH_ADVANCE_TRX,PURCHASES_TRX,CREDIT_LIMIT,PAYMENTS,MINIMUM_PAYMENTS,PRC_FULL_PAYMENT,TENURE
0,C10001,40.900749,0.818182,95.4,0.0,95.4,0.0,0.166667,0.0,0.083333,0.0,0,2,1000.0,201.802084,139.509787,0.0,12
1,C10002,3202.467416,0.909091,0.0,0.0,0.0,6442.945483,0.0,0.0,0.0,0.25,4,0,7000.0,4103.032597,1072.340217,0.222222,12
2,C10003,2495.148862,1.0,773.17,773.17,0.0,0.0,1.0,1.0,0.0,0.0,0,12,7500.0,622.066742,627.284787,0.0,12
3,C10004,1666.670542,0.636364,1499.0,1499.0,0.0,205.788017,0.083333,0.083333,0.0,0.083333,1,1,7500.0,0.0,,0.0,12
4,C10005,817.714335,1.0,16.0,16.0,0.0,0.0,0.083333,0.083333,0.0,0.0,0,1,1200.0,678.334763,244.791237,0.0,12


Cada observación (fila) tiene información agregada sobre un cliente distinto, el balance de su tarjeta de cretito, el número de compras realizado, el número de veces que saca dinero de un cajero, etcétera.

In [4]:
df.dtypes

CUST_ID                              object
BALANCE                             float64
BALANCE_FREQUENCY                   float64
PURCHASES                           float64
ONEOFF_PURCHASES                    float64
INSTALLMENTS_PURCHASES              float64
CASH_ADVANCE                        float64
PURCHASES_FREQUENCY                 float64
ONEOFF_PURCHASES_FREQUENCY          float64
PURCHASES_INSTALLMENTS_FREQUENCY    float64
CASH_ADVANCE_FREQUENCY              float64
CASH_ADVANCE_TRX                      int64
PURCHASES_TRX                         int64
CREDIT_LIMIT                        float64
PAYMENTS                            float64
MINIMUM_PAYMENTS                    float64
PRC_FULL_PAYMENT                    float64
TENURE                                int64
dtype: object

In [5]:
customer_ids = df.CUST_ID
df = df.drop(columns="CUST_ID")

In [13]:
df.head()

Unnamed: 0,BALANCE,BALANCE_FREQUENCY,PURCHASES,ONEOFF_PURCHASES,INSTALLMENTS_PURCHASES,CASH_ADVANCE,PURCHASES_FREQUENCY,ONEOFF_PURCHASES_FREQUENCY,PURCHASES_INSTALLMENTS_FREQUENCY,CASH_ADVANCE_FREQUENCY,CASH_ADVANCE_TRX,PURCHASES_TRX,CREDIT_LIMIT,PAYMENTS,MINIMUM_PAYMENTS,PRC_FULL_PAYMENT,TENURE
0,40.900749,0.818182,95.4,0.0,95.4,0.0,0.166667,0.0,0.083333,0.0,0,2,1000.0,201.802084,139.509787,0.0,12
1,3202.467416,0.909091,0.0,0.0,0.0,6442.945483,0.0,0.0,0.0,0.25,4,0,7000.0,4103.032597,1072.340217,0.222222,12
2,2495.148862,1.0,773.17,773.17,0.0,0.0,1.0,1.0,0.0,0.0,0,12,7500.0,622.066742,627.284787,0.0,12
3,1666.670542,0.636364,1499.0,1499.0,0.0,205.788017,0.083333,0.083333,0.0,0.083333,1,1,7500.0,0.0,0.0,0.0,12
4,817.714335,1.0,16.0,16.0,0.0,0.0,0.083333,0.083333,0.0,0.0,0,1,1200.0,678.334763,244.791237,0.0,12


In [15]:
df = df.fillna(0)

In [16]:
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

In [17]:
df_normalizado = pd.DataFrame(StandardScaler().fit_transform(df))

In [19]:
df_normalizado.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,-0.731989,-0.249434,-0.4249,-0.356934,-0.349079,-0.466786,-0.80649,-0.678661,-0.707313,-0.675349,-0.47607,-0.511333,-0.960213,-0.528979,-0.29731,-0.525551,0.36068
1,0.786961,0.134325,-0.469552,-0.356934,-0.454576,2.605605,-1.221758,-0.678661,-0.916995,0.573963,0.110074,-0.591796,0.688718,0.818642,0.102042,0.234227,0.36068
2,0.447135,0.518084,-0.107668,0.108889,-0.454576,-0.466786,1.269843,2.673451,-0.916995,-0.675349,-0.47607,-0.10902,0.826129,-0.383805,-0.088489,-0.525551,0.36068
3,0.049099,-1.016953,0.232058,0.546189,-0.454576,-0.368653,-1.014125,-0.399319,-0.916995,-0.258913,-0.329534,-0.551565,0.826129,-0.598688,-0.357035,-0.525551,0.36068
4,-0.358775,0.518084,-0.462063,-0.347294,-0.454576,-0.466786,-1.014125,-0.399319,-0.916995,-0.675349,-0.47607,-0.551565,-0.905249,-0.364368,-0.252238,-0.525551,0.36068


In [21]:
cluster = DBSCAN()
cluster_labels = cluster.fit_predict(df_normalizado)
pd.Series(cluster_labels).value_counts()

-1     6627
 0     1948
 10      60
 2       34
 15      30
 7       23
 14      23
 8       14
 6       13
 3       11
 29      10
 21       9
 5        9
 9        8
 12       8
 1        8
 26       8
 17       7
 11       7
 27       7
 19       7
 4        6
 13       6
 23       6
 28       5
 30       5
 16       5
 24       5
 22       5
 25       5
 20       5
 18       5
 35       5
 31       5
 32       4
 34       4
 33       3
dtype: int64

In [51]:
from sklearn.metrics import silhouette_score

In [52]:
silhouette_score(df_normalizado,cluster_labels)

-0.46596190778573116

In [26]:
from scipy.stats import randint as sp_randinit
from scipy.stats import uniform

In [27]:
DBSCAN().get_params()

{'algorithm': 'auto',
 'eps': 0.5,
 'leaf_size': 30,
 'metric': 'euclidean',
 'metric_params': None,
 'min_samples': 5,
 'n_jobs': 1,
 'p': None}

In [28]:
distribucion_parametros = {
    "eps": uniform(0,5),
    "min_samples": sp_randinit(2,20),
    "p": sp_randinit(1,3)
}

In [29]:
from sklearn.model_selection import ParameterSampler

## Realizamos seleccion de parametros manual

In [36]:
import numpy as np
from sklearn.model_selection import ParameterSampler

n_muestras = 30 # probamos 20 combinaciones de hiperparámetros
n_iteraciones = 3 #para validar, vamos a entrenar para cada selección de hiperparámetros en 3 muestras distintas
pct_muestra = 0.7 # usamos el 70% de los datos para entrenar el modelo en cada iteracion
resultados_busqueda = []
lista_parametros = list(ParameterSampler(distribucion_parametros, n_iter=n_muestras))

for param in lista_parametros:
    for iteration in range(n_iteraciones):
        param_resultados = []
        muestra = df_normalizado.sample(frac=pct_muestra)
        etiquetas_clusters = DBSCAN(n_jobs=-1, **param).fit_predict(muestra)
        try:
            param_resultados.append(silhouette_score(muestra, etiquetas_clusters))
        except ValueError: # a veces silhouette_score falla en los casos en los que solo hay 1 cluster
            pass
    puntuacion_media = np.mean(param_resultados)
    resultados_busqueda.append([puntuacion_media, param])

  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


In [46]:
sorted(resultados_busqueda,key=lambda x: x[0],reverse=True)[:5]

[[nan, {'eps': 0.058059507655000564, 'min_samples': 13, 'p': 1}],
 [0.7354497515129891, {'eps': 4.8767765685899365, 'min_samples': 9, 'p': 2}],
 [0.7275051528373792, {'eps': 4.750465916818753, 'min_samples': 8, 'p': 2}],
 [0.7227157799477482, {'eps': 4.552434308323022, 'min_samples': 8, 'p': 1}],
 [0.707876841162, {'eps': 4.5333978273008295, 'min_samples': 14, 'p': 1}]]

In [53]:
mejores_parametros = {'eps': 4.8767765685899365,'min_samples':13,'p':1}
cluster = DBSCAN(n_jobs=-1,**mejores_parametros)
etiquetas_clusters = cluster.fit_predict(df_normalizado)

In [54]:
pd.Series(etiquetas_clusters).value_counts()

 0    8878
-1      72
dtype: int64