autor: @LuisFalva

### SMOTE es una técnica para balancear datos. Normalmente, a la hora de entrenar un modelo tenemos que generar nuestra variable *target* [0,1] con el cual podremos calcular una predicción apartir de los registros observados, ¿pero que pasa cuando el 'target' que nos interesa es la clase minoritaria? Esto es un problema típico que muchos modelos sufren, dado que nuestra clase de interés será, en la mayoría de los casos, la clase minoritaria, tenemos que buscar una técnica para implementar un sobremuestreo sin perder información.

<img src="src/smote.gif" width="750" align="center">

### Dentro de este notebook, están las notas de estudio respecto a la técnica Synthetic Minority Oversampling Technique [SMOTE] la cual hace uso del algoritmo de KNN para encontrar los vecinos más cercanos a la clase minoritaria, i.e. la clase de los positivos '1'.

img link: [The main issue with identifying Financial Fraud using Machine Learning (and how to address it)](https://towardsdatascience.com/the-main-issue-with-identifying-financial-fraud-using-machine-learning-and-how-to-address-it-3b1bf8fa1e0c)

In [2]:
import random
import numpy as np

from sklearn import neighbors
from pyspark.sql.functions import when, col
from pyspark.sql import SparkSession, Row
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import DenseVector

In [3]:
spark = SparkSession.builder.appName("SMOTE").getOrCreate()
sc = spark.sparkContext

### Para la construcción de la función que nos ayudará a generar nuestras muestras sintéticas, vamos a cargar la tabla 'smote_class', la cual contiene una cantidad de variables que describen las caracteristicas principales de un cliente por cada renglón. El dataframe que usaremos para mantendrá únicamente las variables numéricas:
- **[age, child, saving, insight, backup]**

In [4]:
arr_col = ["age", "child", "saving", "insight", "backup", "marital"]
smote_test = spark.read.parquet("src/data/").select(*arr_col)
smote_test.show(5, False)

+---+-----+------+-------+------+-------+
|age|child|saving|insight|backup|marital|
+---+-----+------+-------+------+-------+
|59 |1    |0     |1      |1     |married|
|56 |0    |1     |0      |1     |married|
|41 |1    |1     |0      |0     |married|
|55 |1    |0     |0      |1     |married|
|54 |1    |0     |0      |1     |married|
+---+-----+------+-------+------+-------+
only showing top 5 rows



In [5]:
test = smote_test.select("*", (when(col("marital") == "divorced", 1).otherwise(0)).alias("target")).drop("marital")
test.groupBy("target").count().show()
test.where(col("target") == 1).show(5)

+------+-----+
|target|count|
+------+-----+
|     1| 1293|
|     0| 9869|
+------+-----+

+---+-----+------+-------+------+------+
|age|child|saving|insight|backup|target|
+---+-----+------+-------+------+------+
| 60|    0|     1|      0|     0|     1|
| 35|    0|     1|      1|     1|     1|
| 49|    1|     1|      1|     0|     1|
| 28|    0|     0|      0|     0|     1|
| 43|    1|     1|      0|     1|     1|
+---+-----+------+-------+------+------+
only showing top 5 rows



### Lo que nosotros buscamos para entrenar nuestro modelo de vecinos cercanos [i.e. KNN] es un objeto de tipo numpy array con los valores de cada registro, algo similar a esto:

In [6]:
np.array(test.where(col("target") == 1).drop("target").collect())

array([[60,  0,  1,  0,  0],
       [35,  0,  1,  1,  1],
       [49,  1,  1,  1,  0],
       ...,
       [52,  0,  0,  0,  0],
       [38,  0,  1,  0,  1],
       [60,  1,  1,  0,  1]])

**NOTA: Sin embargo, para convertir de un Spark Dataframe a un objeto de tipo numpy.array es conveniente antes transformarlo a RDD, por lo que los métodos de las clase SparkSMOTE realizarán internamente esos parseos.**

### Para entrenar el modelo de KNN necesitaremos convertir nuestro spark Dataframe a un objeto de tipo numpy array, y para ello debemos bajar nuestra estructura dataframe a rdd's para que las estructura de datos al ser transformada ésta sea de manera distribuida.

In [7]:
def vector_assembling(data_input, target_name):
    """
    Vectorizer function will create a vector filled with features for each row
    
    :param data_input: df, spark Dataframe with target label
    :param target_name: str, string name from target label
    :return: Dataframe, table that includes the feature vector and label
    """
    
    if data_input.select(target_name).distinct().count() != 2:
        raise ValueError("Target field must have only 2 distinct classes")
    
    column_names = list(data_input.drop(target_name).columns)
    vector_assembler = VectorAssembler(inputCols = column_names, outputCol = 'features')
    vector_transform = vector_assembler.transform(data_input)
    vector_feature = vector_transform.select('features', (vector_transform[target_name]).alias("label"))
    
    return vector_feature

def split_target(df, field, minor=1, major=0):
    """
    Split target will split in two distinct Dataframe from label 1 and 0
    
    :param df: Dataframe, spark Dataframe with target label
    :param field: str, string name from taget label
    :param minor: int, integer number for minority class
    :param major: int, integer number for majority class
    :return: dict, python dictionary with separated Dataframe
    """
    minor = df[df[field] == minor]
    major = df[df[field] == major]
    return {"minor": minor, "major": major}

def spkdf_to_nparr(df, feature):
    """
    Spkdf to nparr function will help to parse from spark Dataframe to numpy array
    in a distributed way
    
    :param df: Dataframe, spark Dataframe with features column
    :param feature: str, string name of column features name
    :return: np.array, numpy array object with features
    """
    feature_df = df.select(feature)
    return np.asarray(feature_df.rdd.map(lambda x: x[0]).collect())

def nparr_to_spkdf(arr, feat="features", label="label"):
    """
    Nparr to spkdf function will help to parse from numpy array to spark Dataframe
    in a distributed way
    
    :param df: Dataframe, spark Dataframe with features column
    :param feat: str, string name of column features name; 'features' set as default
    :param label: str, string name of column label name; 'label' set as default
    :return: Dataframe, with feautures and label
    """
    data_set = sc.parallelize(arr)
    data_rdd = data_set.map(lambda x: (Row(fatures=DenseVector(x), label=1)))
    return data_rdd.toDF()

def smote_sampling(df, k=2, algrth="auto", minority_class=1, majority_class=0, pct_over_min=100, pct_under_max=100):
    """
    Smote sampling function will create an oversampling with SMOTE technique
    
    :param df: Dataframe, spark Dataframe with features column
    :param k: int, integer k folds for KNN's groups; '2' set as default
    :param algrth: str, string name for KNN's algorithm choice; 'auto' set as default
    :param minority_class: int, value related to minority class; '1' set as default
    :param majority_class: int, value related to majority class; '0' set as default
    :param pct_over_min: int, integer number for sampling minority class; '100' set as default
    :param pct_under_max: int, integer number for sampling majority class; '100' set as default
    :return: Dataframe, with new SMOTE features sampled
    """
    def k_neighbor(k, algrth, feature):
        """
        k neighbor will compute Nearest Neighbors sklearn algorithm

        :param k: int, integer number for k nearest neighbors groups
        :param feature: str, string name of column features name
        :return: list, python list with numpy array object for each neighbor
        """
        neighbor_list = neighbors.NearestNeighbors(n_neighbors=k, algorithm=algrth).fit(feature)
        return neighbor_list.kneighbors(feature)
    
    def compute_smo(neighbor_list, pct, min_arr):
        """
        Compute smo function will compute the SMOTE oversampling technique

        :param neighbor_list: list, python list with numpy array object for each neighbor
        :param pct: int, integer pct for over min
        :param min_arr: list, python list with minority class rows
        :param k: int, integer number for k nearest neighbors groups
        :return: list, python list with sm class oversampled
        """
        if pct < 100:
            raise ValueError("Percentage Over Min must be in at least >= 100")
        
        smo = []
        counter = 0
        pct_over = int(pct / 100)
        
        while len(min_arr) > counter:
            for i in range(pct_over):
                random_neighbor = random.randint(0, len(neighbor)-1)
                diff = neighbor_list[random_neighbor][0] - min_arr[i][0]
                new_record = (min_arr[i][0] + random.random() * diff)
                smo.insert(0, (new_record))
            counter+=1
        
        return smo
    
    data_input_min = split_target(df=df, field="label")["minor"]
    data_input_max = split_target(df=df, field="label")["major"]
    
    feature_mat = spkdf_to_nparr(data_input_min, "features")
    neighbor = k_neighbor(k=k, algrth=algrth, feature=feature_mat)[1]
    
    min_array = data_input_min.drop("label").rdd.map(lambda x : list(x)).collect()
    new_row = compute_smo(neighbor, pct_over_min, min_array)
    smo_data_df = nparr_to_spkdf(new_row)
    smo_data_minor = data_input_min.unionAll(smo_data_df)
    new_data_major = data_input_max.sample(False, (float(pct_under_max / 100)))
    
    return new_data_major.unionAll(smo_data_minor)

### Para computar nuestras muestras sintéticas debemos antes vectorizar los atributos que tengamos en nuestra tabla de datos, esto significa que debemos tomar los valores de cada columna y crear vectores de longitud **$p$**. Este método asume tres principales puntos:

- Normalización y estandarización de variables
- Mapeo de cada valor por columna a codificaciones binarias (StringIndexer, OneHotEncoder)
- Spark Dataframe vectorizado, i.e. con columna de vectores densos y escasos (features), y columna dicotómica (label)

In [8]:
vector_assemble = vector_assembling(test, "target")
vector_assemble.show(5, False)

+----------------------+-----+
|features              |label|
+----------------------+-----+
|[59.0,1.0,0.0,1.0,1.0]|0    |
|[56.0,0.0,1.0,0.0,1.0]|0    |
|[41.0,1.0,1.0,0.0,0.0]|0    |
|[55.0,1.0,0.0,0.0,1.0]|0    |
|[54.0,1.0,0.0,0.0,1.0]|0    |
+----------------------+-----+
only showing top 5 rows



### Como muestra de su funcionamiento, para aplicar el método 'smote_sampling' requerimos de la tabla anterior con variables previamente standarizados, codificados y vectorizados. Como se puede ver, el método recibe los argumentos 'pct_over_min' y 'pct_under_max' configurados por default en [100, 100] respectivamente, cada uno de esos argumentos ayudarán a manipular el submuestreo o sobremuestreo de ambas clases que se ven en la siguiente tabla.

**pct_over_min; modificará la cantidad de registros que existe para la clase minoritaria sobremuestreando los registros con valores sintéticos, en este caso, la clase '1'**

**pct_under_max; modificará la cantidad de registros que existe para la clase mayoritaria submuestreando los registros, en este caso, la clase '0'**

In [16]:
smote_sample = smote_sampling(vector_assemble, pct_over_min=600, pct_under_max=100)
smote_sample.groupBy("label").count().show()
smote_sample.where(col("label") == 1).orderBy(col("features").desc()).drop("label").show(5, False)

+-----+-----+
|label|count|
+-----+-----+
|    0| 9869|
|    1| 9051|
+-----+-----+

+------------------------------------------------------------------------------------------------+
|features                                                                                        |
+------------------------------------------------------------------------------------------------+
|[1269.8325927780104,1269.1202281118856,1269.1202281118856,1269.1032670484065,1269.1202281118856]|
|[1257.4522132395064,1256.649695937225,1256.649695937225,1256.649695937225,1256.6329768267608]   |
|[1254.1720678340141,1253.6823037725862,1253.6904665069433,1253.6823037725862,1253.6823037725862]|
|[1245.6175873494612,1244.6114847728857,1244.6114847728857,1244.6114847728857,1244.6114847728857]|
|[1244.442370594635,1244.243975072919,1244.2377752128655,1244.2377752128655,1244.243975072919]   |
+------------------------------------------------------------------------------------------------+
only showing top 5 rows
