# ANOVOS - Data Transformer
Following notebook shows the list of functions related to "data transformer" module provided under ANOVOS package and how it can be invoked accordingly.
- [Attribute Binning](#Attribute-Binning)
- [Monotonic Binning](#Monotonic-Binning)
- [Categorical Attribute to Numerical Attribute Conversion](#Categorical-Attribute-to-Numerical-Attribute-Conversion)
    - [Categorical to Numerical - Unsupervised](#Categorical-to-Numerical---Unsupervised)
    - [Categorical to Numerical - Supervised](#Categorical-to-Numerical---Supervised)
- [Attribute Rescaling](#Attribute-Rescaling)
    - [Z Standardization](#Z-Standardization)
    - [IQR Standardization](#IQR-Standardization)
    - [Normalization](#Normalization)
- [Missing Value Imputation](#Missing-Value-Imputation)
    - [Imputation MMM](#Imputation-MMM)
    - [Imputation Sklearn](#Imputation-Sklearn)
    - [Imputation Matrix Factorization](#Imputation-Matrix-Factorization)
    - [Auto Imputation](#Auto-Imputation)
- [Latent Features Generation](#Latent-Features-Generation)
    - [Autoencoder Latent Features](#Autoencoder-Latent-Features)
    - [PCA Latent Features](#PCA-Latent-Features)
- [Feature Transformation](#Feature-Transformation)
- [Box Cox Transformation](#Box-Cox-Transformation)
- [Outlier Categories Treatment](#Outlier-Categories-Treatment)
- [Expression Parser](#Expression-Parser)

**Setting Spark Session**

In [1]:
#set run type variable
run_type = "local" # "local", "emr", "databricks", "ak8s"

In [2]:
#For run_type Azure Kubernetes, run the following block 
import os
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

if run_type == "ak8s":
    auth_key="<insert value of sas_token here>"
    master_url="<insert conf spark.hadoop.fs master url here> ex: spark.hadoop.fs.azure.sas.<container>.<account_name>.blob.core.windows.net"
    docker_image="<insert name docker image here>"
    kubernetes_namespace ="<insert kubernetes namespace here>"

    # Create Spark config for our Kubernetes based cluster manager
    sparkConf = SparkConf()
    sparkConf.setMaster(master_url)
    sparkConf.setAppName("Anovos_pipeline")
    sparkConf.set("spark.submit.deployMode","client")
    sparkConf.set("spark.kubernetes.container.image", docker_image)
    sparkConf.set("spark.kubernetes.namespace", kubernetes_namespace)
    sparkConf.set("spark.executor.instances", "4")
    sparkConf.set("spark.executor.cores", "4")
    sparkConf.set("spark.executor.memory", "16g")
    sparkConf.set("spark.kubernetes.pyspark.pythonVersion", "3")
    sparkConf.set("spark.kubernetes.authenticate.driver.serviceAccountName", "spark")
    sparkConf.set(master_url,auth_key)
    sparkConf.set("spark.kubernetes.authenticate.serviceAccountName", "spark")
    sparkConf.set("spark.jars.packages", "org.apache.hadoop:hadoop-azure:3.2.0,com.microsoft.azure:azure-storage:8.6.3,io.github.histogrammar:histogrammar_2.12:1.0.20,io.github.histogrammar:histogrammar-sparksql_2.12:1.0.20,org.apache.spark:spark-avro_2.12:3.2.1")

    # Initialize our Spark cluster, this will actually
    # generate the worker nodes.
    spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
    sc = spark.sparkContext

#For other run types import from anovos.shared.
else:
    from anovos.shared.spark import *
    auth_key = "NA"

2022-06-06 21:21:27.526 | INFO     | anovos.shared.spark:init_spark:54 - Getting spark session, context and sql context app_name: Anovos_pipeline


:: loading settings :: url = jar:file:/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /Users/mobilewalla/.ivy2/cache
The jars for the packages stored in: /Users/mobilewalla/.ivy2/jars
io.github.histogrammar#histogrammar_2.12 added as a dependency
io.github.histogrammar#histogrammar-sparksql_2.12 added as a dependency
org.apache.spark#spark-avro_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-7e0615f0-1b54-4f15-bb71-9303c72f0d1a;1.0
	confs: [default]
	found io.github.histogrammar#histogrammar_2.12;1.0.20 in central
	found io.github.histogrammar#histogrammar-sparksql_2.12;1.0.20 in central
	found org.apache.spark#spark-avro_2.12;3.2.1 in central
	found org.tukaani#xz;1.8 in central
	found org.spark-project.spark#unused;1.0.0 in central
:: resolution report :: resolve 266ms :: artifacts dl 13ms
	:: modules in use:
	io.github.histogrammar#histogrammar-sparksql_2.12;1.0.20 from central in [default]
	io.github.histogrammar#histogrammar_2.12;1.0.20 from central in [default]
	org.apache.spark#spark-avro_2.12

In [3]:
sc.setLogLevel("ERROR")
import warnings
warnings.filterwarnings('ignore')

**Input/Output Path** 

In [4]:
inputPath = "../data/income_dataset/csv"
outputPath = "../output/income_dataset/data_transformer"

**Read Input Data** 

In [5]:
from anovos.data_ingest.data_ingest import read_dataset
from pyspark.sql import functions as F
df = read_dataset(spark, file_path = inputPath, file_type = "csv",
                  file_configs = {"header": "True", "delimiter": "," , "inferSchema": "True"})
df = df.drop("dt_1", "dt_2")
df.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


# Attribute Binning
- API specification of function **attribute_binning** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only
- 2 binning options: Equal Range Binning (each bin is of equal size/width) and Equal Frequency Binning (each bin has equal no. of rows)

In [6]:
from anovos.data_transformer.transformers import attribute_binning

In [7]:
# Example 1 - Equal range binning + append transformed columns at the end
odf = attribute_binning(spark, idf=df, list_of_cols=["education-num", "hours-per-week"], method_type="equal_range", 
                        bin_size=5, output_mode="append", print_impact=True)

odf.toPandas().head(5)

                                                                                

+---------------------+-------------+
|attribute            |unique_values|
+---------------------+-------------+
|education-num_binned |5            |
|hours-per-week_binned|5            |
+---------------------+-------------+



Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,hours-per-week_binned,education-num_binned
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K,3.0,4.0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K,1.0,4.0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K,3.0,3.0
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K,3.0,2.0
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K,3.0,4.0


In [8]:
# Distinct values after binning
odf.select('hours-per-week_binned').distinct().orderBy('hours-per-week_binned').toPandas().head(10)

Unnamed: 0,hours-per-week_binned
0,
1,1.0
2,2.0
3,3.0
4,4.0
5,5.0


In [9]:
# Example 2 - Equal frequency binning + replace original columns by transformed ones (default)
odf = attribute_binning(spark, df, list_of_cols=["education-num", "hours-per-week"], method_type="equal_frequency", 
                        bin_size=5, print_impact=True)

odf.toPandas().head(5)

+--------------+-------------+
|attribute     |unique_values|
+--------------+-------------+
|hours-per-week|4            |
|education-num |4            |
+--------------+-------------+



Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,native-country,income,hours-per-week,education-num
0,1a,,State-gov,77516.0,4.889391,,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,UnitedStates,<=50K,2.0,4.0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,UnitedStates,<=50K,1.0,4.0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,UnitedStates,<=50K,2.0,1.0
3,4a,53.0,Private,234721.0,5.370552,,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,UnitedStates,<=50K,2.0,1.0
4,5a,,Private,338409.0,5.529442,,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,Cuba,<=50K,2.0,4.0


In [10]:
# Distinct values after binning
odf.select('hours-per-week').distinct().orderBy('hours-per-week').toPandas().head(10)

Unnamed: 0,hours-per-week
0,
1,1.0
2,2.0
3,4.0
4,5.0


In [11]:
# Example 3 - Equal frequency binning + save binning model
odf = attribute_binning(spark, df, list_of_cols=["education-num", "hours-per-week"], method_type="equal_frequency", 
                        bin_size=5, pre_existing_model=False, model_path=outputPath + "/attribute_binning")

odf.toPandas().head(5)



Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,native-country,income,hours-per-week,education-num
0,1a,,State-gov,77516.0,4.889391,,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,UnitedStates,<=50K,2.0,4.0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,UnitedStates,<=50K,1.0,4.0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,UnitedStates,<=50K,2.0,1.0
3,4a,53.0,Private,234721.0,5.370552,,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,UnitedStates,<=50K,2.0,1.0
4,5a,,Private,338409.0,5.529442,,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,Cuba,<=50K,2.0,4.0


In [12]:
# Example 4 - Equal frequency binning + use pre-saved model
odf = attribute_binning(spark, df, list_of_cols=["education-num", "hours-per-week"], 
                        pre_existing_model=True, model_path=outputPath + "/attribute_binning")
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,native-country,income,hours-per-week,education-num
0,1a,,State-gov,77516.0,4.889391,,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,UnitedStates,<=50K,2.0,4.0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,UnitedStates,<=50K,1.0,4.0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,UnitedStates,<=50K,2.0,1.0
3,4a,53.0,Private,234721.0,5.370552,,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,UnitedStates,<=50K,2.0,1.0
4,5a,,Private,338409.0,5.529442,,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,Cuba,<=50K,2.0,4.0


# Monotonic Binning
- API specification of function **monotonic_binning** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Bin size is computed dynamically

In [13]:
from anovos.data_transformer.transformers import monotonic_binning

In [14]:
# Example 1 - Equal Range Binning + append tranformed columns at the end
odf = monotonic_binning(spark, df, list_of_cols=["education-num", "hours-per-week"], label_col="income", 
                        event_label=">50K", bin_method="equal_range", output_mode="append")
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,hours-per-week_binned,education-num_binned
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K,2.0,6.0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K,1.0,6.0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K,2.0,4.0
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K,2.0,3.0
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K,2.0,6.0


In [15]:
# Distinct values for hours-per-week after binning 
odf.select("hours-per-week_binned").distinct().orderBy('hours-per-week_binned').toPandas()

Unnamed: 0,hours-per-week_binned
0,
1,1.0
2,2.0
3,3.0


In [16]:
# Example 2 - Equal Frequency Binning + replace original columns by transformed ones (default)
odf = monotonic_binning(spark, df, list_of_cols=["education-num", "hours-per-week"], label_col="income", 
                        event_label=">50K", bin_method="equal_frequency")
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,native-country,income,hours-per-week,education-num
0,1a,,State-gov,77516.0,4.889391,,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,UnitedStates,<=50K,2.0,12.0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,UnitedStates,<=50K,1.0,12.0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,UnitedStates,<=50K,2.0,3.0
3,4a,53.0,Private,234721.0,5.370552,,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,UnitedStates,<=50K,2.0,2.0
4,5a,,Private,338409.0,5.529442,,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,Cuba,<=50K,2.0,12.0


In [17]:
# Distinct values for hours-per-week after binning
odf.select("hours-per-week").distinct().orderBy('hours-per-week').toPandas()

Unnamed: 0,hours-per-week
0,
1,1.0
2,2.0
3,5.0
4,6.0


# Categorical Attribute to Numerical Attribute Conversion

## Categorical to Numerical - Unsupervised
- API specification of function **cat_to_num_unsupervised** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports Label Encoding and One hot encoding

In [18]:
from anovos.data_transformer.transformers import cat_to_num_unsupervised

In [19]:
# Example 1 - with mandatory arguments (Label Encoding)
odf = cat_to_num_unsupervised(spark, df)
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,11110,,10.0,77516.0,4.889391,,2.0,13.0,1.0,3.0,1.0,0.0,0.0,2174.0,0.0,40.0,42,0
1,22221,,1.0,83311.0,4.920702,,2.0,13.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,13.0,42,0
2,25894,38.0,0.0,215646.0,5.333741,,0.0,9.0,2.0,9.0,1.0,0.0,0.0,0.0,0.0,40.0,42,0
3,27005,53.0,0.0,234721.0,5.370552,,5.0,7.0,0.0,9.0,0.0,1.0,0.0,0.0,0.0,40.0,42,0
4,28116,,0.0,338409.0,5.529442,,2.0,13.0,0.0,0.0,4.0,1.0,1.0,0.0,0.0,40.0,10,0


In [20]:
# Example 2 - 'all' columns (excluding drop_cols) + print impact
odf = cat_to_num_unsupervised(spark, df, list_of_cols='all', drop_cols=['ifa'], print_impact=True)
odf.toPandas().head(5)

Before


                                                                                

+-------+------+-----+-----------+-------+-----------+-----+------------+-------------+--------------+----------------+------------+-------+-----+------------+------------+--------------+--------------+------+
|summary|ifa   |age  |workclass  |fnlwgt |logfnl     |empty|education   |education-num|marital-status|occupation      |relationship|race   |sex  |capital-gain|capital-loss|hours-per-week|native-country|income|
+-------+------+-----+-----------+-------+-----------+-----+------------+-------------+--------------+----------------+------------+-------+-----+------------+------------+--------------+--------------+------+
|count  |32561 |32500|32558      |32546  |12168      |0    |32040       |32530        |32135         |32549           |32557       |32247  |32557|32548       |32549       |32452         |32561         |32561 |
|min    |10000a|17   | Private   |12285  |4.283617786|null |10th        |1            |?             |?               |*           |*      |?    |0           |0

                                                                                

+-------+------+-----+---------+-------+-----------+-----+---------+-------------+--------------+----------+------------+-----+-----+------------+------------+--------------+--------------+------+
|summary|ifa   |age  |workclass|fnlwgt |logfnl     |empty|education|education-num|marital-status|occupation|relationship|race |sex  |capital-gain|capital-loss|hours-per-week|native-country|income|
+-------+------+-----+---------+-------+-----------+-----+---------+-------------+--------------+----------+------------+-----+-----+------------+------------+--------------+--------------+------+
|count  |32561 |32500|32558    |32546  |12168      |0    |32040    |32530        |32135         |32549     |32557       |32247|32557|32548       |32549       |32452         |32561         |32561 |
|min    |10000a|17   |0        |12285  |4.283617786|null |0        |1            |0             |0         |0           |0    |0    |0           |0           |1             |0             |0     |
|max    |9a    

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,1a,,10.0,77516.0,4.889391,,2.0,13.0,1.0,3.0,1.0,0.0,0.0,2174.0,0.0,40.0,42,0
1,2a,,1.0,83311.0,4.920702,,2.0,13.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,13.0,42,0
2,3a,38.0,0.0,215646.0,5.333741,,0.0,9.0,2.0,9.0,1.0,0.0,0.0,0.0,0.0,40.0,42,0
3,4a,53.0,0.0,234721.0,5.370552,,5.0,7.0,0.0,9.0,0.0,1.0,0.0,0.0,0.0,40.0,42,0
4,5a,,0.0,338409.0,5.529442,,2.0,13.0,0.0,0.0,4.0,1.0,1.0,0.0,0.0,40.0,10,0


In [21]:
# Example 3 - selected categorical columns + assign unique integers based on alphabetical order (asc)
odf = cat_to_num_unsupervised(spark, df, list_of_cols='all', drop_cols=['ifa'], index_order='alphabetAsc')
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,1a,,1.0,77516.0,4.889391,,9.0,13.0,4.0,1.0,3.0,6.0,2.0,2174.0,0.0,40.0,41,0
1,2a,,8.0,83311.0,4.920702,,9.0,13.0,3.0,4.0,2.0,6.0,2.0,0.0,0.0,13.0,41,0
2,3a,38.0,6.0,215646.0,5.333741,,11.0,9.0,1.0,6.0,3.0,6.0,2.0,0.0,0.0,40.0,41,0
3,4a,53.0,6.0,234721.0,5.370552,,1.0,7.0,3.0,6.0,2.0,4.0,2.0,0.0,0.0,40.0,41,0
4,5a,,6.0,338409.0,5.529442,,9.0,13.0,3.0,10.0,7.0,4.0,1.0,0.0,0.0,40.0,6,0


In [22]:
# Example 4 - selected categorical columns + one hot encoding (method_type=0) + print impact
#odf = cat_to_num_unsupervised(spark, df, list_of_cols=['race'], method_type=0, print_impact=True)
#odf.show(1, False)

In [24]:
# Example 5 - one hot encoding + save model
#odf = cat_to_num_unsupervised(spark, df, list_of_cols=['race'], method_type=0, 
#                              pre_existing_model=False, model_path=outputPath)
#odf.show(1, False)

In [26]:
# Example 6 - one hot encoding + use pre-saved model
#odf = cat_to_num_unsupervised(spark, df, list_of_cols=['race'], method_type=0, 
#                              pre_existing_model=True, model_path=outputPath)
#odf.show(1, False)

## Categorical to Numerical - Supervised
- API specification of function **cat_to_num_supervised**  can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>

In [27]:
from anovos.data_transformer.transformers import cat_to_num_supervised

In [28]:
# Example 1 - 'all' columns (excluding drop_cols) + print impact 
odf = cat_to_num_supervised(spark, idf=df, list_of_cols="all", drop_cols="ifa", 
                            label_col="income", event_label=">50K", run_type=run_type, auth_key=auth_key, print_impact=True)

Before: 


                                                                                

+-------+-----+--------------+--------------+-------+------------+-----+------------+----------------+-----------+
|summary|sex  |marital-status|native-country|race   |relationship|empty|education   |occupation      |workclass  |
+-------+-----+--------------+--------------+-------+------------+-----+------------+----------------+-----------+
|count  |32557|32135         |32561         |32247  |32557       |0    |32040       |32549           |32558      |
|min    |?    |?             |*             |*      |*           |null |10th        |?               | Private   |
|max    |Male |Widowed       |Yugoslavia    |Whitess|Wife        |null |Some-college|Transport-moving|Without-pay|
+-------+-----+--------------+--------------+-------+------------+-----+------------+----------------+-----------+

After: 
+-------+------+--------------+--------------+-----+------------+------+---------+----------+---------+
|summary|sex   |marital-status|native-country|race |relationship|empty |education|

In [29]:
# Example 2 - selected cateogrical columns + append generated columns + print impact
odf = cat_to_num_supervised(spark, idf=df, list_of_cols=['relationship', 'marital-status'],
                            label_col="income", event_label=">50K", output_mode="append", run_type=run_type, auth_key=auth_key, print_impact=True)

Before: 
+-------+--------------+------------+
|summary|marital-status|relationship|
+-------+--------------+------------+
|count  |32135         |32557       |
|min    |?             |*           |
|max    |Widowed       |Wife        |
+-------+--------------+------------+

After: 
+-------+----------------------+--------------------+
|summary|marital-status_encoded|relationship_encoded|
+-------+----------------------+--------------------+
|count  |32135                 |32557               |
|min    |0.0458                |0.0                 |
|max    |0.4471                |0.4748              |
+-------+----------------------+--------------------+



In [30]:
# Example 3 - selected categorical columns + append generated column + save model
odf = cat_to_num_supervised(spark, idf=df, list_of_cols=['relationship', 'marital-status', 'workclass'], 
                            label_col="income", event_label=">50K", model_path=outputPath, output_mode="append", run_type=run_type, auth_key=auth_key)

In [31]:
# Example 4 - selected categorical columns + use pre-saved model
odf = cat_to_num_supervised(spark, idf=df, list_of_cols=['relationship', 'marital-status'], 
                            label_col="income", event_label=">50K", pre_existing_model=True, 
                            model_path=outputPath, run_type=run_type, auth_key=auth_key, print_impact=True)

Before: 
+-------+--------------+------------+
|summary|marital-status|relationship|
+-------+--------------+------------+
|count  |32135         |32557       |
|min    |?             |*           |
|max    |Widowed       |Wife        |
+-------+--------------+------------+

After: 
+-------+--------------+------------+
|summary|marital-status|relationship|
+-------+--------------+------------+
|count  |32135         |32557       |
|min    |0.0458        |0.0         |
|max    |0.4471        |0.4748      |
+-------+--------------+------------+



# Attribute Rescaling

## Z Standardization
- API specification of function **z_standardization** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only

In [32]:
from anovos.data_transformer.transformers import z_standardization

In [33]:
# Example 1 - with mandatory arguments
odf = z_standardization(spark, idf=df)

In [34]:
# Example 2 - selected columns + print impact
odf = z_standardization(spark, idf=df, list_of_cols=['fnlwgt', 'age', 'hours-per-week'], print_impact=True)

Before: 
+-------+------------------+------------------+------------------+
|summary|hours-per-week    |age               |fnlwgt            |
+-------+------------------+------------------+------------------+
|count  |32452             |32500             |32546             |
|mean   |40.24972266732405 |38.506492307692305|189781.83180728814|
|stddev |11.914337669272227|13.508497735339288|105563.06445057005|
|min    |1                 |17                |12285             |
|max    |94                |85                |1484705           |
+-------+------------------+------------------+------------------+

After: 
+-------+-----------------------+----------------------+---------------------+
|summary|hours-per-week         |age                   |fnlwgt               |
+-------+-----------------------+----------------------+---------------------+
|count  |32452                  |32500                 |32546                |
|mean   |-1.2064250197331911E-15|1.1018878117633553E-16|6.30943

In [35]:
# Example 3 - 'all' columns + save model + print impact
odf = z_standardization(spark, idf=df, list_of_cols='all', model_path=outputPath)

In [36]:
# Example 4 - selected columns + append new columns + use pre-saved model + print impact
odf = z_standardization(spark, idf=df, list_of_cols=['fnlwgt', 'age', 'hours-per-week'], 
                        pre_existing_model=True, model_path=outputPath, output_mode='append', print_impact=True)

Before: 
+-------+------------------+------------------+------------------+
|summary|hours-per-week    |age               |fnlwgt            |
+-------+------------------+------------------+------------------+
|count  |32452             |32500             |32546             |
|mean   |40.24972266732405 |38.506492307692305|189781.83180728814|
|stddev |11.914337669272227|13.508497735339288|105563.06445057005|
|min    |1                 |17                |12285             |
|max    |94                |85                |1484705           |
+-------+------------------+------------------+------------------+

After: 
+-------+-----------------------+----------------------+---------------------+
|summary|hours-per-week_scaled  |age_scaled            |fnlwgt_scaled        |
+-------+-----------------------+----------------------+---------------------+
|count  |32452                  |32500                 |32546                |
|mean   |-1.2064250197331911E-15|1.1018878117633553E-16|6.30943

## IQR Standardization
- API specification of function **IQR_standardization** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only

In [37]:
from anovos.data_transformer.transformers import IQR_standardization

In [38]:
# Example 1 - with mandatory arguments
odf = IQR_standardization(spark, idf=df)

In [39]:
# Example 2 - selected columns + print impact
odf = IQR_standardization(spark, idf=df, list_of_cols=['fnlwgt', 'age', 'hours-per-week'], print_impact=True)

Before: 
+-------+------------------+------------------+------------------+
|summary|hours-per-week    |age               |fnlwgt            |
+-------+------------------+------------------+------------------+
|count  |32452             |32500             |32546             |
|mean   |40.24972266732405 |38.506492307692305|189781.83180728814|
|stddev |11.914337669272227|13.508497735339288|105563.06445057005|
|min    |1                 |17                |12285             |
|max    |94                |85                |1484705           |
+-------+------------------+------------------+------------------+

After: 
+-------+-------------------+-------------------+-------------------+
|summary|hours-per-week     |age                |fnlwgt             |
+-------+-------------------+-------------------+-------------------+
|count  |32452              |32500              |32546              |
|mean   |0.04994453346480968|0.07532461538461487|0.10898186729548563|
|stddev |2.38286753385445   |

In [40]:
# Example 3 - 'all' columns + save model + print impact
odf = IQR_standardization(spark, idf=df, list_of_cols='all', model_path=outputPath)

In [41]:
# Example 4 - selected columns + append new columns + use pre-saved model + print impact
odf = IQR_standardization(spark, idf=df, list_of_cols=['fnlwgt', 'age', 'hours-per-week'], 
                          pre_existing_model=True, model_path=outputPath, output_mode='append', print_impact=True)

Before: 
+-------+------------------+------------------+------------------+
|summary|hours-per-week    |age               |fnlwgt            |
+-------+------------------+------------------+------------------+
|count  |32452             |32500             |32546             |
|mean   |40.24972266732405 |38.506492307692305|189781.83180728814|
|stddev |11.914337669272227|13.508497735339288|105563.06445057005|
|min    |1                 |17                |12285             |
|max    |94                |85                |1484705           |
+-------+------------------+------------------+------------------+

After: 
+-------+---------------------+-------------------+-------------------+
|summary|hours-per-week_scaled|age_scaled         |fnlwgt_scaled      |
+-------+---------------------+-------------------+-------------------+
|count  |32452                |32500              |32546              |
|mean   |0.04994453346480968  |0.07532461538461487|0.10898186729548563|
|stddev |2.38286753

## Normalization
- API specification of function **normalization** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only

In [42]:
from anovos.data_transformer.transformers import normalization

In [43]:
# Example 1 - with mandatory arguments
odf = normalization(idf=df)

In [44]:
# Example 2 - selected columns + print impact
odf = normalization(idf=df, list_of_cols=['fnlwgt', 'age', 'hours-per-week'], print_impact=True)

Before: 
+-------+------------------+------------------+------------------+
|summary|hours-per-week    |age               |fnlwgt            |
+-------+------------------+------------------+------------------+
|count  |32452             |32500             |32546             |
|mean   |40.24972266732405 |38.506492307692305|189781.83180728814|
|stddev |11.914337669272227|13.508497735339288|105563.06445057005|
|min    |1                 |17                |12285             |
|max    |94                |85                |1484705           |
+-------+------------------+------------------+------------------+

After: 




+-------+-------------------+-------------------+-------------------+
|summary|hours-per-week     |age                |fnlwgt             |
+-------+-------------------+-------------------+-------------------+
|count  |32452              |32500              |32546              |
|mean   |0.42204001986123296|0.3162719473860012 |0.12054769144314768|
|stddev |0.12811115499749562|0.19865437746694414|0.07169358238758682|
|min    |0.0                |0.0                |0.0                |
|max    |1.0                |1.0                |1.0                |
+-------+-------------------+-------------------+-------------------+



                                                                                

In [45]:
# Example 3 - 'all' columns + save model + print impact
odf = normalization(idf=df, list_of_cols='all', model_path=outputPath)

In [None]:
# Example 4 - selected columns + append new columns + use pre-saved model + print impact
odf = normalization(idf=df, list_of_cols=['fnlwgt', 'age', 'hours-per-week'], 
                    pre_existing_model=True, model_path=outputPath, output_mode='append', print_impact=True)

# Missing Value Imputation

## Imputation MMM
- API specification of function **imputation_MMM** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- 2 options for numerical  attributes: median and mean
- Mode is only option for categorical attributes

In [47]:
from anovos.data_transformer.transformers import imputation_MMM

In [48]:
# Example 1 - with mandatory arguments + print impact
odf = imputation_MMM(spark, df, print_impact=True)

+--------------+-------------------+------------------+
|attribute     |missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|age           |61                 |0                 |
|capital-gain  |13                 |0                 |
|capital-loss  |12                 |0                 |
|education     |521                |0                 |
|education-num |31                 |0                 |
|empty         |32561              |0                 |
|fnlwgt        |15                 |0                 |
|hours-per-week|109                |0                 |
|logfnl        |20393              |0                 |
|marital-status|426                |0                 |
|occupation    |12                 |0                 |
|race          |314                |0                 |
|relationship  |4                  |0                 |
|sex           |4                  |0                 |
|workclass     |3                  |0           

In [49]:
# Example 2 - use mean for numerical columns + append transformed columns at the end
odf = imputation_MMM(spark, df, list_of_cols='all', method_type="mean", output_mode="append")
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,...,hours-per-week_imputed,logfnl_imputed,sex_imputed,marital-status_imputed,race_imputed,relationship_imputed,empty_imputed,education_imputed,occupation_imputed,workclass_imputed
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,...,40,4.889391,Male,Never-married,White,Not-in-family,,Bachelors,Adm-clerical,State-gov
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,...,13,4.920702,Male,Married-civ-spouse,White,Husband,,Bachelors,Exec-managerial,Self-emp-not-inc
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,...,40,5.333741,Male,Divorced,White,Not-in-family,,HS-grad,Handlers-cleaners,Private
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,...,40,5.370552,Male,Married-civ-spouse,Black,Husband,,11th,Handlers-cleaners,Private
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,...,40,5.529442,Female,Married-civ-spouse,Black,Wife,,Bachelors,Prof-specialty,Private


In [50]:
odf.select('education-num', 'education-num_imputed').where(F.col("education-num").isNull()).distinct().toPandas().head(5)

Unnamed: 0,education-num,education-num_imputed
0,,10


In [51]:
# Example 3 - save model
odf = imputation_MMM(spark, df, pre_existing_model=False, model_path=outputPath)

In [52]:
# Example 4 - use pre-saved model
odf = imputation_MMM(spark, df, pre_existing_model=True, model_path=outputPath)
odf.toPandas().head(5)

Unnamed: 0,ifa,native-country,income,capital-loss,age,fnlwgt,education-num,capital-gain,hours-per-week,logfnl,sex,marital-status,race,relationship,empty,education,occupation,workclass
0,1a,UnitedStates,<=50K,0,37,77516,13,2174,40,4.889391,Male,Never-married,White,Not-in-family,,Bachelors,Adm-clerical,State-gov
1,2a,UnitedStates,<=50K,0,37,83311,13,0,13,4.920702,Male,Married-civ-spouse,White,Husband,,Bachelors,Exec-managerial,Self-emp-not-inc
2,3a,UnitedStates,<=50K,0,38,215646,9,0,40,5.333741,Male,Divorced,White,Not-in-family,,HS-grad,Handlers-cleaners,Private
3,4a,UnitedStates,<=50K,0,53,234721,7,0,40,5.370552,Male,Married-civ-spouse,Black,Husband,,11th,Handlers-cleaners,Private
4,5a,Cuba,<=50K,0,37,338409,13,0,40,5.529442,Female,Married-civ-spouse,Black,Wife,,Bachelors,Prof-specialty,Private


In [53]:
# Example 5 - selected columns + use pre-saved stats
from anovos.data_analyzer.stats_generator import measures_of_counts, measures_of_centralTendency
from anovos.data_ingest.data_ingest import write_dataset
missing = write_dataset(measures_of_counts(spark, df),outputPath+"/missing","parquet", file_configs={"mode":"overwrite"})
mode = write_dataset(measures_of_centralTendency(spark, df),outputPath+"/mode","parquet", file_configs={"mode":"overwrite"})

odf = imputation_MMM(spark, df, list_of_cols=['marital-status', 'sex', 'occupation', 'age'], 
                     stats_missing={"file_path":outputPath+"/missing", "file_type": "parquet"}, 
                     stats_mode={"file_path":outputPath+"/mode", "file_type": "parquet"}, print_impact=True)
odf.toPandas().head(5)

                                                                                

+--------------+-------------------+------------------+
|attribute     |missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|age           |61                 |0                 |
|marital-status|426                |0                 |
|occupation    |12                 |0                 |
|sex           |4                  |0                 |
+--------------+-------------------+------------------+



Unnamed: 0,ifa,workclass,fnlwgt,logfnl,empty,education,education-num,relationship,race,capital-gain,capital-loss,hours-per-week,native-country,income,age,sex,marital-status,occupation
0,1a,State-gov,77516.0,4.889391,,Bachelors,13.0,Not-in-family,White,2174.0,0.0,40.0,UnitedStates,<=50K,37,Male,Never-married,Adm-clerical
1,2a,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Husband,White,0.0,0.0,13.0,UnitedStates,<=50K,37,Male,Married-civ-spouse,Exec-managerial
2,3a,Private,215646.0,5.333741,,HS-grad,9.0,Not-in-family,White,0.0,0.0,40.0,UnitedStates,<=50K,38,Male,Divorced,Handlers-cleaners
3,4a,Private,234721.0,5.370552,,11th,7.0,Husband,Black,0.0,0.0,40.0,UnitedStates,<=50K,53,Male,Married-civ-spouse,Handlers-cleaners
4,5a,Private,338409.0,5.529442,,Bachelors,13.0,Wife,Black,0.0,0.0,40.0,Cuba,<=50K,37,Female,Married-civ-spouse,Prof-specialty


## Imputation Sklearn
- API specification of function **imputation_sklearn** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only
- 2 options supported: KNN and regression

In [54]:
from anovos.data_transformer.transformers import imputation_sklearn

In [55]:
df = df.drop('empty')

In [56]:
print(df.count())
print(df.dropna().count())

32561
11641


In [57]:
# Example 1 - with mandatory arguments + KNN method  + print impact
odf = imputation_sklearn(spark, idf=df, run_type=run_type, auth_key=auth_key, print_impact=True)

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

                                                                                

+--------------+-------------------+------------------+
|attribute     |missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|age           |61                 |0                 |
|capital-gain  |13                 |0                 |
|capital-loss  |12                 |0                 |
|education-num |31                 |0                 |
|fnlwgt        |15                 |0                 |
|hours-per-week|109                |0                 |
|logfnl        |20393              |0                 |
+--------------+-------------------+------------------+



In [58]:
# Example 2 - selected columns + regression method + print impact
odf = imputation_sklearn(spark, idf=df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                         method_type='regression', run_type=run_type, auth_key=auth_key, print_impact=True)

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

                                                                                

+-------------+-------------------+------------------+
|attribute    |missingCount_before|missingCount_after|
+-------------+-------------------+------------------+
|age          |61                 |0                 |
|capital-gain |13                 |0                 |
|capital-loss |12                 |0                 |
|education-num|31                 |0                 |
+-------------+-------------------+------------------+



In [59]:
# Example 3 - KNN method + smaller sample_size + save model
odf = imputation_sklearn(spark, idf=df, sample_size=1000, model_path=outputPath+'/KNN', run_type=run_type, auth_key=auth_key)

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

                                                                                

In [60]:
from anovos.data_analyzer.stats_generator import measures_of_percentiles, measures_of_counts
x = measures_of_counts(spark, odf)

# Visualization
x.orderBy('missing_count').toPandas() 

                                                                                

Unnamed: 0,attribute,fill_count,fill_pct,missing_count,missing_pct,nonzero_count,nonzero_pct
0,age,32561,1.0,0,0.0,32561.0,1.0
1,capital-gain,32561,1.0,0,0.0,2713.0,0.0833
2,capital-loss,32561,1.0,0,0.0,1525.0,0.0468
3,education-num,32561,1.0,0,0.0,32561.0,1.0
4,fnlwgt,32561,1.0,0,0.0,32561.0,1.0
5,hours-per-week,32561,1.0,0,0.0,32561.0,1.0
6,ifa,32561,1.0,0,0.0,,
7,income,32561,1.0,0,0.0,,
8,logfnl,32561,1.0,0,0.0,32561.0,1.0
9,native-country,32561,1.0,0,0.0,,


In [61]:
# Example 4 - KNN method + pre-saved model + append new columns + print impact
odf = imputation_sklearn(spark, idf=df, pre_existing_model=True, model_path=outputPath+'/KNN', 
                         output_mode='append', run_type=run_type, auth_key=auth_key, print_impact=True)

                                                                                

+--------------+-------------------+----------------------+-------------+
|attribute     |missingCount_before|attribute_after       |missing_count|
+--------------+-------------------+----------------------+-------------+
|age           |61                 |age_imputed           |0            |
|capital-gain  |13                 |capital-gain_imputed  |0            |
|capital-loss  |12                 |capital-loss_imputed  |0            |
|education-num |31                 |education-num_imputed |0            |
|fnlwgt        |15                 |fnlwgt_imputed        |0            |
|hours-per-week|109                |hours-per-week_imputed|0            |
|logfnl        |20393              |logfnl_imputed        |0            |
+--------------+-------------------+----------------------+-------------+



In [62]:
# Example 5 - regression method + smaller sample_size + save model
odf = imputation_sklearn(spark, idf=df, sample_size=1000, model_path=outputPath+'/regression', run_type=run_type, auth_key=auth_key)

In [63]:
# Example 6 - regression method + pre-saved model + append new columns + print impact
odf = imputation_sklearn(spark, idf=df, pre_existing_model=True, model_path=outputPath+'/regression', 
                         output_mode='append', run_type=run_type, auth_key=auth_key, print_impact=True)

                                                                                

+--------------+-------------------+----------------------+-------------+
|attribute     |missingCount_before|attribute_after       |missing_count|
+--------------+-------------------+----------------------+-------------+
|age           |61                 |age_imputed           |0            |
|capital-gain  |13                 |capital-gain_imputed  |0            |
|capital-loss  |12                 |capital-loss_imputed  |0            |
|education-num |31                 |education-num_imputed |0            |
|fnlwgt        |15                 |fnlwgt_imputed        |0            |
|hours-per-week|109                |hours-per-week_imputed|0            |
|logfnl        |20393              |logfnl_imputed        |0            |
+--------------+-------------------+----------------------+-------------+



In [64]:
# Example 7 - use pre-saved stats
from anovos.data_analyzer.stats_generator import measures_of_counts
from anovos.data_ingest.data_ingest import write_dataset

write_dataset(measures_of_counts(spark, df), outputPath+"/missing","parquet", file_configs={"mode":"overwrite"})

odf = imputation_sklearn(spark, df, stats_missing={"file_path":outputPath+"/missing", "file_type": "parquet"}, 
                         run_type=run_type, auth_key=auth_key, 
                         print_impact=True)

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

                                                                                

+--------------+-------------------+------------------+
|attribute     |missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|age           |61                 |0                 |
|capital-gain  |13                 |0                 |
|capital-loss  |12                 |0                 |
|education-num |31                 |0                 |
|fnlwgt        |15                 |0                 |
|hours-per-week|109                |0                 |
|logfnl        |20393              |0                 |
+--------------+-------------------+------------------+



## Imputation Matrix Factorization
- API specification of function **imputation_matrixFactorization** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only

In [65]:
from anovos.data_transformer.transformers import imputation_matrixFactorization

In [66]:
# Example 1 - all columns with missing values + print impact
odf = imputation_matrixFactorization(spark, idf=df, id_col='ifa', print_impact=True)

                                                                                

+--------------+-------------------+------------------+
|attribute     |missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|age           |61                 |0                 |
|capital-gain  |13                 |0                 |
|capital-loss  |12                 |0                 |
|education-num |31                 |0                 |
|fnlwgt        |15                 |0                 |
|hours-per-week|109                |0                 |
|logfnl        |20393              |0                 |
+--------------+-------------------+------------------+



In [67]:
# Example 2 - selected columns + append new columns + print impact
odf = imputation_matrixFactorization(spark, idf=df, 
                                     list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                                     id_col='ifa', print_impact=True)

                                                                                

+-------------+-------------------+------------------+
|attribute    |missingCount_before|missingCount_after|
+-------------+-------------------+------------------+
|age          |61                 |0                 |
|capital-gain |13                 |0                 |
|capital-loss |12                 |0                 |
|education-num|31                 |0                 |
+-------------+-------------------+------------------+



In [68]:
# Example 3 - use pre-saved stats
from anovos.data_analyzer.stats_generator import measures_of_counts
from anovos.data_ingest.data_ingest import write_dataset

write_dataset(measures_of_counts(spark, df), outputPath+"/missing","parquet", file_configs={"mode":"overwrite"})

odf = imputation_matrixFactorization(spark, df, 
                                     stats_missing={"file_path":outputPath+"/missing", "file_type": "parquet"}, 
                                     print_impact=True)

                                                                                

+--------------+-------------------+------------------+
|attribute     |missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|age           |61                 |0                 |
|capital-gain  |13                 |0                 |
|capital-loss  |12                 |0                 |
|education-num |31                 |0                 |
|fnlwgt        |15                 |0                 |
|hours-per-week|109                |0                 |
|logfnl        |20393              |0                 |
+--------------+-------------------+------------------+



## Auto Imputation
- API specification of function **auto_imputation** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>

In [69]:
from anovos.data_transformer.transformers import auto_imputation

In [70]:
# Example 1 - all columns with missing values + print impact
auto_imputation(spark, df, id_col='ifa', run_type=run_type, auth_key=auth_key, print_impact=True)

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

                                                                                

[('MMM-mean', 4.308067563800022), ('MMM-median', 4.794413527739969), ('KNN', 4.574602930729491), ('regression', 4.277230543802758), ('matrix_factorization', 5.861640927018089)]
Best Imputation Method:  regression


DataFrame[ifa: string, age: float, fnlwgt: float, logfnl: float, education-num: float, capital-gain: float, capital-loss: float, hours-per-week: float, native-country: string, income: string, sex: string, marital-status: string, race: string, relationship: string, education: string, occupation: string, workclass: string, index: int]

In [None]:
# Example 2 - selected columns + customized null_pct + print impact
odf = auto_imputation(spark, df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'],
                                   id_col='ifa', null_pct=0.5, run_type=run_type, auth_key=auth_key, print_impact=True)

In [None]:
# Example 3 - selected columns + use pre-saved stats + print impact
from anovos.data_analyzer.stats_generator import measures_of_counts
from anovos.data_ingest.data_ingest import write_dataset

write_dataset(measures_of_counts(spark, df), outputPath+"/missing","parquet", file_configs={"mode":"overwrite"})

odf = auto_imputation(spark, df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                      id_col='ifa', stats_missing={"file_path":outputPath+"/missing", "file_type": "parquet"},
                      run_type=run_type, auth_key=auth_key,
                      print_impact=True)

# Latent Features Generation

## Autoencoder Latent Features
- API specification of function **autoencoder_latentFeatures** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only

In [72]:
from anovos.data_transformer.transformers import autoencoder_latentFeatures

In [73]:
# Example 1 - with mandatory arguments + print impact
odf = autoencoder_latentFeatures(spark, df, run_type=run_type, auth_key=auth_key, print_impact=True)
odf.limit(5).toPandas()

Epoch 1/100


2022-06-06 21:37:12.996539: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


38/38 - 1s - loss: 1.0850 - val_loss: 0.9007 - 1s/epoch - 29ms/step
Epoch 2/100
38/38 - 0s - loss: 0.7602 - val_loss: 0.8230 - 84ms/epoch - 2ms/step
Epoch 3/100
38/38 - 0s - loss: 0.6549 - val_loss: 0.7478 - 81ms/epoch - 2ms/step
Epoch 4/100
38/38 - 0s - loss: 0.5791 - val_loss: 0.6609 - 81ms/epoch - 2ms/step
Epoch 5/100
38/38 - 0s - loss: 0.5133 - val_loss: 0.5683 - 82ms/epoch - 2ms/step
Epoch 6/100
38/38 - 0s - loss: 0.4607 - val_loss: 0.4883 - 76ms/epoch - 2ms/step
Epoch 7/100
38/38 - 0s - loss: 0.4177 - val_loss: 0.4253 - 85ms/epoch - 2ms/step
Epoch 8/100
38/38 - 0s - loss: 0.3807 - val_loss: 0.3795 - 92ms/epoch - 2ms/step
Epoch 9/100
38/38 - 0s - loss: 0.3475 - val_loss: 0.3456 - 77ms/epoch - 2ms/step
Epoch 10/100
38/38 - 0s - loss: 0.3185 - val_loss: 0.3208 - 82ms/epoch - 2ms/step
Epoch 11/100
38/38 - 0s - loss: 0.2919 - val_loss: 0.2998 - 75ms/epoch - 2ms/step
Epoch 12/100
38/38 - 0s - loss: 0.2718 - val_loss: 0.2856 - 79ms/epoch - 2ms/step
Epoch 13/100
38/38 - 0s - loss: 0.2542

2022-06-06 21:37:26.242678: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-06 21:37:26.242745: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


+-------+-------------------+-------------------+------------------+
|summary|latent_0           |latent_1           |latent_2          |
+-------+-------------------+-------------------+------------------+
|count  |12085              |12085              |12085             |
|mean   |-0.4632727453802203|0.12079895275269655|0.6199142043967952|
|stddev |1.2692273308680315 |0.7166688923572926 |1.114573671901397 |
|min    |-9.905482          |-1.9535779         |-10.268689        |
|max    |7.6794047          |8.529205           |2.9026842         |
+-------+-------------------+-------------------+------------------+



Unnamed: 0,ifa,workclass,education,marital-status,occupation,relationship,race,sex,native-country,income,capital-loss_scaled,age_scaled,fnlwgt_scaled,education-num_scaled,capital-gain_scaled,hours-per-week_scaled,logfnl_scaled,latent_0,latent_1,latent_2
0,3a,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,UnitedStates,<=50K,-0.216698,-0.037494,0.245012,-0.420201,-0.145898,-0.02096,0.467746,-0.29241,-0.305584,0.933934
1,4a,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,UnitedStates,<=50K,-0.216698,1.072918,0.425709,-1.197652,-0.145898,-0.02096,0.601973,-0.507136,-0.526128,0.76163
2,6a,Private,Masters,Married-civ-spouse,Exec-managerial,Wife,White,Female,United-States,<=50K,-0.216698,-0.111522,0.898043,1.523426,-0.145898,-0.02096,0.907015,0.22948,-0.058834,1.054093
3,7a,Private,,,Other-service,Not-in-family,Black,Female,Jamaica,<=50K,-0.216698,0.776808,-0.280352,-1.975102,-0.145898,-2.03534,-0.003056,-1.20917,-0.377125,1.086815
4,8a,Self-emp-not-inc,HS-grad,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,>50K,-0.216698,0.99889,0.188136,-0.420201,-0.145898,0.398703,0.42303,-0.458912,-0.309971,0.547693


In [74]:
# Example 2 - selected columns + less epochs + larger bach size + print impact
odf = autoencoder_latentFeatures(spark, df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'],
                                 epochs=50, batch_size=528, run_type=run_type, auth_key=auth_key, print_impact=True)
odf.limit(5).toPandas()

Epoch 1/50
50/50 - 1s - loss: 1.2696 - val_loss: 4.0839 - 1s/epoch - 22ms/step
Epoch 2/50
50/50 - 0s - loss: 0.7229 - val_loss: 3.7447 - 89ms/epoch - 2ms/step
Epoch 3/50
50/50 - 0s - loss: 0.5350 - val_loss: 3.3727 - 86ms/epoch - 2ms/step
Epoch 4/50
50/50 - 0s - loss: 0.4644 - val_loss: 2.9583 - 88ms/epoch - 2ms/step
Epoch 5/50
50/50 - 0s - loss: 0.4258 - val_loss: 2.6219 - 105ms/epoch - 2ms/step
Epoch 6/50
50/50 - 0s - loss: 0.4086 - val_loss: 2.3719 - 98ms/epoch - 2ms/step
Epoch 7/50
50/50 - 0s - loss: 0.3853 - val_loss: 2.1977 - 87ms/epoch - 2ms/step
Epoch 8/50
50/50 - 0s - loss: 0.3672 - val_loss: 2.0930 - 93ms/epoch - 2ms/step
Epoch 9/50
50/50 - 0s - loss: 0.3583 - val_loss: 2.0373 - 88ms/epoch - 2ms/step
Epoch 10/50
50/50 - 0s - loss: 0.3508 - val_loss: 1.9708 - 87ms/epoch - 2ms/step
Epoch 11/50
50/50 - 0s - loss: 0.3394 - val_loss: 1.9212 - 91ms/epoch - 2ms/step
Epoch 12/50
50/50 - 0s - loss: 0.3281 - val_loss: 1.8608 - 122ms/epoch - 2ms/step
Epoch 13/50
50/50 - 0s - loss: 0.314

2022-06-06 21:37:39.379437: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


+-------+------------------+------------------+
|summary|latent_0          |latent_1          |
+-------+------------------+------------------+
|count  |32466             |32466             |
|mean   |0.7169663959309821|0.3399314356356994|
|stddev |0.9066275103021408|0.5671097762470548|
|min    |-4.508261         |-4.961321         |
|max    |8.697022          |1.3688395         |
+-------+------------------+------------------+



Unnamed: 0,ifa,workclass,fnlwgt,logfnl,education,marital-status,occupation,relationship,race,sex,hours-per-week,native-country,income,capital-loss_scaled,education-num_scaled,age_scaled,capital-gain_scaled,latent_0,latent_1
0,3a,Private,215646,5.333741,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,40,UnitedStates,<=50K,-0.216698,-0.420201,-0.037494,-0.145898,0.560212,0.56553
1,4a,Private,234721,5.370552,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,40,UnitedStates,<=50K,-0.216698,-1.197652,1.072918,-0.145898,-0.036318,0.323467
2,6a,Private,284582,5.454207,Masters,Married-civ-spouse,Exec-managerial,Wife,White,Female,40,United-States,<=50K,-0.216698,1.523426,-0.111522,-0.145898,1.49924,0.054981
3,7a,Private,160187,5.204627,,,Other-service,Not-in-family,Black,Female,16,Jamaica,<=50K,-0.216698,-1.975102,0.776808,-0.145898,-0.454653,0.48879
4,8a,Self-emp-not-inc,209642,5.321478,HS-grad,Married-civ-spouse,Exec-managerial,Husband,White,Male,45,United-States,>50K,-0.216698,-0.420201,0.99889,-0.145898,0.409022,0.223609


In [75]:
# Example 3 - selected columns + smaller sample_size used for training + save model
odf = autoencoder_latentFeatures(spark, df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'],
                                 sample_size=20000, model_path=outputPath, run_type=run_type, auth_key=auth_key)
odf.limit(5).toPandas()

Epoch 1/100
63/63 - 1s - loss: 0.8790 - val_loss: 3.2559 - 1s/epoch - 16ms/step
Epoch 2/100
63/63 - 0s - loss: 0.6147 - val_loss: 2.9312 - 104ms/epoch - 2ms/step
Epoch 3/100
63/63 - 0s - loss: 0.5475 - val_loss: 2.6809 - 100ms/epoch - 2ms/step
Epoch 4/100
63/63 - 0s - loss: 0.5067 - val_loss: 2.4634 - 106ms/epoch - 2ms/step
Epoch 5/100
63/63 - 0s - loss: 0.4530 - val_loss: 2.0339 - 107ms/epoch - 2ms/step
Epoch 6/100
63/63 - 0s - loss: 0.3792 - val_loss: 1.4039 - 95ms/epoch - 2ms/step
Epoch 7/100
63/63 - 0s - loss: 0.2942 - val_loss: 1.0420 - 97ms/epoch - 2ms/step
Epoch 8/100
63/63 - 0s - loss: 0.2311 - val_loss: 0.8354 - 95ms/epoch - 2ms/step
Epoch 9/100
63/63 - 0s - loss: 0.1836 - val_loss: 0.7035 - 96ms/epoch - 2ms/step
Epoch 10/100
63/63 - 0s - loss: 0.1534 - val_loss: 0.6123 - 96ms/epoch - 2ms/step
Epoch 11/100
63/63 - 0s - loss: 0.1359 - val_loss: 0.5451 - 96ms/epoch - 2ms/step
Epoch 12/100
63/63 - 0s - loss: 0.1270 - val_loss: 0.5074 - 93ms/epoch - 1ms/step
Epoch 13/100
63/63 - 0







                                                                                

Unnamed: 0,ifa,workclass,fnlwgt,logfnl,education,marital-status,occupation,relationship,race,sex,hours-per-week,native-country,income,capital-loss_scaled,education-num_scaled,age_scaled,capital-gain_scaled,latent_0,latent_1
0,3a,Private,215646,5.333741,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,40,UnitedStates,<=50K,-0.216698,-0.420201,-0.037494,-0.145898,-0.386735,-0.024991
1,4a,Private,234721,5.370552,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,40,UnitedStates,<=50K,-0.216698,-1.197652,1.072918,-0.145898,-0.669264,-0.365342
2,6a,Private,284582,5.454207,Masters,Married-civ-spouse,Exec-managerial,Wife,White,Female,40,United-States,<=50K,-0.216698,1.523426,-0.111522,-0.145898,0.189998,-0.109535
3,7a,Private,160187,5.204627,,,Other-service,Not-in-family,Black,Female,16,Jamaica,<=50K,-0.216698,-1.975102,0.776808,-0.145898,-0.828042,-0.195159
4,8a,Self-emp-not-inc,209642,5.321478,HS-grad,Married-civ-spouse,Exec-managerial,Husband,White,Male,45,United-States,>50K,-0.216698,-0.420201,0.99889,-0.145898,-0.501179,-0.345677


In [76]:
# Example 4 - use pre-saved model
odf = autoencoder_latentFeatures(spark, df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                                 pre_existing_model=True, model_path=outputPath, run_type=run_type, auth_key=auth_key, print_impact=True)







2022-06-06 21:37:59.397717: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-06 21:37:59.397715: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

+-------+--------------------+--------------------+
|summary|latent_0            |latent_1            |
+-------+--------------------+--------------------+
|count  |32466               |32466               |
|mean   |-0.05653757447076532|-0.11215917623360833|
|stddev |1.2736202417470284  |0.6219842310068949  |
|min    |-1.4399135          |-6.1083527          |
|max    |12.846301           |2.8634453           |
+-------+--------------------+--------------------+



                                                                                

In [77]:
# Example 5 - selected columns + use pre-saved stats + print impact
from anovos.data_analyzer.stats_generator import measures_of_counts
from anovos.data_ingest.data_ingest import write_dataset

write_dataset(measures_of_counts(spark, df), outputPath+"/missing","parquet", file_configs={"mode":"overwrite"})

odf = autoencoder_latentFeatures(spark, df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                                 stats_missing={"file_path":outputPath+"/missing", "file_type": "parquet"}, 
                                 run_type=run_type, auth_key=auth_key,
                                 print_impact=True)



Epoch 1/100
102/102 - 1s - loss: 1.1466 - val_loss: 4.1183 - 1s/epoch - 11ms/step
Epoch 2/100
102/102 - 0s - loss: 0.6184 - val_loss: 3.9010 - 130ms/epoch - 1ms/step
Epoch 3/100
102/102 - 0s - loss: 0.5301 - val_loss: 3.2086 - 140ms/epoch - 1ms/step
Epoch 4/100
102/102 - 0s - loss: 0.4924 - val_loss: 2.7718 - 128ms/epoch - 1ms/step
Epoch 5/100
102/102 - 0s - loss: 0.4562 - val_loss: 2.6261 - 125ms/epoch - 1ms/step
Epoch 6/100
102/102 - 0s - loss: 0.4138 - val_loss: 1.9632 - 151ms/epoch - 1ms/step
Epoch 7/100
102/102 - 0s - loss: 0.3636 - val_loss: 1.4049 - 132ms/epoch - 1ms/step
Epoch 8/100
102/102 - 0s - loss: 0.3223 - val_loss: 0.9999 - 123ms/epoch - 1ms/step
Epoch 9/100
102/102 - 0s - loss: 0.2938 - val_loss: 0.9420 - 129ms/epoch - 1ms/step
Epoch 10/100
102/102 - 0s - loss: 0.2589 - val_loss: 0.7618 - 142ms/epoch - 1ms/step
Epoch 11/100
102/102 - 0s - loss: 0.2300 - val_loss: 0.8268 - 130ms/epoch - 1ms/step
Epoch 12/100
102/102 - 0s - loss: 0.1950 - val_loss: 0.8579 - 144ms/epoch - 

Epoch 98/100
102/102 - 0s - loss: 0.0633 - val_loss: 0.3496 - 134ms/epoch - 1ms/step
Epoch 99/100
102/102 - 0s - loss: 0.0593 - val_loss: 0.3458 - 154ms/epoch - 2ms/step
Epoch 100/100
102/102 - 0s - loss: 0.0603 - val_loss: 0.3457 - 141ms/epoch - 1ms/step




+-------+--------------------+--------------------+
|summary|latent_0            |latent_1            |
+-------+--------------------+--------------------+
|count  |32466               |32466               |
|mean   |-0.31583618377285744|-0.14931815564615777|
|stddev |0.43429596746537863 |0.9673047256929036  |
|min    |-1.0152589          |-0.9238313          |
|max    |3.506217            |10.180667           |
+-------+--------------------+--------------------+



                                                                                

In [78]:
# Example 6 - use pre-saved standardization model
odf = autoencoder_latentFeatures(spark, df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                                 standardization_configs={"pre_existing_model": True, "model_path": outputPath}, 
                                 run_type=run_type, auth_key=auth_key, print_impact=True)

Epoch 1/100
102/102 - 1s - loss: 1.3065 - val_loss: 4.0110 - 1s/epoch - 14ms/step
Epoch 2/100
102/102 - 0s - loss: 0.7568 - val_loss: 3.7825 - 157ms/epoch - 2ms/step
Epoch 3/100
102/102 - 0s - loss: 0.6408 - val_loss: 3.4865 - 158ms/epoch - 2ms/step
Epoch 4/100
102/102 - 0s - loss: 0.5929 - val_loss: 3.2574 - 156ms/epoch - 2ms/step
Epoch 5/100
102/102 - 0s - loss: 0.5620 - val_loss: 3.0935 - 154ms/epoch - 2ms/step
Epoch 6/100
102/102 - 0s - loss: 0.5361 - val_loss: 2.8638 - 160ms/epoch - 2ms/step
Epoch 7/100
102/102 - 0s - loss: 0.5096 - val_loss: 2.6723 - 159ms/epoch - 2ms/step
Epoch 8/100
102/102 - 0s - loss: 0.4872 - val_loss: 2.3571 - 162ms/epoch - 2ms/step
Epoch 9/100
102/102 - 0s - loss: 0.4559 - val_loss: 2.2463 - 162ms/epoch - 2ms/step
Epoch 10/100
102/102 - 0s - loss: 0.4169 - val_loss: 1.9714 - 160ms/epoch - 2ms/step
Epoch 11/100
102/102 - 0s - loss: 0.3765 - val_loss: 1.6331 - 162ms/epoch - 2ms/step
Epoch 12/100
102/102 - 0s - loss: 0.3342 - val_loss: 1.3105 - 171ms/epoch - 

Epoch 98/100
102/102 - 0s - loss: 0.0881 - val_loss: 0.4654 - 212ms/epoch - 2ms/step
Epoch 99/100
102/102 - 0s - loss: 0.0893 - val_loss: 0.4455 - 211ms/epoch - 2ms/step
Epoch 100/100
102/102 - 0s - loss: 0.0852 - val_loss: 0.4418 - 203ms/epoch - 2ms/step




+-------+-------------------+------------------+
|summary|latent_0           |latent_1          |
+-------+-------------------+------------------+
|count  |32466              |32466             |
|mean   |0.25219409280514377|0.2867071796988021|
|stddev |0.9106507070747841 |0.8443175675621726|
|min    |-2.6947098         |-3.72358          |
|max    |5.2920003          |1.726284          |
+-------+-------------------+------------------+



                                                                                

In [79]:
# Example 7 - impute missing values before calculation
odf = autoencoder_latentFeatures(spark, df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                                 imputation=True, run_type=run_type, auth_key=auth_key, print_impact=True)

Epoch 1/100
102/102 - 2s - loss: 1.4308 - val_loss: 3.2197 - 2s/epoch - 19ms/step
Epoch 2/100
102/102 - 0s - loss: 0.6766 - val_loss: 2.4030 - 215ms/epoch - 2ms/step
Epoch 3/100
102/102 - 0s - loss: 0.5033 - val_loss: 1.7942 - 204ms/epoch - 2ms/step
Epoch 4/100
102/102 - 0s - loss: 0.4155 - val_loss: 1.4679 - 210ms/epoch - 2ms/step
Epoch 5/100
102/102 - 0s - loss: 0.3622 - val_loss: 1.2798 - 201ms/epoch - 2ms/step
Epoch 6/100
102/102 - 0s - loss: 0.3130 - val_loss: 1.1359 - 197ms/epoch - 2ms/step
Epoch 7/100
102/102 - 0s - loss: 0.2720 - val_loss: 0.9359 - 212ms/epoch - 2ms/step
Epoch 8/100
102/102 - 0s - loss: 0.2263 - val_loss: 0.7537 - 213ms/epoch - 2ms/step
Epoch 9/100
102/102 - 0s - loss: 0.1856 - val_loss: 0.6328 - 208ms/epoch - 2ms/step
Epoch 10/100
102/102 - 0s - loss: 0.1550 - val_loss: 0.4908 - 200ms/epoch - 2ms/step
Epoch 11/100
102/102 - 0s - loss: 0.1312 - val_loss: 0.4184 - 193ms/epoch - 2ms/step
Epoch 12/100
102/102 - 0s - loss: 0.1154 - val_loss: 0.3786 - 188ms/epoch - 

Epoch 98/100
102/102 - 0s - loss: 0.0532 - val_loss: 0.3223 - 173ms/epoch - 2ms/step
Epoch 99/100
102/102 - 0s - loss: 0.0525 - val_loss: 0.2691 - 182ms/epoch - 2ms/step
Epoch 100/100
102/102 - 0s - loss: 0.0517 - val_loss: 0.2137 - 173ms/epoch - 2ms/step




+-------+-------------------+--------------------+
|summary|latent_0           |latent_1            |
+-------+-------------------+--------------------+
|count  |32561              |32561               |
|mean   |-0.2211945572867589|-0.05426675022083278|
|stddev |0.30760011620338656|0.3587129319697645  |
|min    |-3.558371          |-3.9723237          |
|max    |1.7913073          |0.7051387           |
+-------+-------------------+--------------------+



                                                                                

## PCA Latent Features
- API specification of function **PCA_latentFeatures** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only

In [80]:
from anovos.data_transformer.transformers import PCA_latentFeatures

In [81]:
# Example 1 - with mandatory arguments + print impact
odf = PCA_latentFeatures(spark, df, standardization=True, run_type=run_type, auth_key=auth_key, print_impact=True)
odf.limit(5).toPandas()



Explained Variance:  0.9866


                                                                                

+-------+--------------------+----------------------+--------------------+---------------------+---------------------+---------------------+
|summary|latent_0            |latent_1              |latent_2            |latent_3             |latent_4             |latent_5             |
+-------+--------------------+----------------------+--------------------+---------------------+---------------------+---------------------+
|count  |12085               |12085                 |12085               |12085                |12085                |12085                |
|mean   |0.006990473625180995|-0.0016755667182939693|0.005319318090569765|-0.009834625663188677|-0.007587318127886196|-0.008249612368145467|
|stddev |1.381797632041396   |1.1228193008371192    |1.0150365936036754  |0.9838047392914232   |0.9419796818084403   |0.8971511772848002   |
|min    |-3.6521552          |-8.73415              |-8.272692           |-2.186532            |-4.0134373           |-7.4833627           |
|max    |9.02

Unnamed: 0,ifa,workclass,education,marital-status,occupation,relationship,race,sex,native-country,income,latent_0,latent_1,latent_2,latent_3,latent_4,latent_5
0,3a,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,UnitedStates,<=50K,0.54468,0.323292,-0.042623,0.146981,-0.248483,-0.113231
1,4a,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,UnitedStates,<=50K,0.70714,0.398122,0.210681,1.422995,-0.590206,-0.326371
2,6a,Private,Masters,Married-civ-spouse,Exec-managerial,Wife,White,Female,United-States,<=50K,1.169402,-0.817424,-0.208944,-0.57067,0.319456,1.234487
3,7a,Private,,,Other-service,Not-in-family,Black,Female,Jamaica,<=50K,-0.014027,2.157362,0.256786,1.788715,0.760998,-0.507208
4,8a,Self-emp-not-inc,HS-grad,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,>50K,0.342476,-0.209419,0.112829,0.979675,-0.671754,0.07136


In [82]:
# Example 2 - selected columns + customized explained_variance_cutoff + print impact
odf = PCA_latentFeatures(spark, df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                         explained_variance_cutoff=0.6, standardization=True, run_type=run_type, auth_key=auth_key, print_impact=True)
odf.limit(5).toPandas()

Explained Variance:  0.7943


                                                                                

+-------+---------------------+--------------------+----------------------+
|summary|latent_0             |latent_1            |latent_2              |
+-------+---------------------+--------------------+----------------------+
|count  |32466                |32466               |32466                 |
|mean   |1.6552674503466662E-4|6.358924743322154E-6|-2.8812043635684855E-4|
|stddev |1.0870763482758585   |1.014307711010815   |0.9826621520658786    |
|min    |-9.827796            |-7.914514           |-4.3574286            |
|max    |2.9694445            |9.095775            |2.3905334             |
+-------+---------------------+--------------------+----------------------+



Unnamed: 0,ifa,workclass,fnlwgt,logfnl,education,marital-status,occupation,relationship,race,sex,hours-per-week,native-country,income,latent_0,latent_1,latent_2
0,3a,Private,215646,5.333741,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,40,UnitedStates,<=50K,0.423203,-0.10286,-0.229906
1,4a,Private,234721,5.370552,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,40,UnitedStates,<=50K,0.380385,0.019731,-1.574839
2,6a,Private,284582,5.454207,Masters,Married-civ-spouse,Exec-managerial,Wife,White,Female,40,United-States,<=50K,-0.725312,-0.091507,0.90499
3,7a,Private,160187,5.204627,,,Other-service,Not-in-family,Black,Female,16,Jamaica,<=50K,0.991146,-0.023124,-1.760285
4,8a,Self-emp-not-inc,209642,5.321478,HS-grad,Married-civ-spouse,Exec-managerial,Husband,White,Male,45,United-States,>50K,-0.058382,0.019048,-1.084266


In [83]:
# Example 3 - selected columns + save model
odf = PCA_latentFeatures(spark, df, model_path=outputPath, standardization=True, run_type=run_type, auth_key=auth_key,)

In [84]:
# Example 4 - selected columns + use pre-saved model
odf = PCA_latentFeatures(spark, df, pre_existing_model=True, model_path=outputPath, standardization=True, 
                         run_type=run_type, auth_key=auth_key, print_impact=True)
odf.limit(5).toPandas()

Explained Variance:  0.9866




+-------+--------------------+----------------------+--------------------+---------------------+---------------------+---------------------+
|summary|latent_0            |latent_1              |latent_2            |latent_3             |latent_4             |latent_5             |
+-------+--------------------+----------------------+--------------------+---------------------+---------------------+---------------------+
|count  |12085               |12085                 |12085               |12085                |12085                |12085                |
|mean   |0.006990473625180995|-0.0016755667182939693|0.005319318090569765|-0.009834625663188677|-0.007587318127886196|-0.008249612368145467|
|stddev |1.381797632041396   |1.1228193008371192    |1.0150365936036754  |0.9838047392914232   |0.9419796818084403   |0.8971511772848002   |
|min    |-3.6521552          |-8.73415              |-8.272692           |-2.186532            |-4.0134373           |-7.4833627           |
|max    |9.02

Unnamed: 0,ifa,workclass,education,marital-status,occupation,relationship,race,sex,native-country,income,latent_0,latent_1,latent_2,latent_3,latent_4,latent_5
0,3a,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,UnitedStates,<=50K,0.54468,0.323292,-0.042623,0.146981,-0.248483,-0.113231
1,4a,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,UnitedStates,<=50K,0.70714,0.398122,0.210681,1.422995,-0.590206,-0.326371
2,6a,Private,Masters,Married-civ-spouse,Exec-managerial,Wife,White,Female,United-States,<=50K,1.169402,-0.817424,-0.208944,-0.57067,0.319456,1.234487
3,7a,Private,,,Other-service,Not-in-family,Black,Female,Jamaica,<=50K,-0.014027,2.157362,0.256786,1.788715,0.760998,-0.507208
4,8a,Self-emp-not-inc,HS-grad,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,>50K,0.342476,-0.209419,0.112829,0.979675,-0.671754,0.07136


In [85]:
# Example 5 - selected columns + use pre-saved stats + print impact
from anovos.data_analyzer.stats_generator import measures_of_counts
from anovos.data_ingest.data_ingest import write_dataset

write_dataset(measures_of_counts(spark, df), outputPath+"/missing","parquet", file_configs={"mode":"overwrite"})

odf = PCA_latentFeatures(spark, df, standardization=True, 
                         stats_missing={"file_path":outputPath+"/missing", "file_type": "parquet"}, 
                         run_type=run_type, auth_key=auth_key,
                         print_impact=True)

Explained Variance:  0.9866
+-------+--------------------+----------------------+--------------------+---------------------+---------------------+---------------------+
|summary|latent_0            |latent_1              |latent_2            |latent_3             |latent_4             |latent_5             |
+-------+--------------------+----------------------+--------------------+---------------------+---------------------+---------------------+
|count  |12085               |12085                 |12085               |12085                |12085                |12085                |
|mean   |0.006990473625180995|-0.0016755667182939693|0.005319318090569765|-0.009834625663188677|-0.007587318127886196|-0.008249612368145467|
|stddev |1.381797632041396   |1.1228193008371192    |1.0150365936036754  |0.9838047392914232   |0.9419796818084403   |0.8971511772848002   |
|min    |-3.6521552          |-8.73415              |-8.272692           |-2.186532            |-4.0134373           |-7.48336

In [86]:
# Example 6 - use pre-saved standardization model
odf = PCA_latentFeatures(spark, df, standardization=True,
                         standardization_configs={"pre_existing_model": True, "model_path": outputPath}, 
                         run_type=run_type, auth_key=auth_key,
                         print_impact=True)

Explained Variance:  0.9866
+-------+--------------------+----------------------+--------------------+---------------------+---------------------+---------------------+
|summary|latent_0            |latent_1              |latent_2            |latent_3             |latent_4             |latent_5             |
+-------+--------------------+----------------------+--------------------+---------------------+---------------------+---------------------+
|count  |12085               |12085                 |12085               |12085                |12085                |12085                |
|mean   |0.006990473625180995|-0.0016755667182939693|0.005319318090569765|-0.009834625663188677|-0.007587318127886196|-0.008249612368145467|
|stddev |1.381797632041396   |1.1228193008371192    |1.0150365936036754  |0.9838047392914232   |0.9419796818084403   |0.8971511772848002   |
|min    |-3.6521552          |-8.73415              |-8.272692           |-2.186532            |-4.0134373           |-7.48336

In [87]:
# Example 7 - impute missing values before calculation
odf = PCA_latentFeatures(spark, df, standardization=True, imputation=True, run_type=run_type, auth_key=auth_key, print_impact=True)

Explained Variance:  0.9635




+-------+--------------------+--------------------+--------------------+---------------------+---------------------+--------------------+
|summary|latent_0            |latent_1            |latent_2            |latent_3             |latent_4             |latent_5            |
+-------+--------------------+--------------------+--------------------+---------------------+---------------------+--------------------+
|count  |32561               |32561               |32561               |32561                |32561                |32561               |
|mean   |0.014843092981726182|-0.03754070391241653|0.004521131346880388|-0.010756595790759266|0.0016000469652099017|0.006853246225921223|
|stddev |1.1508209719334939  |1.0680661402803904  |1.0138910937403953  |0.9808178226286148   |0.9387913783440285   |0.8967908034058005  |
|min    |-8.268952           |-9.891339           |-7.9534183          |-4.725431            |-7.624943            |-7.0489144          |
|max    |5.2560806           |3.94

                                                                                

# Feature Transformation
- API specification of function **feature_transformation** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only

In [88]:
from anovos.data_transformer.transformers import feature_transformation

In [89]:
# Example 1: sqrt 
odf = feature_transformation(idf=df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                             method_type='sqrt', print_impact=True)

Before:
+-------+-----------------+------------------+------------------+------------------+
|summary|capital-loss     |education-num     |age               |capital-gain      |
+-------+-----------------+------------------+------------------+------------------+
|count  |32549            |32530             |32500             |32548             |
|mean   |87.3360164674798 |10.080971411005226|38.506492307692305|1077.6959567408135|
|stddev |403.0310072565714|2.5725103263986946|13.508497735339288|7386.624857802765 |
|min    |0                |1                 |17                |0                 |
|max    |4356             |16                |85                |99999             |
+-------+-----------------+------------------+------------------+------------------+

After:
+-------+------------------+-------------------+------------------+-----------------+
|summary|capital-loss      |education-num      |age               |capital-gain     |
+-------+------------------+-------------------

In [90]:
# Example 2: log + append generated columns
odf = feature_transformation(idf=df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                             method_type='ln', output_mode='append', print_impact=True)

Before:
+-------+-----------------+------------------+------------------+------------------+
|summary|capital-loss     |education-num     |age               |capital-gain      |
+-------+-----------------+------------------+------------------+------------------+
|count  |32549            |32530             |32500             |32548             |
|mean   |87.3360164674798 |10.080971411005226|38.506492307692305|1077.6959567408135|
|stddev |403.0310072565714|2.5725103263986946|13.508497735339288|7386.624857802765 |
|min    |0                |1                 |17                |0                 |
|max    |4356             |16                |85                |99999             |
+-------+-----------------+------------------+------------------+------------------+

After:
+-------+------------------+------------------+-------------------+------------------+
|summary|capital-loss_ln   |education-num_ln  |age_ln             |capital-gain_ln   |
+-------+------------------+-----------------

In [91]:
# Example 3: round to 1 decimal place
odf = feature_transformation(idf=odf, 
                             list_of_cols=['education-num_ln', 'capital-gain_ln', 'capital-loss_ln', 'age_ln'], 
                             method_type='roundN', N=1, print_impact=True)

Before:
+-------+-------------------+------------------+------------------+------------------+
|summary|age_ln             |education-num_ln  |capital-gain_ln   |capital-loss_ln   |
+-------+-------------------+------------------+------------------+------------------+
|count  |32500              |32530             |2710              |1519              |
|mean   |3.5880271478056627 |2.2689316480632513|8.819883472603587 |7.508497766226039 |
|stddev |0.35895718528658827|0.3168442727686075|1.0158964531089263|0.2567566832369081|
|min    |2.833213344056216  |0.0               |4.736198448394496 |5.043425116919247 |
|max    |4.442651256490317  |2.772588722239781 |11.512915464920228|8.37930948405285  |
+-------+-------------------+------------------+------------------+------------------+

After:
+-------+-------------------+------------------+------------------+------------------+
|summary|age_ln             |education-num_ln  |capital-gain_ln   |capital-loss_ln   |
+-------+------------------

In [92]:
# Example 4: square
odf = feature_transformation(idf=df, list_of_cols='age', method_type='sq', print_impact=True)

Before:
+-------+------------------+
|summary|age               |
+-------+------------------+
|count  |32500             |
|mean   |38.506492307692305|
|stddev |13.508497735339288|
|min    |17                |
|max    |85                |
+-------+------------------+

After:
+-------+------------------+
|summary|age               |
+-------+------------------+
|count  |32500             |
|mean   |1665.2238461538461|
|stddev |1154.2085383349036|
|min    |289.0             |
|max    |7225.0            |
+-------+------------------+



In [93]:
# Example 5: remainder divided by 10
odf = feature_transformation(idf=df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                             method_type='remainderDivByN', N=10, print_impact=True)

Before:
+-------+-----------------+------------------+------------------+------------------+
|summary|capital-loss     |education-num     |age               |capital-gain      |
+-------+-----------------+------------------+------------------+------------------+
|count  |32549            |32530             |32500             |32548             |
|mean   |87.3360164674798 |10.080971411005226|38.506492307692305|1077.6959567408135|
|stddev |403.0310072565714|2.5725103263986946|13.508497735339288|7386.624857802765 |
|min    |0                |1                 |17                |0                 |
|max    |4356             |16                |85                |99999             |
+-------+-----------------+------------------+------------------+------------------+

After:
+-------+------------------+------------------+-----------------+-------------------+
|summary|capital-loss      |education-num     |age              |capital-gain       |
+-------+------------------+------------------+

# Box Cox Transformation
- API specification of function **boxcox_transformation** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only

In [94]:
from anovos.data_transformer.transformers import boxcox_transformation

In [95]:
# Example 1 - selected columns + print impact
odf = boxcox_transformation(df, drop_cols=['capital-loss', 'capital-gain'], print_impact=True)

Transformed Columns:  ['fnlwgt', 'age', 'education-num', 'hours-per-week', 'logfnl']
Best BoxCox Parameter(s):  [1, 0, 3, 3, 1]
Before:
+--------+------------------+------------------+-------------------+--------------------+-------------------+
|summary |fnlwgt            |age               |education-num      |hours-per-week      |logfnl             |
+--------+------------------+------------------+-------------------+--------------------+-------------------+
|count   |32546             |32500             |32530              |32452               |12168              |
|mean    |189781.83180728814|38.506492307692305|10.080971411005226 |40.24972266732405   |5.2054654851899365 |
|stddev  |105563.06445057005|13.508497735339288|2.5725103263986946 |11.914337669272227  |0.27424241727170395|
|min     |12285             |17                |1                  |1                   |4.283617786        |
|max     |1484705           |85                |16                 |94                  |6.088

In [96]:
# Example 2 - selected columns + existing lambda value + print impact
odf = boxcox_transformation(df, list_of_cols='age', boxcox_lambda=0, output_mode='append', print_impact=True)

Transformed Columns:  ['age']
Best BoxCox Parameter(s):  [0]
Before:
+--------+------------------+
|summary |age               |
+--------+------------------+
|count   |32500             |
|mean    |38.506492307692305|
|stddev  |13.508497735339288|
|min     |17                |
|max     |85                |
|skewness|0.5127993362812416|
+--------+------------------+

After:
+--------+--------------------+
|summary |age_bxcx_0          |
+--------+--------------------+
|count   |32500               |
|mean    |3.5880271478056627  |
|stddev  |0.35895718528658827 |
|min     |2.833213344056216   |
|max     |4.442651256490317   |
|skewness|-0.14607838263666648|
+--------+--------------------+



# Outlier Categories Treatment
- API specification of function **outlier_categories** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports 2 ways of outliers detection: by max number of categories and by coverage (%)

In [97]:
from anovos.data_transformer.transformers import outlier_categories

In [98]:
# Example 1 - 'all' columns (excluding drop_cols) + max 15 categories + append transformed columns at the end
odf = outlier_categories(spark, df, drop_cols=['ifa'], max_category=15, output_mode='append')
odf.toPandas().head(5)

                                                                                

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,education,education-num,marital-status,occupation,relationship,...,income,sex_outliered,income_outliered,marital-status_outliered,native-country_outliered,race_outliered,relationship_outliered,education_outliered,occupation_outliered,workclass_outliered
0,1a,,State-gov,77516.0,4.889391,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,...,<=50K,Male,<=50K,Never-married,others,White,Not-in-family,Bachelors,Adm-clerical,State-gov
1,2a,,Self-emp-not-inc,83311.0,4.920702,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,...,<=50K,Male,<=50K,Married-civ-spouse,others,White,Husband,Bachelors,Exec-managerial,Self-emp-not-inc
2,3a,38.0,Private,215646.0,5.333741,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,...,<=50K,Male,<=50K,Divorced,others,White,Not-in-family,HS-grad,Handlers-cleaners,Private
3,4a,53.0,Private,234721.0,5.370552,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,...,<=50K,Male,<=50K,Married-civ-spouse,others,Black,Husband,11th,Handlers-cleaners,Private
4,5a,,Private,338409.0,5.529442,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,...,<=50K,Female,<=50K,Married-civ-spouse,Cuba,Black,Wife,Bachelors,Prof-specialty,Private


In [99]:
# Example 2 - selected columns + max 10 categories
odf = outlier_categories(spark, df, list_of_cols=['education', 'occupation', 'native-country'], 
                         max_category=10, print_impact=True)

+--------------+-------------------+
|attribute     |uniqueValues_before|
+--------------+-------------------+
|education     |16                 |
|native-country|44                 |
|occupation    |15                 |
+--------------+-------------------+

+--------------+------------------+
|attribute     |uniqueValues_after|
+--------------+------------------+
|education     |10                |
|native-country|10                |
|occupation    |10                |
+--------------+------------------+



In [100]:
# Example 3 - selected columns + cover 90% values
odf = outlier_categories(spark, df, list_of_cols=['education', 'occupation', 'native-country'], 
                         coverage=0.9, print_impact=True)

+--------------+-------------------+
|attribute     |uniqueValues_before|
+--------------+-------------------+
|education     |16                 |
|native-country|44                 |
|occupation    |15                 |
+--------------+-------------------+

+--------------+------------------+
|attribute     |uniqueValues_after|
+--------------+------------------+
|education     |9                 |
|native-country|3                 |
|occupation    |11                |
+--------------+------------------+



In [101]:
# Example 4 - max 15 categories + save model
odf = outlier_categories(spark, df, drop_cols=['ifa'], max_category=15, 
                         pre_existing_model=False, model_path=outputPath, print_impact=True)

+--------------+-------------------+
|attribute     |uniqueValues_before|
+--------------+-------------------+
|sex           |3                  |
|income        |2                  |
|marital-status|7                  |
|native-country|44                 |
|race          |9                  |
|relationship  |8                  |
|education     |16                 |
|occupation    |15                 |
|workclass     |11                 |
+--------------+-------------------+

+--------------+------------------+
|attribute     |uniqueValues_after|
+--------------+------------------+
|sex           |3                 |
|income        |2                 |
|marital-status|7                 |
|native-country|15                |
|race          |9                 |
|relationship  |8                 |
|education     |15                |
|occupation    |15                |
|workclass     |11                |
+--------------+------------------+



In [102]:
# Example 5 - use pre-saved model
odf = outlier_categories(spark, df, drop_cols=['ifa'], pre_existing_model=True, model_path=outputPath, print_impact=True)

+--------------+-------------------+
|attribute     |uniqueValues_before|
+--------------+-------------------+
|sex           |3                  |
|income        |2                  |
|marital-status|7                  |
|native-country|44                 |
|race          |9                  |
|relationship  |8                  |
|education     |16                 |
|occupation    |15                 |
|workclass     |11                 |
+--------------+-------------------+

+--------------+------------------+
|attribute     |uniqueValues_after|
+--------------+------------------+
|sex           |3                 |
|income        |2                 |
|marital-status|7                 |
|native-country|15                |
|race          |9                 |
|relationship  |8                 |
|education     |15                |
|occupation    |15                |
|workclass     |10                |
+--------------+------------------+



# Expression Parser
- API specification of function **expression_parser** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>

In [103]:
from anovos.data_transformer.transformers import expression_parser

In [104]:
# Example 1 - 2 generated columns + print impact
odf = expression_parser(df, ['age + hours-per-week', 'capital-gain-capital-loss'], print_impact=True)

Columns Added:  ['f0', 'f1']
+-------+------------------+------------------+
|summary|f0                |f1                |
+-------+------------------+------------------+
|count  |32392             |32548             |
|mean   |78.75373549024451 |990.3572569743149 |
|stddev |18.619824518135385|7410.3252594090245|
|min    |20                |-4356             |
|max    |158               |99999             |
+-------+------------------+------------------+



In [105]:
# Example 1 - 2 generated columns + print impact
odf = expression_parser(df, ['age + hours-per-week', 'capital-gain/capital-loss'], print_impact=True)

Columns Added:  ['f0', 'f1']
+-------+------------------+----+
|summary|f0                |f1  |
+-------+------------------+----+
|count  |32392             |1519|
|mean   |78.75373549024451 |0.0 |
|stddev |18.619824518135385|0.0 |
|min    |20                |0.0 |
|max    |158               |0.0 |
+-------+------------------+----+



In [106]:
# Example 2 - 2 generated columns + customized postfix + print impact
odf = expression_parser(df, ['age + hours-per-week', 'capital-gain - capital-loss'], postfix="_new", print_impact=True)

Columns Added:  ['f0_new', 'f1_new']
+-------+------------------+------------------+
|summary|f0_new            |f1_new            |
+-------+------------------+------------------+
|count  |32392             |32548             |
|mean   |78.75373549024451 |990.3572569743149 |
|stddev |18.619824518135385|7410.3252594090245|
|min    |20                |-4356             |
|max    |158               |99999             |
+-------+------------------+------------------+

