RDT Quickstart
==========

In this short series of tutorials we will guide you through a series of steps that will help you getting started using RDT to transform columns, tables and datasets.

# Load the demo data

In [9]:
from rdt import get_demo

customers = get_demo()

customers.head()

Unnamed: 0,last_login,email_optin,credit_card,age,dollars_spent
0,2021-06-26,False,VISA,29,99.99
1,2021-02-10,False,VISA,18,
2,NaT,False,AMEX,21,2.5
3,2020-09-26,True,,45,25.0
4,2020-12-22,,DISCOVER,32,19.99


This dataset contains some randomly generated values that describes the customers of an online marketplace.

Let's transform this data so that each column is converted to full, numerical data ready for data science.


# Creating the HyperTransformer & config

The `HyperTransformer` is capable of transforming multi-column datasets.

In [10]:
from rdt import HyperTransformer

ht = HyperTransformer()

The `HyperTransformer` needs to know about the columns in your dataset and which transformers to apply to each. These are described by a config. We can ask the HyperTransformer to automatically detect it based on the data we plan to use.

In [23]:
config = ht.detect_initial_config(data=customers)


Detecting a new config from the data ... SUCCESS
Setting the new config ... SUCCESS
Config:
{
    "sdtypes": {
        "last_login": "datetime",
        "email_optin": "boolean",
        "credit_card": "categorical",
        "age": "numerical",
        "dollars_spent": "numerical"
    },
    "transformers": {
        "last_login": UnixTimestampEncoder(missing_value_replacement='mean'),
        "email_optin": BinaryEncoder(missing_value_replacement='mode'),
        "credit_card": FrequencyEncoder(),
        "age": FloatFormatter(missing_value_replacement='mean'),
        "dollars_spent": FloatFormatter(missing_value_replacement='mean')
    }
}


The `sdtypes` dictionary describes the semantic data types of each of your columns and the `transformers` dictionary describes which transformer to use for each column.

# Fitting & using the HyperTransformer

The `HyperTransformer` references the config while learning the data during the fit stage.

In [13]:
ht.fit(customers)

Once the transformer is fit, it's ready to use. Use the transform method to transform all columns of your dataset at once.

In [15]:
transformed_data = ht.transform(customers)

transformed_data.head()

Unnamed: 0,last_login.value,email_optin.value,credit_card.value,age.value,dollars_spent.value
0,1.624666e+18,0.0,0.2,29.0,99.99
1,1.612915e+18,0.0,0.2,18.0,36.87
2,1.611814e+18,0.0,0.5,21.0,2.5
3,1.601078e+18,1.0,0.7,45.0,25.0
4,1.608595e+18,0.0,0.9,32.0,19.99


The `HyperTransformer` applied the assigned transformer to each individual column. Each column now contains fully numerical data that you can use for your project!

When you're done with your project, you can also transform the data back to the original format using the reverse_transform method.

In [16]:
original_format_data = ht.reverse_transform(transformed_data)

original_format_data.head()

Unnamed: 0,last_login,email_optin,credit_card,age,dollars_spent
0,2021-06-26,,VISA,29,99.99
1,NaT,False,VISA,18,36.87
2,2021-01-28,False,AMEX,21,2.5
3,2020-09-26,True,,45,25.0
4,2020-12-22,False,DISCOVER,32,19.99


# Transforming a single column

It is also possible to transform a single column of a `pandas.DataFrame`. To do this, follow the following steps.

## Load the transformer

In this example we will use the datetime column, so let's load a UnixTimestampEncoder.

In [17]:
from rdt.transformers import UnixTimestampEncoder

transformer = UnixTimestampEncoder()

## Fit the Transformer

Before being able to transform the data, we need the transformer to learn from it.

We will do this by calling its fit method passing the column that we want to transform.



In [18]:
transformer.fit(customers, column='last_login')

## Transform the data

Once the transformer is fitted, we can pass the data again to its transform method in order to get the transformed version of the data.

In [19]:
transformed = transformer.transform(customers)

The output will be a `pandas.DataFrame` similar to the input data, except with the original datetime column replaced with `last_login.value`.

In [20]:
transformed.head()

Unnamed: 0,email_optin,credit_card,age,dollars_spent,last_login.value
0,False,VISA,29,99.99,1.624666e+18
1,False,VISA,18,,1.612915e+18
2,False,AMEX,21,2.5,
3,True,,45,25.0,1.601078e+18
4,,DISCOVER,32,19.99,1.608595e+18


## Revert the column transformation

In order to revert the previous transformation, the transformed data can be passed to the `reverse_transform` method of the transformer:

In [21]:
reversed_data = transformer.reverse_transform(transformed)

The output will be a `pandas.DataFrame` containing the reverted values, which should be exactly like the original ones, except for the order of the columns.



In [22]:
reversed_data.head()

Unnamed: 0,email_optin,credit_card,age,dollars_spent,last_login
0,False,VISA,29,99.99,2021-06-26
1,False,VISA,18,,2021-02-10
2,False,AMEX,21,2.5,NaT
3,True,,45,25.0,2020-09-26
4,,DISCOVER,32,19.99,2020-12-22


# Change the hypertransformer's config



We first retrieve the deautl config.

In [28]:
config = ht.get_config()

config

{
    "sdtypes": {
        "last_login": "datetime",
        "email_optin": "boolean",
        "credit_card": "categorical",
        "age": "numerical",
        "dollars_spent": "numerical"
    },
    "transformers": {
        "last_login": UnixTimestampEncoder(missing_value_replacement='mean'),
        "email_optin": BinaryEncoder(missing_value_replacement='mode'),
        "credit_card": FrequencyEncoder(),
        "age": FloatFormatter(missing_value_replacement='mean'),
        "dollars_spent": FloatFormatter(missing_value_replacement='mean')
    }
}

## Change the "credit_card" encoder to a one-hot encoder.


In [30]:
from rdt.transformers import OneHotEncoder

config["transformers"]["credit_card"] = OneHotEncoder()

ht.set_config(config)

We then refit the hypertransformer.


In [34]:
ht.fit(customers)

transformed_data_one_hot = ht.transform(customers)

transformed_data.head()


Unnamed: 0,last_login.value,email_optin.value,credit_card.value,age.value,dollars_spent.value
0,1.624666e+18,0.0,0.2,29.0,99.99
1,1.612915e+18,0.0,0.2,18.0,36.87
2,1.611814e+18,0.0,0.5,21.0,2.5
3,1.601078e+18,1.0,0.7,45.0,25.0
4,1.608595e+18,0.0,0.9,32.0,19.99


In [35]:
transformed_data_one_hot.head()

Unnamed: 0,last_login.value,email_optin.value,credit_card.value3,credit_card.value2,credit_card.value1,credit_card.value0,age.value,dollars_spent.value
0,1.624666e+18,0.0,0,0,0,1,29.0,99.99
1,1.612915e+18,0.0,0,0,0,1,18.0,36.87
2,1.611814e+18,0.0,0,0,1,0,21.0,2.5
3,1.601078e+18,1.0,1,0,0,0,45.0,25.0
4,1.608595e+18,0.0,0,1,0,0,32.0,19.99


Lets check the column's types and the reversibility of the operation.

In [36]:
reverse_transform = ht.reverse_transform(transformed_data_one_hot)
reverse_transform.head()

Unnamed: 0,last_login,email_optin,credit_card,age,dollars_spent
0,NaT,,VISA,29,99.99
1,2021-02-10,False,VISA,18,36.87
2,2021-01-28,,AMEX,21,
3,NaT,True,,45,
4,NaT,False,DISCOVER,32,


In [37]:
transformed_data_one_hot.dtypes

last_login.value       float64
email_optin.value      float64
credit_card.value3       int64
credit_card.value2       int64
credit_card.value1       int64
credit_card.value0       int64
age.value              float64
dollars_spent.value    float64
dtype: object