# Data Exploration Module Test

# TODO:
    - head (/)
    - info (/)
    - matrix (missing values) (/)
    - bar (missing values) (/)
    - imports Jupyter Notebook (/)
    - fix plot resolution (/)

    - data dict (types, e.g. nominal, categorial)
    - box (numeric, deviation)
    - bar/mosaic/ (categorial, deviation)
    - predictor/feature correlation (heatmap/scatter)
    - histogram (skewed/deviation)


In [2]:
import idstools.data_explorer as idsde

In [3]:
test_data = "data/BikeRentalDaily_test.csv"
train_data = "data/BikeRentalDaily_train.csv"

In [8]:
data_explorer = idsde.DataExplorer(input_path=train_data, input_delimiter=";")

2024-02-03 14:55:21,307 [data_explorer] [INFO] - Initializing DataExplorer
2024-02-03 14:55:21,308 [data_explorer] [INFO] - No label provided.
2024-02-03 14:55:21,308 [data_explorer] [INFO] - Output path not provided.
Using default path: /home/davidrmn/Studies/introduction-data-science/results
2024-02-03 14:55:21,309 [data_explorer] [INFO] - Please provide a pipeline configuration.
2024-02-03 14:55:21,309 [_helpers] [INFO] - Reading data from:
/home/davidrmn/Studies/introduction-data-science/data/BikeRentalDaily_train.csv


In [9]:
data_explorer.descriptive_analysis()

In [10]:
data_explorer.head

Unnamed: 0,0,1,2,3,4
instant,154,685,368,472,442
dteday,03.06.2011,15.11.2012,03.01.2012,16.04.2012,17.03.2012
season,2.0,4.0,1.0,2.0,1.0
yr,0,1,1,1,1
mnth,6,11,1,4,3
holiday,0,0,0,1,0
weekday,5,4,2,1,-1
workingday,1,1,1,0,0
weathersit,1,2,1,1,2
temp,24.8,12.87,6.0,26.57,20.57


In [11]:
data_explorer.correlation_analysis()

![Correlation Heatmap](../results/BikeRentalDaily_train_correlation_heatmap.png)

# Data Preparation Module Test

In [12]:
import idstools.data_preparation as dp

In [None]:
test_data = "data/BikeRentalDaily_test.csv"
train_data = "data/BikeRentalDaily_train.csv"

In [None]:
data_preparation = dp.DataPreparation(input_path=train_data, output_path="results")


In [None]:
import pandas as pd

def get_wday_by_date(df, date_column, weekday_column):
    # Define the weekday shift
    weekday_shift = {
        6: 0,
        0: 1,
        1: 2,
        2: 3,
        3: 4,
        4: 5,
        5: 6
    }

    # Convert the date column to datetime
    df[date_column] = pd.to_datetime(df[date_column], format="%d.%m.%Y")

    # Calculate the weekday and map it
    df[weekday_column] = df[date_column].dt.dayofweek.map(weekday_shift)

    return df

In [None]:
pipeline_config = {
    "_SimpleImputer": 
    [
        {
            "target": "hum",
            "config": {
                "strategy": "mean"
            }
        }
    ],
    "_CustomTransformer": 
    [
        {
            "func": get_wday_by_date,
            "config": {
                "date_column": "dteday",
                "weekday_column": "weekday"
            }
        }
    ]
}

In [None]:
pipeline = data_preparation.build_pipeline(config=pipeline_config)
pipeline

In [None]:
processed_data = data_preparation.run_pipeline(config=pipeline_config)

In [None]:
processed_data.head(5).T

In [None]:
processed_data.describe().T

In [None]:
from idstools._config import _idstools

In [None]:
_idstools["default"]["data_explorer"]["DataExplorer"]["input_path"]

In [None]:
_idstools.default.data_explorer.DataExplorer.input_path

## Module Configuration

In [None]:
from idstools.data_explorer import DataExplorer
from idstools._config import _idstools, pprint_dynaconf

We have multiple options to configure the DataExplorer to analyze the BikeRentalDaily_train.csv data.

- Load the default set of parameters and adjust them to our needs. In this case all possible parameters are initialized and can be set according the the exploration steps that should be done. 

- Initialize the class with in cell defined configuration.

In [None]:
pprint_dynaconf(_idstools, notebook=True)

In [None]:
config = _idstools.default.data_explorer.DataExplorer

In [None]:
pprint_dynaconf(config, notebook=True)

In [None]:
pprint_dynaconf(_idstools.custom.data_explorer.DataExplorer, notebook=True)

In [None]:
data_explorer_config = config

In [None]:
data_explorer_config.input_path = "/home/davidrmn/Studies/introduction-data-science/data/BikeRentalDaily_train.csv"

In [None]:
my_data_explorer = DataExplorer(**data_explorer_config)

In [None]:
result = my_data_explorer.descriptive_analysis()

## Another example of custom transformer

Simply start by importing the configuration. As always, you can provide your own configuration or load and edit the default one to see what parameters are possible to edit.

In [None]:
from idstools._config import _idstools, pprint_dynaconf

Load the transformer config of the _CustomTransformer defined in the custom pipeline as an example.

In [None]:
transformer_config = _idstools.custom.data_preparation.DataPreparation.pipeline._CustomTransformer

Now show what is configured to execute in this step of the data preparation. As you can see there is a function referenced that is part of the _custom_piplines module of the idstools package.

In [None]:
pprint_dynaconf(transformer_config, notebook=True)

Here you can see how the function is implemented in the module. 

IMPORTANT: Each function for the _CustomTransformer takes the DataFrame with which the DataPreparation Class of the data_perparation module was initialized and performs any implemented function on it. Based on the arguments provided, in this case there is only one argument: "target" which references a column in the DataFrame which has values from which all dots "." are getting replaced with hyphens "-".

In [None]:
!cat /home/davidrmn/Studies/introduction-data-science/src/idstools/_custom_transformer.py

Now we can again show the whole DataPerparation config:

In [None]:
pprint_dynaconf(_idstools.custom.data_preparation.DataPreparation, notebook=True)

I dont want to drop instant so I set the pipeline to only execute the _CustomTransformer:

In [None]:
config = _idstools.custom.data_preparation.DataPreparation

In [None]:
config.pipeline = {"_CustomTransformer" : transformer_config}

In [None]:
pprint_dynaconf(config, notebook=True)

In [None]:
from idstools.data_preparation import DataPreparation
my_data_preparation = DataPreparation(**config) 

As you can see it almost worked! As I am in Jupyter the code resolves to the wrong path, need to fix this still. As a workaround we can set the Path as absolute Path.

At the moment it is relative the the package root. But as this notebook which executes the code is not we need to adjust.

In [None]:
config.input_path

In [None]:
config.input_path = "/home/davidrmn/Studies/introduction-data-science/data/BikeRentalDaily_train.csv"

Now lets try again.

In [None]:
my_data_preparation = DataPreparation(**config) 

It worked so lets run the preconfigured pipeline of the DataPreparation instance we just created.

In [None]:
my_data_preparation.build_pipeline(config.pipeline)

In [None]:
my_data_preparation.run_pipeline(config.pipeline)

The two steps above can also be introduced by the run method of the class. Btw, each class of the package has a run method to orchestrate the class.

In [None]:
my_data_preparation.run()

As you can see now this also automatically writes the results to the output_path instead of returning it. Of course you can also save it manually after build and run ;)

In [None]:
my_data_preparation.write_data()

This project is still in development but imagine what can be automated and build with it :)

Cheers