# Data Exploration Module Test

# TODO:
    - head (/)
    - info (/)
    - matrix (missing values) (/)
    - bar (missing values) (/)
    - imports Jupyter Notebook (/)
    - fix plot resolution (/)
    - box (numeric, deviation) (/)
    - histogram (skewed/deviation) (/)
    - predictor/feature correlation (heatmap/scatter) (/)

    - data dict (types, e.g. nominal, categorial)
    - bar/mosaic (categorial, deviation)
    -> These require some mapping based on domain knowledge.


In [1]:
import idstools.data_explorer as idsde

In [2]:
test_data = "data/BikeRentalDaily_test.csv"
train_data = "data/BikeRentalDaily_train.csv"

In [3]:
data_explorer = idsde.DataExplorer(input_path=train_data, input_delimiter=";")

2024-02-03 16:22:12,175 [data_explorer] [INFO] - Initializing DataExplorer
2024-02-03 16:22:12,176 [data_explorer] [INFO] - No label provided.
2024-02-03 16:22:12,177 [data_explorer] [INFO] - Output path not provided.
Using default path: /home/davidrmn/Studies/introduction-data-science/results
2024-02-03 16:22:12,177 [data_explorer] [INFO] - Please provide a pipeline configuration.
2024-02-03 16:22:12,178 [_helpers] [INFO] - Reading data from:
/home/davidrmn/Studies/introduction-data-science/data/BikeRentalDaily_train.csv


In [4]:
data_explorer.descriptive_analysis()

In [5]:
data_explorer.head

Unnamed: 0,0,1,2,3,4
instant,154,685,368,472,442
dteday,03.06.2011,15.11.2012,03.01.2012,16.04.2012,17.03.2012
season,2.0,4.0,1.0,2.0,1.0
yr,0,1,1,1,1
mnth,6,11,1,4,3
holiday,0,0,0,1,0
weekday,5,4,2,1,-1
workingday,1,1,1,0,0
weathersit,1,2,1,1,2
temp,24.8,12.87,6.0,26.57,20.57


In [6]:
data_explorer.correlation_analysis()

![CorrelationAnalysis](../results/BikeRentalDaily_train_correlation_heatmap.png)

In [7]:
data_explorer.distribution_analysis()

Distribution Plots: 100%|██████████| 17/17 [00:00<00:00, 11783.70it/s]


![DistributionAnalysis](../results/BikeRentalDaily_train_distribution_plots.png)

# Data Preparation Module Test

In [8]:
import idstools.data_preparation as dp

In [9]:
test_data = "data/BikeRentalDaily_test.csv"
train_data = "data/BikeRentalDaily_train.csv"

In [10]:
data_preparation = dp.DataPreparation(input_path=train_data, input_delimiter=";", output_path="results")


2024-02-03 16:22:15,829 [data_preparation] [INFO] - Initializing DataPreparation
2024-02-03 16:22:15,830 [data_preparation] [INFO] - Using output path: /home/davidrmn/Studies/introduction-data-science/results
2024-02-03 16:22:15,830 [data_preparation] [INFO] - Please provide a pipeline configuration.
2024-02-03 16:22:15,831 [_helpers] [INFO] - Reading data from:
/home/davidrmn/Studies/introduction-data-science/data/BikeRentalDaily_train.csv


In [11]:
import pandas as pd

def get_wday_by_date(df, date_column, weekday_column):
    # Define the weekday shift
    weekday_shift = {
        6: 0,
        0: 1,
        1: 2,
        2: 3,
        3: 4,
        4: 5,
        5: 6
    }

    # Convert the date column to datetime
    df[date_column] = pd.to_datetime(df[date_column], format="%d.%m.%Y")

    # Calculate the weekday and map it
    df[weekday_column] = df[date_column].dt.dayofweek.map(weekday_shift)

    return df

In [12]:
pipeline_config = {
    "_SimpleImputer": 
    [
        {
            "target": "hum",
            "config": {
                "strategy": "mean"
            }
        }
    ],
    "_CustomTransformer": 
    [
        {
            "func": get_wday_by_date,
            "config": {
                "date_column": "dteday",
                "weekday_column": "weekday"
            }
        }
    ]
}

In [13]:
pipeline = data_preparation.build_pipeline(config=pipeline_config)
pipeline

2024-02-03 16:22:15,852 [data_preparation] [INFO] - Pipeline created.


In [14]:
processed_data = data_preparation.run_pipeline(config=pipeline_config)

2024-02-03 16:22:15,867 [data_preparation] [INFO] - Pipeline step _SimpleImputer has been processed.
2024-02-03 16:22:15,872 [data_preparation] [INFO] - Pipeline step _CustomTransformer has been processed.


In [15]:
data_preparation.data.head(5).T

Unnamed: 0,0,1,2,3,4
instant,154,685,368,472,442
dteday,03.06.2011,15.11.2012,03.01.2012,16.04.2012,17.03.2012
season,2.0,4.0,1.0,2.0,1.0
yr,0,1,1,1,1
mnth,6,11,1,4,3
holiday,0,0,0,1,0
weekday,5,4,2,1,-1
workingday,1,1,1,0,0
weathersit,1,2,1,1,2
temp,24.8,12.87,6.0,26.57,20.57


In [16]:
processed_data.head(5).T

Unnamed: 0,0,1,2,3,4
instant,154,685,368,472,442
dteday,2011-06-03 00:00:00,2012-11-15 00:00:00,2012-01-03 00:00:00,2012-04-16 00:00:00,2012-03-17 00:00:00
season,2.0,4.0,1.0,2.0,1.0
yr,0,1,1,1,1
mnth,6,11,1,4,3
holiday,0,0,0,1,0
weekday,5,4,2,1,6
workingday,1,1,1,0,0
weathersit,1,2,1,1,2
temp,24.8,12.87,6.0,26.57,20.57


In [17]:
processed_data.describe().T

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
instant,600.0,363.12,1.0,181.25,362.5,538.25,731.0,208.71
dteday,600.0,2011-12-29 02:48:00,2011-01-01 00:00:00,2011-06-30 06:00:00,2011-12-28 12:00:00,2012-06-21 06:00:00,2012-12-31 00:00:00,
season,538.0,2.44,1.0,1.0,2.0,3.0,4.0,1.11
yr,600.0,0.5,0.0,0.0,0.0,1.0,1.0,0.5
mnth,600.0,6.47,1.0,4.0,6.0,9.0,12.0,3.44
holiday,600.0,0.03,0.0,0.0,0.0,0.0,1.0,0.17
weekday,600.0,3.03,0.0,1.0,3.0,5.0,6.0,2.01
workingday,600.0,0.68,0.0,0.0,1.0,1.0,1.0,0.47
weathersit,600.0,1.4,1.0,1.0,1.0,2.0,3.0,0.54
temp,600.0,19.81,2.37,13.57,20.1,26.06,34.47,7.21


In [18]:
from idstools._config import _idstools

In [19]:
_idstools["default"]["data_explorer"]["DataExplorer"]["input_path"] = "data/BikeRentalDaily_train.csv"

In [20]:
_idstools.default.data_explorer.DataExplorer.input_path

'data/BikeRentalDaily_train.csv'

## Module Configuration

In [21]:
from idstools.data_explorer import DataExplorer
from idstools._config import _idstools, pprint_dynaconf

We have multiple options to configure the DataExplorer to analyze the BikeRentalDaily_train.csv data.

- Load the default set of parameters and adjust them to our needs. In this case all possible parameters are initialized and can be set according the the exploration steps that should be done. 

- Initialize the class with in cell defined configuration.

In [22]:
pprint_dynaconf(_idstools, notebook=True)

```yaml
DEFAULT:
  data_explorer:
    DataExplorer:
      output_path: null
      input_path: data/BikeRentalDaily_train.csv
      input_delimiter: null
      label: null
      pipeline:
        descriptive_analysis: null
        missing_value_analysis: null
        correlation_analysis: null
        outlier_analysis: null
        distribution_analysis: null
        scatter_analysis: null
  data_preparation:
    DataPreparation:
      output_path: null
      input_path: null
      input_delimiter: null
      pipeline:
        _SimpleImputer:
        - target: null
          config:
            strategy: null
        _OneHotEncoder:
        - target: null
          config:
            prefix: null
            dtype: null
        _FeatureDropper:
        - target: null
          config:
            axis: null
            errors: null
  model_optimization:
    ModelOptimization:
      output_path: null
      evaluation:
        metric: null
        cv: null
CUSTOM:
  data_explorer:
    DataExplorer:
      output_path: null
      input_path: data/BikeRentalDaily_train.csv
      input_delimiter: ;
      label: cnt
      pipeline:
        descriptive_analysis: true
        missing_value_analysis: true
        correlation_analysis: true
        outlier_analysis: true
        distribution_analysis: true
        scatter_analysis: true
  data_preparation:
    DataPreparation:
      output_path: null
      input_path: data/BikeRentalDaily_test.csv
      input_delimiter: ;
      pipeline:
        _FeatureDropper:
        - target: instant
          config:
            axis: 1
            errors: ignore
        _CustomTransformer:
        - func: replace_dot_with_hyphen
          module: idstools._custom_transformer
          config:
            target: dteday
  model_optimization:
    ModelOptimization:
      output_path: null
      evaluation:
        metric: mse
        cv: 10

```

In [23]:
config = _idstools.custom.data_explorer.DataExplorer

In [24]:
pprint_dynaconf(config, notebook=True)

```yaml
output_path: null
input_path: data/BikeRentalDaily_train.csv
input_delimiter: ;
label: cnt
pipeline:
  descriptive_analysis: true
  missing_value_analysis: true
  correlation_analysis: true
  outlier_analysis: true
  distribution_analysis: true
  scatter_analysis: true

```

In [25]:
pprint_dynaconf(_idstools.custom.data_explorer.DataExplorer, notebook=True)

```yaml
output_path: null
input_path: data/BikeRentalDaily_train.csv
input_delimiter: ;
label: cnt
pipeline:
  descriptive_analysis: true
  missing_value_analysis: true
  correlation_analysis: true
  outlier_analysis: true
  distribution_analysis: true
  scatter_analysis: true

```

In [26]:
data_explorer_config = config

In [27]:
my_data_explorer = DataExplorer(**data_explorer_config)

2024-02-03 16:22:16,064 [data_explorer] [INFO] - Initializing DataExplorer
2024-02-03 16:22:16,065 [data_explorer] [INFO] - Using label: cnt
2024-02-03 16:22:16,065 [data_explorer] [INFO] - Output path not provided.
Using default path: /home/davidrmn/Studies/introduction-data-science/results
2024-02-03 16:22:16,066 [data_explorer] [INFO] - Pipeline configuration:
descriptive_analysis: true
missing_value_analysis: true
correlation_analysis: true
outlier_analysis: true
distribution_analysis: true
scatter_analysis: true

2024-02-03 16:22:16,066 [_helpers] [INFO] - Reading data from:
/home/davidrmn/Studies/introduction-data-science/data/BikeRentalDaily_train.csv


In [28]:
result = my_data_explorer.descriptive_analysis()

## Another example of custom transformer

Simply start by importing the configuration. As always, you can provide your own configuration or load and edit the default one to see what parameters are possible to edit.

In [29]:
from idstools._config import _idstools, pprint_dynaconf

Load the transformer config of the _CustomTransformer defined in the custom pipeline as an example.

In [30]:
transformer_config = _idstools.custom.data_preparation.DataPreparation.pipeline._CustomTransformer

Now show what is configured to execute in this step of the data preparation. As you can see there is a function referenced that is part of the _custom_piplines module of the idstools package.

In [31]:
pprint_dynaconf(transformer_config, notebook=True)

```yaml
- func: replace_dot_with_hyphen
  module: idstools._custom_transformer
  config:
    target: dteday

```

Here you can see how the function is implemented in the module. 

IMPORTANT: Each function for the _CustomTransformer takes the DataFrame with which the DataPreparation Class of the data_perparation module was initialized and performs any implemented function on it. Based on the arguments provided, in this case there is only one argument: "target" which references a column in the DataFrame which has values from which all dots "." are getting replaced with hyphens "-".

In [32]:
!cat /home/davidrmn/Studies/introduction-data-science/src/idstools/_custom_transformer.py

import pandas as pd
from idstools._helpers import setup_logging

logger = setup_logging(__name__)

def replace_dot_with_hyphen(df: pd.DataFrame, target: str) -> pd.DataFrame:
    if target in df.columns:
        df[target] = df[target].str.replace('.', '-', regex=False)
    else:
        logger.error(f"Column '{target}' not found in DataFrame.")
    return df

Now we can again show the whole DataPerparation config:

In [33]:
pprint_dynaconf(_idstools.custom.data_preparation.DataPreparation, notebook=True)

```yaml
output_path: null
input_path: data/BikeRentalDaily_test.csv
input_delimiter: ;
pipeline:
  _FeatureDropper:
  - target: instant
    config:
      axis: 1
      errors: ignore
  _CustomTransformer:
  - func: replace_dot_with_hyphen
    module: idstools._custom_transformer
    config:
      target: dteday

```

I dont want to drop instant so I set the pipeline to only execute the _CustomTransformer:

In [34]:
config = _idstools.custom.data_preparation.DataPreparation

In [35]:
config.pipeline = {"_CustomTransformer" : transformer_config}

In [36]:
pprint_dynaconf(config, notebook=True)

```yaml
output_path: null
input_path: data/BikeRentalDaily_test.csv
input_delimiter: ;
pipeline:
  _CustomTransformer:
  - func: replace_dot_with_hyphen
    module: idstools._custom_transformer
    config:
      target: dteday

```

In [37]:
from idstools.data_preparation import DataPreparation
my_data_preparation = DataPreparation(**config) 

2024-02-03 16:22:16,285 [data_preparation] [INFO] - Initializing DataPreparation
2024-02-03 16:22:16,286 [data_preparation] [INFO] - Output path not provided.
Using default path: /home/davidrmn/Studies/introduction-data-science/results
2024-02-03 16:22:16,287 [data_preparation] [INFO] - Pipeline configuration:
_CustomTransformer:
- func: replace_dot_with_hyphen
  module: idstools._custom_transformer
  config:
    target: dteday

2024-02-03 16:22:16,288 [_helpers] [INFO] - Reading data from:
/home/davidrmn/Studies/introduction-data-science/data/BikeRentalDaily_test.csv


It worked so lets run the preconfigured pipeline of the DataPreparation instance we just created.

In [38]:
my_data_preparation.build_pipeline(config.pipeline)

2024-02-03 16:22:16,297 [data_preparation] [INFO] - Pipeline created.


In [39]:
my_data_preparation.run_pipeline(config.pipeline)

2024-02-03 16:22:16,315 [data_preparation] [INFO] - Pipeline step _CustomTransformer has been processed.


Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,leaflets,price reduction,casual,registered,cnt
0,299,26-10-2011,4.0,0,10,0,3,1,2,19.37,0.47,108.06,0.15,605,0,404,3490,3894
1,458,02-04-2012,2.0,1,4,0,1,1,1,17.36,0.43,75.65,0.31,518,0,1208,4728,5936
2,687,17-11-2012,4.0,1,11,0,6,0,1,13.00,0.33,81.81,0.18,766,0,1313,4316,5629
3,346,12-12-2011,4.0,0,12,0,-1,1,1,9.53,0.27,,0.06,739,0,143,3167,3310
4,291,18-10-2011,4.0,0,10,0,2,1,2,21.30,0.52,105.25,0.11,463,0,637,4111,4748
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
127,426,01-03-2012,1.0,1,3,0,4,1,1,19.43,0.48,92.31,0.23,777,0,325,4665,4990
128,547,30-06-2012,3.0,1,6,0,6,0,1,30.60,0.69,90.19,0.16,981,0,1455,4232,5687
129,271,28-09-2011,4.0,0,9,0,-1,1,2,25.40,0.58,127.31,0.15,577,0,480,3427,3907
130,180,29-06-2011,3.0,0,6,0,3,1,1,29.13,0.65,74.69,0.26,585,0,848,4377,5225


The two steps above can also be introduced by the run method of the class. Btw, each class of the package has a run method to orchestrate the class.

In [40]:
my_data_preparation.run()

2024-02-03 16:22:16,328 [data_preparation] [INFO] - Pipeline created.
2024-02-03 16:22:16,330 [data_preparation] [INFO] - Pipeline step _CustomTransformer has been processed.
2024-02-03 16:22:16,331 [_helpers] [INFO] - Writing data to:
/home/davidrmn/Studies/introduction-data-science/results/BikeRentalDaily_test_processed.csv


As you can see now this also automatically writes the results to the output_path instead of returning it. Of course you can also save it manually after build and run ;)

In [41]:
my_data_preparation.write_data()

2024-02-03 16:22:16,338 [_helpers] [INFO] - Writing data to:
/home/davidrmn/Studies/introduction-data-science/results/BikeRentalDaily_test_processed.csv


This project is still in development but imagine what can be automated and build with it :)

Cheers