# Data Exploration Module Test

# TODO:
    - head (/)
    - info (/)
    - matrix (missing values) (/)
    - bar (missing values) (/)
    - imports Jupyter Notebook (/)
    - fix plot resolution (/)

    - data dict (types, e.g. nominal, categorial)
    - box (numeric, deviation)
    - bar/mosaic/ (categorial, deviation)
    - predictor/feature correlation (heatmap/scatter)
    - histogram (skewed/deviation)


In [1]:
import idstools.data_explorer as idsde

In [2]:
test_data = "../data/BikeRentalDaily_test.csv"
train_data = "../data/BikeRentalDaily_train.csv"

In [3]:
data_explorer_config = {"path": train_data, "type": "csv", "separator": ";"}

In [4]:
data_explorer = idsde.DataExplorer(input_path=train_data, output_path="results")

2024-02-02 02:30:20,346 [data_explorer] [INFO] - Initializing DataExplorer
2024-02-02 02:30:20,348 [_helpers] [INFO] - Reading csv file:
../data/BikeRentalDaily_train.csv
2024-02-02 02:30:20,352 [data_explorer] [INFO] - Using output path: results
2024-02-02 02:30:20,352 [data_explorer] [INFO] - Pipeline configuration:
{}



In [5]:
data_explorer.descriptive_analysis()
data_explorer.data.info()

2024-02-02 02:30:20,362 [data_explorer] [INFO] - Head of BikeRentalDaily_train
                          0           1           2           3           4
instant                 154         685         368         472         442
dteday           03.06.2011  15.11.2012  03.01.2012  16.04.2012  17.03.2012
season                  2.0         4.0         1.0         2.0         1.0
yr                        0           1           1           1           1
mnth                      6          11           1           4           3
holiday                   0           0           0           1           0
weekday                   5           4           2           1          -1
workingday                1           1           1           0           0
weathersit                1           2           1           1           2
temp                   24.8       12.87         6.0       26.57       20.57
atemp                  0.59        0.32        0.13        0.61        0.51
hum      

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600 entries, 0 to 599
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   instant          600 non-null    int64  
 1   dteday           600 non-null    object 
 2   season           538 non-null    float64
 3   yr               600 non-null    int64  
 4   mnth             600 non-null    int64  
 5   holiday          600 non-null    int64  
 6   weekday          600 non-null    int64  
 7   workingday       600 non-null    int64  
 8   weathersit       600 non-null    int64  
 9   temp             600 non-null    float64
 10  atemp            600 non-null    float64
 11  hum              566 non-null    float64
 12  windspeed        600 non-null    float64
 13  leaflets         600 non-null    int64  
 14  price reduction  600 non-null    int64  
 15  casual           600 non-null    int64  
 16  registered       600 non-null    int64  
 17  cnt             

# Data Preparation Module Test

In [6]:
import idstools.data_preparation as dp

In [7]:
test_data = "../data/BikeRentalDaily_test.csv"
train_data = "../data/BikeRentalDaily_train.csv"

In [8]:
data_preparation = dp.DataPreparation(input_path=train_data, output_path="results")


2024-02-02 02:30:20,596 [data_preparation] [INFO] - Initializing DataPreparation
2024-02-02 02:30:20,598 [_helpers] [INFO] - Reading csv file:
../data/BikeRentalDaily_train.csv
2024-02-02 02:30:20,601 [data_preparation] [INFO] - Using output path: results
2024-02-02 02:30:20,601 [data_preparation] [INFO] - Pipeline configuration:
{}



In [9]:
import pandas as pd

def get_wday_by_date(df, date_column, weekday_column):
    # Define the weekday shift
    weekday_shift = {
        6: 0,
        0: 1,
        1: 2,
        2: 3,
        3: 4,
        4: 5,
        5: 6
    }

    # Convert the date column to datetime
    df[date_column] = pd.to_datetime(df[date_column], format="%d.%m.%Y")

    # Calculate the weekday and map it
    df[weekday_column] = df[date_column].dt.dayofweek.map(weekday_shift)

    return df

In [10]:
pipeline_config = {
    "_SimpleImputer": 
    [
        {
            "target": "hum",
            "config": {
                "strategy": "mean"
            }
        }
    ],
    "_CustomTransformer": 
    [
        {
            "func": get_wday_by_date,
            "config": {
                "date_column": "dteday",
                "weekday_column": "weekday"
            }
        }
    ]
}

In [11]:
pipeline = data_preparation.build_pipeline(config=pipeline_config)
pipeline

2024-02-02 02:30:20,621 [data_preparation] [INFO] - Pipeline created.


In [12]:
processed_data = data_preparation.run_pipeline(config=pipeline_config)

2024-02-02 02:30:20,636 [data_preparation] [INFO] - Pipeline step _SimpleImputer has been processed.
2024-02-02 02:30:20,640 [data_preparation] [INFO] - Pipeline step _CustomTransformer has been processed.


In [13]:
processed_data.head(5).T

Unnamed: 0,0,1,2,3,4
instant,154,685,368,472,442
dteday,2011-06-03 00:00:00,2012-11-15 00:00:00,2012-01-03 00:00:00,2012-04-16 00:00:00,2012-03-17 00:00:00
season,2.0,4.0,1.0,2.0,1.0
yr,0,1,1,1,1
mnth,6,11,1,4,3
holiday,0,0,0,1,0
weekday,5,4,2,1,6
workingday,1,1,1,0,0
weathersit,1,2,1,1,2
temp,24.8,12.87,6.0,26.57,20.57


In [14]:
processed_data.describe().T

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
instant,600.0,363.12,1.0,181.25,362.5,538.25,731.0,208.71
dteday,600.0,2011-12-29 02:48:00,2011-01-01 00:00:00,2011-06-30 06:00:00,2011-12-28 12:00:00,2012-06-21 06:00:00,2012-12-31 00:00:00,
season,538.0,2.44,1.0,1.0,2.0,3.0,4.0,1.11
yr,600.0,0.5,0.0,0.0,0.0,1.0,1.0,0.5
mnth,600.0,6.47,1.0,4.0,6.0,9.0,12.0,3.44
holiday,600.0,0.03,0.0,0.0,0.0,0.0,1.0,0.17
weekday,600.0,3.03,0.0,1.0,3.0,5.0,6.0,2.01
workingday,600.0,0.68,0.0,0.0,1.0,1.0,1.0,0.47
weathersit,600.0,1.4,1.0,1.0,1.0,2.0,3.0,0.54
temp,600.0,19.81,2.37,13.57,20.1,26.06,34.47,7.21


In [15]:
from idstools._config import _idstools

In [16]:
_idstools["default"]["data_explorer"]["DataExplorer"]["input_path"]

In [17]:
_idstools.default.data_explorer.DataExplorer.input_path

## Module Configuration

In [18]:
from idstools.data_explorer import DataExplorer
from idstools._config import _idstools, pprint_dynaconf

We have multiple options to configure the DataExplorer to analyze the BikeRentalDaily_train.csv data.

- Load the default set of parameters and adjust them to our needs. In this case all possible parameters are initialized and can be set according the the exploration steps that should be done. 

- Initialize the class with in cell defined configuration.

In [19]:
pprint_dynaconf(_idstools, notebook=True)

```yaml
DEFAULT:
  data_explorer:
    DataExplorer:
      output_path: null
      input_path: null
      input_type: null
      input_delimiter: null
      pipeline:
        descriptive_analysis: false
        missing_value_matrix_plot: false
        missing_value_bar_plot: false
        correlation_heatmap_plot: false
  data_preparation:
    DataPreparation:
      output_path: null
      input_path: null
      input_type: null
      input_delimiter: null
      pipeline:
        _SimpleImputer:
        - target: null
          config:
            strategy: null
        _OneHotEncoder:
        - target: null
          config:
            prefix: null
            dtype: null
        - target: null
          config:
            prefix: null
            dtype: null
        _FeatureDropper:
        - target: null
          config:
            axis: null
            errors: null
  model_optimization:
    ModelOptimization:
      output_path: null
      evaluation:
        metric: null
        cv: null
CUSTOM:
  data_explorer:
    DataExplorer:
      output_path: null
      input_path: data/BikeRentalDaily_train.csv
      input_type: csv
      input_delimiter: ;
      pipeline:
        descriptive_analysis: true
        missing_value_matrix_plot: true
        missing_value_bar_plot: true
        correlation_heatmap_plot: true
  data_preparation:
    DataPreparation:
      output_path: null
      input_path: data/BikeRentalDaily_test.csv
      input_type: csv
      input_delimiter: ;
      pipeline:
        _FeatureDropper:
        - target: instant
          config:
            axis: 1
            errors: ignore
        _CustomTransformer:
        - func: replace_dot_with_hyphen
          module: idstools._custom_pipelines
          config:
            target: dteday
  model_optimization:
    ModelOptimization:
      output_path: results
      evaluation:
        metric: mse
        cv: 10

```

In [20]:
config = _idstools.default.data_explorer.DataExplorer

In [21]:
pprint_dynaconf(config, notebook=True)

```yaml
output_path: null
input_path: null
input_type: null
input_delimiter: null
pipeline:
  descriptive_analysis: false
  missing_value_matrix_plot: false
  missing_value_bar_plot: false
  correlation_heatmap_plot: false

```

In [22]:
pprint_dynaconf(_idstools.custom.data_explorer.DataExplorer, notebook=True)

```yaml
output_path: null
input_path: data/BikeRentalDaily_train.csv
input_type: csv
input_delimiter: ;
pipeline:
  descriptive_analysis: true
  missing_value_matrix_plot: true
  missing_value_bar_plot: true
  correlation_heatmap_plot: true

```

In [23]:
data_explorer_config = config

In [24]:
data_explorer_config.input_path = "/home/davidrmn/Studies/introduction-data-science/data/BikeRentalDaily_train.csv"

In [25]:
my_data_explorer = DataExplorer(**data_explorer_config)

2024-02-02 02:30:20,831 [data_explorer] [INFO] - Initializing DataExplorer
2024-02-02 02:30:20,832 [data_explorer] [INFO] - No output path specified.
Using default output path:/home/davidrmn/Studies/introduction-data-science/project/results
2024-02-02 02:30:20,833 [data_explorer] [INFO] - Pipeline configuration:
descriptive_analysis: false
missing_value_matrix_plot: false
missing_value_bar_plot: false
correlation_heatmap_plot: false



In [26]:
result = my_data_explorer.descriptive_analysis()

2024-02-02 02:30:20,837 [data_explorer] [INFO] - Head of BikeRentalDaily_train
Empty DataFrame
Columns: []
Index: []

2024-02-02 02:30:20,838 [data_explorer] [INFO] - Info of BikeRentalDaily_train
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 0 entries
Empty DataFrame

2024-02-02 02:30:20,839 [data_explorer] [INFO] - Types of BikeRentalDaily_train
Series([], dtype: object)

2024-02-02 02:30:20,840 [data_explorer] [ERROR] - Error in descriptive_analysis: Cannot describe a DataFrame without columns


## Another example of custom transformer

Simply start by importing the configuration. As always, you can provide your own configuration or load and edit the default one to see what parameters are possible to edit.

In [27]:
from idstools._config import _idstools, pprint_dynaconf

Load the transformer config of the _CustomTransformer defined in the custom pipeline as an example.

In [28]:
transformer_config = _idstools.custom.data_preparation.DataPreparation.pipeline._CustomTransformer

Now show what is configured to execute in this step of the data preparation. As you can see there is a function referenced that is part of the _custom_piplines module of the idstools package.

In [31]:
pprint_dynaconf(transformer_config, notebook=True)

```yaml
- func: replace_dot_with_hyphen
  module: idstools._custom_pipelines
  config:
    target: dteday

```

Here you can see how the function is implemented in the module. 

IMPORTANT: Each function for the _CustomTransformer takes the DataFrame with which the DataPreparation Class of the data_perparation module was initialized and performs any implemented function on it. Based on the arguments provided, in this case there is only one argument: "target" which references a column in the DataFrame which has values from which all dots "." are getting replaced with hyphens "-".

In [32]:
!cat /home/davidrmn/Studies/introduction-data-science/src/idstools/_custom_pipelines.py

import pandas as pd
from idstools._helpers import setup_logging

logger = setup_logging(__name__)

def replace_dot_with_hyphen(df: pd.DataFrame, target: str) -> pd.DataFrame:
    if target in df.columns:
        df[target] = df[target].str.replace('.', '-', regex=False)
    else:
        logger.error(f"Column '{target}' not found in DataFrame.")
    return df

Now we can again show the whole DataPerparation config:

In [36]:
pprint_dynaconf(_idstools.custom.data_preparation.DataPreparation, notebook=True)

```yaml
output_path: null
input_path: data/BikeRentalDaily_test.csv
input_type: csv
input_delimiter: ;
pipeline:
  _FeatureDropper:
  - target: instant
    config:
      axis: 1
      errors: ignore
  _CustomTransformer:
  - func: replace_dot_with_hyphen
    module: idstools._custom_pipelines
    config:
      target: dteday

```

I dont want to drop instant so I set the pipeline to only execute the _CustomTransformer:

In [34]:
config = _idstools.custom.data_preparation.DataPreparation

In [37]:
config.pipeline = {"_CustomTransformer" : transformer_config}

In [38]:
pprint_dynaconf(config, notebook=True)

```yaml
output_path: null
input_path: data/BikeRentalDaily_test.csv
input_type: csv
input_delimiter: ;
pipeline:
  _CustomTransformer:
  - func: replace_dot_with_hyphen
    module: idstools._custom_pipelines
    config:
      target: dteday

```

In [39]:
from idstools.data_preparation import DataPreparation
my_data_preparation = DataPreparation(**config) 

2024-02-02 02:43:00,371 [data_preparation] [INFO] - Initializing DataPreparation
2024-02-02 02:43:00,372 [_helpers] [INFO] - Reading csv file:
data/BikeRentalDaily_test.csv
2024-02-02 02:43:00,374 [_helpers] [ERROR] - Error in read_data: [Errno 2] No such file or directory: '/home/davidrmn/Studies/introduction-data-science/project/data/BikeRentalDaily_test.csv'
2024-02-02 02:43:00,374 [data_preparation] [INFO] - No output path specified.
Using default output path:/home/davidrmn/Studies/introduction-data-science/results
2024-02-02 02:43:00,375 [data_preparation] [INFO] - Pipeline configuration:
_CustomTransformer:
- func: replace_dot_with_hyphen
  module: idstools._custom_pipelines
  config:
    target: dteday



As you can see it almost worked! As I am in Jupyter the code resolves to the wrong path, need to fix this still. As a workaround we can set the Path as absolute Path.

At the moment it is relative the the package root. But as this notebook which executes the code is not we need to adjust.

In [40]:
config.input_path

'data/BikeRentalDaily_test.csv'

In [41]:
config.input_path = "/home/davidrmn/Studies/introduction-data-science/data/BikeRentalDaily_train.csv"

Now lets try again.

In [42]:
my_data_preparation = DataPreparation(**config) 

2024-02-02 02:46:33,681 [data_preparation] [INFO] - Initializing DataPreparation
2024-02-02 02:46:33,682 [_helpers] [INFO] - Reading csv file:
/home/davidrmn/Studies/introduction-data-science/data/BikeRentalDaily_train.csv
2024-02-02 02:46:33,686 [data_preparation] [INFO] - No output path specified.
Using default output path:/home/davidrmn/Studies/introduction-data-science/results
2024-02-02 02:46:33,687 [data_preparation] [INFO] - Pipeline configuration:
_CustomTransformer:
- func: replace_dot_with_hyphen
  module: idstools._custom_pipelines
  config:
    target: dteday



It worked so lets run the preconfigured pipeline of the DataPreparation instance we just created.

In [45]:
my_data_preparation.build_pipeline(config.pipeline)

2024-02-02 02:48:06,521 [data_preparation] [INFO] - Pipeline created.


In [47]:
my_data_preparation.run_pipeline(config.pipeline)

2024-02-02 02:48:27,643 [data_preparation] [INFO] - Pipeline step _CustomTransformer has been processed.


Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,leaflets,price reduction,casual,registered,cnt
0,154,03-06-2011,2.0,0,6,0,5,1,1,24.80,0.59,53.13,0.25,991,0,898,4414,5312
1,685,15-11-2012,4.0,1,11,0,4,1,2,12.87,0.32,93.06,0.15,601,0,320,5125,5445
2,368,03-01-2012,1.0,1,1,0,2,1,1,6.00,0.13,66.19,0.37,549,0,89,2147,2236
3,472,16-04-2012,2.0,1,4,1,1,0,1,26.57,0.61,84.25,0.28,740,0,1198,5172,6370
4,442,17-03-2012,1.0,1,3,0,-1,0,2,20.57,0.51,113.37,0.11,773,1,3155,4681,7836
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
595,78,19-03-2011,1.0,0,3,0,6,0,1,18.90,0.47,56.88,0.37,1022,1,1424,1693,3117
596,81,22-03-2011,,0,3,0,2,1,1,17.67,0.44,93.69,0.23,551,0,460,2243,2703
597,377,12-01-2012,1.0,1,1,0,4,1,2,15.30,0.38,120.44,0.18,520,0,269,3828,4097
598,299,26-10-2011,4.0,0,10,0,3,1,2,19.37,0.47,108.06,0.15,605,0,404,3490,3894


The two steps above can also be introduced by the run method of the class. Btw, each class of the package has a run method to orchestrate the class.

In [48]:
my_data_preparation.run()

2024-02-02 02:50:23,606 [data_preparation] [INFO] - Pipeline created.
2024-02-02 02:50:23,608 [data_preparation] [INFO] - Pipeline step _CustomTransformer has been processed.
2024-02-02 02:50:23,609 [_helpers] [INFO] - Writing data to:
/home/davidrmn/Studies/introduction-data-science/results/BikeRentalDaily_train_processed.csv


As you can see now this also automatically writes the results to the output_path instead of returning it. Of course you can also save it manually after build and run ;)

In [49]:
my_data_preparation.write_data()

2024-02-02 02:52:41,461 [_helpers] [INFO] - Writing data to:
/home/davidrmn/Studies/introduction-data-science/results/BikeRentalDaily_train_processed.csv


This project is still in development but imagine what can be automated and build with it :)

Cheers