# Data Exploration Module Test

# TODO:
    - head (/)
    - info (/)
    - matrix (missing values) (/)
    - bar (missing values) (/)
    - imports Jupyter Notebook (/)
    - fix plot resolution (/)

    - data dict (types, e.g. nominal, categorial)
    - box (numeric, deviation)
    - bar/mosaic/ (categorial, deviation)
    - predictor/feature correlation (heatmap/scatter)
    - histogram (skewed/deviation)


In [None]:
import idstools.data_explorer as idsde

In [None]:
test_data = "../data/BikeRentalDaily_test.csv"
train_data = "../data/BikeRentalDaily_train.csv"

In [None]:
data_explorer_config = {"path": train_data, "type": "csv", "separator": ";"}

In [None]:
data_explorer = idsde.DataExplorer(input_path=train_data, output_path="results")

In [None]:
data_explorer.descriptive_analysis()
data_explorer.data.info()

# Data Preparation Module Test

In [None]:
import idstools.data_preparation as dp

In [None]:
test_data = "../data/BikeRentalDaily_test.csv"
train_data = "../data/BikeRentalDaily_train.csv"

In [None]:
data_preparation = dp.DataPreparation(input_path=train_data, output_path="results")


In [None]:
import time
import pandas as pd

def get_wday_by_date(df, date_column, weekday_column):
    # Define the weekday shift
    weekday_shift = {
        6: 0,
        0: 1,
        1: 2,
        2: 3,
        3: 4,
        4: 5,
        5: 6
    }

    # Convert the date column to datetime
    df[date_column] = pd.to_datetime(df[date_column], format="%d.%m.%Y")

    # Calculate the weekday and map it
    df[weekday_column] = df[date_column].dt.dayofweek.map(weekday_shift)

    return df

In [None]:
pipeline_config = {
        "_SimpleImputer": [
            {
                "target": "hum",
                "config": {
                    "strategy": "mean"
                }
            }
        ],
        "_GenericDataFrameTransformer": [
        {
            "transform_func": get_wday_by_date,
            "config": {
                "date_column": "dteday",
                "weekday_column": "weekday"
            }
        }
    ]
}

In [None]:
pipeline = data_preparation.build_pipeline(config=pipeline_config)
pipeline

In [None]:
processed_data = data_preparation.run_pipeline(config=pipeline_config)

In [None]:
processed_data.head(5).T

In [None]:
processed_data.describe().T

In [None]:
from idstools._config import _idstools

In [None]:
_idstools["default"]["data_explorer"]["DataExplorer"]["input_path"]

In [None]:
_idstools.default.data_explorer.DataExplorer.input_path

## Module Configuration

In [1]:
from idstools._config import _idstools, pprint_dynaconf
from idstools.data_explorer import DataExplorer

We have multiple options to configure the DataExplorer to analyze the BikeRentalDaily_train.csv data.

- Load the default set of parameters and adjust them to our needs. In this case all possible parameters are initialized and can be set according the the exploration steps that should be done. 

- Initialize the class with in cell defined configuration.

In [8]:
pprint_dynaconf(_idstools, notebook=True)

```yaml
DEFAULT:
  data_explorer:
    DataExplorer:
      output_path: results
      input_path: data/BikeRentalDaily_train.csv
      input_type: csv
      input_delimiter: ;
      pipeline:
        descriptive_analysis: true
        missing_value_matrix_plot: true
        missing_value_bar_plot: true
        correlation_heatmap_plot: true
  data_preparation:
    DataPreparation:
      output_path: null
      input_path: data/BikeRentalDaily_train.csv
      input_type: csv
      input_delimiter: ;
      pipeline:
        _SimpleImputer:
        - target: hum
          config:
            strategy: mean
        _OneHotEncoder:
        - target: season
          config:
            prefix: season
            dtype: int
        - target: mnth
          config:
            prefix: month
            dtype: int
        _FeatureDropper:
        - target: instant
          config:
            axis: 1
            errors: ignore
  model_optimization:
    ModelOptimization:
      output_path: results
      evaluation:
        metric: rmse
        cv: 5
CUSTOM:
  data_explorer:
    DataExplorer:
      output_path: results
      input_path: data/BikeRentalDaily_test.csv
      input_type: csv
      input_delimiter: ;
      pipeline:
        descriptive_analysis: true
        missing_value_matrix_plot: true
        missing_value_bar_plot: true
        correlation_heatmap_plot: true
  data_preparation:
    DataPreparation:
      output_path: null
      input_path: data/BikeRentalDaily_test.csv
      input_type: csv
      input_delimiter: ;
      pipeline:
        _FeatureDropper:
        - target: instant
          config:
            axis: 1
            errors: ignore
        - target: hum
          config:
            axis: 1
            errors: ignore
        - target: windspeed
          config:
            axis: 1
            errors: ignore
  model_optimization:
    ModelOptimization:
      output_path: results
      evaluation:
        metric: mse
        cv: 10

```

In [3]:
config = _idstools.default.data_explorer.DataExplorer

In [6]:
pprint_dynaconf(config, notebook=True)

```yaml
output_path: results
input_path: data/BikeRentalDaily_train.csv
input_type: csv
input_delimiter: ;
pipeline:
  descriptive_analysis: true
  missing_value_matrix_plot: true
  missing_value_bar_plot: true
  correlation_heatmap_plot: true

```

In [10]:
data_explorer_config = config

In [11]:
data_explorer_config.input_path = "/home/davidrmn/Studies/introduction-data-science/data/BikeRentalDaily_train.csv"

In [12]:
my_data_explorer = DataExplorer(**data_explorer_config)

2024-02-01 20:53:24,117 [data_explorer] [INFO] - Initializing DataExplorer
2024-02-01 20:53:24,119 [_helpers] [INFO] - Reading csv file:
/home/davidrmn/Studies/introduction-data-science/data/BikeRentalDaily_train.csv
2024-02-01 20:53:24,122 [data_explorer] [INFO] - Using output path: results
2024-02-01 20:53:24,123 [data_explorer] [INFO] - Using pipeline:
descriptive_analysis: true
missing_value_matrix_plot: true
missing_value_bar_plot: true
correlation_heatmap_plot: true



In [13]:
result = my_data_explorer.descriptive_analysis()

2024-02-01 20:53:27,129 [data_explorer] [INFO] - Head of BikeRentalDaily_train
                          0           1           2           3           4
instant                 154         685         368         472         442
dteday           03.06.2011  15.11.2012  03.01.2012  16.04.2012  17.03.2012
season                  2.0         4.0         1.0         2.0         1.0
yr                        0           1           1           1           1
mnth                      6          11           1           4           3
holiday                   0           0           0           1           0
weekday                   5           4           2           1          -1
workingday                1           1           1           0           0
weathersit                1           2           1           1           2
temp                   24.8       12.87         6.0       26.57       20.57
atemp                  0.59        0.32        0.13        0.61        0.51
hum      