# EDA example
This notebook is an example of how to use our ConceptDriftsFinder tool as part of the EDA process.

## Step 1 - Technical initialization
We will start with a few technical dataset loading steps and notebook configuration.

The dataset we will use in this example is the sales dataset. See more about the dataset in the README.md file, the accompayning pdf or the `datasets_config.py` file.

### Install necessary requirements

In [None]:
%pip install -r ../requirements.txt

### Change working directory and add jupyter reload

In [None]:
# Change working directory to root
import os
if os.getcwd().endswith("notebooks"):
    %cd ..
    print(os.getcwd())

# Automatically reload changes in code
%load_ext autoreload
%autoreload 2

### Imports, logging and pandas configuration

In [None]:
import logging
from typing import List
import pandas as pd
from association_finder.concept_drifts_finder import ConceptDriftsFinder, convert_df_to_transactions
from association_finder.models import Transaction, ConceptDriftResult
from association_finder.concept_engineering import ConceptEngineering
from association_finder.datasets_config import datasets_config
from sklearn.model_selection import train_test_split
from association_finder.preprocessing import preprocess_dataset, split_X_y
from association_finder.one_vs_rest_classifier import OneVsRestClassifier, label_to_concept_transform_wrapper
from typing import Dict, Tuple, Optional
from dataclasses import dataclass
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Logs config
logging.basicConfig(level=logging.INFO)

# Pandas config
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)

### Read, split and preprocess data

In [None]:
np.random.seed(0)

# Dataset can be changed to any of the following: ["housing", "rain", "sales", "netflix"]
dataset = "housing"

# load dataset config
dataset_config = datasets_config[dataset]

# Read file
df = pd.read_csv(dataset_config["train_dataset_path"], index_col=dataset_config['index_col'])
target_column = dataset_config["target_column"]

# Rain fix
if dataset == "rain":
    # Turn Yes/No columns into 1/0 columns, respectively.
    for column in ["RainToday", "RainTomorrow"]:
        df[column] = df[column].map(dict(Yes=1, No=0))

# Drop rows with NaN values in the target column.
df.drop(df[df[target_column].isna()].index,inplace=True)

# Split
df_train, df_val = train_test_split(df.sort_index(), test_size=0.3, random_state=0)

# Preprocess    
df_train_prep, train_params = preprocess_dataset(df_train)

# Focusing on prominent columns:
good_columns = [column for column in dataset_config["good_columns"] if column not in train_params.dropped_columns]
one_hot_columns = [column for column in dataset_config["one_hot_columns"] if column not in train_params.dropped_columns]

## Step 2 - Using ConceptsDriftFinder
We are now ready to start using ConceptsDriftFinder. You can choose any column as a potential concept drifts.

For example, if you choose `OverallQual`, you can see the `confidence_before` (when `OverallQual` < 2.8) is higher than `confidence_after`, which means the lower the quality of the house, the more the influence of `BldgType: 1Fam` has on `SalePrice=1`.

In [None]:
transactions = convert_df_to_transactions(df_train_prep[good_columns])
concepts = ConceptDriftsFinder().find_concept_drifts(transactions, concept_column="OverallQual", target_column=target_column, min_confidence=dataset_config['min_confidence'], min_support=dataset_config['min_support'], diff_threshold=dataset_config['diff_threshold'])
pd.DataFrame(concepts)