# Automatic Feature Engineering
This notebook uses a simple heuristic to try and automatically utilize our `ConceptsDriftFinder` tool for feature engineering.

## Step 1 - Technical initialization
We will start with a few technical dataset loading steps and notebook configuration.

We can use here one of 4 different datasets: ["housing", "rain", "sales", "netflix"]. See their configuration in `datasets_config.json`.

### Install necessary requirements

In [None]:
%pip install -r ../requirements.txt

### Change working directory and add jupyter reload

In [None]:
# Change working directory to root
import os
if os.getcwd().endswith("notebooks"):
    %cd ..
    print(os.getcwd())

# Automatically reload changes in code
%load_ext autoreload
%autoreload 2

### Imports, logging and pandas configuration

In [None]:
import logging
from typing import List
import pandas as pd
from association_finder.concept_drifts_finder import ConceptDriftsFinder
from association_finder.models import Transaction, ConceptDriftResult
from association_finder.concept_engineering import ConceptEngineering
from association_finder.datasets_config import datasets_config
from sklearn.model_selection import train_test_split
from association_finder.preprocessing import preprocess_dataset, split_X_y
from association_finder.one_vs_rest_classifier import OneVsRestClassifier, label_to_concept_transform_wrapper
from typing import Dict, Tuple, Optional
from dataclasses import dataclass
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Logs config
logging.basicConfig(level=logging.INFO)

# Pandas config
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)

### Read, split and preprocess data

In [None]:
np.random.seed(0)

# Dataset can be changed to any of the following: ["housing", "rain", "sales", "netflix"]
dataset = "housing"

# load dataset config
dataset_config = datasets_config[dataset]

# Read file
df = pd.read_csv(dataset_config["train_dataset_path"], index_col=dataset_config['index_col'])
target_column = dataset_config["target_column"]

# Drop rows with NaN values in the target column.
df.drop(df[df[target_column].isna()].index,inplace=True)

# Rain hotfix
if dataset == "rain":
    # Turn Yes/No columns into 1/0 columns, respectively.
    for column in ["RainToday", "RainTomorrow"]:
        df[column] = df[column].map(dict(Yes=1, No=0))

# Split
df_train, df_val = train_test_split(df, test_size=0.3, random_state=0)

# Preprocess    
df_train_prep, train_params = preprocess_dataset(df_train)

# Focusing on prominent columns:
good_columns = [column for column in dataset_config["good_columns"] if column not in train_params.dropped_columns]
one_hot_columns = [column for column in dataset_config["one_hot_columns"] if column not in train_params.dropped_columns]


In [None]:
# Prepare data for training
X_train, y_train = split_X_y(df_train_prep, good_columns, train_params, one_hot_columns, target_column)
X_val, y_val = split_X_y(preprocess_dataset(df_val, train_params)[0], good_columns, train_params, one_hot_columns, target_column, list(X_train.columns))

# Step 2 - Find rules

In this step, we will automatically try to run `ConceptDriftsFinder` over all the features as concepts. This is one functionallity already bundled in the `ConceptEngineering` object.

The output is a dataframe of all the concepts found. This may take a while, especially in the housing dataset.

In [None]:
# Find association rules
concept_engineering = ConceptEngineering(min_confidence=dataset_config['min_confidence'], min_support=dataset_config['min_support'], diff_threshold=dataset_config['diff_threshold'])
concept_engineering.fit(X_train, df_train_prep[good_columns], target_column, one_hot_columns)
concept_engineering.concepts_df

# Step 3 - Build models
We could review the concepts dataframe manually, but in this notebook our goal is to automatically evaluate our tool, so what we do now is build 2 models: (1) baseline model (2) a model that uses our tool.

### Baseline model
The baseline model is a very simple scikit learn one-vs-rest. Accuracy is printed for both train and validation.

In [None]:
# Simple one vs rest classifier for baseline
one_vs_rest_classifier = OneVsRestClassifier()

In [None]:
y_train_pred = one_vs_rest_classifier.fit_transform(X_train, y_train)
y_val_pred = one_vs_rest_classifier.transform(X_val)

print(f"Train accuracy: {accuracy_score(y_train, y_train_pred)}")
print(f"Validation accuracy: {accuracy_score(y_val, y_val_pred)}")

### Model using rules
We now build a model that uses our rules.
Again, we use the same scikit learn one-vs-rest.
However, this time, for each label we use our `ConceptEngineering` utility.

What the `ConceptEngineering` utility does is: (See more details in the accompayning pdf)
1) find which slice of the dataset (remember that we slice the dataset using the concept column and the concept cutoff) has higher lift values for the given rule.
2) run over all the datapoints in the slice, and increase the values of the features in the left_hand_side of the rule by the difference of the lift values.


Accuracy is again printed for both train and validation

In [None]:
# One vs rest classifier that uses rules (each label classifier uses its own rules)
label_to_transformation = {label: label_to_concept_transform_wrapper(concept_engineering, target_column, label) for label in y_train.unique()}
rules_one_vs_rest_classifier = OneVsRestClassifier(label_to_transformation)

In [None]:
rules_y_train_pred = rules_one_vs_rest_classifier.fit_transform(X_train, y_train)
rules_y_val_pred = rules_one_vs_rest_classifier.transform(X_val)

print(f"Train accuracy: {accuracy_score(y_train, rules_y_train_pred)}")
print(f"Validation accuracy: {accuracy_score(y_val, rules_y_val_pred)}")

# Step 4 - Analyze model
We used these steps to monitor the model's behavior

### Error analysis
Prints the errors the model made

In [None]:
pred_df = pd.DataFrame(zip(y_train_pred, y_train), columns=['predict', 'actual'], index=X_train.index)
pred_df = pd.merge(pred_df, X_train, left_index=True, right_index=True)
errors_df = pred_df[pred_df['predict'] != pred_df['actual']]
errors_df[:50]

### Model coefficients analysis
Prints the coefficient of both models. In a one-vs-rest there are models as the number of labels, so you need to specify which model coefficients to see.

This is useful to make sure that the rules-using model actually changes the weights for the features in the left hand side of the concepts

In [None]:
model_label = 3
print(list(enumerate(sorted(list(zip(one_vs_rest_classifier.classifiers[model_label].coef_, X_train.columns))))))
print(list(enumerate(sorted(list(zip(rules_one_vs_rest_classifier.classifiers[model_label].coef_, X_train.columns))))))

### Scatterplots
Print scatterplots betwen the target and every left hand side feature, before and after the concept values.
This is helpful to better understand the effect of each concept on the dataset.

In [None]:
from collections import Counter
import matplotlib.pyplot as plt

combined_df = pd.merge(X_train, y_train, left_index=True, right_index=True)


for _, concept_row in list(concept_engineering.concepts_df.iterrows()):
    print()
    print(f"{concept_row.concept_column} {concept_row.concept_cutoff} {concept_row.right_hand_side}")
    x = concept_engineering._filter_X_by_concept(combined_df, concept_row)
    left_hand_side_column = list(concept_row.left_hand_side.keys())[0]
    left_hand_side_column_value = list(concept_row.left_hand_side.values())[0]    
    
    if left_hand_side_column in concept_engineering.one_hot_columns:
        left_hand_side_column = f'{left_hand_side_column}_{left_hand_side_column_value}'
    
    x_filtered = x[left_hand_side_column]
    y_filtered = x[target_column]
    
    x_all = combined_df[left_hand_side_column]
    y_all = combined_df[target_column]
    
    # count the occurrences of each point
    c = Counter(zip(x_all,y_all))
    # create a list of the sizes, here multiplied by 10 for scale
    s = [10*c[(xx,yy)] for xx,yy in zip(x_all,y_all)]
    plt.scatter(x_all, y_all, s=s, color='blue')


    # count the occurrences of each point
    c = Counter(zip(x_filtered,y_filtered))
    # create a list of the sizes, here multiplied by 10 for scale
    s = [10*c[(xx,yy)] for xx,yy in zip(x_filtered,y_filtered)]
    plt.scatter(x_filtered, y_filtered, s=s, color='orange')
    
    plt.xlabel(left_hand_side_column)
    plt.ylabel(target_column)

    plt.show()