In [1]:
import pandas as pd
from tpot import TPOTRegressor
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics
from sklearn.preprocessing import Imputer

# Process Data with TPOT
This demo shows how to use DFS-generated features with [TPOT](https://rhiever.github.io/tpot/).

Hint: Because of the features were exported to a CSV-file, there are small differences to [TPOT example](https://rhiever.github.io/tpot/examples/).

In [2]:
feature_matrix = pd.read_csv('./example_data.csv', index_col=0)

## Split X and y
The imported feature-matrix contains all features we can use.

We are going to predict the mean number of items per purchase which is the feature `MEAN(invoices.MEAN(item_purchases.UnitPrice))`.

In [3]:
X = feature_matrix.drop('MEAN(invoices.MEAN(item_purchases.UnitPrice))', axis=1)
y = feature_matrix['MEAN(invoices.MEAN(item_purchases.UnitPrice))']

## Learn model
Learn the model like shown in [TPOT example](https://rhiever.github.io/tpot/examples/).

In [4]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)

tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)

y_hat = tpot.predict(X_test)
print("R2 score:", sklearn.metrics.r2_score(y_test, y_hat))

Optimization Progress:  32%|███▏      | 95/300 [02:14<02:50,  1.20pipeline/s] 

Generation 1 - Current best internal CV score: 1117.6338186510259


Optimization Progress:  47%|████▋     | 141/300 [03:44<05:26,  2.05s/pipeline]

Generation 2 - Current best internal CV score: 1117.6338186510259


Optimization Progress:  61%|██████    | 183/300 [05:22<04:04,  2.09s/pipeline]

Generation 3 - Current best internal CV score: 1117.6338186510259


Optimization Progress:  75%|███████▍  | 224/300 [07:56<04:56,  3.91s/pipeline]

Generation 4 - Current best internal CV score: 1117.6338186510259


                                                                              

Generation 5 - Current best internal CV score: 1091.2792884506362

Best pipeline: LinearSVR(RobustScaler(input_matrix), LinearSVR__C=15.0, LinearSVR__dual=False, LinearSVR__epsilon=1.0, LinearSVR__loss=squared_epsilon_insensitive, LinearSVR__tol=0.0001)
R2 score: 0.976035708232




## TPOT-Pipeline
TPOT creates python-scripts which contain the ML-pipelines. The following script seems to be the optimal which TPOT found.

In [5]:
tpot.export('./tpot_pipeline.py')

In [None]:
# %load ./tpot_pipeline.py
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.svm import LinearSVR

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)
features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)
training_features, testing_features, training_target, testing_target = \
    train_test_split(features, tpot_data['class'], random_state=42)

exported_pipeline = make_pipeline(
    RobustScaler(),
    LinearSVR(C=15.0, dual=False, epsilon=1.0, loss="squared_epsilon_insensitive", tol=0.0001)
)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)


## Analysing Prediction Result
Because of the nearly perfect prediction result, we want to know which features "important".

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.svm import LinearSVR

training_features, testing_features, training_target, testing_target = \
    train_test_split(X_train, y_train, random_state=42)

exported_pipeline = make_pipeline(
    RobustScaler(),
    LinearSVR(C=15.0, dual=False, epsilon=1.0, loss="squared_epsilon_insensitive", tol=0.0001)
)

exported_pipeline.fit(training_features, training_target)

Pipeline(steps=[('robustscaler', RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True)), ('linearsvr', LinearSVR(C=15.0, dual=False, epsilon=1.0, fit_intercept=True,
     intercept_scaling=1.0, loss='squared_epsilon_insensitive',
     max_iter=1000, random_state=None, tol=0.0001, verbose=0))])

Show the features with each used coefficient:

In [10]:
important_coefs = pd.Series(data=exported_pipeline.steps[1][1].coef_, index=X.columns)
sorted_coef = important_coefs.sort_values(ascending=False)

sorted_coef

MEAN(invoices.SUM(item_purchases.UnitPrice))                   41.589479
Country = France                                               22.695646
Country = United Kingdom                                       13.322339
Country = Germany                                               8.935684
MEAN(item_purchases.items.SUM(item_purchases.Quantity))         5.995957
Country = Portugal                                              5.833740
SUM(item_purchases.UnitPrice)                                   2.159102
MEAN(item_purchases.items.SUM(item_purchases.UnitPrice))        1.395468
MEAN(item_purchases.items.MEAN(item_purchases.Quantity))        1.236994
MEAN(invoices.SUM(item_purchases.Quantity))                     0.582053
MEAN(item_purchases.UnitPrice)                                  0.410761
Country = Belgium                                               0.369748
AVG_TIME_BETWEEN(item_purchases)                                0.064857
SUM(item_purchases.Quantity)                       

Seems like there are some important features. Let us focus on top 6:

In [11]:
sorted_coef[(sorted_coef > 6) | (sorted_coef < -10)]

MEAN(invoices.SUM(item_purchases.UnitPrice))    41.589479
Country = France                                22.695646
Country = United Kingdom                        13.322339
Country = Germany                                8.935684
Country = Italy                                -27.377826
MEAN(invoices.COUNT(item_purchases))           -33.242062
dtype: float64

Looks like there are some interesting relationships. Maybe the dataset is biased, because of the great importance of countries for this prediction. For now we won't go deeper into the analysis. But if you're interested in it, have fun! :)