# Custom Transformer

In [3]:
import seaborn as sns
import pandas as pd
import numpy as np

from sklearn import set_config; set_config(display='diagram')

üëá Consider the following dataset

In [None]:
import pandas as pd

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/08-Workflow/custom_transformer_data.csv")
data.head()

- Each observation of the dataset represents an item being delivered from a  `seller_state` to a `customer_state`. 
- Other columns describe the packaging properties of each item.

üéØ The target is the number of days between the order and the delivery.

In [None]:
# Check target
sns.histplot(data.days_until_delivery)

## 1. Pipeline

üëá Create a scikit-learn pipeline named `pipe`:

- Engineer a `volume` feature from the dimensions features
- Preserve the original product dimensions features for training
- Scale all numerical features
- Encode the categorical features
- Add a default `Ridge` regression estimator

**Note:** for this challenge, ignore the holdout method, so no need to `train_test_split`!

<details><summary>Hints</summary>

- There are many ways to create your preprocessed matrix (using `ColumnTransformer` and/or `FeatureUnion`). 
    
- If your transformed feature matrix look weird, it may be stored as "sparse" by the default behavior of `OneHotEncoder(sparse=True)`. Use `.todense()` to turn it back to a dense matrix

</details>

In [None]:
import seaborn as sns
import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import Ridge


data['volume'] = data['product_length_cm'] * data['product_height_cm'] * data['product_width_cm']

numerical_features = ['product_weight_g', 'product_length_cm', 'product_height_cm', 'product_width_cm', 'volume']
categorical_features = ['customer_state', 'seller_state']

numerical_transformer = Pipeline([('scaler', StandardScaler())])

categorical_transformer = Pipeline([('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer([('num', numerical_transformer, numerical_features),('cat', categorical_transformer, categorical_features)])

pipe = Pipeline([('preprocess', preprocessor),('ridge', Ridge())])

X = data.drop('days_until_delivery', axis=1)
y = data['days_until_delivery']

pipe.fit(X, y)


#### üß™ Test your pipe

In [None]:
from nbresult import ChallengeResult

pipe_test = pipe

# Check that it doesn't crash
assert pipe_test.fit(X, y)

result = ChallengeResult(
    'pipe', 
    shape = pipe_test[:-1].fit_transform(X).shape
)

result.write()
print(result.check())

## 2. Train and Predict

üëá Let's imagine `data` is your entire training set.

- `cross_validate` your pipeline on this dataset (‚ùóÔ∏èlow r2 score are expected)
- Now, imagine you just received an new order `new_data`: predict it's duration of delivery in a variable `prediction`

In [None]:
new_data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/08-Workflow/custom_transformer_new_order.csv")
new_data

In [None]:
# YOUR CODE HERE

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult(
    'prediction',
    prediction = prediction
)

result.write()
print(result.check())


üèÅ Congratulation. Don't forget to add, commit and push your notebook.