# Expected Delivery

👇 Run the code below

In [3]:
import pandas as pd

data = pd.read_csv('data.csv')

data.head()

Unnamed: 0,customer_state,seller_state,product_weight_g,product_length_cm,product_height_cm,product_width_cm,order_purchase_timestamp,order_delivered_customer_date
0,RJ,SP,1825,53,10,40,18/09/2017 20:11,28/09/2017 19:19
1,RJ,SP,700,65,18,28,16/10/2017 14:12,25/10/2017 16:43
2,RJ,SP,1825,53,10,40,14/04/2018 00:04,25/04/2018 23:10
3,RJ,SP,1825,53,10,40,10/10/2017 15:32,23/10/2017 12:09
4,RJ,SP,1825,53,10,40,26/09/2017 11:51,10/10/2017 18:05


Each observation of the dataset represents an item being delivered from a  `seller_state` to a `customer_state`. The columns describe the size and weight of each item. There are two columns that inform on the time the order was placed (`order_purchase_timestamp`) and it was delivered (`order_delivered_customer_date`).

The task is to to inform customers the **number of days until delivery** at the moment the order is placed. Because customers would rather a delivery arrive early than late, you should favor a model that **overshoots the predictions**.

## 1. Preprocessing

👇 Drop duplicates

In [1]:
# Drop duplicates


👇 Using `ColumnTransformer`, create a preprocessing pipeline to:
- Encode customer and seller state information
- Scale product details

Save the output of the transformer as `X`.

[`ColumnTransformer` documentation](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)

In [5]:
# YOUR CODE HERE

👇 Create the target by computing the time difference between `order_delivered_customer_date` and `order_delivered_customer_date`. Round it up to days. Save as `y`.

<details>
<summary>💡 Hint</summary>
    
Convert each column to datetime and compute the difference.
    
[`to_datetime` documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html)
    
[`dt.days` documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.days.html)
</details>

In [6]:
# To datetime

# Days until delivery


## 2. Linear Regression

👇 Instanciate a `LinearRegression` model and make cross validated predictions.

[`cross_val_predict` documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html)

In [7]:
# Instanciate model

# Make cross-validated predictions 


👇 Score the model  using a known metric that penalizes underestimated predictions.

In [2]:
# YOUR CODE HERE

👇 Engineer a scoring metric that preserves the magnitude of the target and the direction of the errors made.

<details>
<summary>💡 Hint</summary>
    
Computing the mean differences between predicted and true values will preserving negativity.
    
</details>

In [3]:
# YOUR CODE HERE

👇 Encapsulate the metric you just engineered in a reusable function

In [4]:
def directed_error(y, y_pred):
    """TODO: returns the error"""
    pass

👇 Test the function out on the predictions of the Linear Regression model

In [5]:
# YOUR CODE HERE

## 4. KNN Regressor

👇 Instanciate a `KNNRegressor` with `n_neighbors=3` model and make cross validated predictions.

In [14]:
# Instanciate model

# Make cross-validated predictions 


👇 Score the model  using a known metric that penalizes underestimated predictions.

In [6]:
# YOUR CODE HERE

👇 Score with your engineered metric function

In [7]:
# YOUR CODE HERE

## 5. Model Selection

 Considering that it is better to overshoot the predicted delivery date... 
 
❓ Which of the two models would you chose based on the mean squared log error score?

<details>
<summary>Answer</summary>

Since the MSLE penalises underestimated predictions, the model best suited for the task is the one with lowest MSLE score.
    
</details>

❓ Which of the two models would you chose based on your metric's score?

<details>
<summary>Answer</summary>

The model best suited for the task is the one with the positive error (or least negative). If the average error is positive, the model tends to overshoot delivery.
</details>

The two metrics give opposite performance indications and you still can't pick a model?

👇 Decide on a tie-breaking scoring metric of your choice. You could look at minimizing large outlier error...

In [8]:
# YOUR CODE HERE

In [9]:
# YOUR CODE HERE

## 6. Predictions

👇 Pick a model and use it to inform a customer on the delivery of the following order she just placed. Don't forget to reuse the original `ColumnTransformer` to preprocess the new data.


<details>
<summary>💡 Hint</summary>

1. The model has been Cross-validated, but not fitted. You need to fit it before making predictions.

</details>

In [None]:
new_data = pd.read_csv('new_data.csv')

new_data

In [11]:
# Preprocess new data

# Fit the model

# Predict new data


# 🏁