In [None]:
import os
import pandas as pd
import json
import joblib
import pickle
import requests
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import cross_val_score
from category_encoders import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')

The portuguese “Velho Banco”, wants to attract wealthier clients for one of its products. For this purpose, they want to predict the annual income of each person, based on the information those people provided to the bank.

Your company has been hired to help with this situation, and you were assigned to create a service that should predict if person's annual income if above 50k. The service should answer with a Yes or No answer.

These exercises will guide you on your task, step by step.

### 1. Meet the data

Start by getting familiar with the dataset, in file `salaries.csv`, adapted from [here](https://www.kaggle.com/wenruliu/adult-income-dataset).
Each row in the dataset is about each client, and has 4 fields.

4 features:

- **age**: what is the age of the client
- **education**: the school education of the client
- **hours-per-week**: number of hours the person works per week
- **native-country**: where the person is from

And the target:
- **salary**: if is above or bellow 50k

We'll consider a potencial client for the product when the salary is above 50k.

In [None]:
df = pd.read_csv(os.path.join("data", "salaries.csv"))
df.head()

In [None]:
# Create a DataFrame with the 4 features: age, education, hours-per-week, native-country
# Keep them in this order
# X = ...

# YOUR CODE HERE
raise NotImplementedError()


# Create a series with the target: salary
# y = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(X, pd.DataFrame)
assert X.columns.tolist() == ["age", "education", "hours-per-week", "native-country"]

assert isinstance(y, pd.Series)

### 2. Build a model

Build a scikit model that predicts whether a person earns more than 50k, based on the features that you have available. Your model should be delivered as a scikit [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline.predict_proba).

Don't worry too much about the model's performance, anything better than random works! We'll focus on model performance in the next BLUs.

In [None]:
# Create the pipeline with your model
# pipeline = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(pipeline, Pipeline)

In [None]:
# Use cross validation with 5 folds and ROC_AUC as metric, to check your model's performance
# roc_aucs = cross_val_score(...)
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert roc_aucs.mean() > 0.5

In [None]:
# Now fit the pipeline to all the training data
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert pipeline.predict_proba(X.head(1)).shape == (1, 2)

### 3. Serialize all the things!

Now we need to serialize three things:

1. The column names in the correct order
1. The fitted pipeline
1. The dtypes of the columns of the training set

In [None]:
# This is a temporary directory where your serialized files will be saved
# You can change it while working on the exercises locally,
# but change it back to TMP_DIR = '/tmp' before submitting the exercises,
# otherwise grading will fail
TMP_DIR = '/tmp'

In [None]:
# Serialize the column names from the X DataFrame into a file named columns.json
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
with open(os.path.join(TMP_DIR, "columns.json"), 'r') as fh:
    columns = json.load(fh)
    
assert columns == X.columns.tolist()

In [None]:
# Pickle the dtypes of the columns from the X DataFrame into a file named dtypes.pickle
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
with open(os.path.join(TMP_DIR, "dtypes.pickle"), 'rb') as fh:
    dtypes = pickle.load(fh)
    
assert dtypes.equals(X.dtypes)

In [None]:
# Pickle the fitted pipeline into a file named pipeline.pickle
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
pipeline_recovered = joblib.load(os.path.join(TMP_DIR, "pipeline.pickle"))

assert isinstance(pipeline_recovered, Pipeline)
assert pipeline_recovered.predict_proba(X.head(1)).shape == (1, 2)

### 4. Create a new repo for your service

Now it's time to create a new repo for your service. As you learned in the README of the [heroku-model-deploy repository](https://github.com/LDSSA/heroku-model-deploy), duplicate the heroku-model-deploy repo.

From this point on, you should code on the new repo. The remaining exercises in this notebook are questions meant to check if your service is working as expected.

After you've setup your new repo, copy the following things over there:
- `columns.json` file
- `dtypes.pickle` file
- `pipeline.pickle` file
- the package containing custom code in your model (only if you've used it, of course!).

### 5. Build your flask app

#### /predict

At this point, you can either edit the `app.py` file that's in the repo, or start a new file from scratch.
My advice is that you start one from scratch, as it will probably be a better learning experience.

Start by creating a `predict` endpoint, that should receive POST requests, and a JSON payload with:
- id
- observation, which has 4 fields: age, education, hours-per-week, native-country.

This endpoint should return the proba returned by your model for this observation.
Make sure that each field is in the correct format before passing it to the scikit model. If you receive an observation with an invalid value, return an appropriate error message.

When a request is received, you should update your local sqlite database with the following:
- id
- observation
- proba
- true_class (which is null for now)

In case your app has received an observation with an id that it has seen before, it should return an error message, the correspondent proba, and don't store anything.

Try the following commands to check that everything is working as expected.

**Command**

```bash
~ > curl -X POST http://localhost:5000/predict -d '{"id": 0, "observation": {"age": 45, "education": "Bachelors", "hours-per-week": 45, "native-country": "United-States"}}' -H "Content-Type:application/json"
```


**Expected output**

```json
{
  "proba": 0.3
}
```

(any proba value works, it depends on your model, of course!)

**Command**

```bash
~ > curl -X POST http://localhost:5000/predict -d '{"id": 0, "observation": {"age": 45, "education": "Bachelors", "hours-per-week": 45, "native-country": "United-States"}}' -H "Content-Type:application/json"
```


**Expected output**

```json
{
  "error": "Observation ID: \"0\" already exists",
  "proba": 0.3
}
```

**Command**

```bash
curl -X POST http://localhost:5000/predict -d '{"id": 1, "observation": {"age": 45, "education": "Bachelors", "hours-per-week": "45 hours", "native-country": "United-States"}}' -H "Content-Type:application/json"
```

**Expected output**
```json
{
  "error": "Observation is invalid!"
}
```

In [None]:
# When the predict endpoint of your flask app is working as expected,
# set variable predict_endpoint_working_fine to True
predict_endpoint_working_fine = False

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert predict_endpoint_working_fine

#### /update

The update endpoint should receive POST requests, and a JSON payload with:
- id
- true_class

If there is an observation with `id` in your database, you should update the `true_class` value with the value in the request. The response should be the observation, with the updated true_class value.

Otherwise, you should return an appropriate error message.

Try the following commands to check that everything is working as expected.

**Command**

```bash
~ > curl -X POST http://localhost:5000/update -d '{"id": 0, "true_class": 1}'  -H "Content-Type:application/json"
```


**Expected output**

```json
{
  "id": 1,
  "observation": "{"age": 45, "education": "Bachelors", "hours-per-week": 45, "native-country": "United-States"}",
  "observation_id": 0,
  "proba": 0.3,
  "true_class": 1
}
```

**Command**

```bash
~ > curl -X POST http://localhost:5000/update -d '{"id": 3, "true_class": 1}'  -H "Content-Type:application/json"
```


**Expected output**

```json
{
  "error": "Observation ID: \"3\" does not exist"
}
```

In [None]:
# When the predict endpoint of your flask app is working as expected,
# set variable update_endpoint_working_fine to True
update_endpoint_working_fine = False

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert update_endpoint_working_fine

### 6. Deploy your app to heroku

Follow the instructions on the Learning part of this BLU to deploy your app to heroku.

In order to check that your app is working correctly on heroku, re-run the previous commands, but replacing the `localhost` with the URL of your heroku app (like `https://<your-app-name>.herokuapp.com`). For instance, the first command would be:


**Command**

```bash
~ > curl -X POST https://<your-app-name>.herokuapp.com/predict -d '{"id": 0, "observation": {"age": 45, "education": "Bachelors", "hours-per-week": 45, "native-country": "United-States"}}' -H "Content-Type:application/json"
```


**Expected output**

```json
{
  "proba": 0.3
}
```

In [None]:
# In this test, we will call your app to check if it's working as expected
# Assign the variable APP_NAME to the name of your heroku app
# APP_NAME = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Testing the /predict endpoint

url = f"http://{APP_NAME}.herokuapp.com/predict"
payload = {
    "id": 0,
    "observation": {"age": 45,
                    "education": "Bachelors",
                    "hours-per-week": 45,
                    "native-country": "United-States"}
}

r = requests.post(url, json=payload)

assert isinstance(r, requests.Response)
assert r.ok
assert "proba" in r.json()
assert isinstance(r.json()["proba"], float)
assert 0 <= r.json()["proba"] <= 1

In [None]:
# Testing the /update endpoint

url = f"http://{APP_NAME}.herokuapp.com/update"
payload = {
    "id": 0,
    "true_class": 1
}

r = requests.post(url, json=payload)

assert isinstance(r, requests.Response)
assert r.ok
assert "observation" in r.json()
assert "proba" in r.json()
assert "true_class" in r.json()
assert r.json()["true_class"] == 1