In [None]:
import os
import pandas as pd
import json
import joblib
import pickle
import requests
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

The portuguese “Velho Banco”, wants to attract wealthier clients for one of its products. For this purpose, they want to predict the annual income of each person, based on the information those people provided to the bank.

Your company has been hired to help with this situation, and you were assigned to create a service that should predict if a person's annual income is above 50k. The service should answer with a Yes or No answer.

These exercises will guide you in your task, step by step.

Start by getting familiar with the dataset in the file `salaries.csv`, adapted from [here](https://www.kaggle.com/wenruliu/adult-income-dataset).
Each row in the dataset is about one client and has 4 features:

- **age**: what is the age of the client
- **education**: the school education of the client
- **hours-per-week**: number of hours the person works per week
- **native-country**: where the person is from

And the target:
- **salary**: if is above or below 50k

We want to select clients with an annual income above 50k to offer them a financial product.

In [None]:
df = pd.read_csv(os.path.join("data", "salaries.csv"))
df.head()

## Exercise 1 - Meet the data
Separate the data into features and target. Features should be stored in a dataframe named `X` with the columns age, education, hours-per-week, native-country, in this order. The target should be stored in a series named `y`.

In [None]:
# X = ...
# y = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(X, pd.DataFrame)
assert X.columns.tolist() == ["age", "education", "hours-per-week", "native-country"]
assert isinstance(y, pd.Series)
assert y[0]==' <=50K'

## Exercise 2 - Model

### Exercise 2.1 - Set up the model

Build a scikit model that predicts whether a person earns more than 50k, based on the features that you have available. Your model should be delivered as a scikit [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). Include any necessary preprocessing steps in the pipeline.

Don't worry too much about the model's performance, anything better than random works! We'll focus on model performance in the next BLUs.

In [None]:
# pipeline = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(pipeline, Pipeline)

### Exercise 2.2 - Cross-validation
Use cross-validation with 5 folds and `ROC_AUC` as the metric to check your model's performance. Any result better than 0.5 is acceptable. If you don't reach 0.5, go back to the previous exercise and redefine your model.

In [None]:
# roc_aucs = cross_val_score(...)
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert roc_aucs.mean() > 0.5
assert len(roc_aucs)==5

### Exercise 2.3 - Fit the model
Fit the pipeline to all the training data.

In [None]:
# pipeline ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert pipeline.predict_proba(X.head(1)).shape == (1, 2)

## Exercise 3 - Serialize everything

Now we need to serialize the model to be able to use it in the app. You need to serialize three things:

1. The column names in the correct order
1. The fitted pipeline
1. The dtypes of the columns of the training set

In [None]:
# This is a temporary directory where your serialized files will be saved
# You can change it while working on the exercises locally,
# but change it back to TMP_DIR = '/tmp' before submitting the exercises,
# otherwise the grading will fail
TMP_DIR = '/tmp'

### Exercise 3.1 - Serialize the column names
Serialize the column names of the `X` dataframe into a file named `columns.json`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
with open(os.path.join(TMP_DIR, "columns.json"), 'r') as fh:
    columns = json.load(fh)
    
assert columns == X.columns.tolist()

### Exercise 3.2 - Pickle the dtypes
Pickle the dtypes of the columns from the `X` dataframe into a file named `dtypes.pickle`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
with open(os.path.join(TMP_DIR, "dtypes.pickle"), 'rb') as fh:
    dtypes = pickle.load(fh)
    
assert dtypes.equals(X.dtypes)

### Exercise 3.3 - Pickle the model
Pickle the fitted pipeline into a file named `pipeline.pickle`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
pipeline_recovered = joblib.load(os.path.join(TMP_DIR, "pipeline.pickle"))

assert isinstance(pipeline_recovered, Pipeline)
assert pipeline_recovered.predict_proba(X.head(1)).shape == (1, 2)

## Exercise 4 - Create a new repo for your service

Now it's time to create a new repo for your service. Duplicate the [railway-model-deploy repository](https://github.com/LDSSA/railway-model-deploy) if you haven't done so yet.

From this point on, you should code on the new repo. The remaining exercises in this notebook are questions meant to check if your service is working as expected.

After you've setup your new repo, copy the following things over there:
- `columns.json` file
- `dtypes.pickle` file
- `pipeline.pickle` file
- any packages with custom code in your model (only if you've used it, of course!).

## Exercise 5 - Build your flask app

### Exercise 5.1 - /predict endpoint

At this point, you can either edit the `app.py` file that's in the repo, or start a new file from scratch.
My advice is that you start one from scratch, as it will probably be a better learning experience.

Start by creating a `predict` endpoint, that should receive POST requests and a JSON payload with:
- id
- observation, which has 4 fields: age, education, hours-per-week, native-country.

This endpoint should return the probability returned by your model for this observation.
Make sure that each field is in the correct format before passing it to the scikit model. If you receive an observation with an invalid value, return an appropriate error message.

When a request is received, you should update your local sqlite database with the following:
- id
- observation
- proba
- true_class (which is null for now)

In case your app has received an observation with an id that it has seen before, it should return an error message, the correspondent probability, and not store anything.

Use the following three commands to check that everything is working as expected.

**Command 1**

```bash
curl -X POST http://localhost:5000/predict -d '{"id": 0, "observation": {"age": 45, "education": "Bachelors", "hours-per-week": 45, "native-country": "United-States"}}' -H "Content-Type:application/json"
```


**Expected output 1**

```json
{
  "proba": 0.11
}
```

(any proba value works, it depends on your model, of course!)

**Command 2**

```bash
curl -X POST http://localhost:5000/predict -d '{"id": 0, "observation": {"age": 45, "education": "Bachelors", "hours-per-week": 45, "native-country": "United-States"}}' -H "Content-Type:application/json"
```


**Expected output 2**

```json
{
  "error": "Observation ID: \"0\" already exists",
  "proba": 0.11
}
```

**Command 3**

```bash
curl -X POST http://localhost:5000/predict -d '{"id": 1, "observation": {"age": 45, "education": "Bachelors", "hours-per-week": "45 hours", "native-country": "United-States"}}' -H "Content-Type:application/json"
```

**Expected output 3**
```json
{
  "error": "Observation is invalid!"
}
```

This exercise notebook will not evaluate if your flask app is working as expected, so we will trust your judgement.
To pass this exercise, simply set `predict_endpoint_working_fine = True`.

In [None]:
predict_endpoint_working_fine = False

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert predict_endpoint_working_fine

### Exercise 5.2 - /update endpoint

The update endpoint is for adding the true class to existing observations. It should receive POST requests and a JSON payload with:
- id
- true_class

If there is an observation with `id` in your database, you should update the `true_class` value with the value in the request. The response should be the observation, with the updated `true_class` value.

Otherwise, you should return an appropriate error message.

Use the following two commands to check that everything is working as expected.

**Command 1**

```bash
curl -X POST http://localhost:5000/update -d '{"id": 0, "true_class": 1}'  -H "Content-Type:application/json"
```


**Expected output 1**

```json
{
  "id": 1,
  "observation": "{"age": 45, "education": "Bachelors", "hours-per-week": 45, "native-country": "United-States"}",
  "observation_id": 0,
  "proba": 0.3,
  "true_class": 1
}
```

**Command 2**

```bash
curl -X POST http://localhost:5000/update -d '{"id": 3, "true_class": 1}'  -H "Content-Type:application/json"
```


**Expected output 2**

```json
{
  "error": "Observation ID: \"3\" does not exist"
}
```

When the predict endpoint of your flask app is working as expected, set variable `update_endpoint_working_fine` to `True`.

In [None]:
update_endpoint_working_fine = False

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert update_endpoint_working_fine

## Exercise 6 - Deploy your app to railway

Follow the instructions in the learning part of this BLU to deploy your app to railway.

In order to check that your app is working correctly in railway, re-run the commands from the previous two exercises, but replace the `localhost` with the URL of your railway app (like `https://<your-app-name>.up.railway.app`). For instance, the first command would be:


**Command**

```bash
curl -X POST https://<your-app-name>.up.railway.app/predict -d '{"id": 0, "observation": {"age": 45, "education": "Bachelors", "hours-per-week": 45, "native-country": "United-States"}}' -H "Content-Type:application/json"
```


**Expected output**

```json
{
  "proba": 0.3
}
```

In the following test, we will call your app to check if it's working as expected. Set the variable `APP_NAME` to the name of your railway app.

In [None]:
# APP_NAME = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Testing the /predict endpoint

url = f"https://{APP_NAME}/predict"
payload = {
    "id": 0,
    "observation": {"age": 45,
                    "education": "Bachelors",
                    "hours-per-week": 45,
                    "native-country": "United-States"}
}

r = requests.post(url, json=payload)

assert isinstance(r, requests.Response)
assert r.ok
assert "proba" in r.json()
assert isinstance(r.json()["proba"], float)
assert 0 <= r.json()["proba"] <= 1

In [None]:
# Testing the /update endpoint

url = f"https://{APP_NAME}/update"
payload = {
    "id": 0,
    "true_class": 1
}

r = requests.post(url, json=payload)

assert isinstance(r, requests.Response)
assert r.ok
assert "observation" in r.json()
assert "proba" in r.json()
assert "true_class" in r.json()
assert r.json()["true_class"] == 1