## Summary and Answers

### 1) Train a classifier to predict if a transaction is a fraudulent transaction or not?

From Notebooks:

**1.- Split dataset Notebook**

- Since Time was measured with respect to the first transaction, it has to be transformed with a cyclical method to have the same measure in a production environment. Datasets were transformed first and then splited on train & test.


- The data were split in a stratify manner to mantain the class proportions across train and test samples.


**2.- Model Experimentation Notebook**

This notebook was developed to experiment with many models using the Pycaret AutoML framework. Since the features' meanings are confidential, this doesn’t allow us to directly add intuition to the data. The process involved three main data scenarios: Almost raw data scenario, fixing imbalance with Smote scenario and automated feature engineer scenario. Based on the previous approach the final model was determined.

The main results for these Pycaret experiments for the three choosen models were: 

| Method/Model | ExtraTrees | XGBoost | LDA |
| --- | --- | --- | --- |
| Normalize | 86% | 85.64% | 81.55% |
| SMOTE | 86% | 80% | 16% |
| Auto Feat. Eng | 80.5% | 82% | 81% |

<center>Taking F1-Score Metric</center>

Latter, the experiments turned to be around **AUC-PR** metric. This was choosen due to the imbalanced target class and to have a better understanding on how the model is performing against precision-recall tradeoff.

The final trained model was a VotingClassifier made out of XGBoost, LDA and Extratrees. Tuned models were developed by train dataset and then it were re-trained.

*Note:  Functions were build in a re-usable way. These functions were also involved to try-error the models quickly but are not necessary used in the final notebook version. Nevertheless, these can be use for training environments.*

**3.- Model Delivering Notebook**

The main utility for this notebook is to create the models and artifacts needed for the API Pipeline. There are two relevant functions built here: ``build_model`` and ``tuning_job`` both being self-explainatory.

Final Score against test dataset: 

| Method/Model | XGBoost| 
| --- | --- 
| AUC-PRC | 81.75% |
| F1-score | 80 % | 
| Recall | 73.4% | 
| Precision | 90 % | 


<br>


----

<br>

#### 2) How would you deploy this model in production?

I would build the model as an API-REST endpoint using a backend framework (Flask, FastAPI, Django, etc) and Docker. Docker will be used to deploy and expose the service in hands with a cloud-engine helpful features. Starting with having a MVP model and building the entire pipeline, I'd continue interating over the model versions to fullfill client requirements.

#### a) Create an API to deploy this model and use it in a dev environment

API repo environment here: [GitHub repo link](https://github.com/PBenavides/credit-fraud/blob/main/dev/app/utils.py)

#### b) API response and should be clear and should handle missing data and expections.

You can see API responses and small handle errors in this project's script. [routes.py](https://github.com/PBenavides/credit-fraud/blob/main/dev/app/routes.py) and [utils.py](https://github.com/PBenavides/credit-fraud/blob/main/dev/app/utils.py)

#### c) You are expected to share how your model could be called. (Postman, Request script)

It can be called on both ways after running the app. There is script for small testing errors in [tests.py](https://github.com/PBenavides/credit-fraud/tree/main/dev)

````Python
#Request Script
import requests

sample_data = {"V1": -0.365234375, "V2": 0.1234415820, ... "V28": -0.00994110107421875,"Amount": 88.67}  #Input Features without 'Time'

api_endpoint = 'http://127.0.0.1:5000/predict' #If running in local 

response = requests.post(api_endpoint, json=sample_data) # POST REQUEST

print(response) #Should be [200] if everything's okay. data is in response.content

````

#### d) How could you deploy this model for a production setting? A paragraph only please.

I would build the production environment apart from the training environment, using a framework that allows me to expose the model as an API and having a version control system of it so I can update it if necessary. Then, I would configure everything in a Docker container to be able to start the application on-demand and depending on how many requests I receive distribute the load. This would be orchestrated by some cloud service and using Kubernetes as a container manager. Finally, it would have a tracking service for production performance monitoring.


#### e) How do you scale your application ? A paragraph only please

Having a containarized API, I could use a cloud service to be able to serve container on demand load. It is also important to have a control of the consumption and memory metrics that the application may have, in order to have an idea of which service can be used and when it will be required to scale.


#### f) How do you do performance testing for your API ? A paragraph only please

Testing can be done from two approaches: the API and the model. For the former, I would generate automatic and stress test to the API by making several requests to find cases where the system fails, and track the response metrics. On the model side, I would track input and output distribution of the data to see if they change and if it needs to be retrained. If so, I would make first and A/B test and then change the production model.



**Pablo Benavides**