# My Project

In addition to being a place to experiment, this project has been structured to build and serve your model in a Flask application.  The purpose is to allow data science exploration to easily transition into deployed services and applications on the OpenShift platform.  After saving this project to git, it can be built on the OpenShift platform to serve models.

Your dependencies will live in `requirements.txt` and your prediction function will live in `prediction.py`.  As a Python based s2i application, this project can be configured and built upon to fit your needs.

### Project Organization
```
.
├── README.md
├── LICENSE
├── requirements.txt        <- Used to install packages for s2i application
├── 0_start_here.ipynb      <- Instructional notebook
├── 1_run_flask.ipynb       <- Notebook for running flask locally to test
├── 2_test_flask.ipynb      <- Notebook for testing flask requests
├── .gitignore              <- standard python gitignore
├── .s2i                    <- hidden folder for advanced s2i configuration
│   └── environment         <- s2i environment settings
├── gunicorn_config.py      <- configuration for gunicorn when run in OpenShift
├── prediction.py           <- the predict function called from Flask
└── wsgi.py                 <- basic Flask application
```

### Basic Flow
1. Install and manage dependencies in `requirements.txt`.
1. Experiment as usual.
1. Extract your prediction into the `prediction.py` file.
1. Update any dependencies.
1. Run and test your application locally.
1. Save to git.

For a complete overview, please read the [README.md](./README.md)

## Install Dependencies

In [98]:
import sys
!{sys.executable} -m pip install -r requirements.txt



In [99]:
#Packages related to general operating system & warnings
import os 
import warnings
warnings.filterwarnings('ignore')
#Packages related to data importing, manipulation, exploratory data #analysis, data understanding
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from termcolor import colored as cl # text customization
#Packages related to data visualizaiton
import seaborn as sns
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline
#Setting plot sizes and type of plot
plt.rc("font", size=14)
plt.rcParams['axes.grid'] = True
plt.figure(figsize=(6,3))
plt.gray()
from matplotlib.backends.backend_pdf import PdfPages
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import metrics
from sklearn.impute import MissingIndicator, SimpleImputer
from sklearn.preprocessing import  PolynomialFeatures, KBinsDiscretizer, FunctionTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, LabelBinarizer, OrdinalEncoder
import statsmodels.formula.api as smf
import statsmodels.tsa as tsa
from sklearn.linear_model import LogisticRegression, LinearRegression, ElasticNet, Lasso, Ridge
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor #export_graphviz, export
from sklearn.ensemble import BaggingClassifier, BaggingRegressor,RandomForestClassifier,RandomForestRegressor
from sklearn.ensemble import GradientBoostingClassifier,GradientBoostingRegressor, AdaBoostClassifier, AdaBoostRegressor 
from sklearn.svm import LinearSVC, LinearSVR, SVC, SVR
from xgboost import XGBClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

<Figure size 432x216 with 0 Axes>

## Experiment

Experiment with data and create your prediction function.  Create any serialized models needed.

### Importing Dataset

In [24]:
data = pd.read_csv("dataset/creditcard.csv")

### Data Processing & Understanding

In [28]:
# Let’s check the transaction distribution.

Total_transactions = len(data)
normal = len(data[data.Class == 0])
fraudulent = len(data[data.Class == 1])
fraud_percentage = round(fraudulent/normal*100, 2)
print(cl('Total number of Trnsactions are {}'.format(Total_transactions), attrs = ['bold']))
print(cl('Number of Normal Transactions are {}'.format(normal), attrs = ['bold']))
print(cl('Number of fraudulent Transactions are {}'.format(fraudulent), attrs = ['bold']))
print(cl('Percentage of fraud Transactions is {}'.format(fraud_percentage), attrs = ['bold']))

# Only 0.17% of transactions are fraudulent.

[1mTotal number of Trnsactions are 284807[0m
[1mNumber of Normal Transactions are 284315[0m
[1mNumber of fraudulent Transactions are 492[0m
[1mPercentage of fraud Transactions is 0.17[0m


In [102]:
# We can also check for null values using the following line of code.
data.info()
data
# We have no null values

<class 'pandas.core.frame.DataFrame'>
Int64Index: 275663 entries, 0 to 284806
Data columns (total 30 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   V1      275663 non-null  float64
 1   V2      275663 non-null  float64
 2   V3      275663 non-null  float64
 3   V4      275663 non-null  float64
 4   V5      275663 non-null  float64
 5   V6      275663 non-null  float64
 6   V7      275663 non-null  float64
 7   V8      275663 non-null  float64
 8   V9      275663 non-null  float64
 9   V10     275663 non-null  float64
 10  V11     275663 non-null  float64
 11  V12     275663 non-null  float64
 12  V13     275663 non-null  float64
 13  V14     275663 non-null  float64
 14  V15     275663 non-null  float64
 15  V16     275663 non-null  float64
 16  V17     275663 non-null  float64
 17  V18     275663 non-null  float64
 18  V19     275663 non-null  float64
 19  V20     275663 non-null  float64
 20  V21     275663 non-null  float64
 21  V22     27

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,0.244964,0
1,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,-0.342475,0
2,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,1.160686,0
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,0.140534,0
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,-0.073403,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,4.356170,...,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,-0.350151,0
284803,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,-0.975926,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,-0.254117,0
284804,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,-0.484782,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,-0.081839,0
284805,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,-0.399126,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,-0.313249,0


In [39]:
# While checking the minimum and maximum in the amount we can notice that the difference is huge; this can deviate the result.
min(data.Amount), max(data.Amount)
# (0.0, 25691.16)


# We scale the variable Amount with a standard scaler to make it fix.
sc = StandardScaler()
amount = data['Amount'].values
data['Amount'] = sc.fit_transform(amount.reshape(-1, 1))
min(data.Amount), max(data.Amount)

(-0.3532293929668236, 102.36224270928423)

In [None]:
# For our modelling process we can drop the Time variable
data.drop(['Time'], axis=1, inplace=True)

In [49]:
# We can remove duplicate transactions
data.drop_duplicates(inplace=True)

In [51]:
data.shape
# We removed around 9000 duplicate transactions

(275663, 30)

### Train & Test Split

Before splitting train & test — we need to define dependent and independent variables. The dependent variable is also known as X and the independent variable is known as y.

In [56]:
X = data.drop('Class', axis = 1).values
y = data['Class'].values
print("X:", X)
print("y:", y)

X: [[-1.35980713e+00 -7.27811733e-02  2.53634674e+00 ...  1.33558377e-01
  -2.10530535e-02  2.44964263e-01]
 [ 1.19185711e+00  2.66150712e-01  1.66480113e-01 ... -8.98309914e-03
   1.47241692e-02 -3.42474541e-01]
 [-1.35835406e+00 -1.34016307e+00  1.77320934e+00 ... -5.53527940e-02
  -5.97518406e-02  1.16068593e+00]
 ...
 [ 1.91956501e+00 -3.01253846e-01 -3.24963981e+00 ...  4.45477214e-03
  -2.65608286e-02 -8.18393021e-02]
 [-2.40440050e-01  5.30482513e-01  7.02510230e-01 ...  1.08820735e-01
   1.04532821e-01 -3.13248531e-01]
 [-5.33412522e-01 -1.89733337e-01  7.03337367e-01 ... -2.41530880e-03
   1.36489143e-02  5.14355311e-01]]
y: [0 0 0 ... 0 0 0]


In [57]:
# Now, let split our train and test data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 1)
print("X_train:", X_train)
print("X_test:", X_test)
print("y_train:", y_train)
print("y_test:", y_test)

X_train: [[-1.18941481  0.39725908  2.12690708 ... -0.01839888  0.07187018
  -0.3172866 ]
 [-0.66764845  0.81572911  2.13129343 ...  0.07571651  0.07869564
  -0.2533572 ]
 [-1.34311641  1.76561272 -2.64943878 ...  0.4547342   0.36914912
  -0.34723226]
 ...
 [-0.67069743  0.55494015  1.66960402 ...  0.27661825  0.18856012
  -0.27326767]
 [ 1.39923244 -0.99462959 -3.81313109 ... -0.02407438  0.03905452
   0.9332348 ]
 [-1.40052078  1.26222445  1.05598404 ... -0.02403393  0.03464592
  -0.3496711 ]]
X_test: [[-1.70571074e+00  1.24622041e+00  1.03899393e+00 ... -2.12988295e-01
   7.13281473e-02 -1.74634883e-01]
 [ 9.49651077e-01 -1.68556670e-01  2.53948905e-01 ...  8.42498850e-03
   4.03484724e-02  2.46483536e-01]
 [-2.88652724e-01  7.78265027e-01  1.69907868e+00 ... -1.15284991e-01
  -1.93754911e-01 -2.33326788e-01]
 ...
 [-1.19573911e+00  2.22014672e-01  2.43145573e+00 ... -6.19290073e-01
  -5.91946583e-01 -1.75763566e-03]
 [-3.53341694e-01  2.11668055e-01  1.50379440e+00 ... -4.19168852e

### Model Building

#### Model Training

##### Using XGBoost

In [69]:
# Defining XGBoost into a variable
xbg = XGBClassifier(max_depth = 4)

# Fitting the train dataset into the model for Training
xbg.fit(X_train, y_train)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=4, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=16,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

#### Model Evaluation

In [70]:
xgb_yhat = xgb.predict(X_test)

__Accuracy__

Accuracy is a metric for classification models that measures the number of predictions that are correct as a percentage of the total number of predictions that are made. Accuracy is a useful metric only when you have an equal distribution of classes on your classification. This means that if you have a use case in which you observe more data points of one class than of another, the accuracy is not a useful metric anymore. Check: https://towardsdatascience.com/the-f1-score-bec2bbc38aa6

In [71]:
# Let’s check the accuracy of our XGBoost model.
print('Accuracy score of the XGBoost model is {}'.format(accuracy_score(y_test, xgb_yhat)))

Accuracy score of the XGBoost model is 0.999506645771664


__F1 Score__

Precision and Recall are the two most common metrics that take into account class imbalance. They are also the foundation of the F1 score. 

* *Precision*: within everything that has been predicted as a positive, precision counts the percentage that is correct.
* *Recall*: within everything that actually is positive, how many did the model succeed to find.

Precision and Recall are the two building blocks of the F1 score. The goal of the F1 score is to combine the precision and recall metrics into a single metric. At the same time, the F1 score has been designed to work well on imbalanced data. In the F1 score, we compute the average of precision and recall.

Since the F1 score is an average of Precision and Recall, it means that the F1 score gives equal weight to Precision and Recall:
* A model will obtain a high F1 score if both Precision and Recall are high
* A model will obtain a low F1 score if both Precision and Recall are low
* A model will obtain a medium F1 score if one of Precision and Recall is low and the other is high

In [72]:
# Checking F1-Score for the XGBoost model.
print('F1 score of the XGBoost model is {}'.format(f1_score(y_test, xgb_yhat)))

F1 score of the XGBoost model is 0.8495575221238937


We can save our trained model

In [88]:
xgb.save_model("models/fraudmodel.json")

## Create a Predict Function

Extract the prediction logic into a standalone python file, `prediction.py` in a `predict` function.  Also, make sure `requirements.txt` is updated with any additional packages you've used and need for prediction.

In [129]:
def predict(args_dict):
    return {'prediction': 'not implemented'}


#def predict(sample_transaction):
#    #data['Class'].values

#    single_prediction = xgb.predict(sample_transaction)
#    return {'prediction': single_prediction}

# get sample as numpy array
#array = data.to_numpy()
#sample_transaction = array[:,0]

# get sample as pandas dataframe
#sample_transaction = data.iloc[0]


#predict(sample_transaction)

## Test Predict Function

In [130]:
from prediction import predict

predict({'keys': 'values'})

#saved_model = tf.saved_model.load('models/fraudmodel.json')  #load the model before called predict function
#predict(sample_transaction)

{'prediction': 'not implemented'}

### Run Flask

Run flask in a separate notebook ([1_run_flask.ipynb](./1_run_flask.ipynb)) to create a local service to try it out.  You must run the application in a separate notebook since it will use the kernel until stopped.

```
!FLASK_ENV=development FLASK_APP=wsgi.py flask run
```

### Test the Flask Endpoint

Test your new service endpoint in this notebook or from a separate notebook ([2_test_flask.ipynb](./2_test_flask.ipynb)) to try it out.  You can 


In [131]:
!curl -X POST -H "Content-Type: application/json" --data '{"data": "hello world"}' http://localhost:5000/predictions


{
  "prediction": "not implemented"
}


In [132]:
import requests
import json
response = requests.post('http://127.0.0.1:5000/predictions', '{"hello":"world"}')
response.json()

{'prediction': 'not implemented'}

### Save Your Project to Git (and Build)

Now that you've created and tested your prediction and service endpoint, push the code up to git.  This can be built as an s2i application on OpenShift.


