# Introduction 
    
Due to rapid growth in field of cashless or digital 
transactions, credit cards are widely used in all 
around the world. Credit cards providers are 
issuing thousands of cards to their customers.
 Providers have to ensure all the credit card users 
should be genuine and real. Any mistake in issuing 
a card can be reason of financial crises. 
Due to rapid growth in cashless transaction,
 the chances of number of fraudulent transactions can also increasing.
 A Fraud transaction can be identified by analyzing various
 behaviors of credit card customers from previous 
transaction history datasets. If any deviation
 is noticed in spending behavior from available patterns, 
it is possibly of fraudulent transaction. 
Data mining and machine learning techniques are widely used in credit card 
fraud detection. In this article we are presenting review 
of various data mining and machine learning methods
 which are widely used for credit card fraud detections and  complete this project end to end from Data Understanding to deploy Model via API .  
    
    
 

<a id=0></a>
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home">

<center>CRISP-DM Methodology</center></h3>

* [Buissness Understanding](#1)
* [Data Understanding](#2)
* [Data Preparation](#3)
* [Data Modeling](#4)   
* [Data Evaluation](#5)
    

In this section we overview our selected method for engineering our solution. CRISP-DM stands for Cross-Industry Standard Process for Data Mining. It is an open standard guide that describes common approaches that are used by data mining experts. CRISP-DM includes descriptions of the typical phases of a project, including tasks details and provides an overview of the data mining lifecycle. The lifecycle model consists of six phases with arrows indicating the most important and frequent dependencies between phases. The sequence of the phases is not strict. In fact, most projects move back and forth between phases as necessary. It starts with business understanding, and then moves to data understanding, data preparation, modelling, evaluation, and deployment. The CRISP-DM model is flexible and can be customized easily.
## Buissness Understanding

    Tasks:

    1.Determine business objectives

    2.Assess situation

    3.Determine data mining goals

    4.Produce project plan

## Data Understanding
     Tasks:

    1.Collect data

    2.Describe data

    3.Explore data    

## Data Preparation
    
    Tasks:
    
    1.Data selection

    2.Data preprocessing

    3.Feature engineering

    4.Dimensionality reduction

            Steps:

            Data cleaning

            Data integration

            Data sampling

            Data dimensionality reduction

            Data formatting

            Data transformation

            Scaling

            Aggregation

            Decomposition

## Data Modeling :

Modeling is the part of the Cross-Industry Standard Process for Data Mining (CRISP-DM) process model that i like best. Our data is already in good shape, and now we can search for useful patterns in our data.

   Tasks:
    
    1. Select modeling technique Select technique

    2. Generate test design

    3. Build model

    4. Assess model

## Data Evaluation :
    
    Tasks:

    1.Evaluate Result

    2.Review Process

    3.Determine next steps

<a id=1></a>
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home">
<center>Buissness Understanding</center></h3>


There may be two types of questions:

**A.Technical Questions:**
  
Can ML be a solution to the problem?

    
                Do we have THE data?
                Do we have all necessary related data?
                Is there enough amount of data to develop algorithm?
                Is data collected in the right way?
                Is data saved in the right format?
                Is the access to information guaranteed?

Can we satisfy all the Business Questions by means of ML?

**B.Business Questions:**
    
What are the organization's business goals?
    
                To reduce cost and increase revenue? 
                To increase efficiencies?
                To avoid risks? To improve quality?
    
Is it worth to develop ML?
    
                In short term? In long term?
                What are the success metrics?
                Can we handle the risk if the project is unsuccessful?
    
Do we have the resources?
    
                Do we have enough time to develop ML?
                Do we have a right talented team?


    
WE are provided a synthetic dataset for a mobile payments application. In this dataset, you are provided the sender and recipient of a transaction as well as whether transactions are tagged as fraud or not fraud. Your task is to build a fraud detection API that can be called to predict whether or not a transaction is fraudulent.
You can download the dataset here:https://www.kaggle.com/bannourchaker/frauddetection
    
You are expected to build a REST API that predicts whether a given transaction is fraudulent or not. You are also to assume that the previous API calls are to be stored in order to engineer
features relevant to finding fraud. The API calls will include the time step of the transaction, so you can assume that a transaction happens sequentially within the same time step.
For example, if I make the following transactions in the same time step:  
    
![image.png](attachment:c6d681b2-1201-43b2-a5cf-587bd46f7839.png)
    
The first transaction is unlikely to be fraudulent, since anon is initiating a normal transfer.
However, multiple successive transfers of the same amount in the same hour is potentially fraudulent, since anon’s account might have been taken over by a fraudster. On the first API call,your model is unlikely to classify the transaction as fraudulent. However, on the fifth call, it’s likely that it will be tagged as fraudulent.
The REST API only has 1 endpoint /is-fraud that takes in a POST request:
    
The body is expected to receive the following fields(which are also the fields that can be found in your dataset:
The following is a sample body when making a POST request to your
    
    
            {
        "step":1,
        "type":"PAYMENT",
        "amount":9839.64,
        "nameOrig":"C1231006815",
        "oldbalanceOrig":170136.0,
        "newbalanceOrig":160296.36,
        "nameDest":"M1979787155",
        "oldbalanceDest":0.0,
        "newbalanceDest":0.0
        }
    
    
Your API is expected to return a JSON object with a boolean field isFraud. You may find a
sample response below:
    
    {"isFraud": true}
    
**summary:**
we are expecting the following:
    
- 1. Deployed REST API:
    
    a. As mentioned above, we would need an API that takes in a POST request for the
    /is-fraud url and returns a prediction on whether or not a transaction is
    fraudulent.
    
    b. Your REST API should be public for us to call the API and evaluate the accuracy
    of your model
    
    c. Given the nature of the data, your REST API will likely need to take into account
    previous transactions, so make sure it is able to take note of transactions from
    your training dataset as well as previous API calls.

- 2. Model
    
    a. We are expecting a machine learning model that can correctly classify whether or
    not a transaction is fraudulent.

**What is the objective of the machine learning model?**

We aim to predict  the real transactions fraud  and the fraud estimated by our model. We will evaluate model performance with the:

   - F beta score
    
   - ROC AUC score
    
   - PR AUC score | Average precision
    
    
## Step 1: Import helpful libraries

In [1]:
#Load the librarys
import pandas as pd #To work with dataset
import numpy as np #Math library
import matplotlib.gridspec as gridspec
import seaborn as sns #Graph library that use matplot in background
import matplotlib.pyplot as plt #to plot some parameters in seaborn
import warnings
# Preparation  
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PowerTransformer, StandardScaler,Normalizer,RobustScaler,MaxAbsScaler,MinMaxScaler,QuantileTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import KBinsDiscretizer
# Import StandardScaler from scikit-learn

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer,IterativeImputer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.compose import make_column_transformer,ColumnTransformer
from sklearn.pipeline import make_pipeline, Pipeline,FeatureUnion
from sklearn.manifold import TSNE
# Import train_test_split()
# Metrics
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.metrics import make_scorer
from sklearn.metrics import mean_squared_error
from sklearn.metrics import roc_curve,confusion_matrix
from datetime import datetime, date
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.linear_model import LinearRegression, RidgeCV
from sklearn.linear_model import LogisticRegression

#import tensorflow as tf 
#from tensorflow.keras import layers
#from tensorflow.keras.callbacks import EarlyStopping
#from tensorflow.keras.callbacks import LearningRateScheduler
#import smogn
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.kernel_ridge import KernelRidge
from sklearn.ensemble import GradientBoostingRegressor,RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
# For training random forest model
import lightgbm as lgb
from scipy import sparse
from sklearn.neighbors import KNeighborsRegressor 
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans 
# Model selection
from sklearn.model_selection import StratifiedKFold,TimeSeriesSplit
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, GroupKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
# Feature Selection 
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression,f_classif,chi2
from sklearn.feature_selection import mutual_info_regression
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import mutual_info_classif,VarianceThreshold
# Stat :
from scipy import stats, optimize, interpolate
# Models 

from lightgbm import LGBMClassifier
import lightgbm as lgbm
#from catboost import CatBoostRegressor, CatBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import svm
#from xgboost import XGBClassifier
from sklearn import set_config
from itertools import combinations
# Cluster :
from sklearn.cluster import MiniBatchKMeans
#from yellowbrick.cluster import KElbowVisualizer
#import smong 
import category_encoders as ce
import warnings
#import optuna 
from joblib import Parallel, delayed
import joblib 
from sklearn import set_config
from typing import List, Optional, Union
set_config(display='diagram')
warnings.filterwarnings('ignore')


## Step 2: Load the data
Complete guid to read data : 
Next, we'll load the training and test data.

In [2]:
%%time
# import lux
# Load the training data
#train = pd.read_csv("../input/frauddetection/transactions_train.csv")
train = pd.read_csv("C:/Users/rzouga/Desktop/ALLINHERE/ALLINHERE/FraudDetection/transactions_train.csv")

# Preview the data
train.head(3)

Wall time: 49.3 s


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrig,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1



<a id=2></a>
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home">
<center>Data Understanding</center></h3>


### Explore the data/Analysis 

We will analyse the following:

    The target variable
    
    Variable types (categorical and numerical)
    
    Numerical variables
        Discrete
        Continuous
        Distributions
        Transformations

    Categorical variables
        Cardinality
        Rare Labels
        Special mappings

    Null Data

    Text data 
    
    wich columns will we use
    
    IS there outliers that can destory our algo
    
    IS there diffrent range of data
    
    Curse of dimm...
    
This Step is done here : [https://www.kaggle.com/bannourchaker/frauddetection-part1-eda/edit](http://)

# Convert Dtypes :

In [3]:
# Convert Dtypes :
train[train.select_dtypes(['int64','int16','float32','float64','int8']).columns] = train[train.select_dtypes(['int64','int16','float32','float64','int8']).columns].apply(pd.to_numeric)
train[train.select_dtypes(['object','category']).columns] = train.select_dtypes(['object','category']).apply(lambda x: x.astype('category'))

## Define the model features and target

### Extract X and y 

In [4]:
# Pour le train test
target= "isFraud"
X = train.drop(target, axis='columns')# axis=1
y = train[target].to_numpy()

In [5]:
train.isFraud.value_counts()

0    6343476
1       7717
Name: isFraud, dtype: int64

# What should we do for each colmun

**Separate features by dtype**

Next we’ll separate the features in the dataframe by their datatype. There are a few different ways to achieve this. I’ve used the select_dtypes() function to obtain specific data types by passing in np.number to obtain the numeric data and exclude=['np.number'] to return the categorical data. Appending .columns to the end returns an Index list containing the column names. For the categorical features, we don’t want to include the target income column, so I’ve dropped that.

**Cat Features**





In [6]:
# select non-numeric columns
cat_columns = X.select_dtypes(exclude=['int64','int16','float32','float64','int8']).columns

**Num Features**



In [7]:
# select the float columns
num_columns = X.select_dtypes(include=['int64','int16','float32','float64','int8']).columns

In [8]:
all_columns = (num_columns.append(cat_columns))
print(cat_columns)
print(num_columns)
print(all_columns)

Index(['type', 'nameOrig', 'nameDest'], dtype='object')
Index(['step', 'amount', 'oldbalanceOrig', 'newbalanceOrig', 'oldbalanceDest',
       'newbalanceDest'],
      dtype='object')
Index(['step', 'amount', 'oldbalanceOrig', 'newbalanceOrig', 'oldbalanceDest',
       'newbalanceDest', 'type', 'nameOrig', 'nameDest'],
      dtype='object')


# check that we have all column

In [9]:
if set(all_columns) == set(X.columns):
    print('Ok')
else:
    # Let's see the difference 
    print('in all_columns but not in  train  :', set(all_columns) - set(X.columns))
    print('in X.columns   but not all_columns :', set(X.columns) - set(all_columns))

Ok


<a id=3></a>
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home">
<center>Data Preparation</center></h3>


## Data preprocessing

Data preprocessing comes after you've cleaned up your data and after you've done some exploratory analysis to understand your dataset. Once you understand your dataset, you'll probably have some idea about how you want to model your data. Machine learning models in Python require numerical input, so if your dataset has categorical variables, you'll need to transform them. Think of data preprocessing as a prerequisite for modeling.
This Step is Done Here :
[https://www.kaggle.com/bannourchaker/frauddetection-part2-preparation/edit](http://)



In [10]:
class ColumnsSelector(BaseEstimator, TransformerMixin):
    def __init__(self, positions):
        self.positions = positions

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        #return np.array(X)[:, self.positions]
        return X.loc[:, self.positions] 
########################################################################
class CustomLogTransformer(BaseEstimator, TransformerMixin):
    # https://towardsdatascience.com/how-to-write-powerful-code-others-admire-with-custom-sklearn-transformers-34bc9087fdd
    def __init__(self):
        self._estimator = PowerTransformer()

    def fit(self, X, y=None):
        X_copy = np.copy(X) + 1
        self._estimator.fit(X_copy)

        return self

    def transform(self, X):
        X_copy = np.copy(X) + 1

        return self._estimator.transform(X_copy)

    def inverse_transform(self, X):
        X_reversed = self._estimator.inverse_transform(np.copy(X))

        return X_reversed - 1  

class TemporalVariableTransformer(BaseEstimator, TransformerMixin):
    # Temporal elapsed time transformer

    def __init__(self, variables, reference_variable):
        
        if not isinstance(variables, list):
            raise ValueError('variables should be a list')
        
        self.variables = variables
        self.reference_variable = reference_variable

    def fit(self, X, y=None):
        # we need this step to fit the sklearn pipeline
        return self

    def transform(self, X):

       # so that we do not over-write the original dataframe
        X = X.copy()
        
        for feature in self.variables:
            X[feature] = X[self.reference_variable] - X[feature]

        return X
class CustomImputer(BaseEstimator, TransformerMixin) : 
    def __init__(self, variable, by) : 
            #self.something enables you to include the passed parameters
            #as object attributes and use it in other methods of the class
            self.variable = variable
            self.by = by

    def fit(self, X, y=None) : 
        self.map = X.groupby(self.by)[variable].mean()
        #self.map become an attribute that is, the map of values to
        #impute in function of index (corresponding table, like a dict)
        return self

def transform(self, X, y=None) : 
    X[variable] = X[variable].fillna(value = X[by].map(self.map))
    #Change the variable column. If the value is missing, value should 
    #be replaced by the mapping of column "by" according to the map you
    #created in fit method (self.map)
    return X

    # categorical missing value imputer
class Mapper(BaseEstimator, TransformerMixin):

    def __init__(self, variables, mappings):

        if not isinstance(variables, list):
            raise ValueError('variables should be a list')

        self.variables = variables
        self.mappings = mappings

    def fit(self, X, y=None):
        # we need the fit statement to accomodate the sklearn pipeline
        return self

    def transform(self, X):
        X = X.copy()
        for feature in self.variables:
            X[feature] = X[feature].map(self.mappings)

        return X  
    
##########################################################################
class CountFrequencyEncoder(BaseEstimator, TransformerMixin):
    #temp = df['card1'].value_counts().to_dict()
    #df['card1_counts'] = df['card1'].map(temp)
    def __init__(
        self,
        encoding_method: str = "count",
        variables: Union[None, int, str, List[Union[str, int]]] = None,
        keep_variable=True,
                  ) -> None:

        self.encoding_method = encoding_method
        self.variables = variables
        self.keep_variable=keep_variable

    def fit(self, X: pd.DataFrame, y: Optional[pd.Series] = None):
        """
        Learn the counts or frequencies which will be used to replace the categories.
        Parameters
        ----------
        X: pandas dataframe of shape = [n_samples, n_features]
            The training dataset. Can be the entire dataframe, not just the
            variables to be transformed.
        y: pandas Series, default = None
            y is not needed in this encoder. You can pass y or None.
        """
        self.encoder_dict_ = {}

        # learn encoding maps
        for var in self.variables:
            if self.encoding_method == "count":
                self.encoder_dict_[var] = X[var].value_counts().to_dict()

            elif self.encoding_method == "frequency":
                n_obs = float(len(X))
                self.encoder_dict_[var] = (X[var].value_counts() / n_obs).to_dict()
        return self

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        # replace categories by the learned parameters
        X = X.copy()
        for feature in self.encoder_dict_.keys():
            if self.keep_variable:
                X[feature+'_fq_enc'] = X[feature].map(self.encoder_dict_[feature])
            else:
                X[feature] = X[feature].map(self.encoder_dict_[feature])
        return X[self.variables].to_numpy()
#################################################   
class FeaturesEngineerGroup(BaseEstimator, TransformerMixin):
    def __init__(self,groupping_method ="mean",
                   variables=  "amount",
                   groupby_variables = "nameOrig"                         
                 ) :
        self.groupping_method = groupping_method
        self.variables=variables
        self.groupby_variables=groupby_variables
        
    def fit(self, X, y=None):
        """
        Learn the mean or median of  amount of each client which will be used to create new feature for each unqiue client in order to undersatant thier behavior .
        Parameters
        ----------
        X: pandas dataframe of shape = [n_samples, n_features]
        The training dataset. Can be the entire dataframe, not just the
        variables to be transformed.
        y: pandas Series, default = None
        y is not needed in this encoder. You can pass y or None.
        """
        self.group_amount_dict_ = {}
        #df.groupby('card1')['TransactionAmt'].agg(['mean']).to_dict()
        #temp = df.groupby('card1')['TransactionAmt'].agg(['mean']).rename({'mean':'TransactionAmt_card1_mean'},axis=1)
        #df = pd.merge(df,temp,on='card1',how='left')
        #target_mean = df_train.groupby(['id1', 'id2'])['target'].mean().rename('avg')
        #df_test = df_test.join(target_mean, on=['id1', 'id2'])
        #lifeExp_per_continent = gapminder.groupby('continent').lifeExp.mean()
        # learn mean/medain 
        #for groupby in self.groupby_variables:
         #   for var in self.variables:
        if self.groupping_method == "mean":
            self.group_amount_dict_[self.variables] =X.fillna(np.nan).groupby([self.groupby_variables])[self.variables].agg(['mean']).to_dict()
        elif self.groupping_method == "median":
            self.group_amount_dict_[self.variables] =X.fillna(np.nan).groupby([self.groupby_variables])[self.variables].agg(['median']).to_dict()
        else:
            print('error , chose mean or median')
        return self
    
    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        X = X.copy()
        #for col in self.variables:
         #   for agg_type in self.groupping_method:
        new_col_name =  self.variables+'_Transaction_'+ self.groupping_method
        X[new_col_name] = X[self.groupby_variables].map(self.group_amount_dict_[ self.variables][self.groupping_method])
        return X[new_col_name].to_numpy().reshape(-1,1)    
    
################################################   
class FeaturesEngineerGroup2(BaseEstimator, TransformerMixin):
    def __init__(self,groupping_method ="mean",
                   variables=  "amount",
                   groupby_variables = "nameOrig"                         
                 ) :
        self.groupping_method = groupping_method
        self.variables=variables
        self.groupby_variables=groupby_variables
        
    def fit(self, X, y=None):
        """
        Learn the mean or median of  amount of each client which will be used to create new feature for each unqiue client in order to undersatant thier behavior .
        Parameters
        ----------
        X: pandas dataframe of shape = [n_samples, n_features]
        The training dataset. Can be the entire dataframe, not just the
        variables to be transformed.
        y: pandas Series, default = None
        y is not needed in this encoder. You can pass y or None.
        """
        X = X.copy()
        self.group_amount_dict_ = {}
        #df.groupby('card1')['TransactionAmt'].agg(['mean']).to_dict()
        #temp = df.groupby('card1')['TransactionAmt'].agg(['mean']).rename({'mean':'TransactionAmt_card1_mean'},axis=1)
        #df = pd.merge(df,temp,on='card1',how='left')
        #target_mean = df_train.groupby(['id1', 'id2'])['target'].mean().rename('avg')
        #df_test = df_test.join(target_mean, on=['id1', 'id2'])
        #lifeExp_per_continent = gapminder.groupby('continent').lifeExp.mean()
        # learn mean/medain 
        #for groupby in self.groupby_variables:
         #   for var in self.variables:

        print('we have {} unique clients'.format(X[self.groupby_variables].nunique()))
        new_col_name =  self.variables+'_Transaction_'+ self.groupping_method    
        X[new_col_name] = X.groupby([self.groupby_variables])[[self.variables]].transform(self.groupping_method)
        X = X.drop_duplicates(['nameOrig'])
    
        self.group_amount_dict_ = dict(zip(X[self.groupby_variables], X[new_col_name]))
        del X
        print('we have {} unique mean amount : one for each client'.format(len(self.group_amount_dict_)))
        return self
    
    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        X = X.copy()
        #for col in self.variables:
         #   for agg_type in self.groupping_method:
        new_col_name =  self.variables+'_Transaction_'+ self.groupping_method
        X[new_col_name] = X[self.groupby_variables].map(self.group_amount_dict_)
        return X[new_col_name].to_numpy().reshape(-1,1)   
    
############################################  
class FeaturesEngineerCumCount(BaseEstimator, TransformerMixin):
    def __init__(self,group_one ="step",
                   group_two=  "nameOrig"                       
                 ) :
        self.group_one =group_one
        self.group_two=group_two
        
    def fit(self, X, y=None):
        """
        """
        return self
    
    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        X = X.copy()
        new_col_name =  self.group_two+'_Transaction_count'
        X[new_col_name] = X.groupby([self.group_one, self.group_two])[[self.group_two]].transform('count')
        return X[new_col_name].to_numpy().reshape(-1,1)

 # Baseline Pipe :
 This is  the first round to get the best preprocess steps 

In [11]:
# Cat columns: 
cat_pipe = Pipeline([
                     ('Encoder',ce.target_encoder.TargetEncoder())
                     
                    ])
#Num_columns:
num_pipe = Pipeline([('imputer', SimpleImputer(strategy='median',add_indicator=False)),
                     ('scaler', QuantileTransformer())
                    ])
#Feature Union fitting training data :
preprocessor = FeatureUnion(transformer_list=[('cat', cat_pipe),
                                              ('num', num_pipe)])
# Using ColumnTransformer:
data_cleaning = ColumnTransformer([
    ('cat_columns',  cat_pipe, cat_columns ),
    ('num_columns', num_pipe , num_columns)
])
# preprocessor.fit(X_train)
#############################
# Complete Pipe 
def create_pipeline(model,preprocessor,FeaturesEngineer=None):
    pipeline = Pipeline([ 
        ('pre', preprocessor),
        ('lgbm', model)
    ])
    return pipeline
preprocessor 

# Advanced Pipe :
This pipe include features engineer+ some advanced preprocessing steps for each columns.

In [12]:
# complete pipe :
# select the float/cat columns
#cat_feautres = X.select_dtypes(include=['object','category']).columns
#num_features = X.select_dtypes(exclude=['object','category']).columns
#Define vcat pipeline
features_cum_count=['step','nameOrig']
features_groupby_amount=['amount','nameOrig']
features_frequency_orig_dest=['nameOrig','nameDest']
features_cum_count_pipe = Pipeline([
                     ('transformer_Encoder', FeaturesEngineerCumCount())
                    ])
features_groupby_pipe = Pipeline([
                     ('transformer_group_amount_mean', FeaturesEngineerGroup2()),
                     ('transformer_group_scaler', PowerTransformer())
                    ])
features_frequency_pipe = Pipeline([
                     ('Encoder', CountFrequencyEncoder(variables=['nameOrig','nameDest'],encoding_method ="frequency", keep_variable=False))
                    ])
type_pipe= Pipeline([
                     ('transformer_Encoder', ce.cat_boost.CatBoostEncoder())
                    ])
num_features0=[  'amount',  'oldbalanceOrig', 'newbalanceOrig' ,'oldbalanceDest', 'newbalanceDest']
#Define vnum pipeline
num_pipe = Pipeline([
                     ('scaler', PowerTransformer()),
                    ])
#Featureunion fitting training data
preprocessor = FeatureUnion(transformer_list=[('cum_count', features_cum_count_pipe),
                                              ('mean_amount', features_groupby_pipe),
                                              ('frequency_dest_orig', features_frequency_pipe),
                                              ('trans_type', type_pipe),
                                              ('num', num_pipe)])
data_preparing= ColumnTransformer([
    ('cum_count', features_cum_count_pipe, features_cum_count ),
    ('mean_amount', features_groupby_pipe, features_groupby_amount ),
    ('frequency_dest_orig', features_frequency_pipe, features_frequency_orig_dest ),
    ('trans_type', type_pipe, ['type'] ),
    ('num', num_pipe, num_features0)
], remainder='drop')
data_preparing

<a id=4></a>
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home">
<center>Modeling</center></h3>


Modeling is the part of the Cross-Industry Standard Process for Data Mining (CRISP-DM) process model that i like best. Our data is already in good shape, and now we can search for useful patterns in our data.


Tasks

1. Select modeling technique Select technique

2. Generate test design

3. Build model

4. Assess model


# Define CrossValidation Strategie : 



**Steps for K-fold cross-validation**

    Split the dataset into K equal partitions (or "folds")
        So if k = 5 and dataset has 150 observations
        Each of the 5 folds would have 30 observations
    Use fold 1 as the testing set and the union of the other folds as the training set
        Testing set = 30 observations (fold 1)
        Training set = 120 observations (folds 2-5)
    Calculate testing accuracy
    Repeat steps 2 and 3 K times, using a different fold as the testing set each time
        We will repeat the process 5 times
        2nd iteration
            fold 2 would be the testing set
            union of fold 1, 3, 4, and 5 would be the training set
        3rd iteration
            fold 3 would be the testing set
            union of fold 1, 2, 4, and 5 would be the training set
        And so on...
    Use the average testing accuracy as the estimate of out-of-sample accuracy

Diagram of 5-fold cross-validation

![image.png](attachment:3c2b3638-be43-4a17-af54-84aae9a27418.png)





**Comparing cross-validation to train/test split**

Advantages of cross-validation:

    More accurate estimate of out-of-sample accuracy
    More "efficient" use of data
    This is because every observation is used for both training and testing

Advantages of train/test split:

    Runs K times faster than K-fold cross-validation
    This is because K-fold cross-validation repeats the train/test split K-times
    Simpler to examine the detailed results of the testing process

**Cross-validation recommendations**

    K can be any number, but K=10 is generally recommended
    This has been shown experimentally to produce the best out-of-sample estimate
    For classification problems, stratified sampling is recommended for creating the folds
    Each response class should be represented with equal proportions in each of the K folds:   
            If dataset has 2 response classes
                Spam/Ham
                20% observation = ham
                Each cross-validation fold should consist of exactly 20% ham
                 scikit-learn's cross_val_score function does this by default


**Cross-Validation in Deep Learning**

Cross-validation in Deep Learning (DL) might be a little tricky because most of the CV techniques require training the model at least a couple of times.

In deep learning, you would normally tempt to avoid CV because of the cost associated with training k different models. Instead of doing k-Fold or other CV technique, you might use a random subset of your training data as a hold-out for validation purposes.

For example, Keras deep learning library allows you to pass one of two parameters for the fit function that performs training.

validation_split: percentage of the data that should be held out for validation
validation_data: a tuple of (X, y) which should be used for validation. This parameter overrides the validation_split parameter which means you can use only one of these parameters at once.

The same approach is used in official tutorials of other DL frameworks such as PyTorch and MxNet. They also suggest splitting the dataset into three parts: training, validation, and testing.

Training – a part of the dataset to train on
Validation – a part of the dataset to validate on while training
Testing – a part of the dataset for final validation of the model

Still, you can use cross-validation in DL tasks if the dataset is tiny (contains hundreds of samples). In this case, learning a complex model might be an irrelevant task so make sure that you don’t complicate the task further.

**How to choose the best k for cross validation **:

Note:
Large K value in leave one out cross-validation would result in over-fitting. Small K value in leave one out cross-validation would result in under-fitting.

If the training set is relatively small, would it be better to use a lower k?

No, it's the opposite. If the training set is small, it is better to use a larger k. This simulates having a larger training set. In the extreme case you get leave-one-out cross-validation: train on all but one of the samples, and test on that sample. Repeat for each sample.

The choice of k involves a tradeoff between efficiency and the accuracy of the error prediction. More folds reduce the bias of the estimate of the generalization error, but they take more time to compute.

Depends on how much CPU juice you are willing to afford for the same. Having a lower K means less variance and thus, more bias, while having a higher K means more variance and thus, and lower bias.

Also, one should keep in mind the computational costs for the different values. High K means more folds, thus higher computational time and vice versa. So, one needs to find a sweet spot between those by doing a hyper tuning analysis.

Also, you need to keep the size of your data in mind. If your data is very less, then even using a k-fold crossval wouldn't make sense. So, you might want to use a leave-one-out CV (LOOCV). 


**Feature engineering and selection within cross-validation iterations**

    Normally, feature engineering and selection occurs before cross-validation
    Instead, perform all feature engineering and selection within each cross-validation iteration
    More reliable estimate of out-of-sample performance since it better mimics the application of the model to out-of-sample data
      



Let's go more in depth in order to choose our CV Validation strategie. 


**How to validate a model on chronologically ordered data which also contains groups?**

Since it takes quite some time to get a utility score back for our model, it would be nice to be able to 'locally' calculate an indication of a model's performance; independent of the (time expensive and limited) . This would allow for much better tuning of hyper-parameters or other aspects of the model's training process.

In this notebook I want to lay out a couple of techniques that can be used to do this. For every step we will see that there is a problem with using it for this particular Data. Fortunately we will provides a solution! .
Credit:

    Based on the great notebook: https://www.kaggle.com/jorijnsmit/found-the-holy-grail-grouptimeseriessplit
    

# Create test and train groups

Now we’ve got our dataframe ready we can split it up into the train and test datasets for our model to use. We’ll use the Scikit-Learn train_test_split() function for this. By passing in the X dataframe of raw features, the y series containing the target, and the ************size of the test group (i.e. 0.1 for 10%), we get back the X_train, X_test, y_train and y_test data to use in the model.


   

In [13]:
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y, 
                                                    test_size=0.1,
                                                    random_state=0,
                                                    # stratify=y ,
                                                    shuffle=False
                                                   )
print("{} rows in test set vs. {} in training set. {} Features.".format(X_test.shape[0], X_train.shape[0], X_test.shape[1]))


635120 rows in test set vs. 5716073 in training set. 9 Features.


In [14]:
print(X_train.index)
print(X_train.step.tail(3))
print(X_test.step.head(3))

Int64Index([      0,       1,       2,       3,       4,       5,       6,
                  7,       8,       9,
            ...
            5716063, 5716064, 5716065, 5716066, 5716067, 5716068, 5716069,
            5716070, 5716071, 5716072],
           dtype='int64', length=5716073)
5716070    398
5716071    398
5716072    398
Name: step, dtype: int64
5716073    398
5716074    398
5716075    398
Name: step, dtype: int64


Use of **shuffle=False** is key here; since otherwise we would lose all chronological order.

However, this approach is problematic because by constantly verifying on the same data, we also slowly start to overfit on the test set ("leakage"). Splitting the test set again into a validation set could solve this: the model's hyper-parameters are tuned and verified on the validation set and once that is completely finished we test it (only once!) on the test set. The problem now becomes that either the test and validation sets become too small to be useful or that so much data is used to validate and test that nog enough data remains to train on.
Also we have Other probelm here : 
- we can not use stratify to get the % of classes on train and test 
- we have the same step is devided between train and test ....this will cause problem and we lose our groups and features like cumsum in each group.

**Normal Cross-validation**

In cross-validation (CV), multiple validation sets are derived from the training set. Every fold a new part of the training set is used as the vaildation set, and the data previously used for validation now becomes part of the training set again:


![image.png](attachment:7856e24e-b95e-4c6e-8458-eddffc215e4d.png)





In [15]:
for train_idx, test_idx in KFold().split(X,y):
    #print(train, test)
    print(X.loc[train_idx, 'step'])
    print(X.loc[test_idx, 'step'])
    break

1270239    135
1270240    135
1270241    135
1270242    135
1270243    135
          ... 
6351188    699
6351189    699
6351190    699
6351191    699
6351192    699
Name: step, Length: 5080954, dtype: int64
0            1
1            1
2            1
3            1
4            1
          ... 
1270234    135
1270235    135
1270236    135
1270237    135
1270238    135
Name: step, Length: 1270239, dtype: int64


Let's get the result : 

    fold 1 : training [1 2 3 4 5], test [6]
    fold 2 : training [1 2 3 4 6], test [5]
    fold 3 : training [1 2 3 5 6], test [4]
    fold 4 : training [1 2 4 5 6], test [3]
    fold 5 : training [1 3 4 5 6], test [2]
    fold 6 : training [2 3 4 5 6], test [1].
    
As we see this is a big problem  

In other words, our model has been trained using information which wasn't yet available at the time of the validation set. **This is clear leakage**; we are predicting the past with knowledge from the future. But our aim is to predict data in the future! This problem has already been addressed by scikit-learn in the form of TimeSeriesSplit. 

When dealing with time series data, traditional cross-validation (like k-fold) should not be used for two reasons:

1. Temporal Dependencies.

2. Arbitrary Choice of Test Set.


**TimeSeriesSplit:**

An approach that's sometimes more principled for time series is forward chaining, where your procedure would be something like this:

    fold 1 : training [1], test [2]
    fold 2 : training [1 2], test [3]
    fold 3 : training [1 2 3], test [4]
    fold 4 : training [1 2 3 4], test [5]
    fold 5 : training [1 2 3 4 5], test [6]


The method I use for cross-validating my time-series model is cross-validation on a rolling basis. Start with a small subset of data for training purpose, forecast for the later data points and then checking the accuracy for the forecasted data points. The same forecasted data points are then included as part of the next training dataset and subsequent data points are forecasted.

To make things intuitive, here is an image for same:

![image.png](attachment:0255116a-7dd5-4171-8436-edab7989b742.png)


In [16]:
### from sklearn.model_selection import TimeSeriesSplit
for train_idx, test_idx in TimeSeriesSplit().split(X):
    print(X.loc[train_idx, 'step'].unique())
    print(X.loc[train_idx, 'step'].tail())
    print(X.loc[test_idx, 'step'].unique())
    print(X.loc[test_idx, 'step'].head())
    break

[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96]
1058528    96
1058529    96
1058530    96
1058531    96
1058532    96
Name: step, dtype: int64
[ 96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131
 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149
 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167
 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183]
1058533    96
1058534    96
1058535    96
1058536    96
1058537    96
Name: step, dtype: int64


But what is the problem this time? TimeSeriesSplit does not respect the groups available in the data. Although not clearly visible in this plot, we can imagine that a group can partially fall in the training set and partially in the test set.



Already in the first split we can see data from step 96 present in both the training and test set. That would mean that we are training on half of the transactions of a certain step, just to validate their performance on the other half of the transactions  of that step. What we of course want is to train on all transactions  of a particular step, and to validate them on the step that follows! **Otherwise again leaking will occur**.
# GroupKFold:



In [17]:
for train_idx, test_idx in GroupKFold().split(X, groups=X['step']):
    print(X.loc[train_idx, 'step'].unique())
    print(X.loc[test_idx, 'step'].unique())
    break

[  2   3   4   5   6   7   8   9  12  13  14  15  16  17  18  20  21  22
  23  24  25  26  27  28  29  30  31  32  33  34  35  37  38  39  41  42
  43  45  46  47  48  49  50  54  55  56  57  58  59  60  62  63  65  67
  68  70  71  72  73  75  76  77  79  80  81  82  84  85  87  88  89  90
  91  92  93  95  96  97  98  99 100 101 102 103 104 105 107 108 109 110
 111 112 113 115 116 118 119 123 124 126 127 128 129 130 132 133 134 136
 138 139 140 141 142 143 144 145 146 148 149 150 151 152 153 154 155 156
 157 158 159 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175
 176 177 178 179 181 182 183 184 185 186 187 188 189 190 191 192 193 194
 195 197 198 199 200 201 203 204 205 207 210 213 214 216 217 218 219 220
 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238
 239 240 241 242 245 247 248 249 250 251 252 253 254 255 256 257 258 260
 262 263 265 266 267 269 270 271 272 273 274 277 278 280 281 282 283 284
 285 286 287 288 289 291 293 295 296 297 298 299 30

The GroupKFold iterator does respect groupings: no group will ever be part of two folds. Unfortunately, it is also clear that it mixes up the order completely and thus loses the temporal dimension again. What we need is a a crossover between GroupKFold and TimeSeriesSplit: GroupTimesSeriesSplit.


# GroupTimesSeriesSplit

OK, so this iterator does not exist yet in scikit-learn. However, a request for it has been documented on GitHub over a year ago (Feature request: Group aware Time-based cross validation #14257) and is almost ready for release. Thanks to open source we can take a sneak peek already!

Do note that this is not fully reviewed yet!!! This might be the final code that it will make it into sklearn's version 0.24 as a major feature, but there's also a chance of bugs still being present.

I did not write any of this but it did take me a good day of research and trying to write it myself. All credits go to @getgaurav2.

Here are some more attempts at grouped cross-validation I encountered in my research:

https://stackoverflow.com/questions/51963713/cross-validation-for-grouped-time-series-panel-data
    
https://datascience.stackexchange.com/questions/77684/time-series-grouped-cross-validation

https://nander.cc/writing-custom-cross-validation-methods-grid-search


In [18]:
from sklearn.model_selection._split import _BaseKFold, indexable, _num_samples
from sklearn.utils.validation import _deprecate_positional_args

# https://github.com/getgaurav2/scikit-learn/blob/d4a3af5cc9da3a76f0266932644b884c99724c57/sklearn/model_selection/_split.py#L2243
class GroupTimeSeriesSplit(_BaseKFold):
    """Time Series cross-validator variant with non-overlapping groups.
    Provides train/test indices to split time series data samples
    that are observed at fixed time intervals according to a
    third-party provided group.
    In each split, test indices must be higher than before, and thus shuffling
    in cross validator is inappropriate.
    This cross-validation object is a variation of :class:`KFold`.
    In the kth split, it returns first k folds as train set and the
    (k+1)th fold as test set.
    The same group will not appear in two different folds (the number of
    distinct groups has to be at least equal to the number of folds).
    Note that unlike standard cross-validation methods, successive
    training sets are supersets of those that come before them.
    Read more in the :ref:`User Guide <cross_validation>`.
    Parameters
    ----------
    n_splits : int, default=5
        Number of splits. Must be at least 2.
    max_train_size : int, default=None
        Maximum size for a single training set.
    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.model_selection import GroupTimeSeriesSplit
    >>> groups = np.array(['a', 'a', 'a', 'a', 'a', 'a',\
                           'b', 'b', 'b', 'b', 'b',\
                           'c', 'c', 'c', 'c',\
                           'd', 'd', 'd'])
    >>> gtss = GroupTimeSeriesSplit(n_splits=3)
    >>> for train_idx, test_idx in gtss.split(groups, groups=groups):
    ...     print("TRAIN:", train_idx, "TEST:", test_idx)
    ...     print("TRAIN GROUP:", groups[train_idx],\
                  "TEST GROUP:", groups[test_idx])
    TRAIN: [0, 1, 2, 3, 4, 5] TEST: [6, 7, 8, 9, 10]
    TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a']\
    TEST GROUP: ['b' 'b' 'b' 'b' 'b']
    TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] TEST: [11, 12, 13, 14]
    TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b']\
    TEST GROUP: ['c' 'c' 'c' 'c']
    TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]\
    TEST: [15, 16, 17]
    TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b' 'c' 'c' 'c' 'c']\
    TEST GROUP: ['d' 'd' 'd']
    """
    @_deprecate_positional_args
    def __init__(self,
                 n_splits=5,
                 *,
                 max_train_size=None
                 ):
        super().__init__(n_splits, shuffle=False, random_state=None)
        self.max_train_size = max_train_size

    def split(self, X, y=None, groups=None):
        """Generate indices to split data into training and test set.
        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Training data, where n_samples is the number of samples
            and n_features is the number of features.
        y : array-like of shape (n_samples,)
            Always ignored, exists for compatibility.
        groups : array-like of shape (n_samples,)
            Group labels for the samples used while splitting the dataset into
            train/test set.
        Yields
        ------
        train : ndarray
            The training set indices for that split.
        test : ndarray
            The testing set indices for that split.
        """
        if groups is None:
            raise ValueError(
                "The 'groups' parameter should not be None")
        X, y, groups = indexable(X, y, groups)
        n_samples = _num_samples(X)
        n_splits = self.n_splits
        n_folds = n_splits + 1
        group_dict = {}
        u, ind = np.unique(groups, return_index=True)
        unique_groups = u[np.argsort(ind)]
        n_samples = _num_samples(X)
        n_groups = _num_samples(unique_groups)
        for idx in np.arange(n_samples):
            if (groups[idx] in group_dict):
                group_dict[groups[idx]].append(idx)
            else:
                group_dict[groups[idx]] = [idx]
        if n_folds > n_groups:
            raise ValueError(
                ("Cannot have number of folds={0} greater than"
                 " the number of groups={1}").format(n_folds,
                                                     n_groups))
        group_test_size = n_groups // n_folds
        group_test_starts = range(n_groups - n_splits * group_test_size,
                                  n_groups, group_test_size)
        for group_test_start in group_test_starts:
            train_array = []
            test_array = []
            for train_group_idx in unique_groups[:group_test_start]:
                train_array_tmp = group_dict[train_group_idx]
                train_array = np.sort(np.unique(
                                      np.concatenate((train_array,
                                                      train_array_tmp)),
                                      axis=None), axis=None)
            train_end = train_array.size
            if self.max_train_size and self.max_train_size < train_end:
                train_array = train_array[train_end -
                                          self.max_train_size:train_end]
            for test_group_idx in unique_groups[group_test_start:
                                                group_test_start +
                                                group_test_size]:
                test_array_tmp = group_dict[test_group_idx]
                test_array = np.sort(np.unique(
                                              np.concatenate((test_array,
                                                              test_array_tmp)),
                                     axis=None), axis=None)
            yield [int(i) for i in train_array], [int(i) for i in test_array]

In [19]:
for idx, (train_idx, test_idx) in enumerate(GroupTimeSeriesSplit().split(X, groups=X['step'])):
    print('-' * 80)
    print('Fold: ', idx)
    print(X.loc[train_idx, 'step'].unique())
    print(X.loc[test_idx, 'step'].unique())
    print('-' * 80)

--------------------------------------------------------------------------------
Fold:  0
[  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
 109 110 111 112 113 114 115 116 117 118 119]
[120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137
 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155
 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173
 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209
 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 

 Chronological order is maintained and group separation is respected!

<a id=5></a>
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home">
<center>Evaluation</center></h3>

# Model accuracy scoring

The easiest way to analyze performance is with accuracy. 
It measures how many observations, both positive and negative, were correctly classified.


You shouldn’t use accuracy on imbalanced problems. Then, it is easy to get a high accuracy score by simply classifying all observations as the majority class. For example in our case, by classifying all transactions as non-fraudulent we can get an accuracy of over 0.9.

**When to use it:**

    When your problem is balanced using accuracy is usually a good start. An additional benefit is that it is really easy to explain it to non-technical stakeholders in your project,
    When every class is equally important to you.

# Confusion Matrix

**How to compute:**

It is a common way of presenting true positive (tp), true negative (tn), false positive (fp) and false negative (fn) predictions. Those values are presented in the form of a matrix where the Y-axis shows the true classes while the X-axis shows the predicted classes.

It is calculated on class predictions, which means the outputs from your model need to be thresholded first.

**When to use it:**

    Pretty much always. I like to see the nominal values rather than normalized to get a feeling on how the model is doing on different, often imbalanced, classes.



# ROC Curve


It is a chart that visualizes the tradeoff between true positive rate (TPR) and false positive rate (FPR). Basically, for every threshold, we calculate TPR and FPR and plot it on one chart.

Of course, the higher TPR and the lower FPR is for each threshold the better and so classifiers that have curves that are more top-left side are better.

Since we have an imbalanced data set, Receiver Operating Characteristic Curves are not that useful although it's an expected output of most binary classifiers.
Because you can generate a pretty good-looking curve by just simply guessing each one is the non-fraud case.

**When to use it:**

    You should use it when you ultimately care about ranking predictions and not necessarily about outputting well-calibrated probabilities (read this article by Jason Brownlee if you want to learn about probability calibration).
    You should not use it when your data is heavily imbalanced. It was discussed extensively in this article by Takaya Saito and Marc Rehmsmeier. The intuition is the following: false positive rate for highly imbalanced datasets is pulled down due to a large number of true negatives.
    You should use it when you care equally about positive and negative classes.. If we care about true negatives as much as we care about true positives then it totally makes sense to use ROC AUC.
    
# ROC AUC score   
AUC - ROC curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. By analogy, the Higher the AUC, the better the model is at distinguishing between patients with the disease and no disease. The ROC curve is plotted with TPR against the FPR where TPR is on the y-axis and FPR is on the x-axis.

**When to use it:**

    You should use it when you ultimately care about ranking predictions and not necessarily about outputting well-calibrated probabilities (read this article by Jason Brownlee if you want to learn about probability calibration).
    You should not use it when your data is heavily imbalanced. It was discussed extensively in this article by Takaya Saito and Marc Rehmsmeier. The intuition is the following: false positive rate for highly imbalanced datasets is pulled down due to a large number of true negatives.
    You should use it when you care equally about positive and negative classes. It naturally extends the imbalanced data discussion from the last section. If we care about true negatives as much as we care about true positives then it totally makes sense to use ROC AUC.

# Recall    
It measures how many observations out of all positive observations have we classified as positive. It tells us how many fraudulent transactions we recalled from all fraudulent transactions.
true positive rate

When you are optimizing recall you want to put all guilty in prison.
**When to use it:**

    Usually, you will not use it alone but rather coupled with other metrics like precision.
    That being said, recall is a go-to metric, when you really care about catching all fraudulent transactions even at a cost of false alerts. Potentially it is cheap for you to process those alerts and very expensive when the transaction goes unseen.
    
# Precision

It measures how many observations predicted as positive are in fact positive. Taking our fraud detection example, it tells us what is the ratio of transactions correctly classified as fraudulent.
positive predictive value

When you are optimizing precision you want to make sure that people that you put in prison are guilty. 

**When to use it:**

    Again, it usually doesn’t make sense to use it alone but rather coupled with other metrics like recall.
    When raising false alerts is costly, when you want all the positive predictions to be worth looking at you should optimize for precision.
    


**Precision vs. Recall for Imbalanced Classification:**

You may decide to use precision or recall on your imbalanced classification problem.

Maximizing precision will minimize the number false positives, whereas maximizing the recall will minimize the number of false negatives.

    Precision: Appropriate when minimizing false positives is the focus.
    Recall: Appropriate when minimizing false negatives is the focus.

Sometimes, we want excellent predictions of the positive class. We want high precision and high recall.

This can be challenging, as often increases in recall often come at the expense of decreases in precision.

    In imbalanced datasets, the goal is to improve recall without hurting precision. These goals, however, are often conflicting, since in order to increase the TP for the minority class, the number of FP is also often increased, resulting in reduced precision.
    
    
# PR AUC score | Average precision

Similarly to ROC AUC score you can calculate the Area Under the Precision-Recall Curve to get one number that describes model performance.

You can also think about PR AUC as the average of precision scores calculated for each recall threshold [0.0, 1.0]. You can also adjust this definition to suit your business needs by choosing/clipping recall thresholds if needed.

**When to use it:**

    when you want to communicate precision/recall decision to other stakeholders
    when you want to choose the threshold that fits the business problem.
    when your data is heavily imbalanced. As mentioned before, it was discussed extensively in this article by Takaya Saito and Marc Rehmsmeier. The intuition is the following: since PR AUC focuses mainly on the positive class (PPV and TPR) it cares less about the frequent negative class.
    when you care more about positive than negative class. If you care more about the positive class and hence PPV and TPR you should go with Precision-Recall curve and PR AUC (average precision).
    
# F beta score

Simply put, it combines precision and recall into one metric. The higher the score the better our model is. You can calculate it in the following way:





When choosing beta in your F-beta score the more you care about recall over precision the higher beta you should choose. For example, with F1 score we care equally about recall and precision with F2 score, recall is twice as important to us.
F beta by beta

With 0<beta<1 we care more about precision and so the higher the threshold the higher the F beta score. When beta>1 our optimal threshold moves toward lower thresholds and with beta=1 it is somewhere in the middle.  

**When to use it:**

    Pretty much in every binary classification problem. It is my go-to metric when working on those problems. It can be easily explained to business stakeholders.
    
 for more details see this article:[https://neptune.ai/blog/evaluation-metrics-binary-classification](http://)    
 
==>Complete evaluation will be done when we train the model on all data that we have and with the best tuned model.

<a id=6></a>
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home">
<center>Deploy</center></h3>

The deployment of machine learning models is the process for making models available in production environments, where they can provide predictions to other software systems.

●One of the last stages in the Machine Learning Lifecycle.

●Potentially the most challenging stage.

●Challenges of traditional software

oReliability
oReusability
oMaintainability
oFlexibility

●Additional challenges specific to Machine Learning

oReproducibility

Needs coordination of data scientists, IT teams, software developers and business professionals:

oEnsure model works reliably
oEnsure model delivers the intended result.

●Potential discrepancy between programming language in which the model is developed and the production system language.

oRe-coding the model extends the project timeline and risks lack of reproducibility

Why is Model Deployment important?

●To start using a Machine Learning Model, it needs to be effectively deployed into production, so that they can provide predictions to other software systems.

●To maximize the value of the Machine Learning Model, we need to be able to reliably extract the predictions and share them with other systems.


**Research Environment**

●The Research Environment is a setting with tools, programs and software suitable for data analysis and the development of machine learning models.

●Here, we develop the Machine Learning Models and identify their value.
Its done by a data scientist : i prefer work on jupyter for this phase .

**Production Environment**

●The Production Environment is a real-time setting with running programs and hardware setups that allow the organization’s daily operations.

●It’s the place where the machine learning models is actually available for business use.

●It allows organisations to show clients a “live” service.
This job is done by solid sofware+ml engineer+ devops team



we have 4 ways to deploy models .
ML System Architectures:
1. Model embedded in application

2. Served via a dedicated service

3. Model published as data(streaming)

4. Batch prediction (offline process)


I developed  a baseline how to deploy model using Fastapi+docker on herokou :

https://github.com/DeepSparkChaker/FraudDetection_Fastapi


Complete deployment of our model is done here : 
<a id=7></a>
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#1777C4; border:0' role="tab" aria-controls="home">
<center>Summary</center></h3> 

We had developed end-to-end machine learning using the CRISP_DM methodology. Work still in progress. Always keep in mind that the data science / ML project must be done as a team and iteratively in order to properly exploit our data and add value to our business. Also keep in mind that AI helps you make the decision by using the added value extracted from the data but not the accountability. So we have to keep in mind to always use a composite AI in order to make the final decision.
Don't forgot to upvote if you find it useful .

https://www.kaggle.com/bannourchaker/frauddetection-part3-modeling2-selectbestmodel?scriptVersionId=81276811

for complete deployement baseline see : 

https://github.com/DeepSparkChaker/FraudDetection_Fastapi

References :

https://developer.nvidia.com/blog/leveraging-machine-learning-to-detect-fraud-tips-to-developing-a-winning-kaggle-solution/

python guidline : 

https://gist.github.com/sloria/7001839

features  selections :

https://www.kaggle.com/sz8416/6-ways-for-feature-selection

https://pub.towardsai.net/feature-selection-and-removing-in-machine-learning-dd3726f5865c

https://www.kaggle.com/bannourchaker/1-featuresengineer-selectionpart1?scriptVersionId=72906910

Cripspdm :
https://www.kaggle.com/bannourchaker/4-featureengineer-featuresselectionpart4?scriptVersionId=73374083

Quanrile transformer : 

https://machinelearningmastery.com/quantile-transforms-for-machine-learning/

Best link for all : 

https://neptune.ai/blog/tabular-data-binary-classification-tips-and-tricks-from-5-kaggle-competitions

complete guide Stacking :

https://www.analyticsvidhya.com/blog/2021/08/ensemble-stacking-for-machine-learning-and-deep-learning/

https://neptune.ai/blog/ensemble-learning-guide

https://www.kaggle.com/prashant111/adaboost-classifier-tutorial


Missing : 

https://www.kaggle.com/dansbecker/handling-missing-values

Binning : 

https://heartbeat.fritz.ai/hands-on-with-feature-engineering-techniques-variable-discretization-7deb6a5c6e27

https://www.analyticsvidhya.com/blog/2020/10/getting-started-with-feature-engineering/

Cat :

https://innovation.alteryx.com/encode-smarter/

https://github.com/alteryx/categorical_encoding/blob/main/guides/notebooks/categorical-encoding-guide.ipynb

https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/

https://maxhalford.github.io/blog/target-encoding/


Choice of kmeans : 

https://www.analyticsvidhya.com/blog/2021/05/k-mean-getting-the-optimal-number-of-clusters/

Imputation : 

https://machinelearningmastery.com/knn-imputation-for-missing-values-in-machine-learning/

https://machinelearningmastery.com/iterative-imputation-for-missing-values-in-machine-learning/

Choice of  roc vs precssion_recall : 

https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/


https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/


How to tune for he futur work : 

https://www.kaggle.com/hamidrezabakhtaki/xgboost-catboost-lighgbm-optuna-final-submission

https://www.kaggle.com/bextuychiev/lgbm-optuna-hyperparameter-tuning-w-understanding



Deploy:

https://towardsdatascience.com/from-jupyter-notebook-to-deployment-a-straightforward-example-1838c203a437

 https://github.com/DeepSparkChaker/Titanic_Deep_Spark/blob/main/app.py
https://github.com/Kunal-Varma/Deployment-of-ML-model-using-FASTAPI/tree/2cc0319abbec469010a5139f460004f2a75a7482
https://realpython.com/fastapi-python-web-apis/
 https://github.com/tiangolo/fastapi/issues/3373
 https://www.freecodecamp.org/news/data-science-and-machine-learning-project-house-prices/
https://github.com/tiangolo/fastapi/issues/1616
https://stackoverflow.com/questions/68244582/display-dataframe-as-fastapi-output
https://www.kaggle.com/sakshigoyal7/credit-card-customers
https://github.com/renanmouraf/data-science-house-prices    
https://towardsdatascience.com/data-science-quick-tips-012-creating-a-machine-learning-inference-api-with-fastapi-bb6bcd0e6b01
https://towardsdatascience.com/how-to-build-and-deploy-a-machine-learning-model-with-fastapi-64c505213857
https://analyticsindiamag.com/complete-hands-on-guide-to-fastapi-with-machine-learning-deployment/

https://github.com/shaz13/katana/blob/develop/Dockerfile


https://github.com/TripathiAshutosh/FastAPI/blob/main/main.py

Best practices : 
    
https://theaisummer.com/best-practices-deep-learning-code/    
https://github.com/The-AI-Summer/Deep-Learning-In-Production/tree/master/2.%20Writing%20Deep%20Learning%20code:%20Best%20Practises

 Docker :
 
 https://towardsdatascience.com/docker-in-pieces-353525ec39b0?fbclid=IwAR102sks2L0vRTde2qz1g4I4NhqXxnoqfV4IFzmZke4DvGcuiuYhj25eVSY
 
https://github.com/dkhundley/ds-quick-tips/blob/master/012_dockerizing_fastapi/Dockerfile


 Deploy + scaling :
https://towardsdatascience.com/deploying-ml-models-in-production-with-fastapi-and-celery-7063e539a5db
https://github.com/jonathanreadshaw/ServingMLFastCelery

https://github.com/trainindata/deploying-machine-learning-models/blob/aaeb3e65d0a58ad583289aaa39b089f11d06a4eb/section-04-research-and-development/07-feature-engineering-pipeline.ipynb

Ml OPS : 
https://www.linkedin.com/posts/vipulppatel_getting-started-with-mlops-21-page-tutorial-activity-6863895411837415424-dWMh/?fbclid=IwAR3Y4clbzujS_s2FFWg3tTYMKaGhh3vo25NUyoVdKHAJ7zynmCTNtzlHQ4M

https://towardsai.net/p/machine-learning/mlops-demystified?utm_source=twitter&utm_medium=social&utm_campaign=rop-content-recycle&fbclid=IwAR3MimsSXCFq3GqiLKoaQqXbeb3bkSwKhSkfQSKT_c1gsHDMGSBAv63s7Po
https://www.youtube.com/watch?v=9I8X-3HIErc

https://pub.towardsai.net/deployment-ml-ops-guide-series-2-69d4a13b0dcf

Publish to medium : 

https://towardsai.net/p/data-science/how-to-publish-a-jupyter-notebook-as-a-medium-blogpost?utm_source=twitter&utm_medium=social&utm_campaign=rop-content-recycle&fbclid=IwAR2-an7kknO3bsI5xjRdjL3jiwuPy7MBN5lVBc6fzx15mGY2iLS5KndCYWc


