# Project : 

### Loading datasets

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
train_data = pd.read_csv('data/training.csv')
test_data = pd.read_csv('data/test.csv')


# Separate target from predictors
y = train_data.FraudResult
X = train_data.drop(['FraudResult'], axis=1)


X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                           random_state=0)


# train_data.head(5)


## Exploring data

### looking at dimensions of our dataset and checking for missing values

In [2]:
train_data.size
print(f"{train_data.axes[0].size} observations")
print(f"{train_data.axes[1].size -1 } features")
# train_data.axes[1].size


95662 observations
15 features


In [3]:
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns
                     if X_train[col].isnull().any()]
print(cols_with_missing)

# doesn't affect the dataset since all fields are containing data  
# train_data = train_data.dropna(axis=0) 

[]


### looking at data types

Since we now that categorical data need to be transposed to numerical data ...For categoricals variables which needs a from of transposition to numeric values. In general, we now that removing data is certainly not a valuable option since we are preventing the learning of our model to use the full training set. Most specific to our xente dataset, we can say that keeping only the numerical

In [4]:
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train.columns if X_train[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train.columns if X_train[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train[my_cols].copy()
X_valid = X_valid[my_cols].copy()

print(categorical_cols)
print(numerical_cols)



['TransactionId', 'BatchId', 'AccountId', 'SubscriptionId', 'CustomerId', 'CurrencyCode', 'ProviderId', 'ProductId', 'ProductCategory', 'ChannelId', 'TransactionStartTime']
['CountryCode', 'Amount', 'Value', 'PricingStrategy']


## Building a base pipeline

In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

"""
As said earlier, numerical data is directly interpretable by a model and in our case, we don't even need to deal with imputation since all fields are containing a value 
Our single preprocessing step will then only concern the categorical variables 
"""

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))

])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_cols)
    ])

### Baseline model

In [6]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

# model = RandomForestRegressor(n_estimators=10, random_state=0)
model = DecisionTreeRegressor()

In [7]:
from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)



MAE: 0.0013066429728740918


### Need to write a conclusion 
This value should be meaning that our model is very accurate by the model is probably in an overfitting situtaion

### Checking class data proportions

In [8]:
full = pd.concat([X_train,y_train],axis=1) 
not_fraud = full[full.FraudResult==0]
fraud = full[full.FraudResult==1]
print(f"{(not_fraud.size/(fraud.size+not_fraud.size))*100} % of the transactions are non fraudulent")


99.79354231729148 % of the transactions are non fraudulent


Allmost all the transactions registered in our dataset are not fraudulent. We have an imbalanced dataset. Any model with tend to produce overfitting for this type of dataset because it's learning quality will only be based on the non fraudulent transactions, simply ignoring the minority class. Even while testing our model, we are giving him a test data set that also contains a very high majority of non fraudulent cases, this is important to know since the accuracy is directly related to that fact, if we would have for example a test set that would contain only fraudulent transactions, the accuracy value would be very low... (show). That's why better metrics should be used for imbalenced datasets.

Since algorithms have all the same general intentions which are to adapt their parameters to reduce their errors to tend to the most accurate models as possible.  We will have to consider different approaches in this specific imbalanced dataset scenario. 


## Better ways to evaluate a model efficiency in the case of a imbalanced datasets

In [9]:
from sklearn.metrics import f1_score, recall_score,accuracy_score


# compare the model predictions with the real values 

# calculate the F1 score and recall score
def compute_imbalanced_scores(model_preds):
    f1 = f1_score(y_valid, model_preds)
    recall = recall_score(y_valid, model_preds)
    accuracy = accuracy_score(y_valid, model_preds)
    # f1 = (2*recall*accuracy)/(recall + accuracy)
    return f" Accuracy score: {accuracy} % \n F1 score: {f1} % \n Recall score: {recall} "



# Trying different models 

In [10]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier


decision_tree = DecisionTreeClassifier(random_state=0)
random_forest = RandomForestClassifier(n_estimators=10, random_state=0)
xgb = XGBClassifier()

models = {"decision tree":decision_tree,"random forest":random_forest,"xgb":xgb}


In [11]:
from enum import Enum

class Models(Enum):
    DecisionTree = "decision tree"
    RandomForest = "random_forest"
    XGB = "xgb"

In [12]:
def compare_models_predictions(trainX,trainY,models = {Models.DecisionTree:decision_tree,Models.RandomForest:random_forest,Models.XGB:xgb} ):
    model_predictions = {}
    for model_name in models:
        print(f"Trainning {model_name} ...")
        pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                    ('model',models[model_name])
                                    ])
        pipeline.fit(trainX,trainY)
        model_preds = pipeline.predict(X_valid)
        model_predictions[model_name] = model_preds
    return model_predictions

In [13]:

models_predictions = compare_models_predictions(X_train,y_train)
#models_predictions["test"] = pd.DataFrame(np.zeros(y_train.size))

print()
print("Original dataset")
for model in models_predictions:
    print(model)
    scores = compute_imbalanced_scores(models_predictions[model])
    print(scores)


Trainning Models.DecisionTree ...
Trainning Models.RandomForest ...
Trainning Models.XGB ...

Original dataset
Models.DecisionTree
 Accuracy score: 0.998641091308211 % 
 F1 score: 0.43478260869565216 % 
 Recall score: 0.2857142857142857 
Models.RandomForest
 Accuracy score: 0.998484294151466 % 
 F1 score: 0.32558139534883723 % 
 Recall score: 0.2 
Models.XGB
 Accuracy score: 0.998588825589296 % 
 F1 score: 0.39999999999999997 % 
 Recall score: 0.2571428571428571 


#  Several options : oversampling vs downsampling

In [14]:

Xnew = pd.concat([X_train, y_train], axis=1)

# separate minority and majority classes
not_fraud = Xnew[Xnew.FraudResult==0]
fraud = Xnew[Xnew.FraudResult==1]

### Undersample majority class : train the model with a smaller amount of fraudulent transactions

In [15]:
from sklearn.utils import resample

not_fraud_downsampled = resample(not_fraud,
                                replace = False, # sample without replacement
                                n_samples = len(fraud), # match minority n
                                random_state = 27) # reproducible results

# combine minority and downsampled majority
downsampled = pd.concat([not_fraud_downsampled, fraud])

In [16]:
downsampled.FraudResult.value_counts()

FraudResult
0    158
1    158
Name: count, dtype: int64

In [17]:
# creating an undersampled pipeline 
down_y_train = downsampled.FraudResult
down_X_train = downsampled.drop('FraudResult', axis=1)

downsampled_predictions = compare_models_predictions(down_X_train,down_y_train)
print()
print("Downsampled the non fraudulent transactions")
for model in downsampled_predictions:
    print(model)
    scores = compute_imbalanced_scores(downsampled_predictions[model])
    print(scores)

Trainning Models.DecisionTree ...
Trainning Models.RandomForest ...
Trainning Models.XGB ...

Downsampled the non fraudulent transactions
Models.DecisionTree
 Accuracy score: 0.9491977212146553 % 
 F1 score: 0.06538461538461537 % 
 Recall score: 0.9714285714285714 
Models.RandomForest
 Accuracy score: 0.8715308629070193 % 
 F1 score: 0.027689873417721517 % 
 Recall score: 1.0 
Models.XGB
 Accuracy score: 0.8497883238383944 % 
 F1 score: 0.023777173913043476 % 
 Recall score: 1.0 


### Oversample minority class

In [18]:
fraud_upsampled = resample(fraud,
                          replace=True, # sample with replacement
                          n_samples=len(not_fraud), # match number in majority class
                          random_state=27) # reproducible results

In [19]:
upsampled = pd.concat([not_fraud, fraud_upsampled])
upsampled.FraudResult.value_counts()

FraudResult
0    76371
1    76371
Name: count, dtype: int64

In [20]:
up_y_train = upsampled.FraudResult
up_X_train = upsampled.drop('FraudResult', axis=1)

In [21]:
# creating a pipeline for the oversampling of farudulent transactions
oversampled_predictions = compare_models_predictions(up_X_train,up_y_train)
print()
print("Upsampled the fraudulent transactions")
for model in oversampled_predictions:
    print(model)
    scores = compute_imbalanced_scores(oversampled_predictions[model])
    print(scores)

Trainning Models.DecisionTree ...
Trainning Models.RandomForest ...
Trainning Models.XGB ...

Upsampled the fraudulent transactions
Models.DecisionTree
 Accuracy score: 0.9926827993519051 % 
 F1 score: 0.29292929292929293 % 
 Recall score: 0.8285714285714286 
Models.RandomForest
 Accuracy score: 0.998536559870381 % 
 F1 score: 0.44 % 
 Recall score: 0.3142857142857143 
Models.XGB
 Accuracy score: 0.9889719333089426 % 
 F1 score: 0.19771863117870722 % 
 Recall score: 0.7428571428571429 


## Comparing sampling methods

In [22]:
model_name = Models.XGB
scores = compute_imbalanced_scores(models_predictions[model_name])
print(f" {model_name}")
print()
print("original dataset")
print(scores)
print()
print("upsampled")
scores = compute_imbalanced_scores(oversampled_predictions[model_name])
print(scores)
print()
print("downsampled")
scores = compute_imbalanced_scores(downsampled_predictions[model_name])
print(scores)

 Models.XGB

original dataset
 Accuracy score: 0.998588825589296 % 
 F1 score: 0.39999999999999997 % 
 Recall score: 0.2571428571428571 

upsampled
 Accuracy score: 0.9889719333089426 % 
 F1 score: 0.19771863117870722 % 
 Recall score: 0.7428571428571429 

downsampled
 Accuracy score: 0.8497883238383944 % 
 F1 score: 0.023777173913043476 % 
 Recall score: 1.0 


Regarding the obtained results for our 3 bases models, downsampling seems to be the best approach for fitting our dataset. I will then train a model on the downsampled data.

In [23]:
opt_model = DecisionTreeClassifier(random_state=0)


opt_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                            ('model',opt_model)
                            ])
opt_pipeline.fit(down_X_train,down_y_train)
opt_model_preds = opt_pipeline.predict(X_valid)

compute_imbalanced_scores(opt_model_preds)

# submission 
#opt_preds = pd.DataFrame(opt_pipeline.predict(test_data),columns=['FraudResult']).to_csv('prediction.csv')

opt_preds = opt_pipeline.predict(test_data)
print(opt_preds)

[0 0 0 ... 0 0 0]


# Using best f1 score from above 

In [25]:
# print(models.keys())
tree_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                            ('model',models["decision tree"])
                            ])
tree_pipeline.fit(X_train,y_train)
tree_preds = tree_pipeline.predict(X_valid)

compute_imbalanced_scores(tree_preds)

' Accuracy score: 0.998641091308211 % \n F1 score: 0.43478260869565216 % \n Recall score: 0.2857142857142857 '