# Machine Learning for Fraud Detection.

##### **Notebook Author**: A.D. Téllez

You can find me on GitHub for collaborations.

## 1. Model Development

In [1]:
import pandas as pd

##### This data sets is under public domain.

In [2]:
card_df = pd.read_csv('card_transdata.csv')

#quick look to our data.
card_df.head() 

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


#### Number of observations in our database.

In [3]:
print(card_df.shape[0],"observations.")

1000000 observations.


#### Looking for null values.

Sometimes we have null values in our data sets. Since they can compromise the performance of our model, we need to identify and treat them properly. Just deleting is not always the best solution.

In [4]:
card_df.isnull().sum()

distance_from_home                0
distance_from_last_transaction    0
ratio_to_median_purchase_price    0
repeat_retailer                   0
used_chip                         0
used_pin_number                   0
online_order                      0
fraud                             0
dtype: int64

#### Fraud incidences collected in the data set

In [5]:
fraud_df = card_df.value_counts(card_df['fraud']).rename(index = 'Number of cases').to_frame() #counts how many incidences belongs to each cases.  
fraud_df

Unnamed: 0_level_0,Number of cases
fraud,Unnamed: 1_level_1
0.0,912597
1.0,87403


#### Standardization

This process prepares our data in such a way that the mean is zero. This sets our values within the range [-1,1] preventing extreme values from weighting more in our model.  

In [6]:
# allows us to standarsize our data.
from sklearn import preprocessing 

#### Understanding the process

When setting our model, we need some input data called independent variables. But we also need to define what would be the expected outcome. The variable we would like to predict, categorize or cluster is called the dependent variable.  

In [7]:
transform = preprocessing.StandardScaler()  # performs the standardization.

X = transform.fit_transform(card_df) #Independent variables on the x-axis.
X

array([[ 0.47788202, -0.18284892,  0.04349141, ..., -0.33445812,
        -1.36442519, -0.30947363],
       [-0.24160679, -0.18809398, -0.18930045, ..., -0.33445812,
        -1.36442519, -0.30947363],
       [-0.3293694 , -0.16373307, -0.49881185, ..., -0.33445812,
         0.73290937, -0.30947363],
       ...,
       [-0.36264968, -0.13790278, -0.57369398, ..., -0.33445812,
         0.73290937, -0.30947363],
       [-0.34209827, -0.1855234 , -0.48162807, ..., -0.33445812,
         0.73290937, -0.30947363],
       [ 0.48140344, -0.18257921, -0.51338354, ..., -0.33445812,
         0.73290937, -0.30947363]])

In [8]:
Y = card_df['fraud'].to_numpy()
Y

array([0., 0., 0., ..., 0., 0., 0.])

### Split the data -- train vs test

In [9]:
# Allows us to split our data into training and testing data
from sklearn.model_selection import train_test_split

# Allows us to test parameters of classification algorithms and find the best one
from sklearn.model_selection import GridSearchCV

In [10]:
X_train, X_test, Y_train, Y_test = train_test_split (X, Y, test_size = 0.2, random_state = 2)