# Predict Credit Card Fraud

Your task is to use Logistic Regression and create a predictive model to determine if a transaction is fraudulent or not, based on a subset of the dataset called "synthetic financial dataset" available in kaggle.

## Load the Data

1. Let’s begin by loading the data into a pandas DataFrame named transactions. How many transactions are fraudulent?

In [60]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the data
transactions = pd.read_csv('transactions_modified.csv')
print(transactions.head())
print(transactions.info()) #We dont have null values.

#In this case the fraudulent transactions are in the "isFraud" column and the number that represents it is the number 1.
#Thats why we can use the sum() method to know how many fraudulent transactions are in our dataset.
# Sum method
print("The number of fraudulent transactions is: {}".format(transactions.isFraud.sum()))

   step      type     amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0   206  CASH_OUT   62927.08   C473782114           0.00            0.00   
1   380   PAYMENT   32851.57  C1915112886           0.00            0.00   
2   570  CASH_OUT 1131750.38  C1396198422     1131750.38            0.00   
3   184  CASH_OUT   60519.74   C982551468       60519.74            0.00   
4   162   CASH_IN   46716.01  C1759889425     7668050.60      7714766.61   

      nameDest  oldbalanceDest  newbalanceDest  isFraud  isPayment  \
0  C2096898696       649420.67       712347.75        0          0   
1   M916879292            0.00            0.00        0          1   
2  C1612235515       313070.53      1444820.92        1          0   
3  C1378644910        54295.32       182654.50        1          0   
4  C2059152908      2125468.75      2078752.75        0          0   

   isMovement  accountDiff  
0           1    649420.67  
1           0         0.00  
2           1    818679.85  
3     

## EDA and Clean the Data

2. We know that the amount of a given transaction is going to be important. Calculate summary statistics for this column. What does the distribution look like?

In [61]:
# Set pandas display option to avoid scientific notation
pd.set_option('display.float_format', '{:.2f}'.format)

print("Summary statistics on amount column")
print(transactions.amount.describe())

Summary statistics on amount column
count       1000.00
mean      537307.96
std      1423692.48
min            0.00
25%        29337.05
50%       126530.51
75%       301037.78
max     10000000.00
Name: amount, dtype: float64


3. Let’s create a new column called isPayment that assigns a 1 when type is “PAYMENT” or “DEBIT”, and a 0 otherwise.

In [62]:
#This is neccesary bacause a pay made with a debit card is proceded at that time, so, we can count it as a payment, but
#we already have the columns, so in this case we will duplicate it to complete the task and we will work with them.
transactions['isPaymentPlusDebitCard'] = transactions['type'].apply(lambda x: 1 if x=="PAYMENT" or x=="DEBIT" else 0)

4. Similarly, create a column called isMovement, which will capture if money moved out of the origin account. This column will have a value of 1 when type is either “CASH_OUT” or “TRANSFER”, and a 0 otherwise.

In [63]:
transactions['isMovementPlusTransfer'] = transactions['type'].apply(lambda x: 1 if x=="CASH_OUT" or x=="TRANSFER" else 0)

5. With financial fraud, another key factor to investigate would be the difference in value between the origin and destination account. Our theory, in this case, being that destination accounts with a significantly different value could be suspect of fraud. Let’s create a column called accountDiff with the absolute difference of the oldbalanceOrg and oldbalanceDest columns.

In [64]:
#We already have the column, but, to complete the task we will do a duplicate and we will work with it.
transactions["accountDiffN"] = abs(transactions.oldbalanceDest - transactions.oldbalanceOrg)

print(transactions.head()) #We review the new columns added

#Prove that the current columns in the dataset have the "same data":
print("Differences: {}; {}; {};".format(transactions['isPaymentPlusDebitCard'].sum() - transactions['isPayment'].sum(), transactions['isMovementPlusTransfer'].sum() - transactions['isMovement'].sum(),transactions["accountDiffN"].sum()- transactions["accountDiff"].sum()))

   step      type     amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0   206  CASH_OUT   62927.08   C473782114           0.00            0.00   
1   380   PAYMENT   32851.57  C1915112886           0.00            0.00   
2   570  CASH_OUT 1131750.38  C1396198422     1131750.38            0.00   
3   184  CASH_OUT   60519.74   C982551468       60519.74            0.00   
4   162   CASH_IN   46716.01  C1759889425     7668050.60      7714766.61   

      nameDest  oldbalanceDest  newbalanceDest  isFraud  isPayment  \
0  C2096898696       649420.67       712347.75        0          0   
1   M916879292            0.00            0.00        0          1   
2  C1612235515       313070.53      1444820.92        1          0   
3  C1378644910        54295.32       182654.50        1          0   
4  C2059152908      2125468.75      2078752.75        0          0   

   isMovement  accountDiff  isPaymentPlusDebitCard  isMovementPlusTransfer  \
0           1    649420.67                  

## Select and Split the Data

6. Before we can start training our model, we need to define our features and label columns. Our label column in this dataset is the isFraud field. Create a variable called features which will be an array consisting of the following fields: amount, isPayment, isMovement, accountDiff. Also create a variable called label with the column isFraud.

In [65]:
features = transactions[["amount", "isPayment", "isMovement", "accountDiff"]]
label = transactions.isFraud

7. Split the data into training and test sets using sklearn‘s train_test_split() method. We’ll use the training set to train the model and the test set to evaluate the model. Use a test_size value of 0.3.

In [66]:
X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.3)

## Normalize the Data

8. Since sklearn‘s Logistic Regression implementation uses Regularization, we need to scale our feature data. Create a StandardScaler object, .fit_transform() it on the training features, and .transform() the test features.

In [67]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Create and Evaluate the Model

9. Create a LogisticRegression model with sklearn and .fit() it on the training data. Fitting the model find the best coefficients for our selected features so it can more accurately predict our label. We will start with the default threshold of 0.5.

In [68]:
model = LogisticRegression()
model.fit(X_train, y_train)

10. Run the model’s .score() method on the training data and print the training score. Scoring the model on the training data will process the training data through the trained model and will predict which transactions are fraudulent. The score returned is the percentage of correct classifications, or the accuracy.

In [69]:
print("Model score (train data):")
print(model.score(X_train, y_train)) #About 0.8

Model score (train data):
0.8571428571428571


11. Run the model’s .score() method on the test data and print the test score. Scoring the model on the test data will process the test data through the trained model and will predict which transactions are fraudulent. The score returned is the percentage of correct classifications, or the accuracy, and will be an indicator for the sucess of your model.

How did your model perform?

In [70]:
print("Model score (test data):")
print(model.score(X_test, y_test))

#As with the training data, the model score is about 0.8, so it performs well.

Model score (test data):
0.8166666666666667


12. Print the coefficients for our model to see how important each feature column was for prediction. Which feature was most important? Least important?

In [71]:
print("model coefficients: {}".format(abs(model.coef_)))

#In this case the most important feature was amount, because it has the largest coefficient.
#In the least important feature was isPayment, because it has the shortest coefficient.

model coefficients: [[3.1348221  0.62141579 2.03149605 1.55698273]]


## Predict With the Model

13. Let’s use our model to process more transactions that have gone through our systems with the "new transaction data". Create a fourth array.

In [72]:
# New transaction data
transaction1 = np.array([123456.78, 0.0, 1.0, 54670.1])
transaction2 = np.array([98765.43, 1.0, 0.0, 8524.75])
transaction3 = np.array([543678.31, 1.0, 0.0, 510025.5])

transaction4 = np.array([78925.14, 0.0, 0.0, 25052.67])

14. Combine the new transactions and your_transaction into a single numpy array called sample_transactions.

In [73]:
sample_transactions = np.stack((transaction1, transaction2, transaction3, transaction4))

15. Since our Logistic Regression model was trained on scaled feature data, we must also scale the feature data we are making predictions on. Using the StandardScaler object created earlier, apply its .transform() method to sample_transactions and save the result to sample_transactions.

In [74]:
sample_transactions = scaler.transform(sample_transactions)



16. Which transactions are fraudulent? Use your model’s .predict() method on sample_transactions and print the result to find out. Also call your model’s .predict_proba() method on sample_transactions and print the result.

In [75]:
print(model.predict(sample_transactions))

print(model.predict_proba(sample_transactions)) #is the probability of a transaction not being fraudulent,
                                                #and the 2nd column is the probability of a transaction being fraudulent

#In this case all the transactions have a big possibility to not be fraudulent.
#Except for the fourth array, in which the values were totally random.

[0 0 0 0]
[[0.60854125 0.39145875]
 [0.99787853 0.00212147]
 [0.99497029 0.00502971]
 [0.99100329 0.00899671]]
