In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler



In [21]:
# Load the data
transactions = pd.read_csv('transactions_modified.csv')
print(transactions.head())
print(transactions.info())



   step      type      amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0   206  CASH_OUT    62927.08   C473782114           0.00            0.00   
1   380   PAYMENT    32851.57  C1915112886           0.00            0.00   
2   570  CASH_OUT  1131750.38  C1396198422     1131750.38            0.00   
3   184  CASH_OUT    60519.74   C982551468       60519.74            0.00   
4   162   CASH_IN    46716.01  C1759889425     7668050.60      7714766.61   

      nameDest  oldbalanceDest  newbalanceDest  isFraud  isPayment  \
0  C2096898696       649420.67       712347.75        0          0   
1   M916879292            0.00            0.00        0          1   
2  C1612235515       313070.53      1444820.92        1          0   
3  C1378644910        54295.32       182654.50        1          0   
4  C2059152908      2125468.75      2078752.75        0          0   

   isMovement  accountDiff  
0           1    649420.67  
1           0         0.00  
2           1    818679.85  


The file transactions_modified.csv contains data on 1000 simulated credit card transactions. Let’s begin by loading the data into a pandas DataFrame named transactions. Take a peek at the dataset using .head() and you can use .info() to examine how many rows are there and what datatypes the are. How many transactions are fraudulent? Print your answer.

In [23]:
# How many fraudulent transactions?

fraud = transactions.isFraud.sum()
print(fraud)


282


Looking at the dataset, combined with our knowledge of credit card transactions in general, we can see that there are a few interesting columns to look at. We know that the amount of a given transaction is going to be important. Calculate summary statistics for this column. What does the distribution look like?

In [24]:
# Summary statistics on amount column

transactions['amount'].describe()



count    1.000000e+03
mean     5.373080e+05
std      1.423692e+06
min      0.000000e+00
25%      2.933705e+04
50%      1.265305e+05
75%      3.010378e+05
max      1.000000e+07
Name: amount, dtype: float64

We have a lot of information about the type of transaction we are looking at. Let’s create a new column called isPayment that assigns a 1 when type is “PAYMENT” or “DEBIT”, and a 0 otherwise.

In [34]:
# Create isPayment field

transactions['isPayment'] = 0
transactions.loc[transactions['type'].isin(['PAYMENT','DEBIT']),'isPayment'] = 1


Similarly, create a column called isMovement, which will capture if money moved out of the origin account. This column will have a value of 1 when type is either “CASH_OUT” or “TRANSFER”, and a 0 otherwise.

In [33]:
# Create isMovement field

transactions['isMovement'] = 0
transactions.loc[transactions['type'].isin(['CASH_OUT','TRANSFER']), 'isMovement'] = 1


With financial fraud, another key factor to investigate would be the difference in value between the origin and destination account. Our theory, in this case, being that destination accounts with a significantly different value could be suspect of fraud. Let’s create a column called accountDiff with the absolute difference of the oldbalanceOrg and oldbalanceDest columns.

In [35]:
# Create accountDiff field

transactions['accountDiff'] = np.abs(transactions['oldbalanceOrg']-transactions['oldbalanceDest'])


Before we can start training our model, we need to define our features and label columns. Our label column in this dataset is the isFraud field. Create a variable called features which will be an array consisting of the following fields:

amount
isPayment
isMovement
accountDiff
Also create a variable called label with the column isFraud.

In [42]:
# Create features and label variables
features = transactions[['amount','isPayment','isMovement','accountDiff']]
label = transactions['isFraud']



Split the data into training and test sets using sklearn‘s train_test_split() method. We’ll use the training set to train the model and the test set to evaluate the model. Use a test_size value of 0.3.

In [45]:
# Split dataset
XTrain,XTest,YTrain,YTest = train_test_split(features,label,test_size=.3,random_state=42)




Since sklearn‘s Logistic Regression implementation uses Regularization, we need to scale our feature data. Create a StandardScaler object, .fit_transform() it on the training features, and .transform() the test features.

In [59]:
# Normalize the features variables
sclr = StandardScaler()
XTrain = sclr.fit_transform(XTrain)
XTest = sclr.transform(XTest)





Create a LogisticRegression model with sklearn and .fit() it on the training data.

Fitting the model find the best coefficients for our selected features so it can more accurately predict our label. We will start with the default threshold of 0.5.

In [48]:
# Fit the model to the training data
model = LogisticRegression()
model.fit(XTrain,YTrain, sample_weight=0.5)




Run the model’s .score() method on the training data and print the training score.

Scoring the model on the training data will process the training data through the trained model and will predict which transactions are fraudulent. The score returned is the percentage of correct classifications, or the accuracy.

In [49]:
# Score the model on the training data
model.score(XTrain,YTrain)



0.8385714285714285

Run the model’s .score() method on the test data and print the test score.

Scoring the model on the test data will process the test data through the trained model and will predict which transactions are fraudulent. The score returned is the percentage of correct classifications, or the accuracy, and will be an indicator for the sucess of your model.

How did you model perform?

In [50]:
# Score the model on the test data
model.score(XTest,YTest)



0.85

Print the coefficients for our model to see how important each feature column was for prediction. Which feature was most important? Least important?

In [51]:
# Print the model coefficients
print(model.coef_)



[[ 2.14147832 -0.58136792  1.83617384 -0.8717413 ]]


Let’s use our model to process more transactions that have gone through our systems. There are three numpy arrays pre-loaded in the workspace with information on new sample transactions under “New transaction data”

# New transaction data
transaction1 = np.array([123456.78, 0.0, 1.0, 54670.1])
transaction2 = np.array([98765.43, 1.0, 0.0, 8524.75])
transaction3 = np.array([543678.31, 1.0, 0.0, 510025.5])
Create a fourth array, your_transaction, and add any transaction information you’d like. Make sure to enter all values as floats with a .!

In [55]:
# New transaction data
transaction1 = np.array([123456.78, 0.0, 1.0, 54670.1])
transaction2 = np.array([98765.43, 1.0, 0.0, 8524.75])
transaction3 = np.array([543678.31, 1.0, 0.0, 510025.5])



In [56]:
# Create a new transaction

transaction4 = np.array([696969.42, 0.0, 1.0, 420.69])


Combine the new transactions and your_transaction into a single numpy array called sample_transactions.

In [57]:
# Combine new transactions into a single array

sample_transactions = np.stack((transaction1, transaction2, transaction3, transaction4))


Since our Logistic Regression model was trained on scaled feature data, we must also scale the feature data we are making predictions on. Using the StandardScaler object created earlier, apply its .transform() method to sample_transactions and save the result to sample_transactions.

In [60]:
# Normalize the new transactions

sample_transactions = sclr.transform(sample_transactions)


Which transactions are fraudulent? Use your model’s .predict() method on sample_transactions and print the result to find out.

Want to see the probabilities that led to these predictions? Call your model’s .predict_proba() method on sample_transactions and print the result. The 1st column is the probability of a transaction not being fraudulent, and the 2nd column is the probability of a transaction being fraudulent (which was calculated by our model to make the final classification decision).

In [63]:
# Predict fraud on the new transactions
model.predict(sample_transactions)




array([0, 0, 0, 1], dtype=int64)

In [62]:
# Show probabilities on the new transactions
model.predict_proba(sample_transactions)

array([[0.61281199, 0.38718801],
       [0.99641474, 0.00358526],
       [0.9938802 , 0.0061198 ],
       [0.40031781, 0.59968219]])

Congratulations on completing the project!

Note that we’d used a modified version of the dataset. You can now try to re-run the project using the original dataset, transactions.csv. Examine how the results change. If you notice something weird, you’re totally on to something! That “something” is what is known as an imbalanced class classification problem.

We will cover this very relevant topic (among many other things) in the Logistic Regression II module!