This code imports libraries for data manipulation, visualization, and machine learning. Numpy is used for numerical computations and array manipulation. Pandas is used for data manipulation and analysis. Seaborn and matplotlib are used for creating plots and charts. The sklearn library is used for machine learning, including evaluating models and training ensemble models like Isolation Forest.

In [179]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import IsolationForest

Isolation Forest, also known as iForest, is an unsupervised machine learning algorithm used for anomaly detection. It is an ensemble method that combines multiple decision trees to identify observations that are different from the majority of the data, also known as outliers or anomalies.
The algorithm works by randomly selecting a feature and a random split value and repeatedly splitting the data into subsets, until all the observations are isolated.One of the main advantages of Isolation Forest for fraud detection is its ability to handle high-dimensional data. Credit card transactions, for example, can have a large number of features such as purchase amount, location, time of purchase, etc. Isolation Forest can handle high-dimensional data efficiently and identify abnormal observations regardless of the number of features.

This line of code is using the read_csv function from the pandas library to read in a csv file named 'creditcard.csv' and store it in a variable called 'data'. The read_csv function reads in the csv file and converts it into a pandas dataframe, which is a tabular data structure with rows and columns.

In [180]:
data = pd.read_csv('creditcard.csv')

 This line of code is calling the shape attribute on the 'data' variable, which is a pandas DataFrame. This attribute returns a tuple representing the dimensions of the DataFrame, (number of rows, number of columns). The number of rows represents the number of observations or instances in the data, while the number of columns represents the number of features or variables in the data. data is of shape (284807,31) which implies 284807 cases and 31 columns.

In [181]:
data.shape # Prints the shape of ‘data’ 

(284807, 31)

This data frame is enormous. It costs a lot to compute. As a result, by removing rows, the data frame is downsampled to one-tenth of its original size. Large data samples are required for model training, which improves outcomes but requires more time and computer resources.

In [182]:
data['Class'].value_counts()

0    284315
1       492
Name: Class, dtype: int64

this line of code is using the sample() function on the 'data' variable, which is a pandas DataFrame. It is used to randomly sample a fraction of the dataframe.
The sample() function takes a parameter 'frac' which is the fraction of the dataframe that
you want to select. In this case, the value passed is 0.1, so the function will randomly select 10% of the rows from the original dataframe and return a new dataframe with only those rows. This can be useful for reducing the size of the data and working with a smaller subset when the original dataset is too large to work with efficiently. 

In [183]:
data = data.sample(frac=0.1) # Size of data is reduced

This statement is describing the columns in the 'creditcard.csv' dataset. The dataset contains information about credit card transactions, including the duration of the transaction (column "Time"), the value of the transaction (column "Amount"), and whether or not the transaction was fraudulent (column "Class"). Additionally, there are additional columns labeled "V1" through "V28" which contain reduced transaction information, this information is used to protect users' confidential information.

To achieve this, the dataset uses a technique called Principal Component Analysis (PCA). PCA is a statistical method that is used to reduce the dimensionality of data by identifying patterns in the data and projecting the data onto a lower dimensional space while minimizing information loss. By reducing the dimensionality of the data, PCA makes it possible to protect users' confidential information while preserving the important features of the data that are needed for analysis.

In [184]:
data.columns # Prints columns of data 

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')

In [185]:
data.describe() # Displays details of each column

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,28481.0,28481.0,28481.0,28481.0,28481.0,28481.0,28481.0,28481.0,28481.0,28481.0,...,28481.0,28481.0,28481.0,28481.0,28481.0,28481.0,28481.0,28481.0,28481.0,28481.0
mean,95295.985499,-0.010532,-0.00011,-0.006228,0.002754,-1e-06,-0.007255,0.000543,8.5e-05,-0.000772,...,0.003658,0.001926,-0.002652,0.002423,-0.003759,-0.003134,0.000546,-0.001457,90.284655,0.00165
std,47323.732833,1.99555,1.740045,1.539786,1.430485,1.401678,1.341342,1.305598,1.172236,1.108579,...,0.735382,0.728189,0.655268,0.604635,0.528314,0.482991,0.402391,0.358585,287.494886,0.04059
min,1.0,-37.558067,-60.464618,-30.177317,-5.266509,-35.18212,-16.172614,-43.557242,-41.044261,-13.320155,...,-22.797604,-9.499423,-20.794422,-2.747197,-7.081325,-2.604551,-8.878665,-8.65657,0.0,0.0
25%,54713.0,-0.916815,-0.597056,-0.885661,-0.860663,-0.695281,-0.770277,-0.549596,-0.207153,-0.639373,...,-0.225256,-0.542682,-0.162406,-0.352907,-0.32168,-0.32864,-0.071363,-0.053559,5.49,0.0
50%,85624.0,0.002131,0.077831,0.174565,-0.026535,-0.049867,-0.280451,0.048954,0.024746,-0.052152,...,-0.026904,0.007661,-0.011546,0.041211,0.01517,-0.057042,0.000557,0.011217,22.0,0.0
75%,139566.0,1.315111,0.80755,1.021396,0.765812,0.614446,0.391271,0.573656,0.328023,0.600744,...,0.189371,0.53187,0.144919,0.440892,0.34988,0.234303,0.09178,0.078799,75.07,0.0
max,172787.0,2.446505,22.057729,4.040465,16.875344,24.34531,21.550496,36.877368,15.794136,9.234623,...,27.202839,8.316275,19.228169,3.926604,5.525093,3.517346,5.899507,15.632689,18910.0,1.0


This line of code calculates the fraction of outliers (anomalies) in the data. The variable "outlier_fraction" is assigned the result of the calculation.  0 denotes non-fraudulent transaction and 1 denotes fraudulent transaction.

In [186]:
# separating the data for analysis
fraud = data[data['Class'] == 1] # Number of fraudulent transactions
valid = data[data['Class'] == 0] # Number of valid transactions
outlier_fraction = len(fraud)/float(len(valid))

In [187]:
# statistical measures of the data

In [188]:
fraud.Amount.describe()

count      47.000000
mean      138.351277
std       257.534978
min         0.000000
25%         1.000000
50%        19.730000
75%       125.375000
max      1218.890000
Name: Amount, dtype: float64

In [189]:
valid.Amount.describe()

count    28434.000000
mean        90.205203
std        287.539229
min          0.000000
25%          5.490000
50%         22.000000
75%         75.000000
max      18910.000000
Name: Amount, dtype: float64

In [190]:
# compare the values for both transactions
data.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,95322.653373,0.000723,-0.00757,0.008243,-0.005397,0.007302,-0.005384,0.013649,0.001215,0.004387,...,0.008862,0.003003,0.001296,-0.002426,0.002611,-0.003692,-0.003154,0.000989,-0.001408,90.205203
1,79162.489362,-6.819613,4.512929,-8.760726,4.934398,-4.418304,-1.13913,-7.928285,-0.683475,-3.121723,...,0.359731,0.399848,0.383473,-0.139352,-0.111357,-0.044343,0.008745,-0.267595,-0.03079,138.351277


this code is creating two new variables, X and y, for use in machine learning by removing 'Class' column from the original data DataFrame and creating a new DataFrame X, and selecting 'Class' column from the original data DataFrame and creating a new variable y. X will be used as input variable and y will be used as output variable.

In [191]:
X = data.drop('Class',axis = 1) # X is input
y = data['Class'] # y is output

The first line is creating an instance of the IsolationForest class and setting its parameters. The max_samples parameter is set to the number of samples in the input data (X) and the contamination parameter is set to the outlier_fraction calculated earlier. The fit() function is then used to fit the model to the input data (X.values).
The next line uses the predict() function to make predictions on the input data (X.values) using the trained model. The resulting predictions are stored in the variable y_prediction2.

The following two lines of code are used to convert the predictions, which are originally in the form of -1 for anomalous observations and 1 for non-anomalous observations, into 0 and 1 respectively. This is done by replacing all occurrences of -1 in the y_prediction2 variable with 1, and all occurrences of 1 with 0.
The next line calculates the total number of errors in the predictions by comparing y_prediction2 with the true labels (y) and summing the number of instances where they are different.

In [192]:
b = IsolationForest(max_samples = len(X),contamination = outlier_fraction).fit(X.values) # Fitting the model.
y_prediction2 = b.predict(X.values) # Prediction using trained model.
y_prediction2[y_prediction2 == 1] = 0 # Valid transactions are labelled as 0.
y_prediction2[y_prediction2 == -1] = 1 # Fraudulent transactions are labelled as 1.
errors2 = (y_prediction2 != y).sum() # Total number of errors is calculated.
print(errors2)
print(accuracy_score(y_prediction2,y))
print(classification_report(y_prediction2,y))

61
0.9978582212703205
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     28433
           1       0.36      0.35      0.36        48

    accuracy                           1.00     28481
   macro avg       0.68      0.68      0.68     28481
weighted avg       1.00      1.00      1.00     28481



the accuracy_score() and classification_report() functions from sklearn.metrics library to evaluate the performance of the model. The accuracy_score() function calculates the accuracy of the predictions, and classification_report() function provides a detailed report of various evaluation metrics such as precision, recall and f1-score

sample cannot be larger than orginal sample

In [193]:
valid_sample = valid.sample(n=550) #accuracy depends on sample number if it is a small number its f1-score is decreasing 

In [194]:
new_dataset = pd.concat([valid_sample, fraud], axis=0)

In [195]:
new_dataset.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
278900,168511.0,-1.076119,0.886882,0.503935,0.222523,1.111138,-0.831998,1.307208,-0.21627,-0.12603,...,-0.101284,0.057317,-0.661175,-0.038831,1.28672,-0.37703,0.016451,-0.192935,24.78,0
260313,159494.0,-2.657389,2.94897,-1.430304,-1.648488,1.009991,-0.38823,1.486252,-0.612245,3.132351,...,-0.965362,-1.086562,0.052306,0.00769,0.197308,0.074475,0.98824,0.133745,8.95,0
129473,79100.0,-1.451348,0.7347,0.654889,-3.615261,-0.254103,-1.681033,0.308082,-0.610166,1.410541,...,0.674827,-0.321999,-0.166804,0.411268,0.379246,-1.224412,0.342805,0.059571,6.05,0
23080,32578.0,-1.324167,-0.499988,1.769284,-2.561669,-1.131916,0.375047,-1.029441,0.880367,-2.528083,...,0.134854,0.533036,-0.088409,-0.325176,0.262679,-0.127938,0.203363,0.006226,53.0,0
22381,32217.0,1.193696,0.167419,0.609451,0.653714,-0.506251,-0.723992,-0.049388,-0.054022,0.109317,...,-0.215205,-0.618754,0.21354,0.378624,0.08084,0.106042,-0.016047,0.017638,1.79,0


In [196]:
new_dataset.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,95219.418182,-0.054361,0.013658,0.022526,-0.020625,0.010162,0.118429,-0.010299,-0.008035,0.029401,...,-0.011799,0.027019,0.043447,0.005794,0.005801,0.018672,0.063411,0.016892,-0.004364,87.287327
1,79162.489362,-6.819613,4.512929,-8.760726,4.934398,-4.418304,-1.13913,-7.928285,-0.683475,-3.121723,...,0.359731,0.399848,0.383473,-0.139352,-0.111357,-0.044343,0.008745,-0.267595,-0.03079,138.351277


In [197]:
new_dataset['Class'].value_counts()

0    550
1     47
Name: Class, dtype: int64

In [177]:
X2 = new_dataset.drop('Class',axis = 1) # X is input
y2 = new_dataset['Class'] # y is output

In [178]:
b2 = IsolationForest(max_samples = len(X2),contamination = outlier_fraction).fit(X2.values) # Fitting the model.
y_prediction3 = b2.predict(X2.values) # Prediction using trained model.
y_prediction3[y_prediction3 == 1] = 0 # Valid transactions are labelled as 0.
y_prediction3[y_prediction3 == -1] = 1 # Fraudulent transactions are labelled as 1.
errors3 = (y_prediction3 != y2).sum() # Total number of errors is calculated.
print(errors3)
print(accuracy_score(y_prediction3,y2))
print(classification_report(y_prediction3,y2))

50
0.9169435215946844
              precision    recall  f1-score   support

           0       1.00      0.92      0.96       600
           1       0.04      1.00      0.07         2

    accuracy                           0.92       602
   macro avg       0.52      0.96      0.52       602
weighted avg       1.00      0.92      0.95       602

