### This case requires to develop a model for predicting fraudulent transactions for a financial company and make insights. Data for the case is available in CSV format having 6362620 rows and 10 columns. I tried incorporating different ML classifications models to find the best fit. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

<a id='EDA'></a>
## 1]Exploratory Data Analysis - Getting familiar to the dataset

In [None]:
df = pd.read_csv("../input/fraud-transaction-detection/Fraud (1).csv")
df.head()

In [None]:
df.columns

In [None]:
df.describe()

In [None]:
df.info()

### 1.1]there are 5 unique types of transactions for which we need to check frauds

In [None]:
df.type.unique()

### 1.2]there are no null values

In [None]:
df.isnull().values.any()

### 1.3] Visualisation to see which transactions are used more

In [None]:
#counting the amount of different transaction types for both fraud and non fraud cases
x1 = df.sort_values(by = ["type"])[["type"]]
word1,count1 = np.unique(x1.values,return_counts = True)
plt.figure(figsize=[15, 15])
plt.bar(word1, count1, width=0.7, edgecolor='blueviolet',
        color=['yellow','green','purple','blue','orange'], linewidth=2)
plt.title('All Types Clients Transactions', fontsize=15)
plt.xlabel('word1', fontsize=15)
plt.ylabel('count1', fontsize=15)
plt.show()

In [None]:
# counting fraud cases in each transaction type
plt.figure(figsize=(11,7))
sns.countplot(x='type', hue='isFraud', data=df, palette='Set1')

### 1.4]From the visualisation above we find out for which types of transaction, the column's values in isFraud equals to 1 or 0. Let's also confirm the same thing numerically below


In [None]:
# number of transfer transactions where isfraud = 1
df[(df['type']=='TRANSFER')&(df['isFraud']==1)]['isFraud'].sum()

In [None]:
# number of cash out transactions where isfraud = 1
df[(df['type']=='CASH_OUT')&(df['isFraud']==1)]['isFraud'].sum()

In [None]:
# number of cash in transactions where isfraud = 1
df[(df['type']=='CASH_IN')&(df['isFraud']==1)]['isFraud'].sum()

In [None]:
# number of debit transactions where isfraud = 1
df[(df['type']=='DEBIT')&(df['isFraud']==1)]['isFraud'].sum()

In [None]:
# number of payment transactions where isfraud = 1
df[(df['type']=='PAYMENT')&(df['isFraud']==1)]['isFraud'].sum()

### 1.5]We will do the same thing for isflaggedfraud column for each payment type

In [None]:
df[(df['type']=='TRANSFER')&(df['isFlaggedFraud']==1)]['isFlaggedFraud'].sum()

In [None]:
df[(df['type']=='CASH_OUT')&(df['isFlaggedFraud']==1)]['isFlaggedFraud'].sum()

In [None]:
df[(df['type']=='CASH_IN')&(df['isFlaggedFraud']==1)]['isFlaggedFraud'].sum()

In [None]:
df[(df['type']=='PAYMENT')&(df['isFlaggedFraud']==1)]['isFlaggedFraud'].sum()

In [None]:
df[(df['type']=='DEBIT')&(df['isFlaggedFraud']==1)]['isFlaggedFraud'].sum()

### 1.6]Checking whether isFlaggedFraud depends on a customer transacting more than once?  

In [None]:
#the customers who paid the money are unique as there are only 16 isflaggedfraud = 1 in the tranfer type
df[(df['type']=='TRANSFER')&(df['isFlaggedFraud']==1)]['nameOrig'].nunique()

In [None]:
#the customers who recieved the money are unique as there are only 16 isflaggedfraud = 1 in the tranfer type
df[(df['type']=='TRANSFER')&(df['isFlaggedFraud']==1)]['nameDest'].nunique()

### 1.7] isFlaggedFraud - An illegal attempt in this dataset is an attempt to transfer more than 200000 in a single transaction. Let's see if this holds true 

In [None]:
df[(df['type']=='TRANSFER')&(df['isFlaggedFraud']==1)]['amount'].max()

In [None]:
df[(df['type']=='TRANSFER')&(df['isFlaggedFraud']==1)]['amount'].min()

In [None]:
df[(df['type']=='TRANSFER')&(df['isFlaggedFraud']==0)]['amount'].min()

In [None]:
df[(df['type']=='TRANSFER')&(df['isFlaggedFraud']==0)]['amount'].max()

<a id='EDAConc'></a>
## 2] Conclusion from EDA
### 2.1] About isFraud: 
*  Only Cashout and transfer types have fraud cases. 
*  So we will drop the debit, cashin and payments transaction from our dataset as we have no use for them and it will also shorten the table

### 2.2]About isflaggedfraud:
*  isFlaggedFraud is set 1 just 16 times in a meaningless way
*  isFlaggedFraud being set 1 cannot be thresholded on amount transferred since the corresponding range of values overlaps with that for TRANSFERs where isFlaggedFraud is set 0
*  it also does not hold true for the definition of the column i.e (An illegal attempt in this dataset is an attempt to transfer more than 200000 in a single transaction)
*  isFlaggedFraud is set 1 for unique customers. There are no same customers who are being flagged
*  Thus we can treat isFlaggedFraud as insignificant and discard it in the dataset without loosing information.

<a id='cleaning'></a>
## 3] Data cleaning
### 3.1] Droping useless attributes - payment, debit, cash in transactions and isflagged fraud column

In [None]:
dt= pd.read_csv("../input/fraud-transaction-detection/Fraud (1).csv")
dt.drop(dt[dt.type == 'PAYMENT'].index, inplace=True)
dt.drop(dt[dt.type == 'DEBIT'].index, inplace=True)
dt.drop(dt[dt.type == 'CASH_IN'].index, inplace=True)
dt.drop('isFlaggedFraud', axis=1, inplace=True)
dt.head()

<a id='percent'></a>
### 3.2] Finding the number and percent of fraud and legit transactions in the cleaned data

In [None]:
#no. of transactions in the cleaned data
len(dt)

In [None]:
legit = len(dt[dt.isFraud == 0])
fraud = len(dt[dt.isFraud == 1])
legit_percent = (legit / (fraud + legit)) * 100
fraud_percent = (fraud / (fraud + legit)) * 100

print("Number of Legit transactions: ", legit)
print("Number of Fraud transactions: ", fraud)
print("Percentage of Legit transactions: {:.4f} %".format(legit_percent))
print("Percentage of Fraud transactions: {:.4f} %".format(fraud_percent))

### 3.3] Making the data suitable for a ML model

In [None]:
#making a new dataframe for droping name of the customer columns
df2=dt.copy()
df2.drop(['nameOrig','nameDest'], axis=1, inplace=True)
df2.head()

In [None]:
#Giving binary values to type column 
df3 = df2.replace({'type':{'TRANSFER':1, 'CASH_OUT':0}})
df3.head()

<a id='colinearity'></a>
### 3.4] Checking for colinearity

In [None]:
corr=df3.corr()
plt.figure(figsize=(10,6))
sns.heatmap(corr,annot=True)

In [None]:
# Import library for VIF (VARIANCE INFLATION FACTOR)

from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_vif(df):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = df.columns
    vif["VIF"] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]

    return(vif)

calc_vif(df3)

<a id='3.5'></a>
### 3.5] We can see that (oldbalanceOrg, newbalanceOrig, newbalanceDest, oldbalanceDest) are highly correlated. Can be seen from the heatmap and VIF
### We will alter these columns and make them usable for our model.

In [None]:
df3['Actual_amount_orig'] = df3.apply(lambda x: x['oldbalanceOrg'] - x['newbalanceOrig'],axis=1)
df3['Actual_amount_dest'] = df3.apply(lambda x: x['newbalanceDest'] - x['oldbalanceDest'],axis=1)

#Dropping columns
new_df = df3.drop(['oldbalanceOrg','newbalanceOrig','oldbalanceDest','newbalanceDest', 'step'],axis=1)

calc_vif(new_df)

In [None]:
corr=new_df.corr()

plt.figure(figsize=(10,6))
sns.heatmap(corr,annot=True)

In [None]:
new_df2=new_df.copy()

Y = new_df2["isFraud"]
X = new_df2.drop(["isFraud"], axis= 1)
new_df2.head()

<a id='model'></a>
### 4] Model Building

### will try out different classification models to see which gives the best results

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
import sklearn.metrics as metrics
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
# Split the data
(X_train, X_test, Y_train, Y_test) = train_test_split(X, Y, test_size= 0.3, random_state= 42)

print("Shape of X_train: ", X_train.shape)
print("Shape of X_test: ", X_test.shape)

<a id='LR'></a>
### 4.1] Logistic Regression

In [None]:
lm = LogisticRegression()
lm.fit(X_train, Y_train)
pr = lm.predict(X_test)
print(classification_report(Y_test,pr))

In [None]:
print("Logistic Regression")
tn, fp, fn, tp = confusion_matrix(Y_test, pr).ravel()
print(f'True Positives: {tp}')
print(f'False Positives: {fp}')
print(f'True Negatives: {tn}')
print(f'False Negatives: {fn}')

<a id='DT'></a>
### 4.2]DECISION TREE

In [None]:
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)

Y_pred_dt = decision_tree.predict(X_test)
decision_tree_score = decision_tree.score(X_test, Y_test) * 100
print("Decision Tree Score: ", decision_tree_score)

In [None]:
print(" Decision Tree")
tn, fp, fn, tp = confusion_matrix(Y_test, Y_pred_dt).ravel()
print(f'True Positives: {tp}')
print(f'False Positives: {fp}')
print(f'True Negatives: {tn}')
print(f'False Negatives: {fn}')

In [None]:
classification_report_dt = classification_report(Y_test, Y_pred_dt)
print("Classification Report - Decision Tree")
print(classification_report_dt)

## 5] Conclusion
-  Here we can see that decision tree performs better than logistic regression.

1.  TP(Decision Tree) > TP(logistic regression) 
2.  FP(Decision Tree) < FP(logistic regression) 
3.  TN(Decision Tree) > TN(logistic regression) 
4.  FN(Decision Tree) < FN(logistic regression)

-  Also from the 2 classification reports, the decision tree has better f1 score and precision

### Note - I was not able to test a random forests classifier because it was taking a lot of time to run and could'nt figure out any alternative

# Insights

1. Data cleaning including missing values, outliers and multi-collinearity. 
   -  Dropped the rows that were not going to be used in the model building. Check out section 3.
   -  Dropped the columns that had collinearity and replaced them with new column. Check out section 3.4.
2. Describe your fraud detection model in elaboration. 
   - These results 3.2 prove that this it is a highly unbalanced data as Percentage of Legit transactions= 99.7% and Percentage of Fraud transactions= 0.29%.
   - We tested Logistic regression and decision tree algorithms and proceeded with decision trees which works best with strongly imbalanced classes. 
3. How did you select variables to be included in the model?
   - We dropped the isFlaggedFraud column based on our conclusion in section 2
   - Using the VIF values and correlation results in section 3.4, we droped corelated independent variables and kept the ones which were better correlated to the isFraud Attribute.
4. Demonstrate the performance of the model by using best set of tools.
   - Used confusion matrix and classification report to see the performance of trained models. Check out section 4.1 and 4.2.
5. What are the key factors that predict fraudulent customer? 
   - In the section 3.5 we created 2 actual balance columns for the origin and destination accounts. They show a correlation with the target variable. Hence they are used as predictors


