# Business Context

Business Context
This case requires trainees to develop a model for predicting fraudulent transactions for a financial company and use insights from the model to develop an actionable plan. Data for the case is available in CSV format having 6362620 rows and 10 columns.
Candidates can use whatever method they wish to develop their machine learning model. Following usual model development procedures, the model would be estimated on the calibration data and tested on the validation data. This case requires both statistical analysis and creativity/judgment. We recommend you spend time on both fine-tuning and interpreting the results of your machine learning model.

# Data Dictioary

step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

amount - amount of the transaction in local currency.

nameOrig - customer who started the transaction

oldbalanceOrg - initial balance before the transaction

newbalanceOrig - new balance after the transaction

nameDest - customer who is the recipient of the transaction

oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).

newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).

isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.

# Import Libraries

In [None]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import os

# Import the Dataset

In [None]:
# We change the directory to where the files are currently stored

os.chdir('C:/Users/sadik/OneDrive/Desktop/Insaid')

df=pd.read_csv('Fraud.csv')

df.head()

In [None]:
df.info()

In [None]:
df.describe()

# Data Cleaning, Preprocessing and EDA

In [None]:
# Check for null values if any

df.isnull().values.any()


In [None]:
# Check for duplicate values

duplicateRows = df[df.duplicated()]

duplicateRows

# Check For NAN Values In df


In [None]:
df.isna().any()

In [None]:
df.type.value_counts()


In [None]:
df.isFraud.value_counts()


In [None]:
df.isFlaggedFraud.value_counts()


# We group fraud transaction and non fraud transactions together similarly we group transaction that are flagged as fraud together

In [None]:
Fraud=df.loc[df.isFraud == 1]
Fraud.head()

In [None]:
FlaggedFraud= df.loc[df.isFlaggedFraud == 1]
FlaggedFraud.head(10)

In [None]:
def countplot(x,df):
    bar_plot1 = sns.countplot(x=x, data=df, order = df[x].value_counts().index)
    for p in bar_plot1.patches:
        height = p.get_height()
        bar_plot1.text(p.get_x()+ p.get_width()/2, height + 1, height)

In [None]:
countplot("isFraud", df)


In [None]:
sns.countplot(df.type) ## for discrete data


In [None]:
# Visualizing correlation coefficients between features and cancellation:
fig = plt.figure(figsize=(8,10))
ax = sns.heatmap(df.corr()[['isFraud']].sort_values('isFraud', ascending=False), annot = True, annot_kws = {"size":12}, cmap='Blues')
ax.set_title('Correlation Coefficient Between Each Numeric Feature and Fraud Status', fontsize=18)
ax.set_xlabel('Features', fontsize = 16)
ax.set_ylabel('Features', fontsize = 16)
ax.tick_params(axis = "both", labelsize = 12);
y_min, y_max = ax.get_ylim()
ax.set_ylim(top=y_max+1);

## coverting categorical features to numerical


In [None]:
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()
df.type=label.fit_transform(df.type)
df.nameOrig=label.fit_transform(df.nameOrig)
df.nameDest=label.fit_transform(df.nameDest)

## checking the outliers if any


In [None]:

box=df[['step', 'type', 'amount', 'nameOrig', 
          'oldbalanceOrg', 'newbalanceOrig', 'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud', 'isFlaggedFraud']]
plt.figure(figsize=(25,25), facecolor='white')
plotnumber = 1

for column in box:
    if plotnumber<=15 :     
        ax = plt.subplot(25,1,plotnumber)
        sns.boxplot(x=box[column])
        plt.xlabel(column,fontsize=20)
        
    plotnumber+=1
plt.show()

In [None]:
# Perform Feature Selection

In [None]:
##cmap stands for colormap and it's a colormap instance or registered colormap name
plt.figure(figsize=(20, 15))
sns.heatmap(df.corr(), annot=True, fmt= '.1f')
sns.set(font_scale=2.25)
plt.show()

In [None]:
## displays the correlated data 
corr_matrix = df.corr()
corr_matrix["isFraud"].sort_values(ascending=False)

# 5. What are the key factors that predict fraudulent customer?

key factors that predict fraudulent customer are isFraud and amount

# 6. Do these factors make sense? If yes, How? If not, How not?

Yes as noted above if the amount of transaction is high then the possibility of it being fraudulent is high


# Model Building

In [None]:
##creating independent and dependent variables X & y
X = df.loc[:,['isFraud','amount','oldbalanceOrg','newbalanceOrig','step','type','nameOrig']]
y = df.isFraud

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


# Splitting the data into test and train for calculating accuracy
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=42)

In [None]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape


In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# 2. Describe your fraud detection model in elaboration.

In [None]:
from sklearn.tree import DecisionTreeRegressor


In [None]:
dtree=DecisionTreeRegressor(max_depth=25)
dtree.fit(X_train, y_train)

# 3. How did you select variables to be included in the model?

By using the heatmap and checking the correlation between the target and remaining variables

# 4. Demonstrate the performance of the model by using best set of tools.

In [None]:
dtree.score(X_train, y_train)


In [None]:
dtree.score(X_test, y_test)


In [None]:
p_test = dtree.predict(X_test)


In [None]:
def mae(p, t):
    return np.sum(np.abs(p - t)) / len(p)

In [None]:
mae(p_test, y_test)


In [None]:
def print_score(mm):
    print("train r^2 " + str(mm.score(X_train, y_train)))
    print("validation r^2 " + str(mm.score(X_test, y_test)))
    p_test = mm.predict(X_test)
    p_train = mm.predict(X_train)
    print("mean absolute error(Train): " + str(mae(p_train, y_train)))
    print("mean absolute error(Validation): " + str(mae(p_test, y_test)))
print_score(dtree)

In [None]:
p_test = dtree.predict(X_test)


In [None]:
p_test


# 7. What kind of prevention should be adopted while company update its infrastructure?

Set limits on the amount of transactions if it is very high amount deploy two step verification or identity verification techniques to verify identity

# 8. Assuming these actions have been implemented, how would you determine if they work?

We can again run these models on new data obtained and see the results to determine if they work or not