## Financial Fraud Data Wrangling 

We noticed through our initial EDA process that there are many outliers in our data. The distribution of our data is mostly right skewed. Normally we may want to remove some of these outliers in order to get a better understanding of our data and its distribution. However through our bivariate analysis we noticed that these outliers are actually quite important for our model later so we will not be removing them.

In [69]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

In [70]:
df = pd.read_csv('../data/raw/PS_20174392719_1491204439457_log.csv')

In [71]:
sampled_df = df.sample(n=50000, random_state=42)

In [72]:
# We should drop the following columns "nameOrig", "nameDest" since they are not helpful in our classification model we will use later.

cleaned_sample_df = sampled_df.drop(["nameOrig", "nameDest"], axis="columns")

In [73]:
cleaned_sample_df.head(5)

Unnamed: 0,step,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
3737323,278,CASH_IN,330218.42,20866.0,351084.42,452419.57,122201.15,0,0
264914,15,PAYMENT,11647.08,30370.0,18722.92,0.0,0.0,0,0
85647,10,CASH_IN,152264.21,106589.0,258853.21,201303.01,49038.8,0,0
5899326,403,TRANSFER,1551760.63,0.0,0.0,3198359.45,4750120.08,0,0
2544263,206,CASH_IN,78172.3,2921331.58,2999503.88,415821.9,337649.6,0,0


In [74]:
cleaned_sample_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 50000 entries, 3737323 to 409882
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   step            50000 non-null  int64  
 1   type            50000 non-null  object 
 2   amount          50000 non-null  float64
 3   oldbalanceOrg   50000 non-null  float64
 4   newbalanceOrig  50000 non-null  float64
 5   oldbalanceDest  50000 non-null  float64
 6   newbalanceDest  50000 non-null  float64
 7   isFraud         50000 non-null  int64  
 8   isFlaggedFraud  50000 non-null  int64  
dtypes: float64(5), int64(3), object(1)
memory usage: 3.8+ MB


Okay lets export our data and move onto step three modeling and making predictions with our data using machine learning algorithms.

In [75]:
#Lets not include a index in our cleaned df sample

cleaned_sample_df.to_csv("../data/processed/financial_fraud_data.csv", index = False)