# Basic Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is a fundamental step in the data analysis process, as it allows us to understand and explore the data before applying more advanced techniques. This notebook will guide you through a basic EDA process to gain valuable insights into the dataset at hand.

The objective of this notebook is to:

1. **Basic overview of the Data Structure**: 
   - **Objective**: Gain a comprehensive overview of the dataset's structure.
   - **Details**: Examine the dataset's dimensions (number of records and columns), the types of data present, and the initial summary statistics.

2. **Clean the Data**:
   - **Objective**: Prepare the data for analysis by addressing any issues.
   - **Details**: Identify and handle missing values, remove duplicate entries, and correct any inconsistencies in the data.


4. **Bivariate Analysis**:
   - **Objective**: Investigate the relationships between pairs of variables.
   - **Details**: Examine how two variables interact with each other. This could involve calculating correlations, creating scatter plots, and evaluating any trends or patterns.

5. **Multivariate Analysis**:
   - **Objective**: Explore the relationships among three or more variables.
   - **Details**: Analyze how multiple variables interact simultaneously. Techniques may include multivariate regression, principal component analysis (PCA), and creating complex visualizations such as pair plots or 3D scatter plots.

6. **Visualize the Data**:
   - **Objective**: Create visual representations to simplify the understanding of data patterns.
   - **Details**: Generate various plots (e.g., histograms, bar charts, heatmaps) to highlight key patterns, trends, and relationships in the data, aiding in clearer interpretation and communication of findings.

7. **Conclusions**:
    Finally, I will summarize the most important findings from the EDA and discuss potential next steps for further analysis.





Letâ€™s get started!

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from src.utils import pie_plot
import numpy as np

# Basic overview of the Data Structure

In [2]:
df = pd.read_csv('../data/raw/PS_20174392719_1491204439457_log.csv')

In [3]:
print(f'This dataset contains {df.shape[0]} samples and {df.shape[1]} features')

This dataset contains 6362620 samples and 11 features


In [4]:
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [5]:
df.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   step            6362620 non-null  int64  
 1   type            6362620 non-null  object 
 2   amount          6362620 non-null  float64
 3   nameOrig        6362620 non-null  object 
 4   oldbalanceOrg   6362620 non-null  float64
 5   newbalanceOrig  6362620 non-null  float64
 6   nameDest        6362620 non-null  object 
 7   oldbalanceDest  6362620 non-null  float64
 8   newbalanceDest  6362620 non-null  float64
 9   isFraud         6362620 non-null  int64  
 10  isFlaggedFraud  6362620 non-null  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


In [6]:
# Statistics of numerical features
df.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0
mean,243.3972,179861.9,833883.1,855113.7,1100702.0,1224996.0,0.00129082,2.514687e-06
std,142.332,603858.2,2888243.0,2924049.0,3399180.0,3674129.0,0.0359048,0.001585775
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,156.0,13389.57,0.0,0.0,0.0,0.0,0.0,0.0
50%,239.0,74871.94,14208.0,0.0,132705.7,214661.4,0.0,0.0
75%,335.0,208721.5,107315.2,144258.4,943036.7,1111909.0,0.0,0.0
max,743.0,92445520.0,59585040.0,49585040.0,356015900.0,356179300.0,1.0,1.0


Observations:

- Step: based on the mean and the std deviation, the distribution seems quite uniform.

- Amount: there is an enormous variation of amount of transferred money and the standard deviation is very high (603.858). 
50% of all transactions move below 74.872, 75% of all transactions move below 208.721 and the max amount is 92.445.200 so we will find very few but huge transactions (long tail).

- OldbalanceOrg, NewbalanceOrig, OldbalanceDest and NewbalanceDest: both have similar standard deviations and maximum values. It calls my attention that min, 25% and even 50% can be 0. That makes me wonder if those accounts have been opened or closed just to commit fraud. Some significant amount accounts have also completely completely emptied. 

- isFraud: the mean is very close to zero so we can expect the labels of fraudulent transactions are going to be rare and the dataset unbalanced. 

- isFlaggedFraud: the mean is even lower but we have to take into consideration that this flag is activated only for transactions greater than 200.000. 

In [7]:
# Statistics of categorical features
df.describe(include = ['O'])

Unnamed: 0,type,nameOrig,nameDest
count,6362620,6362620,6362620
unique,5,6353307,2722362
top,CASH_OUT,C1530544995,C1286084959
freq,2237500,3,113


Observations:

- Type: there are 5 types of transactions, the most frequent is CASH_OUT. We will analyse them and their relationship with fraudulent transactions.

- nameOrig and nameDest: some names are repeated. We will check if that has something to do with fraudulent behaviours. 

# Clean the data

### Missing values

In [12]:
df.isnull().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrig    0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

### Duplicates

In [8]:
print(f'The amount of duplicated values is: {df.duplicated().sum()}')

The amount of duplicated values is: 0


### Fix data types

In [9]:
# Let's transform type to category to optimize memory and speed
df['type'] = df['type'].astype('category')

### Standarize Column Names
Some features have misspellings

In [10]:
df = df.rename(columns={'oldbalanceOrg': 'oldbalanceOrig'})
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrig,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


We won't use the column 'isFlaggedFraud' for now even though there is a basic analysis inside var_isflagggedfraud notebook

In [11]:
data = df[['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrig', 'newbalanceOrig',
       'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud']].copy()

data.to_csv('../data/processed/df_fraud.csv', index= False)