# Mobile Money Transaction Analysis - Data Preprocessing
- This notebook outlines the data preprocessing process on the mobile money transactions dataset, for a user clustering task.

## Required Libraries

In [1]:
import pandas as pd
import numpy as np

## Loading the Dataset


In [7]:
FILE_PATH = "transactions.csv"
data = pd.read_csv(FILE_PATH)
data.head()

Unnamed: 0,step,transactionType,amount,initiator,oldBalInitiator,newBalInitiator,recipient,oldBalRecipient,newBalRecipient,isFraud
0,0,TRANSFER,19824.96,4537027967639631,187712.18,167887.22,4875702729424478,8.31,19833.27,1
1,0,PAYMENT,598.97,4296267625767470,8.92,8.92,25-0000401,0.0,0.0,0
2,0,PAYMENT,545.85,4178224023847746,93.6,-452.25,13-0001587,0.0,545.85,0
3,0,TRANSFER,19847.01,4178224023847746,-452.25,-20299.26,4096920916696293,4011.72,23858.74,1
4,0,PAYMENT,546.89,4779013371563747,159148.76,158601.88,75-0003564,0.0,546.89,0


## Dataset Shape

In [8]:
data.shape

(1720181, 10)

## Dataset Summary

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1720181 entries, 0 to 1720180
Data columns (total 10 columns):
 #   Column           Dtype  
---  ------           -----  
 0   step             int64  
 1   transactionType  object 
 2   amount           float64
 3   initiator        int64  
 4   oldBalInitiator  float64
 5   newBalInitiator  float64
 6   recipient        object 
 7   oldBalRecipient  float64
 8   newBalRecipient  float64
 9   isFraud          int64  
dtypes: float64(5), int64(3), object(2)
memory usage: 131.2+ MB


### Descriptive Statistics

In [10]:
data.describe()

Unnamed: 0,step,amount,initiator,oldBalInitiator,newBalInitiator,oldBalRecipient,newBalRecipient,isFraud
count,1720181.0,1720181.0,1720181.0,1720181.0,1720181.0,1720181.0,1720181.0,1720181.0
mean,65.55529,52538.68,4499952000000000.0,2433758.0,2443880.0,108508.3,122277.2,0.1020346
std,44.67368,88356.5,289635100000000.0,1307615.0,1297181.0,283013.8,319227.7,0.3026939
min,0.0,0.24,4000062000000000.0,-199997.1,-199997.1,-198368.5,-135728.0,0.0
25%,23.0,606.46,4248762000000000.0,1577186.0,1600496.0,16064.23,24962.33,0.0
50%,54.0,17298.25,4508521000000000.0,2619827.0,2625680.0,63130.18,74481.61,0.0
75%,106.0,71161.49,4750928000000000.0,3361338.0,3361872.0,137382.2,143170.2,0.0
max,143.0,2142928.0,4999855000000000.0,12244690.0,12244690.0,11885540.0,12066210.0,1.0


## Dropping Duplicates

In [11]:
data = data.drop_duplicates()

## Checking for Null Values in Dataset

In [6]:
data.isnull().sum()

step               0
transactionType    0
amount             0
initiator          0
oldBalInitiator    0
newBalInitiator    0
recipient          0
oldBalRecipient    0
newBalRecipient    0
isFraud            0
dtype: int64

- No null values found, hence no need to handle missing values.

## Removing Outliers
- The top 1% and bottom 1% of records containing amount on extreme ends are removed to reduce the dataset size.

In [13]:
q_low = data['amount'].quantile(0.01)
q_high = data['amount'].quantile(0.99)

# Filter out top 1% and bottom 1%
data = data[(data['amount'] >= q_low) & (data['amount'] <= q_high)]

### Shape after Filtering Outliers

In [14]:
data.shape

(1685998, 10)