# Exploratory Data Analysis and Visualization of Credit Card Transactions

**Introduction**
This notebook presents a comprehensive exploration of a credit card transactions dataset, delving into customer transaction patterns, visualization of transaction amounts, and uncovering key insights. Our analysis aims to provide valuable perspectives for fraud detection and understanding customer behavior.

**Objective**
The primary objective of this notebook is to conduct an in-depth exploratory data analysis (EDA) of credit card transactions, using Python and data analysis libraries.

# Let's import all our Guruji's as usual:


In [34]:
# Import libraries with warnings
import warnings
warnings.filterwarnings("ignore")

# Data manipulation libraries
import pandas as pd
import numpy as np

# Data visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from datetime import datetime
# Machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

# Import the dataset file
dataframe = pd.read_csv('/kaggle/input/comprehensive-credit-card-transactions-dataset/credit_card_transaction_flow.csv')

print('Import Successful')

Import Successful


# Data Exploration:
* Print basic information about the dataset, including data types, null values, and memory usage.
* Display the first few rows to get a glimpse of the data.

In [35]:
# Dataset Overview
print("Dataset Overview:")
print(dataframe.info())

# Display the first few rows
print("\nFirst few rows:")
print(dataframe.head())

Dataset Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Customer ID         50000 non-null  int64  
 1   Name                50000 non-null  object 
 2   Surname             50000 non-null  object 
 3   Gender              44953 non-null  object 
 4   Birthdate           50000 non-null  object 
 5   Transaction Amount  50000 non-null  float64
 6   Date                50000 non-null  object 
 7   Merchant Name       50000 non-null  object 
 8   Category            50000 non-null  object 
dtypes: float64(1), int64(1), object(7)
memory usage: 3.4+ MB
None

First few rows:
   Customer ID      Name    Surname Gender   Birthdate  Transaction Amount  \
0       752858      Sean  Rodriguez      F  20-10-2002               35.47   
1        26381  Michelle     Phelps    NaN  24-10-1985             2552.72   
2       305449   

# Statistical Analysis:
Calculate and display summary statistics for numerical columns, including count, mean, standard deviation, minimum, and maximum values.

In [36]:
print("\nSummary Statistics:")
print(dataframe.describe())


Summary Statistics:
        Customer ID  Transaction Amount
count   50000.00000        50000.000000
mean   500136.79696          442.119239
std    288232.43164          631.669724
min        29.00000            5.010000
25%    251191.50000           79.007500
50%    499520.50000          182.195000
75%    749854.25000          470.515000
max    999997.00000         2999.880000


# Handling Missing Values:
Checking for missing values and determining how to handle them, either by imputation or removal.

In [37]:
print("\nMissing Values")
print(dataframe.isnull().sum())


Missing Values
Customer ID              0
Name                     0
Surname                  0
Gender                5047
Birthdate                0
Transaction Amount       0
Date                     0
Merchant Name            0
Category                 0
dtype: int64


In [38]:
# Let's handle the missing values as it is necessary for our data.
# There are multiple ways we can handle our data. Like we can remove the null values or mark it as unknow value by giving the random value.
# But, that will be bad decision as a data scientist. I have 1 more idea where we can make a use of transaction amount.
# Imputing missing values based on the 'Transaction Amount' column can be an option, but it should be done with caution as it makes assumptions about gender based on financial behavior, which may not always be accurate.
# Let's start - 
# 1. Calculate the average transaction amount by each gender
#2. For rows with missing 'Gender' values, assign the gender that corresponds to the nearest average transaction amount. For example, if a transaction amount falls closer to the average for 'Male' transactions, assign 'Male' as the gender.

# Calculate the average transaction amount for 'Male' and 'Female'
average_transaction_amount_by_gender = dataframe.groupby('Gender')['Transaction Amount'].mean()
print("average_transaction_amount_by_gender")
print(average_transaction_amount_by_gender)

# Impute missing 'Gender' based on transaction amount
for index, row in dataframe[dataframe['Gender'].isnull()].iterrows():  # selects all rows in the DataFrame 'dataframe' where the 'Gender' column is null (missing values). This creates a new DataFrame containing only those rows with missing gender values.
    tansaction_amount = row['Transaction Amount']       # For each row in the selected DataFrame (i.e., rows with missing gender values), this line extracts the 'Transaction Amount' value for that particular row
    
    # Calculate the absolute difference between transaction amount and average for both genders
    difference_male = abs(tansaction_amount - average_transaction_amount_by_gender['M'])
    difference_female = abs(tansaction_amount - average_transaction_amount_by_gender['F'])
    
    # Assign 'M' or 'F' based on the closest average
    if difference_male < difference_female:
        dataframe.at[index, 'Gender'] = 'M'
    else:
        dataframe.at[index, 'Gender'] = 'F'


# Validating the process worked or not
print("\nMissing Values")
print(dataframe.isnull().sum())

average_transaction_amount_by_gender
Gender
F    445.521078
M    440.417393
Name: Transaction Amount, dtype: float64

Missing Values
Customer ID           0
Name                  0
Surname               0
Gender                0
Birthdate             0
Transaction Amount    0
Date                  0
Merchant Name         0
Category              0
dtype: int64


# Data Preprocessing:
Converting date columns to the correct datetime format.
Ensuring proper data types for columns (e.g., categorical columns should be of 'category' data type).

In [39]:
# Calculate age from 'Birthdate' and add it as a new column
current_date = datetime.now()
df['Age'] = current_date.year - dataframe['Birthdate'].dt.year

AttributeError: Can only use .dt accessor with datetimelike values

In [None]:
# We could observe, Date and Birthdate are not in correct format. Let's covert it to perfectly accepted format.

# Correcting the date format for 'Date' and 'Birthdate' columns
dataframe['Date'] = pd.to_datetime(dataframe['Date'], format='%d-%m-%Y')
dataframe['Birthdate'] = pd.to_datetime(dataframe['Birthdate'], format='%d-%m-%Y')

#validate if required updation is done
print(dataframe.dtypes)

# EDA of Credit Card Transactions

In [None]:
# Visualize transaction amount distribution
plt.figure(figsize=(10,6))
sns.histplot(data=dataframe, x='Transaction Amount', kde=True)
plt.title("Transaction Amount Distribution")
plt.show()

In [None]:
# Now, let's see who is using the credit card more frequently. Male or Female. 
# Gender distribution
gender_counts = dataframe['Gender'].value_counts()
print(gender_counts)

plt.figure(figsize=(8,8))
plt.pie(gender_counts,labels=gender_counts.index,autopct='%1.2f%%',startangle=140,colors=('purple','green'))
plt.title("Transaction count based on Gender Distribution")
plt.show()

In [None]:
# We could observe, men's have more number of transactions as compared to women's. Let's see on which category who is spending more to get the insights.
# Let's have a look where man spends more and where women spends more
# Will be using plotly for data visualization

dataframe_plotly = dataframe.groupby(['Category','Gender']).size().reset_index(name='Count')

# Creating an interactive plot
fig = px.bar(dataframe_plotly, x='Category', y='Count', color='Gender', barmode='group',
             labels={'Count': 'Transaction Count'},
             title="Transaction Breakdown by Category with Gender Distribution",
             text='Count')  # Display the count on hover

# Customizing hover information
fig.update_traces(texttemplate='%{text}', textposition='outside')

#show plot
fig.show()

We could observe, for the category Electronics and Travel, female have more number of transactions then the male. Surprisingly, males are ahead in transactions of clothing, cosmetics, market(groceries) and restaurant, is it because of mens always loves to make expenses for their partners🤣(just a joke). 

If we consider the electronic category, approx the transaction counts are equated.

Let's see, what amount gender's are spending on each category.

In [None]:
# Create a DataFrame with category, gender, and transaction amount data
dataframe_transactions_plotly = dataframe.groupby(['Category', 'Gender'])['Transaction Amount'].sum().reset_index()
dataframe_transactions_plotly['Count'] = dataframe.groupby(['Category', 'Gender']).size().values

# Creating the interactive bar plot
fig = px.bar(dataframe_transactions_plotly, x='Category', y='Count', color='Gender', barmode='group',
             labels={'Transaction Amount': 'Total Transaction Amount'},
             title="Transaction Amount Breakdown by Category with Transaction Count",
             text='Transaction Amount')  # Display the count on hover

# Customize hover information
fig.update_traces(texttemplate='%{text}', textposition='outside')

# Show the interactive plot
fig.show()


> We could see, both the gender spends the most on travelling that is 7.2M(female) and 5.6M(male).
> Also, for clothing, cosmetics, market or the restaurant, men had captured the growth in terms of transaction count or the amount.
> This means, if we consider all the genders as happily married person, we can say that men are spending money on their women for all the daily needs so that they can get the tour package as a gift from their women and they can flex in front of their friends🤣🤣🤣🤣(just a joke).

In [None]:
# Let's see the trasnaction variability based on month. As the dataset consist of only single year, we will be going with a month visualization.

# Defining a new column to utilize for new visualization
dataframe['Year'] = dataframe['Date'].dt.year
dataframe['Month'] = dataframe['Date'].dt.month
dataframe['Day'] = dataframe['Date'].dt.day

# Transaction count by month
monthly_transaction_amount = dataframe.groupby('Month')['Transaction Amount'].sum().reset_index()

fig = px.line(monthly_transaction_amount, x='Month', y='Transaction Amount', title="Total Transaction Amount by Month")
fig.update_traces(mode='lines+markers', hovertext=monthly_transaction_amount['Transaction Amount'])

# Customize hover information
fig.update_layout(hovermode='x unified', hoverlabel=dict(namelength=0))

fig.show()

> As we could see, in between of month august and october, there is a drastic downfall, which means people start saving for christmas🤣🤣.

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=dataframe, x='Age', y='Transaction Amount', hue='Gender', palette='nipy_spectral', alpha=0.7)
plt.title("Age vs. Transaction Amount")
plt.xlabel("Age")
plt.ylabel("Transaction Amount")
plt.legend(title="Gender")
plt.show()