<a href="https://colab.research.google.com/github/Sagarjain93/Operations_transaction_Data/blob/main/transaction_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Project Title - Operational Transaction Data Analysis**

#**1. Introduction**

This project focuses on the exploratory analysis of operational transaction data to uncover behavioral patterns, potential fraud indicators, and performance bottlenecks. The dataset comprises detailed records of digital transactions including user identifiers, transaction metadata, network characteristics (like latency and bandwidth), device information, and fraud labels. Given the integration of both financial and network-level parameters, this dataset offers a unique opportunity to analyze how operational conditions impact financial transactions, and whether they correlate with suspicious or failed activities. The aim is to derive actionable insights that can support anomaly detection, fraud prevention, and system optimization.

*Potential Hypotheses to Explore*

**Fraud Detection Hypotheses**

1.Transactions with unusually high amounts have a higher chance of being fraudulent.

2.Fraudulent transactions are more likely to be initiated from mobile or unknown devices.

3.Transactions with higher latency or low bandwidth are more prone to fraud.

**Behavioral/Usage Patterns**

1.Users typically interact with a fixed set of counterparties (receiver accounts).

2.Certain PIN codes (locations) show higher transaction volumes or fraud rates.

**Network Impact Hypotheses**

1.Higher network latency is associated with increased transaction failures.

2.Specific Network Slice IDs are more commonly associated with failed or delayed transactions.

**Temporal Trends**

1.Fraudulent activity peaks during specific hours of the day or days of the week.

2.Transaction volume is higher during business hours and lower on weekends.

**Geolocation Hypotheses**

1.Certain geolocations (based on latitude/longitude or PIN code) are hotspots for fraud.

2.Distance between sender and receiver accounts correlates with fraud probability.

#**2. Data Description**

This dataset contains transactional records from an operational system, capturing various attributes related to digital money transfers.

**Transaction ID:** Unique identifier for each transaction.

**Sender Account ID:** ID of the account initiating the transaction.

**Receiver Account ID:** ID of the account receiving the funds.

**Transaction Amount:** Monetary value involved in the transaction.

**Transaction Type:** Type/category of the transaction (e.g., transfer, withdrawal, deposit).

**Timestamp:** Date and time when the transaction occurred.

**Transaction Status:** Status of the transaction (e.g., success, failed, pending).

**Fraud Flag:** Binary indicator (1 = fraudulent, 0 = genuine).

**Geolocation (Latitude/Longitude):** Coordinates of the transaction origin.

**Device Used:** Device type used to perform the transaction (e.g., mobile, desktop).

**Network Slice ID:** ID of the network slice allocated during the transaction (5G context).

**Latency (ms):** Network latency experienced during the transaction.

**Slice Bandwidth (Mbps):** Bandwidth available via the assigned network slice.

**PIN Code:** Postal code of the user’s location during the transaction.

#**3. Import Libraries**

In [2]:
#Data Manipulation Libraries
import pandas as pd
import numpy as np

#Data Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns

#Set Consistent theme for all plots
sns.set_theme(style="whitegrid")

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#**4. Load Dataset**


To begin the analysis, the marketing dataset is loaded directly from Google Drive. This approach ensures convenient access to the data stored in the cloud, especially when working in collaborative or cloud-based environments like Google Colab. By mounting Google Drive or using a shareable link, we can seamlessly import the dataset into our workspace for further processing and analysis.

In [6]:
df = pd.read_csv('/content/drive/MyDrive/colab/eda/8. Operations/transaction_data.csv')

In [7]:
df.columns


Index(['Transaction ID', 'Sender Account ID', 'Receiver Account ID',
       'Transaction Amount', 'Transaction Type', 'Timestamp',
       'Transaction Status', 'Fraud Flag', 'Geolocation (Latitude/Longitude)',
       'Device Used', 'Network Slice ID', 'Latency (ms)',
       'Slice Bandwidth (Mbps)', 'PIN Code'],
      dtype='object')

#**5. Inital Data Inspection**

to gain a foundational understanding of the dataset, we begin with an initial inspection that covers several essential aspects. This includes previewing the first few records to get a sense of the structure and values, examining the data types of each feature to ensure they align with expectations, and reviewing the overall shape and completeness of the dataset. We also generate statistical summaries for both numerical and categorical features to identify distributions, detect potential anomalies, and guide further steps in the analysis pipeline.

###**5.1 Preview 1st Few Records**

In [8]:
# Show First 5 rows of the data set
df.head()

Unnamed: 0,Transaction ID,Sender Account ID,Receiver Account ID,Transaction Amount,Transaction Type,Timestamp,Transaction Status,Fraud Flag,Geolocation (Latitude/Longitude),Device Used,Network Slice ID,Latency (ms),Slice Bandwidth (Mbps),PIN Code
0,TXN9520068950,ACC14994,ACC16656,495.9,Deposit,2025-01-17 10:14:00,Failed,True,"34.0522 N, -74.006 W",Desktop,Slice3,10,179,3075
1,TXN9412011085,ACC58958,ACC32826,529.62,Withdrawal,2025-01-17 10:51:00,Success,False,"35.6895 N, -118.2437 W",Mobile,Slice2,11,89,2369
2,TXN4407425052,ACC56321,ACC92481,862.47,Withdrawal,2025-01-17 10:50:00,Failed,False,"48.8566 N, 2.3522 W",Mobile,Slice1,4,53,8039
3,TXN2214150284,ACC48650,ACC76457,1129.88,Transfer,2025-01-17 10:56:00,Success,True,"34.0522 N, -74.006 W",Mobile,Slice3,10,127,6374
4,TXN4247571145,ACC60921,ACC11419,933.24,Deposit,2025-01-17 10:25:00,Success,True,"55.7558 N, 37.6173 W",Mobile,Slice3,20,191,8375


###**5.2 Check the Dataset Shape**

In [10]:
df.shape

(1000, 14)

**Interpretation** - *The Dataset has 1000 rows and 14 columns*

### **5.3 Dataset Summary Overview**

Check for missing values and data types of each column.

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 14 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Transaction ID                    1000 non-null   object 
 1   Sender Account ID                 1000 non-null   object 
 2   Receiver Account ID               1000 non-null   object 
 3   Transaction Amount                1000 non-null   float64
 4   Transaction Type                  1000 non-null   object 
 5   Timestamp                         1000 non-null   object 
 6   Transaction Status                1000 non-null   object 
 7   Fraud Flag                        1000 non-null   bool   
 8   Geolocation (Latitude/Longitude)  1000 non-null   object 
 9   Device Used                       1000 non-null   object 
 10  Network Slice ID                  1000 non-null   object 
 11  Latency (ms)                      1000 non-null   int64  
 12  Slice B

**Interpretation:** From the output, we see all columns have 1000 non-null values. The data types look correct, with latency,slice bandwidth,pincode as integers and categorical columns as objects. This suggests minimal to no missing data that we may need to handle before analysis.

###**5.4. Statistical Summary of Numeric Columns**

Generating statistical summary of numerical columns to understand their distribution, central tendency, and spread across the dataset.

In [16]:
df.select_dtypes(include='number').describe()

Unnamed: 0,Transaction Amount,Latency (ms),Slice Bandwidth (Mbps),PIN Code
count,1000.0,1000.0,1000.0,1000.0
mean,771.16529,11.688,148.511,5458.666
std,411.01925,5.131958,57.78634,2603.03646
min,51.89,3.0,50.0,1000.0
25%,423.3475,7.0,98.0,3281.75
50%,761.655,12.0,148.0,5385.5
75%,1122.6725,16.0,198.25,7535.0
max,1497.76,20.0,250.0,9999.0


###**5.5. Statistical Summary of Numeric Columns**

Generating summary of categorical features to understand unique values, frequency distribution, and potential data quality issues.

In [17]:
df.select_dtypes(include='object').describe()

Unnamed: 0,Transaction ID,Sender Account ID,Receiver Account ID,Transaction Type,Timestamp,Transaction Status,Geolocation (Latitude/Longitude),Device Used,Network Slice ID
count,1000,1000,1000,1000,1000,1000,1000,1000,1000
unique,1000,994,994,3,60,2,36,2,3
top,TXN3992032184,ACC71245,ACC36934,Transfer,2025-01-17 10:55:00,Failed,"48.8566 N, 139.6917 W",Mobile,Slice2
freq,1,2,2,374,28,513,42,521,340
