# Personal Finance Transaction Analyzer

## Overview
This notebook analyzes personal bank transaction data to categorize expenses and visualize spending patterns. The analysis includes data preprocessing, transaction categorization based on merchant names, and various visualizations to understand financial behavior.

## Data Requirements
- Input file: `bankstatements.csv`
- Expected columns: `name`, `amount`, `mode`, `DrCr`
- The analysis categorizes transactions into: Personal, Wives, Parents, Boyfriends, Children, and Other

## Analysis Workflow
1. Data loading and initial inspection
2. Data cleaning and preprocessing
3. Transaction categorization
4. Exploratory data analysis and visualization

## 1. Setup and Imports
Import all necessary libraries for data manipulation, analysis, and visualization.

In [None]:
# Core data manipulation and analysis libraries
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt

## 2. Data Loading and Initial Inspection
Load the bank statement data and perform initial exploration to understand the dataset structure and quality.

In [None]:
# Load the bank statement data from CSV file
# This file should contain transaction records with columns: name, amount, mode, DrCr
df = pd.read_csv('./bankstatements.csv')

# Display basic information about the dataset structure
# Shows data types, non-null counts, and memory usage
df.info()

In [None]:
# Display the first few rows to understand data format and structure
df.head()

In [None]:
# Generate descriptive statistics for numerical columns
# Provides count, mean, std, min, quartiles, and max values
df.describe()

In [None]:
# Display a subset of data (first 5 rows and 5 columns) for detailed inspection
df.iloc[0:5, 0:5]

## 3. Data Quality Assessment and Cleaning
Identify and handle missing values, data type conversions, and text standardization.

In [None]:
# Check for missing values across all columns
# This helps identify data quality issues that need to be addressed
df.isnull().sum()

In [None]:
# Fill missing merchant names with "UNKNOWN" to maintain data integrity
# This prevents issues during categorization and analysis
df['name'].fillna('UNKNOWN', inplace=True)

In [None]:
# Verify that missing values have been handled properly
df.isnull().sum()

In [None]:
# Convert 'amount' column to float data type for numerical operations
# Ensures proper mathematical calculations and aggregations
df['amount'] = df['amount'].astype(float)

# Verify the data type conversion was successful
df.info()

In [None]:
# Display updated data structure after cleaning
df.head()

## 4. Text Data Standardization
Normalize text fields to ensure consistent categorization and analysis.

In [None]:
# Standardize merchant names to lowercase for consistent pattern matching
# This ensures case-insensitive categorization
df['name'] = df['name'].str.lower()

# Remove leading and trailing whitespace from merchant names
# Prevents categorization errors due to inconsistent spacing
df['name'] = df['name'].str.strip()

df.head()

In [None]:
# Standardize transaction mode to lowercase for consistency
df['mode'] = df['mode'].str.lower()
df.head()

In [None]:
# Standardize debit/credit indicator to lowercase
df['DrCr'] = df['DrCr'].str.lower()
df.head()

In [None]:
# Display a random sample of cleaned data to verify standardization
df.sample(5)

## 5. Initial Data Exploration
Explore transaction patterns and distributions before categorization.

In [None]:
# Import seaborn for enhanced statistical visualizations
import seaborn as sns

# Create boxplot to visualize transaction amount distribution by debit/credit type
# This helps identify outliers and understand spending vs income patterns
sns.boxplot(data=df, x='DrCr', y='amount')
plt.title('Transaction Amount Distribution by Type')

# Limit y-axis to 95th percentile to better visualize the main distribution
# This removes extreme outliers that might skew the visualization
plt.ylim(0, df['amount'].quantile(0.95))
plt.show()

In [None]:
# Analyze frequency of unique merchant names to understand transaction patterns
# This helps identify the most frequent transaction partners
name_counts = df['name'].value_counts()

# Display top 10 most frequent merchants for quick analysis
print(name_counts.head(10))

## 6. Transaction Categorization
Categorize transactions based on merchant names using pattern matching to group related expenses.

In [None]:
# Alternative approach using function-based categorization (commented out)
# This shows a more traditional approach using a custom function

# def categorize_by_name(name):
#     if 'abutalah' in name:
#         return 'Personal'
#     elif 'phonepe' in name or 'nafeesab' in name:
#         return 'Wives'
#     elif 'sangalli' in name or 'hdfcbank' in name:
#         return 'Parents'
#     elif 'flipkart' in name or 'dadakhala' in name:
#         return 'Boyfriends'
#     elif 'ayubraje' in name or 'budesaheb' in name:
#         return 'Children'
#     else:
#         return 'Other'

# # Apply categorization
# df['category'] = df['name'].apply(categorize_by_name)

# # check the distribution of categories
# category_counts = df['category'].value_counts()
# print(category_counts)

In [None]:
# Efficient categorization using numpy.select for better performance
# This approach is more scalable for large datasets

# Define conditions for each category using string pattern matching
# Each condition checks if merchant name contains specific keywords
conditons = [
    df['name'].str.contains('abutalah', case=False, na=False),          # Personal transactions
    df['name'].str.contains('phonepe|nafeesab', case=False, na=False),  # Wives-related transactions
    df['name'].str.contains('sangalli|hdfcbank', case=False, na=False), # Parents-related transactions
    df['name'].str.contains('flipkart|dadakhala', case=False, na=False), # Boyfriends-related transactions
    df['name'].str.contains('ayubraje|budesaheb', case=False, na=False)  # Children-related transactions
]

# Define category labels corresponding to each condition
choices = np.array(['Personal', 'Wives', 'Parents', 'Boyfriends', 'Children', 'Other'])

# Apply categorization using numpy.select for efficient processing
# Transactions not matching any condition are labeled as 'Other'
df['category'] = np.select(conditons, choices[:-1], default=choices[-1])

# Display the distribution of categories to understand spending patterns
category_counts = df['category'].value_counts()
print(category_counts)

In [None]:
# Verify specific pattern matching using word boundaries for precise matching
# This ensures 'abutalah' is matched as a complete word, not as part of another word
df['name'].str.contains(r'\babutalah\b', case=False, na=False).sum()

## 7. Financial Analysis by Category
Analyze spending patterns and amounts across different transaction categories.

In [None]:
# Calculate total transaction amount per category
# This provides insight into which categories represent the highest spending
df.groupby('category')['amount'].sum()

## 8. Data Visualization and Insights
Create various visualizations to understand spending patterns and distributions across categories.

In [None]:
# Create boxplot showing transaction amount distribution by category
# Boxplots reveal median, quartiles, and outliers for each category
sns.boxplot(data=df, x='category', y='amount')
plt.title('Transaction Amount Distribution by Category')
plt.xticks(rotation=45)  # Rotate labels for better readability
plt.show()

In [None]:
# Create focused analysis excluding 'Other' category for clearer insights
# This removes noise from uncategorized transactions
df_filtered = df[df['category'] != 'Other']

# Boxplot for categorized transactions only
sns.boxplot(data=df_filtered, x='category', y='amount')
plt.title('Transaction Amount Distribution by Category (Excluding Other)')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Create bar chart showing total spending by category
# Bar charts are ideal for comparing total amounts across categories
df_filtered = df[df['category'] != 'Other']

sns.barplot(data=df_filtered, x='category', y='amount', estimator=np.sum)
plt.title('Total Transaction Amount by Category (Excluding Other)')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Create line chart for trend analysis of total spending by category
# Line charts help visualize relationships and trends between categories
df_filtered = df[df['category'] != 'Other']

# Aggregate data by category and create line plot
category_totals = df_filtered.groupby('category')['amount'].sum().reset_index()
sns.lineplot(data=category_totals, x='category', y='amount')
plt.title('Total Transaction Amount by Category (Excluding Other)')
plt.xticks(rotation=45)
plt.show()

## Summary

This analysis provides comprehensive insights into personal financial transactions through:

1. **Data Quality**: Cleaned and standardized transaction data for accurate analysis
2. **Categorization**: Automated classification of transactions based on merchant names
3. **Visualization**: Multiple chart types to understand spending patterns and distributions
4. **Insights**: Clear view of spending across different relationship categories

The analysis reveals spending patterns across Personal, Wives, Parents, Boyfriends, and Children categories, enabling better financial planning and budget management.