# **PhonePe Transaction Insights**

**Project Type** - EDA

**Contribution** - Individual

# **Project Summary**


The **"PhonePe Transaction Insights"** project leverages the open-source **PhonePe Pulse dataset**, an extensive repository of anonymized and aggregated digital transaction, user, and insurance data across India. This dataset is categorized into **aggregated**, **map**, and **top** levels, each offering insights at country and state levels, segmented quarterly from 2018 to 2024.

The project focuses on **Exploratory Data Analysis (EDA)** to uncover patterns in **digital payment behavior**, including transaction trends over time, top performing states/districts, device preferences among users, and insurance adoption rates. It involves **data extraction, transformation, and visualization** using tools like **Python (Pandas, Matplotlib, Seaborn)** and interactive dashboards via **Streamlit**.

Data is sourced from JSON files organized by year and quarter, allowing analysis of how digital payments have evolved across regions and categories. Visualizations include bar charts, pie charts, and line graphs to highlight growth trends, user engagement, and geographical performance.

While primarily an EDA project, it holds potential for **regression-based forecasting**, **clustering**, or even **classification** if extended with predictive modeling.

This project serves as a powerful tool for stakeholders seeking to understand digital payment dynamics in India and supports **business intelligence, marketing strategy, and policy-making** through actionable visual insights.

It aligns with PhonePe’s mission to democratize access to meaningful financial data and encourages developers and analysts to build upon this open dataset for deeper insights.


# **GitHub Link**

# **Problem Statement**

With the increasing reliance on digital payment systems like PhonePe, understanding the dynamics of transactions, user engagement, and insurance-related data is crucial for improving services and targeting users effectively. This project aims to analyze and visualize aggregated values of payment categories, create maps for total values at state and district levels, and identify top-performing states, districts, and pin codes.

# **Code Implementation**

# Step 1: Clone PhonePe Pulse GitHub Repository

In [1]:
# Clone PhonePe Pulse GitHub Repository
!git clone https://github.com/PhonePe/pulse.git

Cloning into 'pulse'...
remote: Enumerating objects: 17904, done.[K
remote: Counting objects: 100% (49/49), done.[K
remote: Compressing objects: 100% (32/32), done.[K
remote: Total 17904 (delta 19), reused 17 (delta 17), pack-reused 17855 (from 2)[K
Receiving objects: 100% (17904/17904), 26.13 MiB | 7.67 MiB/s, done.
Resolving deltas: 100% (8723/8723), done.
Updating files: 100% (9029/9029), done.


# Step 2: Install Required Libraries

In [2]:
# Install Required Libraries
!pip install pandas matplotlib seaborn streamlit plotly openpyxl -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m59.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m86.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.1/79.1 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25h

# Step 3: Import Libraries and Setup Paths

In [3]:
# Import Libraries and Define Paths
import os
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import streamlit as st
import plotly.express as px

# Set paths
DATA_DIR = 'pulse/data'
AGGREGATED_DIR = os.path.join(DATA_DIR, 'aggregated')
MAP_DIR = os.path.join(DATA_DIR, 'map')
TOP_DIR = os.path.join(DATA_DIR, 'top')

print("Paths set up successfully.")

Paths set up successfully.


# Step 4: ETL Functions to Load Data
## 1. Create etl.py file inside /src/

In [4]:
# File: src/etl.py

def load_json_files(base_path):
    """Load all JSON files from a given directory."""
    all_data = []
    for root, dirs, files in os.walk(base_path):
        for file in files:
            if file.endswith('.json'):
                file_path = os.path.join(root, file)
                try:
                    with open(file_path, 'r') as f:
                        data = json.load(f)
                        # Add metadata about path
                        year = os.path.basename(os.path.dirname(root))
                        quarter = os.path.splitext(file)[0]
                        data['metadata'] = {'year': year, 'quarter': quarter}
                        all_data.append(data)
                except Exception as e:
                    print(f"Error reading {file_path}: {e}")
    return all_data


def extract_transaction_data(data_list):
    """Extract transaction data from loaded JSON."""
    records = []
    for data in data_list:
        meta = data.get('metadata', {})
        for entry in data.get('data', {}).get('transactionData', []):
            category = entry.get('name', '')
            for instrument in entry.get('paymentInstruments', []):
                records.append({
                    'category': category,
                    'count': instrument.get('count', 0),
                    'amount': instrument.get('amount', 0),
                    'year': meta.get('year'),
                    'quarter': meta.get('quarter')
                })
    return pd.DataFrame(records)


def extract_user_device_data(data_list):
    """Extract user device data."""
    records = []
    for data in data_list:
        meta = data.get('metadata', {})
        aggregated = data.get('data', {}).get('aggregated', {})
        users_by_device = data.get('data', {}).get('usersByDevice', [])
        for device in users_by_device:
            records.append({
                'device_brand': device.get('brand'),
                'registered_users': device.get('count'),
                'percentage': device.get('percentage'),
                'total_registered_users': aggregated.get('registeredUsers', 0),
                'app_opens': aggregated.get('appOpens', 0),
                'year': meta.get('year'),
                'quarter': meta.get('quarter')
            })
    return pd.DataFrame(records)


def extract_map_transaction_data(data_list):
    """Extract map-level transaction data."""
    records = []
    for data in data_list:
        meta = data.get('metadata', {})
        hover_list = data.get('data', {}).get('hoverDataList', [])
        for item in hover_list:
            name = item.get('name', '').title()
            metric = item.get('metric', [{}])[0]
            records.append({
                'state_district': name,
                'count': metric.get('count', 0),
                'amount': metric.get('amount', 0),
                'year': meta.get('year'),
                'quarter': meta.get('quarter')
            })
    return pd.DataFrame(records)


def extract_top_states_data(data_list):
    """Extract top states by transaction volume."""
    records = []
    for data in data_list:
        meta = data.get('metadata', {})
        states = data.get('data', {}).get('states', [])
        for state in states:
            metric = state.get('metric', {})
            records.append({
                'state': state.get('entityName', '').title(),
                'count': metric.get('count', 0),
                'amount': metric.get('amount', 0),
                'year': meta.get('year'),
                'quarter': meta.get('quarter')
            })
    return pd.DataFrame(records)

# Step 5: Load All Data into DataFrames

## 1. Create Required Folders


In [5]:
# Create Required Folders
import os

os.makedirs('src', exist_ok=True)

print("Folder structure created: src/")

Folder structure created: src/


## 2. Create etl.py file

In [6]:
# Create `etl.py` file
%%writefile src/etl.py

import os
import json
import pandas as pd

def load_json_files(base_path):
    """Load all JSON files from a given directory."""
    all_data = []
    for root, dirs, files in os.walk(base_path):
        for file in files:
            if file.endswith('.json'):
                file_path = os.path.join(root, file)
                try:
                    with open(file_path, 'r') as f:
                        data = json.load(f)
                        # Add metadata about path
                        year = os.path.basename(os.path.dirname(root))
                        quarter = os.path.splitext(file)[0]
                        data['metadata'] = {'year': year, 'quarter': quarter}
                        all_data.append(data)
                except Exception as e:
                    print(f"Error reading {file_path}: {e}")
    return all_data


def extract_transaction_data(data_list):
    """Extract transaction data from loaded JSON."""
    records = []
    for data in data_list:
        meta = data.get('metadata', {})
        for entry in data.get('data', {}).get('transactionData', []):
            category = entry.get('name', '')
            for instrument in entry.get('paymentInstruments', []):
                records.append({
                    'category': category,
                    'count': instrument.get('count', 0),
                    'amount': instrument.get('amount', 0),
                    'year': meta.get('year'),
                    'quarter': meta.get('quarter')
                })
    return pd.DataFrame(records)


def extract_user_device_data(data_list):
    """Extract user device data."""
    records = []
    for data in data_list:
        meta = data.get('metadata', {})
        aggregated = data.get('data', {}).get('aggregated', {})
        users_by_device = data.get('data', {}).get('usersByDevice', [])
        for device in users_by_device:
            records.append({
                'device_brand': device.get('brand'),
                'registered_users': device.get('count'),
                'percentage': device.get('percentage'),
                'total_registered_users': aggregated.get('registeredUsers', 0),
                'app_opens': aggregated.get('appOpens', 0),
                'year': meta.get('year'),
                'quarter': meta.get('quarter')
            })
    return pd.DataFrame(records)


def extract_map_transaction_data(data_list):
    """Extract map-level transaction data."""
    records = []
    for data in data_list:
        meta = data.get('metadata', {})
        hover_list = data.get('data', {}).get('hoverDataList', [])
        for item in hover_list:
            name = item.get('name', '').title()
            metric = item.get('metric', [{}])[0]
            records.append({
                'state_district': name,
                'count': metric.get('count', 0),
                'amount': metric.get('amount', 0),
                'year': meta.get('year'),
                'quarter': meta.get('quarter')
            })
    return pd.DataFrame(records)


def extract_top_states_data(data_list):
    """Extract top states by transaction volume."""
    records = []
    for data in data_list:
        meta = data.get('metadata', {})
        states = data.get('data', {}).get('states', [])
        for state in states:
            metric = state.get('metric', {})
            records.append({
                'state': state.get('entityName', '').title(),
                'count': metric.get('count', 0),
                'amount': metric.get('amount', 0),
                'year': meta.get('year'),
                'quarter': meta.get('quarter')
            })
    return pd.DataFrame(records)

Writing src/etl.py


## 3. Create Folder Structure (src/)


In [7]:
# Create Folder Structure (src/)
import os

os.makedirs('src', exist_ok=True)

print("Created folder: src/")

Created folder: src/


## 4. Write ETL Module to src/etl.py

In [8]:
# Write ETL Module to src/etl.py
%%writefile src/etl.py

import os
import json
import pandas as pd

def load_json_files(base_path):
    """Load all JSON files from a given directory."""
    all_data = []
    for root, dirs, files in os.walk(base_path):
        for file in files:
            if file.endswith('.json'):
                file_path = os.path.join(root, file)
                try:
                    with open(file_path, 'r') as f:
                        data = json.load(f)
                        # Add metadata about path
                        year = os.path.basename(os.path.dirname(root))
                        quarter = os.path.splitext(file)[0]
                        data['metadata'] = {'year': year, 'quarter': quarter}
                        all_data.append(data)
                except Exception as e:
                    print(f"Error reading {file_path}: {e}")
    return all_data


def extract_transaction_data(data_list):
    """Extract transaction data from loaded JSON."""
    records = []
    for data in data_list:
        meta = data.get('metadata', {})
        for entry in data.get('data', {}).get('transactionData', []):
            category = entry.get('name', '')
            for instrument in entry.get('paymentInstruments', []):
                records.append({
                    'category': category,
                    'count': instrument.get('count', 0),
                    'amount': instrument.get('amount', 0),
                    'year': meta.get('year'),
                    'quarter': meta.get('quarter')
                })
    return pd.DataFrame(records)


def extract_user_device_data(data_list):
    """Extract user device data with robust error handling."""
    records = []
    for data in data_list:
        try:
            # Ensure metadata exists
            meta = data.get("metadata", {})

            # Safely get data["data"] with fallback
            if not isinstance(data, dict):
                continue  # Skip non-dictionary items

            data_content = data.get("data")

            if not (isinstance(data_content, dict) and data_content):
                continue  # Skip if not a non-empty dictionary

            aggregated = data_content.get("aggregated", {})
            users_by_device = data_content.get("usersByDevice", []) or []

            for device in users_by_device:
                records.append({
                    "device_brand": device.get("brand"),
                    "registered_users": device.get("count"),
                    "percentage": device.get("percentage"),
                    "total_registered_users": aggregated.get("registeredUsers", 0),
                    "app_opens": aggregated.get("appOpens", 0),
                    "year": meta.get("year"),
                    "quarter": meta.get("quarter"),
                })
        except Exception as e:
            print(f"Error processing one item: {e}")
            continue  # Skip problematic item without breaking loop

    return pd.DataFrame(records)


def extract_map_transaction_data(data_list):
    """Extract map-level transaction data."""
    records = []
    for data in data_list:
        meta = data.get('metadata', {})
        hover_list = data.get('data', {}).get('hoverDataList', [])
        for item in hover_list:
            name = item.get('name', '').title()
            metric = item.get('metric', [{}])[0]
            records.append({
                'state_district': name,
                'count': metric.get('count', 0),
                'amount': metric.get('amount', 0),
                'year': meta.get('year'),
                'quarter': meta.get('quarter')
            })
    return pd.DataFrame(records)


def extract_top_states_data(data_list):
    """Extract top states by transaction volume with robust error handling."""
    records = []
    for data in data_list:
        try:
            meta = data.get("metadata", {})

            # Safely access nested data
            if not isinstance(data, dict):
                continue

            data_content = data.get("data")
            if not (isinstance(data_content, dict) and data_content):
                continue

            states = data_content.get("states")
            if not isinstance(states, list):
                states = []

            for state in states:
                metric = state.get("metric", {})
                records.append({
                    "state": state.get("entityName", "").title(),
                    "count": metric.get("count", 0),
                    "amount": metric.get("amount", 0),
                    "year": meta.get("year"),
                    "quarter": meta.get("quarter"),
                })
        except Exception as e:
            print(f"Error processing item: {e}")
            continue

    return pd.DataFrame(records)

Overwriting src/etl.py


# Step 6: Import ETL Module and Load Data

In [9]:
# Import ETL Module and Load Data
import sys
sys.path.append('src')

# Reload the etl module after modifying it
if 'etl' in sys.modules:
    del sys.modules['etl']

from etl import load_json_files, extract_transaction_data, extract_user_device_data, extract_map_transaction_data, extract_top_states_data

# Define base directories
DATA_DIR = 'pulse/data'
AGGREGATED_DIR = os.path.join(DATA_DIR, 'aggregated')
MAP_DIR = os.path.join(DATA_DIR, 'map')
TOP_DIR = os.path.join(DATA_DIR, 'top')

# --- Load Aggregated Transaction Data ---
agg_trans_data = load_json_files(os.path.join(AGGREGATED_DIR, 'transaction'))
df_agg_transactions = extract_transaction_data(agg_trans_data)

# --- Load Aggregated User Device Data ---
agg_user_data = load_json_files(os.path.join(AGGREGATED_DIR, 'user'))
df_agg_users = extract_user_device_data(agg_user_data)

# --- Load Map-Level Transaction Data ---
map_trans_data = load_json_files(os.path.join(MAP_DIR, 'transaction'))
df_map_transactions = extract_map_transaction_data(map_trans_data)

# --- Load Top States by Transaction Volume ---
top_trans_data = load_json_files(os.path.join(TOP_DIR, 'transaction'))
df_top_states = extract_top_states_data(top_trans_data)

print("All data loaded successfully.")

All data loaded successfully.


# Step 7: Load All Data into Pandas DataFrames

In [10]:
# Load All Data into Pandas DataFrames
import sys
sys.path.append('src/')
from etl import *

# Load Aggregated Transaction Data
agg_trans_data = load_json_files(os.path.join(AGGREGATED_DIR, 'transaction'))
df_agg_transactions = extract_transaction_data(agg_trans_data)

# Load User Device Data
agg_user_data = load_json_files(os.path.join(AGGREGATED_DIR, 'user'))
df_agg_users = extract_user_device_data(agg_user_data)

# Load Map Transaction Data
map_trans_data = load_json_files(os.path.join(MAP_DIR, 'transaction'))
df_map_transactions = extract_map_transaction_data(map_trans_data)

# Load Top States Data
top_trans_data = load_json_files(os.path.join(TOP_DIR, 'transaction'))
df_top_states = extract_top_states_data(top_trans_data)

print("All dataframes loaded successfully.")

All dataframes loaded successfully.


# Step 8: Data Analysis and Visualization

In [11]:
# Create src/visualization.py (Fully Streamlit-Compatible)
%%writefile src/visualization.py

import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
import streamlit as st

# Color scheme
PHONEPE_PURPLE = "#5F0F40"
PHONEPE_RED = "#9A031E"
PHONEPE_ORANGE = "#FB8B24"
PHONEPE_DARK_ORANGE = "#E36414"
PHONEPE_TEAL = "#0F4C5C"
COLOR_SEQUENCE = [PHONEPE_PURPLE, PHONEPE_RED, PHONEPE_ORANGE, PHONEPE_DARK_ORANGE, PHONEPE_TEAL]

# Dark text colors for readability
TEXT_COLOR = "#333333"
AXIS_COLOR = "#555555"
GRID_COLOR = "#e0e0e0"

def apply_plot_style(fig, title):
    fig.update_layout(
        plot_bgcolor='rgba(255, 255, 255, 0.5)',
        paper_bgcolor='rgba(255, 255, 255, 0.5)',
        title={
            'text': f"<b>{title}</b>",
            'font': {'size': 18, 'color': PHONEPE_PURPLE},
            'x': 0.05,
            'xanchor': 'left'
        },
        font={'color': TEXT_COLOR},
        xaxis={
            'color': AXIS_COLOR,
            'gridcolor': GRID_COLOR,
            'title_font': {'size': 14}
        },
        yaxis={
            'color': AXIS_COLOR,
            'gridcolor': GRID_COLOR,
            'title_font': {'size': 14}
        },
        legend={
            'font': {'size': 12},
            'title_font': {'size': 13}
        },
        hoverlabel={
            'bgcolor': 'white',
            'font_size': 12,
            'font_family': "Arial",
            'bordercolor': PHONEPE_PURPLE
        },
        margin={'l': 50, 'r': 30, 't': 70, 'b': 50},
        hovermode='x unified'
    )
    return fig

def plot_transaction_trend(df, return_fig=False):
    df['period'] = df['year'] + '-Q' + df['quarter'].astype(str)
    grouped = df.groupby('period')[['count', 'amount']].sum().reset_index()

    fig = px.line(grouped, x='period', y='count',
                 title="Transaction Trend Over Time",
                 labels={'count': 'Transaction Count', 'period': 'Quarter'},
                 color_discrete_sequence=[PHONEPE_PURPLE],
                 template='plotly_white')

    fig.update_traces(line_width=3, hovertemplate="%{y:,.0f} transactions")
    fig = apply_plot_style(fig, "Transaction Trend Over Time")
    fig.update_xaxes(tickangle=45)

    if return_fig:
        return fig
    st.plotly_chart(fig, use_container_width=True)

# [Rest of the visualization functions follow the same pattern with the updated styling]

def plot_category_distribution(df, return_fig=False):
    grouped = df.groupby('category')[['count', 'amount']].sum().reset_index()

    fig = px.bar(grouped, x='count', y='category',
                 title="Transaction Distribution by Category",
                 labels={'count': 'Number of Transactions', 'category': 'Category'},
                 color='category',
                 color_discrete_sequence=COLOR_SEQUENCE)

    fig = apply_plot_style(fig, "Transaction Distribution by Category")
    fig.update_yaxes(categoryorder='total ascending')

    if return_fig:
        return fig
    st.plotly_chart(fig, use_container_width=True)

def plot_top_states(df, return_fig=False):
    grouped = df.groupby('state')[['count', 'amount']].sum().sort_values(by='count', ascending=False).head(10).reset_index()

    fig = px.bar(grouped, x='count', y='state',
                 title="Top 10 States by Transaction Volume",
                 labels={'count': 'Transaction Count', 'state': 'State'},
                 color='state',
                 color_discrete_sequence=COLOR_SEQUENCE)

    fig = apply_plot_style(fig, "Top 10 States by Transaction Volume")
    fig.update_yaxes(categoryorder='total ascending')

    if return_fig:
        return fig
    st.plotly_chart(fig, use_container_width=True)

def plot_device_usage(df, return_fig=False):
    grouped = df.groupby('device_brand')['registered_users'].sum().sort_values(ascending=False).head(10).reset_index()

    fig = px.pie(grouped, values='registered_users', names='device_brand',
                 title="Top 10 Device Brands by User Share",
                 color_discrete_sequence=COLOR_SEQUENCE)

    fig = apply_plot_style(fig, "Top 10 Device Brands by User Share")
    fig.update_traces(textposition='inside', textinfo='percent+label')

    if return_fig:
        return fig
    st.plotly_chart(fig, use_container_width=True)

def plot_quarterly_growth(df, return_fig=False):
    df['period'] = df['year'] + '-Q' + df['quarter'].astype(str)
    grouped = df.groupby('period')['count'].sum().reset_index()
    grouped['growth'] = grouped['count'].pct_change() * 100

    fig = px.line(grouped, x='period', y='growth',
                  title="Quarter-over-Quarter Growth Rate",
                  labels={'growth': 'Growth Rate (%)', 'period': 'Quarter'},
                  color_discrete_sequence=[PHONEPE_RED])

    fig = apply_plot_style(fig, "Quarter-over-Quarter Growth Rate")
    fig.update_xaxes(tickangle=45)
    fig.add_hline(y=0, line_dash="dash", line_color="gray")

    if return_fig:
        return fig
    st.plotly_chart(fig, use_container_width=True)

def plot_category_trends(df, return_fig=False):
    df['period'] = df['year'] + '-Q' + df['quarter'].astype(str)
    grouped = df.groupby(['period', 'category'])['count'].sum().reset_index()

    fig = px.line(grouped, x='period', y='count', color='category',
                  title="Category-wise Transaction Trends",
                  labels={'count': 'Transaction Count', 'period': 'Quarter'},
                  color_discrete_sequence=COLOR_SEQUENCE)

    fig = apply_plot_style(fig, "Category-wise Transaction Trends")
    fig.update_xaxes(tickangle=45)

    if return_fig:
        return fig
    st.plotly_chart(fig, use_container_width=True)

def plot_state_analysis(df, return_fig=False):
    grouped = df.groupby('state')[['count', 'amount']].sum().reset_index()

    fig = px.scatter(grouped, x='count', y='amount', color='state',
                     size='count', hover_name='state',
                     title="State-wise Transaction Volume vs Value",
                     labels={'count': 'Transaction Count', 'amount': 'Transaction Amount'},
                     color_discrete_sequence=COLOR_SEQUENCE)

    fig = apply_plot_style(fig, "State-wise Transaction Volume vs Value")

    if return_fig:
        return fig
    st.plotly_chart(fig, use_container_width=True)

def plot_top_districts(df, return_fig=False):
    grouped = df.groupby('state_district')['count'].sum().sort_values(ascending=False).head(15).reset_index()

    fig = px.bar(grouped, x='count', y='state_district',
                 title="Top 15 Districts by Transaction Volume",
                 labels={'count': 'Transaction Count', 'state_district': 'District'},
                 color='state_district',
                 color_discrete_sequence=COLOR_SEQUENCE)

    fig = apply_plot_style(fig, "Top 15 Districts by Transaction Volume")
    fig.update_yaxes(categoryorder='total ascending')

    if return_fig:
        return fig
    st.plotly_chart(fig, use_container_width=True)

def plot_user_growth(df, return_fig=False):
    df['period'] = df['year'] + '-Q' + df['quarter'].astype(str)
    grouped = df.groupby('period')['total_registered_users'].sum().reset_index()

    fig = px.line(grouped, x='period', y='total_registered_users',
                  title="Registered User Growth",
                  labels={'total_registered_users': 'Registered Users', 'period': 'Quarter'},
                  color_discrete_sequence=[PHONEPE_TEAL])

    fig = apply_plot_style(fig, "Registered User Growth")
    fig.update_xaxes(tickangle=45)

    if return_fig:
        return fig
    st.plotly_chart(fig, use_container_width=True)

def plot_device_share_trend(df, return_fig=False):
    df['period'] = df['year'] + '-Q' + df['quarter'].astype(str)
    grouped = df.groupby(['period', 'device_brand'])['registered_users'].sum().reset_index()
    top_brands = df.groupby('device_brand')['registered_users'].sum().sort_values(ascending=False).head(5).index
    grouped = grouped[grouped['device_brand'].isin(top_brands)]

    fig = px.area(grouped, x='period', y='registered_users', color='device_brand',
                  title="Top 5 Device Brands Over Time",
                  labels={'registered_users': 'Registered Users', 'period': 'Quarter'},
                  color_discrete_sequence=COLOR_SEQUENCE)

    fig = apply_plot_style(fig, "Top 5 Device Brands Over Time")
    fig.update_xaxes(tickangle=45)

    if return_fig:
        return fig
    st.plotly_chart(fig, use_container_width=True)

def plot_transaction_heatmap(df, return_fig=False):
    df['period'] = df['year'] + '-Q' + df['quarter'].astype(str)
    grouped = df.groupby(['period', 'category'])['count'].sum().unstack()

    fig = px.imshow(grouped,
                   labels=dict(x="Category", y="Quarter", color="Transactions"),
                   title="Transaction Heatmap by Quarter and Category",
                   color_continuous_scale=[PHONEPE_PURPLE, PHONEPE_ORANGE])

    fig = apply_plot_style(fig, "Transaction Heatmap by Quarter and Category")

    if return_fig:
        return fig
    st.plotly_chart(fig, use_container_width=True)

def plot_avg_transaction_value(df, return_fig=False):
    grouped = df.groupby('category').agg({'count':'sum', 'amount':'sum'}).reset_index()
    grouped['avg_value'] = grouped['amount'] / grouped['count']

    fig = px.bar(grouped, x='category', y='avg_value',
                 title="Average Transaction Value by Category",
                 labels={'avg_value': 'Average Value (₹)', 'category': 'Category'},
                 color='category',
                 color_discrete_sequence=COLOR_SEQUENCE)

    fig = apply_plot_style(fig, "Average Transaction Value by Category")

    if return_fig:
        return fig
    st.plotly_chart(fig, use_container_width=True)

Writing src/visualization.py


# Step 9: Run Visualizations

In [12]:
# Run and Save All Visualizations
import sys
sys.path.append('src/')
from visualization import *
import os

# Create output directory if not exists
os.makedirs('output', exist_ok=True)
os.makedirs('output/visualizations', exist_ok=True)  # For saving plot images

# Function to save Plotly figures as HTML for Streamlit
def save_plotly_fig(fig, filename):
    fig.write_html(f"output/visualizations/{filename}.html")

# 1. Transaction Trend
fig1 = plot_transaction_trend(df_agg_transactions, return_fig=True)
save_plotly_fig(fig1, "transaction_trend")

# 2. Category Distribution
fig2 = plot_category_distribution(df_agg_transactions, return_fig=True)
save_plotly_fig(fig2, "category_distribution")

# 3. Top States
fig3 = plot_top_states(df_top_states, return_fig=True)
save_plotly_fig(fig3, "top_states")

# 4. Device Usage
fig4 = plot_device_usage(df_agg_users, return_fig=True)
save_plotly_fig(fig4, "device_usage")

# 5. Quarterly Growth
fig5 = plot_quarterly_growth(df_agg_transactions, return_fig=True)
save_plotly_fig(fig5, "quarterly_growth")

# 6. Category Trends
fig6 = plot_category_trends(df_agg_transactions, return_fig=True)
save_plotly_fig(fig6, "category_trends")

# 7. State Analysis
fig7 = plot_state_analysis(df_top_states, return_fig=True)
save_plotly_fig(fig7, "state_analysis")

# 8. Top Districts
fig8 = plot_top_districts(df_map_transactions, return_fig=True)
save_plotly_fig(fig8, "top_districts")

# 9. User Growth
fig9 = plot_user_growth(df_agg_users, return_fig=True)
save_plotly_fig(fig9, "user_growth")

# 10. Device Share Trends
fig10 = plot_device_share_trend(df_agg_users, return_fig=True)
save_plotly_fig(fig10, "device_share_trends")

# 11. Transaction Heatmap
fig11 = plot_transaction_heatmap(df_agg_transactions, return_fig=True)
save_plotly_fig(fig11, "transaction_heatmap")

# 12. Avg Transaction Value
fig12 = plot_avg_transaction_value(df_agg_transactions, return_fig=True)
save_plotly_fig(fig12, "avg_transaction_value")

# Save DataFrames to CSV for Streamlit
df_agg_transactions.to_csv('output/agg_transactions.csv', index=False)
df_agg_users.to_csv('output/agg_users.csv', index=False)
df_map_transactions.to_csv('output/map_transactions.csv', index=False)
df_top_states.to_csv('output/top_states.csv', index=False)

print("All data and visualizations saved to output/")

All data and visualizations saved to output/


# Step 10: Save DataFrames for Reuse

In [13]:
# Save DataFrames to CSV
df_agg_transactions.to_csv('output/agg_transactions.csv', index=False)
df_agg_users.to_csv('output/agg_users.csv', index=False)
df_map_transactions.to_csv('output/map_transactions.csv', index=False)
df_top_states.to_csv('output/top_states.csv', index=False)

print("Data saved to output folder.")

Data saved to output folder.


# Step 11: Build Streamlit Dashboard

In [14]:
# Final dashboard.py
%%writefile dashboard.py

import streamlit as st
import pandas as pd
import sys
import os
sys.path.append('src/')
from visualization import *

# Set page config with improved styling
st.set_page_config(
    page_title="PhonePe Pulse Analytics",
    layout="wide",
    page_icon="📊"
)

# PhonePe logo URL (using raw SVG for reliability)
PHONEPE_LOGO = """
<svg width="150" height="40" viewBox="0 0 150 40" fill="none" xmlns="http://www.w3.org/2000/svg">
<path d="M30 10H20V30H30V10Z" fill="#5F0F40"/>
<path d="M40 10H30V30H40V10Z" fill="#9A031E"/>
<path d="M50 10H40V30H50V10Z" fill="#FB8B24"/>
<path d="M60 10H50V30H60V10Z" fill="#E36414"/>
<path d="M70 10H60V30H70V10Z" fill="#0F4C5C"/>
<text x="80" y="25" font-family="Arial" font-size="20" font-weight="bold" fill="#5F0F40">PhonePe Pulse</text>
</svg>
"""

# Apply custom CSS for perfect styling
st.markdown(f"""
    <style>
        /* Main background */
        .main, .stApp {{
            background-color: #5F0F40;
        }}

        /* Text colors */
        h1, h2, h3, h4, h5, h6 {{
            color: white !important;
            font-weight: 600 !important;
        }}

        /* Sidebar styling */
        .css-1lcbmhc {{
            background-color: #5F0F40 !important;
        }}
        .css-1lcbmhc h1,
        .css-1lcbmhc .stRadio label {{
            color: white !important;
        }}

        /* Chart background */
        .stPlotlyChart, .plot-container {{
            background-color: rgba(0, 0, 0, 0.5);
            border-radius: 8px;
            padding: 15px;
            box-shadow: 0 2px 4px rgba(0,0,0,0.1);
        }}

        /* Navigation radio buttons */
        .stRadio div[role="radiogroup"] {{
            background-color: #6c1d5f;
            padding: 10px;
            border-radius: 8px;
        }}
        .stRadio label {{
            color: white !important;
            padding: 5px 10px;
        }}
        .stRadio label:hover {{
            background-color: #9A031E !important;
        }}

        /* Remove the 0 below logo */
        .css-1v3fvcr {{
            display: none;
        }}
    </style>
""", unsafe_allow_html=True)

# Main title
st.title("📊 PhonePe Pulse Analytics Dashboard")
st.markdown("---")

# Sidebar navigation with proper logo
with st.sidebar:
    st.markdown(PHONEPE_LOGO, unsafe_allow_html=True)
    st.markdown("<h1 style='color:white !important;'>Navigation</h1>", unsafe_allow_html=True)
    option = st.radio(
        "Menu",
        [
            "Overview",
            "Transaction Analysis",
            "User Analysis",
            "Geographical Analysis",
            "Advanced Insights"
        ],
        index=0,
        label_visibility="collapsed"
    )

# Load data
@st.cache_data
def load_data():
    return {
        'agg_transactions': pd.read_csv('output/agg_transactions.csv'),
        'agg_users': pd.read_csv('output/agg_users.csv'),
        'map_transactions': pd.read_csv('output/map_transactions.csv'),
        'top_states': pd.read_csv('output/top_states.csv')
    }

data = load_data()

# Overview Page
if option == "Overview":
    st.subheader("📌 Key Insights at a Glance")

    # Create metrics row
    col1, col2, col3, col4 = st.columns(4)
    total_trans = data['agg_transactions']['count'].sum()
    total_users = data['agg_users']['total_registered_users'].sum()
    avg_trans = data['agg_transactions']['amount'].sum() / data['agg_transactions']['count'].sum()
    top_state = data['top_states'].groupby('state')['count'].sum().idxmax()

    col1.metric("Total Transactions", f"{total_trans:,.0f}")
    col2.metric("Total Registered Users", f"{total_users:,.0f}")
    col3.metric("Avg. Transaction Value", f"₹{avg_trans:,.2f}")
    col4.metric("Top Performing State", top_state)

    st.markdown("---")

    # Show important charts in overview
    st.subheader("📈 Key Transaction Trends")
    plot_transaction_trend(data['agg_transactions'])

    st.subheader("📱 Top Device Brands")
    plot_device_usage(data['agg_users'])

    st.subheader("🏆 Top Performing States")
    plot_top_states(data['top_states'])

# Transaction Analysis Page
elif option == "Transaction Analysis":
    st.subheader("💸 Transaction Analysis")

    tab1, tab2, tab3 = st.tabs(["Trends", "Categories", "Metrics"])

    with tab1:
        st.subheader("Transaction Volume Over Time")
        plot_transaction_trend(data['agg_transactions'])

        st.subheader("Quarterly Growth Rates")
        plot_quarterly_growth(data['agg_transactions'])

    with tab2:
        st.subheader("Transaction Distribution by Category")
        plot_category_distribution(data['agg_transactions'])

        st.subheader("Category Trends Over Time")
        plot_category_trends(data['agg_transactions'])

    with tab3:
        st.subheader("Average Transaction Values")
        plot_avg_transaction_value(data['agg_transactions'])

        st.subheader("Transaction Heatmap")
        plot_transaction_heatmap(data['agg_transactions'])

# User Analysis Page
elif option == "User Analysis":
    st.subheader("👥 User Behavior Analysis")

    col1, col2 = st.columns(2)

    with col1:
        st.subheader("User Growth Over Time")
        plot_user_growth(data['agg_users'])

        st.subheader("Top Device Brands")
        plot_device_usage(data['agg_users'])

    with col2:
        st.subheader("Device Brand Trends")
        plot_device_share_trend(data['agg_users'])

# Geographical Analysis Page
elif option == "Geographical Analysis":
    st.subheader("🌍 Geographical Distribution")

    col1, col2 = st.columns(2)

    with col1:
        st.subheader("Top Performing States")
        plot_top_states(data['top_states'])

        st.subheader("State Performance Analysis")
        plot_state_analysis(data['top_states'])

    with col2:
        st.subheader("Top Districts by Volume")
        plot_top_districts(data['map_transactions'])

# Advanced Insights Page
elif option == "Advanced Insights":
    st.subheader("🔍 Advanced Analytics")

    st.subheader("Transaction Category Trends")
    plot_category_trends(data['agg_transactions'])

    st.subheader("Device Brand Evolution")
    plot_device_share_trend(data['agg_users'])

    st.subheader("Transaction Value Analysis")
    plot_avg_transaction_value(data['agg_transactions'])

    st.subheader("Quarterly Performance Heatmap")
    plot_transaction_heatmap(data['agg_transactions'])

Writing dashboard.py


# Step 12: Launch Streamlit Dashboard

In [15]:
# Installing pyngrok package - enables creating secure tunnels to localhost via ngrok

!pip install pyngrok

Collecting pyngrok
  Downloading pyngrok-7.2.12-py3-none-any.whl.metadata (9.4 kB)
Downloading pyngrok-7.2.12-py3-none-any.whl (26 kB)
Installing collected packages: pyngrok
Successfully installed pyngrok-7.2.12


In [16]:
from pyngrok import ngrok
from google.colab import userdata

# Get the ngrok authtoken from Colab secrets
NGROK_AUTH_TOKEN = userdata.get('NGROK_AUTH_TOKEN')
if NGROK_AUTH_TOKEN:
    ngrok.set_auth_token(NGROK_AUTH_TOKEN)
else:
    print("NGROK_AUTH_TOKEN not found in Colab secrets. Please add it.")


# Run Streamlit in the background
!streamlit run dashboard.py &>/dev/null&

# Get the public URL
public_url = ngrok.connect(addr='8501')
print("Streamlit Dashboard URL:")
print(public_url)

Streamlit Dashboard URL:
NgrokTunnel: "https://bb2c4f1b2cbe.ngrok-free.app" -> "http://localhost:8501"


# **Conclusion**

The **PhonePe Transaction Insights** project leverages the open-source **PhonePe Pulse dataset** to perform **Exploratory Data Analysis (EDA)** on digital payment trends in India. It analyzes transaction, user, and insurance data across states, districts, and time periods to uncover patterns and regional performance. Using **Python**, **Pandas**, and **Streamlit**, it builds an interactive dashboard for visualizing key metrics like transaction volume, device usage, and top-performing regions. The project supports strategic decision-making for marketing, policy-making, and financial inclusion by transforming complex JSON data into meaningful insights through **data visualization** and structured analysis.