# Pulse Secure Vulnerability Analysis

This notebook demonstrates the loading, processing, and enrichment of Pulse Secure VPN device data collected via the Shodan API. Data is processed with Apache Spark, enriched using the Hunter.io API, and stored in MongoDB for further analysis.

In [1]:
# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode
from src.data_loader import load_shodan_data
from src.process import process_shodan_data
from src.hunter_api import create_domain_index, fetch_and_store_hunter_data
from src.report_generator import generate_top_report

## Initialize Spark Session

We start by initializing a local Spark session, which is used for data processing and transformation tasks throughout the notebook.

In [2]:
spark = SparkSession.builder.appName("ShodanPulseSecure").getOrCreate()

## Load Shodan Data

We load the Shodan scan results from a JSON file into a Spark DataFrame. This data contains information about exposed Pulse Secure VPN devices.

In [None]:
df = load_shodan_data("data\product_pulse_secure_country_fi.json")

### Inspect the Loaded Data
Before proceeding with processing, we inspect the raw data to understand its structure and content.


In [4]:
# Show the schema of the DataFrame
df.printSchema()
# Display the first few records
df.show(5, truncate=False)
# Count the total number of records
df.count()
# # List available columns
# df.columns
# # Display a sample record as JSON
# import json

# sample_record = df.limit(1).toPandas().to_dict(orient="records")[0]
# print(json.dumps(sample_record, indent=2))

root
 |-- _shodan: struct (nullable = true)
 |    |-- crawler: string (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- module: string (nullable = true)
 |    |-- options: struct (nullable = true)
 |    |    |-- hostname: string (nullable = true)
 |    |    |-- referrer: string (nullable = true)
 |    |    |-- scan: string (nullable = true)
 |    |-- ptr: boolean (nullable = true)
 |    |-- region: string (nullable = true)
 |-- asn: string (nullable = true)
 |-- cpe: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- cpe23: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- data: string (nullable = true)
 |-- domains: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- hash: long (nullable = true)
 |-- hostnames: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- http: struct (nullable = true)
 |    |-- components: struct (nullable = true)
 |    |    |-- AngularJS: st

157

## Process the Data

We process the loaded Shodan data to normalize organization and ISP names, assess device vulnerabilities based on version numbers, and prioritize entries for further analysis.




### filtered_df

In [5]:
optimized_df, filtered_df = process_shodan_data(df)
filtered_df.printSchema()
filtered_df.show(5, truncate=False)

root
 |-- status: long (nullable = true)
 |-- ip_str: string (nullable = true)
 |-- org: string (nullable = true)
 |-- isp: string (nullable = true)
 |-- domains: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- version: string (nullable = true)
 |-- product: string (nullable = true)
 |-- cpe23: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- country_name: string (nullable = true)
 |-- city: string (nullable = true)
 |-- CN: string (nullable = true)
 |-- O: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- vulnerability_assessment: string (nullable = true)
 |-- priority: integer (nullable = false)

+------+--------------+--------------------------------+--------------------------------------+-----------------+-------------+------------+-----+------------+--------+----------------------+-----------------+--------------------------+---------------------------------------------+--------+
|status|ip_str        |o

### optimized_df

In [6]:
optimized_df.printSchema()
optimized_df.show(5, truncate=False)

root
 |-- ip_str: string (nullable = true)
 |-- priority: integer (nullable = false)
 |-- org: string (nullable = true)
 |-- isp: string (nullable = true)
 |-- domains: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- version: string (nullable = true)
 |-- product: string (nullable = true)
 |-- country_name: string (nullable = true)
 |-- city: string (nullable = true)
 |-- CN: string (nullable = true)
 |-- O: string (nullable = true)
 |-- vulnerability_assessment: string (nullable = true)
 |-- timestamp: string (nullable = true)

+--------------+--------+--------------------------------+--------------------------------------+-----------------+-------------+------------+------------+--------+----------------------+-----------------+---------------------------------------------+--------------------------+
|ip_str        |priority|org                             |isp                                   |domains          |version      |product     |country_name|ci

### exploded_df

In [7]:
exploded_df = optimized_df.withColumn("domain", explode(col("domains")))
exploded_df = exploded_df.drop("domains")
exploded_df.printSchema()

root
 |-- ip_str: string (nullable = true)
 |-- priority: integer (nullable = false)
 |-- org: string (nullable = true)
 |-- isp: string (nullable = true)
 |-- version: string (nullable = true)
 |-- product: string (nullable = true)
 |-- country_name: string (nullable = true)
 |-- city: string (nullable = true)
 |-- CN: string (nullable = true)
 |-- O: string (nullable = true)
 |-- vulnerability_assessment: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- domain: string (nullable = true)



## Inspect the Processed Data

After processing, we inspect the resulting DataFrame to verify the transformations and understand the structure of the enriched data.

In [8]:
exploded_df.show(30, truncate=False)

+--------------+--------+--------------------------------+--------------------------------------+-------------+------------+------------+--------+----------------------+----------------------------+---------------------------------------------+--------------------------+--------------------+
|ip_str        |priority|org                             |isp                                   |version      |product     |country_name|city    |CN                    |O                           |vulnerability_assessment                     |timestamp                 |domain              |
+--------------+--------+--------------------------------+--------------------------------------+-------------+------------+------------+--------+----------------------+----------------------------+---------------------------------------------+--------------------------+--------------------+
|62.183.178.4  |1       |Deltamarin Ltd.                 |DNA Oyj                               |22.3.17.25001|Pulse Secu

In [9]:
from src.enrichment import get_top_domains
top_n = 10
top_domains = get_top_domains(exploded_df, top_n)
print(top_domains)

['deltamarin.com', 'vmp.fi', 'skoda.fi', 'fmi.fi', 'teliasonera.com', 'elo.fi', 'elake-fennia.fi', 'hermanit.fi']


In [10]:
create_domain_index()
fetch_and_store_hunter_data(top_domains, enrich=True, force=False)

ℹ️ Domain index already exists.
ℹ️ Domain 'deltamarin.com' already exists in MongoDB. Skipping API fetch.
ℹ️ Domain 'vmp.fi' already exists in MongoDB. Skipping API fetch.
ℹ️ Domain 'skoda.fi' already exists in MongoDB. Skipping API fetch.
ℹ️ Domain 'fmi.fi' already exists in MongoDB. Skipping API fetch.
ℹ️ Domain 'teliasonera.com' already exists in MongoDB. Skipping API fetch.
ℹ️ Domain 'elo.fi' already exists in MongoDB. Skipping API fetch.
ℹ️ Domain 'elake-fennia.fi' already exists in MongoDB. Skipping API fetch.
ℹ️ Domain 'hermanit.fi' already exists in MongoDB. Skipping API fetch.
✅ All domains have been processed.


In [11]:
report = generate_top_report(exploded_df, save_to_csv=True, top_n=top_n)
display(report)

✅ Top 10 report successfully saved to top_report.csv (emails formatted, dates formatted)


Unnamed: 0,IP,Company,Product,Version,Emails,Location,EmployeeCount,ShodanScanDate,HunterScanDate,Domain
0,62.183.178.4,Deltamarin Ltd.,Pulse Secure,22.3.17.25001,"info@deltamarin.com, info.pl@deltamarin.com, m...","Turku, Finland",51-250,2025-04-05,2025-04-27,deltamarin.com
1,62.236.202.13,Varamiespalvelu Oy,Pulse Secure,,"helsinki@vmp.fi, palvelukeskus.caverion@vmp.fi...","Turku, Finland",251-1K,2025-03-21,2025-04-27,vmp.fi
2,195.255.59.18,HELKAMA-AUTO_OY,Pulse Secure,,"huolenpitosopimus@skoda.fi, yritysmyynti@skoda...","Espoo, Finland",,2025-04-01,2025-04-27,skoda.fi
3,128.214.219.14,Finnish Meteorological Institute,Pulse Secure,,"communications@fmi.fi, expert.services@fmi.fi,...",Finland,,2025-03-28,2025-04-28,fmi.fi
4,131.177.99.252,Telia Finland Oyj,Pulse Secure,,"press@teliasonera.com, domain-manager@teliason...","Frankfurt am Main, Germany",1-10,2025-04-05,2025-04-28,teliasonera.com
5,131.177.99.250,Telia Finland Oyj,Pulse Secure,,"press@teliasonera.com, domain-manager@teliason...","Frankfurt am Main, Germany",1-10,2025-04-04,2025-04-28,teliasonera.com
6,62.71.244.166,Telia Cygate Oy,Pulse Secure,,"kuntoutus@elo.fi, viestinta@elo.fi, rekrytoint...","Espoo, Finland",251-1K,2025-04-01,2025-04-28,elo.fi
7,62.71.244.141,Telia Cygate Oy,Pulse Secure,,"kuntoutus@elo.fi, viestinta@elo.fi, rekrytoint...","Espoo, Finland",251-1K,2025-03-31,2025-04-28,elo.fi
8,62.71.244.141,Telia Cygate Oy,Pulse Secure,,,,,2025-03-31,2025-04-28,elake-fennia.fi
9,185.22.132.102,Televisiokatu 4,Pulse Secure,,servicedesk@hermanit.fi,"Kajaani, Finland",1-10,2025-04-05,2025-04-28,hermanit.fi
