# Pulse Secure Vulnerability Analysis

This notebook demonstrates the loading, processing, and enrichment of Pulse Secure VPN device data collected via the Shodan API. Data is processed with Apache Spark, enriched using the Hunter.io API, and stored in MongoDB for further analysis.

In [None]:
# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode
from src.data_loader import load_shodan_data
from src.enrichment import get_top_domains
from src.process import process_shodan_data
from src.hunter_api import create_domain_index, fetch_and_store_hunter_data
from src.report_generator import generate_top_report

## Initialize Spark Session

We start by initializing a local Spark session, which is used for data processing and transformation tasks throughout the notebook.

In [2]:
spark = SparkSession.builder.appName("ShodanPulseSecure").getOrCreate()

## Load Shodan Data

We load the Shodan scan results from a JSON file into a Spark DataFrame. This data contains information about exposed Pulse Secure VPN devices.

In [3]:
df = load_shodan_data("data\product_pulse_secure_country_de.json")

### Inspect the Loaded Data
Before proceeding with processing, we inspect the raw data to understand its structure and content.


In [4]:
# Show the schema of the DataFrame
df.printSchema()
# Display the first few records
df.show(5, truncate=False)
# Count the total number of records
df.count()
# # List available columns
# df.columns
# # Display a sample record as JSON
# import json

# sample_record = df.limit(1).toPandas().to_dict(orient="records")[0]
# print(json.dumps(sample_record, indent=2))

root
 |-- _shodan: struct (nullable = true)
 |    |-- crawler: string (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- module: string (nullable = true)
 |    |-- options: struct (nullable = true)
 |    |    |-- hostname: string (nullable = true)
 |    |    |-- referrer: string (nullable = true)
 |    |    |-- scan: string (nullable = true)
 |    |-- ptr: boolean (nullable = true)
 |    |-- region: string (nullable = true)
 |-- asn: string (nullable = true)
 |-- cloud: struct (nullable = true)
 |    |-- provider: string (nullable = true)
 |    |-- region: string (nullable = true)
 |    |-- service: string (nullable = true)
 |-- cpe: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- cpe23: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- data: string (nullable = true)
 |-- domains: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- hash: long (nullable = true)
 |-- hostnames: array (nullabl

150

## Process the Data

We process the loaded Shodan data to normalize organization and ISP names, assess device vulnerabilities based on version numbers, and prioritize entries for further analysis.




### filtered_df

In [5]:
optimized_df, filtered_df = process_shodan_data(df)
filtered_df.printSchema()
filtered_df.show(5, truncate=False)

root
 |-- status: long (nullable = true)
 |-- ip_str: string (nullable = true)
 |-- org: string (nullable = true)
 |-- isp: string (nullable = true)
 |-- domains: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- version: void (nullable = true)
 |-- product: string (nullable = true)
 |-- cpe23: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- country_name: string (nullable = true)
 |-- city: string (nullable = true)
 |-- CN: string (nullable = true)
 |-- O: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- vulnerability_assessment: string (nullable = true)
 |-- priority: integer (nullable = false)

+------+-------------+-----------------------------------------------------+------------------------------------+--------------------------+-------+------------+-----+------------+-----------------+-------------------------------+----------------+--------------------------+------------------------+--------+
|status|i

### optimized_df

In [6]:
optimized_df.printSchema()
optimized_df.show(5, truncate=False)

root
 |-- ip_str: string (nullable = true)
 |-- priority: integer (nullable = false)
 |-- org: string (nullable = true)
 |-- isp: string (nullable = true)
 |-- domains: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- version: void (nullable = true)
 |-- product: string (nullable = true)
 |-- country_name: string (nullable = true)
 |-- city: string (nullable = true)
 |-- CN: string (nullable = true)
 |-- O: string (nullable = true)
 |-- vulnerability_assessment: string (nullable = true)
 |-- timestamp: string (nullable = true)

+-------------+--------+-----------------------------------------------------+------------------------------------+--------------------------+-------+------------+------------+-----------------+-------------------------------+----------------+------------------------+--------------------------+
|ip_str       |priority|org                                                  |isp                                 |domains                   |

### exploded_df

In [7]:
exploded_df = optimized_df.withColumn("domain", explode(col("domains")))
exploded_df = exploded_df.drop("domains")
exploded_df.printSchema()

root
 |-- ip_str: string (nullable = true)
 |-- priority: integer (nullable = false)
 |-- org: string (nullable = true)
 |-- isp: string (nullable = true)
 |-- version: void (nullable = true)
 |-- product: string (nullable = true)
 |-- country_name: string (nullable = true)
 |-- city: string (nullable = true)
 |-- CN: string (nullable = true)
 |-- O: string (nullable = true)
 |-- vulnerability_assessment: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- domain: string (nullable = true)



## Inspect the Processed Data

After processing, we inspect the resulting DataFrame to verify the transformations and understand the structure of the enriched data.

In [8]:
exploded_df.show(30, truncate=False)

+---------------+--------+-----------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------+-------+------------+------------+--------------------+-------------------------------+------------------------------+------------------------+--------------------------+------------------------+
|ip_str         |priority|org                                                                          |isp                                                                                                     |version|product     |country_name|city                |CN                             |O                             |vulnerability_assessment|timestamp                 |domain                  |
+---------------+--------+-----------------------------------------------------------------------------+--------------------------------------------------------------------------------------

In [None]:
top_n = 10
top_domains = get_top_domains(exploded_df, top_n)
print(top_domains)

['deutschebahn.com', 'ericsson.net', 'db.de', 'bauer-gmbh.org', 'unbelievable-machine.net', 'siemens-energy.com', 'm-online.net', 'kvb.de', 'eurofins.com']


In [10]:
create_domain_index()
fetch_and_store_hunter_data(top_domains, enrich=True, force=False)

ℹ️ Domain index already exists.
✅ New document inserted for domain: deutschebahn.com
✅ New document inserted for domain: ericsson.net
✅ New document inserted for domain: db.de
⚠️ Hunter.io request failed for domain 'bauer-gmbh.org', status_code=404
✅ New document inserted for domain: unbelievable-machine.net
✅ New document inserted for domain: siemens-energy.com
✅ New document inserted for domain: m-online.net
✅ New document inserted for domain: kvb.de
✅ New document inserted for domain: eurofins.com
✅ All domains have been processed.


In [11]:
report = generate_top_report(exploded_df, save_to_csv=True, top_n=top_n)
display(report)

✅ Top 10 report successfully saved to top_report.csv (emails formatted, dates formatted)


Unnamed: 0,IP,Company,Product,Version,Emails,Location,EmployeeCount,ShodanScanDate,HunterScanDate,Domain
0,81.200.199.27,Deutsche Bahn AG / DB Systel GmbH (German Rail...,Pulse Secure,,"compliance-dbcargo@deutschebahn.com, datenschu...","Frankfurt am Main, Germany",1K-5K,2025-04-27,2025-04-28,deutschebahn.com
1,129.192.10.118,Ericsson Inc.,Pulse Secure,,,"Kista, Sweden",100K+,2025-04-27,2025-04-28,ericsson.net
2,81.200.199.16,Deutsche Bahn AG / DB Systel GmbH (German Rail...,Pulse Secure,,"info@db.de, medienbetreuung@db.de",,,2025-04-27,2025-04-28,db.de
3,91.26.50.154,Bauer GmbH,Pulse Secure,,,,,2025-04-27,NaT,bauer-gmbh.org
4,129.192.10.119,Ericsson Inc.,Pulse Secure,,,"Kista, Sweden",100K+,2025-04-27,2025-04-28,ericsson.net
5,94.198.63.67,Network for The unbelievable Machine Company S...,Pulse Secure,,,,,2025-04-27,2025-04-28,unbelievable-machine.net
6,143.99.208.11,Siemens Energy Management GmbH trading as Siem...,Pulse Secure,,"ariba.support@siemens-energy.com, sekkinquiryj...","51222, Douglas, US, United States",10K-50K,2025-04-27,2025-04-28,siemens-energy.com
7,62.245.159.85,Kassenaerztliche Verein. Bayerns,Pulse Secure,,"abuse@m-online.net, virus-alert@m-online.net",,,2025-04-26,2025-04-28,m-online.net
8,62.245.159.85,Kassenaerztliche Verein. Bayerns,Pulse Secure,,"recruiting@kvb.de, presse@kvb.de, patienten-in...","Munich, Germany",251-1K,2025-04-26,2025-04-28,kvb.de
9,178.15.111.67,Arcor AG & Co KG Network Operation Center,Pulse Secure,,"formation-germande@eurofins.com, salessupport-...","Luxembourg, Luxembourg",10K-50K,2025-04-26,2025-04-28,eurofins.com
