## Ukraine EuRepoC Research

In this notebook, we will construct a links list from sources of incidents in EuRepoC database impact Ukraine. The EuRepoC database has information about the receiver country; however, it does not have information about specific cities or provinces impacted. We will extract this new information for incidents that impact Ukraine.

We will leverage `UkraineCyberMultiScraper`, a custom `MultiScraper` that combines several specific scrapers including `CertUaScraper`, as well as [`TelegramMessageScraper`, a fallback `NewsScraper`, and a `TextScraper`](https://scraipe.readthedocs.io/en/latest/get_started/bundled_components/).

## Setup
Install packages, load dataset, and load credentials from secrets.env.

A template `secrets.env.template` file is provided that should be renamed to `secrets.env` and filled in with your credentials.

In [1]:
# Import Dependencies
from scraipe_cyber.ukraine import UkraineCyberMultiScraper
from scraipe.extended import TelegramMessageScraper
from scraipe.extended import OpenAiAnalyzer
from scraipe import Workflow
import pandas as pd
import re
import dotenv
import os

In [7]:
# Download the dataset
database_link = "https://zenodo.org/records/14965395/files/eurepoc_global_dataset_1_3.csv?download=1"
cyber_database = pd.read_csv(database_link, sep=",", encoding="utf-8")
cyber_database.head()

Unnamed: 0,incident_id,name,description,start_date,end_date,inclusion_criterion,inclusion_criterion_subcode,source_disclosure,incident_type,receiver_name,...,legal_response_subtype,legal_response_responding_country,legal_response_responding_actor,attribution_legal_reference,attribution_legal_reference_subcode,response_indicator,casualties,source_url,added_to_db,updated_at
0,4163,Russian State-Sponsored Actors Linked to GRU ...,"On 19 December 2024, a cyber attack attributed...",19.12.2024,19.12.2024,Attack conducted by nation state (generic “sta...,Not available;Not available,Incident disclosed by authorities of victim state,Disruption;Hijacking with Misuse,Ministry of Justice (Ukraine),...,Not available,Ukraine,Security Service of Ukraine (SBU),Not available,Not available,Countermeasures under international law justif...,Not available,https://www.t-online.de/nachrichten/ukraine/id...,2024-12-23,2025-02-18
1,4161,Unknown threat actors stole Microsoft Azure ac...,Unit 42 researchers uncovered a phishing campa...,01.06.2024,Not available,Attack on critical infrastructure target(s),Not available,Incident disclosed by IT-security company,Data theft;Hijacking with Misuse,Not available;Not available;Not available;Not ...,...,Not available,Not available,Not available,Not available,Not available,Unfriendly acts/retorsions justified (missing ...,Not available,https://www.bleepingcomputer.com/news/security...,2024-12-20,2025-02-18
2,4160,Unspecified US intelligence agencies stole tra...,The Chinese National Internet Emergency Respon...,01.05.2023,Not available,Attack conducted by nation state (generic “sta...,Not available,Incident disclosed by authorities of victim state,Data theft;Hijacking with Misuse,Not available,...,Not available,Not available,Not available,Not available,Not available,Countermeasures under international law justif...,Not available,https://cyberscoop.com/chinese-cyber-center-us...,2024-12-20,2025-02-18
3,4159,Unspecified US intelligence agency stole trade...,The Chinese National Internet Emergency Respon...,01.08.2024,Not available,Attack conducted by nation state (generic “sta...,Not available;Not available,Incident disclosed by authorities of victim state,Data theft;Hijacking with Misuse,Not available;Not available,...,Not available,Not available,Not available,Not available,Not available,Countermeasures under international law justif...,Not available,https://cyberscoop.com/chinese-cyber-center-us...,2024-12-20,2025-02-18
4,4158,Unknown Threat Actors breached the Attorney Ge...,"In March 2024, a hacker breached the computer ...",01.03.2024,Not available,"Attack on (inter alia) political target(s), no...",Not available,Not available,Data theft & Doxing;Hijacking with Misuse,Attorney General's Office of Nuevo León,...,Not available,Not available,Not available,Not available,Not available,Unfriendly acts/retorsions justified (missing ...,Not available,https://mvsnoticias.com/nuevo-leon/2024/12/18/...,2024-12-20,2025-02-04


In [4]:
# Load credentials for telegram and openai
dotenv.load_dotenv('secrets.env')
telegram_api_id = os.getenv('TELEGRAM_API_ID')
telegram_api_hash = os.getenv('TELEGRAM_API_HASH')
telegram_phone_number = os.getenv('TELEGRAM_PHONE_NUMBER')
openai_key = os.getenv('OPENAI_API_KEY')
assert telegram_api_id is not None, "TELEGRAM_API_ID not found in secrets.env"
assert telegram_api_hash is not None, "TELEGRAM_API_HASH not found in secrets.env"
assert telegram_phone_number is not None, "TELEGRAM_PHONE_NUMBER not found in secrets.env"
assert openai_key is not None, "OPENAI_API_KEY not found in secrets.env"

## Get links from data source

In [18]:
# Filter for incidents affecting Ukraine
links_regex = r"(https?://[^\s]+)"
ukraine_incidents = pd.DataFrame(cyber_database[cyber_database["initiator_country"].str.contains("Ukraine", na=False)])

# Extract links from the 'attribution_basis' column
ukraine_incidents["extracted_links"] = ukraine_incidents["attribution_source_url"].apply(
    lambda text: re.findall(links_regex, text) if isinstance(text, str) else []
)
links = ukraine_incidents["extracted_links"].explode().dropna().unique()
links[0:5]

array(['https://kyivindependent.com/hur-gazprombank-cyberattack/',
       'https://x.com/sudormRF6/status/1843153079535046660?prefetchTimestamp=1728373202822',
       'https://www.ukrinform.ua/rubric-ato/3898491-kiberfahivci-gur-zablokuvali-desatki-resursiv-promislovih-obektiv-rosii-dzerelo.html',
       'https://www.ukrinform.net/rubric-society/3896123-ukrainian-hackers-block-work-of-russian-nuclear-weapons-manufacturer.html',
       'https://t.me/cyber_anarchy_squad/215'], dtype=object)

## Configure Workflow

In [19]:
from pydantic import BaseModel
from typing import List

# Setup scraper
# This will ask for an auth code sent to your Telegram app the first time you run it
telegram_scraper = TelegramMessageScraper(telegram_api_id, telegram_api_hash, telegram_phone_number, session_name="my_session")
multi_scraper = UkraineCyberMultiScraper(telegram_message_scraper=telegram_scraper)

# Setup analyzer
instruction = """
Extract information from the text about the specific locations directly impacted by the cyber attack. The locations must be more specific than the country (i.e., city or province).
If the text does not contain a specific location, return an empty list.
Also provide short evidence from the text that supports the location.
Return a JSON object with the following fields:
{
    "location": ["location1", "location2", ...],
    "evidence": ["evidence1", "evidence2", ...]
}
"""
class ExpectedOutput(BaseModel):
    location: List[str]
    evidence: List[str]
analyzer = OpenAiAnalyzer(
    api_key=openai_key,
    instruction = instruction,
    pydantic_schema=ExpectedOutput)

# Setup workflow
workflow = Workflow(multi_scraper, analyzer)

Signed in successfully as Peter Naph; remember to not break the ToS or you will risk an account ban!
[IngressRule(match=re.compile('https://cert.gov.ua/article/\\d+'), scraper=<scraipe_cyber.ukraine.cert_ua_scraper.CertUaScraper object at 0x7f8dc0132750>), IngressRule(match=re.compile('https://t.me/[^/]+/[0-9]+'), scraper=<scraipe.extended.telegram_message_scraper.TelegramMessageScraper object at 0x7f8dc1774890>), IngressRule(match=re.compile('.*'), scraper=<scraipe.defaults.text_scraper.TextScraper object at 0x7f8dc121df50>), IngressRule(match=re.compile('.*'), scraper=<scraipe.defaults.text_scraper.TextScraper object at 0x7f8db7dda290>)]


## Run Workflow

In [20]:
# Scrape links
workflow.scrape(links)
workflow.get_scrapes().head()

Scraping: 100%|██████████| 73/73 [00:19<00:00,  3.83link/s]


Unnamed: 0,link,content,scrape_success,scrape_error,metadata
0,https://gur.gov.ua/en/content/voienna-rozvidka...,Defence Intelligence of Ukraine conducted a cy...,True,,
1,https://www.newsweek.com/us-consulate-hacked-p...,U.S. Consulate Hacked by 'Putin Supporters' - ...,True,,
2,https://gur.gov.ua/en/content/zlam-federalnoi-...,Hacking of Federal Tax Service of the russian ...,True,,
3,https://gur.gov.ua/en/content/soft-shyfry-sekr...,"Software, Ciphers, Secret Documents — DIU Cybe...",True,,
4,https://www.bleepingcomputer.com/news/security...,Russian defense firm Rostec shuts down website...,True,,


In [22]:
# Extract location info with LLM
workflow.analyze()
workflow.get_analyses()

Analyzing: 100%|██████████| 52/52 [00:29<00:00,  1.76item/s]


Unnamed: 0,link,output,analysis_success,analysis_error
0,https://gur.gov.ua/en/content/voienna-rozvidka...,"{'location': [], 'evidence': []}",True,
1,https://www.newsweek.com/us-consulate-hacked-p...,"{'location': ['Milan'], 'evidence': ['Hackers ...",True,
2,https://gur.gov.ua/en/content/zlam-federalnoi-...,"{'location': ['Moscow', 'Crimea'], 'evidence':...",True,
3,https://gur.gov.ua/en/content/soft-shyfry-sekr...,"{'location': [], 'evidence': []}",True,
4,https://www.bleepingcomputer.com/news/security...,"{'location': [], 'evidence': []}",True,
...,...,...,...,...
68,https://therecord.media/ukrainian-hacktivists-...,,,
69,https://t.me/itarmyofukraine2022/855,"{'location': [], 'evidence': []}",True,
70,https://twitter.com/iiyonite/status/1512001395...,,,
71,https://www.welivesecurity.com/wp-content/uplo...,,,


## Output

In [25]:
# Export results
export = workflow.export(verbose=True)
# Display rows with locations extracted
display(export[export["location"].str.len() > 0].head())

Unnamed: 0,link,scrape_success,scrape_error,analysis_success,analysis_error,location,evidence
1,https://www.newsweek.com/us-consulate-hacked-p...,True,,True,,[Milan],[Hackers took over the U.S. Consulate Milan's ...
2,https://gur.gov.ua/en/content/zlam-federalnoi-...,True,,True,,"[Moscow, Crimea]",[Communication between the central office in m...
5,https://jeffreycaruso.substack.com/p/another-g...,True,,True,,[Yambur],[a section of the Yambur gas pipeline — Elets-...
6,https://www.ibtimes.com/team-onefist-hackers-s...,True,,True,,"[Rostelecom, Kherson]",[Team OneFist Hackers Strike Russia's Rostelec...
7,https://jeffreycarr.substack.com/p/rostelecoms...,True,,True,,"[Sochi, Kyiv, Leningrad]",[Sochi is also where Putin conducts many of hi...


In [None]:
# Number of links that were successfuly scraped and analyzed
success_count = (export["scrape_success"] & export["analysis_success"]).sum()
print (f"Successfully scraped and analyzed links: {success_count}/{len(export)} ({success_count/len(export)*100:.2f}%)")

# Number of successful extractions from analyses
populated_count = (export["location"].str.len() > 0).sum()
print (f"Successfully extracted locations from analyses: {populated_count}/{success_count} ({populated_count/success_count*100:.2f}%)")

Successfully scraped and analyzed links: 52/73 (71.23%)
Successfully extracted locations from analyses: 33/52 (63.46%)
