## Ukraine EuRepoC Research

In this notebook, we will construct a links list from sources of incidents in EuRepoC database impact Ukraine. The EuRepoC database has information about the receiver country; however, it does not have information about specific cities or provinces impacted. We will extract this new information for incidents that impact Ukraine.

We will leverage `UkraineCyberMultiScraper`, a custom `MultiScraper` that combines several specific scrapers including `CertUaScraper`, as well as [`TelegramMessageScraper`, a fallback `NewsScraper`, and a `TextScraper`](https://scraipe.readthedocs.io/en/latest/get_started/bundled_components/).

## Setup
Install packages, load dataset, and load credentials from secrets.env.

In [6]:
%pip install scraipe-cyber --quiet
%pip install dotenv --quiet

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


A template `secrets.env.template` file is provided that should be renamed to `secrets.env` and filled in with your credentials.

In [8]:
# Import Dependencies
from scraipe_cyber.ukraine import UkraineCyberMultiScraper
from scraipe.extended import TelegramMessageScraper
from scraipe.extended import OpenAiAnalyzer
from scraipe import Workflow
import pandas as pd
import re
import dotenv
import os

In [10]:
# Load credentials for telegram and openai
dotenv.load_dotenv('secrets.env')
telegram_api_id = os.getenv('TELEGRAM_API_ID')
telegram_api_hash = os.getenv('TELEGRAM_API_HASH')
openai_key = os.getenv('OPENAI_API_KEY')
for key in [telegram_api_id, telegram_api_hash, openai_key]:
    if key is None:
        raise ValueError(f"Missing environment variable: {key} in secrets.env file. Please set it up before running the script.")

In [14]:
# Download the dataset
database_link = "https://zenodo.org/records/14965395/files/eurepoc_global_dataset_1_3.csv?download=1"
cyber_database = pd.read_csv(database_link, sep=",", encoding="utf-8")
cyber_database

Unnamed: 0,incident_id,name,description,start_date,end_date,inclusion_criterion,inclusion_criterion_subcode,source_disclosure,incident_type,receiver_name,...,legal_response_subtype,legal_response_responding_country,legal_response_responding_actor,attribution_legal_reference,attribution_legal_reference_subcode,response_indicator,casualties,source_url,added_to_db,updated_at
0,4163,Russian State-Sponsored Actors Linked to GRU ...,"On 19 December 2024, a cyber attack attributed...",19.12.2024,19.12.2024,Attack conducted by nation state (generic “sta...,Not available;Not available,Incident disclosed by authorities of victim state,Disruption;Hijacking with Misuse,Ministry of Justice (Ukraine),...,Not available,Ukraine,Security Service of Ukraine (SBU),Not available,Not available,Countermeasures under international law justif...,Not available,https://www.t-online.de/nachrichten/ukraine/id...,2024-12-23,2025-02-18
1,4161,Unknown threat actors stole Microsoft Azure ac...,Unit 42 researchers uncovered a phishing campa...,01.06.2024,Not available,Attack on critical infrastructure target(s),Not available,Incident disclosed by IT-security company,Data theft;Hijacking with Misuse,Not available;Not available;Not available;Not ...,...,Not available,Not available,Not available,Not available,Not available,Unfriendly acts/retorsions justified (missing ...,Not available,https://www.bleepingcomputer.com/news/security...,2024-12-20,2025-02-18
2,4160,Unspecified US intelligence agencies stole tra...,The Chinese National Internet Emergency Respon...,01.05.2023,Not available,Attack conducted by nation state (generic “sta...,Not available,Incident disclosed by authorities of victim state,Data theft;Hijacking with Misuse,Not available,...,Not available,Not available,Not available,Not available,Not available,Countermeasures under international law justif...,Not available,https://cyberscoop.com/chinese-cyber-center-us...,2024-12-20,2025-02-18
3,4159,Unspecified US intelligence agency stole trade...,The Chinese National Internet Emergency Respon...,01.08.2024,Not available,Attack conducted by nation state (generic “sta...,Not available;Not available,Incident disclosed by authorities of victim state,Data theft;Hijacking with Misuse,Not available;Not available,...,Not available,Not available,Not available,Not available,Not available,Countermeasures under international law justif...,Not available,https://cyberscoop.com/chinese-cyber-center-us...,2024-12-20,2025-02-18
4,4158,Unknown Threat Actors breached the Attorney Ge...,"In March 2024, a hacker breached the computer ...",01.03.2024,Not available,"Attack on (inter alia) political target(s), no...",Not available,Not available,Data theft & Doxing;Hijacking with Misuse,Attorney General's Office of Nuevo León,...,Not available,Not available,Not available,Not available,Not available,Unfriendly acts/retorsions justified (missing ...,Not available,https://mvsnoticias.com/nuevo-leon/2024/12/18/...,2024-12-20,2025-02-04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3409,6,"""First Sino-US-Cyber-War"" II",After the collision of an American spy plane a...,01.05.2001,Not available,Attack conducted by non-state group / non-stat...,Not available;Not available,Incident disclosed by attacker,Disruption,Not available,...,Not available,Not available,Not available,Not available,Not available,Not available,Not available,https://www.nytimes.com/2001/05/13/weekinrevie...,2022-08-15,2024-12-19
3410,5,Honker Union of China defaced US government an...,After the collision of an American spy plane a...,01.04.2001,Not available,Attack conducted by non-state group / non-stat...,Not available;Not available,Incident disclosed by IT-security company,Disruption,Not available;Not available,...,Not available,Not available,Not available,Not available,Not available,Not available,Not available,https://www.nytimes.com/2001/05/13/weekinrevie...,2022-08-15,2024-02-23
3411,4,Chinese hacktivists targeted Taiwanese governm...,Chinese hackers succeeded in attacking several...,20.05.2000,20.05.2000,Attack conducted by non-state group / non-stat...,Not available;Not available,Incident disclosed by attacker,Disruption,Not available,...,Not available,Not available,Not available,Not available,Not available,Not available,Not available,http://www.hartford-hwp.com/archives/55/105.ht...,2022-08-15,2024-02-23
3412,3,Armenian hacktivists target Azerbaijani webpag...,In response to previous DDoS-operations agains...,01.02.2000,Not available,Attack conducted by non-state group / non-stat...,Not available;Not available,Incident disclosed by attacker,Disruption,Not available;Not available;Not available,...,Not available,Not available,Not available,Not available,Not available,Not available,Not available,Not available,2022-08-15,2024-02-23


## Get links from data source

In [17]:
# Filter for incidents affecting Ukraine
links_regex = r"(https?://[^\s]+)"
ukraine_incidents = pd.DataFrame(cyber_database[cyber_database["initiator_country"].str.contains("Ukraine", na=False)])

# Extract links from the 'attribution_basis' column
ukraine_incidents["extracted_links"] = ukraine_incidents["attribution_source_url"].apply(
    lambda text: re.findall(links_regex, text) if isinstance(text, str) else []
)
links = ukraine_incidents["extracted_links"].explode().dropna().unique()
pd.DataFrame(links)

Unnamed: 0,0
0,https://kyivindependent.com/hur-gazprombank-cy...
1,https://x.com/sudormRF6/status/184315307953504...
2,https://www.ukrinform.ua/rubric-ato/3898491-ki...
3,https://www.ukrinform.net/rubric-society/38961...
4,https://t.me/cyber_anarchy_squad/215
...,...
68,https://www.bleepingcomputer.com/news/security...
69,https://medium.com/dfrlab/breaking-down-the-su...
70,https://web.archive.org/web/20171202045106/htt...
71,https://www.welivesecurity.com/wp-content/uplo...


## Configure Workflow

In [19]:
from pydantic import BaseModel
from typing import List

# Setup scraper
# This will ask for a QR code to be scanned
telegram_scraper = TelegramMessageScraper(telegram_api_id, telegram_api_hash, session_name="my_session")
multi_scraper = UkraineCyberMultiScraper(telegram_message_scraper=telegram_scraper)

# Setup analyzer
instruction = """
Extract information from the text about the specific locations directly impacted by the cyber attack. The locations must be more specific than the country (i.e., city or province).
If the text does not contain a specific location, return an empty list.
Also provide short evidence from the text that supports the location.
Return a JSON object with the following fields:
{
    "location": ["location1", "location2", ...],
    "evidence": ["evidence1", "evidence2", ...]
}
"""
class ExpectedOutput(BaseModel):
    location: List[str]
    evidence: List[str]
analyzer = OpenAiAnalyzer(
    api_key=openai_key,
    instruction = instruction,
    pydantic_schema=ExpectedOutput)

# Setup workflow
workflow = Workflow(multi_scraper, analyzer)



Please scan the QR code from the Telegram app:
                                             
                                             
    █▀▀▀▀▀█ ▀▀███ ▄█▀██▀▄██ ▀  ▀▀ █▀▀▀▀▀█    
    █ ███ █ ▄ ▀█  ▄▄▄▀██▀▀ █ ▄ ██ █ ███ █    
    █ ▀▀▀ █ █  ▄▄███▄ ██ ██▀ ▀▄▀█ █ ▀▀▀ █    
    ▀▀▀▀▀▀▀ █ █ ▀▄▀ ▀▄█ ▀ █▄▀ █▄▀ ▀▀▀▀▀▀▀    
    ▀▄▄▄█ ▀▀▀███ ▄▄▀▄▄▀█▄ ▀▀▀▀  █▀▀▀██ ▄▀    
    ▀▀ ▀▄ ▀█▄█ █ ▄▀ ▀▄▀ ▀▄█ ▄ ██▄▀▀   █▀▄    
     ▄▄▄██▀▀█▄▀▀█▀ █▀█▀▀ ▀▀█▄▀▄▄▄▀▀▀▄ ▄▀▀    
     ▄▄▀▀ ▀▀ ▀▀▄▀█ ▀█ ▀▄██▀█   ▀▄███▀█▄      
     ▀▀▄▄▀▀  ▀▀▄▀█▄ ▀ ▄▄▀▄▄▀▄█▄█▄▀█▄▄▄▀▀     
    ▄ ▀ ▄█▀▀▄  █▄▄██▄█ █▄▄ █▀ ▄█ ▄█▀█▀▄▄     
     ▄ █▀█▀█    ▀██▄█▀▄▄ ▄█▄ ▀ ▄▄▄▀▀▄█▄▀▀    
    ▀▄▀▀ ▀▀▀█▀█ ▀▄██▄█▄▄▄ ▄▀▀ ▄▀  ▀██▄▄█     
    ██▄ ▀▀▀▀██▀ ▄▀█ ▄▄▀▄▄▄▀ ▀█ ███▀▀█▄▀▄▀    
      █▄▄▄▀ ▀█ ▄▀▄█ ▀█▄ ▀▄█ ▄ ▄▀▀ █▀  ▄ ▄    
    ▀▀   ▀▀▀█▀ █▀█▀█▀█▀▀▀▀▀█▄█▄ █▀▀▀█▄▄█     
    █▀▀▀▀▀█ ▀█ ▀▄▀▄▀▄ ▀ ▄█▀▀▄▀  █ ▀ █ ▄▀     
    █ ███ █ ▀▀▄▀█▄▄ ▀ ▄▄▀▄▄ █   ▀███▀ ██▀    
    █ ▀▀▀ █  █ █▀█▀█▄█▀█▄▄▄▀█▀▄▄▀ ▀▀ ▀██     
    ▀▀▀▀▀▀▀ ▀▀   ▀▀  ▀  ▀ ▀ ▀▀ ▀ 

## Run Workflow

In [20]:
# Scrape links
workflow.scrape(links)
workflow.get_scrapes().head()

Scraping:   0%|          | 0/73 [00:00<?, ?link/s]

ERROR:root:Failed to scrape https://t.me/cyberResistanceUA/397: Message 397 from cyberResistanceUA is None.
ERROR:root:Failed to scrape https://t.me/sudo_RM_RF_6/37: Failed to get chat for {chat_name}
ERROR:root:Failed to scrape https://t.me/dfhmara/41: Failed to get chat for {chat_name}


Unnamed: 0,link,content,scrape_success,scrape_error,metadata
0,https://t.me/itarmyofukraine2022/1701,"Поки ви тут пили лате, наші великородні північ...",True,,
1,https://gur.gov.ua/en/content/voienna-rozvidka...,Defence Intelligence of Ukraine conducted a cy...,True,,
2,https://www.ukrinform.ua/rubric-ato/3898491-ki...,Кіберфахівці ГУР заблокували десятки ресурсів ...,True,,
3,https://www.kyivpost.com/post/31798,Ukrainian Hackers Launch Cyberattacks on Subsi...,True,,
4,https://twitter.com/cyber_etc/status/151786767...,,False,No scraper could handle link; TextScraper[FAIL...,


In [21]:
# Extract location info with LLM
workflow.analyze()
workflow.get_analyses()

Analyzing:   0%|          | 0/52 [00:00<?, ?link/s]

Unnamed: 0,link,output,analysis_success,analysis_error
0,https://t.me/itarmyofukraine2022/1701,"{'location': [], 'evidence': []}",True,
1,https://gur.gov.ua/en/content/voienna-rozvidka...,"{'location': [], 'evidence': []}",True,
2,https://www.ukrinform.ua/rubric-ato/3898491-ki...,"{'location': ['Острогозьк', 'Чайка-сервіс'], '...",True,
3,https://www.kyivpost.com/post/31798,"{'location': ['Moscow', 'St. Petersburg'], 'ev...",True,
4,https://jeffreycaruso.substack.com/p/another-g...,"{'location': ['Yambur'], 'evidence': ['a secti...",True,
5,https://jeffreycarr.substack.com/p/kalashnikov...,"{'location': [], 'evidence': []}",True,
6,https://t.me/Hdr0_one/130,"{'location': ['Крым', 'Алтайский Край'], 'evid...",True,
7,https://t.me/itarmyofukraine2022/763,"{'location': [], 'evidence': []}",True,
8,https://www.ukrinform.net/rubric-society/38961...,"{'location': ['Snezhinsk', 'Chelyabinsk region...",True,
9,https://www.newsweek.com/us-consulate-hacked-p...,"{'location': ['Milan'], 'evidence': ['The U.S....",True,


## Output

In [22]:
# Export results
export = workflow.export(verbose=True)
# Display rows with locations extracted
display(export[export["location"].str.len() > 0].head())

Unnamed: 0,link,scrape_success,scrape_error,analysis_success,analysis_error,location,evidence
2,https://www.ukrinform.ua/rubric-ato/3898491-ki...,True,,True,,"[Острогозьк, Чайка-сервіс]","[уражено польовий склад боєприпасів, розташова..."
3,https://www.kyivpost.com/post/31798,True,,True,,"[Moscow, St. Petersburg]",[caused severe disruption to internet services...
5,https://jeffreycaruso.substack.com/p/another-g...,True,,True,,[Yambur],[a section of the Yambur gas pipeline — Elets-...
7,https://t.me/Hdr0_one/130,True,,True,,"[Крым, Алтайский Край]","[временного оккупации Крыма, временного оккупа..."
13,https://www.ukrinform.net/rubric-society/38961...,True,,True,,"[Snezhinsk, Chelyabinsk region]",[the Internet service provider Vega from the c...


In [23]:
# Number of links that were successfuly scraped and analyzed
success_count = (export["scrape_success"] & export["analysis_success"]).sum()
print (f"Successfully scraped and analyzed links: {success_count}/{len(export)} ({success_count/len(export)*100:.2f}%)")

# Number of successful extractions from analyses
populated_count = (export["location"].str.len() > 0).sum()
print (f"Successfully extracted locations from analyses: {populated_count}/{success_count} ({populated_count/success_count*100:.2f}%)")

Successfully scraped and analyzed links: 52/73 (71.23%)
Successfully extracted locations from analyses: 36/52 (69.23%)
