# Goal:
  
  * Review and select the github links datasets
  * Create a pipleine to standardize the datasets in different formats (.pdfs, html, json) to json formats
  * Use the `CyberMetric` dataset as the `EvalSet` by preprocessing it
  * Create a script to chunk the raw unstructured texts from pdfs, html on cybersecurity domain
  * Create ` SFT - Synthetic dataset` (`Instruction-response`) pairs by writing custome scripts / using distilabel framework
  * Fine tune one of the models (possibly `Llama 3.2` model) on the SFT synthetic dataset
  * Evaluate the Instruct-finetuned model on the `CyberMetric` dataset
  * Push to `HuggingFace`

## Datasets

1. [CyberMetric Dataset](https://github.com/cybermetric/CyberMetric)
2. [NIST Technical Series Publications](https://github.com/usnistgov/NIST-Tech-Pubs?tab=readme-ov-file)
3. [NIST-Cybersecurity-Documents](https://github.com/fractional-ciso/NIST-Cybersecurity-Documents)
4. [Canadian Institute for Cybersecurity datasets](https://www.unb.ca/cic/datasets/index.html)

## Data Exploration

After carefully, looking into the provided datasets. I decided to use the first dataset in the link above , the `CyberMetric` dataset as my eval-set. This is for two reasons.

 * It is a benchmark dataset used to evaluate LLMs on their knowledge on cybersecurity. Check the description in the official README file [here](https://github.com/cybermetric/CyberMetric/blob/main/README.md).

 * The dataset contains questions-answers which is ideal as a ground-truth.

Secondly, I also explored the reposiroty for dataset 2 above which is the NIST technical publications. I think this is a good dataset for research purposes and will be good enough to adapt our model to this publications. Only a sample of this html publications were used.

Lastly, the NIST cybersecurity documents will also used as they contain pdfs of different cybersecurity contents. I did not make use of the canadian institure for now as it contains complex datasets which would require additional processing and there is no sufficient time to explore this. As we know, data exploration takes up to 70% of machine learning dev cycle because quality of data is more important to give the model the right context and knowledge. Due to the deadline for this assessment, only the 3 data sources will be used.



### Data Preprocessing

* write a script to fetch data from the data sources (github repos in this case).
* I used a sample of some pdfs and html data as they are quite many sources and due to my compute/storage constraints.


In [1]:
!pip install langchain
!pip install langchain-openai
!pip install langchain_community
!pip install datasets
!pip install pypdf
!pip install pdfplumber




In [2]:
!pip install loguru -q

In [3]:
!pip install pymupdf -q


In [4]:
import os
import json
import pdfplumber
import requests
import uuid
import fitz
import re
from bs4 import BeautifulSoup
from git import Repo, GitError
from langchain_community.document_loaders import GitLoader
from pathlib import Path
from tqdm import tqdm
from typing import List, Dict, Optional, Any
from loguru import logger
from huggingface_hub import HfApi, login
from google.colab import userdata
from urllib.parse import urljoin,unquote

In [5]:
DATA_SOURCES_DICT = {
    "cybermetric": {
        "url": "https://github.com/cybermetric/CyberMetric",
        "file_types": [".json"],
        "output_dir": "./data/cybermetric"
    },
    "nist_cyber": {
        "url": "https://github.com/fractional-ciso/NIST-Cybersecurity-Documents",
        "file_types": [".pdf"],
        "output_dir": "./data/nist_cyber",
        "max_files": 5
    },
    "nist_pubs": {
        "url": "https://github.com/usnistgov/NIST-Tech-Pubs",
        "file_types": [".html"],
        "output_dir": "./data/nist_pubs",
        "max_files": 5
    }
}

In [6]:
HF_USERNAME = "Tiamz"

In [7]:
HF_TOKEN = userdata.get('HF_TOKEN')
login(token=HF_TOKEN)

In [8]:
OPENAI_API_KEY = userdata.get('openai_api_key')

In [9]:
api = HfApi()

In [10]:
!rm -rf ./temp ./data

In [11]:
def clone_and_process_repo(repo_url, file_types, save_path, max_files=None):
    """Clone repo and find files with optional limit"""
    repo_name = repo_url.split("/")[-1].replace(".git", "")
    repo_path = f"./temp/{repo_name}"

    Repo.clone_from(repo_url, to_path=repo_path)

    Path(save_path).mkdir(parents=True, exist_ok=True)

    file_counts = {ft: 0 for ft in file_types}
    processed_files = []

    if max_files is None:
        max_files = float('inf')

    for root, _, files in os.walk(repo_path):
        for file in files:
            for ft in file_types:
                if file.endswith(ft) and file_counts[ft] < max_files:
                    processed_files.append(os.path.join(root, file))
                    file_counts[ft] += 1
                    break

            if sum(file_counts.values()) >= max_files:
                return processed_files

    return processed_files

In [12]:
def process_pdf_link(pdf_url, output_dir):
    """Process individual PDF link and save as JSON"""
    try:

        pdf_name = unquote(pdf_url.split("/")[-1]).replace(".pdf", "")
        safe_name = re.sub(r"[^a-zA-Z0-9_-]", "_", pdf_name)
        json_filename = f"{safe_name}.json"
        output_path = os.path.join(output_dir, json_filename)


        if os.path.exists(output_path):
            return json_filename


        response = requests.get(pdf_url)
        response.raise_for_status()

        with fitz.open(stream=response.content, filetype="pdf") as doc:
            content = " ".join([page.get_text() for page in doc])
            metadata = {
                "source_url": pdf_url,
                "title": doc.metadata.get("title", ""),
                "page_count": doc.page_count,
                "author": doc.metadata.get("author", ""),
                "creation_date": doc.metadata.get("creationDate", "")
            }


        with open(output_path, 'w') as f:
            json.dump({
                "id": str(uuid.uuid4()),
                "content": content,
                "metadata": metadata
            }, f, indent=2)

        return json_filename

    except Exception as e:
        print(f"Error processing PDF {pdf_url}: {str(e)}")
        return None

In [13]:
def extract_pdf_content(pdf_url):
    """Extract text content from a PDF URL"""
    try:
        with requests.get(pdf_url, stream=True) as response:
            response.raise_for_status()
            with pdfplumber.open(response.raw) as pdf:
                return " ".join([
                    page.extract_text()
                    for page in pdf.pages
                    if page.extract_text()
                ])
    except Exception as e:
        print(f"Error processing PDF {pdf_url}: {str(e)}")
        return None

In [14]:
def parse_pdf_to_json(pdf_path):
    """Process local PDF files"""
    try:
        with fitz.open(pdf_path) as doc:
            content = " ".join([page.get_text() for page in doc])
            return {
                "id": str(uuid.uuid4()),
                "content": content,
                "metadata": {
                    "source": pdf_path,
                    "title": doc.metadata.get("title", ""),
                    "page_count": doc.page_count,
                    "author": doc.metadata.get("author", ""),
                    "creation_date": doc.metadata.get("creationDate", "")
                }
            }
    except Exception as e:
        print(f"Error processing PDF {pdf_path}: {str(e)}")
        return None

In [15]:
def parse_html_to_json(html_path, repo_url, output_dir):
    """Process HTML file and create individual JSONs for linked PDFs"""
    with open(html_path, 'r', encoding='utf-8') as f:
        soup = BeautifulSoup(f.read(), 'html.parser')
        base_url = urljoin(repo_url, os.path.dirname(html_path) + "/")

        pdf_links = [
            urljoin(base_url, a['href'])
            for a in soup.find_all('a', href=True)
            if a['href'].lower().endswith('.pdf')
        ]


        pdf_references = []
        for pdf_link in tqdm(pdf_links, desc="Processing PDF links"):
            json_name = process_pdf_link(pdf_link, output_dir)
            if json_name:
                pdf_references.append(json_name)


        return {
            "id": str(uuid.uuid4()),
            "content": soup.get_text(separator=' ', strip=True),
            "metadata": {
                "source": html_path,
                "title": soup.title.string if soup.title else "",
                "pdf_references": pdf_references,
                "file_type": "html"
            }
        }


In [16]:
def process_json_file(json_path):
    """Process CyberMetric JSON files"""
    with open(json_path) as f:
        data = json.load(f)

    return {
        "id": str(uuid.uuid4()),
        "content": data,
        "metadata": {
            "source": json_path,
            "file_type": "json"
        }
    }


In [17]:

def process_files(file_paths, output_dir, repo_name, repo_url):
    """Main processing function"""
    for file_path in tqdm(file_paths, desc="Processing files"):
        try:
            if file_path.endswith(".pdf"):
                data = parse_pdf_to_json(file_path)
                if data:
                    base_name = os.path.basename(file_path).replace(".pdf", "")
                    json_filename = f"{base_name}.json"
                    local_path = os.path.join(output_dir, json_filename)
                    with open(local_path, 'w') as f:
                        json.dump(data, f, indent=2)

            elif file_path.endswith(".html"):
                html_data = parse_html_to_json(file_path, repo_url, output_dir)
                html_filename = os.path.basename(file_path).replace(".html", ".json")
                local_path = os.path.join(output_dir, html_filename)
                with open(local_path, 'w') as f:
                    json.dump(html_data, f, indent=2)

            elif file_path.endswith(".json"):
                # Existing CyberMetric processing
                pass

        except Exception as e:
            print(f"Error processing {file_path}: {str(e)}")

In [18]:
def main():
    for name, config in DATA_SOURCES_DICT.items():
        print(f"\nProcessing {name} repository...")

        files = clone_and_process_repo(
            config["url"],
            config["file_types"],
            config["output_dir"],
            max_files=config.get("max_files")
        )

        process_files(
            files,
            config["output_dir"],
            name,
            config["url"]  # Pass repo URL for link resolution
        )
        print(f"Completed processing {len(files)} files for {name}")

if __name__ == "__main__":
    main()


Processing cybermetric repository...


Processing files: 100%|██████████| 4/4 [00:00<00:00, 17734.90it/s]


Completed processing 4 files for cybermetric

Processing nist_cyber repository...


Processing files:  40%|████      | 2/5 [00:01<00:01,  2.06it/s]

MuPDF error: format error: No default Layer config



Processing files: 100%|██████████| 5/5 [00:03<00:00,  1.38it/s]


Completed processing 5 files for nist_cyber

Processing nist_pubs repository...


Processing files:   0%|          | 0/5 [00:00<?, ?it/s]
Processing PDF links:   0%|          | 0/10 [00:00<?, ?it/s][A
Processing PDF links:  10%|█         | 1/10 [00:03<00:31,  3.48s/it][A
Processing PDF links:  20%|██        | 2/10 [00:09<00:41,  5.18s/it][A
Processing PDF links:  30%|███       | 3/10 [00:16<00:41,  5.96s/it][A
Processing PDF links:  40%|████      | 4/10 [00:21<00:33,  5.66s/it][A
Processing PDF links:  50%|█████     | 5/10 [00:26<00:25,  5.18s/it][A
Processing PDF links:  60%|██████    | 6/10 [00:30<00:19,  4.81s/it][A
Processing PDF links:  70%|███████   | 7/10 [00:35<00:14,  4.82s/it][A
Processing PDF links:  80%|████████  | 8/10 [00:40<00:09,  4.87s/it][A
Processing PDF links:  90%|█████████ | 9/10 [00:44<00:04,  4.72s/it][A
Processing PDF links: 100%|██████████| 10/10 [00:49<00:00,  4.92s/it]
Processing files:  20%|██        | 1/5 [00:49<03:16, 49.20s/it]
Processing PDF links:   0%|          | 0/12 [00:00<?, ?it/s][A
Processing PDF links:   8%|▊      

Completed processing 5 files for nist_pubs





In [24]:
import shutil

In [25]:
if "CyberMetric-10000-v1.json" in os.listdir("./temp/CyberMetric"):
    shutil.move("./temp/CyberMetric/CyberMetric-10000-v1.json", "./data/cybermetric/CyberMetric-10000-v1.json")

In [26]:
from huggingface_hub import create_repo

In [27]:
repo_id = "Tiamz/cybersecurity-raw-json-datasets"
repo_type = "dataset"
create_repo(repo_id, repo_type= repo_type, exist_ok=True)

RepoUrl('https://huggingface.co/datasets/Tiamz/cybersecurity-raw-json-datasets', endpoint='https://huggingface.co', repo_type='dataset', repo_id='Tiamz/cybersecurity-raw-json-datasets')

In [29]:
data_path = "./data"

subfolders = ["cybermetric", "nist_cyber", "nist_pubs"]

for folder in subfolders:
    folder_path = os.path.join(data_path, folder)
    api.upload_folder(
        folder_path=folder_path,
        repo_id=repo_id,
        repo_type=repo_type,
        path_in_repo=folder,
        allow_patterns="*.json",
    )
    print(f"✅ Uploaded: {folder}")

No files have been modified since last commit. Skipping to prevent empty commit.


✅ Uploaded: cybermetric


No files have been modified since last commit. Skipping to prevent empty commit.
It seems you are trying to upload a large folder at once. This might take some time and then fail if the folder is too large. For such cases, it is recommended to upload in smaller batches or to use `HfApi().upload_large_folder(...)`/`huggingface-cli upload-large-folder` instead. For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/upload#upload-a-large-folder.


✅ Uploaded: nist_cyber


No files have been modified since last commit. Skipping to prevent empty commit.


✅ Uploaded: nist_pubs


In [None]:
!rm -rf ./temp