<a href="https://colab.research.google.com/github/Kdavis2025/Automating-Compliance-AI-and-Machine-Learning-Approaches-to-Achieviing-CMMC-2.0-Certification/blob/main/CMMC_Documentation_and_Reporting_Automation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# 1) Install and download models
!pip install pdfplumber spacy jinja2 transformers torch pandas matplotlib --quiet
!python -m spacy download en_core_web_md --quiet

import pdfplumber
import re
import spacy
import pandas as pd
import time
from jinja2 import Environment, FileSystemLoader
from transformers import pipeline, GPT2TokenizerFast
from google.colab import files

# 2) Upload PDFs via Colab’s file‐picker
print("📂 Upload your CMMC 2.0 Final Rule PDF(s) and/or NIST SP 800-171 PDF(s) now:")
uploaded = files.upload()  # opens file picker
for fname in uploaded.keys():
    with open(fname, "wb") as f:
        f.write(uploaded[fname])
print("✅ PDFs successfully uploaded:", list(uploaded.keys()))

# 3) Extract all text from a PDF
def extract_pdf_text(path: str) -> str:
    pages = []
    with pdfplumber.open(path) as pdf:
        for p in pdf.pages:
            pages.append(p.extract_text() or "")
    return "\n".join(pages)

# 4) Parse CMMC/NIST controls by splitting on “digit.digit.digit” patterns
def parse_requirements(raw: str) -> dict:
    parts = re.split(r'(\d+\.\d+\.\d+)', raw)
    reqs = {}
    for i in range(1, len(parts), 2):
        cid = parts[i].strip()
        ctext = parts[i+1].strip().replace("\n", " ")
        reqs[cid] = ctext
    return reqs

# 5) Programmatically write SSP & IRP Jinja2 templates into /content
ssp_tmpl = """# System Security Plan (SSP)

## Organization Overview
**Organization Name**: {{ organization }}
**Environment**: {{ environment }}

## Control Implementation Summary
{% for ctrl, text in controls.items() %}
### Control {{ ctrl }}
**Requirement**: {{ text }}

**Implementation Description**:
[Describe how {{ organization }} implements control {{ ctrl }} here.]

**Responsible Party**:
[Name or role]

{% endfor %}
"""

irp_tmpl = """# Incident Response Plan (IRP)

## Organization Overview
**Organization Name**: {{ organization }}
**Environment**: {{ environment }}

## Incident Response Team
- **Team Lead**: [Name or role]
- **Members**: [List roles]

## Incident Response Procedures
### 1. Preparation
[Outline preparation steps.]

### 2. Detection and Analysis
[How to detect incidents; reference controls]
{% for ctrl, text in controls.items() %}
- **Control {{ ctrl }} Insight**: {{ text }}
{% endfor %}

### 3. Containment
[Containment strategies]

### 4. Eradication
[Eradication procedures]

### 5. Recovery
[Recovery steps]

### 6. Post‐Incident Activity
[Lessons learned, reporting]

## Communication Plan
- **Internal**: [Stakeholders]
- **External**: [Notification processes]
"""

with open("/content/SSP_template.md", "w") as f:
    f.write(ssp_tmpl)
with open("/content/IRP_template.md", "w") as f:
    f.write(irp_tmpl)

# 6) Auto‐create org_profile.csv and historical_metrics.csv
org_profile = {"name": ["Cyber Brilliance LLC"], "environment": ["AWS Cloud"]}
pd.DataFrame(org_profile).to_csv("/content/org_profile.csv", index=False)

historical_metrics = {
    "baseline_time": [120.0, 130.0, 115.0],    # seconds (example)
    "baseline_quality": [80.0, 85.0, 78.0]     # % scores (example)
}
pd.DataFrame(historical_metrics).to_csv("/content/historical_metrics.csv", index=False)

print("✅ Templates and CSV files created in /content.")

# 7) Load spaCy and GPT-2 tokenizer/pipeline
nlp = spacy.load("en_core_web_md")
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
generator = pipeline("text-generation", model="gpt2")

# 8) Load org data & historical metrics
org_data = pd.read_csv("/content/org_profile.csv")
metrics = pd.read_csv("/content/historical_metrics.csv")

# 9) Extract & parse controls from each uploaded PDF
all_controls = {}
for pdf_name in uploaded.keys():
    if pdf_name.lower().endswith(".pdf"):
        raw = extract_pdf_text(pdf_name)
        parsed = parse_requirements(raw)
        all_controls.update(parsed)
print(f"🔍 Parsed {len(all_controls)} total controls from uploaded PDFs.")

# 10) Prepare Jinja2 environment
env = Environment(loader=FileSystemLoader("/content"))
ssp_template = env.get_template("SSP_template.md")
irp_template = env.get_template("IRP_template.md")

# 11) Function to chunk & generate very large text
def generate_long_text(input_text: str,
                       max_input_tokens: int = 1024,
                       gen_extra_tokens: int = 50) -> str:
    """
    1. Tokenize input_text into GPT-2 tokens.
    2. Split tokens into chunks of size <= max_input_tokens.
    3. For each chunk, decode and run generator(...).
    4. Return concatenated outputs.
    """
    tokens = tokenizer.encode(input_text)
    chunks = [
        tokens[i : i + max_input_tokens]
        for i in range(0, len(tokens), max_input_tokens)
    ]
    out_pieces = []
    for chunk_tokens in chunks:
        chunk_text = tokenizer.decode(chunk_tokens, clean_up_tokenization_spaces=True)
        max_len = len(chunk_tokens) + gen_extra_tokens
        generated = generator(chunk_text, max_length=max_len, do_sample=False)[0]["generated_text"]
        out_pieces.append(generated)
    return "\n".join(out_pieces)

# 12) Render, polish, measure time & coverage
start = time.time()

context = {
    "organization": org_data.loc[0, "name"],
    "environment": org_data.loc[0, "environment"],
    "controls": all_controls
}

raw_ssp = ssp_template.render(**context)
raw_irp = irp_template.render(**context)

# Generate long documents by chunking
ssp_final = generate_long_text(raw_ssp, max_input_tokens=1024, gen_extra_tokens=50)
irp_final = generate_long_text(raw_irp, max_input_tokens=1024, gen_extra_tokens=50)

elapsed = time.time() - start

# Coverage metric
total_controls = len(all_controls)

# Time‐saved versus historical average
avg_time = metrics["baseline_time"].mean()
time_saved = avg_time - elapsed

# 13) “Accuracy” via rephrase similarity (first 5 controls)
sim_scores = []
sample_ids = list(all_controls.keys())[:5]  # up to 5 controls
for cid in sample_ids:
    orig_text = all_controls[cid]
    prompt = f"Rephrase this requirement clearly:\n\n\"{orig_text}\"\n\nRephrase:"
    reph_out = generator(prompt, max_length=100, do_sample=False)[0]["generated_text"]
    # Extract portion after “Rephrase:”
    candidate = reph_out.split("Rephrase:")[-1].strip()
    sim_scores.append(nlp(orig_text).similarity(nlp(candidate)))

avg_similarity = float(pd.Series(sim_scores).mean() * 100) if sim_scores else 0.0

# 14) Print metrics summary
print("\n📊 Metrics Summary:")
print(f"• Total controls parsed: {total_controls}")
print(f"• SSP+IRP generation time: {elapsed:.1f} s (baseline avg: {avg_time:.1f} s → saved {time_saved:.1f} s)")
print(f"• Rephrase‐similarity (first 5 controls): {avg_similarity:.1f}%")
print(f"• Baseline quality (avg): {metrics['baseline_quality'].mean():.1f}%")

# 15) Write out and trigger downloads
with open("/content/Generated_SSP.md", "w") as f:
    f.write(ssp_final)
with open("/content/Generated_IRP.md", "w") as f:
    f.write(irp_final)

print("\n✅ Finished. Download ‘Generated_SSP.md’ and ‘Generated_IRP.md’ below.")
files.download("/content/Generated_SSP.md")
files.download("/content/Generated_IRP.md")


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.2/60.2 kB[0m [31m868.2 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m27.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m22.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Saving NIST.SP.800-171r3.pdf to NIST.SP.800-171r3.pdf
✅ PDFs successfully uploaded: ['NIST.SP.800-171r3.pdf']
✅ Templates and CSV files created in /content.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Device set to use cpu
Token indices sequence length is longer than the specified maximum sequence length for this model (10583 > 1024). Running this sequence through the model will result in indexing errors
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=1074) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/m

🔍 Parsed 130 total controls from uploaded PDFs.


This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (1024). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.


IndexError: index out of range in self