## 📚 Workshop: Parsing and Enriching Public Documents with NLP

In this workshop, we will learn how to:

1. Load real-world text content from a PDF document and a public HTML webpage.
2. Clean and structure raw data using BeautifulSoup and pandas.
3. Extract Named Entities using Hugging Face Transformers.
4. Tag the content with relevant hashtags based on preloaded keyword dictionaries.

Our use cases will include legal documents and natural science articles, preparing them for analysis or semantic indexing.


## 📄 PDF Loader and Retriever

We’ll retrieve a U.S. copyright law PDF directly from a public website. Then we’ll extract its text content using the `PyMuPDF` library.


In [None]:
!pip install pymupdf

In [None]:
import fitz
import requests

# Download a real public domain statute (Title 17)
url = "https://www.copyright.gov/title17/title17.pdf"
response = requests.get(url)
with open("title17.pdf", "wb") as f:
    f.write(response.content)

# Extract text from pages
doc = fitz.open("title17.pdf")
pdf_text = "\n".join([page.get_text() for page in doc[:10]])  # limit to first 10 pages
print(pdf_text[:1000])  # preview output


## 🌐 HTML Loader and Retriever

Next, we’ll fetch a public webpage (Wikipedia article on ducks) and extract its main paragraphs using `requests` and `BeautifulSoup`. CSS selectors help us locate only relevant sections like `<p>`.


In [None]:
!pip install beautifulsoup4 requests

In [None]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Duck"
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")

# Select main paragraphs only
paragraphs = soup.select("#mw-content-text .mw-parser-output > p")
wiki_text = "\n\n".join(p.get_text(strip=True) for p in paragraphs if p.text.strip())[:5000]
print(wiki_text[:1000])

## 🧹 BeautifulSoup + pandas Cleaning

We’ll now demonstrate basic cleaning and structuring with `pandas`, using the data from HTML or PDF as input.


In [None]:
!pip install pandas

In [None]:
import pandas as pd

df = pd.DataFrame([
    {"source": "title17.pdf", "text": pdf_text},
    {"source": "wikipedia_duck", "text": wiki_text}
])

df["word_count"] = df["text"].apply(lambda x: len(x.split()))
df.head()


## 🧠 Named Entity Recognition (NER)

We’ll use Hugging Face’s `bert-base-NER` model to extract names, locations, laws, species, and other entities from both documents.

In [None]:
!pip install transformers --quiet

In [None]:
from transformers import pipeline

ner = pipeline("ner", model="dslim/bert-base-NER", grouped_entities=True)

# Apply NER to each document (limit tokens)
df["entities"] = df["text"].apply(lambda t: [e["word"] for e in ner(t[:1000])][:10])
df[["source", "entities"]]


## 🏷 Hashtagging from Keywords

We’ll define two sets of domain-specific keywords for legal and birding content. These will be used to enrich documents with hashtags automatically.

In [None]:
# Define keyword-to-hashtag maps
legal_hashtags = {
    "copyright": "#copyrightlaw",
    "intellectual property": "#IP",
    "section 106": "#fairuse",
    "DMCA": "#dmca",
    "public domain": "#publicdomain",
    "reproduction": "#reproductionrights",
    "infringement": "#infringement",
    "work of authorship": "#authorsrights",
    "exclusive rights": "#exclusivity",
    "derivative work": "#derivativework"
}

birding_hashtags = {
    "duck": "#waterfowl",
    "mallard": "#ducksofinstagram",
    "wing": "#birdwatching",
    "feathers": "#ornithology",
    "migratory": "#birdmigration",
    "quack": "#ducksounds",
    "nest": "#birdnest",
    "plumage": "#plumage",
    "beak": "#birdanatomy",
    "aquatic": "#wetlandbirds"
}

# Map hashtags based on content
def tag_text(text, tag_dict):
    return [tag for key, tag in tag_dict.items() if key.lower() in text.lower()][:10]

df["hashtags"] = df.apply(
    lambda row: tag_text(row["text"], legal_hashtags if "title17" in row["source"] else birding_hashtags),
    axis=1
)

df[["source", "hashtags"]]