In [4]:
import pdfplumber

def extract_text_from_pdf(pdf_path):
    text = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                text.append(page_text)
    return "\n".join(text)

pdf_text = extract_text_from_pdf("C:\\Users\\dhruv joshi\\Downloads\\SP_Uttarakhand.pdf")

print(pdf_text[:1000]) 


State Profile of Uttarakhand
Uttarakhand was formed on the 9th November 2000 as the 27th State of India, when it was
carved out of northern Uttar Pradesh. Located at the foothills of the Himalayan mountain ranges,
it is largely a hilly State, having international boundaries with China (Tibet) in the north and
Nepal in the east. On its north-west lies Himachal Pradesh, while on the south is Uttar Pradesh. It
is rich in natural resources especially water and forests with many glaciers, rivers, dense forests
and snow-clad mountain peaks. Char-dhams, the four most sacred and revered Hindu temples of
Badrinath,Kedarnath, Gangotri and Yamunotri are nestled in the mighty mountains. It’s truly
God’s Land (Dev Bhoomi). Dehradun is the Capital of Uttarakhand. It is one of the most
beautiful resort in the submountain tracts of India, known for its scenic surroundings. The town
lies in the Dun Valley, on the watershed of the Ganga and Yamuna rivers.
It is blessed with a rare bio-diversity, inter-a

**Step 1: PDF Text Extraction (pdfplumber)**

pdfplumber extracts text using layout aware parsing.

What it does:

- Reads embedded text objects

- Preserves reading order better than PyPDF2

- Exposes line breaks and layout artifacts


In [5]:
import re

def clean_text(text):
    text = re.sub(r'-\n', '', text)          # fix hyphenation
    text = re.sub(r'\n+', ' ', text)         # normalize line breaks
    text = re.sub(r'\s+', ' ', text)         # normalize spaces
    return text.strip()

cleaned_text = clean_text(pdf_text)


**Step 2: Text Cleaning**

You applied regex cleanup to the raw text.

Purpose:

- Remove hyphenation across lines

- Collapse excessive newlines

- Normalize spacing

In [6]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(cleaned_text)


**Step 3: Load spaCy Language Model**

I loaded en_core_web_sm.

This provides:

- Tokenizer

- Sentence boundary detection

- POS tagger

- Dependency parser

In [7]:
for token in doc[:100]:  # limit output or you will drown in it
    print(token.text, token.pos_, token.tag_)


State PROPN NNP
Profile PROPN NNP
of ADP IN
Uttarakhand PROPN NNP
Uttarakhand PROPN NNP
was AUX VBD
formed VERB VBN
on ADP IN
the DET DT
9th ADJ JJ
November PROPN NNP
2000 NUM CD
as ADP IN
the DET DT
27th ADJ JJ
State PROPN NNP
of ADP IN
India PROPN NNP
, PUNCT ,
when SCONJ WRB
it PRON PRP
was AUX VBD
carved VERB VBN
out ADP IN
of ADP IN
northern ADJ JJ
Uttar PROPN NNP
Pradesh PROPN NNP
. PUNCT .
Located VERB VBN
at ADP IN
the DET DT
foothills NOUN NNS
of ADP IN
the DET DT
Himalayan ADJ JJ
mountain NOUN NN
ranges AUX VBZ
, PUNCT ,
it PRON PRP
is AUX VBZ
largely ADV RB
a DET DT
hilly ADJ JJ
State NOUN NN
, PUNCT ,
having VERB VBG
international ADJ JJ
boundaries NOUN NNS
with ADP IN
China PROPN NNP
( PUNCT -LRB-
Tibet PROPN NNP
) PUNCT -RRB-
in ADP IN
the DET DT
north NOUN NN
and CCONJ CC
Nepal PROPN NNP
in ADP IN
the DET DT
east NOUN NN
. PUNCT .
On ADP IN
its PRON PRP$
north NOUN NN
- PUNCT HYPH
west NOUN NN
lies VERB VBZ
Himachal PROPN NNP
Pradesh PROPN NNP
, PUNCT ,
while SCONJ IN
on

**Step 4: NLP Processing**

I passed cleaned text to spaCy.

spaCy then:

- Segments sentences statistically

- Assigns POS tags to tokens

- Builds dependency trees per sentence

In [8]:
for token in doc[:100]:
    print(
        token.text,
        "HEAD:", token.head.text,
        "DEP:", token.dep_
    )


State HEAD: Profile DEP: compound
Profile HEAD: formed DEP: nsubjpass
of HEAD: Profile DEP: prep
Uttarakhand HEAD: Uttarakhand DEP: compound
Uttarakhand HEAD: of DEP: pobj
was HEAD: formed DEP: auxpass
formed HEAD: formed DEP: ROOT
on HEAD: formed DEP: prep
the HEAD: November DEP: det
9th HEAD: November DEP: amod
November HEAD: on DEP: pobj
2000 HEAD: November DEP: nummod
as HEAD: formed DEP: prep
the HEAD: State DEP: det
27th HEAD: State DEP: amod
State HEAD: as DEP: pobj
of HEAD: State DEP: prep
India HEAD: of DEP: pobj
, HEAD: State DEP: punct
when HEAD: carved DEP: advmod
it HEAD: carved DEP: nsubjpass
was HEAD: carved DEP: auxpass
carved HEAD: State DEP: relcl
out HEAD: carved DEP: prep
of HEAD: out DEP: prep
northern HEAD: Pradesh DEP: amod
Uttar HEAD: Pradesh DEP: compound
Pradesh HEAD: of DEP: pobj
. HEAD: formed DEP: punct
Located HEAD: is DEP: advcl
at HEAD: Located DEP: prep
the HEAD: foothills DEP: det
foothills HEAD: at DEP: pobj
of HEAD: foothills DEP: prep
the HEAD: moun

**Step 5: POS Tagging**

I extracted:

- Token text

- Coarse POS tag

- Fine grained POS tag

Use case:

- Grammatical labeling

- Feature extraction

In [9]:
for sent in doc.sents:
    print("SENTENCE:", sent.text)
    for token in sent:
        print(
            token.text,
            token.pos_,
            token.dep_,
            "→",
            token.head.text
        )
    print()


SENTENCE: State Profile of Uttarakhand Uttarakhand was formed on the 9th November 2000 as the 27th State of India, when it was carved out of northern Uttar Pradesh.
State PROPN compound → Profile
Profile PROPN nsubjpass → formed
of ADP prep → Profile
Uttarakhand PROPN compound → Uttarakhand
Uttarakhand PROPN pobj → of
was AUX auxpass → formed
formed VERB ROOT → formed
on ADP prep → formed
the DET det → November
9th ADJ amod → November
November PROPN pobj → on
2000 NUM nummod → November
as ADP prep → formed
the DET det → State
27th ADJ amod → State
State PROPN pobj → as
of ADP prep → State
India PROPN pobj → of
, PUNCT punct → State
when SCONJ advmod → carved
it PRON nsubjpass → carved
was AUX auxpass → carved
carved VERB relcl → State
out ADP prep → carved
of ADP prep → out
northern ADJ amod → Pradesh
Uttar PROPN compound → Pradesh
Pradesh PROPN pobj → of
. PUNCT punct → formed

SENTENCE: Located at the foothills of the Himalayan mountain ranges, it is largely a hilly State, having int

**Step 6: Dependency Parsing**

You extracted:

- Syntactic heads

- Dependency labels

- Sentence roots