In [2]:
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m92.3 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [3]:
import spacy
nlp = spacy.load("en_core_web_sm")

Explanation:
You first install spaCy and download a small English language model (en_core_web_sm).

import spacy brings the library into your script.

nlp = spacy.load("en_core_web_sm") loads the pre-trained model, which is required for all subsequent analysis.
This setup step is necessary for any spaCy-based application.

Tokenization

In [4]:
text = "spaCy is an advanced NLP library."
doc = nlp(text)
for token in doc:
    print(token.text)


spaCy
is
an
advanced
NLP
library
.


Explanation:
Tokenization is splitting a text into basic units called tokens (words, punctuation, etc.).

doc = nlp(text) processes the input string, producing a spaCy Doc object.

Iterating over doc yields each token as an object; token.text gives the original text for each token.

This step is foundational, enabling detailed analysis of text structure and meaning.

Sentence Segmentation

In [5]:
text = "Natural Language Processing enables computers to understand human language. It's a fast-growing field!"
doc = nlp(text)
for sent in doc.sents:
    print(sent.text)


Natural Language Processing enables computers to understand human language.
It's a fast-growing field!


Explanation:
Sentence segmentation divides text into sentences.

Each sent in doc.sents is a sentence object.

Accurate sentence splitting is necessary before more advanced tasks (e.g., parsing, information extraction).

Part-of-Speech (POS) Tagging

In [6]:
text = "spaCy processes text efficiently."
doc = nlp(text)
for token in doc:
    print(f"{token.text} - {token.pos_}")


spaCy - X
processes - NOUN
text - NOUN
efficiently - ADV
. - PUNCT


Explanation:
POS tagging labels each word with its grammatical category (noun, verb, etc.).

token.pos_ provides the POS tag (e.g., 'NOUN', 'VERB').

This process is crucial for understanding syntax and structure, supporting tasks like parsing and information extraction

Lemmatization

In [7]:
text = "The leaves are falling."
doc = nlp(text)
for token in doc:
    print(f"{token.text} - Lemma: {token.lemma_}")


The - Lemma: the
leaves - Lemma: leave
are - Lemma: be
falling - Lemma: fall
. - Lemma: .


Explanation:
Lemmatization reduces each word to its dictionary (base) form.

For example, "leaves" → "leaf", "falling" → "fall".

This helps normalize vocabulary, supporting search and analysis tasks.

Named Entity Recognition (NER)

In [8]:
text = "Apple is looking at buying U.K. startup for $1 billion."
doc = nlp(text)
for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")


Entity: Apple, Label: ORG
Entity: U.K., Label: GPE
Entity: $1 billion, Label: MONEY


Explanation:
NER identifies real-world entities like names, places, and monetary amounts in text.

Each detected entity (ent) has a label (e.g., 'ORG', 'GPE', 'MONEY').

Recognizing entities enables applications like information extraction, search, and question answering.

 Stop Word Detection

In [9]:
text = "This is a simple test."
doc = nlp(text)
for token in doc:
    print(f"{token.text} - Stopword: {token.is_stop}")


This - Stopword: True
is - Stopword: True
a - Stopword: True
simple - Stopword: False
test - Stopword: False
. - Stopword: False


Explanation:
Stop words are common words (like "is", "a", "the") that usually carry little semantic meaning.

token.is_stop is True for stop words.

Removing stop words simplifies text and focuses analysis on keywords.

 Dependency Parsing

In [10]:
text = "spaCy lets you analyze linguistic structure easily."
doc = nlp(text)
for token in doc:
    print(f"{token.text} <--{token.dep_}-- {token.head.text}")


spaCy <--meta-- lets
lets <--ROOT-- lets
you <--nsubj-- analyze
analyze <--ccomp-- lets
linguistic <--amod-- structure
structure <--dobj-- analyze
easily <--advmod-- analyze
. <--punct-- lets


Explanation:
Dependency parsing analyzes sentence structure by describing how words depend on each other.

token.dep_ is the type of grammatical relationship (subject, object, etc.).

token.head.text is the word this token depends on.

Understanding dependencies is key for extracting who does what to whom, supporting tasks like translation and question-answering

Combined Workflow Example

In [11]:
text = "Barack Obama was born in Hawaii. He was elected president in 2008."
doc = nlp(text)

# Sentences
for sent in doc.sents:
    print("Sentence:", sent.text)
    for token in sent:
        print(f"  {token.text}: POS={token.pos_}, Lemma={token.lemma_}, Stopword={token.is_stop}")
    for ent in sent.ents:
        print(f"  Entity: {ent.text}, Label: {ent.label_}")


Sentence: Barack Obama was born in Hawaii.
  Barack: POS=PROPN, Lemma=Barack, Stopword=False
  Obama: POS=PROPN, Lemma=Obama, Stopword=False
  was: POS=AUX, Lemma=be, Stopword=True
  born: POS=VERB, Lemma=bear, Stopword=False
  in: POS=ADP, Lemma=in, Stopword=True
  Hawaii: POS=PROPN, Lemma=Hawaii, Stopword=False
  .: POS=PUNCT, Lemma=., Stopword=False
  Entity: Barack Obama, Label: PERSON
  Entity: Hawaii, Label: GPE
Sentence: He was elected president in 2008.
  He: POS=PRON, Lemma=he, Stopword=True
  was: POS=AUX, Lemma=be, Stopword=True
  elected: POS=VERB, Lemma=elect, Stopword=False
  president: POS=NOUN, Lemma=president, Stopword=False
  in: POS=ADP, Lemma=in, Stopword=True
  2008: POS=NUM, Lemma=2008, Stopword=False
  .: POS=PUNCT, Lemma=., Stopword=False
  Entity: 2008, Label: DATE


Explanation:
This program combines several spaCy features:

Sentence segmentation: Splits the text into individual sentences.

Token-based analysis: Prints POS, lemma, and stop word status for each word.

NER per sentence: Shows entities found in each sentence.

This type of analysis is foundational for many advanced NLP applications as described in the reference text