### spaCy Processing Pipeline

When you pass text to the `nlp` object in spaCy, the text undergoes a series of processing steps defined by the spaCy processing pipeline. This results in the creation of a `Doc` object, which is a container for accessing linguistic annotations. Here's a detailed explanation of what happens during this process:

### Processing Pipeline in spaCy

spaCy's processing pipeline consists of several components that apply various linguistic annotations to the text. The default pipeline typically includes:

1. **Tokenizer**: Splits the text into individual tokens (words, punctuation, etc.).
2. **Tagger**: Assigns part-of-speech (POS) tags to each token.
3. **Parser**: Analyzes the syntactic structure of the sentence, identifying dependencies between tokens.
4. **NER (Named Entity Recognizer)**: Identifies named entities in the text and labels them.
5. **TextCategorizer**: (if included) Categorizes the text into predefined categories.


### spaCy Pre-trained Models

spaCy offers a variety of pre-trained models for different languages and purposes. These models come in different sizes and capabilities. Here’s a detailed list of the available pre-trained models along with their characteristics.

## Available Models

spaCy provides pre-trained models for different languages and domains. The models are generally categorized based on their size and the types of annotations they provide. The most common types are:

- **sm**: Small
- **md**: Medium
- **lg**: Large
- **trf**: Transformer-based

### List of Pre-trained Models and Their Characteristics


In [56]:
spacy.info()

{'spacy_version': '3.7.5',
 'location': '/usr/local/lib/python3.10/dist-packages/spacy',
 'platform': 'Linux-6.1.85+-x86_64-with-glibc2.35',
 'python_version': '3.10.12',
 'pipelines': {'en_core_web_sm': '3.7.1'}}

## Model Characteristics

Here’s a breakdown of some of the popular models and their characteristics:

### English Models

- **en_core_web_sm**
  - **Size**: 12MB
  - **Components**: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer
  - **Vocab**: 509k keys, 20k unique vectors (300 dimensions)

- **en_core_web_md**
  - **Size**: 43MB
  - **Components**: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer
  - **Vocab**: 509k keys, 20k unique vectors (300 dimensions)

- **en_core_web_lg**
  - **Size**: 741MB
  - **Components**: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer
  - **Vocab**: 509k keys, 300k unique vectors (300 dimensions)

- **en_core_web_trf**
  - **Size**: 438MB
  - **Components**: transformer, ner, tagger, parser, attribute_ruler, lemmatizer
  - **Vocab**: 501k keys, 48k unique vectors (300 dimensions)

### Other Language Models

- **es_core_news_sm** (Spanish)
  - **Size**: 42MB
  - **Components**: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer
  - **Vocab**: 584k keys, 0 unique vectors

- **fr_core_news_sm** (French)
  - **Size**: 38MB
  - **Components**: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer
  - **Vocab**: 700k keys, 0 unique vectors

- **de_core_news_sm** (German)
  - **Size**: 47MB
  - **Components**: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer
  - **Vocab**: 597k keys, 0 unique vectors

### Usage Example

Let's load the `en_core_web_sm` model and process a sample text to see how it works.


### Spacy Introduction

In [1]:
import spacy

In [2]:
# Step 1: Clone the repository
!git clone https://github.com/wjbmattingly/freecodecamp_spacy.git

# Step 2: Change directory to the cloned repository
%cd freecodecamp_spacy

# Step 3: Install dependencies (if any)
!pip install -r requirements.txt


Cloning into 'freecodecamp_spacy'...
remote: Enumerating objects: 930, done.[K
remote: Total 930 (delta 0), reused 0 (delta 0), pack-reused 930[K
Receiving objects: 100% (930/930), 15.78 MiB | 16.59 MiB/s, done.
Resolving deltas: 100% (619/619), done.
/content/freecodecamp_spacy
Collecting jupyter-book>=0.9 (from -r requirements.txt (line 1))
  Downloading jupyter_book-1.0.0-py3-none-any.whl (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Collecting textacy (from -r requirements.txt (line 3))
  Downloading textacy-0.13.0-py3-none-any.whl (210 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m210.7/210.7 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
Collecting myst-nb<3,>=1 (from jupyter-book>=0.9->-r requirements.txt (line 1))
  Downloading myst_nb-1.1.0-py3-none-any.whl (80 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.3/80.3 kB[0m [31m7.5 MB/s[0m eta [36m0:

The en_core_web_sm model is one of the pre-trained language models provided by spaCy. This specific model is designed for English and is used to perform various natural language processing (NLP) tasks. Here’s a complete guide on what it is, why it is used, and how to work with it:

What is en_core_web_sm?
en_core_web_sm is a small, pre-trained NLP model for English provided by spaCy. It includes a variety of components and capabilities, such as:

1. Tokenization: Splitting text into individual tokens (words, punctuation, etc.).
2. Part-of-Speech (POS) Tagging: Assigning parts of speech to each token.
3. Named Entity Recognition (NER): Identifying and classifying named entities (e.g., people, organizations, dates).
4. Dependency Parsing: Analyzing the syntactic structure of sentences.
5. Lemmatization: Reducing words to their base or dictionary form.

In [3]:
nlp = spacy.load('en_core_web_sm')

In [4]:
with open ("/content/freecodecamp_spacy/data/wiki_us.txt","r") as f:
  text = f.read()

In [5]:
print(text)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [6]:
doc = nlp(text)

In [7]:
print(len(text))
print(len(doc))

3521
654


In [8]:
for token in text[0:10]:
  print(token)

T
h
e
 
U
n
i
t
e
d


In [9]:
for token in doc[0:10]:
  print(token)

The
United
States
of
America
(
U.S.A.
or
USA
)


In [10]:
for token in text.split()[:10]:
  print(token)

The
United
States
of
America
(U.S.A.
or
USA),
commonly
known


In [39]:
# Process text
doc = nlp("SpaCy is a great NLP library!")

# Print tokens
for token in doc:
    print(token.text)


SpaCy
is
a
great
NLP
library
!


In [40]:
# Print tokens and their POS tags
for token in doc:
    print(f'{token.text} - {token.pos_}')


SpaCy - PROPN
is - AUX
a - DET
great - ADJ
NLP - PROPN
library - NOUN
! - PUNCT


In [41]:
# Print named entities
for ent in doc.ents:
    print(f'{ent.text} - {ent.label_}')


NLP - ORG


In [42]:
# Print tokens and their dependencies
for token in doc:
    print(f'{token.text} - {token.dep_} - {token.head.text}')


SpaCy - nsubj - is
is - ROOT - is
a - det - library
great - amod - library
NLP - compound - library
library - attr - is
! - punct - is


In [43]:
# Print tokens and their lemmas
for token in doc:
    print(f'{token.text} - {token.lemma_}')


SpaCy - SpaCy
is - be
a - a
great - great
NLP - NLP
library - library
! - !


###Tokenization
Understanding Tokenization
Tokenization is the process of breaking down text into smaller units called tokens. These tokens can be words, phrases, or even punctuation marks. Tokenization is a fundamental step in natural language processing (NLP) because it helps in structuring unstructured text data.

Why Tokenization is Important
1. Text Analysis: Helps in analyzing and processing text by breaking it into manageable pieces.
2. Feature Extraction: Essential for extracting meaningful features from text for machine learning models.
3. Linguistic Structure: Preserves the linguistic structure of text, making it easier to understand the context.
Types of Tokenization
1. Word Tokenization: Splitting text into individual words.
2. Sentence Tokenization: Splitting text into individual sentences.

In [44]:
# Sample text
text = "SpaCy is a great NLP library!"

# Process the text
doc = nlp(text)

# Tokenize into words
tokens = [token.text for token in doc]

# Print tokens
print(tokens)


['SpaCy', 'is', 'a', 'great', 'NLP', 'library', '!']


In [45]:
# Sample text
text = "SpaCy is a great NLP library. It is widely used in the industry."

# Process the text
doc = nlp(text)

# Tokenize into sentences
sentences = [sent.text for sent in doc.sents]

# Print sentences
print(sentences)


['SpaCy is a great NLP library.', 'It is widely used in the industry.']


Token Objects
In spaCy, each token is an object that contains various attributes and methods. Here are some useful attributes:

1. token.text: The original text of the token.
2. token.lemma_: The lemma or base form of the token.
3. token.pos_: The part of speech tag.
4. token.dep_: The syntactic dependency relation.
5. token.is_stop: Boolean flag indicating if the token is a stop word.

In [46]:
# Print detailed information about each token
for token in doc:
    print(f'Text: {token.text}, Lemma: {token.lemma_}, POS: {token.pos_}, Dependency: {token.dep_}, Is Stop Word: {token.is_stop}')


Text: SpaCy, Lemma: SpaCy, POS: PROPN, Dependency: nsubj, Is Stop Word: False
Text: is, Lemma: be, POS: AUX, Dependency: ROOT, Is Stop Word: True
Text: a, Lemma: a, POS: DET, Dependency: det, Is Stop Word: True
Text: great, Lemma: great, POS: ADJ, Dependency: amod, Is Stop Word: False
Text: NLP, Lemma: NLP, POS: PROPN, Dependency: compound, Is Stop Word: False
Text: library, Lemma: library, POS: NOUN, Dependency: attr, Is Stop Word: False
Text: ., Lemma: ., POS: PUNCT, Dependency: punct, Is Stop Word: False
Text: It, Lemma: it, POS: PRON, Dependency: nsubjpass, Is Stop Word: True
Text: is, Lemma: be, POS: AUX, Dependency: auxpass, Is Stop Word: True
Text: widely, Lemma: widely, POS: ADV, Dependency: advmod, Is Stop Word: False
Text: used, Lemma: use, POS: VERB, Dependency: ROOT, Is Stop Word: True
Text: in, Lemma: in, POS: ADP, Dependency: prep, Is Stop Word: True
Text: the, Lemma: the, POS: DET, Dependency: det, Is Stop Word: True
Text: industry, Lemma: industry, POS: NOUN, Dependency

###Custom Tokenization

Custom tokenization allows you to modify the default tokenization behavior of spaCy to suit specific needs or handle special cases in your text data. spaCy’s tokenizer is highly customizable, allowing you to change how text is split into tokens by adjusting the rules for prefix, suffix, infix, and exception handling.

Components of spaCy’s Tokenizer:

1. Prefix: Characters at the beginning of a token.
2. Suffix: Characters at the end of a token.
3. Infix: Characters within a token.
4. Exceptions: Specific cases where the default rules do not apply.

In [47]:
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex


By default, spaCy has a set of infix patterns (e.g., for splitting on punctuation within words). We can extend this with our custom pattern to handle hyphens (-).

Infix Patterns: We added a custom infix pattern for hyphens (r'\\-\\'). This tells spaCy to consider hyphens as boundaries within words, thus splitting them.



In [48]:
# Define a custom infix pattern to include hyphens
infixes = nlp.Defaults.infixes + [r'\\-\\']

# Compile the infix patterns into a regex
infix_re = compile_infix_regex(infixes)


In [49]:
# Create a new tokenizer with modified infix patterns
custom_tokenizer = Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)


In [50]:
# Sample text to tokenize
text = "Let's test the custom-tokenizer to see how it handles well-formed tokens."

# Use the custom tokenizer
doc = custom_tokenizer(text)

# Print the resulting tokens
tokens = [token.text for token in doc]
print(tokens)


["Let's", 'test', 'the', 'custom', '-', 'tokenizer', 'to', 'see', 'how', 'it', 'handles', 'well', '-', 'formed', 'tokens.']


###Sentence Boundary Detection

In [None]:
for sent in doc.sents:
  print(sent)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]
The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22]
With a population of more than 331 million people, it is the third most populous country in the world.
The national capital is Washington, D.C., and the most populous city is New York.


Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.
The United States emerged from the thirteen British colonies es

In [None]:
sentence1 = doc.sents[0]
print(sentence1)


TypeError: 'generator' object is not subscriptable

In [None]:
sentence1 = list(doc.sents)[0]
print(sentence1)


The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


In [None]:
token2 = sentence1[2]
print(token2)

States


In [None]:
for token in doc[:10]:
  print(token)

The
United
States
of
America
(
U.S.A.
or
USA
)


In [None]:
print(type(token))
print(type(token.text))


<class 'spacy.tokens.token.Token'>
<class 'str'>


In [None]:
token2.text

'States'

In [None]:
token2.left_edge

The

In [None]:
token2.right_edge

America

In [None]:
token2.ent_type

384

In [None]:
token2.ent_type_

'GPE'

In [None]:
token2.ent_iob_

'I'

In [None]:
token2.lemma_

'States'

In [None]:
sentence1[12]

known

In [None]:
sentence1[12].lemma_

'know'

In [None]:
token2.morph

Number=Sing

In [None]:
sentence1[12].morph

Aspect=Perf|Tense=Past|VerbForm=Part

In [None]:
token2.pos_

'PROPN'

In [None]:
token2.tag_

'NNP'

In [None]:
token2.dep_

'nsubj'

In [None]:
token2.lang_

'en'

<generator at 0x7c53e39a9440>

###Parts of Speech (POS) Tagging

In [None]:
text = "Mike enjoys playing football"
doc2 = nlp(text)
print(doc2)

Mike enjoys playing football


In [None]:
for token in doc2:
  print(token.text, token.pos_, token.dep_)

Mike PROPN nsubj
enjoys VERB ROOT
playing VERB xcomp
football NOUN dobj


In [51]:
from spacy import displacy
displacy.render(doc2, style="dep")

### Understanding POS Tags

spaCy uses the Universal POS tags, which are a simplified set of POS tags that are consistent across different languages. Here are some common POS tags and their meanings:

- **NOUN**: Noun (e.g., 'dog', 'car')
- **VERB**: Verb (e.g., 'run', 'speak')
- **ADJ**: Adjective (e.g., 'blue', 'quick')
- **ADV**: Adverb (e.g., 'quickly', 'very')
- **PRON**: Pronoun (e.g., 'he', 'they')
- **DET**: Determiner (e.g., 'the', 'a')
- **ADP**: Adposition (e.g., 'in', 'to')
- **CONJ**: Conjunction (e.g., 'and', 'but')
- **NUM**: Numeral (e.g., 'one', 'two')
- **PART**: Particle (e.g., 'up', 'off')
- **INTJ**: Interjection (e.g., 'wow', 'oops')
- **PUNCT**: Punctuation (e.g., '.', ',')
- **SYM**: Symbol (e.g., '$', '%')
- **X**: Other (e.g., foreign words, typos)

### Detailed POS Tags

spaCy also provides more detailed POS tags, which are language-specific and offer more granularity. For English, these detailed tags are based on the Penn Treebank POS tags. Here are some examples:

- **NN**: Noun, singular or mass
- **NNS**: Noun, plural
- **NNP**: Proper noun, singular
- **NNPS**: Proper noun, plural
- **VB**: Verb, base form
- **VBD**: Verb, past tense
- **VBG**: Verb, gerund or present participle
- **VBN**: Verb, past participle
- **VBP**: Verb, non-3rd person singular present
- **VBZ**: Verb, 3rd person singular present
- **JJ**: Adjective
- **JJR**: Adjective, comparative
- **JJS**: Adjective, superlative
- **RB**: Adverb
- **RBR**: Adverb, comparative
- **RBS**: Adverb, superlative


In [52]:
# Sample text
text = "SpaCy is a great NLP library that supports POS tagging."

# Process the text
doc = nlp(text)

# Print tokens and their POS tags
for token in doc:
    print(f'Token: {token.text}, POS: {token.pos_}, Detailed POS: {token.tag_}')


Token: SpaCy, POS: PROPN, Detailed POS: NNP
Token: is, POS: AUX, Detailed POS: VBZ
Token: a, POS: DET, Detailed POS: DT
Token: great, POS: ADJ, Detailed POS: JJ
Token: NLP, POS: PROPN, Detailed POS: NNP
Token: library, POS: NOUN, Detailed POS: NN
Token: that, POS: PRON, Detailed POS: WDT
Token: supports, POS: VERB, Detailed POS: VBZ
Token: POS, POS: PROPN, Detailed POS: NNP
Token: tagging, POS: NOUN, Detailed POS: NN
Token: ., POS: PUNCT, Detailed POS: .


In [None]:
from spacy import displacy
displacy.render(sentence1, style="dep")

###Named Entity Recognition

In [None]:
for ent in doc.ents:
  print(ent.text, ent.label_)

The United States of America GPE
U.S.A. GPE
USA GPE
the United States GPE
U.S. GPE
US GPE
America GPE
North America LOC
50 CARDINAL
five CARDINAL
326 CARDINAL
Indian NORP
3.8 million square miles QUANTITY
9.8 million square kilometers QUANTITY
fourth ORDINAL
The United States GPE
Canada GPE
Mexico GPE
Bahamas GPE
Cuba GPE
more than 331 million CARDINAL
third ORDINAL
Washington GPE
D.C. GPE
New York GPE
Paleo-Indians NORP
Siberia LOC
North American NORP
at least 12,000 years ago DATE
European NORP
the 16th century DATE
The United States GPE
thirteen CARDINAL
British NORP
the East Coast LOC
Great Britain GPE
the American Revolutionary War ORG
the late 18th century DATE
U.S. GPE
North America LOC
Native Americans NORP
1848 DATE
the United States GPE
United States GPE
the second half of the 19th century DATE
the American Civil War ORG
Spanish NORP
World War EVENT
U.S. GPE
World War II EVENT
the Cold War EVENT
the United States GPE
the Korean War EVENT
the Vietnam War EVENT
the Soviet Union

In [None]:
displacy.render(doc, style="ent")

In [53]:
text = "Elon Musk founded SpaceX, which is headquartered in Hawthorne, California. The company has plans to colonize Mars by 2030."


In [54]:
# Process the text
doc = nlp(text)


In [55]:
# Print named entities with additional details
for ent in doc.ents:
    print(f'Entity: {ent.text}, Label: {ent.label_}, Start: {ent.start_char}, End: {ent.end_char}, Explanation: {spacy.explain(ent.label_)}')


Entity: Elon Musk, Label: PERSON, Start: 0, End: 9, Explanation: People, including fictional
Entity: Hawthorne, Label: GPE, Start: 52, End: 61, Explanation: Countries, cities, states
Entity: California, Label: GPE, Start: 63, End: 73, Explanation: Countries, cities, states
Entity: Mars, Label: LOC, Start: 109, End: 113, Explanation: Non-GPE locations, mountain ranges, bodies of water
Entity: 2030, Label: DATE, Start: 117, End: 121, Explanation: Absolute or relative dates or periods


###Word Vectors and Spacy

In [None]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
sentence1[0].vector

array([ 1.59222260e-01, -4.17643905e-01, -6.36026502e-01,  1.97980964e+00,
       -9.92742717e-01,  3.27986181e-01, -1.43436813e+00,  3.92762125e-01,
       -2.21372217e-01,  1.23972368e+00,  1.46654427e+00,  9.07576144e-01,
       -3.52612495e-01, -7.64090002e-01, -9.86878872e-01, -9.11739588e-01,
       -6.99107826e-01,  1.38432574e+00, -1.03957617e+00, -2.26922899e-01,
       -1.09980154e+00, -2.52399504e-01,  2.09237009e-01, -1.43642986e+00,
        1.98636830e+00,  4.54234242e-01,  8.36412311e-01,  1.15134805e-01,
        5.11804223e-03,  8.08914304e-01,  4.18873399e-01, -1.57853103e+00,
        6.36767626e-01, -9.50180352e-01,  5.02419174e-01,  1.34429443e+00,
        8.22311565e-02, -1.14306271e-01,  1.54729724e-01,  2.90426373e+00,
       -3.52550358e-01,  7.49375045e-01, -1.52755511e+00,  4.65825021e-01,
       -1.63595057e+00,  7.50666797e-01,  5.89215100e-01,  1.65174007e+00,
        7.01108217e-01,  2.49644145e-01, -9.34628427e-01, -4.80721891e-01,
        3.71875763e-01,  

In [None]:
nlp = spacy.load("en_core_web_md")

In [None]:
with open("data/wiki_us.txt", "r") as f:
  text = f.read()

In [None]:
doc = nlp(text)
sentence1 = list(doc.sents)[0]
print(sentence1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


In [None]:
import numpy as np
your_word = "country"

ms = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]), n=10
)
words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
print(words)

['country—0,467', 'nationâ\x80\x99s', 'countries-', 'continente', 'Carnations', 'pastille', 'бесплатно', 'Argents', 'Tywysogion', 'Teeters']


In [None]:
doc1 = nlp("I like salty ries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

In [None]:
print(doc1, "<->", doc2, doc1.similarity(doc2))

I like salty ries and hamburgers <-> Fast food tastes very good 0.6068287196656371


###Spacy Pipelines

In [3]:
import spacy
nlp = spacy.blank("en")

In [4]:
nlp.add_pipe("sentencizer")

<spacy.pipeline.sentencizer.Sentencizer at 0x7f6049894e00>

In [5]:
import requests
from bs4 import BeautifulSoup
s = requests.get("https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt")
soup = BeautifulSoup(s.content).text.replace("-\n", "").replace("\n", " ")
nlp.max_length = 5278439

In [6]:
%%time
doc = nlp(soup)
print (len(list(doc.sents)))

94134
CPU times: user 17.9 s, sys: 281 ms, total: 18.1 s
Wall time: 19.7 s


In [7]:
nlp.analyze_pipes()

{'summary': {'sentencizer': {'assigns': ['token.is_sent_start', 'doc.sents'],
   'requires': [],
   'scores': ['sents_f', 'sents_p', 'sents_r'],
   'retokenizes': False}},
 'problems': {'sentencizer': []},
 'attrs': {'token.is_sent_start': {'assigns': ['sentencizer'], 'requires': []},
  'doc.sents': {'assigns': ['sentencizer'], 'requires': []}}}

In [8]:
nlp2 = spacy.load("en_core_web_sm")

In [9]:
nlp2.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  'tagger': [],
  'parser': [],
  'attribute_ruler': [],
  'lemmatizer': [],
  'ner': []},
 'att

###Spacy EntityRuler

In [12]:
nlp = spacy.load("en_core_web_sm")
text = "Mr. Deeds It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"
doc = nlp(text)

In [13]:
for ent in doc.ents:
  print(ent.text, ent.label_)

Deeds It PERSON
Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


In [14]:
 ruler = nlp.add_pipe("entity_ruler")

In [15]:
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'],
   'requires': [],
   'scores': ['ents_f', 'ent

In [16]:
patterns = [{"label": "ORG", "pattern": "Apple"}, {"label": "GPE", "pattern": "New York"}]
ruler.add_patterns(patterns)

In [17]:
doc2 = nlp(text)
for ent in doc2.ents:
  print(ent.text, ent.label_)

Deeds It PERSON
Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


In [18]:
nlp2 = spacy.load("en_core_web_sm")

In [20]:
ruler = nlp2.add_pipe('entity_ruler', before='ner')

In [21]:
doc = nlp2(text)



In [22]:
for ent in doc.ents:
  print(ent.text, ent.label_)

Deeds It PERSON
Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


In [23]:
nlp3 = spacy.load("en_core_web_sm")

In [24]:
ruler = nlp3.add_pipe('entity_ruler', before='ner')

In [25]:
patterns = [{"label": "ORG", "pattern": "Apple"}, {"label": "GPE", "pattern": "New York"}]
ruler.add_patterns(patterns)

In [26]:
doc = nlp3(text)

In [27]:
for ent in doc.ents:
  print(ent.text, ent.label_)

Deeds It PERSON
Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


### Spacy Matcher

In [28]:
from spacy.matcher import Matcher

In [29]:
nlp = spacy.load("en_core_web_sm")

In [31]:
matcher = Matcher(nlp.vocab)

In [32]:
pattern =[{"LIKE_EMAIL": True}]

In [33]:
matcher.add("EMAIL_ADDRESS", [pattern])

In [34]:
doc = nlp("This is an email address: john@example.com")

In [35]:
matches = matcher(doc)

In [37]:
print(matches)

[(16571425990740197027, 6, 7)]


### Dependency Parsing with spaCy

Dependency parsing is the process of analyzing the grammatical structure of a sentence to establish relationships between "head" words and words that modify those heads. Dependency parsing helps in understanding the syntactic structure of a sentence, which is useful for various NLP tasks like information extraction, question answering, and more.

### Why Dependency Parsing is Important

- **Syntax Understanding**: Helps in understanding how words in a sentence relate to each other.
- **Information Extraction**: Useful for extracting relationships and entities from text.
- **Text Analysis**: Provides insights into the structure and meaning of text.

### Dependency Labels and Relations

In dependency parsing, each word in a sentence is assigned a grammatical relation to another word. These relations are called dependency labels. Some common dependency labels include:

- **nsubj**: Nominal subject
- **dobj**: Direct object
- **iobj**: Indirect object
- **prep**: Prepositional modifier
- **pobj**: Object of a preposition
- **amod**: Adjectival modifier
- **advmod**: Adverbial modifier
- **attr**: Attribute
- **root**: Root of the sentence


In [57]:
# Sample text
text = "Apple is looking at buying U.K. startup for $1 billion on January 1, 2023."

# Process the text
doc = nlp(text)


In [58]:
# Print tokens and their dependencies
print("\nDependencies:")
for token in doc:
    print(f'Token: {token.text}, Dependency: {token.dep_}, Head: {token.head.text}')



Dependencies:
Token: Apple, Dependency: nsubj, Head: looking
Token: is, Dependency: aux, Head: looking
Token: looking, Dependency: ROOT, Head: looking
Token: at, Dependency: prep, Head: looking
Token: buying, Dependency: pcomp, Head: at
Token: U.K., Dependency: dobj, Head: buying
Token: startup, Dependency: dep, Head: looking
Token: for, Dependency: prep, Head: startup
Token: $, Dependency: quantmod, Head: billion
Token: 1, Dependency: compound, Head: billion
Token: billion, Dependency: pobj, Head: for
Token: on, Dependency: prep, Head: startup
Token: January, Dependency: pobj, Head: on
Token: 1, Dependency: nummod, Head: January
Token: ,, Dependency: punct, Head: January
Token: 2023, Dependency: nummod, Head: January
Token: ., Dependency: punct, Head: looking


In [59]:
from spacy import displacy

# Visualize the dependency structure
displacy.render(doc, style="dep", jupyter=True)


### Lemmatization with spaCy

**Lemmatization** is the process of converting words to their base or dictionary form. This is important in natural language processing (NLP) as it helps in normalizing words to their root forms, making text analysis more efficient and accurate.

### Why Lemmatization is Important

- **Text Normalization**: Converts words to their base forms, reducing the complexity of text data.
- **Improved Accuracy**: Enhances the accuracy of text analysis by treating different forms of a word as a single entity.
- **Consistency**: Ensures consistency in text representation, which is crucial for tasks like text classification and information retrieval.

### Differences Between Stemming and Lemmatization

- **Stemming**: Involves cutting off the end or beginning of a word to find its base form. It is a crude method and can sometimes produce non-dictionary forms.
- **Lemmatization**: Uses vocabulary and morphological analysis to find the base form of a word. It ensures that the base form is a valid word in the language.


In [60]:
# Sample text
text = "The striped bats are hanging on their feet for best."

# Process the text
doc = nlp(text)


In [61]:
# Print tokens and their lemmatized forms
print("Lemmatization:")
for token in doc:
    print(f'Token: {token.text}, Lemma: {token.lemma_}')


Lemmatization:
Token: The, Lemma: the
Token: striped, Lemma: stripe
Token: bats, Lemma: bat
Token: are, Lemma: be
Token: hanging, Lemma: hang
Token: on, Lemma: on
Token: their, Lemma: their
Token: feet, Lemma: foot
Token: for, Lemma: for
Token: best, Lemma: good
Token: ., Lemma: .


In [62]:
import spacy
from nltk.stem import PorterStemmer

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

# Initialize Porter Stemmer
stemmer = PorterStemmer()

# Sample text
text = "The striped bats are hanging on their feet for best."

# Process the text with spaCy
doc = nlp(text)

# Print stemming and lemmatization
print("Token  | Stemming  | Lemmatization")
print("------------------------------------")
for token in doc:
    stem = stemmer.stem(token.text)
    lemma = token.lemma_
    print(f"{token.text:7} | {stem:8} | {lemma}")


Token  | Stemming  | Lemmatization
------------------------------------
The     | the      | the
striped | stripe   | stripe
bats    | bat      | bat
are     | are      | be
hanging | hang     | hang
on      | on       | on
their   | their    | their
feet    | feet     | foot
for     | for      | for
best    | best     | good
.       | .        | .


###What are Word Vectors?
Word vectors are representations of words in a continuous vector space where similar words are mapped close to each other. These vectors capture semantic meaning, allowing for operations like finding similar words, analogies, and more.

###Pre-trained Word Vectors
Several pre-trained word vectors are available, such as:

1. Word2Vec: Trained on Google News dataset.
2. GloVe: Trained on Common Crawl and Wikipedia.
3. FastText: Trained on Wikipedia and other large corpora.

In [1]:
import gensim.downloader as api

# Load the pre-trained GloVe model
model = api.load("glove-wiki-gigaword-50")  # Loads the smaller GloVe vectors




In [2]:
# Similarity between two words
similarity = model.similarity("king", "queen")
print(f"Similarity between 'king' and 'queen': {similarity}")


Similarity between 'king' and 'queen': 0.7839043140411377


In [3]:
# Words most similar to 'king'
similar_words = model.most_similar("king", topn=5)
print("Words most similar to 'king':")
for word, similarity in similar_words:
    print(f"{word}: {similarity}")


Words most similar to 'king':
prince: 0.8236179351806641
queen: 0.7839043140411377
ii: 0.7746230363845825
emperor: 0.7736247777938843
son: 0.766719400882721


In [4]:
import numpy as np

def phrase_vector(phrase, model):
    words = phrase.split()
    word_vectors = [model[word] for word in words if word in model]
    return np.mean(word_vectors, axis=0)

phrase1 = "king of the jungle"
phrase2 = "lion of the forest"

vector1 = phrase_vector(phrase1, model)
vector2 = phrase_vector(phrase2, model)

similarity = np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))
print(f"Similarity between '{phrase1}' and '{phrase2}': {similarity}")


Similarity between 'king of the jungle' and 'lion of the forest': 0.9040785431861877


In [5]:
def document_vector(document, model):
    words = document.split()
    word_vectors = [model[word] for word in words if word in model]
    return np.mean(word_vectors, axis=0)

doc1 = "The quick brown fox jumps over the lazy dog"
doc2 = "A fast brown fox leaps over a sleepy dog"

vector1 = document_vector(doc1, model)
vector2 = document_vector(doc2, model)

similarity = np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))
print(f"Similarity between documents: {similarity}")


Similarity between documents: 0.9711934924125671


###Rule-based matching
It is a technique used in Natural Language Processing (NLP) to identify specific patterns or entities in text. In the spaCy library, there are several tools to help with rule-based matching:

1. Matcher: Allows you to create patterns to match sequences of tokens based on their attributes (like text, part of speech, lemma, etc.).
2. PhraseMatcher: Matches exact phrases from a given list.
3. EntityRuler: Combines the functionality of Matcher and PhraseMatcher to add entities to the named entity recognizer (NER) pipeline.

In [6]:
# The Matcher allows you to create complex patterns to match sequences of tokens based on their attributes.


import spacy
from spacy.matcher import Matcher

# Load a pre-trained spacy model
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

# Define a pattern to match 'Hello World'
pattern = [{"LOWER": "hello"}, {"LOWER": "world"}]
matcher.add("HELLO_WORLD_PATTERN", [pattern])

# Apply the matcher to a document
doc = nlp("Hello world! This is an example.")
matches = matcher(doc)

# Print the results
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(f"Matched text: {matched_span.text}")


Matched text: Hello world


In [7]:
# The PhraseMatcher is used to match exact phrases from a given list.

import spacy
from spacy.matcher import PhraseMatcher

# Load a pre-trained spacy model
nlp = spacy.load("en_core_web_sm")
phrase_matcher = PhraseMatcher(nlp.vocab)

# List of phrases to match
phrases = ["New York City", "San Francisco", "Los Angeles"]
patterns = [nlp(text) for text in phrases]
phrase_matcher.add("CITIES", patterns)

# Apply the matcher to a document
doc = nlp("I love New York City and Los Angeles.")
matches = phrase_matcher(doc)

# Print the results
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(f"Matched text: {matched_span.text}")


Matched text: New York City
Matched text: Los Angeles
