Imports the spaCy library and downloads the medium English model (en_core_web_md), which includes word vectors for semantic similarity tasks.

In [2]:
import spacy
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
     ---------------------------------------- 0.0/33.5 MB ? eta -:--:--
     ---------------------------------------- 0.3/33.5 MB ? eta -:--:--
     - -------------------------------------- 1.0/33.5 MB 3.0 MB/s eta 0:00:11
     - -------------------------------------- 1.6/33.5 MB 3.4 MB/s eta 0:00:10
     --- ------------------------------------ 2.6/33.5 MB 3.4 MB/s eta 0:00:09
     ---- ----------------------------------- 3.4/33.5 MB 3.5 MB/s eta 0:00:09
     ---- ----------------------------------- 3.9/33.5 MB 3.5 MB/s eta 0:00:09
     ----- ---------------------------------- 5.0/33.5 MB 3.6 MB/s eta 0:00:09
     ------ --------------------------------- 5.8/33.5 MB 3.6 MB/s eta 0:00:08
     ------- -------------------------------- 6.6/33.5 MB 3.7 MB/s eta 0:00:08
     -------- ------------------------------- 7.

Loads the medium English model and reads text from wiki_us.txt.
Processes it with spaCy’s NLP pipeline, storing the result in doc.
Extracts the first sentence from the document as sentence1.

In [3]:
nlp = spacy.load("en_core_web_md")
with open ("data/wiki_us.txt", "r") as f:
    text = f.read()
doc = nlp(text)
sentence1 = list(doc.sents)[0]

In [8]:
print(sentence1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


Finds the 10 most similar words to "country" using spaCy’s word vectors.
It retrieves semantically related terms based on vector similarity, showing how words with similar meanings are close in vector space.

In [20]:
import numpy as np
your_word = "country"

ms = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]), n=10)
words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
print(words)

['SLUMS', 'inner-city', 'anti-poverty', 'Socioeconomic', 'INTERSECT', 'Divides', 'dropout', 'handicaps', 'drop-out', 'south-east']


Creates two short documents and computes their semantic similarity using word vectors.
The .similarity() method returns a numerical score (0–1), where higher values mean stronger semantic similarity.

In [28]:
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

# Similarity of two documents
print(doc1, "<->", doc2, doc1.similarity(doc2))

I like salty fries and hamburgers. <-> Fast food tastes very good. 0.8015960454940796


In [43]:
doc3 = nlp("I like pizza")
doc4 = nlp("I like cars")

# Similarity of two documents
print(doc3, "<->", doc4, doc3.similarity(doc4))

I like pizza <-> I like cars 0.7633953094482422


Creates a blank English pipeline with only a sentencizer component — this splits text into sentences without doing token tagging, parsing, or NER.
It’s useful for lightweight tasks or custom pipelines.

In [46]:
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")

<spacy.pipeline.sentencizer.Sentencizer at 0x1d1ea4b3550>

Displays the current components in the pipeline and their statuses (trainable, frozen, etc.).
Here, it will show just the sentencizer since this is a blank model.

In [47]:
nlp.analyze_pipes()

{'summary': {'sentencizer': {'assigns': ['token.is_sent_start', 'doc.sents'],
   'requires': [],
   'scores': ['sents_f', 'sents_p', 'sents_r'],
   'retokenizes': False}},
 'problems': {'sentencizer': []},
 'attrs': {'doc.sents': {'assigns': ['sentencizer'], 'requires': []},
  'token.is_sent_start': {'assigns': ['sentencizer'], 'requires': []}}}

Loads the small English model (en_core_web_sm) — which includes tagging, parsing, and NER — and analyzes its full pipeline components.
This helps you compare how the small model differs from the blank one.

In [48]:
nlp2 = spacy.load("en_core_web_sm")

In [49]:
nlp2.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc',
    'pos_acc',
    'tag_micro_p',
    'tag_micro_r',
    'tag_micro_f'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  'tagger': [],
  'parse