# Basic NLP Course

## spaCy Language Processing Pipelines

In this class, we will learn how to define and utilize language processing pipelines using spaCy. This is the second class of the Basic NLP Course, where we will explore two popular Python libraries for Natural Language Processing: NLTK and spaCy. These packages provide powerful tools for text processing, tokenization, parsing, and more.

spaCy adopts an object-oriented approach, making it ideal for those who prioritize the end result and need efficient pipelines for NLP tasks. On the other hand, NLTK relies on string processing, offering access to a wide range of algorithms and greater flexibility for customizations, making it suitable for those who want to experiment and fine-tune their workflows.

In [2]:
import spacy
from spacy import displacy

# create a blank English NLP object
nlp = spacy.blank("en")

In [3]:
# Examples of NL questions about physical and chemical properties with varying temperature units
questions = [
    "What is the boiling point of ethanol at 1 atm?",
    "How much does the density of mercury change at 25 °C?",
    "What is the specific heat capacity of aluminum in its solid state?",
    "Can you tell me the viscosity of olive oil at 40 °C?",
    "What is the melting point of sodium chloride?",
    "How does the thermal conductivity of copper vary at 100 °C?",
    "What is the vapor pressure of acetone at 50 °C?",
    "What is the refractive index of water at 20 °C?",
    "What is the solubility of carbon dioxide in water at 5 °C?",
    "What is the surface tension of ethanol at 30 °C?",
    "What is the pH of a 0.1 M HCl solution at 25 °C?",
    "What is the enthalpy of vaporization of benzene at its boiling point?",
    "What is the compressibility factor of nitrogen gas at 300 K and 10 atm?",
    "What is the freezing point of a 10% NaCl solution?",
    "What is the electrical conductivity of pure water at 25 °C?"
]

In [4]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [5]:
# load pipeline
nlp = spacy.load("en_core_web_sm")
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x703cbbf33170>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x703cbbf33530>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x703cbc1e6c00>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x703cbc31d690>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x703cbdd20e10>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x703cbc1e6ce0>)]

In [6]:
# create a doc for each question
docs = [nlp(question) for question in questions]

for doc in docs:
    for token in doc:
        print(f"{token.text:12} {token.lemma_:12} {token.pos_:6} {token.dep_:10} {token.shape_:12} {token.is_alpha:6} {token.is_stop:6}")

What         what         PRON   attr       Xxxx              1      1
is           be           AUX    ROOT       xx                1      1
the          the          DET    det        xxx               1      1
boiling      boiling      NOUN   compound   xxxx              1      0
point        point        NOUN   nsubj      xxxx              1      0
of           of           ADP    prep       xx                1      1
ethanol      ethanol      NOUN   pobj       xxxx              1      0
at           at           ADP    prep       xx                1      1
1            1            NUM    nummod     d                 0      0
atm          atm          NOUN   pobj       xxx               1      0
?            ?            PUNCT  punct      ?                 0      0
How          how          SCONJ  advmod     Xxx               1      1
much         much         ADJ    dobj       xxxx              1      1
does         do           VERB   ROOT       xxxx              1      1
the   

In [None]:
for doc in docs:
    displacy.render(doc, style="ent")

