## Introduction to NLP and spaCy
NLP is a subfield of artificial intelligence, and it’s all about allowing computers to comprehend human language. NLP involves analyzing, quantifying, understanding, and deriving meaning from natural languages. spaCy is a free, open-source library for NLP in Python written in Cython. spaCy is designed to make it easy to build systems for information extraction or general-purpose natural language processing.


There are various spaCy models for different languages. The default model for the English language is designated as en_core_web_sm. Since the models are quite large, it’s best to install them separately—including all languages in one package would make the download too massive

In [1]:
import spacy

nlp = spacy.load("en_core_web_sm")
nlp

<spacy.lang.en.English at 0x177f29b70>

In [2]:
# The load() function returns a Language callable object, which is commonly assigned to a variable called nlp
# To start processing , let me construct a Doc object. A Doc object is asequence pf Token objects representing a lexical token.
# Each token object has information about a particular piece - typically one word of text. You can instantiate a doc object
# by calling the language object with the input string as an argument.

int_doc = nlp("This tutorial is about Natural Language Processing in spaCy.")
type(int_doc)

spacy.tokens.doc.Doc

In [3]:
[token.text for token in int_doc]

['This',
 'tutorial',
 'is',
 'about',
 'Natural',
 'Language',
 'Processing',
 'in',
 'spaCy',
 '.']

In [6]:
import pathlib

file_name = "golden.txt" # reading the file
intr_doc = nlp(pathlib.Path(file_name).read_text(encoding="utf8"))
print([token.text for token in intr_doc])

['\n', 'The', 'Golden', 'Retriever', 'is', 'a', 'popular', 'dog', 'breed', 'known', 'for', 'its', 'friendly', 'and', 'tolerant', 'attitude', ',', '\n', 'making', 'it', 'an', 'excellent', 'family', 'pet', '.', 'Originally', 'bred', 'for', 'retrieving', 'game', 'during', 'hunting', ',', '\n', 'they', 'are', 'highly', 'trainable', 'and', 'often', 'used', 'as', 'guide', 'dogs', 'and', 'in', 'search', '-', 'and', '-', 'rescue', 'missions']


In [7]:
# In the example about, the txt file was read in with the default encoding, which is utf8.
# If you are working with a different encoding, you can specify it with the encoding argument. Anlso,
# if you are working with a file that is not in the current working directory, you can specify the path to the file
# with the path argument. The path argument can be a string or a pathlib.Path object.