In [2]:
import spacy  
nlp = spacy.load('en_core_web_sm')

nlp.pipeline 

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7fa7cc20f280>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7fa7cbf69e40>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7fa7cc3f0eb0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7fa7cbd426c0>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7fa7cbd3d9c0>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7fa7cc3f0f20>)]

The default pipeline consists of a tagger, parser, and named-entity recognizer (ner),
all of which are language-dependent. The tokenizer is not explicitly listed because this
step is always necessary.

In [3]:
#often you will only need the tokenizer and the part-of-speech tagger.
#In this case, disable the parser and NER like:

nlp = spacy.load('en_core_web_sm', disable= ['parser', 'ner'])

Processing Text

The call returns an object of type
spacy.tokens.doc.Doc, a container to access the tokens, spans (ranges of tokens),
and their linguistic annotations.

In [4]:
nlp = spacy.load("en_core_web_sm")
text = "My best friend Ryan Peters likes fancy adventure games"
doc = nlp(text)

spaCy is object-oriented as well as nondestructive. The original text is always
retained. When you print the doc object, it uses doc.text, the property containing
the original text:

In [5]:
doc 

My best friend Ryan Peters likes fancy adventure games

But doc is also a container object for the tokens, and you can use it
as an iterator for them:

In [6]:
for token in doc:
    print(token, end='|')

My|best|friend|Ryan|Peters|likes|fancy|adventure|games|

Each token is actually an object of spaCy’s class Token. Tokens, as well as docs, have a
number of interesting properties for language processing

For each token, you find the lemma, some descriptive flags(e.g is_stop, is_alpha), the part-of-speech tag(pos),
the dependency tag (dep_), and possibly some information about the entity type(ent_type_).

The is_<something> flags are created based on rules, but
all part-of-speech, dependency, and named-entity attributes are based on neural network
models. So, there is always some degree of uncertainty in this information.

The corpora used for training contain a mixture of news articles and online articles. The
predictions of the model (en_core_web_sm) are fairly accurate if your data has similar linguistic characteristics.

But if your data is very different—if you are working with Twitter data or
IT service desk tickets, for example—you should be aware that this information is
unreliable.

-----------------------------------------------------------------------------------------

Extracting Noun Phrases


In [None]:
text = "my best fr"