# Chapter 1: Finding words, phrases, names and concepts

---

## Introduction of spaCy

### The `nlp` object
- At the center of spaCy is the object containing the processing pipeline. We usually call this variable "nlp".
- It contains all the different components in the pipeline.
- It also includes language-specific rules used for tokenizing the text into words and punctuation. spaCy supports a variety of languages that are available in spacy dot lang.
    - korean and chinese are supported.

In [1]:
# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

### The `Doc` object
- When you process a text with the nlp object, spaCy creates a Doc object – short for "document".
- The Doc lets you access information about the text in a structured way, and no information is lost.
- The Doc behaves like a normal Python sequence by the way and lets you iterate over its tokens, or get a token by its index. But more on that later!

In [5]:
# Created by processing a string of text with the nlp object
doc = nlp("Hello world!")

# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

Hello
world
!


### The `Token` object
- Token objects represent the tokens in a document – for example, a word or a punctuation character.
- To get a token at a specific position, you can index into the Doc.
- Token objects also provide various attributes that let you access more information about the tokens. For example, the dot text attribute returns the verbatim token text.

<img src='https://course.spacy.io/doc.png'>

In [7]:
doc = nlp("Hello world!")

# Index into the Doc to get a single Token
token = doc[1]

# Get the token text via the .text attribute
print(token.text)

world


### The `Span` object
- A Span object is a slice of the document consisting of one or more tokens.
- It's only a view of the Doc and doesn't contain any data itself.
- To create a Span, you can use Python's slice notation. 
    - For example, 1 colon 3 will create a slice starting from the token at position 1, up to – but not including! – the token at position 3.

In [8]:
doc = nlp("Hello world!")

# A slice from the Doc is a Span object
span = doc[1:4]

# Get the span text via the .text attribute
print(span.text)

world!


### Lexical Attributes
Here you can see some of the available token attributes:

"i" is the index of the token within the parent document.

"text" returns the token text.

"is alpha", "is punct" and "like num" return boolean values indicating whether the token consists of alphabetic characters, whether it's punctuation or whether it resembles a number. For example, a token "10" – one, zero – or the word "ten" – T, E, N.

These attributes are also called lexical attributes: they refer to the entry in the vocabulary and don't depend on the token's context.

In [23]:
doc = nlp("It costs $5.")
print('Index:   ', [token.i for token in doc])
print('Text:    ', [token.text for token in doc])

print('is_alpha:', [token.is_alpha for token in doc])
print('is_punct:', [token.is_punct for token in doc])
print('like_num:', [token.like_num for token in doc])

Index:    [0, 1, 2, 3, 4]
Text:     ['It', 'costs', '$', '5', '.']
is_alpha: [True, True, False, False, False]
is_punct: [False, False, False, False, True]
like_num: [False, False, False, True, False]


---

## Statistical models

### What are statistical models?
- Enable spaCy to predict linguistic attributes in *context*
    - Part-of-speech tags
    - Syntactic dependencies
    - Named entities
- Trained on labeled example texts
- Can be updated with more examples to fine-tune predictions

Some of the most interesting things you can analyze are context-specific: for example, whether a word is a verb or whether a span of text is a person name.

Statistical models enable spaCy to make predictions in context. This usually includes part-of speech tags, syntactic dependencies and named entities.

Models are trained on large datasets of labeled example texts.

They can be updated with more examples to fine-tune their predictions – for example, to perform better on your specific data.

### Model Packages

In [2]:
from spacy.lang.en import English

In [15]:
nlp = English()
doc = nlp("It costs $5.")

In [16]:
nlp.pipe_names

[]

In [10]:
for token in doc:
    print(token.pos_)








In [11]:
import spacy

In [None]:
spacy.load()