___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Tokenization
The first step in creating a `Doc` object is to break down the incoming text into component pieces or "tokens".

In [4]:
# Import spaCy and load the language library
import spacy
nlp = spacy.load('ko_core_news_lg')

In [5]:
# Create a string that includes opening and closing quotation marks
mystring = '"우리는 지금 서울로 이동하고 있어!"' # \를 사용하는 이유는 '가 문자열을 분해하지 않도록 하기 위해
print(mystring)

"우리는 지금 서울로 이동하고 있어!"


In [6]:
# Create a Doc object and explore tokens
doc = nlp(mystring)

for token in doc:
    print(token.text, end=' | ')

" | 우리는 | 지금 | 서울로 | 이동하고 | 있어 | ! | " | 

<img src="../tokenization.png" width="600">

-  **Prefix**:	Character(s) at the beginning &#9656; `$ ( “ ¿`
-  **Suffix**:	Character(s) at the end &#9656; `km ) , . ! ”`
-  **Infix**:	Character(s) in between &#9656; `- -- / ...`
-  **Exception**: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied &#9656; `St. U.S.`

Notice that tokens are pieces of the original text. That is, we don't see any conversion to word stems or lemmas (base forms of words) and we haven't seen anything about organizations/places/money etc. Tokens are the basic building blocks of a Doc object - everything that helps us understand the meaning of the text is derived from tokens and their relationship to one another.

## Prefixes, Suffixes and Infixes
spaCy will isolate punctuation that does *not* form an integral part of a word. Quotation marks, commas, and punctuation at the end of a sentence will be assigned their own token. However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token.

In [9]:
doc2 = nlp(u"우리가 여기서 도와드릴게요! 이메일 보내주세요, support@oursite.com로 이메일을 보내주시거나 http://www.oursite.com로 방문해주세요!")

for t in doc2:
    print(t)

우리가
여기서
도와드릴게요
!
이메일
보내주세요
,
support@oursite.com로
이메일을
보내주시거나
http://www.oursite.com로
방문해주세요
!


<font color=green>Note that the exclamation points, comma, and the hyphen in 'snail-mail' are assigned their own tokens, yet both the email address and website are preserved.</font>

In [10]:
doc3 = nlp(u'5km NYC 택시 요금은 $10.30입니다.')

for t in doc3:
    print(t)

5
km
NYC
택시
요금은
$
10.30입니다
.


<font color=green>Here the distance unit and dollar sign are assigned their own tokens, yet the dollar amount is preserved.</font>

## Exceptions
Punctuation that exists as part of a known abbreviation will be kept as part of the token.

In [11]:
doc4 = nlp(u"내년에는 미국 세인트루이스로 방문해주세요.")

for t in doc4:
    print(t)

내년에는
미국
세인트루이스로
방문해주세요
.


<font color=green>Here the abbreviations for "Saint" and "United States" are both preserved.</font>

## Counting Tokens
`Doc` objects have a set number of tokens:

In [12]:
len(doc4)

5

## Counting Vocab Entries
`Vocab` objects contain a full library of items!

In [14]:
len(doc4.vocab) #'en_core_web_sm'가 57,852개(여기선 294개?)의 서로 다른 토큰을 갖고 있다는 말

294

<font color=green>NOTE: This number changes based on the language library loaded at the start, and any new lexemes introduced to the `vocab` when the `Doc` was created.</font>

## Tokens can be retrieved by index position and slice
`Doc` objects can be thought of as lists of `token` objects. As such, individual tokens can be retrieved by index position, and spans of tokens can be retrieved through slicing:

In [15]:
doc5 = nlp(u'받는 것보다 주는 게 낫다.')

# Retrieve the third token:
doc5[2]

주는

In [16]:
# Retrieve three tokens from the middle:
doc5[2:5]

주는 게 낫다

In [17]:
# Retrieve the last four tokens:
doc5[-4:]

주는 게 낫다.

## Tokens cannot be reassigned
Although `Doc` objects can be considered lists of tokens, they do *not* support item reassignment:

In [18]:
#doc6 = nlp(u'My dinner was horrible.')
doc6 = nlp(u'저의 저녁이 맛있어요.')
doc7 = nlp(u'당신의 저녁이 맛있어요.')

In [20]:
# Try to change "My dinner was horrible" to "My dinner was delicious"
doc6[2] = doc7[2]

TypeError: 'spacy.tokens.doc.Doc' object does not support item assignment

___
# Named Entities
Going a step beyond tokens, *named entities* add another layer of context. The language model recognizes that certain words are organizational names while others are locations, and still other combinations relate to money, dates, etc. Named entities are accessible through the `ents` property of a `Doc` object.

In [23]:
doc8 = nlp(u'애플 회사는 600만 원에 홍콩 공장 짓는다')

for token in doc8:
    print(token.text, end=' | ')

print('\n----')

for ent in doc8.ents:
    print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))

애플 | 회사는 | 600만 | 원에 | 홍콩 | 공장 | 짓는다 | 
----
애플 - OG - None
600만 원에 - QT - None
홍콩 - LC - localizer


<font color=green>Note how two tokens combine to form the entity `Hong Kong`, and three tokens combine to form the monetary entity:  `$6 million`</font>

In [24]:
len(doc8.ents)

3

Named Entity Recognition (NER) is an important machine learning tool applied to Natural Language Processing.<br>We'll do a lot more with it in an upcoming section. For more info on **named entities** visit https://spacy.io/usage/linguistic-features#named-entities

---
# Noun Chunks
Similar to `Doc.ents`, `Doc.noun_chunks` are another object property. *Noun chunks* are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, in [Sheb Wooley's 1958 song](https://en.wikipedia.org/wiki/The_Purple_People_Eater), a *"one-eyed, one-horned, flying, purple people-eater"* would be one long noun chunk.

In [25]:
doc9 = nlp(u"자율주행차는 보험 책임을 제조업체로 전가합니다.")

for chunk in doc9.noun_chunks:
    print(chunk.text)

NotImplementedError: [E894] The 'noun_chunks' syntax iterator is not implemented for language 'ko'.

In [26]:
doc10 = nlp(u"빨간 자동차는 보험료가 더 높지 않습니다.")

for chunk in doc10.noun_chunks:
    print(chunk.text)

NotImplementedError: [E894] The 'noun_chunks' syntax iterator is not implemented for language 'ko'.

In [27]:
doc11 = nlp(u"그는 외눈박이, 외뿔, 날아다니는 보라색 사람을 잡아먹는 동물이었습니다.")

for chunk in doc11.noun_chunks:
    print(chunk.text)

NotImplementedError: [E894] The 'noun_chunks' syntax iterator is not implemented for language 'ko'.

We'll look at additional noun_chunks components besides `.text` in an upcoming section.<br>For more info on **noun_chunks** visit https://spacy.io/usage/linguistic-features#noun-chunks

___
# Built-in Visualizers

spaCy includes a built-in visualization tool called **displaCy**. displaCy is able to detect whether you're working in a Jupyter notebook, and will return markup that can be rendered in a cell right away. When you export your notebook, the visualizations will be included as HTML.

For more info visit https://spacy.io/usage/visualizers

## Visualizing the dependency parse
Run the cell below to import displacy and display the dependency graphic

In [28]:
from spacy import displacy #시각화 내장 도구

doc = nlp(u'삼성전자는 영국에 600만 달러를 들여 공장을 지을 예정이다.')
displacy.render(doc, style='dep', jupyter=True, options={'distance': 50}) #distance:토근간 거리

The optional `'distance'` argument sets the distance between tokens. If the distance is made too small, text that appears beneath short arrows may become too compressed to read.

## Visualizing the entity recognizer

In [29]:
doc = nlp(u'지난 분기 동안 삼성전자는 약 2만 대의 iPod을 판매하여 600만 달러의 수익을 올렸습니다.')
displacy.render(doc, style='ent', jupyter=True)

___
## Creating Visualizations Outside of Jupyter
If you're using another Python IDE or writing a script, you can choose to have spaCy serve up html separately:

In [30]:
doc = nlp(u'이것은 문장이다.')
displacy.serve(doc, style='dep')

ValueError: [E1050] Port 5000 is already in use. Please specify an available port with `displacy.serve(doc, port=port)` or use `auto_select_port=True` to pick an available port automatically.

<font color=blue>**After running the cell above, click the link below to view the dependency parse**:</font>

http://127.0.0.1:5000
<br><br>
<font color=red>**To shut down the server and return to jupyter**, interrupt the kernel either through the **Kernel** menu above, by hitting the black square on the toolbar, or by typing the keyboard shortcut `Esc`, `I`, `I`</font>

Great! Now you should have an understanding of how tokenization divides text up into individual elements, how named entities provide context, and how certain tools help to visualize grammar rules and entity labels.
## Next up: Stemming