# Tokenizing Sentences with [spaCy](https://spacy.io/usage/spacy-101) Package

In [2]:
import re
import spacy
import torch

## 0. Installation

Before runing the example we need to download a pre-trained "parser" by running:

`python -m spacy download en_core_web_sm`

Assuming that you have installed the spacy package. Better use `pip install -U spacy`

国内安装的话去baidu搜一下手动下载`en_core_web_{sm, md}-版本号.tar.gz`然后用`python -m pip install en_core_web_{sm, md}-版本号.tar.gz`命令安装。安装的时候要改成清华源。Windows的话过程非常恶心，需要管理员权限，最好去搜一下。我最后也不是很确定怎么安装成功的。

## 1. Quick Example

Note that in the first example, `U.K.` is treated as a whole while `do` and `n't` are splited apart. So the tokenizer does more than simply seperate words by white spaces.

In [3]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
print("One example:")
for i, token in enumerate(doc):
    print(f"{i}th token is: {token}")

print("-------------------------------")

print("Another example:")
doc = nlp("I don't wanna go to school!")
for i, token in enumerate(doc):
    print(f"{i}th token is: {token}")

One example:
0th token is: Apple
1th token is: is
2th token is: looking
3th token is: at
4th token is: buying
5th token is: U.K.
6th token is: startup
7th token is: for
8th token is: $
9th token is: 1
10th token is: billion
-------------------------------
Another example:
0th token is: I
1th token is: do
2th token is: n't
3th token is: wanna
4th token is: go
5th token is: 

6th token is: to
7th token is: school
8th token is: !


## 2. Sentences Containing Special Symbols

Not all of the symbols are so meaningful and sometimes unwanted. For instance, the 7th token is an extra space, and there are a series of extra ! at the end. Thus, we can use `re.sub` to remove certain symbols beforehand.

In [5]:
# Original string
str_raw = "Troy `is` a ^ very      \n nice place ~ (or town) to live in!!!!"

# Remove special characters like: ^ or ~
str_pro = re.sub(
    pattern=r"[\(\)`~^]"  # [] groups a set of chracters to be matched
    , repl=" "            # Replace them with a single space
    , string=str_raw
)

# Remove extra characters
str_pro = re.sub("\n", " ", str_pro)
str_pro = re.sub("[ ]+", " ", str_pro)
str_pro = re.sub("\!+", "!", str_pro)

print("Tokenization before processing:")
for i, token in enumerate(nlp(str_raw)):
    print(f"{i}th token is: {token}")

print("---------------------------------")

print("Tokenization after processing:")
for i, token in enumerate(nlp.tokenizer(str_pro)):
    print(f"{i}th token is: {token.text}")

Tokenization before processing:
0th token is: Troy
1th token is: `
2th token is: is
3th token is: `
4th token is: a
5th token is: ^
6th token is: very
7th token is:      
 
8th token is: nice
9th token is: place
10th token is: ~
11th token is: (
12th token is: or
13th token is: town
14th token is: )
15th token is: to
16th token is: live
17th token is: in
18th token is: !
19th token is: !
20th token is: !
21th token is: !
---------------------------------
Tokenization after processing:
0th token is: Troy
1th token is: is
2th token is: a
3th token is: very
4th token is: nice
5th token is: place
6th token is: or
7th token is: town
8th token is: to
9th token is: live
10th token is: in
11th token is: !
