# Text Mining-tokenization
data source: wikipedia

## Loading Wikipedia data

In [1]:
!pip install wikipedia

Collecting wikipedia
  Using cached wikipedia-1.4.0-py3-none-any.whl
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [2]:
import wikipedia
import string

In [3]:
# page "Taipei"
cv = wikipedia.page("Taipei")
text = cv.content
print(cv.url)
print(type(text))
print("The length of Taipei page is: ", len(text))


https://en.wikipedia.org/wiki/Taipei
<class 'str'>
The length of Taipei page is:  46184


In [4]:
# summary
summary_text = wikipedia.summary("Taipei", sentences = 20)
print(len(summary_text))
print("The length of Taipei summary is: ", len(summary_text))

1832
The length of Taipei summary is:  1832


In [10]:
# print(summary_text)

## Tokenization

split strings, sentences into a list of tokens
*   Method1: `.split()`
*   Method2: `nltk`
*   Method3: python function


### Method1： `.split()`
以空格區隔句子(僅英文文本適用)



In [5]:
# example
sentence_a = "Taipei is the economic, political, educational and cultural center of Taiwan and one of the major hubs in East Asia."
print(sentence_a.split("　"))

sentence_b = "2022/08/16 00:10:32"
print(sentence_b.split("/"))

['Taipei is the economic, political, educational and cultural center of Taiwan and one of the major hubs in East Asia.']
['2022', '08', '16 00:10:32']


In [6]:
print(len(text.split(" ")))
print(text.split(" ")[:50])

7086
['Taipei', '(),', 'officially', 'Taipei', 'City,', 'is', 'the', 'capital', 'and', 'a', 'special', 'municipality', 'of', 'the', 'Republic', 'of', 'China', '(Taiwan).', 'Located', 'in', 'Northern', 'Taiwan,', 'Taipei', 'City', 'is', 'an', 'enclave', 'of', 'the', 'municipality', 'of', 'New', 'Taipei', 'City', 'that', 'sits', 'about', '25', 'km', '(16', 'mi)', 'southwest', 'of', 'the', 'northern', 'port', 'city', 'of', 'Keelung.', 'Most']


### Method2： by nltk's function



In [8]:
!pip install nltk

Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
     ---------------------------------------- 1.5/1.5 MB 3.4 MB/s eta 0:00:00
Collecting click
  Downloading click-8.1.3-py3-none-any.whl (96 kB)
     ---------------------------------------- 96.6/96.6 KB 1.8 MB/s eta 0:00:00
Collecting regex>=2021.8.3
  Downloading regex-2023.5.4-cp38-cp38-win_amd64.whl (267 kB)
     -------------------------------------- 267.9/267.9 KB 4.1 MB/s eta 0:00:00
Installing collected packages: regex, click, nltk
Successfully installed click-8.1.3 nltk-3.8.1 regex-2023.5.4


In [9]:
import nltk
nltk.download("punkt")
from nltk.tokenize import word_tokenize

nltk_tokenize = word_tokenize(text)
print(len(nltk_tokenize))
print(nltk_tokenize[:50])

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


8334
['Taipei', '(', ')', ',', 'officially', 'Taipei', 'City', ',', 'is', 'the', 'capital', 'and', 'a', 'special', 'municipality', 'of', 'the', 'Republic', 'of', 'China', '(', 'Taiwan', ')', '.', 'Located', 'in', 'Northern', 'Taiwan', ',', 'Taipei', 'City', 'is', 'an', 'enclave', 'of', 'the', 'municipality', 'of', 'New', 'Taipei', 'City', 'that', 'sits', 'about', '25', 'km', '(', '16', 'mi', ')']


### Method3: by python funcior

In [10]:
def tokenizer(txt):
  tok = ""
  word_list = []

  for ch in txt:
    # print(ch)
    if ch == " ":
      word_list.append(tok)
      tok = ""
    else:
      tok += ch
  return word_list

word_list = tokenizer(text)
print(len(word_list))
print(word_list[:50])

7085
['Taipei', '(),', 'officially', 'Taipei', 'City,', 'is', 'the', 'capital', 'and', 'a', 'special', 'municipality', 'of', 'the', 'Republic', 'of', 'China', '(Taiwan).', 'Located', 'in', 'Northern', 'Taiwan,', 'Taipei', 'City', 'is', 'an', 'enclave', 'of', 'the', 'municipality', 'of', 'New', 'Taipei', 'City', 'that', 'sits', 'about', '25', 'km', '(16', 'mi)', 'southwest', 'of', 'the', 'northern', 'port', 'city', 'of', 'Keelung.', 'Most']


## Counting
計算每個字詞出現次數


In [11]:
from collections import Counter

word_count = Counter(word_list)
for w, c in word_count.most_common(20):
  print("%s\t%s" % (w, c))

the	451
of	248
and	237
in	206
Taipei	116
to	113
is	101
a	98
The	83
as	75
by	54
was	46
city	41
on	41
for	41
are	40
Taiwan	37
from	35
has	34
with	33


## Stopword and sign removal
使用Tokenization method2 nltk斷詞後的結果(nltk_tokenize)


**移除標點符號**
* Method1: `string` using `string.punctuation`
* Method2: `.isalpha()` 去除所有包含非字母的字詞

**移除stopwords**
* load english stopwords from NLTK


### Remove punctuation Method 1: by `string`


In [12]:
tokens = nltk_tokenize

In [13]:
import string
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [15]:
def remove_punc(tokens):
  tokens_list = []
  for tok in tokens:
    if tok not in string.punctuation:
      tokens_list.append(tok)
  return tokens_list

print(len(remove_punc(tokens)))
print(remove_punc(tokens)[:50])

7336
['Taipei', 'officially', 'Taipei', 'City', 'is', 'the', 'capital', 'and', 'a', 'special', 'municipality', 'of', 'the', 'Republic', 'of', 'China', 'Taiwan', 'Located', 'in', 'Northern', 'Taiwan', 'Taipei', 'City', 'is', 'an', 'enclave', 'of', 'the', 'municipality', 'of', 'New', 'Taipei', 'City', 'that', 'sits', 'about', '25', 'km', '16', 'mi', 'southwest', 'of', 'the', 'northern', 'port', 'city', 'of', 'Keelung', 'Most', 'of']


In [16]:
from collections import Counter
print("移除標點符號前")
tokens = nltk_tokenize
word_count = Counter(tokens)
for w, c in word_count.most_common(10):
  print("%s\t%d" % (w, c))

print("=========================")

print("移除標點符號後")
tokens = remove_punc(tokens)
word_count = Counter(tokens)
for w, c in word_count.most_common(10):
  print("%s\t%d" % (w, c))

移除標點符號前
the	454
,	446
.	304
of	249
and	237
in	206
Taipei	172
to	113
is	101
(	98
移除標點符號後
the	454
of	249
and	237
in	206
Taipei	172
to	113
is	101
a	98
The	94
as	75


### Remove punctuation Method2: `.isalpha()` 
去除所有包含非英文的字詞

In [17]:
tokens = nltk_tokenize

In [19]:
def remove_punc2(tokens):
  tokens_list = []
  for tok in tokens:
    if tok.isalpha():
      tokens_list.append(tok)
  return tokens_list

# def remove_punc_short(tokens):
  # return [tok for tok in tokens if tok.isalpha()]

print(len(remove_punc2(tokens)))
print(remove_punc2(tokens)[:50])

6714
['Taipei', 'officially', 'Taipei', 'City', 'is', 'the', 'capital', 'and', 'a', 'special', 'municipality', 'of', 'the', 'Republic', 'of', 'China', 'Taiwan', 'Located', 'in', 'Northern', 'Taiwan', 'Taipei', 'City', 'is', 'an', 'enclave', 'of', 'the', 'municipality', 'of', 'New', 'Taipei', 'City', 'that', 'sits', 'about', 'km', 'mi', 'southwest', 'of', 'the', 'northern', 'port', 'city', 'of', 'Keelung', 'Most', 'of', 'the', 'city']


### Remove Stopwords
載入NLTK套件中的英文stopwords

In [20]:
tokens = nltk_tokenize
tokens = remove_punc2(tokens)

In [23]:
# load english stopwords from NLTK
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")
stopword_list = stopwords.words('english')

print(stopword_list[:50])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [24]:
# remove

def remove_stopwords(tokens):
  tokens_list = []
  for tok in tokens:
    if tok not in stopword_list:
      tokens_list.append(tok)
  return tokens_list

print(len(remove_stopwords(tokens)))
print(remove_stopwords(tokens)[:50])

4411
['Taipei', 'officially', 'Taipei', 'City', 'capital', 'special', 'municipality', 'Republic', 'China', 'Taiwan', 'Located', 'Northern', 'Taiwan', 'Taipei', 'City', 'enclave', 'municipality', 'New', 'Taipei', 'City', 'sits', 'km', 'mi', 'southwest', 'northern', 'port', 'city', 'Keelung', 'Most', 'city', 'rests', 'Taipei', 'Basin', 'ancient', 'lakebed', 'The', 'basin', 'bounded', 'relatively', 'narrow', 'valleys', 'Keelung', 'Xindian', 'rivers', 'join', 'form', 'Tamsui', 'River', 'along', 'city']


In [25]:
from collections import Counter
tokens = remove_stopwords(tokens)
word_count = Counter(tokens)
for w, c in word_count.most_common(10):
  print("%s\t%d" % (w, c))

Taipei	172
The	94
Taiwan	71
city	61
City	26
also	23
District	22
area	21
Japanese	20
population	19


## Stemming and Lemmatization
* 處理英文文本中，同一個單詞的拼法上可能隨著時態、單複數、主被動而不同之問題
* 將不同的形態歸一化，降低複雜度同時加快語言模型訓練速度


---

1.  **Stemming** ：Snowball algorithm
2.  **Lemmatization** 
：還原字的原型(通常較stemming精準)

### Stemming: Snowball algorithm

Reference: http://snowball.tartarus.org/texts/introduction.html

`from nltk.stem.snowball import SnowballStemmer`

缺點：可能會有overstemming的問題

In [26]:
tokens = remove_stopwords(tokens)

In [27]:
from nltk.stem.snowball import SnowballStemmer
snowball_stemmer = SnowballStemmer("english")

stemmed_tokens = []
for tok in tokens:
  stemmed_tokens.append(snowball_stemmer.stem(tok))

word_count = Counter(stemmed_tokens)

for w, c in word_count.most_common(20):
  print("%s\t%d" % (w, c))

taipei	172
citi	102
the	94
taiwan	71
district	42
area	28
includ	26
nation	26
world	26
templ	26
intern	26
new	24
also	23
popul	22
build	21
locat	20
center	20
japanes	20
museum	20
system	20


## Lemmatization
還原字的原型，精準度比stemming高

In [28]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

def lemmatize(token):
  # ADJ (a), ADJ_SAT (s), ADV (r), NOUN (n) or VERB (v)
  for p in ['v', 'n', 'a', 'r', 's']:
    l = wordnet_lemmatizer.lemmatize(token, pos=p)
    if l != token:
      return l
  return token

print(lemmatize('Dogs'))
print(lemmatize('dogs'))
print(lemmatize('hits'))

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...


Dogs
dog
hit


show differences between stemming and lemmatization → lemmatization is better.

In [29]:
print("words \t Stemming \t Lemmatization")
for w in [
    'open', 'opens', 'opened', 'opening', 'unopened',
    'talk', 'talks', 'talked', 'talking',
    'decompose', 'decomposes', 'decomposed', 'decomposing',
    'do', 'does', 'did', 
    'wrote', 'written', 'ran', 'gave', 'held', 'went', 'gone',
    'lied', 'lies', 'lay', 'lain', 'lying', 
    'cats', 'people', 'feet', 'women', 'smoothly', 'firstly', 'secondly', 
    ]:
    s = snowball_stemmer.stem(w)
    l = lemmatize(w)
    if s != l:
        print("%s\t%s\t%s" % (w, s, l))

words 	 Stemming 	 Lemmatization
unopened	unopen	unopened
decompose	decompos	decompose
decomposes	decompos	decompose
decomposed	decompos	decompose
decomposing	decompos	decompose
does	doe	do
did	did	do
wrote	wrote	write
written	written	write
ran	ran	run
gave	gave	give
held	held	hold
went	went	go
gone	gone	go
lain	lain	lie
people	peopl	people
feet	feet	foot
women	women	woman
smoothly	smooth	smoothly
firstly	first	firstly
secondly	second	secondly


In [30]:
tokens = remove_stopwords(tokens)
lemmatized_tokens = []
for tok in tokens:
    lemmatized_tokens.append(lemmatize(tok))
word_count = Counter(lemmatized_tokens)

for w, c in word_count.most_common(20):
  print("%s\t%d" % (w, c))

Taipei	172
The	94
city	72
Taiwan	71
area	27
City	26
include	26
also	23
District	22
Japanese	20
build	20
district	20
population	19
In	19
National	18
Chinese	18
international	17
system	17
know	15
center	15


## Applications: WordClund 文字雲
將字串以空白隔開後製作成文字雲
https://www.jasondavies.com/wordcloud/

In [32]:
repeated_tokens = []
for w, c in word_count.most_common():
  for i in range(c):
    repeated_tokens.append(w)

print(" ".join(repeated_tokens)[:100])

Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Ta
