# Text Mining-tokenization
data source: wikipedia

## Loading Wikipedia data

In [6]:
!pip install wikipedia

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11695 sha256=7ff82499f578e8fda17096bd46605ca1076e21ea4dc898bb20192c63c2acecc3
  Stored in directory: /root/.cache/pip/wheels/07/93/05/72c05349177dca2e0ba31a33ba4f7907606f7ddef303517c6a
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [7]:
import wikipedia
import string

In [8]:
# page "Taipei"
cv = wikipedia.page("Taipei")
text = cv.content
print(cv.url)
print(type(text))
print("The length of Taipei page is: ", len(text))


https://en.wikipedia.org/wiki/Taipei
<class 'str'>
The length of Taipei page is:  46155


In [9]:
# summary
summary_text = wikipedia.summary("Taipei", sentences = 20)
print(len(summary_text))
print("The length of Taipei summary is: ", len(summary_text))

1834
The length of Taipei summary is:  1834


In [10]:
# print(summary_text)

## Tokenization

split strings, sentences into a list of tokens
*   Method1: `.split()`
*   Method2: `nltk`
*   Method3: python function


### Method1： `.split()`
以空格區隔句子(僅英文文本適用)



In [11]:
# example
sentence_a = "Taipei is the economic, political, educational and cultural center of Taiwan and one of the major hubs in East Asia."
print(sentence_a.split("　"))

sentence_b = "2022/08/16 00:10:32"
print(sentence_b.split("/"))

['Taipei is the economic, political, educational and cultural center of Taiwan and one of the major hubs in East Asia.']
['2022', '08', '16 00:10:32']


In [12]:
print(len(text.split(" ")))
print(text.split(" ")[:50])

7090
['Taipei', '(),', 'officially', 'Taipei', 'City,', 'is', 'the', 'capital', 'and', 'a', 'special', 'municipality', 'of', 'the', 'Republic', 'of', 'China', '(Taiwan).', 'Located', 'in', 'Northern', 'Taiwan,', 'Taipei', 'City', 'is', 'an', 'enclave', 'of', 'the', 'municipality', 'of', 'New', 'Taipei', 'City', 'that', 'sits', 'about', '25', 'km', '(16', 'mi)', 'southwest', 'of', 'the', 'northern', 'port', 'city', 'of', 'Keelung.', 'Most']


### Method2： by nltk's function



In [13]:
import nltk
nltk.download("punkt")
from nltk.tokenize import word_tokenize

nltk_tokenize = word_tokenize(text)
print(len(nltk_tokenize))
print(nltk_tokenize[:50])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


8328
['Taipei', '(', ')', ',', 'officially', 'Taipei', 'City', ',', 'is', 'the', 'capital', 'and', 'a', 'special', 'municipality', 'of', 'the', 'Republic', 'of', 'China', '(', 'Taiwan', ')', '.', 'Located', 'in', 'Northern', 'Taiwan', ',', 'Taipei', 'City', 'is', 'an', 'enclave', 'of', 'the', 'municipality', 'of', 'New', 'Taipei', 'City', 'that', 'sits', 'about', '25', 'km', '(', '16', 'mi', ')']


### Method3: by python funcior

In [21]:
def tokenizer(txt):
  tok = ""
  word_list = []

  for ch in txt:
    # print(ch)
    if ch == " ":
      word_list.append(tok)
      tok = ""
    else:
      tok += ch
  return word_list

word_list = tokenizer(text)
print(len(word_list))
print(word_list[:50])

7089
['Taipei', '(),', 'officially', 'Taipei', 'City,', 'is', 'the', 'capital', 'and', 'a', 'special', 'municipality', 'of', 'the', 'Republic', 'of', 'China', '(Taiwan).', 'Located', 'in', 'Northern', 'Taiwan,', 'Taipei', 'City', 'is', 'an', 'enclave', 'of', 'the', 'municipality', 'of', 'New', 'Taipei', 'City', 'that', 'sits', 'about', '25', 'km', '(16', 'mi)', 'southwest', 'of', 'the', 'northern', 'port', 'city', 'of', 'Keelung.', 'Most']


## Counting
計算每個字詞出現次數


In [22]:
from collections import Counter

word_count = Counter(word_list)
for w, c in word_count.most_common(20):
  print("%s\t%s" % (w, c))

the	449
of	246
and	238
in	205
Taipei	115
to	114
is	100
a	98
The	83
as	76
by	54
was	46
on	42
city	41
are	41
for	41
Taiwan	37
from	35
has	34
with	33


## Stopword and sign removal
使用Tokenization method2 nltk斷詞後的結果(nltk_tokenize)


**移除標點符號**
* Method1: `string` using `string.punctuation`
* Method2: `.isalpha()` 去除所有包含非字母的字詞

**移除stopwords**
* load english stopwords from NLTK


### Remove punctuation Method 1: by `string`


In [23]:
tokens = nltk_tokenize

In [24]:
import string
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [26]:
def remove_punc(tokens):
  tokens_list = []
  for tok in tokens:
    if tok not in string.punctuation:
      tokens_list.append(tok)
  return tokens_list

print(len(remove_punc(tokens)))
print(remove_punc(tokens))

7336
['Taipei', 'officially', 'Taipei', 'City', 'is', 'the', 'capital', 'and', 'a', 'special', 'municipality', 'of', 'the', 'Republic', 'of', 'China', 'Taiwan', 'Located', 'in', 'Northern', 'Taiwan', 'Taipei', 'City', 'is', 'an', 'enclave', 'of', 'the', 'municipality', 'of', 'New', 'Taipei', 'City', 'that', 'sits', 'about', '25', 'km', '16', 'mi', 'southwest', 'of', 'the', 'northern', 'port', 'city', 'of', 'Keelung', 'Most', 'of', 'the', 'city', 'rests', 'on', 'the', 'Taipei', 'Basin', 'an', 'ancient', 'lakebed', 'The', 'basin', 'is', 'bounded', 'by', 'the', 'relatively', 'narrow', 'valleys', 'of', 'the', 'Keelung', 'and', 'Xindian', 'rivers', 'which', 'join', 'to', 'form', 'the', 'Tamsui', 'River', 'along', 'the', 'city', "'s", 'western', 'border.The', 'municipality', 'of', 'Taipei', 'is', 'home', 'to', 'an', 'estimated', 'population', 'of', '2,488,043', 'January', '2023', 'forming', 'the', 'core', 'part', 'of', 'the', 'Taipei–Keelung', 'metropolitan', 'area', 'also', 'known', 'as', '

In [33]:
from collections import Counter
print("移除標點符號前")
tokens = nltk_tokenize
word_count = Counter(tokens)
for w, c in word_count.most_common(10):
  print("%s\t%d" % (w, c))

print("=========================")

print("移除標點符號後")
tokens = remove_punc(tokens)
word_count = Counter(tokens)
for w, c in word_count.most_common(10):
  print("%s\t%d" % (w, c))

移除標點符號前
the	452
,	444
.	304
of	247
and	238
in	205
Taipei	171
to	114
is	100
a	98
移除標點符號後
the	452
of	247
and	238
in	205
Taipei	171
to	114
is	100
a	98
The	94
as	76


### Remove punctuation Method2: `.isalpha()` 
去除所有包含非英文的字詞

In [18]:
tokens = nltk_tokenize

In [19]:
def remove_punc2(tokens):
  tokens_list = []
  for tok in tokens:
    if tok.isalpha():
      tokens_list.append(tok)
  return tokens_list

# def remove_punc_short(tokens):
  # return [tok for tok in tokens if tok.isalpha()]

print(len(remove_punc2(tokens)))
print(remove_punc2(tokens))

6712
['Taipei', 'officially', 'Taipei', 'City', 'is', 'the', 'capital', 'and', 'a', 'special', 'municipality', 'of', 'the', 'Republic', 'of', 'China', 'Taiwan', 'Located', 'in', 'Northern', 'Taiwan', 'Taipei', 'City', 'is', 'an', 'enclave', 'of', 'the', 'municipality', 'of', 'New', 'Taipei', 'City', 'that', 'sits', 'about', 'km', 'mi', 'southwest', 'of', 'the', 'northern', 'port', 'city', 'of', 'Keelung', 'Most', 'of', 'the', 'city', 'rests', 'on', 'the', 'Taipei', 'Basin', 'an', 'ancient', 'lakebed', 'The', 'basin', 'is', 'bounded', 'by', 'the', 'relatively', 'narrow', 'valleys', 'of', 'the', 'Keelung', 'and', 'Xindian', 'rivers', 'which', 'join', 'to', 'form', 'the', 'Tamsui', 'River', 'along', 'the', 'city', 'western', 'municipality', 'of', 'Taipei', 'is', 'home', 'to', 'an', 'estimated', 'population', 'of', 'January', 'forming', 'the', 'core', 'part', 'of', 'the', 'metropolitan', 'area', 'also', 'known', 'as', 'Greater', 'Taipei', 'which', 'includes', 'the', 'nearby', 'cities', 'of

### Remove Stopwords
載入NLTK套件中的英文stopwords

In [20]:
tokens = nltk_tokenize
tokens = remove_punc2(tokens)

In [21]:
# load english stopwords from NLTK
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")
stopword_list = stopwords.words('english')

print(stopword_list)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [22]:
# remove

def remove_stopwords(tokens):
  tokens_list = []
  for tok in tokens:
    if tok not in stopword_list:
      tokens_list.append(tok)
  return tokens_list

print(len(remove_stopwords(tokens)))
print(remove_stopwords(tokens)[:50])

4406
['Taipei', 'officially', 'Taipei', 'City', 'capital', 'special', 'municipality', 'Republic', 'China', 'Taiwan', 'Located', 'Northern', 'Taiwan', 'Taipei', 'City', 'enclave', 'municipality', 'New', 'Taipei', 'City', 'sits', 'km', 'mi', 'southwest', 'northern', 'port', 'city', 'Keelung', 'Most', 'city', 'rests', 'Taipei', 'Basin', 'ancient', 'lakebed', 'The', 'basin', 'bounded', 'relatively', 'narrow', 'valleys', 'Keelung', 'Xindian', 'rivers', 'join', 'form', 'Tamsui', 'River', 'along', 'city']


In [23]:
from collections import Counter
tokens = remove_stopwords(tokens)
word_count = Counter(tokens)
for w, c in word_count.most_common(10):
  print("%s\t%d" % (w, c))

Taipei	171
The	94
Taiwan	71
city	61
City	26
also	24
District	22
area	21
In	20
population	19


## Stemming and Lemmatization
* 處理英文文本中，同一個單詞的拼法上可能隨著時態、單複數、主被動而不同之問題
* 將不同的形態歸一化，降低複雜度同時加快語言模型訓練速度


---

1.  **Stemming** ：Snowball algorithm
2.  **Lemmatization** 
：還原字的原型(通常較stemming精準)

### Stemming: Snowball algorithm

Reference: http://snowball.tartarus.org/texts/introduction.html

`from nltk.stem.snowball import SnowballStemmer`

缺點：可能會有overstemming的問題

In [24]:
tokens = remove_stopwords(tokens)

In [25]:
from nltk.stem.snowball import SnowballStemmer
snowball_stemmer = SnowballStemmer("english")

stemmed_tokens = []
for tok in tokens:
  stemmed_tokens.append(snowball_stemmer.stem(tok))

word_count = Counter(stemmed_tokens)

for w, c in word_count.most_common(20):
  print("%s\t%d" % (w, c))

taipei	171
citi	102
the	94
taiwan	71
district	42
area	28
intern	27
nation	26
templ	26
includ	25
world	25
new	24
also	24
popul	22
build	21
locat	20
center	20
museum	20
in	20
system	20


## Lemmatization
還原字的原型，精準度比stemming高

In [26]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

def lemmatize(token):
  # ADJ (a), ADJ_SAT (s), ADV (r), NOUN (n) or VERB (v)
  for p in ['v', 'n', 'a', 'r', 's']:
    l = wordnet_lemmatizer.lemmatize(token, pos=p)
    if l != token:
      return l
  return token

print(lemmatize('Dogs'))
print(lemmatize('dogs'))
print(lemmatize('hits'))

Dogs
dog
hit


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


show differences between stemming and lemmatization → lemmatization is better.

In [27]:
print("words \t Stemming \t Lemmatization")
for w in [
    'open', 'opens', 'opened', 'opening', 'unopened',
    'talk', 'talks', 'talked', 'talking',
    'decompose', 'decomposes', 'decomposed', 'decomposing',
    'do', 'does', 'did', 
    'wrote', 'written', 'ran', 'gave', 'held', 'went', 'gone',
    'lied', 'lies', 'lay', 'lain', 'lying', 
    'cats', 'people', 'feet', 'women', 'smoothly', 'firstly', 'secondly', 
    ]:
    s = snowball_stemmer.stem(w)
    l = lemmatize(w)
    if s != l:
        print("%s\t%s\t%s" % (w, s, l))

words 	 Stemming 	 Lemmatization
unopened	unopen	unopened
decompose	decompos	decompose
decomposes	decompos	decompose
decomposed	decompos	decompose
decomposing	decompos	decompose
does	doe	do
did	did	do
wrote	wrote	write
written	written	write
ran	ran	run
gave	gave	give
held	held	hold
went	went	go
gone	gone	go
lain	lain	lie
people	peopl	people
feet	feet	foot
women	women	woman
smoothly	smooth	smoothly
firstly	first	firstly
secondly	second	secondly


In [28]:
tokens = remove_stopwords(tokens)
lemmatized_tokens = []
for tok in tokens:
    lemmatized_tokens.append(lemmatize(tok))
word_count = Counter(lemmatized_tokens)

for w, c in word_count.most_common(20):
  print("%s\t%d" % (w, c))

Taipei	171
The	94
city	72
Taiwan	71
area	27
City	26
include	25
also	24
District	22
build	20
district	20
In	20
population	19
Japanese	18
National	18
international	18
Chinese	17
system	17
know	15
center	15


## Applications: WordClund 文字雲
將字串以空白隔開後製作成文字雲
https://www.jasondavies.com/wordcloud/

In [29]:
repeated_tokens = []
for w, c in word_count.most_common():
  for i in range(c):
    repeated_tokens.append(w)

print(" ".join(repeated_tokens))

Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei