In [1]:
import numpy as np
import spacy


In [2]:
nlp = spacy.load("en_core_web_sm")

In [3]:
text = ("""Thailand, officially the Kingdom of Thailand, is a country in Southeast Asia, located at the centre of the Indochinese Peninsula, spanning 513,120 square kilometres (198,120 sq mi), with a population of almost 70 million.[11] The country is bordered to the north by Myanmar and Laos, to the east by Laos and Cambodia, to the south by the Gulf of Thailand and Malaysia, and to the west by the Andaman Sea and the extremity of Myanmar. Thailand also shares maritime borders with Vietnam to the southeast, and Indonesia and India to the southwest. Bangkok is the nation's capital and largest city.

Tai peoples migrated from southwestern China to mainland Southeast Asia from the 11th century. Indianised kingdoms such as the Mon, Khmer Empire and Malay states ruled the region, competing with Thai states such as the Kingdoms of Ngoenyang, Sukhothai, Lan Na and Ayutthaya, which also rivalled each other. European contact began in 1511 with a Portuguese diplomatic mission to Ayutthaya, which became a regional power by the end of the 15th century. Ayutthaya reached its peak during the 18th century, until it was destroyed in the Burmese–Siamese War. Taksin quickly reunified the fragmented territory and established the short-lived Thonburi Kingdom. He was succeeded in 1782 by Buddha Yodfa Chulaloke, the first monarch of the current Chakri dynasty. Throughout the era of Western imperialism in Asia, Siam remained the only nation in the region to avoid colonization by foreign powers, although it was often forced to make territorial, trade and legal concessions in unequal treaties.[12] The Siamese system of government was centralised and transformed into a modern unitary absolute monarchy in the reign of Chulalongkorn. In World War I, Siam sided with the Allies, a political decision made in order to amend the unequal treaties. Following a bloodless revolution in 1932, it became a constitutional monarchy and changed its official name to Thailand, becoming an ally of Japan in World War II. In the late 1950s, a military coup under Field Marshal Sarit Thanarat revived the monarchy's historically influential role in politics. Thailand became a major ally of the United States, and played an anti-communist role in the region as a member of the failed SEATO, but from 1975 sought to improve relations with Communist China and Thailand's neighbours.

Apart from a brief period of parliamentary democracy in the mid-1970s, Thailand has periodically alternated between democracy and military rule. Since the 2000s the country has been caught in continual bitter political conflict between supporters and opponents of Thaksin Shinawatra, which resulted in two coups (in 2006 and 2014), along with the establishment of its current constitution, a nominally democratic government after the 2019 Thai general election, and large pro-democracy protests in 2020–2021 which included unprecedented demands to reform the monarchy. Since 2019, it has been nominally a parliamentary constitutional monarchy; in practice, however, structural advantages in the constitution have ensured the military's hold on power.[13]

Thailand is a middle power in global affairs and a founding member of ASEAN, and ranks very high in the Human Development Index. It has the second-largest economy in Southeast Asia and the 24th-largest in the world by PPP. Thailand is classified as a newly industrialised economy, with manufacturing, agriculture, and tourism as leading sectors.""")

In [4]:
doc = nlp(text)

In [5]:
for token in text[0:10]:
    print (token)

T
h
a
i
l
a
n
d
,
 


In [6]:
for token in doc[0:10]:
    print (token)

Thailand
,
officially
the
Kingdom
of
Thailand
,
is
a


In [7]:
# count number of letters vs words (tokens)
print (len(text))
print (len(doc))

3460
624


In [8]:
sentence1 = list(doc.sents)
for x in sentence1[:5]:
    print (x)

Thailand, officially the Kingdom of Thailand, is a country in Southeast Asia, located at the centre of the Indochinese Peninsula, spanning 513,120 square kilometres (198,120 sq mi), with a population of almost 70 million.[11] The country is bordered to the north by Myanmar and Laos, to the east by Laos and Cambodia, to the south by the Gulf of Thailand and Malaysia, and to the west by the Andaman Sea and the extremity of Myanmar.
Thailand also shares maritime borders with Vietnam to the southeast, and Indonesia and India to the southwest.
Bangkok is the nation's capital and largest city.


Tai peoples migrated from southwestern China to mainland Southeast Asia from the 11th century.
Indianised kingdoms such as the Mon, Khmer Empire and Malay states ruled the region, competing with Thai states such as the Kingdoms of Ngoenyang, Sukhothai, Lan Na and Ayutthaya, which also rivalled each other.


In [9]:
print (len(sentence1))

20


In [10]:
print ((sentence1)[19])

Thailand is classified as a newly industrialised economy, with manufacturing, agriculture, and tourism as leading sectors.


In [11]:
token1 = doc[0]
token1.text


'Thailand'

In [12]:
token1.ent_type

384

In [13]:
token1.ent_type_

'GPE'

In [14]:
(doc[45])

bordered

In [15]:
(doc[45]).lemma_

'border'

In [16]:
for token in ((sentence1)[19]):
    print (token.text, token.pos_, token.dep_)

Thailand PROPN nsubjpass
is AUX auxpass
classified VERB ROOT
as ADP prep
a DET det
newly ADV advmod
industrialised VERB amod
economy NOUN pobj
, PUNCT punct
with ADP prep
manufacturing NOUN pobj
, PUNCT punct
agriculture NOUN conj
, PUNCT punct
and CCONJ cc
tourism NOUN conj
as ADP prep
leading VERB amod
sectors NOUN pobj
. PUNCT punct


In [17]:
# Showing Part of Speech (POS) tagging and Syntactic Dependecy (DEP) parsing
from spacy import displacy
displacy.render((sentence1)[19], style="dep")

In [18]:
# Showing Named Entities Recognition (NER)
for ent in (doc.ents[:20]):
    print (ent.text, ent.label_)

Thailand GPE
the Kingdom of Thailand GPE
Southeast Asia LOC
the Indochinese Peninsula LOC
513,120 square kilometres QUANTITY
198,120 sq mi QUANTITY
almost 70 CARDINAL
Myanmar GPE
Laos GPE
Laos GPE
Cambodia GPE
the Gulf of Thailand LOC
Malaysia GPE
the Andaman Sea LOC
Myanmar GPE
Thailand GPE
Vietnam GPE
Indonesia GPE
India GPE
Bangkok GPE


In [19]:
# Labelling Named Entities (NER)
displacy.render(doc[:250], style="ent")

In [20]:
nlp = spacy.load("en_core_web_md")

In [21]:
doc = nlp(text)
sentence2 = list(doc.sents)[18]
print(sentence2)

It has the second-largest economy in Southeast Asia and the 24th-largest in the world by PPP.


In [22]:
from spacy.matcher import Matcher

In [23]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN"}]
matcher.add("PROPER_NOUN", [pattern])
doc = nlp(text)
matches = matcher(doc)
print (len(matches))
for match in matches[:10]:
    print (match, doc[match[1]:match[2]])

88
(451313080118390996, 0, 1) Thailand
(451313080118390996, 4, 5) Kingdom
(451313080118390996, 6, 7) Thailand
(451313080118390996, 12, 13) Southeast
(451313080118390996, 13, 14) Asia
(451313080118390996, 21, 22) Indochinese
(451313080118390996, 22, 23) Peninsula
(451313080118390996, 30, 31) sq
(451313080118390996, 31, 32) mi
(451313080118390996, 50, 51) Myanmar


In [24]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}]
matcher.add("PROPER_NOUN", [pattern])
doc = nlp(text)
matches = matcher(doc)
print (len(matches))
for match in matches[:10]:
    print (match, doc[match[1]:match[2]])

118
(451313080118390996, 0, 1) Thailand
(451313080118390996, 4, 5) Kingdom
(451313080118390996, 6, 7) Thailand
(451313080118390996, 12, 13) Southeast
(451313080118390996, 12, 14) Southeast Asia
(451313080118390996, 13, 14) Asia
(451313080118390996, 21, 22) Indochinese
(451313080118390996, 21, 23) Indochinese Peninsula
(451313080118390996, 22, 23) Peninsula
(451313080118390996, 30, 31) sq


In [25]:
# Look for the longest tokens
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
print (len(matches))
for match in matches[:10]:
    print (match, doc[match[1]:match[2]])

65
(451313080118390996, 376, 380) Field Marshal Sarit Thanarat
(451313080118390996, 239, 242) Buddha Yodfa Chulaloke
(451313080118390996, 315, 318) World War I
(451313080118390996, 363, 366) World War II
(451313080118390996, 579, 582) Human Development Index
(451313080118390996, 12, 14) Southeast Asia
(451313080118390996, 21, 23) Indochinese Peninsula
(451313080118390996, 30, 32) sq mi
(451313080118390996, 79, 81) Andaman Sea
(451313080118390996, 125, 127) Southeast Asia


In [26]:
# Look for the last 10 tokens
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
print (len(matches))
for match in matches[-10:]:
    print (match, doc[match[1]:match[2]])

65
(451313080118390996, 390, 391) Thailand
(451313080118390996, 416, 417) SEATO
(451313080118390996, 429, 430) Thailand
(451313080118390996, 446, 447) Thailand
(451313080118390996, 505, 506) Thai
(451313080118390996, 555, 556) power.[13
(451313080118390996, 558, 559) Thailand
(451313080118390996, 571, 572) ASEAN
(451313080118390996, 602, 603) PPP
(451313080118390996, 604, 605) Thailand


In [27]:
# Look for tokens in sequential order
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key = lambda x: x[1])
print (len(matches))
for match in matches[:10]:
    print (match, doc[match[1]:match[2]])

65
(451313080118390996, 0, 1) Thailand
(451313080118390996, 4, 5) Kingdom
(451313080118390996, 6, 7) Thailand
(451313080118390996, 12, 14) Southeast Asia
(451313080118390996, 21, 23) Indochinese Peninsula
(451313080118390996, 30, 32) sq mi
(451313080118390996, 50, 51) Myanmar
(451313080118390996, 52, 53) Laos
(451313080118390996, 58, 59) Laos
(451313080118390996, 60, 61) Cambodia


In [28]:
# Look for tokens followed by verb
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}, {"POS": "VERB"}]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key = lambda x: x[1])
print (len(matches))
for match in matches[:5]:
    print (match, doc[match[1]:match[2]])

5
(451313080118390996, 198, 200) Ayutthaya reached
(451313080118390996, 261, 263) Siam remained
(451313080118390996, 319, 321) Siam sided
(451313080118390996, 376, 381) Field Marshal Sarit Thanarat revived
(451313080118390996, 390, 392) Thailand became


In [29]:
# 
speak_lemmas = ["think", "say"]
matcher = Matcher(nlp.vocab)
pattern = [{"ORTH": "'"}, 
           {"IS_ALPHA": True, "OP": "+"},
           {"IS_PUNCT": True, "OP": "*"}, 
           {"ORTH": "'"},
           {"POS": "VERB", "LEMMA": {"IN": speak_lemmas}},
           {"POS": "PROPN", "OP": "+"},
           {"ORTH": "'"}, 
           {"IS_ALPHA": True, "OP": "+"},
           {"IS_PUNCT": True, "OP": "*"}, 
           {"ORTH": "'"}          
          ]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key = lambda x: x[1])
print (len(matches))
for match in matches[:10]:
    print (match, doc[match[1]:match[2]])

0


In [30]:
doc3 = nlp("Bangkok is a city. John is a data analyst.")

In [31]:
for ent in doc3.ents:
    print (ent.text, ent.label_)

Bangkok GPE
John PERSON


In [32]:
from spacy.language import Language

In [33]:
# custom component to specifically remove GPE
@Language.component("remove_gpe")
def remove_gpe(doc):
    original_ents = list(doc.ents)
    for ent in doc.ents:
        if ent.label_ == "GPE":
            original_ents.remove(ent)
    doc.ents = original_ents
    return (doc)


In [34]:
nlp.add_pipe("remove_gpe")

<function __main__.remove_gpe(doc)>

In [35]:
#Confirming that remove_gpe component has added to the pipeline
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'remove_gpe': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  

In [36]:
doc3 = nlp("Bangkok is a city. John is a data analyst.")
for ent in doc3.ents:
    print (ent.text, ent.label_)

John PERSON


In [37]:
# Saving this to a new model
# nlp.to_disk("data/new_en_core_web_sm")

### Regexes - Pattern Matching of strings in Python
For Password checkers, phone numbers, emails, and more!

In [38]:
# Import Regular Expression
import re

In [39]:
text2 = "Paul Newman is an American actor, but Paul Hollywood is a British TV host. "
phrases = ["abcd","xxx","aaa abxxxcd ccc","ab cd"]

In [40]:
pattern = r"Paul [A-Z]\w+"

In [41]:
matches = re.finditer(pattern, text2)
for match in matches:
    print (match)

<re.Match object; span=(0, 11), match='Paul Newman'>
<re.Match object; span=(38, 52), match='Paul Hollywood'>


In [42]:
regexp= re.compile(r"ab[^\s]*cd")

matches =[]
for phrase in phrases:
    if re.search(regexp, phrase):  #can also replace .search with .match
        matches.append(phrase)

print(matches)

['abcd', 'aaa abxxxcd ccc']


### Excercise on how to use Vectorizer from Sklearn for classification

In [43]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import svm

In [44]:
class Category:
    BOOKS = "BOOKS"
    CLOTHING = "CLOTHING"

train_x = ["i love the book", "this is a great book", "the fit is great", "i love the shoes"]
train_y = [Category.BOOKS, Category.BOOKS, Category.CLOTHING, Category.CLOTHING]

In [45]:
vectorizer = CountVectorizer(binary=True)
train_x_vectors = vectorizer.fit_transform(train_x)

print(vectorizer.get_feature_names_out())
print(train_x_vectors.toarray())

['book' 'fit' 'great' 'is' 'love' 'shoes' 'the' 'this']
[[1 0 0 0 1 0 1 0]
 [1 0 1 1 0 0 0 1]
 [0 1 1 1 0 0 1 0]
 [0 0 0 0 1 1 1 0]]


In [46]:
clf_svm = svm.SVC(kernel='linear')
clf_svm.fit(train_x_vectors, train_y)

SVC(kernel='linear')

In [47]:
test_x = vectorizer.transform(['i like the book'])
clf_svm.predict(test_x)


array(['BOOKS'], dtype='<U8')

In [48]:
test_x = vectorizer.transform(['Boots are alright'])
clf_svm.predict(test_x)

array(['CLOTHING'], dtype='<U8')

In [49]:
print(train_x)

['i love the book', 'this is a great book', 'the fit is great', 'i love the shoes']


### Show vectors using pre-trained model from spaCy

In [50]:
docs = [nlp(i) for i in train_x]
#docs = nlp(train_x)
train_x_word_vectors = [x.vector for x in docs]
print(docs[0], docs[0].vector[:50])

i love the book [-0.3980475  -1.705925   -0.90664995 -4.5425     -1.1165801  -2.915125
  3.175245    4.088725   -3.447475    2.38406     6.485725    2.3083498
 -8.64645     2.0437698   2.269975   -1.0261      4.09154    -0.7480149
  0.11435002 -1.9810501   1.3855026   1.707      -2.9752648  -1.9328325
 -1.42555    -2.0426226  -3.7064652  -0.4378465  -2.0860374   4.43085
 -1.0481     -0.78117514 -1.687       1.9781501   1.4894226  -0.28325254
 -1.4800999   1.4303375   2.606875   -1.393568   -0.45071498  1.8592875
  0.60194993 -2.03559     5.385375    3.3568425  -2.65585    -2.5876875
 -0.41877502  1.1819749 ]


In [51]:
clf_svm_wv = svm.SVC(kernel='linear')
clf_svm_wv.fit(train_x_word_vectors, train_y)

SVC(kernel='linear')

In [52]:
test_x = ["these earings hurt"]
test_docs = [nlp(text) for text in test_x]
test_x_word_vectors = [x.vector for x in test_docs]

clf_svm_wv.predict(test_x_word_vectors)


array(['CLOTHING'], dtype='<U8')