# Quantifying Texts

Bag of words
- Test reuse
- parts of speech tagging
- Named entity recognition

## Tokenising

In [28]:
import nltk
import re
from nltk.tokenize import TweetTokenizer
trump_twt = "TO ALL AMERICANS: #HappyNewYear &amp; many blessings to you all! Looking forward to a wonderful &amp prosperous 2017 as we work together to #MAGA https://t.co/UaBFaoDYhe"

In [29]:
# clean up tweet text get rid of symbols in HTML sytax usng regex pattern
trump_twt_clean = re.sub(r'&amp;','and',trump_twt)
trump_twt_clean = re.sub(r'http\S+','',trump_twt_clean)
trump_twt_clean = re.sub(r'#[A-Za-z0-9_]+','',trump_twt_clean)
trump_twt_clean = re.sub(r'@[A-Za-z0-9_]+','',trump_twt_clean)
trump_twt_clean = re.sub(r'[^A-Za-z0-9 ]+','',trump_twt_clean)
trump_twt_clean = trump_twt_clean.strip()
print(trump_twt_clean)

TO ALL AMERICANS  and many blessings to you all Looking forward to a wonderful amp prosperous 2017 as we work together to


In [30]:
# nltk.download('stopwords')
# nltk.tokenize.casual module
tknzr = TweetTokenizer()
tokens = tknzr.tokenize(trump_twt_clean)
print(tokens)

['TO', 'ALL', 'AMERICANS', 'and', 'many', 'blessings', 'to', 'you', 'all', 'Looking', 'forward', 'to', 'a', 'wonderful', 'amp', 'prosperous', '2017', 'as', 'we', 'work', 'together', 'to']


## Stemmitisation

In [31]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
stemmed_tokens = [stemmer.stem(token) for token in tokens]
print(stemmed_tokens)

['to', 'all', 'american', 'and', 'mani', 'bless', 'to', 'you', 'all', 'look', 'forward', 'to', 'a', 'wonder', 'amp', 'prosper', '2017', 'as', 'we', 'work', 'togeth', 'to']


## Document-feature matrix (DFM)
A matrix of N documents (rows) by J features (columns) where:
Each $W_{ij}$ counts the number of times the $j$th feature appears in the $i$th document.

文件特徵矩陣是一個把「文字文件」轉成「可以被模型讀的數值矩陣」的資料結構。每一列是一份文件，每一欄是一個特徵（feature），格子裡是數值。

The feature can be 
- Bag of words
- TF-IDF
- TERM frequency



## WebScrapping

In [8]:
import requests
from bs4 import BeautifulSoup       
# URL of the sample blog page
url = 'https://ieknet.iek.org.tw/iekrpt/DefaultFree.aspx?currentPageIndex=1&domain=0&indu_idno=0&actiontype=rpt'  # Replace with the actual blog URL
# Send a GET request to the page
# also send the purpose of request in headers
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
    'Purpose': 'course practicing, email :t.hsu7@lse.ac.uk, phone: +44 7491 770206'
}   


In [None]:
response = requests.get(url, headers=headers)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find all article titles (assuming they are in <h2> tags with class 'post-title')
    titles = soup.find_all('h2', class_='g-font-weight-600')
    
    # Print the titles
    for title in titles:
        print(title.get_text(strip=True))




工研院盤點CES 2026　七大AI關鍵趨勢　AI落地成形　產業全面轉型
IEKView：服務業數位轉型與智慧機器人新契機
IEKView：全球藥品市場與台灣新藥出口競爭力
IEKView：擁抱量子革命
IEKView: Service Industry Digital Transformation: Emerging Opportunities with Smart Robotics
IEKView: Riding the AI Trend to Seize Business Opportunities in the Green Transition
新創引領轉型風潮　工研院論壇發表2026年十大創新趨勢
Zettabyte財務長龍牧生：算力資本化啟動全球AI新競局
工研院「眺望2026產業發展趨勢研討會」登場　以AI與量子為核心　四大策略打造臺灣量子產業生態系
2025 IEKTopics｜剖析量子科技新創聚落形成的三大類模式


In [36]:
# for page_num in range(1, 10):  
#     url = f'https://ieknet.iek.org.tw/iekrpt/DefaultFree.aspx?currentPageIndex={page_num}&domain=0&indu_idno=0&actiontype=rpt'
#     response = requests.get(url, headers=headers)
    
#     if response.status_code == 200:
#         soup = BeautifulSoup(response.content, 'html.parser')
#         titles = soup.find_all('h2', class_='g-font-weight-600')
        
#         print(f'--- Page {page_num} ---')
#         for title in titles:
#             print(title.get_text(strip=True))