 
 # [动手练习] NLP Pipeline

:::{note} NLP Pipelines 可能存在变化

- NLP 的处理流程不总是线性的
- 经常在处理过程中会有循环
- 这些任务都要具体的根据特定得任务来思考设计
:::

## 数据收集

### 数据采集 (Data Acquisition): 整个 ML 系统的核心

- 理想的情况： 我们有所有想要的数据
- 包括标签和注释
- 但是，更现实的情况是，我们经常要处理的并不是理想情况，而是缺少各种数据的

### 不理想的情况的处理

- 带有有限注释/标签的初始数据集
- 基于正则表达式或启发式标记的初始数据集
- 公共数据集 (cf. [Google Dataset Search](https://datasetsearch.research.google.com/) or [kaggle](https://www.kaggle.com/))
- 不完整的数据
- 产品上的干预
- 数据增强

### 数据增强

- 这是一种利用语言相似性来生产新数据的技术。
- 常见的策略包括:
    - 同义词替换 (synonym replacement)
    - 相关词替换 (based on association metrics)
    - 回译 (Back translation)
    - 替换实体 (Replacing entities)
    - 增加噪音 (e.g. spelling errors, random words)

## 文本抽取和清理

### 文本抽取

- 从原始文本中抽取数据
    - HTML
    - PDF
- 相关 vs. 非相关信息
    - 非语义信息 (non-textual information)
    - 标签 (markup)
    - 元数据 (metadata)
- 编码格式

#### 从网页中提取文本

In [16]:
import requests 
from bs4 import BeautifulSoup
import pandas as pd
import json
 
 
url = 'https://news.google.com/topstories?hl=zh-CN&gl=CN&ceid=CN:zh-Hans'
r = requests.get(url)
web_content = r.text
soup = BeautifulSoup(web_content,'html.parser')
title = soup.find_all('a', class_='DY5T1d')
first_art_link = title[1]['href'].replace('.','https://news.google.com',1)

print(first_art_link)
art_request = requests.get(first_art_link)
art_request.encoding='utf8'
soup_art = BeautifulSoup(art_request.text,'html.parser')

art_content = soup_art.find_all('p')
art_texts = [p.text for p in art_content]
for text in art_texts:
    print(text.encode('utf-8'))

https://news.google.com/articles/CBMiPmh0dHA6Ly9wb2xpdGljcy5wZW9wbGUuY29tLmNuL24xLzIwMjIvMDgyNC9jMTAwMS0zMjUxMDA4Ny5odG1s0gEA?hl=zh-CN&gl=CN&ceid=CN%3Azh-Hans
b''
b'\n\t\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xc7\xb0\xc8\xab\xef\xbf\xbd\xef\xbf\xbd\xd1\xb4\xef\xbf\xbd\xe9\xba\xb5\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xd3\xa3\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xd1\xb4\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xd4\xb4\xef\xbf\xbd\xef\xbf\xbd\xda\xb9\xd8\xbc\xef\xbf\xbd\xef\xbf\xbd\xda\xa3\xef\xbf\xbd\xcf\xb0\xef\xbf\xbd\xef\xbf\xbd\xc6\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xc7\xb6\xef\xbf\xbd\xce\xbe\xcd\xb7\xef\xbf\xbd\xd1\xb4\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xd6\xb9\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xd2\xaa\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd'
b'\n\t\xef\xbf\xbd\xef\xbf\xbd\xe

#### 从扫描的 PDF 中提取文本

需要安装 OCR 提取工具 tesseract，安装教程见 https://nanonets.com/blog/ocr-with-tesseract/

In [18]:
!pip install pytesseract

Collecting pytesseract
  Downloading pytesseract-0.3.10-py3-none-any.whl (14 kB)
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.10


In [19]:
from PIL import Image
from pytesseract import image_to_string


YOUR_DEMO_DATA_PATH = "data/"  # please change your file path
filename = YOUR_DEMO_DATA_PATH+'pdf-firth-text.png'
text = image_to_string(Image.open(filename))
print(text)

TesseractNotFoundError: tesseract is not installed or it's not in your PATH. See README file for more information.

#### Unicode 标准化

In [None]:
text = 'I feel really 😡. GOGOGO!! 💪💪💪  🤣🤣 ȀÆĎǦƓ'
print(text)
text2 = text.encode('utf-8') # encode the strings in bytes
print(text2)


In [None]:
import unicodedata
unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')

'I feel really . GOGOGO!!    ADG'

- 详细请查阅 [unicodedata documentation](https://docs.python.org/3/library/unicodedata.html) 
- 其他有用的库
    - 拼写检查 (Spelling check): pyenchant, Microsoft REST API
    - PDF:  PyPDF, PDFMiner
    - OCR: pytesseract
 

### 文本清洗

- 预备知识
    - Sentence segmentation
    - Word tokenization
    

#### 分段和标记化 (Segmentation and Tokenization)

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize

text = '''
Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.
'''

## sent segmentation
sents = sent_tokenize(text)

## word tokenization
for sent in sents:
    print(sent)
    print(word_tokenize(sent))


Python is an interpreted, high-level and general-purpose programming language.
['Python', 'is', 'an', 'interpreted', ',', 'high-level', 'and', 'general-purpose', 'programming', 'language', '.']
Python's design philosophy emphasizes code readability with its notable use of significant whitespace.
['Python', "'s", 'design', 'philosophy', 'emphasizes', 'code', 'readability', 'with', 'its', 'notable', 'use', 'of', 'significant', 'whitespace', '.']
Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.
['Its', 'language', 'constructs', 'and', 'object-oriented', 'approach', 'aim', 'to', 'help', 'programmers', 'write', 'clear', ',', 'logical', 'code', 'for', 'small', 'and', 'large-scale', 'projects', '.']


- 经常使用的预处理 （preprocessing）
    - 停用词 (Stopword) 移除
    - Stemming 和lemmatization
    - 数字或标点移除
    - 大小写标准化
    

#### 删除停用词、标点符号和数字

In [None]:
from nltk.corpus import stopwords
from string import punctuation

eng_stopwords = stopwords.words('english')

text = "Mr. John O'Neil works at Wonderland, located at 245 Goleta Avenue, CA., 74208."

words = word_tokenize(text)

print(words)

# remove stopwords, punctuations, digits
for w in words:
    if w not in eng_stopwords and w not in punctuation and not w.isdigit():
        print(w)

['Mr.', 'John', "O'Neil", 'works', 'at', 'Wonderland', ',', 'located', 'at', '245', 'Goleta', 'Avenue', ',', 'CA.', ',', '74208', '.']
Mr.
John
O'Neil
works
Wonderland
located
Goleta
Avenue
CA.


#### Stemming 和 lemmatization

In [None]:
## Stemming
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

words = ['cars','revolution', 'better']
print([stemmer.stem(w) for w in words])


['car', 'revolut', 'better']


In [None]:
## Lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

## Wordnet requires POS of words
poss = ['n','n','a']

for w,p in zip(words,poss):
    print(lemmatizer.lemmatize(w, pos=p))

car
revolution
good


- 和任务相关的预处理 (preprocessing)
    - Unicode 标准化
    - 语言检测 (Language detection)
    - 混合编码 (Code mixing)
    - 同音异形 (Transliteration) (e.g., using piyin for Chinese words in English-Chinese code-switching texts)
    

- 自动化标注 (Automatic annotations)
    - 语法标记 (POS tagging)
    - 解析 (Parsing)
    - 命名实体识别 (Named Entity Recognition)
    - 指代消解 (Coreference resolution)
    

### 预处理的重要提醒

- 并不是所有的预处理过程都是必要的，要根据具体情况分析，预处理所带来的的好处和弊端
- 这些步骤不是顺序的
- 这些步骤取决于任务的
- 预处理的目标
    - 文本标准化 (Text Normalization)
    - 文本单词化 (Text Tokenization)
    - 文本增补和丰富 (Text Enrichment/Annotation)

## 特征工程 (Feature Engineering)

### 什么是特征工程 (feature engineering)?

- 它是指将提取和预处理的文本输入机器学习算法的过程
- 它旨在将文本的特征捕捉到一个数字向量中，该向量可以被ML算法理解。(Cf. *construct*, *operational definitions*, and *measurement* in experimental science)
- 简言之，它涉及到如何有意义地定量表示文本, i.e., text representation.

### 传统机器学习算法的特征工程 

- 基于词的频率表
- 文字袋表示法 
- 特定于域的词频列表 
- 基于领域特定知识的手工特征 

### 深度学习的特征工程

- DL直接将文本作为模型的输入
- DL模型能够从文本中学习特征 (e.g., embeddings)
- 其代价是，该模型往往难以解释
    

## 建模

### 从简单到复杂

- 从启发式或规则开始
- 不同 ML 模型的实验
    - 从启发式到特征
    - 从手动注释到自动提取
    - 特征重要性 (Feature importance/weights) 
- 找到最佳的模型
    - Ensemble 和 stacking
    - 重做 feature engineering
    - 迁移学习 (Transfer learning)
    - 重新应用启发式

## 评估 (Evaluation)

### 为什么要 evaluation?

- 我们需要知道我们建立的模型有多好 
- 与评估方法相关的因素 
    - 建模方法 (Model building)
    - 部署 (Deployment)
    - 生产 (Production)
- ML度量与业务度量 （ML metrics vs. Business metrics）


### 内在评价与外在评价

- 以垃圾邮件分类系统为例 Take spam-classification system as an example
- 内在评价:
    - 垃圾邮件分类/预测的精度和召回率
- 外在评价:
    - 用户在垃圾邮件上花费的时间
    

### 一般性原则

- 在外部评估之前先进行内部评估。
- 外部评估成本更高，因为它通常涉及人工智能团队以外的项目干系人
- 只有当我们在内在评价中获得一致的好结果时，我们才应该进行外在评价
- 内在的不良结果往往意味着外在的不良结果

### 通用的 Intrinsic Metrics

- 评估指标选择原则 
- 标签的数据类型 (ground truths)
    - 二元 (Binary) (e.g., sentiment)
    - 序型 (Ordinal) (e.g., informational retrieval)
    - 分类 (Categorical) (e.g., POS tags)
    - 文本 (Textual) (e.g., named entity, machine translation, text generation)
- 自动与人工评估

## 后建模阶段 (Post-Modeling Phases)

- 在生产环境中部署模型 (e.g., web service)
- 定期监控系统性能 
- 用新的数据更新系统