In [2]:
import spacy
# processing
# truncate
# Load English model
nlp = spacy.load('en_core_web_sm')

# Sample text
text = """
Apple Inc. is planning to open a new store in New York City. 
The company's CEO Tim Cook announced this during the annual meeting.
"""

# Process text
doc = nlp(text)

# 1. Basic tokenization
# stem -> 提取词干
# lemmazation -> 提取词根
# 
print("=== Tokenization Results ===")
for token in doc:
    print(f"Token: {token.text}, Lemma: {token.lemma_}")


=== Tokenization Results ===
Token: 
, Lemma: 

Token: Apple, Lemma: Apple
Token: Inc., Lemma: Inc.
Token: is, Lemma: be
Token: planning, Lemma: plan
Token: to, Lemma: to
Token: open, Lemma: open
Token: a, Lemma: a
Token: new, Lemma: new
Token: store, Lemma: store
Token: in, Lemma: in
Token: New, Lemma: New
Token: York, Lemma: York
Token: City, Lemma: City
Token: ., Lemma: .
Token: 
, Lemma: 

Token: The, Lemma: the
Token: company, Lemma: company
Token: 's, Lemma: 's
Token: CEO, Lemma: CEO
Token: Tim, Lemma: Tim
Token: Cook, Lemma: Cook
Token: announced, Lemma: announce
Token: this, Lemma: this
Token: during, Lemma: during
Token: the, Lemma: the
Token: annual, Lemma: annual
Token: meeting, Lemma: meeting
Token: ., Lemma: .
Token: 
, Lemma: 



tokenization 也就是所谓分词有两种常用方法，其实本质上都是一样的，还原单词，提取词干，但是lemmazation 是基于词典的，而stem 是基于规则的，lemmazation还原的词干更精准。

In [3]:
# 2. Part-of-speech tagging
print("\n=== POS Tagging ===")
for token in doc:
    print(f"Token: {token.text}, POS: {token.pos_}")


=== POS Tagging ===
Token: 
, POS: SPACE
Token: Apple, POS: PROPN
Token: Inc., POS: PROPN
Token: is, POS: AUX
Token: planning, POS: VERB
Token: to, POS: PART
Token: open, POS: VERB
Token: a, POS: DET
Token: new, POS: ADJ
Token: store, POS: NOUN
Token: in, POS: ADP
Token: New, POS: PROPN
Token: York, POS: PROPN
Token: City, POS: PROPN
Token: ., POS: PUNCT
Token: 
, POS: SPACE
Token: The, POS: DET
Token: company, POS: NOUN
Token: 's, POS: PART
Token: CEO, POS: PROPN
Token: Tim, POS: PROPN
Token: Cook, POS: PROPN
Token: announced, POS: VERB
Token: this, POS: PRON
Token: during, POS: ADP
Token: the, POS: DET
Token: annual, POS: ADJ
Token: meeting, POS: NOUN
Token: ., POS: PUNCT
Token: 
, POS: SPACE


In [4]:
# 3. Named Entity Recognition
print("\n=== Named Entity Recognition ===")
for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")


=== Named Entity Recognition ===
Entity: Apple Inc., Label: ORG
Entity: New York City, Label: GPE
Entity: Tim Cook, Label: PERSON
Entity: annual, Label: DATE


In [5]:
# 4. Dependency Parsing
print("\n=== Dependency Parsing ===")
for token in doc:
    print(f"Token: {token.text}")
    print(f"Dependency: {token.dep_}")
    print(f"Head token: {token.head.text}")
    print("---")


=== Dependency Parsing ===
Token: 

Dependency: dep
Head token: Inc.
---
Token: Apple
Dependency: compound
Head token: Inc.
---
Token: Inc.
Dependency: nsubj
Head token: planning
---
Token: is
Dependency: aux
Head token: planning
---
Token: planning
Dependency: ROOT
Head token: planning
---
Token: to
Dependency: aux
Head token: open
---
Token: open
Dependency: xcomp
Head token: planning
---
Token: a
Dependency: det
Head token: store
---
Token: new
Dependency: amod
Head token: store
---
Token: store
Dependency: dobj
Head token: open
---
Token: in
Dependency: prep
Head token: store
---
Token: New
Dependency: compound
Head token: York
---
Token: York
Dependency: compound
Head token: City
---
Token: City
Dependency: pobj
Head token: in
---
Token: .
Dependency: punct
Head token: planning
---
Token: 

Dependency: dep
Head token: .
---
Token: The
Dependency: det
Head token: company
---
Token: company
Dependency: poss
Head token: CEO
---
Token: 's
Dependency: case
Head token: company
---
Toke

**Dependency Parsing（依存句法分析）** 是自然语言处理（NLP）中的一项核心技术，旨在通过分析句子中词语之间的**语法依存关系**，揭示句子结构的语义和句法信息。其核心思想是构建一个**树状结构**（依存树），描述词语之间的**支配关系**（如主谓、动宾、修饰等）。

---

### **核心概念**
1. **依存关系**  
   - 每个词语（除根节点外）都依存于另一个词语（父节点），形成**有向边**。
   - 依存关系通常用标签表示（如 `nsubj`（主语）、`obj`（宾语）、`amod`（形容词修饰语）等）。

2. **结构特点**  
   - 树状结构：句子中只有一个根节点（通常是谓语动词或核心动作）。
   - 无环：依存关系不会形成环路。
   - 投射性（projective）：在大多数语言中，依存弧在句子线性顺序上不交叉（但非投射结构也存在）。

---

### **示例**
以句子 **“他喜欢读书”** 为例：  
- **依存树**：
  ```
  喜欢（根）
    ├─ 他（nsubj，主语）
    └─ 读书（obj，宾语）
         └─ 读（核心动词） → 书（宾语）
  ```
- 形式化表示为三元组：  
  `(喜欢, nsubj, 他)`、`(喜欢, obj, 读书)`、`(读, obj, 书)`。

---

### **主要应用**
1. **语义理解**  
   帮助模型捕捉句子的逻辑结构（如动作的施事、受事）。
2. **机器翻译**  
   通过依存关系对齐不同语言的句子结构。
3. **信息抽取**  
   识别实体间的关系（如“苹果[公司]发布iPhone”）。
4. **问答系统**  
   分析问题中的关键依存路径以定位答案。

---

### **常用方法**
1. **基于规则的方法**  
   利用语言学规则手动定义依存关系（如早期的Constraint Grammar）。
2. **统计机器学习方法**  
   如基于转移的解析器（Transition-based Parsers，如Arc-Eager算法）。
3. **深度学习模型**  
   使用神经网络（如BiLSTM、Transformer、BERT）直接预测依存树（如Biaffine Parsing）。

---

### **工具与资源**
- **工具库**：  
  - Stanford CoreNLP、spaCy（支持多语言依存分析）、MaltParser、UDpipe。
  - 中文工具：LTP（哈工大）、THULAC（清华）。
- **标注标准**：  
  Universal Dependencies（UD，通用依存标注体系）提供跨语言的统一标注规范。

---

### **为什么重要？**
依存句法分析通过结构化表示句子的语法关系，为下游任务（如情感分析、文本生成）提供关键的结构化特征，是理解自然语言逻辑的核心步骤。

In [6]:
# 5. Sentence Segmentation
print("\n=== Sentence Segmentation ===")
for i, sent in enumerate(doc.sents):
    print(f"Sentence {i+1}: {sent.text}")


=== Sentence Segmentation ===
Sentence 1: 
Apple Inc. is planning to open a new store in New York City. 

Sentence 2: The company's CEO Tim Cook announced this during the annual meeting.



In [7]:
# 6. Stopword Filtering
print("\n=== Stopword Filtering ===")
filtered_tokens = [token.text for token in doc if not token.is_stop]
print("Original tokens:", [token.text for token in doc])
print("Filtered tokens:", filtered_tokens)


=== Stopword Filtering ===
Original tokens: ['\n', 'Apple', 'Inc.', 'is', 'planning', 'to', 'open', 'a', 'new', 'store', 'in', 'New', 'York', 'City', '.', '\n', 'The', 'company', "'s", 'CEO', 'Tim', 'Cook', 'announced', 'this', 'during', 'the', 'annual', 'meeting', '.', '\n']
Filtered tokens: ['\n', 'Apple', 'Inc.', 'planning', 'open', 'new', 'store', 'New', 'York', 'City', '.', '\n', 'company', 'CEO', 'Tim', 'Cook', 'announced', 'annual', 'meeting', '.', '\n']


In [8]:
# 7. Noun Phrase Chunking
print("\n=== Noun Phrase Chunking ===")
for chunk in doc.noun_chunks:
    print(f"Noun phrase: {chunk.text}")
    print(f"Root word: {chunk.root.text}")
    print(f"Root head: {chunk.root.head.text}")
    print("---")


=== Noun Phrase Chunking ===
Noun phrase: 
Apple Inc.
Root word: Inc.
Root head: planning
---
Noun phrase: a new store
Root word: store
Root head: open
---
Noun phrase: New York City
Root word: City
Root head: in
---
Noun phrase: The company's CEO Tim Cook
Root word: Cook
Root head: announced
---
Noun phrase: this
Root word: this
Root head: announced
---
Noun phrase: the annual meeting
Root word: meeting
Root head: during
---


In [9]:
# 8. Word Vector Similarity
print("\n=== Word Vector Similarity ===")
if "en_core_web_md" in spacy.util.get_installed_models():  # Check if medium model installed
    nlp_md = spacy.load("en_core_web_md")
    doc_md = nlp_md("apple orange computer")
    for token1 in doc_md:
        for token2 in doc_md:
            if token1 != token2:
                print(f"{token1.text} vs {token2.text}: {token1.similarity(token2):.2f}")
else:
    print("Install medium model: !python -m spacy download en_core_web_md")


=== Word Vector Similarity ===
apple vs orange: 0.59
apple vs computer: 0.07
orange vs apple: 0.59
orange vs computer: 0.11
computer vs apple: 0.07
computer vs orange: 0.11


In [10]:
# 9. Dependency Visualization
from spacy import displacy

print("\n=== Dependency Visualization ===")
displacy.render(doc, style="dep", options={'compact': True}, jupyter=True)


=== Dependency Visualization ===
