## NLTK安装

In [1]:
! pip install nltk

Looking in indexes: http://mirrors.aliyun.com/pypi/simple/


## 词干提取算法  
- Porter: 这种词干算法比较旧。它是从20世纪80年代开始的，其主要关注点是删除单词的共同结尾，以便将它们解析为通用形式。它不是太复杂，它的开发停止了。  
- Snowball: 种算法也称为 Porter2 词干算法。它几乎被普遍认为比 Porter 更好，甚至发明 Porter 的开发者也这么认为。Snowball 在 Porter 的基础上加了很多优化。Snowball 与 Porter 相比差异约为5％。
- Lancaster: Lancaster 的算法比较激进，有时候会处理成一些比较奇怪的单词。如果在 NLTK 中使用词干分析器，则可以非常轻松地将自己的自定义规则添加到此算法中。

**Porter Stem练习**

In [2]:
from nltk.stem.porter import *

In [3]:
words = ['caresses', 'flies', 'dies', 'mules', 'denied',
    'died', 'agreed', 'owned', 'humbled', 'sized',
    'meeting', 'stating', 'siezing', 'itemization',
    'sensational', 'traditional', 'reference', 'colonizer',
    'plotted']

In [4]:
stemmer1 = PorterStemmer()

In [5]:
{word: stemmer1.stem(word) for word in words}

{'caresses': 'caress',
 'flies': 'fli',
 'dies': 'die',
 'mules': 'mule',
 'denied': 'deni',
 'died': 'die',
 'agreed': 'agre',
 'owned': 'own',
 'humbled': 'humbl',
 'sized': 'size',
 'meeting': 'meet',
 'stating': 'state',
 'siezing': 'siez',
 'itemization': 'item',
 'sensational': 'sensat',
 'traditional': 'tradit',
 'reference': 'refer',
 'colonizer': 'colon',
 'plotted': 'plot'}

**Snowball Stem练习**

In [6]:
from nltk.stem.snowball import SnowballStemmer

stemmer2 = SnowballStemmer(language='english')

In [7]:
{word: stemmer2.stem(word) for word in words}

{'caresses': 'caress',
 'flies': 'fli',
 'dies': 'die',
 'mules': 'mule',
 'denied': 'deni',
 'died': 'die',
 'agreed': 'agre',
 'owned': 'own',
 'humbled': 'humbl',
 'sized': 'size',
 'meeting': 'meet',
 'stating': 'state',
 'siezing': 'siez',
 'itemization': 'item',
 'sensational': 'sensat',
 'traditional': 'tradit',
 'reference': 'refer',
 'colonizer': 'colon',
 'plotted': 'plot'}

**Lancaster 练习**

In [8]:
from nltk.stem.lancaster import LancasterStemmer  

In [9]:
stemmer3 = LancasterStemmer()  
stemmer3.stem('maximum')

'maxim'

**WordNet Lemmatizer练习**

In [10]:
import nltk
from nltk.stem import WordNetLemmatizer

In [11]:
lemma = WordNetLemmatizer()

In [12]:
lemma.lemmatize('dogs')

'dog'

**查看停用词**

In [13]:
from nltk.corpus import stopwords

In [14]:
words = stopwords.words('english')
print(words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

## 分词

### 英文

In [15]:
# 1
import nltk

In [31]:
sent = "Citibank is the consumer division of financial services multinational Citigroup. It's founded in 1812 as the City Bank of New York, and later became First National City Bank of New York."

In [32]:
sent

"Citibank is the consumer division of financial services multinational Citigroup. It's founded in 1812 as the City Bank of New York, and later became First National City Bank of New York."

In [33]:
print(nltk.word_tokenize(sent))

['Citibank', 'is', 'the', 'consumer', 'division', 'of', 'financial', 'services', 'multinational', 'Citigroup', '.', 'It', "'s", 'founded', 'in', '1812', 'as', 'the', 'City', 'Bank', 'of', 'New', 'York', ',', 'and', 'later', 'became', 'First', 'National', 'City', 'Bank', 'of', 'New', 'York', '.']


In [34]:
# 2
from nltk.tokenize import TreebankWordTokenizer

In [39]:
sent = "He isn't a boy"

In [36]:
word = TreebankWordTokenizer()

In [37]:
word.tokenize(sent)

['He', 'is', "n't", 'a', 'boy']

In [40]:
nltk.word_tokenize(sent)

['He', 'is', "n't", 'a', 'boy']

In [41]:
# 3
from nltk.tokenize import WordPunctTokenizer

In [42]:
word = WordPunctTokenizer()

In [44]:
word.tokenize(sent)

['He', 'isn', "'", 't', 'a', 'boy']

In [45]:
# 4
from nltk.tokenize import WhitespaceTokenizer

In [46]:
word = WhitespaceTokenizer()

In [47]:
word.tokenize(sent)

['He', "isn't", 'a', 'boy']

### 中文