In [1]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2
# 多行输出
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" 

## 博客
Great NLP-related blogs:
- [Sebastian Ruder](http://ruder.io/)
- [Jay Alammar](https://jalammar.github.io/)
- [Abigail See](http://www.abigailsee.com/)
- [Joyce Xu](https://medium.com/@joycex99)
- [Stephen Merity](https://smerity.com/articles/articles.html)
- [Rachael Tatman](https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213)

Other great technical blog posts:
- [Peter Norvig](http://nbviewer.jupyter.org/url/norvig.com/ipython/ProbabilityParadox.ipynb) (more [here](http://norvig.com/ipython/))
- [Julia Evans](https://codewords.recurse.com/issues/five/why-do-neural-networks-think-a-panda-is-a-vulture) (more [here](https://jvns.ca/blog/2014/08/12/what-happens-if-you-write-a-tcp-stack-in-python/))
- [Julia Ferraioli](http://blog.juliaferraioli.com/2016/02/exploring-world-using-vision-twilio.html)
- [Slav Ivanov](https://blog.slavv.com/picking-an-optimizer-for-style-transfer-86e7b8cba84b)
- find [more on twitter](https://twitter.com/math_rachel)

## NLP

NLP 包含以下的任务:

- 词性标注
- 命名实体识别（人名、地名）
- 问答系统 Question answering
- 语音辨识 Speech recognition
- 语音合成和语音识别 Text-to-speech and Speech-to-text
- 主题模型 Topic modeling
- 语义分类 Sentiment classification
- 语言模型 Language modeling
- 翻译 Translation

In [2]:
# !pip install -U nltk spacy gensim

In [3]:
# !sudo apt-get -y install protobuf-compiler libprotoc-dev

In [4]:
# !pip install pytext-nlp

```bash
$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ make
```

## 基于矩阵分解的主题模型

最佳分解是将文档聚类成两组，组与组之间具有尽可能彼此不同的单词分布，但在组中的文档中尽可能相似。 我们将这两个组称为“主题”。 我们会根据每个主题中最常出现的词汇将这些词汇分为两组。

In [5]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn import decomposition
from scipy import linalg
import matplotlib.pyplot as plt
from pathlib import Path

newsgroups 数据集包括18,000个新闻组帖子，其中包含20个主题

### 获取数据集

In [19]:
root = Path('/home/lab/Datasets/scikit_learn_data')
root_nltk = Path('/home/lab/Datasets/nltk_data')

In [7]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
remove = ('headers', 'footers', 'quotes')
newsgroups_train = fetch_20newsgroups(data_home=root, subset='train', categories=categories, remove=remove)
newsgroups_test = fetch_20newsgroups(data_home=root, subset='test', categories=categories, remove=remove)

In [8]:
newsgroups_train.filenames.shape # 样本名
newsgroups_train.target.shape

(2034,)

(2034,)

In [9]:
newsgroups_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

- 样本真实内容

In [10]:
print(newsgroups_train.data[0])

Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych


- 标签与标签名

In [11]:
newsgroups_train.target[0]

1

In [12]:
newsgroups_train.target_names[1]

'comp.graphics'

- 话题数与话题单词数

In [13]:
num_topics, num_top_words = 6, 8

### 停用词

- 停用词 stop words 对文本的含义没有太大帮助的词，比如 the, 通常会被过滤掉[停用词 - 维基百科，自由的百科全书](https://zh.wikipedia.org/wiki/%E5%81%9C%E7%94%A8%E8%AF%8D)

In [14]:
from sklearn.feature_extraction import stop_words

In [15]:
dir(stop_words)

['ENGLISH_STOP_WORDS',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__']

In [16]:
sorted(list(stop_words.ENGLISH_STOP_WORDS))[:20]  # 字典本身无序，转换成列表后排序

['a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amoungst']

- stop words 通常没有太多具体的含义
- stop words 通常没有一个明确的停用词表能够适用于所有的工具

### 词干和词性还原 Stemming and Lemmatization
- organize, organizes, and organizing 他们的词干是删除截尾的部分字母，最后的可能不是真正的单词，词性还原更加准确 cars -> car 名词还原
- 词形还原（lemmatization），是把一个词汇还原为一般形式（能表达完整语义），方法较为复杂；而词干提取（stemming）是抽取词的词干或词根形式（不一定能够表达完整语义），方法较为简单
- 不同语言不同

In [26]:
import nltk

In [32]:
# nltk.download('wordnet', download_dir=root_nltk)
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /home/lab/Datasets/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

[nltk_data] Downloading package wordnet to /home/lab/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

### 词干提取

In [33]:
from nltk import stem

- 基于Porter词干提取算法

In [34]:
wnl = stem.WordNetLemmatizer()
porter = stem.porter.PorterStemmer()

In [35]:
word_list = ['feet', 'foot', 'foots', 'footing']

In [36]:
[wnl.lemmatize(word) for word in word_list]

['foot', 'foot', 'foot', 'footing']

In [37]:
[porter.stem(word) for word in word_list]

['feet', 'foot', 'foot', 'foot']

In [38]:
wl = ['organize', 'organizes', 'organizing']

In [39]:
[wnl.lemmatize(word) for word in wl]

['organize', 'organizes', 'organizing']

In [40]:
[porter.stem(word) for word in wl]

['organ', 'organ', 'organ']