# LDA模型应用：一眼看穿希拉里的邮件

我们拿到希拉里泄露的邮件，跑一把LDA，看看她平时都在聊什么。

首先，导入我们需要的一些库

In [1]:
import numpy as np
import pandas as pd
import re

然后，把希婆的邮件读取进来。

这里我们用pandas。不熟悉pandas的朋友，可以用python标准库csv

In [5]:
df = pd.read_csv("./input/HillaryEmails.csv")

In [6]:
df.head()

Unnamed: 0,Id,DocNumber,MetadataSubject,MetadataTo,MetadataFrom,SenderPersonId,MetadataDateSent,MetadataDateReleased,MetadataPdfLink,MetadataCaseNumber,...,ExtractedTo,ExtractedFrom,ExtractedCc,ExtractedDateSent,ExtractedCaseNumber,ExtractedDocNumber,ExtractedDateReleased,ExtractedReleaseInPartOrFull,ExtractedBodyText,RawText
0,1,C05739545,WOW,H,"Sullivan, Jacob J",87.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739545...,F-2015-04841,...,,"Sullivan, Jacob J <Sullivan11@state.gov>",,"Wednesday, September 12, 2012 10:16 AM",F-2015-04841,C05739545,05/13/2015,RELEASE IN FULL,,UNCLASSIFIED\nU.S. Department of State\nCase N...
1,2,C05739546,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,H,,,2011-03-03T05:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739546...,F-2015-04841,...,,,,,F-2015-04841,C05739546,05/13/2015,RELEASE IN PART,"B6\nThursday, March 3, 2011 9:45 PM\nH: Latest...",UNCLASSIFIED\nU.S. Department of State\nCase N...
2,3,C05739547,CHRIS STEVENS,;H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739547...,F-2015-04841,...,B6,"Mills, Cheryl D <MillsCD@state.gov>","Abedin, Huma","Wednesday, September 12, 2012 11:52 AM",F-2015-04841,C05739547,05/14/2015,RELEASE IN PART,Thx,UNCLASSIFIED\nU.S. Department of State\nCase N...
3,4,C05739550,CAIRO CONDEMNATION - FINAL,H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739550...,F-2015-04841,...,,"Mills, Cheryl D <MillsCD@state.gov>","Mitchell, Andrew B","Wednesday, September 12,2012 12:44 PM",F-2015-04841,C05739550,05/13/2015,RELEASE IN PART,,UNCLASSIFIED\nU.S. Department of State\nCase N...
4,5,C05739554,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,"Abedin, Huma",H,80.0,2011-03-11T05:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739554...,F-2015-04841,...,,,,,F-2015-04841,C05739554,05/13/2015,RELEASE IN PART,"H <hrod17@clintonemail.com>\nFriday, March 11,...",B6\nUNCLASSIFIED\nU.S. Department of State\nCa...


In [8]:
# 原邮件数据中有很多Nan的值，直接扔了。
df = df[['Id','ExtractedBodyText']].dropna()  # 选择'Id','ExtractedBodyText'，并按行删除含有NAN的

In [9]:
df.head()

Unnamed: 0,Id,ExtractedBodyText
1,2,"B6\nThursday, March 3, 2011 9:45 PM\nH: Latest..."
2,3,Thx
4,5,"H <hrod17@clintonemail.com>\nFriday, March 11,..."
5,6,Pis print.\n-•-...-^\nH < hrod17@clintonernail...
7,8,"H <hrod17@clintonemail.corn>\nFriday, March 11..."


### 文本预处理：

我们这里，针对邮件内容，写一组正则表达式：

In [10]:
def clean_email_text(text):
    text = text.replace('\n'," ") # 新行，我们是不需要的
    text = re.sub(r"-", " ", text) # 把 "-" 的两个单词，分开
    text = re.sub(r"\d+/\d+/\d+", "", text) # 日期，对主体模型没什么意义
    text = re.sub(r"[0-2]?[0-9]:[0-6][0-9]", "", text) # 时间，没意义
    text = re.sub(r"[\w]+@[\.\w]+", "", text) # 邮件地址，没意义
    text = re.sub(r"/[a-zA-Z]*[:\//\]*[A-Za-z0-9\-_]+\.+[A-Za-z0-9\.\/%&=\?\-_]+/i", "", text) # 网址，没意义
    pure_text = ''
    # 以防还有其他特殊字符（数字）等等，我们直接把他们loop一遍，过滤掉
    for letter in text:
        # 只留下字母和空格
        if letter.isalpha() or letter==' ':
            pure_text += letter
    # 再把那些去除特殊字符后落单的单词，直接排除。
    # 我们就只剩下有意义的单词了。
    text = ' '.join(word for word in pure_text.split() if len(word)>1)
    return text

好的，现在我们新建一个colum，并把我们的方法跑一遍：

In [11]:
docs = df['ExtractedBodyText']
docs = docs.apply(lambda s: clean_email_text(s))  

好，来看看长相：

In [15]:
docs.head(1).values

array(['Thursday March PM Latest How Syria is aiding Qaddafi and more Sid hrc memo syria aiding libya docx hrc memo syria aiding libya docx March For Hillary'],
      dtype=object)

我们直接把所有的邮件内容拿出来。

In [19]:
doclist = docs.values

### LDA模型构建：

用Gensim来做一次模型构建

首先，我们得把我们刚刚整出来的一大波文本数据
```
[[一条邮件字符串]，[另一条邮件字符串], ...]
```

转化成Gensim认可的语料库形式：

```
[[一，条，邮件，在，这里],[第，二，条，邮件，在，这里],[今天，天气，肿么，样],...]
```

引入库：

In [20]:
from gensim import corpora, models, similarities
import gensim

手写**停止词列表**：

这些词在不同语境中指代意义完全不同，但是在不同主题中的出现概率是几乎一致的。所以要去除，否则对模型的准确性有影响

In [21]:
stoplist = ['very', 'ourselves', 'am', 'doesn', 'through', 'me', 'against', 'up', 'just', 'her', 'ours', 
            'couldn', 'because', 'is', 'isn', 'it', 'only', 'in', 'such', 'too', 'mustn', 'under', 'their', 
            'if', 'to', 'my', 'himself', 'after', 'why', 'while', 'can', 'each', 'itself', 'his', 'all', 'once', 
            'herself', 'more', 'our', 'they', 'hasn', 'on', 'ma', 'them', 'its', 'where', 'did', 'll', 'you', 
            'didn', 'nor', 'as', 'now', 'before', 'those', 'yours', 'from', 'who', 'was', 'm', 'been', 'will', 
            'into', 'same', 'how', 'some', 'of', 'out', 'with', 's', 'being', 't', 'mightn', 'she', 'again', 'be', 
            'by', 'shan', 'have', 'yourselves', 'needn', 'and', 'are', 'o', 'these', 'further', 'most', 'yourself', 
            'having', 'aren', 'here', 'he', 'were', 'but', 'this', 'myself', 'own', 'we', 'so', 'i', 'does', 'both', 
            'when', 'between', 'd', 'had', 'the', 'y', 'has', 'down', 'off', 'than', 'haven', 'whom', 'wouldn', 
            'should', 've', 'over', 'themselves', 'few', 'then', 'hadn', 'what', 'until', 'won', 'no', 'about', 
            'any', 'that', 'for', 'shouldn', 'don', 'do', 'there', 'doing', 'an', 'or', 'ain', 'hers', 'wasn', 
            'weren', 'above', 'a', 'at', 'your', 'theirs', 'below', 'other', 'not', 're', 'him', 'during', 'which']

人工分词：

这里，英文的分词，直接就是对着空白处分割就可以了。

中文的分词稍微复杂点儿，具体可以百度：jieba，等等

分词的意义在于，把我们的长长的字符串原文本，转化成有意义的小元素：

In [22]:
texts = [[word for word in doc.lower().split() if word not in stoplist] for doc in doclist]
# 遍历每个文档，将其转为小写并按空格分词，遍历分词后的词，如果该词不在stoplist中，保留下来

这时候，我们的texts就是我们需要的样子了：

In [24]:
texts[0]  # 第一封邮件内容分词后的结果

['thursday',
 'march',
 'pm',
 'latest',
 'syria',
 'aiding',
 'qaddafi',
 'sid',
 'hrc',
 'memo',
 'syria',
 'aiding',
 'libya',
 'docx',
 'hrc',
 'memo',
 'syria',
 'aiding',
 'libya',
 'docx',
 'march',
 'hillary']

### 建立语料库

用词袋的方法，把每个单词用一个数字index指代，并把我们的原文本变成一条长长的数组：

In [40]:
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
# doc2bow:将文档转换为词袋（BoW）格式,输入数据为str列表

In [120]:
for i in range(10):
    print(dictionary[i])

aiding
docx
hillary
hrc
latest
libya
march
memo
pm
qaddafi


给你们看一眼：

In [29]:
corpus[13]

[(51, 1), (505, 1), (506, 1), (507, 1), (508, 1)]

这个列表告诉我们，第14（从0开始是第一）个邮件中，一共6个有意义的单词（经过我们的文本预处理，并去除了停止词后）

其中，36号单词出现1次，505号单词出现1次，以此类推。。。

接着，我们终于可以建立模型了：

In [44]:
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20)
"""
corpus:文档向量，也就是训练数据
id2word:从单词ID到单词的映射
num_topics: 要从训练语料库中提取的请求潜在主题的数量
"""

'\ncorpus:文档向量，也就是训练数据\nid2word:从单词ID到单词的映射\nnum_topics: 要从训练语料库中提取的请求潜在主题的数量\n'

我们可以看到，第10号分类，其中最常出现的单词是：

In [129]:
# 将单个主题作为格式化字符串
# 返回：主题的字符串表示，如'-0.340 *“类别”+ 0.298 *“$ M $”+ 0.183 *“代数”+ ...“。
# topicno：主题ID，这里是10
# topn: 将使用的主题中的单词数
lda.print_topic(10, topn=5)

'0.013*"state" + 0.011*"us" + 0.008*"department" + 0.007*"woodward" + 0.007*"know"'

我们把所有的主题打印出来看看

In [47]:
lda.print_topics(num_topics=20, num_words=5)

[(0,
  '0.019*"get" + 0.015*"see" + 0.012*"tomorrow" + 0.011*"sure" + 0.010*"good"'),
 (1,
  '0.023*"pm" + 0.013*"press" + 0.012*"clips" + 0.011*"dialogue" + 0.011*"boehner"'),
 (2,
  '0.022*"party" + 0.020*"labour" + 0.007*"david" + 0.006*"tax" + 0.005*"chamber"'),
 (3, '0.028*"call" + 0.014*"add" + 0.013*"pm" + 0.013*"yes" + 0.011*"pls"'),
 (4,
  '0.019*"call" + 0.011*"sbwhoeop" + 0.009*"negotiating" + 0.007*"vote" + 0.006*"saturday"'),
 (5,
  '0.025*"pm" + 0.015*"sullivan" + 0.011*"talk" + 0.011*"monday" + 0.009*"sunday"'),
 (6,
  '0.019*"israeli" + 0.012*"percent" + 0.012*"palestinian" + 0.008*"israel" + 0.008*"one"'),
 (7,
  '0.010*"said" + 0.007*"lona" + 0.007*"work" + 0.006*"would" + 0.006*"secretary"'),
 (8,
  '0.014*"cheryl" + 0.013*"see" + 0.013*"bloomberg" + 0.011*"mills" + 0.006*"fyi"'),
 (9,
  '0.006*"get" + 0.006*"today" + 0.006*"one" + 0.006*"well" + 0.006*"also"'),
 (10,
  '0.013*"state" + 0.011*"us" + 0.008*"department" + 0.007*"woodward" + 0.007*"know"'),
 (11,
  '0.0

### 接下来：

通过
```
get_document_topics（bow，minimum_probability = None，minimum_phi_value = None，per_word_topics = False ）

获取给定文档的主题分发

参数：
    bow:BOW格式的文档
    minimum_probability:将丢弃分配概率低于此阈值的主题
    minimum_phi_value:如果per_word_topics为True，则表示包含的术语概率的下限。如果设置为None，则使用值1e-8来防止0
    per_word_topics:如果为True，此函数还将返回两个额外的列表，如“返回”部分中所述
返回：
    list of（int，float）整个文档的主题分布。列表中的每个元素都是一对主题的id，以及分配给它的概率
    列表（int，列表（int，float），可选 - 每个单词最可能的主题。列表中的每个元素都是一对单词的id，以及按照与该单词的相关性排序的主题列表。仅在返回时返回per_word_topics设置为True。
    列表（int，float of float），可选 - Phi相关性值，乘以特征长度，用于每个单词 - 主题组合。列表中的每个元素都是一对单词的id和该单词与每个主题之间的phi值列表。仅在per_word_topics设置为True 时才返回。
```
或者
```
get_term_topics（word_id，minimum_probability = None)

获取与给定单词最相关的主题

参数：
    word_id（int）：将为其计算主题分布的单词。
    minimum_probability（float ，optional）：将丢弃分配概率低于此阈值的主题。

返回：	
    相关主题表示为其ID和分配概率的对，按与给定单词的相关性排序。
```

两个方法，我们可以把新鲜的文本/单词，分类成20个主题中的一个。

*但是注意，我们这里的文本和单词，都必须得经过同样步骤的文本预处理+词袋化，也就是说，变成数字表示每个单词的形式。*

### 作业：

我这里有希拉里twitter上的几条(每一空行是单独的一条)：

```
To all the little girls watching...never doubt that you are valuable and powerful & deserving of every chance & opportunity in the world.

I was greeted by this heartwarming display on the corner of my street today. Thank you to all of you who did this. Happy Thanksgiving. -H

Hoping everyone has a safe & Happy Thanksgiving today, & quality time with family & friends. -H

Scripture tells us: Let us not grow weary in doing good, for in due season, we shall reap, if we do not lose heart.

Let us have faith in each other. Let us not grow weary. Let us not lose heart. For there are more seasons to come and...more work to do

We have still have not shattered that highest and hardest glass ceiling. But some day, someone will

To Barack and Michelle Obama, our country owes you an enormous debt of gratitude. We thank you for your graceful, determined leadership

Our constitutional democracy demands our participation, not just every four years, but all the time

You represent the best of America, and being your candidate has been one of the greatest honors of my life

Last night I congratulated Donald Trump and offered to work with him on behalf of our country

Already voted? That's great! Now help Hillary win by signing up to make calls now

It's Election Day! Millions of Americans have cast their votes for Hillary—join them and confirm where you vote

We don’t want to shrink the vision of this country. We want to keep expanding it

We have a chance to elect a 45th president who will build on our progress, who will finish the job

I love our country, and I believe in our people, and I will never, ever quit on you. No matter what

```

使用训练好的LDA模型，判断每句话各自属于哪个potic

In [144]:
test_texts = []
with open("./data/test.txt") as fp:
    line = fp.readline()
    while line:
        test_texts.append(line[:-1])
        line = fp.readline()
fp.close()

In [145]:
testlist = []
for text in test_texts:
    if text != '':
        testlist.append(text)

In [146]:
print(testlist[:-1])

['To all the little girls watching...never doubt that you are valuable and powerful & deserving of every chance & opportunity in the world.', 'I was greeted by this heartwarming display on the corner of my street today. Thank you to all of you who did this. Happy Thanksgiving. -H', 'Hoping everyone has a safe & Happy Thanksgiving today, & quality time with family & friends. -H', 'Scripture tells us: Let us not grow weary in doing good, for in due season, we shall reap, if we do not lose heart.', 'Let us have faith in each other. Let us not grow weary. Let us not lose heart. For there are more seasons to come and...more work to do', 'We have still have not shattered that highest and hardest glass ceiling. But some day, someone will', 'To Barack and Michelle Obama, our country owes you an enormous debt of gratitude. We thank you for your graceful, determined leadership', 'Our constitutional democracy demands our participation, not just every four years, but all the time', 'You represent 

In [147]:
cleanlist = []
for text in testlist:
    cleanlist.append(clean_email_text(text))

In [148]:
print(cleanlist[0:-1])

['To all the little girls watchingnever doubt that you are valuable and powerful deserving of every chance opportunity in the world', 'was greeted by this heartwarming display on the corner of my street today Thank you to all of you who did this Happy Thanksgiving', 'Hoping everyone has safe Happy Thanksgiving today quality time with family friends', 'Scripture tells us Let us not grow weary in doing good for in due season we shall reap if we do not lose heart', 'Let us have faith in each other Let us not grow weary Let us not lose heart For there are more seasons to come andmore work to do', 'We have still have not shattered that highest and hardest glass ceiling But some day someone will', 'To Barack and Michelle Obama our country owes you an enormous debt of gratitude We thank you for your graceful determined leadership', 'Our constitutional democracy demands our participation not just every four years but all the time', 'You represent the best of America and being your candidate ha

In [149]:
# 分词并去停用词
texts1 = [[word for word in doc.lower().split() if word not in stoplist] for doc in cleanlist]

In [150]:
texts1[14]

['love',
 'country',
 'believe',
 'people',
 'never',
 'ever',
 'quit',
 'matter',
 'wha']

In [151]:
# doc2bow:将文档转换为词袋（BoW）格式
corpus1 = [dictionary.doc2bow(text) for text in texts1]

In [152]:
len(corpus1)

15

In [153]:
corpus1[0]

[(139, 1),
 (202, 1),
 (321, 1),
 (444, 1),
 (1961, 1),
 (2348, 1),
 (3760, 1),
 (3849, 1),
 (4374, 1),
 (23009, 1)]

In [154]:
resultlist = lda.get_document_topics(corpus1)
for result in resultlist:
    print(result)

[(11, 0.9049675)]
[(1, 0.23220752), (4, 0.2203621), (9, 0.26207253), (19, 0.18514422)]
[(8, 0.41687113), (9, 0.37908316), (19, 0.12670003)]
[(4, 0.15536655), (8, 0.12320458), (12, 0.65154374)]
[(10, 0.38240257), (12, 0.54833883)]
[(11, 0.63090694), (12, 0.24030071)]
[(9, 0.099997744), (12, 0.7332237), (13, 0.08171323)]
[(12, 0.8944042)]
[(12, 0.8812165)]
[(12, 0.88119155)]
[(2, 0.20206088), (9, 0.70791954)]
[(4, 0.21389806), (7, 0.574322), (16, 0.1173015)]
[(11, 0.88120323)]
[(8, 0.12825772), (9, 0.32152146), (11, 0.45572245)]
[(11, 0.8944002)]
