# Glossary Generator

即兴想起来就写了，主要目的是配合英语学习的透析阅读法使用，什么叫透析阅读法请自行搜索。

程序比较简单，简单地说，就是：
1. 读取一本小说的文本，干掉复数、时态这些东西，得到一本小说的词汇表；
2. 去掉认识和不常用的词，生成不认识且高频的单词表。

生词表生成后可导入欧陆词典一类的app，快速预习一下，可以大幅提升阅读原版书籍时的体验。

In [12]:
import os, re
import nltk, string
import collections
import textract

from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

## 1. 读取文档
目前支持txt/pdf/doc/docx/csv/epub格式，扫描版的PDF不行，只能是文字版的。

除了txt外，其他格式读取时花的时间会长一点，请耐心等待。

In [2]:
# filename = 'pride_and_prejudice.txt' 
filename = os.path.join('input-book','The One Minute Manager_Kenneth H. Blanchard_1983.epub' )

读取文本，去掉标点符号。

In [3]:
byte = textract.process(filename)
text = byte.decode("utf-8").replace('\n', ' ')
text = text.replace('—', '').replace('.', '')
text.translate(str.maketrans('', '', string.punctuation))



'Cover The One Minute Manager Kenneth Blanchard  Spenser Johnson – THE ONE MINUTE MANAGER Read a Story That Will Change Your Life The  One  Minute  Manager  is  an  easily  read  story  which  quickly  shows  you  three very  practical  management  techniques  As  the  story  unfolds  you  will  discover  several studies in medicine and the behavioral sciences which help you to understand why these apparently  simple  methods  work  so  well  with  so  many  people  By  the  book’s  end  you will also know how to apply them to your own situation The book is brief the language is simple and best of all  it works That’s  why  The  One  Minute  Manager  has  become  America’s  national  sensation featured  in  People  magazine  and  on  The  Today  Show The  Merv  Griffin  Show  and other network television programs 1 Kenneth Blanchard  Spenser Johnson – THE ONE MINUTE MANAGER Books by Kenneth H Blanchard PhD MANAGEMENT      OF      ORGANIZATIONAL      BEHAVIOR      UTILIZING      HUMAN  

## 2. 切割和统计词干
获取单词
抽取词干或词根形式且转成小写
获取排名在1000以后且频率在3以上的词干


In [4]:
nltk_tokens = nltk.word_tokenize(text) # 切割成单词
stems = [porter_stemmer.stem(w) for w in nltk_tokens] # 抽取词的词干或词根形式且转成小写

In [5]:
stem_count = collections.Counter(stems)
stem_count = sorted(stem_count.items(), key=lambda pair: pair[1], reverse=True)

In [6]:
clean_stem = [pair[0] for pair in stem_count if pair[1]>=3 and pair[1] <=10]
len(clean_stem)

470

## 3. 去除熟悉单词

- 柯林斯词典词频: Collins的词库里只有14000-15000单词，但我测试的效果似乎比COCA好一些。大家可以自己更改文件名，看看自己哪一个更合适。
- 自已建立熟词库:

In [7]:
known_filename = "my_vocabulary.txt" 
unknown_filename = 'common30k.txt'

known_word = open(known_filename, 'r', encoding='utf-8').read().replace('\n', ' ').split(' ')
unknown_word = open(unknown_filename, 'r', encoding='utf-8').read().replace('\n', ' ').split(' ')

将词干和熟词库比较，去除熟词。将词干与词库比较，获取单词原形。

In [8]:
# 删除认识的词干
known_stem = [s for s in clean_stem for w in known_word if s in w]
clean_stem = set(clean_stem).difference(set(known_stem))
len(clean_stem)

In [27]:
new_word = list()
for s in clean_stem:
    for w in unknown_word:
        if re.match(s,w):
            new_word.append(w)
            break
new_word = set(new_word)
len(new_word)

26

In [28]:
glossary = os.path.join('output-txt',os.path.basename(filename).split('.')[0] + '_glossary.txt')
with open(glossary, 'w') as output:
    output.write('\n'.join(new_word))

## 4. 将已经背熟的单词加入熟词库

In [None]:
known_word = open(known_filename, 'r', encoding='utf-8').read().replace('\n', ' ').split(' ')
new_filename = os.path.join('','')
new_word = open(new_filename, 'r', encoding='utf-8').read().replace('\n', ' ').split(' ')
known_word = known_word.extend(new_word)
known_word = sorted(set(known_word))
with open(known_filename, 'w') as output:
    output.write('\n'.join(known_word))