下面的一些应用需要**从语义方面深入理解文本**:
- 问答系统
- 语境识别
- 语音识别

文本语义学专门研究理解文本或语言的意义

情感分析也许是文本分析中最主流的应用,有大量关注不同文本资源情感分析的手册,网站和应用,其内容涵盖从企业调查到电影评论.

情感分析的关键是分析文本易理解其表达的观点,以及情绪和青苔等其他因素.

通常相比于**在客观内容上**,情感分析能够在主观内容上更好的工作.

在情感分析方面,我们将研究如何使**用有监督机器学习技术分析情感**,同时也是**用几个基于字典的无监督技术深入研究自然语言的情感,情绪和情态**.

## 语义分析

第三章介绍了自然语言不同的结构组成,包括词性标注(POS),组块分析(chunking)和语法.这些概念都属于文本数据的语法和结构分析.

然而,研究单词,短语,从句之间的**语法关系**,纯粹是**基于它们的位置,句法和结构**.

我们将在语义分析的基础上广泛讨论以下主题:
- 探索WordNet和同义词集
- 分析词汇语义关系
- 语义消歧
- 命名实体识别
- 分析语义的表示方法

## 探索WordNet
wordnet该词汇数据库由名词,形容词,动词和副词组成,而且基于相同的概念将相关的单词分为一组,称之为认知同义词集或同义词集.

### 理解同义词集
同义词是将每个事物联系在一起的非常重要的概念和结构之一,所以我们从研究同义词集开始探索WordNet.

In [1]:
from nltk.corpus import wordnet as wn
import pandas as pd

term = 'fruit'
synsets = wn.synsets(term)
#display total synsets
print('Total Synsets:', len(synsets))

('Total Synsets:', 5)


In [2]:
for synset in synsets:
    print('Synset:', synset)
    print('Part of speech:', synset.lexname()) # 同义词的语音部分
    print('Definition:', synset.definition())
    print('Lemmas:', synset.lemma_names())
    print('Examples:', synset.examples()) #造句
    print

('Synset:', Synset('fruit.n.01'))
('Part of speech:', u'noun.plant')
('Definition:', u'the ripened reproductive body of a seed plant')
('Lemmas:', [u'fruit'])
('Examples:', [])

('Synset:', Synset('yield.n.03'))
('Part of speech:', u'noun.artifact')
('Definition:', u'an amount of a product')
('Lemmas:', [u'yield', u'fruit'])
('Examples:', [])

('Synset:', Synset('fruit.n.03'))
('Part of speech:', u'noun.event')
('Definition:', u'the consequence of some effort or action')
('Lemmas:', [u'fruit'])
('Examples:', [u'he lived long enough to see the fruit of his policies'])

('Synset:', Synset('fruit.v.01'))
('Part of speech:', u'verb.creation')
('Definition:', u'cause to bear fruit')
('Lemmas:', [u'fruit'])
('Examples:', [])

('Synset:', Synset('fruit.v.02'))
('Part of speech:', u'verb.creation')
('Definition:', u'bear fruit')
('Lemmas:', [u'fruit'])
('Examples:', [u'the trees fruited early this year'])



### 分析词汇的语义关系
文本语义学指的是意思和内容的研究.

- 蕴含
    通常蕴含指的是同样事件或行为,这些事件或行为在逻辑上涉及或者其他已经发生或将要发生的行为或事件相关联.
    理想情况下,这适用于表示某些特定行为的动词.

In [3]:
#entailments蕴含
for action in ['walk','eat','digest']:
    action_syn = wn.synsets(action, pos='v')[0]
    print action_syn, '--entails-->', action_syn.entailments()

Synset('walk.v.01') --entails--> [Synset('step.v.01')]
Synset('eat.v.01') --entails--> [Synset('chew.v.01'), Synset('swallow.v.01')]
Synset('digest.v.01') --entails--> [Synset('consume.v.02')]


- 同音词和同形异义词

In [4]:
for synset in wn.synsets('bank'):
    print synset.name(),'-',synset.definition()

bank.n.01 - sloping land (especially the slope beside a body of water)
depository_financial_institution.n.01 - a financial institution that accepts deposits and channels the money into lending activities
bank.n.03 - a long ridge or pile
bank.n.04 - an arrangement of similar objects in a row or in tiers
bank.n.05 - a supply or stock held in reserve for future use (especially in emergencies)
bank.n.06 - the funds held by a gambling house or the dealer in some gambling games
bank.n.07 - a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force
savings_bank.n.02 - a container (usually with a slot in the top) for keeping money at home
bank.n.09 - a building in which the business of banking transacted
bank.n.10 - a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning)
bank.v.01 - tip laterally
bank.v.02 - enclose with a bank
bank.v.03 - do business with a bank or keep an account at 

- 同义词和反义词

In [5]:
term = 'large'
synsets = wn.synsets(term)
adj_large = synsets[1]
adj_large = adj_large.lemmas()[0]
adj_large_synonym = adj_large.synset()
adj_large_antonym = adj_large.antonyms()[0].synset()
#print synonym and antonym
print 'Synonym:', adj_large_synonym.name()
print 'Definition:', adj_large_synonym.definition()
print 'Antonym:', adj_large_antonym.name()
print 'Definition:', adj_large_antonym.definition()

Synonym: large.a.01
Definition: above average in size or number or quantity or magnitude or extent
Antonym: small.a.01
Definition: limited or below average in number or quantity or magnitude or extent


In [6]:
term = 'rich'
synsets = wn.synsets(term)[:3]

#print synonym and antonym for different synsets
for synset in synsets:
    rich = synset.lemmas()[0]
    rich_synonym = rich.synset()
    rich_antonym = rich.antonyms()[0].synset()
    print('Synonym:', rich_synonym.name())
    print('Definition:', rich_synonym.definition())
    print('Antonym:', rich_antonym.name())
    print('Definition:', rich_antonym.definition())
    print('-'*20)

('Synonym:', u'rich_people.n.01')
('Definition:', u'people who have possessions and wealth (considered as a group)')
('Antonym:', u'poor_people.n.01')
('Definition:', u'people without possessions or wealth (considered as a group)')
--------------------
('Synonym:', u'rich.a.01')
('Definition:', u'possessing material wealth')
('Antonym:', u'poor.a.02')
('Definition:', u'having little money or few possessions')
--------------------
('Synonym:', u'rich.a.02')
('Definition:', u'having an abundant supply of desirable qualities or substances (especially natural resources)')
('Antonym:', u'poor.a.04')
('Definition:', u'lacking in specific resources, qualities or substances')
--------------------


- 上位词和下位词
同义词集表示具有独立语义和概念的词,基于相似性和内容而联系或关联在一起.

它们以一种层次结构连接在一起,表示一种is-a关系.

下位词和上位词帮助我们在层次结构中探索相关的概念.

下位词所指的概念和实体是高层概念或实体的子类,与其超类相比,下位词具有更具体的意义和语境.

In [7]:
term = 'tree'
synsets = wn.synsets(term)
tree = synsets[0]

#print the entity and its meaning
print 'Name:', tree.name()
print 'Definition:', tree.definition()

Name: tree.n.01
Definition: a tall perennial woody plant having a main trunk and branches forming a distinct elevated crown; includes both gymnosperms and angiosperms


In [8]:
#print total hyponyms and some sample hyponyms for 'tree'
#作为超类的下位词具有更一般的意义和语境
hyponyms = tree.hyponyms()
print('Total Hyponyms:', len(hyponyms))  #输出tree有多少个下位词,我们可以看到下位词是一类具体的树.
print('Sample Hyponyms')
print('-'*20)
for hyponym in hyponyms[:10]:
    print(hyponym.name(),'-',hyponym.definition())

('Total Hyponyms:', 180)
Sample Hyponyms
--------------------
(u'aalii.n.01', '-', u'a small Hawaiian tree with hard dark wood')
(u'acacia.n.01', '-', u'any of various spiny trees or shrubs of the genus Acacia')
(u'african_walnut.n.01', '-', u'tropical African timber tree with wood that resembles mahogany')
(u'albizzia.n.01', '-', u'any of numerous trees of the genus Albizia')
(u'alder.n.02', '-', u'north temperate shrubs or trees having toothed leaves and conelike fruit; bark is used in tanning and dyeing and the wood is rot-resistant')
(u'angelim.n.01', '-', u'any of several tropical American trees of the genus Andira')
(u'angiospermous_tree.n.01', '-', u'any tree having seeds and ovules contained in the ovary')
(u'anise_tree.n.01', '-', u'any of several evergreen shrubs and small trees of the genus Illicium')
(u'arbor.n.01', '-', u'tree (as opposed to shrub)')
(u'aroeira_blanca.n.01', '-', u'small resinous tree or shrub of Brazil')


In [9]:
#显示tree的直接超类
hypernyms = tree.hypernyms()
print(hypernyms)

[Synset('woody_plant.n.01')]


In [10]:
#get total hierarchy pathways for 'tree'
hypernym_paths = tree.hypernym_paths()
print('Total Hypernym paths:', len(hypernym_paths))

('Total Hypernym paths:', 1)


In [11]:
#print the entire hypernym hierarchy
print('Hypernym Hierarchy')
print('->'.join(synset.name() for synset in hypernym_paths[0]))

Hypernym Hierarchy
entity.n.01->physical_entity.n.01->object.n.01->whole.n.02->living_thing.n.01->organism.n.01->plant.n.02->vascular_plant.n.01->woody_plant.n.01->tree.n.01


从上面的代码输出,我们可以看到完整的上位词层次结构显示了每个层次对应的上位词或超类.当你继续向下导航,你将获得更加具体的概念/实体,如果反方向导航,你将获得更加一般的概念/实体.

- 整体词和部分词

    整体词是一些实体,包含我们感兴趣的特定实体.基本上,**整体词指的是单词或实体间的关系,表示一个整体或整体的具体部分**.

In [12]:
member_holonyms = tree.member_holonyms()
print('Total Member Holonyms:', len(member_holonyms))
print('Member Holonyms for [tree]:-')

for holonym in member_holonyms:
    print holonym.name(),'-',holonym.definition()

('Total Member Holonyms:', 1)
Member Holonyms for [tree]:-
forest.n.01 - the trees and other plants in a large densely wooded area


从输出可以看到,'forest'是tree的整体词,其语义上是正确的,因为森林是由树组成的.

部分词表示一个单词或实体作为其他单词组成或一部分的语义关系.

In [13]:
#part based meronyms for tree
part_meronyms = tree.part_meronyms()
print('Total Part Meronyms:', len(part_meronyms))
print('Part Meronyms for [tree]:-')
print('-'*40)
for meronym in part_meronyms:
    print(meronym.name(),'-',meronym.definition())

('Total Part Meronyms:', 5)
Part Meronyms for [tree]:-
----------------------------------------
(u'burl.n.02', '-', u'a large rounded outgrowth on the trunk or branch of a tree')
(u'crown.n.07', '-', u'the upper branches and leaves of a tree or other plant')
(u'limb.n.02', '-', u'any of the main branches arising from the trunk or a bough of a tree')
(u'stump.n.01', '-', u'the base part of a tree that remains standing after the tree has been felled')
(u'trunk.n.01', '-', u'the main stem of a tree; usually covered with bark; the bole is usually the part that is commercially useful for lumber')


上面输出显示了tree的不同部分词,包含树的不同组成,如树干,树桩.

In [14]:
#substance based meronyms for tree
substance_meronyms = tree.substance_meronyms()
print('Total Substance Meronyms:', len(substance_meronyms))
print('Substance Meronyms for [tree]:-')
print('-'*40)

for meronym in substance_meronyms:
    print(meronym.name(),'-',meronym.definition())

('Total Substance Meronyms:', 2)
Substance Meronyms for [tree]:-
----------------------------------------
(u'heartwood.n.01', '-', u'the older inactive central wood of a tree or woody plant; usually darker and denser than the surrounding sapwood')
(u'sapwood.n.01', '-', u'newly formed outer wood lying between the cambium and the heartwood of a tree or woody plant; usually light colored; active in water conduction')


上面输出显示了tree的衍生物体,如树的心材和边材.

- 语义关系与相似度

    前面章节,我们已经研究了各种与词汇语义关系相关的概念.

    现在研究基于语义关系连接相似实体的方法,以及它们的相似度度量.(不同于第六章的度量方法)

In [15]:
tree = wn.synset('tree.n.01')
lion = wn.synset('lion.n.01')
tiger = wn.synset('tiger.n.02')
cat = wn.synset('cat.n.01')
dog = wn.synset('dog.n.01')

#create entities and extract names and definitions
entities = [tree, lion, tiger, cat, dog]
entity_names = [entity.name().split('.')[0] for entity in entities]
entity_definitions = [entity.definition() for entity in entities]

#print entities and their definitions
for entity, definition in zip(entity_names, entity_definitions):
    print(entity, '-', definition)

(u'tree', '-', u'a tall perennial woody plant having a main trunk and branches forming a distinct elevated crown; includes both gymnosperms and angiosperms')
(u'lion', '-', u'large gregarious predatory feline of Africa and India having a tawny coat with a shaggy mane in the male')
(u'tiger', '-', u'large feline of forests in most of Asia having a tawny coat with black stripes; endangered')
(u'cat', '-', u'feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats')
(u'dog', '-', u'a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds')


接下来,我们将基于这些实体共同的上位词来联系它们.对于每一对实体,我们将尽力寻找关系层次结构树中最低级别的上位词.

相关联的实体应该拥有非常具体的上位词,不相关的实体则拥有非常抽象或通用的上位词.

In [16]:
common_hypernyms = []
for entity in entities:
    #get pairwise lowest common hypernyms
    common_hypernyms.append([entity.lowest_common_hypernyms(compared_entity)[0].name().split('.')[0] 
                            for compared_entity in entities])
    
#build pairwise lower common hypernym matrix
common_hypernym_frame = pd.DataFrame(common_hypernyms,
                                    index=entity_names,
                                    columns=entity_names)

#print the matrix
print common_hypernym_frame

           tree       lion      tiger        cat        dog
tree       tree   organism   organism   organism   organism
lion   organism       lion    big_cat     feline  carnivore
tiger  organism    big_cat      tiger     feline  carnivore
cat    organism     feline     feline        cat  carnivore
dog    organism  carnivore  carnivore  carnivore        dog


使用语义概念来度量实体之间的语义相似性.我们将使用路径相似性(path similarity),

它基于连接两个词的上位词/下位词的最短路径返回一个数值,其范围是[0,1].

In [17]:
similarities = []
for entity in entities:
    #getpairwise similarities
    similarities.append([round(entity.path_similarity(compared_entity),2) for compared_entity in entities])
    
#build pairwise similarity matrix
similarity_frame = pd.DataFrame(similarities, index=entity_names, columns=entity_names)

#print the matrix
print similarity_frame

       tree  lion  tiger   cat   dog
tree   1.00  0.07   0.07  0.08  0.13
lion   0.07  1.00   0.33  0.25  0.17
tiger  0.07  0.33   1.00  0.25  0.17
cat    0.08  0.25   0.25  1.00  0.20
dog    0.13  0.17   0.17  0.20  1.00


## 词义消歧
词义基于单词如何使用且依赖于单词的语义,由语境决定.

基于单词使用情况,识别单词正确语义和意义称为语义消歧.

语义消歧是自然语言处理中典型问题,如提高搜索引擎结果的相关性,一致性等.

有多种解决方法,包括基于词汇和字典的方法,有监督机器学习和无监督机器学习的方法.

这里使用Lesk算法,它的基本原理是使用字典或词汇的定义为我们消除文本的歧义把我们感兴趣单词周围的一段文字与这些定义中的文字进行比较.

主要目的是返回上下文的句子和我们要消歧的单词同义词集定义之间的最大数量的重叠词或词项.

In [18]:
from nltk.wsd import lesk
from nltk import word_tokenize

#sample text and word to disambiguate
samples = [('The fruits on that plant have ripened','n'),
          ('He finally reaped the fruit of his hard work as he won the race','n')]

word = 'fruit'
#perform word sense disambiguation
for sentence,pos_tag in samples:
    word_syn = lesk(word_tokenize(sentence.lower()), word, pos_tag)
    print('Sentence:', sentence)
    print('Word synset:', word_syn)
    print('Corresponding definition:', word_syn.definition())
    print

('Sentence:', 'The fruits on that plant have ripened')
('Word synset:', Synset('fruit.n.01'))
('Corresponding definition:', u'the ripened reproductive body of a seed plant')

('Sentence:', 'He finally reaped the fruit of his hard work as he won the race')
('Word synset:', Synset('fruit.n.03'))
('Corresponding definition:', u'the consequence of some effort or action')



In [19]:
#sample text and word to disambiguate
#消歧义的时候,提取出词语所在的上下文,同时对词语进行词性标注,然后根据这两个信息来对词语进行消歧
samples = [('Lead is a very soft, malleable metal', 'n'),
          ('John is the actor who plays the lead in that movie','n'),
          ('This road leads to nowhere','v')]
word = 'lead'
#perform word sense disambiguation
for sentence, pos_tag in samples:
    word_syn = lesk(word_tokenize(sentence.lower()), word, pos_tag)
    print('Sentence:', sentence)
    print('Word synset:', word_syn)
    print('Corresponding definition:', word_syn.definition())
    print

('Sentence:', 'Lead is a very soft, malleable metal')
('Word synset:', Synset('lead.n.02'))
('Corresponding definition:', u'a soft heavy toxic malleable metallic element; bluish white when freshly cut but tarnishes readily to dull grey')

('Sentence:', 'John is the actor who plays the lead in that movie')
('Word synset:', Synset('star.n.04'))
('Corresponding definition:', u'an actor who plays a principal role')

('Sentence:', 'This road leads to nowhere')
('Word synset:', Synset('run.v.23'))
('Corresponding definition:', u'cause something to pass or lead somewhere')



## 命名实体识别
更详细信息请参考nltk和斯坦福NLP的网站.

In [20]:
#sample document
text = """
Bayern Munich, or FC Bayern, is a German sports club based in Munich, 
Bavaria, Germany. It is best known for its professional football team, 
which plays in the Bundesliga, the top tier of the German football 
league system, and is the most successful club in German football 
history, having won a record 26 national titles and 18 national cups. 
FC Bayern was founded in 1900 by eleven football players led by Franz John. 
Although Bayern won its first national championship in 1932, the club 
was not selected for the Bundesliga at its inception in 1963. The club 
had its period of greatest success in the middle of the 1970s when, 
under the captaincy of Franz Beckenbauer, it won the European Cup three 
times in a row (1974-76). Overall, Bayern has reached ten UEFA Champions 
League finals, most recently winning their fifth title in 2013 as part 
of a continental treble. 
"""

In [21]:
import nltk
from normalization import parse_document
import pandas as pd

In [22]:
#tokenize sentences
sentences = parse_document(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

#tag sentences and use nltk's Named Entity chunker
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences] #词性标注
ne_chunked_sents = [nltk.ne_chunk(tagged) for tagged in tagged_sentences]  #命名实体识别

#extract all named entities
named_entities = []
for ne_tagged_sentence in ne_chunked_sents:
    for tagged_tree in ne_tagged_sentence:
        #extract only chunks having NE labels
        if hasattr(tagged_tree, 'label'):
            entity_name = ' '.join(c[0] for c in tagged_tree.leaves()) #get NE name,比如Franz John有两个单词需要拼接成完整的命名实体
            
            entity_type = tagged_tree.label()  #get NE category
            named_entities.append((entity_name, entity_type))
#get unique named entities
named_entities = list(set(named_entities))
#store named entities in a data frame
entity_name = pd.DataFrame(named_entities,
                          columns=['Entity Name','Entity Type'])

#display results
print entity_name
            

          Entity Name   Entity Type
0              Bayern        PERSON
1          Franz John        PERSON
2   Franz Beckenbauer        PERSON
3              Munich  ORGANIZATION
4            European  ORGANIZATION
5          Bundesliga  ORGANIZATION
6              German           GPE
7             Bavaria           GPE
8             Germany           GPE
9           FC Bayern  ORGANIZATION
10               UEFA  ORGANIZATION
11             Munich           GPE
12             Bayern           GPE
13            Overall           GPE


接下来,我们使用斯坦福的NLP标记器进行分析,然后跟上面的nltk结果进行比较.

需要下载Stanford NER资源:http://nlp.stanford.edu/software/stanford-ner-2014-08-27.zip

关于斯坦福NER的详细信息,请访问网站:http://nlp.stanford.edu/soft-ware/CRF-NER.shtml

In [23]:
from nltk.tag import StanfordNERTagger
import os

In [24]:
#set java path in environment variables
java_path='/usr/bin/java'
os.environ['JAVAHOME'] = java_path

In [25]:
#load stanford NER
sn = StanfordNERTagger('/home/parallels/stanford-nlp/stanford-ner-2014-08-27/classifiers/english.all.3class.distsim.crf.ser.gz',
                      path_to_jar='/home/parallels/stanford-nlp/stanford-ner-2014-08-27/stanford-ner.jar')

In [26]:
ne_annotated_sentences = [sn.tag(sent) for sent in tokenized_sentences]

named_entities = []
for sentence in ne_annotated_sentences:
    temp_entity_name = ''
    temp_named_entity = None
    for term, tag in sentence:
        if tag != 'O':
            temp_entity_name = ' '.join([temp_entity_name, term]).strip()
            temp_named_entity = (temp_entity_name, tag)
        else:
            if temp_named_entity:
                named_entities.append(temp_named_entity)
                temp_entity_name = ''
                temp_named_entity = None

named_entities = list(set(named_entities))
entity_frame = pd.DataFrame(named_entities, 
                            columns=['Entity Name', 'Entity Type'])
print entity_frame 

         Entity Name   Entity Type
0         Franz John        PERSON
1  Franz Beckenbauer        PERSON
2            Germany      LOCATION
3             Bayern  ORGANIZATION
4            Bavaria      LOCATION
5             Munich      LOCATION
6          FC Bayern  ORGANIZATION
7               UEFA  ORGANIZATION
8      Bayern Munich  ORGANIZATION


关于命名实体识别器的好坏评估依赖于所分析的语料库类型.

你可以使用与第三章相似的方法通过有监督的机器学习来训练预标记的语料库,从而建立自己的NER识别器.

实际上,上述讨论的两个标记器已经使用预标记的语料库,如CoNLL,MUC和Penn Treebank进行了训练.

## 分析语义表征
到目前为止,我们一直在讨论不同单词单元间的语义和关系.

如何表示一个或多个消息所传递的语义.

命题逻辑和一阶逻辑的框架可以帮助我们进行语义表示.

### 命题逻辑
一个命题通常是一个声明,其值是二进制值真或假.

In [27]:
import nltk
import pandas as pd
import os

In [28]:
#assign symbols and propositions
symbol_P = 'P'
symbol_Q = 'Q'
proposition_P = 'He is hungry'
proposition_Q = 'He will eat a sandwich'
#assign various truth values to the propositions
p_statuses = [False, False, True, True]
q_statuses = [False, True, False, True]
#assign the various expressions combining the logical operators
conjunction = '(P&Q)'
disjunction = '(P|Q)'
implication = '(P->Q)'
equivalence = '(P<->Q)'
expressions = [conjunction, disjunction, implication, equivalence]

#evaluate each expression using propositional logic
results = []
for status_p, status_q in zip(p_statuses, q_statuses):
    dom = set([])
    val = nltk.Valuation([(symbol_P, status_p), (symbol_Q, status_q)])
    assignments = nltk.Assignment(dom)
    model = nltk.Model(dom, val)
    row = [status_p, status_q]
    for expression in expressions:
        #evaluate each expression based on proposition truth values
        result = model.evaluate(expression, assignments)
        row.append(result)
    results.append(row)

#build the result table
columns = [symbol_P, symbol_Q, conjunction, disjunction, implication, equivalence]
result_frame = pd.DataFrame(results, columns=columns)

#display results
print('P:', proposition_P)
print('Q:', proposition_Q)
print
print('Expression Outcomes:-')
print(result_frame)

('P:', 'He is hungry')
('Q:', 'He will eat a sandwich')

Expression Outcomes:-
       P      Q  (P&Q)  (P|Q)  (P->Q)  (P<->Q)
0  False  False  False  False    True     True
1  False   True  False   True    True    False
2   True  False  False   True   False    False
3   True   True   True   True    True     True


### 一阶逻辑
命名逻辑有几个局限,如没有能力表示事实或复杂的关系和推理.

对于每个新的命题,我们需要同一个符号表示,这使得产生事实非常困难,因此命名逻辑表现能力十分有限.

一阶逻辑(FOL)具有如函数,量词,关系,连接词和符号等特征.

本节主要内容是了解如何在python中表示**FOL表达式**,以及如何使用**基于某些目标和预定义规则和事件的证据来执行后续推理**.

有几个定理证明器(prover)可以用于评价表达式和证明定理.

nltk程序包有三个主要的不同类型的定理证明器:Prover9, TableauProver和ResolutionProver.

Prover9,免费使用,www.cs.unm.edu/~mccume/prover9/download

In [29]:
import nltk
import os

#for reading FOL expressions
read_expr = nltk.sem.Expression.fromstring
#initialize theorem provers (you can choose any)
os.environ['PROVER9'] = r'/home/parallels/LADR-2009-11A/bin'
prover = nltk.Prover9()
#i use the following one for our examples
prove = nltk.ResolutionProver()

In [30]:
#set the rule expression
rule = read_expr('all x. all y. (jumps_over(x,y))')

#set the event occured
event = read_expr('jumps_over(fox, dog)')

#set the outcome we want to evaluate -- the goal
test_outcome = read_expr('jumps_over(dog, fox)')

#get the result
prover.prove(goal=test_outcome, assumptions=[event, rule],verbose=True)

[Found prover9: /home/parallels/LADR-2009-11A/bin/prover9]
Calling: /home/parallels/LADR-2009-11A/bin/prover9
Args: []
Input:
 assign(max_seconds, 60).

clear(auto_denials).
formulas(assumptions).
    jumps_over(fox,dog).
    all x all y jumps_over(x,y).
end_of_list.

formulas(goals).
    jumps_over(dog,fox).
end_of_list.

 

Return code: 0
stdout:
Prover9 (64) version 2009-11A, November 2009.
Process 16769 was started by parallels on ubuntu,
Mon Jul 16 23:38:48 2018
The command was "/home/parallels/LADR-2009-11A/bin/prover9".

assign(max_seconds,60).
clear(auto_denials).

formulas(assumptions).
jumps_over(fox,dog).
(all x all y jumps_over(x,y)).
end_of_list.

formulas(goals).
jumps_over(dog,fox).
end_of_list.



% Formulas that are not ordinary clauses:
1 (all x all y jumps_over(x,y)) # label(non_clause).  [assumption].
2 jumps_over(dog,fox) # label(non_clause) # label(goal).  [goal].



% Clauses before input processing:

formulas(usable).
end_of_list.

formulas(sos).
jumps_over(fox,

True

In [31]:
#set the rule expression
rule = read_expr('all x. (studies(x, exam) -> pass(x, exam))')

#set the events and outcome we want to determine
event1 = read_expr('-studies(John, exam)')
test_outcome1 = read_expr('pass(John, exam)')

event2 = read_expr('studies(Pierre, exam)')
test_outcome2 = read_expr('pass(Pierre, exam)')

#get results
prover.prove(goal=test_outcome1,
            assumptions=[event1, rule],
            verbose=True)

prover.prove(goal=test_outcome2,
            assumptions=[event2, rule],
            verbose=True)

Calling: /home/parallels/LADR-2009-11A/bin/prover9
Args: []
Input:
 assign(max_seconds, 60).

clear(auto_denials).
formulas(assumptions).
    -(studies(John,exam)).
    all x (studies(x,exam) -> pass(x,exam)).
end_of_list.

formulas(goals).
    pass(John,exam).
end_of_list.

 

Return code: 2
stdout:
Prover9 (64) version 2009-11A, November 2009.
Process 16773 was started by parallels on ubuntu,
Mon Jul 16 23:38:49 2018
The command was "/home/parallels/LADR-2009-11A/bin/prover9".

assign(max_seconds,60).
clear(auto_denials).

formulas(assumptions).
-studies(John,exam).
(all x (studies(x,exam) -> pass(x,exam))).
end_of_list.

formulas(goals).
pass(John,exam).
end_of_list.



% Formulas that are not ordinary clauses:
1 (all x (studies(x,exam) -> pass(x,exam))) # label(non_clause).  [assumption].
2 pass(John,exam) # label(non_clause) # label(goal).  [goal].



% Clauses before input processing:

formulas(usable).
end_of_list.

formulas(sos).
-studies(John,exam).  [assumption].
-studies(x,e

True

In [32]:
#define symbols (entities\functions) and their values
rules = """
rover=>r
felix=>r
garfield=>g
alex=>a
dog=>{r, a}
cat=>{g}
fox=>{f}
runs=>{a, f}
sleeps=>{r, g}
jumps_over => {(f,g), (a,g), (f,r), (a,r)}
"""

In [33]:
val = nltk.Valuation.fromstring(rules)
#view the valuation object of symbols and their assigned values (dictionary)

print val

{'rover': 'r', 'runs': set([('f',), ('a',)]), 'alex': 'a', 'sleeps': set([('r',), ('g',)]), 'felix': 'r', 'fox': set([('f',)]), 'dog': set([('a',), ('r',)]), 'jumps_over': set([('a', 'g'), ('f', 'g'), ('a', 'r'), ('f', 'r')]), 'cat': set([('g',)]), 'garfield': 'g'}


In [34]:
#define domain and build FOL based model
dom = {'r','f','g','a'}

m = nltk.Model(dom, val)

#evaluate various expressions
print m.evaluate('jumps_over(felix, rover) & dog(rover) & runs(rover)', None)

False


In [35]:
print m.evaluate('jumps_over(felix, rover) & dog(rover) & -runs(rover)', None)

False


In [36]:
print m.evaluate('jumps_over(alex, garfield) & dog(alex) & cat(garfield) & sleeps(garfield)', None)

True


In [37]:
#assign rover to x and felix to y in the domain
g = nltk.Assignment(dom, [('x','r'),('y','f')])

In [38]:
#evaluate more expressions based on above assigned symbols
print m.evaluate('runs(y) & jumps_over(y, x) & sleeps(x)', g)

True


In [39]:
print m.evaluate('exists y. (fox(y) & runs(y))', g)

True


In [40]:
#who are the animals who run
formula = read_expr('runs(x)')
print m.satisfiers(formula, 'x', g)

set(['a', 'f'])


In [41]:
#animals who run and are also a fox?
formula = read_expr('runs(x) & fox(x)')
print m.satisfiers(formula, 'x', g)

set(['f'])


## 情感分析
非结构化的文本数据,也主要可以分为两大类:基于事实类(客观)和基于观念类(主观)

情感分析定义为使用如NLP,字典资源,语言学和机器学习等技术进行主观意见相关的信息提取,

并尝试用这些来计算文本文档所表示的极性的过程.

所谓极性,是指文件是否表示积极,消极或中性的情绪.

可以在几个层次上进行情感分析,分别是单个语句层次,段落层次和整篇文档作为一个整体.情感分析一般基于整篇文档计算,或是逐句计算之后累加在一起.

语义的极性分析通常为文档所表达的正面和负面情感赋予一些分值,然后基于累计的分值为文档赋予一个标签.

这里介绍两种情感分析的两种主要技术:
- 有监督的机器学习
- 无监督的机器学习

## IMDb电影评论的情感分析
电影评论数据集可以从http://ai.stanford.edu/~amaas/data/sentiment/ 上下载

### 安装依赖程序包

- 数据获取和格式化
    
    下载数据之后,使用review_data_extractor.py文件以及本章的代码文件从解压缩目录中提取每个评论,解析它们,整齐地将它们格式化到数据框.
- 文本规范化


In [42]:
from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
        
    def handle_data(self, d):
        self.fed.append(d)
        
    def get_data(self):
        return ' '.join(self.fed)
    
def strip_html(text):
    html_stripper = MLStripper()
    html_stripper.feed(text)
    return html_stripper.get_data()

In [43]:
def normalize_accented_characters(text):
    '''对特殊的重音字符规范化'''
    text = unicodedata.normalize('NFKD',text.decode('utf8')).encode('ascii', 'ignore')
    return text

In [44]:
def normalize_corpus(corpus, lemmatize=True, only_text_chars=False, tokenize=False):
    normalized_corpus = []
    
    for index, text in enumerate(corpus):
        text = normalize_accented_characters(text)
        text = html_parser.unescape(text)
        text = strip_html(text)
        text = expand_contractions(text, CONTRACTION_MAP)
        if lemmatize:
            text = lemmatize_text(text)
        else:
            text = text.lower()
            
    text = remove_special_characters(text)
    text = remove_stopwords(text)
    if only_text_chars:
        text = keep_text_characters(text)
        
    if tokenize:
        text = tokenize_text(text)
        normalized_corpus.append(text)
    else:
        normalized_corpus.append(text)
        
    return normalized_corpus

- 特征提取

In [45]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

def build_feature_matrix(documents, feature_type='frequency', ngram_range=(1,1), min_df=0.0, max_df=1.0):
    feature_type = feature_type.lower().strip()
    
    if feature_type == 'binary':
        vectorizer = CounterVectorizer(binary=True, min_df=min_df,
                                      max_df=max_df, ngram_range=ngram_range)
    elif feature_type == 'frequency':
        vectorizer = CounterVectorizer(binary=False, min_df=min_df,
                                      max_df=max_df, ngram_range=ngram_range)
    elif feature_type == 'tfidf':
        vectorizer = TfidfVectorizer(min_df=min_df, max_df=max_df, ngram_range=ngram_range)
    else:
        raise Exception('Wrong feature type entered. Possible values: "binary","frequency","tfidf"')
        
    feature_matrix = vectorizer.fit_transform(documents).astype(float)
    return vectorizer, feature_matrix

- 模型性能评估(准确率,精确率,召回率和F1 score)
    
    查看混淆矩阵和每一类详细的分类报告.

In [46]:
from sklearn import metrics
import numpy as np
import pandas as pd

In [47]:
def display_evaluation_metrics(true_labels, predicted_labels, positive_class=1):
    print("Accuracy:", np.round(metrics.accuracy_score(true_labels, predicted_labels),2))
    
    print('Precision:',np.round(metrics.precision_score(true_labels,predicted_labels,
                                                       pos_label=positive_class,
                                                       average='binary'),2))
    print("Recall:",np.round(metrics.recall_score(true_labels,predicted_labels,
                                                 pos_label=positive_class,
                                                 average='binary'),2))
    print("F1 Score:", np.round(metrics.f1_score(true_labels,predicted_labels,
                                                pos_label=positive_class,
                                                average='binary'),2))

In [48]:
def display_confusion_matrix(true_labels, predicted_labels, classes=[1,0]):
    '''建立混淆矩阵'''
    cm = metrics.confusion_matrix(y_true=true_labels,
                                 y_pred=predicted_labels,
                                 labels=classes)
    cm_frame = pd.DataFrame(data=cm,columns=pd.MultiIndex(levels=[['Predicted:'],classes], labels=[[0,0],[0,1]]),
                           index=pd.MultiIndex(levels=[['Actual:'],classes],labels=[[0,0],[0,1]]))
    print(cm_frame)

In [49]:
def display_classification_report(true_labels, predicted_labels,classes=[1,0]):
    report = metrics.classification_report(y_true=true_labels,
                                          y_pred=predicted_labels,
                                          labels=classes)
    print report

### 准备数据集

In [50]:
#load movie reviews data
dataset = pd.read_csv(r'movie_reviews.csv')
#print sample data
print dataset.head()

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive


In [51]:
#prepare training and testing datasets
train_data = dataset[:3500]
test_data = dataset[3500:]

train_reviews = np.array(train_data['review'])
train_sentiments = np.array(train_data['sentiment'])
test_reviews = np.array(test_data['review'])
test_sentiments = np.array(test_data['sentiment'])

#prepare sample dataset for experiments
sample_docs = [100,5817,7626,7356, 1008, 7155, 3533, 13010]
sample_data = [(test_reviews[index], test_sentiments[index]) for index in sample_docs]

### 有监督的机器学习技术
1. 模型训练
    - 训练数据规范化处理
    - 特征提取以及建立特征集和特征向量
    - 使用有监督的机器学习算法(SVM)建立预测模型

2. 模型测试
    - 测试数据规范化处理
    - 使用训练特征向量生成器提取特征
    - 实现训练好的模型预测测试评论的情感
    - 评估模型性能

In [None]:
from normalization import normalize_corpus
from utils import build_feature_matrix

#normalization
norm_train_reviews = normalize_corpus(train_reviews, lemmatize=True, only_text_chars=True)

#feature extraction
vectorizer, train_features = build_feature_matrix(documents=norm_train_reviews, 
                                                 feature_type='tfidf',
                                                 ngram_range=(1,1),
                                                 min_df=0.0, max_df=1.0)

In [None]:
#使用SVM算法构建模型
from sklearn.linear_model import SGDClassifier
#build the model
svm = SGDClassifier(loss='hinge',n_iter=50)
svm.fit(train_features, train_sentiments)



SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=None, n_iter=50,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False)

In [None]:
#从测试数据集中提取特征

#normalize reviews
norm_test_reviews = normalize_corpus(test_reviews, lemmatize=True, only_text_chars=True)

In [None]:
#extract features

text_features = vectorizer.transform(norm_test_reviews)

In [None]:
#predict sentiment for sample docs from test data
for doc_index in sample_docs:
    print 'Review:-'
    print test_reviews[doc_index]
    print 'Actual Labeled Sentiment:', test_sentiments[doc_index]
    doc_features = test_features[doc_index]
    predicted_sentiment = svm.predict(doc_features)[0]
    print 'Predicted Sentiment:', predicted_sentiment
    print

接下来,我们对所有的测试集进行预测,并评估以下模型的性能

In [None]:
#predict the sentiment for test dataset movie reviews
predicted_sentiments = svm.predict(test_features)

#evaluate model prediciton performance
from_utils import display_evaluation_metrics, display_confusion_matrix, display_classification_report

#show performance metrics
display_evaluation_metrics(true_labels=test_sentiments,
                          predicted_labels=predicted_sentiments,
                          positive_class='positive')

In [None]:
#show confusion matrix
display_confusion_matrix(true_labels=test_sentiments,
                        predicted_labels=predicted_sentiments,
                        classes=['positive','negative'])

#show detailed per-class classification report
display_classification_report(true_labels=test_sentiments, predicted_labels=predicted_sentiments,
                             classes=['positive','negative'])

### 无监督的词典技术
用于情感分析的各种流行的词典如下所示:
- AFINN词典
    
    极性基本上意味着正面的,负面的或中性的程度如何,及其对应的数值.
    
    下载地址:www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010
    
    也可以直接使用python版本的API直接使用该字典进行情感分析:github.com/fnielsen/afinn
- Bing Liu 词典
    
    更多信息请参阅:https://www.cs.uic.edu/~liub/FBS/entiment-analysis.html#lexicon
    
    该词典的核心思想是当识别出文档中的单词时,使用这些单词就可以确定任何文档的正面或负面的极性.
    
- MPQA 主观词典
    
    MPQA代表的是多视角的问题回答,包含了有匹兹堡大学维护的大量资源,观点语料库,主观字典,词性标注,基于参数的词汇以及辩论数据集.
    
    它们中很多资源可以用于分析人类情感和情绪.相关论文:"Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis"
    
    下载主观词典网址:http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/
    
    主观性的提示词语可以在档案解压后的subjclueslen1-HLTEMNLP05.tff中找到.
- SentiWordNet
    
    一个用于情感分析和意见挖掘的词典资源.赋予了三个情感分值,包括正面极性分值,负面极性分值和客观分值.
    
    更多信息请了解:http://sentiwordnet.isti.cnr.it
- VADER词典

    该词典是一个基于规则的情感分析框架,专为社交媒体分析情感而建立.
    
    更多信息请参阅:https://github.com/cjhutto/vaderSentiment
- Pattern词典

    pattern是一个完整的自然语言处理,文本分析和信息检索的函数包.
    
    该函数包有一个情感分析模块,以及情绪分析和文本模式分析模块
    
    对于情感分析,该模块通过将文本分为句子,词干化,对词干进行词性标注等步骤分析任何文本.
    
    它的主观情感词典为:github.com/clips/pattern/blob/master/pattern/text/en/en-sentiment.xml
    
    该词典包含极性,主观,强度,自信的评分,以及词性标签,WordNet标识等.
    
    pattern推荐的阈值为0.1,高于该值时情感为正面,低于该值时为负面.

#### AFINN词典

In [None]:
from afinn import Afinn
afn = Afinn(emoticons=True)

#### SentiWordNet

In [None]:
import nltk
from nltk.corpus import sentiwordnet as swn
#get synset for 'good'
good = swn.senti_synsets('good','n')[0]

#print synset sentiment scores
print('Positive Polarity Score:', good.pos_score())
print('Negative Polarity Score:', good.neg_score())
print('Objective Score:', good.obj_score())

In [None]:
from normalization import normalize_accented_characters, html_parser, strip_html

def analyze_sentiment_sentiwordnet_lexicon(review, verbose=False):
    #pre-process text
    review = normalize_accented_characters(review)
    review = html_parser.unescape(review)
    review = strip_html(review)
    
    #tokenize and POS tag tet tokens
    text_tokens = nltk.word_tokenize(review)
    tagged_text = nltk.pos_tag(text_tokens)
    pos_score = neg_score = token_count = obj_score = 0
    
    #get wordnet synsets based on POS tags
    #get sentiment scores if synsets are found
    for word ,tag in tagged_text:
        ss_set = None
        
    if 'NN' in tag and swn.senti_synsets(word, 'n'):
        ss_set = swn.senti_synsets(word, 'n')[0]
    elif 'VB' in tag and swn.senti_synsets(word,'v'):
        ss_set = swn.senti_synsets(word,'v')[0]
    elif 'JJ' in tag and swn.senti_synsets(word,'a'):
        ss_set = swn.senti_synsets(word, 'a')[0]
    elif 'RB' in tag and swn.senti_synsets(word, 'r'):
        ss_set = swn.senti_synsets(word, 'r')[0]
    #if senti-synset is found
    if ss_set:
        #add scores for all found synsets
        pos_score += ss_set.pos_score()
        neg_score += ss_set.neg_score()
        obj_score += ss_set.obj_score()
        token_count += 1
        
    #aggregate final scores
    final_score = pos_score - neg_score
    norm_final_score = round(float(final_score)/token_count, 2)
    final_sentiment = 'positive' if norm_final_score >= 0 else 'negative'

    if verbose:
        norm_obj_score = round(float(obj_score)/token_count, 2)
        norm_pos_score = round(float(pos_score)/token_count,2)
        norm_neg_score = round(float(neg_score)/token_count,2)
        # to display results in a nice table
        sentiment_frame = pd.DataFrame([[final_sentiment, norm_obj_score,
                                        norm_pos_score,norm_neg_score,norm_final_score]],
                                       columns=pd.MultiIndex(levels=[['SENTIMENT STATS:'],
                                                                     ['Predicted_Sentiment',
                                                                     'Objectivity',
                                                                     'Positive',
                                                                     'Negative',
                                                                     'Overall']],
                                                            labels=[[0,0,0,0,0],[0,1,2,3,4]])
                                      )
        print sentiment_frame
        
    return final_sentiment

In [None]:
#detail sentiment analysis for sample reviews
for review, review_sentiment in sample_data:
    print 'Review:'
    print review
    print 
    print('Labeled Sentiment:', review_sentiment)
    print
    final_sentiment = analyze_sentiment_sentiwordnet_lexicon(review,verbose=True)
    print '-'*60

In [None]:
#predict sentiment for test movie reviews dataset
sentiwordnet_predictions = [analyze_sentiment_sentiwordnet_lexicon(review) for review in test_reviews]

#get model erformance statistics
print('Performance metrics:')

display_evaluation_metrics(true_labels=test_sentiments,
                            predicted_labels=sentiwordnet_predictions,
                            positive_class='positive')
print('\nConfusion Matrix:')

display_confusion_matrix(true_labels=test_sentiments,
                        predicted_labels=sentiwordnet_predictions,
                        classes=['positive','negative'])

print('\nClassification report:')
display_classification_report(true_labels=test_sentiments,
                             predicted_labels=sentiwordnet_predictions,
                             classes=['positive','negative'])

#### VADER词典

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

def analyze_sentiment_vader_lexicon(review, threhold=0.1, verbose=False):
    #pre-process text
    review = normalize_accented_characters(review)
    review = html_parse.unescape(review)
    #analyze the sentiment for review
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(review)
    #get aggregate scores and final sentiment
    agg_score = scores['compound']
    final_sentiment = 'positive' if agg_score >= threshold else 'negative'
    
    if verbose:
        #display detailed sentiment statistic
        positive = str(round(scores['pos'], 2)*100) + '%'
        final = round(agg_score, 2)
        negative = str(round(scores['neg'], 2)*100)+'%'
        neutral = str(round(scores['neu'], 2)*100)+
        sentiment_frame = pd.DataFrame([[final_sentiment, final, positive, negative, neutral]],
                                      columns=pd.MultiIndex(levels=[['SENTIMENT STATS:'],
                                                                    ['Predicted Sentiment','Polarity Score','Positive','Negative','Neutral']],
                                                           labels=[[0,0,0,0,0],[0,1,2,3,4]]))
        print(sentiment_frame)
        
    return final_sentiment

In [None]:
#get detailed sentiment statistics
for review, review_sentiment in sample_data:
    print 'Review:'
    print review
    print
    print 'Labeled Sentiment:', review_sentiment
    print
    final_sentiment = analyze_sentiment_vader_lexicon(review,threshold=0.1,verbose=True)
    print('-'*60)

In [None]:
#predict sentiment for test movie reviews dataset
vader_predictions = [analyze_sentiment_vader_lexicon(review, threshold=0.1) for review in test_reviews]

#get model performance statistics
print 'Performance metrics:'
display_evaluation_metrics(true_labels=test_sentiments, predicted_labels=vader_predictions, positive_class='positive')

print('\n Confusion Matrix:')

display_confusion_matrix(true_labels=test_sentiments
                        predicted_labels=vader_predictions,
                        classes=['positive','negative'])
print('\nClassification report:')
display_classification_report(true_labels=test_sentiments,
                             predicted_labels=vader_predictions,
                             classes=['positive', 'negative'])

#### Pattern 词典

In [None]:
from pattern.en import sentiment, mood, modality
def analyze_sentiment_pattern_lexicon(review, threshold=0.1, verbose=False):
    #pre-process text
    review = normalize_accented_characters(review)
    review = html_parser.unescape(review)
    review = strip_html(review)
    
    #analyze sentiment for the text document
    analysis = sentiment(review)
    sentiment_score = round(analysis[0], 2)
    sentiment_subjectivity = round(analysis[1],2)
    
    #get final sentiment
    final_sentiment = 'positive' if sentiment_score >= threshold else 'negative'
    
    if verbose:
        #display detailed sentiment statistics
        sentiment_frame = pd.DataFrame([[final_sentiment, sentiment_score, sentiment_subjectivity]],
                                      columns=pd.Multiindex(levels=[['SENTIMENT STATS:'],
                                                                    ['Predicted Sentiment','Polarity Score','Subjectivity Score']],
                                                           labels=[[0,0,0],[0,1,2]]))
        
        print sentiment_frame
        assessment = analysis.assessments
        assessment_frame = pd.DataFrame(assessment, columns=pd.MultiIndex(levels=[['DETAILED ASSESSMENT STATS:'],
                                                                                 ['Key Terms','Polarity Score',
                                                                                 'Subjectivity Score','Type']],
                                                                         labels=[[0,0,0,0],[0,1,2,3]]))
        print assessment_frame
        print
    return final_sentiment

In [None]:
#get detailed sentiment statistics
for review, review_sentiment in sample_data:
    print 'Review:'
    print review
    print
    print 'Labeled Sentiment:', review_sentiment
    print
    final_sentiment = analyze_sentiment_pattern_lexicon(review, threshold=0.1, verbose=True)
    print '-'*40

In [None]:
for review, review_sentiment in sample_data:
    print 'Review:'
    print review
    print 'Labeled Sentiment:', review_sentiment
    print 'Mood:', mood(review)
    mode_score = modality(review)
    print('Modality Score:', round(mod_score, 2))
    print('Certainty:', 'Strong' if mod_score>0.5 else 'Medium' if mod_score > 0.35 else 'Low')
    print '-'*60

In [None]:
#predict sentiment for test movie reviews dataset
pattern_predictions = [analyze_sentiment_pattern_lexicon(review, threshold=0.1) for review in test_reviews]

#get model performance statistics
print('Performance metrics:')
display_evaluation_metrics(true_labels=test_sentiments,predicted_labels=pattern_predictions,positive_class='positive')

print('\nConfusion Matrix:')
display_confusion_matrix(true_labels=test_sentiments, predicted_labels=pattern_predictions, classes=['positive','negative'])

print('\cClassification report:')
display_classification_report(true_labels=test_sentiments, predicted_labels=pattern_predictions, classes=['positive','negative'])