<a href="https://colab.research.google.com/github/DmitryKutsev/NIS_SentiFrame/blob/master/virt_udpipe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
!pip install spacy_udpipe
import spacy_udpipe

In [2]:
%%capture
!pip install pymorphy2[fast]
from pymorphy2 import MorphAnalyzer 
morph = MorphAnalyzer()

In [3]:
import os
import unicodedata
import json
import pandas as pd
import numpy as np
from tqdm import tqdm
from collections import Counter, defaultdict
from sklearn.metrics.pairwise import cosine_similarity

# Sentence-level experiment

For further exploration of the semantic axis method we make use of BERT embeddings.

Seed verbs remain the same as in the word-level experiments, but may be subject to change later.

1.   We construct seed sentences with regard to the arguments of the test sentence. For example, to test a sentence like 'Силовики вломились к журналисту', we compute the semantic axis as follows:

*   Replace the target verb with a seed verb.
*   Make changes to cases of arguments if necessary.
*   Repeat for each seed verb to construct seed sentences.

2.   We embed sentences in the two seed groups and take averages of the two corresponding embedding groups.

3.   We compute semantic axis for the test sentense by substracting the embedding of the negative seed from embedding of the positive seed.

4.   We embed the test sentense and measure it's cosine similarity to the axis

5.   We repeat previous steps for each test sentence with the predicate in question.

6. The resulting dataset for all test sentenses will hopefully look like this:

| Predicate(verb)      | Polarity | Text     | Similarity     |
|    :----   |    :----   |    :----   |    ----:   |
| защищать      | pos       | он защищает его   | 0,333       |
| защищать      | pos       | суд защищает права   | 0,321       |


# BERT-as-service

Check out [this issue](https://github.com/hanxiao/bert-as-service/issues/380) and "make sure Colab is using Tensorflow 1.x, because bert-serving-start doesn't currently work with TF 2.1 and nohup hides the output of the command failing".

Also make sure you're using GPU accelerator.

In [None]:
%tensorflow_version 1.x
# import tensorflow as tf
# print (tf.__version__)

TensorFlow 1.x selected.


In [None]:
%%capture
!pip install -U bert-serving-server[http]
!pip install bert-serving-client  # client, independent of `bert-serving-server`

In [None]:
%%capture
!wget https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip
!unzip /content/multi_cased_L-12_H-768_A-12.zip

In [None]:
!nohup bert-serving-start -model_dir=./multi_cased_L-12_H-768_A-12 > out.file 2>&1 &

In [None]:
from bert_serving.client import BertClient
bc = BertClient()

In [None]:
encoded_test = bc.encode(['First do it', 
                          'then do it right', 'then do it better'
                          ])
len(encoded_test[0])

768

# UDPipe

In [12]:
spacy_udpipe.download("ru")
syntagrus = spacy_udpipe.load("ru")

Downloaded pre-trained UDPipe model for 'ru' language


In [None]:
# with open('16.txt') as f:
#     text = f.read()

# verbs = list(pd.read_csv('cross_seminar_task.csv', sep='\t')['verb'])

def text2ud(text):
    udpiped = []
    doc = syntagrus(text)
    doc_len = len(doc)
    for i, token in enumerate(doc):
        if i <= 10 or i == doc_len-10:
            continue
        if token.lemma_ in verbs:
            new_entry = {token.lemma_: []}
            for t in reversed(doc[i-10:i]):
                if t.head.lemma_ == token.lemma_:
                    new_entry[token.lemma_].append([t.text, t.lemma_, t.pos_, t.dep_])
            for t in doc[i:i+10]:
                if t.head.lemma_ == token.lemma_:
                    new_entry[token.lemma_].append([t.text, t.lemma_, t.pos_, t.dep_])
            udpiped.append(new_entry)
            
    with open('result.json', 'w') as f:
        json.dump(udpiped, f, ensure_ascii=False, indent=4)

# text2ud(text[:1000])

# Semantic axis method

In [None]:
class SemanticAxis():

    def __init__(self):
        self.seed0 = set()
        self.seed1 = set()
        self.targets = set()
        self.axis_vector = None
        self.axis_similarities = None

    def add_seed(self, seed: set, seed_id: int):
        if seed_id:
            self.seed1 = set(seed)
        else:
            self.seed0 = set(seed)
    def add2seed(self, seed: set, seed_id: int):
        if seed_id:
            self.seed1.update(seed)
        else:
            self.seed0.update(seed)
    def flush_seed(self, seed_id=None, flush_both_seeds=True):
        if seed_id != None:
            if seed_id:
                self.seed1 = set()
            else:
                self.seed0 = set()
        else:
            self.seed0, self.seed1 = set(), set()
    
    def add_targets(self, target):
        self.targets = target
    def add2targets(self, target):
        self.targets.update(target)
    def flush_targets(self):
        self.targets = set()
    
    def compute_bert_axis(self, bert_client):
        assert len(self.seed0) > 0, 'Seed0 set is empty.'
        assert len(self.seed1) > 0, 'Seed1 set is empty.'
        self.bert_client = bert_client

        target_vectors = self.bert_client.encode(list(self.targets))
        seed_vectors = [self.bert_client.encode(list(s)).mean(axis=0) 
        for s in (self.seed0, self.seed1)]

        self.axis_vector = seed_vectors[1] - seed_vectors[0]

        self.axis_similarities = {self.targets[i]:cosine_similarity(
            np.atleast_2d(vector), 
            np.atleast_2d(self.axis_vector)
            ).item() for i, vector in enumerate(target_vectors)}

In [None]:
sa = SemanticAxis()

In [None]:
sa.add_targets(list2check)
sa.add_seed(['разрушать'], 0)
sa.add_seed(['ценить'], 1)

In [None]:
sa.compute_bert_axis(bc)
df = pd.DataFrame({'target':sa.axis_similarities.keys(),
                   'similarity':sa.axis_similarities.values()})
df.sort_values(by='similarity')

# Triples (nsubj, root, obj)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!unrar x ""

In [6]:
def ovd_loader():
    path = 'OVD-Info/2019'
    for month in os.listdir(path):
        for filename in os.listdir('{}/{}'.format(path, month)):
            filepath = '{}/{}/{}'.format(path, month, filename)
            with open(filepath, 'r', encoding='utf-8') as f:
                yield filepath, f.read()

In [7]:
d = {x[0]:unicodedata.normalize("NFKD", x[1]) for x in ovd_loader()}
texts = pd.DataFrame({'url':d.keys(),
                      'text':d.values()})

In [198]:
texts['spacy_doc'] = texts.text.progress_apply(syntagrus)

100%|██████████| 3073/3073 [17:20<00:00,  2.95it/s]


# WIP

In [187]:
# переписать для spacy docs
def triples_mapping(text, tags=tags, udpiper=syntagrus):
    doc = udpiper(text)

    for sent in doc.sents:
        sent_triples = []
        root_triple = defaultdict(list)
        # переписать, оставить леммы и pos-теги
        root_triple['root'] = sent.root
        for child in sent.root.children:
            if child.dep_ in tags:
                root_triple[child.dep_].append(child)
        sent_triples.append(dict(root_triple))
    return sent_triples

In [189]:
# tqdm.pandas()
# tags = 'nsubj obj iobj'.split()
# texts['triples'] = texts.text.progress_apply(triples_mapping)

100%|██████████| 3073/3073 [17:13<00:00,  2.97it/s]


In [201]:
pd.to_pickle(texts.spacy_doc, 'ovd-info_spacy_docs.pkl')

In [86]:
triple_set = Counter()

# переписать для колонки с spacy docs
for text, text_triples in tripled.items():
    for sent_triples in text_triples:
        d = sent_triples[0]
        # выкинули предложения с несколькими iobj
        if 'iobj' in d:
            if len(d['iobj']) > 1:
                continue
        # выкинули предложения с несколькими nsubj
        if 'nsubj' in d:
            if len(d['nsubj']) > 1:
                continue
        # переписать, потому что iobj остаются за бортом
            else:
                if 'obj' in d:
        # для каждого дополнения
                    for obj in d['obj']:
                        tup = tuple(map(str, (d['nsubj'][0], d['root'], obj)))
                        triple_set.update({tuple(t.lower() for t in tup)})

In [None]:
triple_set.most_common(200)

In [92]:
with open ('tripled_ovd-info_2019.json', 'w', encoding='utf-8') as f:
    json.dump({' '.join(list(tr)):v for tr, v in triple_set.items()}, 
        f, ensure_ascii=False)

In [148]:
predicates = {}
predicates = defaultdict(list)
for tr, v in triple_set.most_common():
    for var in morph.parse(tr[1]):
        if ('VERB' in var.tag) or ('INFN' in var.tag):
            predicates[var.normal_form].append((tr, v))
            continue

In [128]:
!wget 'https://github.com/DmitryKutsev/NIS_SentiFrame/raw/master/annotations/polarity_annotation/3annotators_agree_on_these.zip'
!unzip '/content/3annotators_agree_on_these.zip'
with open('annotated_negative_3annotators_agree.json')as f:
    nverbs = set(json.load(f))
with open('annotated_positive_3annotators_agree.json')as f:
    pverbs = set(json.load(f))

In [256]:
p_seed = ['одобрять', 'хвалить', 'поощрять', 'любить',
                'обожать', 'ценить','превозносить', 
                # 'восхищаться', 'восторгаться', 'гордиться',
                # added
                'уважать'
                ]
n_seed = ['ненавидеть', 'ругать', 'убивать', 'разрушать',
        #   'злиться', 'негодовать', 
          'порицать', 'осуждать', 'обвинять', 'наказывать', ]

In [21]:
test_0 = 'избить задержать арестовать'.split()
test_1 = 'поддержать оправдать защищать'.split()
test_0 + test_1

['избить', 'задержать', 'арестовать', 'поддержать', 'оправдать', 'защищать']

In [156]:
test_triples = defaultdict(list)

for v in test_0 + test_1:
    for triple in predicates[v]:
        test_triples[v].append(triple[0])

In [242]:
#  пока без предлогов

# объект, из которого будем доставать 
# сидовые предложения и тестовое предложение 
# для каждой тройки (a1, глагол, a2)
class SentSeeds():

    def __init__(self, triple, seed0, seed1, morph,
                 casefile = 'cases.json',
                #  прописать логирование
                 logfile='SemanticAxis_log.txt'):
        
        self.triple = triple
        self.morph = morph
        self.casefile = casefile
        self.logfile = logfile

        self.seed_cases = self.read_seed_cases()
        self.parsed_triple = self.parse_triple()
        self.capitals = self.capitalization_check()
        self.grammemes2inflect = self.extract_grammemes()

        self.seed0 = seed0
        self.seed1 = seed1

        self.seed0_sents = self.stitch_seed(self.seed0)
        self.seed1_sents = self.stitch_seed(self.seed1)

    def read_seed_cases(self):
        # нужен словарь со словарями для каждого сидового глагола
        # словари возвращают падежи аргументов по аргументу
        with open(self.casefile, 'r', encoding='utf-8') as f:
            return json.load(f)

    def parse_triple(self):
        # морфологический разбор кортежа
        a0, verb, a1 = tuple(map(self.morph.parse, self.triple))
        # выбираем разбор наиболее частотных омографов
        return a0[0], verb[0], a1[0]
        
    def capitalization_check(self):
        return tuple([1 if el[0].isupper() else 0 for el in self.triple])

    def inflect_sent(self, seed_verb):
        # аргументы уже парсили
        a0, a1 = self.parsed_triple[0], self.parsed_triple[2]
        # сидовый глагол ещё не парсили
        v = self.morph.parse(seed_verb)[0]
        # граммемы, которыми будем приводить в нужную форму
        grs = self.grammemes2inflect
        # debug
        # print(self.grammemes2inflect[2])
        verb = v.inflect(grs[1]).word

        d = self.seed_cases
        # здесь должен быть 'nomn' в большинстве случаев
        a0 = a0.inflect({d[seed_verb]['a0']}).word
        # а здесь может быть что угодно
        a1 = a1.inflect({d[seed_verb]['a1']}).word

        return tuple([word.capitalize() if self.capitals[i] else word 
                for i, word in enumerate([a0, verb, a1])])

    def stitch_seed(self, seed):
        # добавляем в сиды предложения без знака конца предложения
        return tuple([' '.join(self.inflect_sent(v)) for v in seed])
        

    def extract_grammemes(self):
        a0, verb, a1 = self.parsed_triple
        a0, verb, a1 = a0.tag, verb.tag, a1.tag
        # лицо и род - взаимоисключающие граммемы в разборе глагола
        grammemes = {g for g in (verb.tense, verb.number, 
                                 verb.gender, verb.person) if g}
        # граммемы, которые можно использовать
        # для постановки словв нужную форму
        return tuple([{a0.case}, grammemes, {a1.case}])

In [257]:
x = SentSeeds(triple = ('Суд', 'приговорил', 'Елену'),
              seed0=n_seed, seed1=p_seed+['нравиться'], morph=morph)

In [258]:
[sent for sent in x.seed1_sents]

['Суд одобрял Елену',
 'Суд хвалил Елену',
 'Суд поощрял Елену',
 'Суд любил Елену',
 'Суд обожал Елену',
 'Суд ценил Елену',
 'Суд превозносил Елену',
 'Суд уважал Елену',
 'Суд нравился Елене']

# Словари падежей аргументов

In [None]:
!wget 'https://raw.githubusercontent.com/DmitryKutsev/NIS_SentiFrame/master/annotations/SENTIFRAME%20-%20case_annotation.csv'

In [148]:
casefile = pd.read_csv('SENTIFRAME - case_annotation.csv')

In [198]:
casedict = {}
casedict = defaultdict(dict)
pol_col = casefile.polar_or_not.to_list()
a0_col = casefile['падеж первого аргумента, по умолчанию nomn'].to_list()
a1_col = casefile['падеж второго аргумента в нотации pymorphy2, по умолчанию accs'].to_list()

preps = ['с', 'в', 'против', 'за', 'на', 'от', 'над', 'у', 'перед', 'из-за']

for i, v in enumerate(casefile.verb.to_list()):
    casedict[v].update({'polarity':pol_col[i]})
    casedict[v].update({'a0':(a0_col[i] if pd.notna(a0_col[i]) else 'nomn')})
    if pd.notna(a1_col[i]) and (a1_col[i] != '0'):
        casedict[v].update({'a1':a1_col[i]})
    prep = casefile['Комментарий'].to_list()[i]
    if prep == 'c':
        casedict[v].update({'preposition':'с'})
    elif prep in preps:
        casedict[v].update({'preposition':prep})

In [204]:
# пока что None для неразмеченных полярностей
casedict['хвалить'] = {'polarity': None, 'a0': 'nomn', 'a1': 'accs'}
casedict['обожать'] = {'polarity': None, 'a0': 'nomn', 'a1': 'accs'}
casedict['превозносить'] = {'polarity': None, 'a0': 'nomn', 'a1': 'accs'}
casedict['ругать'] = {'polarity': None, 'a0': 'nomn', 'a1': 'accs'}
casedict['порицать'] = {'polarity': None, 'a0': 'nomn', 'a1': 'accs'}
casedict['наказывать'] = {'polarity': None, 'a0': 'nomn', 'a1': 'accs'}

In [206]:
casedict = dict(casedict)
with open('cases.json', 'w', encoding='utf-8') as f:
    json.dump(casedict, f, ensure_ascii=False)

# Каждый пример на своей оси