# Semantic axis method

Цель метода - отранжировать слова в порядке изменения полярности, от негативной к позитивной или наоборот. Метод опирается на векторы базовых слов каждой полярности (сиды).

In [12]:
from gensim.models import KeyedVectors
from gensim.models.word2vec import Word2Vec
import numpy as np
import os

In [16]:
model = KeyedVectors.load_word2vec_format(os.path.join("wv", "model.bin"), binary=True)

Следующим шагом мы четыре новых эмбеддинга, который будет предствлять репрезентацию сидов каждой полярности. Для этого мы считываем наши сиды - по каждой тематике отдельно. Затем для каждой тематики усередняем токены для каждой полярности по этой формуле:

$$V^+ = \frac{1}{n}\sum_{1}^n E(w_{i}^+)$$
$$V^- = \frac{1}{m}\sum_{1}^m E(w_{i}^-)$$

In [27]:
positive_food = []
negative_food = []
positive_service = []
negative_service = []

with open(os.path.join("seeds", "Food_words.txt"), "r") as fd:
    food_words = fd.read().split("\n")

for word in food_words:
    line = word.split()
    if line[-1] == "1":
        positive_food.append(model[line[1]])
    else:
        negative_food.append(model[line[1]]) 

with open(os.path.join("seeds", "Service_words.txt"), "r") as fd:
    food_words = fd.read().split("\n")

for word in food_words:
    line = word.split()
    if line[-1] == "1":
        positive_service.append(model[line[1]])
    else:
        negative_service.append(model[line[1]]) 
        
center_pos_food = np.mean(positive_food, axis=0)
center_neg_food = np.mean(negative_food, axis=0)
center_pos_service = np.mean(positive_service, axis=0)
center_neg_service = np.mean(negative_service, axis=0)

In [28]:
print(center_pos_food.shape)
print(center_neg_food.shape)
print(center_pos_service.shape)
print(center_neg_service.shape)

(300,)
(300,)
(300,)
(300,)


Затем мы создаем "семантические оси" для каждой тематикм по следующей формуле:

$$V_{axis} = V^+ - V^-$$

Итого, мы имеем две оси. Для каждого токена в корпусе высчитываем косинусную близость с данной осью - чем ниже близость, тем более положительно поляризовано слово. Таким образом, мы можем отранжировать все токены в порядке увеличения положительности полярности.

In [29]:
sem_axis_food = center_pos_food - center_neg_food
sem_axis_service = center_pos_service - center_neg_service

In [30]:
from scipy.spatial.distance import cosine

print(cosine(model["вкусно_ADV"], sem_axis_food))

1.0010059647029266


In [41]:
class WeightedWord:
    def __init__(self, word, cosine, pos):
        self.word = word
        self.cosine = cosine
        self.pos = pos
        
    def __lt__(self, other):
        return self.cosine < other.cosine
    
    def __eq__(self, other):
        return self.cosine == other.cosine

In [5]:
from conllu import parse
from tqdm import tqdm
from collections import Counter
import os
import pickle

weighted_food = []
processed_tokens = []
weighted_service = []

allowed_tags = ["VERB", "NOUN", "ADJ"]

for file in tqdm(os.listdir("parsed_train")):
    fd = open(os.path.join("parsed_train", file), "r")
    conllu_text = parse(fd.read())
    for sentence in conllu_text:
        for word in sentence:
            model_lemma = word['lemma'].lower().replace("&quot", "") + '_' + word['upostag']        
            if model_lemma in processed_tokens:
                continue

            if word["upostag"] in allowed_tags and model_lemma in model:
                weighted_food.append(WeightedWord(model_lemma, 
                                                 cosine(model[model_lemma], sem_axis_food),
                                                 word["upostag"]
                                                ))
                weighted_service.append(WeightedWord(model_lemma, 
                                                    cosine(model[model_lemma], sem_axis_service),
                                                    word["upostag"]))
                processed_tokens.append(model_lemma)

100%|██████████| 19034/19034 [01:28<00:00, 215.04it/s]


In [86]:
sorted_food = sorted(weighted_food)
sorted_service = sorted(weighted_service)

In [87]:
with open(os.path.join("word_lists", "semantic_axis_method", "food.csv"), "w") as final_fd:
    for word in sorted_food:
        final_fd.write(word.word + "\t" + str(word.cosine) + "\n")
        
with open(os.path.join("word_lists", "semantic_axis_method", "service.csv"), "w") as final_fd:
    for word in sorted_service:
        final_fd.write(word.word + "\t" + str(word.cosine) + "\n")

Теперь мы получили отранжированный список токенов, отдельный по каждой тематике. Посмотрим, какие токены у нас самые маркированные. Мы предполагаем, что списки для еды и для сервиса будут очень похожи, так как мы имеем довольно мало сидов. К тому же, эти сиды неспецифицированные (например, "плохой" подходит и для еды, и для сервиса).

In [137]:
print("~~~FOOD~~~")
print("MAX:")
for word in sorted_food[-5:]:
    print(word.word, word.cosine)
print("MIN:")
for word in sorted_food[:5]:
    print(word.word, word.cosine)
    

print("\n\n~~~SERVICE~~~")
print("MAX:")
for word in sorted_service[-5:]:
    print(word.word, word.cosine)
print("MIN:")
for word in sorted_service[:5]:
    print(word.word, word.cosine)

~~~FOOD~~~
MAX:
плохо_NOUN 1.4711369276046753
нехороший_ADJ 1.4964621663093567
плохой_NOUN 1.5298885107040405
плохо_ADJ 1.6258410215377808
плохой_ADJ 1.6771029233932495
MIN:
величественный_ADJ 0.5564294457435608
великолепный_ADJ 0.5609675943851471
обширный_ADJ 0.5676152110099792
разнообразный_ADJ 0.5677912831306458
живописный_ADJ 0.5748997330665588


~~~SERVICE~~~
MAX:
люмпен_NOUN 1.2843709290027618
орать_NOUN 1.2850382924079895
обозвать_VERB 1.285082459449768
дезертир_NOUN 1.296842485666275
деморализованный_ADJ 1.2989414930343628
MIN:
приятный_ADJ 0.3453672528266907
приветливый_ADJ 0.36830490827560425
радушный_ADJ 0.4070584177970886
дружелюбный_ADJ 0.4409315586090088
милый_ADJ 0.442871630191803


Сделаем z-score нормализацию значений.

In [138]:
food_mean = np.mean([word.cosine for word in sorted_food])
service_mean = np.mean([word.cosine for word in sorted_service])

food_std = np.std([word.cosine for word in sorted_food])
service_std = np.std([word.cosine for word in sorted_service])

print("food mean", food_mean)
print("food std", food_std)
print("service mean", service_mean)
print("service std", service_std)

food mean 0.9976846222538592
food std 0.10944905145854601
service mean 1.005963000205442
service std 0.09811330958646661


In [140]:
from collections import Counter

def z_score_normalization(word_list, list_mean, list_std, if_counter=False):
    normalized_list = {}
    for word in word_list:
        if if_counter:
            normalized_list[word] = (word_list[word] - list_mean) / list_std
        else:
            normalized_list[word.word] = (word.cosine - list_mean) / list_std
    return Counter(normalized_list)
        
norm_food = z_score_normalization(sorted_food, food_mean, food_std)
norm_service = z_score_normalization(sorted_service, service_mean, service_std)

In [141]:
print("~~~FOOD~~~")
print("MIN:")
for word in norm_food.most_common()[-5:]:
    print(word)
print("MAX:")
for word in norm_food.most_common(5):
    print(word)
    

print("\n\n~~~SERVICE~~~")
print("MIN:")
for word in norm_service.most_common()[-5:]:
    print(word)
print("MAX:")
for word in norm_service.most_common(5):
    print(word)

~~~FOOD~~~
MIN:
('живописный_ADJ', -3.8628465350147945)
('разнообразный_ADJ', -3.927794105059341)
('обширный_ADJ', -3.9294028181392635)
('великолепный_ADJ', -3.990139905726998)
('величественный_ADJ', -4.031603477874127)
MAX:
('плохой_ADJ', 6.207621647563761)
('плохо_ADJ', 5.739258503504132)
('плохой_NOUN', 4.862571957983156)
('нехороший_ADJ', 4.5571664387097055)
('плохо_NOUN', 4.325778058754002)


~~~SERVICE~~~
MIN:
('милый_ADJ', -5.739194533208467)
('дружелюбный_ADJ', -5.758968319160356)
('радушный_ADJ', -6.1042134337599)
('приветливый_ADJ', -6.499200716166587)
('приятный_ADJ', -6.732988115099437)
MAX:
('деморализованный_ADJ', 2.9861238405246207)
('дезертир_NOUN', 2.9647301338304444)
('обозвать_VERB', 2.844868452820256)
('орать_NOUN', 2.844418289208767)
('люмпен_NOUN', 2.8376163231142533)


# Word Lists

Теперь мы хотим получить конечные списки токенов, учтя пересечение всех использованных нами методов. Для этого мы соответственно сложим все метрики для каждого слова.

In [142]:
import pickle

with open(os.path.join("word_lists", "CNN", 'sent_food.pickle'), 'rb') as f:
    food_list = pickle.load(f)
    
with open(os.path.join("word_lists", "CNN", 'sent_service.pickle'), 'rb') as f:
    service_list = pickle.load(f)

In [143]:
def mean_metric(counter_list):

    average_metric = {}

    for counter in counter_list:
        for token in counter:
            if token not in average_metric:
                average_metric[token] = [counter[token]]
            else:
                average_metric[token].append(counter[token])
                
    for key in average_metric:
        average_metric[key] = np.mean(average_metric[key])
    return Counter(average_metric)

average_food = mean_metric(food_list)
average_service = mean_metric(service_list)

In [144]:
print("~~~FOOD~~~")
print("MAX:")
for el in average_food.most_common(5):
    print(el)
print("MIN:")   
for el in average_food.most_common()[-5:]:
    print(el)
    
print("\n\n~~~SERVICE~~~")
print("MAX:")
for el in average_service.most_common(5):
    print(el)
print("MIN:")   
for el in average_service.most_common()[-5:]:
    print(el)

~~~FOOD~~~
MAX:
('очень_ADV', 11437.910938599616)
('вкусный_ADJ', 9284.592686381122)
('интерьер_NOUN', 4962.086438708173)
('приятный_ADJ', 4419.284605484837)
('год_NOUN', 3819.742681226211)
MIN:
('оценка_NOUN', -74.68075154055259)
('просить_VERB', -81.60893007901905)
('посетитель_NOUN', -125.6364262850575)
('помещение_NOUN', -147.90723155334854)
('пойти_VERB', -183.5469616511209)


~~~SERVICE~~~
MAX:
('очень_ADV', 9962.569364785297)
('приятный_ADJ', 6230.708526983835)
('вкусный_ADJ', 5238.680879248544)
('уютный_ADJ', 4438.081501664099)
('атмосфера_NOUN', 4315.923531234968)
MIN:
('один_NUM', -198.5470799011896)
('столик_NOUN', -231.35559087586626)
('гость_NOUN', -292.2447034124025)
('внимание_NOUN', -295.53615339319185)
('стол_NOUN', -524.0081696923007)


In [145]:
average_food_mean = np.mean([average_food[word] for word in average_food])
average_service_mean = np.mean([average_service[word] for word in average_service])

average_food_std = np.std([average_food[word] for word in average_food])
average_service_std = np.std([average_service[word] for word in average_service])

print("food mean", average_food_mean)
print("food std", average_food_std)
print("service mean", average_service_mean)
print("service std", average_service_std)

food mean 3.4863766971231622
food std 68.84006361884803
service mean 3.7298273491683043
service std 66.60644373501773


In [146]:
average_norm_food = z_score_normalization(average_food, average_food_mean, average_food_std, if_counter=True)
average_norm_service = z_score_normalization(average_service, average_service_mean, average_service_std, if_counter=True)

In [147]:
print("~~~FOOD~~~")
print("MAX:")
for el in average_norm_food.most_common(5):
    print(el)
print("MIN:")   
for el in average_norm_food.most_common()[-5:]:
    print(el)
    
print("\n\n~~~SERVICE~~~")
print("MAX:")
for el in average_norm_service.most_common(5):
    print(el)
print("MIN:")   
for el in average_norm_service.most_common()[-5:]:
    print(el)

~~~FOOD~~~
MAX:
('очень_ADV', 166.10130730285684)
('вкусный_ADJ', 134.82129187258454)
('интерьер_NOUN', 72.03073038203021)
('приятный_ADJ', 64.14576042865092)
('год_NOUN', 55.43655981579044)
MIN:
('оценка_NOUN', -1.1354889017893648)
('просить_VERB', -1.2361305655859909)
('посетитель_NOUN', -1.8756926736312243)
('помещение_NOUN', -2.1992078492068816)
('пойти_VERB', -2.716925704540391)


~~~SERVICE~~~
MAX:
('очень_ADV', 149.51765893785978)
('приятный_ADJ', 93.48913333982564)
('вкусный_ADJ', 78.59526433697208)
('уютный_ADJ', 66.57541561528545)
('атмосфера_NOUN', 64.74138930222908)
MIN:
('один_NUM', -3.036896971336313)
('столик_NOUN', -3.5294695984713043)
('гость_NOUN', -4.443632089697725)
('внимание_NOUN', -4.493048479407463)
('стол_NOUN', -7.923227355313906)


In [153]:
def count_total_score(one_list, another_list):
    total_list = {}
    for word in one_list:
        if word not in another_list:
            continue
        total_list[word] = one_list[word] * 20 + another_list[word]
    return Counter(total_list)

In [154]:
total_food = count_total_score(norm_food, average_norm_food)
total_service = count_total_score(norm_service, average_norm_service)

In [156]:
print("~~~FOOD~~~")
print("MAX:")
for el in total_food.most_common(5):
    print(el)
print("MIN:")   
for el in total_food.most_common()[-5:]:
    print(el)
    
print("\n\n~~~SERVICE~~~")
print("MAX:")
for el in total_service.most_common(5):
    print(el)
print("MIN:")   
for el in total_service.most_common()[-5:]:
    print(el)

~~~FOOD~~~
MAX:
('вкусный_ADJ', 137.47484866640062)
('плохой_ADJ', 124.8155285447863)
('плохо_ADJ', 114.94069716003143)
('плохой_NOUN', 97.20079457112197)
('нехороший_ADJ', 91.09427657657248)
MIN:
('необыкновенный_ADJ', -73.04109101251572)
('великолепный_ADJ', -76.08433495459823)
('живописный_ADJ', -76.77724693908834)
('обширный_ADJ', -77.77615607578247)
('величественный_ADJ', -80.63956970724162)


~~~SERVICE~~~
MAX:
('деморализованный_ADJ', 59.66647881480288)
('дезертир_NOUN', 59.235652612393906)
('обозвать_VERB', 56.83842684217229)
('орать_NOUN', 56.83049384777462)
('люмпен_NOUN', 56.69632846659554)
MIN:
('ласковый_ADJ', -110.2697821090971)
('любезный_ADJ', -111.02047615590729)
('дружелюбный_ADJ', -113.13881994524824)
('радушный_ADJ', -119.29214391109717)
('приветливый_ADJ', -120.35591199195714)


In [157]:
print(len(total_food), len(total_service))

31265 29973


Выделяем по 5% токенов с каждой стороны списка и соответственно маркируем их.

In [165]:
with open(os.path.join("word_lists", "total_CNN+SAM", "food.csv"), "w") as food_fd:
    for el in total_food.most_common(int(len(total_service) / 100 * 5)):
        food_fd.write(el[0] + "\t" + "0\n")
    for el in total_food.most_common()[int(len(total_service) / 100 * 5):]:
        food_fd.write(el[0] + "\t" + "1\n")
    
with open(os.path.join("word_lists", "total_CNN+SAM", "service.csv"), "w") as service_fd:
    for el in total_service.most_common(int(len(total_service) / 100 * 5)):
        service_fd.write(el[0] + "\t" + "0\n")
    for el in total_service.most_common()[int(len(total_service) / 100 * 5):]:
        service_fd.write(el[0] + "\t" + "1\n")

Полные списки тоже на всякий сохраним:

In [169]:
with open(os.path.join("word_lists", "total_CNN+SAM", "full_food_wo_mark.csv"), "w") as food_fd:
    for el in total_food.most_common():
        food_fd.write(el[0] + "\t" + str(el[1]) + "\n")
    
with open(os.path.join("word_lists", "total_CNN+SAM", "full_service_wo_mark.csv"), "w") as service_fd:
    for el in total_service.most_common():
        service_fd.write(el[0] + "\t" + str(el[1]) + "\n")