# cntext

cntext 是一个文本分析包，提供基于词嵌入模型的语义距离和语义投影。 此外，cntext还提供了传统的方法，如字数统计、可读性、文档相似度、情感分析等。

- github地址 https://github.com/hiDaDeng/cntext/
- pypi地址 https://pypi.org/project/cntext/

In [1]:
 pip install -i https://pypi.tuna.tsinghua.edu.cn/simple cntext

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting gensim==4.0.0
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/81/09/6929fd1e882943d1764f2aaf1e66ed32fc1cef987dab6ddbec0291e3ae4a/gensim-4.0.0-cp38-cp38-macosx_10_9_x86_64.whl (23.9 MB)
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.8.3
    Uninstalling gensim-3.8.3:
      Successfully uninstalled gensim-3.8.3
Successfully installed gensim-4.0.0
You should consider upgrading via the '/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
import cntext as ct

help(ct)

Help on package cntext:

NAME
    cntext

PACKAGE CONTENTS
    dictionary
    mind
    similarity
    stats

VERSION
    1.7.9

FILE
    /opt/anaconda3/lib/python3.8/site-packages/cntext/__init__.py




In [3]:
text = '如何看待一网文作者被黑客大佬盗号改文，因万分惭愧而停更。'

ct.term_freq(text, lang='chinese')

Building prefix dict from the default dictionary ...
Dumping model to file cache /var/folders/l6/ntr5b4610hx38gy0_2xp7ngh0000gn/T/jieba.cache
Loading model cost 0.802 seconds.
Prefix dict has been built successfully.


Counter({'看待': 1,
         '网文': 1,
         '作者': 1,
         '黑客': 1,
         '大佬': 1,
         '盗号': 1,
         '改文因': 1,
         '万分': 1,
         '惭愧': 1,
         '停': 1})

## readability
文本可读性，指标越大，文章复杂度越高，可读性越差。

> readability(text, lang='chinese')

徐巍,姚振晔,陈冬华.中文年报可读性：衡量与检验[J].会计研究,2021(03):28-44.

- readability1 ---每个分句中的平均字数
- readability2 ---每个句子中副词和连词所占的比例
- readability3 ---参考Fog Index， readability3=(readability1+readability2)×0.5

In [4]:
text1 = '如何看待一网文作者被黑客大佬盗号改文，因万分惭愧而停更。'

ct.readability(text1, lang='chinese')

{'readability1': 28.0,
 'readability2': 0.15789473684210525,
 'readability3': 14.078947368421053}

In [5]:
import cntext as ct

# 获取cntext内置词典列表(pkl格式)
ct.dict_pkl_list()

['DUTIR.pkl',
 'HOWNET.pkl',
 'Chinese_Loughran_McDonald_Financial_Sentiment.pkl',
 'SentiWS.pkl',
 'ChineseFinancialFormalUnformalSentiment.pkl',
 'ANEW.pkl',
 'LSD2015.pkl',
 'NRC.pkl',
 'geninqposneg.pkl',
 'HuLiu.pkl',
 'Loughran_McDonald_Financial_Sentiment.pkl',
 'AFINN.pkl',
 'ADV_CONJ.pkl',
 'STOPWORDS.pkl',
 'Concreteness.pkl',
 'ChineseEmoBank.pkl']

In [9]:
import cntext as ct

print(ct.__version__)
# 导入pkl词典文件,
dutir = ct.load_pkl_dict('DUTIR.pkl')
print(dutir.keys())

1.7.9
dict_keys(['DUTIR', 'Referer', 'Desc'])


## sentiment

> sentiment(text, diction, lang='chinese') 

使用diy词典进行情感分析，计算各个情绪词出现次数; 未考虑强度副词、否定词对情感的复杂影响，

- text: 待分析中文文本
- diction: 情感词字典；
- lang: 语言类型，"chinese"或"english"，默认"chinese"

In [10]:
import cntext as ct

text = '我今天得奖了，很高兴，我要将快乐分享大家。'

ct.sentiment(text=text,
             diction=ct.load_pkl_dict('DUTIR.pkl')['DUTIR'],
             lang='chinese')

{'乐_num': 2,
 '好_num': 0,
 '怒_num': 0,
 '哀_num': 0,
 '惧_num': 0,
 '恶_num': 0,
 '惊_num': 0,
 'stopword_num': 8,
 'word_num': 14,
 'sentence_num': 1}

In [12]:
import cntext as ct

# load the concreteness.pkl dictionary file
concreteness_df = ct.load_pkl_dict('concreteness.pkl')
concreteness_df

{'Referer': 'Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46, 904–911',
 'Desc': '语言具体性词典， 具体性计算应用案例可参考Packard, Grant, and Jonah Berger. "How concrete language shapes customer satisfaction." *Journal of Consumer Research* 47, no. 5 (2021): 787-806.',
 'Concreteness':                   word  valence
 0          roadsweeper     4.85
 1          traindriver     4.54
 2                 tush     4.45
 3            hairdress     3.93
 4        pharmaceutics     3.77
 ...                ...      ...
 39949         unenvied     1.21
 39950     agnostically     1.20
 39951  conceptualistic     1.18
 39952  conventionalism     1.18
 39953    essentialness     1.04
 
 [39954 rows x 2 columns]}

In [21]:
reply = "I'll go look for that"

score=ct.sentiment_by_valence(text=reply, 
                              diction=concreteness_df, 
                              lang='english')
score

{'valence': 9.28, 'word_num': 5}

In [17]:
text = 'What a happy day!'

ct.sentiment(text=text,
             diction=ct.load_pkl_dict('NRC.pkl')['NRC'],
             lang='english')

{'anger_num': 0,
 'anticipation_num': 1,
 'disgust_num': 0,
 'fear_num': 0,
 'joy_num': 1,
 'negative_num': 0,
 'positive_num': 1,
 'sadness_num': 0,
 'surprise_num': 0,
 'trust_num': 1,
 'stopword_num': 1,
 'word_num': 5,
 'sentence_num': 1}

In [19]:
# load the concreteness.pkl dictionary file;  cntext version >=1.7.1
concreteness_df = ct.load_pkl_dict('concreteness.pkl')['Concreteness']
concreteness_df.head()

Unnamed: 0,word,valence
0,roadsweeper,4.85
1,traindriver,4.54
2,tush,4.45
3,hairdress,3.93
4,pharmaceutics,3.77


In [23]:
employee_replys = ["I'll go look for that",
                   "I'll go search for that",
                   "I'll go search for that top",
                   "I'll go search for that t-shirt",
                   "I'll go look for that t-shirt in grey",
                   "I'll go search for that t-shirt in grey"]

for idx, reply in enumerate(employee_replys):
    score=ct.sentiment_by_valence(text=reply, 
                                  diction=concreteness_df, 
                                  lang='english')
    
    
    template = "Concreteness Score: {score:.2f} | Example-{idx}: {example}"
    print(template.format(score=score['valence']/score['word_num'], 
                          idx=idx, 
                          example=reply))
    

Concreteness Score: 1.86 | Example-0: I'll go look for that
Concreteness Score: 1.86 | Example-1: I'll go search for that
Concreteness Score: 2.21 | Example-2: I'll go search for that top
Concreteness Score: 2.04 | Example-3: I'll go search for that t-shirt
Concreteness Score: 2.37 | Example-4: I'll go look for that t-shirt in grey
Concreteness Score: 2.37 | Example-5: I'll go search for that t-shirt in grey


In [25]:
# This module is used to build or expand the vocabulary (dictionary), including

# SoPmi Co-occurrence algorithm to extend vocabulary (dictionary), Only support chinese
# W2VModels using word2vec to extend vocabulary (dictionary), support english & chinese

import os

sopmier = ct.SoPmi(cwd=os.getcwd(),
                   #raw corpus data，txt file.only support chinese data now.
                   input_txt_file='/Users/chengjun/GitHub/cntext/examples/data/sopmi_corpus.txt', 
                   #muanually selected seed words
                   seedword_txt_file='/Users/chengjun/GitHub/cntext/examples/data/sopmi_seed_words.txt', #人工标注的初始种子词
                   )   

sopmier.sopmi()

Step 1/4:...Preprocess   Corpus ...
Step 2/4:...Collect co-occurrency information ...
Step 3/4:...Calculate   mutual information ...
Step 4/4:...Save    candidate words ...
Finish! used 60.98 s


In [36]:
os.getcwd()

'/Users/chengjun/GitHub/css/notebook'

In [37]:
import os

#init W2VModels, corpus data w2v_corpus.txt
model = ct.W2VModels(cwd=os.getcwd(), lang='english')  
model.train(input_txt_file='/Users/chengjun/GitHub/cntext/examples/data/w2v_corpus.txt')


#According to the seed word, filter out the top 100 words that are most similar to each category words
model.find(seedword_txt_file='/Users/chengjun/GitHub/cntext/examples/data/w2v_seeds/integrity.txt', 
           topn=100)
model.find(seedword_txt_file='/Users/chengjun/GitHub/cntext/examples/data/w2v_seeds/innovation.txt', 
           topn=100)
model.find(seedword_txt_file='/Users/chengjun/GitHub/cntext/examples/data/w2v_seeds/quality.txt', 
           topn=100)
model.find(seedword_txt_file='/Users/chengjun/GitHub/cntext/examples/data/w2v_seeds/respect.txt', 
           topn=100)
model.find(seedword_txt_file='/Users/chengjun/GitHub/cntext/examples/data/w2v_seeds/teamwork.txt', 
           topn=100)

Step 1/4:...Preprocess   corpus ...
Step 2/4:...Train  word2vec model
            used   72 s
Step 3/4:...Prepare similar candidates for each seed word in the word2vec model...
Step 4/4 Finish! Used 82 s
Step 3/4:...Prepare similar candidates for each seed word in the word2vec model...
Step 4/4 Finish! Used 82 s
Step 3/4:...Prepare similar candidates for each seed word in the word2vec model...
Step 4/4 Finish! Used 82 s
Step 3/4:...Prepare similar candidates for each seed word in the word2vec model...
Step 4/4 Finish! Used 82 s
Step 3/4:...Prepare similar candidates for each seed word in the word2vec model...
Step 4/4 Finish! Used 82 s


In [42]:
from gensim.models import KeyedVectors

w2v_model = KeyedVectors.load("/Users/chengjun/GitHub/cntext/examples/output/w2v_candi_words/w2v.model")
w2v_model.get_vector('company')

array([-1.315214  ,  0.8820462 ,  1.1123291 ,  0.57502085, -0.4536534 ,
        0.04593553,  1.7608879 ,  1.8781638 ,  0.62472534, -0.40170938,
       -0.74852383,  0.20071138, -0.1615613 , -1.1544902 , -1.8266941 ,
        0.07904029,  0.04081295, -0.30807   , -0.25948036,  0.7637631 ,
        0.30233267,  0.02158124, -0.4548134 ,  0.22692135, -0.26944858,
       -0.24163176,  1.2302433 ,  0.70777947,  1.1255033 , -0.17567617,
       -0.7234768 , -1.0653226 , -0.09055816, -1.46483   ,  0.4885035 ,
       -0.9827379 ,  0.95400816, -0.13374516, -0.52517796, -0.23480052,
        0.62640554, -0.5553755 , -0.83863366,  1.3104644 , -0.07400908,
        1.1336509 ,  0.32375774,  0.354964  ,  0.10843342, -0.54200864,
       -0.8003504 , -0.7266025 ,  1.4322623 , -0.8662505 , -0.7207146 ,
        0.16009226,  1.746251  ,  1.7912021 ,  1.9516977 ,  0.7451547 ,
       -0.43844843, -0.30046785,  1.7179738 ,  0.8191196 , -0.41387284,
       -0.26602927, -1.3589116 , -0.0668382 , -0.02384867, -0.58

In [43]:
w2v_model.most_similar('innovation')

[('execution', 0.8257297873497009),
 ('capabilities', 0.8198432326316833),
 ('continue_focus', 0.8191850781440735),
 ('dsd_system', 0.8175386786460876),
 ('continue_expand', 0.8135412931442261),
 ('emphasis', 0.8049392700195312),
 ('informatics', 0.8048107624053955),
 ('continue_drive', 0.8032825589179993),
 ('technologies', 0.8026941418647766),
 ('dsd', 0.8008151650428772)]

In [44]:
w2v_model.get_vector('innovation')


array([ 0.10703183, -0.5493035 ,  1.0623612 , -0.43180484, -0.9257197 ,
       -1.0833972 , -0.07150532,  1.8700444 ,  0.26021707,  0.04272432,
       -0.93499535,  0.05657623,  1.1683912 ,  0.3376583 ,  0.49347457,
       -0.7696766 , -0.91470116, -0.66640836,  1.2935071 , -0.6879933 ,
        0.01370158, -0.07028562, -1.3170564 , -0.4464321 , -0.08877528,
        0.634105  , -0.9797471 ,  1.221016  , -0.5741402 , -0.18807778,
       -1.5563715 ,  0.14269993,  0.53213865, -0.16015887, -0.16328476,
       -0.03469868,  0.8123352 , -0.89830345, -0.4654071 ,  0.59473026,
       -0.01235237, -0.16233891,  0.0882808 ,  1.2530324 ,  0.63678694,
       -0.4859363 ,  0.63639694,  0.5577657 ,  1.1487194 ,  1.2548769 ,
        0.08172881,  1.2036918 ,  0.31136826, -0.33539033, -0.73395115,
       -0.35624295,  0.00873698, -0.7198372 ,  0.09266879,  0.0259969 ,
        0.5248478 ,  1.0185517 , -0.3534738 , -0.11561136, -0.64694124,
       -0.13806236, -0.7061978 ,  0.4576997 , -0.9011815 ,  0.47

In [26]:
documents = ["I go to school every day by bus .",
         "i go to theatre every night by bus"]

ct.co_occurrence_matrix(documents, 
                        window_size=2, 
                        lang='english')

Unnamed: 0,.,bus,by,day,every,go,i,night,school,theatre,to
.,0,1,1,0,0,0,0,0,0,0,0
bus,1,0,2,1,0,0,0,1,0,0,0
by,1,2,0,1,2,0,0,1,0,0,0
day,0,1,1,0,1,0,0,0,1,0,0
every,0,0,2,1,0,0,0,1,1,1,2
go,0,0,0,0,0,0,2,0,1,1,2
i,0,0,0,0,0,2,0,0,0,0,2
night,0,1,1,0,1,0,0,0,0,1,0
school,0,0,0,1,1,1,0,0,0,0,1
theatre,0,0,0,0,1,1,0,1,0,0,1


In [45]:
text1 = 'Programming is fun!'
text2 = 'Programming is interesting!'

print(ct.cosine_sim(text1, text2))
print(ct.jaccard_sim(text1, text2))
print(ct.minedit_sim(text1, text2))
print(ct.simple_sim(text1, text2))

0.67
0.50
1.00
0.90


In [50]:
# download glove_w2v.6B.100d.txt from google Driver
# https://drive.google.com/file/d/1tuQB9PDx42z67ScEQrg650aDTYPz-elJ/view
#Note: this is a word2vec format model
tm = ct.Text2Mind(w2v_model_path='/Users/chengjun/bigdata/glove_w2v.6B.100d.txt')
#tm = ct.Text2Mind(w2v_model_path='/Users/chengjun/GitHub/cntext/examples/output/Glove/brown_corpus_w2v.txt')


engineer = ['program', 'software', 'computer']
mans =  ["man", "he", "him"]
womans = ["woman", "she", "her"]


tm.sematic_distance(words=animals, 
                    c_words1=mans, 
                    c_words2=womans)

Loading the model of /Users/chengjun/bigdata/glove_w2v.6B.100d.txt
Load successfully, used 22.7 s


-0.43

In [51]:
animals = ['mouse', 'cat', 'horse',  'pig', 'whale']
small_words = ["small", "little", "tiny"]
large_words = ["large", "big", "huge"]

tm.sematic_projection(words=animals, 
                      c_words1=small_words, 
                      c_words2=large_words)

[('mouse', -1.68),
 ('cat', -0.92),
 ('pig', -0.46),
 ('whale', -0.24),
 ('horse', 0.4)]

In [4]:
ct.load_pkl_dict('ChineseEmoBank.pkl')


{'Referer-1': 'Lee, Lung-Hao, Jian-Hong Li, and Liang-Chih Yu. "Chinese EmoBank: Building Valence-Arousal Resources for Dimensional Sentiment Analysis." Transactions on Asian and Low-Resource Language Information Processing 21, no. 4 (2022): 1-18.',
 'Referer-2': 'Liang-Chih Yu, Lung-Hao Lee, Shuai Hao, Jin Wang, Yunchao He, Jun Hu, K. Robert Lai, and Xuejie Zhang. 2016. "Building Chinese affective resources in valence-arousal dimensions. In Proceedings of NAACL/HLT-16, pages 540-545.',
 'Desc': 'Chinese Sentiment Dictionary, includes 「valence」「arousal」. In cntext, we only take Chinese valence-arousal words (CVAW, single word) into account, ignore CVAP, CVAS, CVAT.',
 'ChineseEmoBank':       word  valence  arousal
 0     不可思议      5.4      7.2
 1       不平      3.6      5.8
 2       不甘      3.2      6.4
 3       不安      3.8      5.4
 4       不利      3.6      5.6
 ...    ...      ...      ...
 5505    黏闷      2.8      5.6
 5506    黏腻      2.7      5.8
 5507    艳丽      5.8      4.5
 5508 

In [5]:
diction_df = ct.load_pkl_dict('ChineseEmoBank.pkl')['ChineseEmoBank']
diction_df

Unnamed: 0,word,valence,arousal
0,不可思议,5.4,7.2
1,不平,3.6,5.8
2,不甘,3.2,6.4
3,不安,3.8,5.4
4,不利,3.6,5.6
...,...,...,...
5505,黏闷,2.8,5.6
5506,黏腻,2.7,5.8
5507,艳丽,5.8,4.5
5508,苗条,6.7,3.8


In [6]:
text = '很多车主抱怨新车怠速抖动严重---冷车时更严重。'

ct.sentiment_by_weight(text = text, 
                       diction = diction_df,
                       params = ['valence', 'arousal'],
                       lang = 'chinese')

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/l6/ntr5b4610hx38gy0_2xp7ngh0000gn/T/jieba.cache
Loading model cost 0.677 seconds.
Prefix dict has been built successfully.


{'valence': 14.8, 'arousal': 24.8, 'word_num': 13}

- valence是句子中各个chinese_emobank词valence得分的加总。
- arousal是句子中各个chinese_emobank词arousal得分的加总。
- word_num是句子中的词语数(含标点符号)，短文本的情况下，word_num会不太准确，长文本情况下无限接近真实词语数。

需要注意，文本越长，valence和arousal指标应该会越大。使用这两个指标时，需要结合word_num进行均值处理，即

- Valence = valence/word_num
- Arousal = arousal/word_num
这里未做均值处理，尽量保留文本的原始信息。

![image.png](img/chengjun2.png)