<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preprocess-data" data-toc-modified-id="Preprocess-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preprocess data</a></span></li><li><span><a href="#Get-top-1000-common-words" data-toc-modified-id="Get-top-1000-common-words-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Get top 1000 common words</a></span></li><li><span><a href="#Classify-headlines-by-labelled-words" data-toc-modified-id="Classify-headlines-by-labelled-words-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Classify headlines by labelled words</a></span></li><li><span><a href="#Sentiment-analysis" data-toc-modified-id="Sentiment-analysis-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Sentiment analysis</a></span></li></ul></div>

- This notebook contains codes for sentiment analysis of headlines. 

In [1]:
from snownlp import SnowNLP
from snownlp import sentiment

import pandas as pd
import jieba as jb
import csv
import matplotlib.pyplot as plt

## Preprocess data

In [2]:
# Import data 
data = pd.read_csv('data/data_tgb_48w.csv', index_col=0)
data.columns = ['headline', 'n_replies', 'n_likes', 'year', 'month', 'day']
data.reset_index(inplace=True, drop=True)

print(data.shape)
data.head(5)

(489870, 6)


Unnamed: 0,headline,n_replies,n_likes,year,month,day
0,原\n股票大跌以后为什么总上不去？(923),0,14,2019,10,5
1,原\n股市里的大道到底是什么？ - - -什么才是股市里的道。(151),2,68,2019,10,5
2,原\n人类文明还能存在多久？死亡天体正在逼近，科学家给出准确答案(165),10,178,2019,10,5
3,原\n跟随市场，理解市场，做到了即成功也(4),0,64,2019,10,5
4,原\n一步！一步！我要明天会更好！(189),0,58,2019,10,5


In [3]:
# separate categories from headlines
def get_category_content(txt):
    '''Get the category and content of a headline'''
    category, content = txt.split('\n', maxsplit=1)
    return category, content

lst = data['headline'].apply(get_category_content)
categories = [row[0] for row in lst]
contents = [row[1] for row in lst]

data['category'] = categories
data['headline'] = contents

# separate no of views from headlines 
def get_views(txt):
    '''Get the content and number of views of a headline'''
    content, view = txt.rsplit('(', maxsplit=1)
    view = view.rstrip(')')
    return content, view 

lst = data['headline'].apply(get_views)
contents = [row[0] for row in lst]
views = [row[1] for row in lst]

data['headline'] = contents
data['views'] = views 

data.head(5)

Unnamed: 0,headline,n_replies,n_likes,year,month,day,category,views
0,股票大跌以后为什么总上不去？,0,14,2019,10,5,原,923
1,股市里的大道到底是什么？ - - -什么才是股市里的道。,2,68,2019,10,5,原,151
2,人类文明还能存在多久？死亡天体正在逼近，科学家给出准确答案,10,178,2019,10,5,原,165
3,跟随市场，理解市场，做到了即成功也,0,64,2019,10,5,原,4
4,一步！一步！我要明天会更好！,0,58,2019,10,5,原,189


In [4]:
# Save the preprocessed df into csv
data.to_csv('data/data_tgb_48w_clean.csv')

## Get top 1000 common words

- The purpose is to label most common words as positive, neutral or negative in the data, thereby improving the performance of sentiment analysis model.

In [5]:
data = pd.read_csv('data/data_tgb_48w_clean.csv', index_col = 0)
data.head(5)

Unnamed: 0,headline,n_replies,n_likes,year,month,day,category,views
0,股票大跌以后为什么总上不去？,0,14,2019,10,5,原,923
1,股市里的大道到底是什么？ - - -什么才是股市里的道。,2,68,2019,10,5,原,151
2,人类文明还能存在多久？死亡天体正在逼近，科学家给出准确答案,10,178,2019,10,5,原,165
3,跟随市场，理解市场，做到了即成功也,0,64,2019,10,5,原,4
4,一步！一步！我要明天会更好！,0,58,2019,10,5,原,189


In [6]:
# Tokenizations of headlines 
tokens = []

for line in data['headline']:
    line = list(jb.cut(str(line)))
    tokens.extend(line)

print('Number of words:', len(tokens))
print('Number of unique words:', len(set(tokens)))

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\user\AppData\Local\Temp\jieba.cache
Loading model cost 0.555 seconds.
Prefix dict has been built successfully.


Number of words: 4572536
Number of unique words: 122429


In [7]:
# Import stopwords
with open('data/stop_words.txt', 'r', encoding="utf8") as sw:
    stopword = sw.read().split('\n')
    
# Remove stopwords from tokens 
tokens_wo_stopwords = [token for token in tokens if token not in stopword]
print('Number of words after removing stopwords:', len(tokens_wo_stopwords))
print('Number of unique words after removing stopwords:', len(set(tokens_wo_stopwords)))

# Calculate word frequency of each token 
word_count = {}
for token in tokens_wo_stopwords:
    if token in word_count:
        word_count[token] += 1
    else: 
        word_count[token] = 1
        
# Calculate term frequency 
total_count = len(tokens_wo_stopwords) 
for word, count in word_count.items():
    word_count[word] = count/total_count

# Find top 1000 words in terms of df
top_words = {}
for word in sorted(word_count, key = word_count.get, reverse = True)[:1000]:
    top_words[word] = word_count[word] 
    
# Save top 1000 words into a csv
with open('data/data_tgb_words_1000.csv', 'w') as f:
    for word, tf in top_words.items():
        f.write("%s,%s\n"%(word.encode('utf8'), tf))

Number of words after removing stopwords: 3129287
Number of unique words after removing stopwords: 121474


## Classify headlines by labelled words

- Import manually-labelled words to classify headlines as training data

In [8]:
keywords = pd.read_csv('data/data_tgb_words_1000_labelled.csv', header = None)
keywords.columns = ['word', 'tf', 'sentiment']

print(keywords.shape)
keywords.head()

(1000, 3)


Unnamed: 0,word,tf,sentiment
0,,0.032309,0.0
1,复盘,0.010134,0.0
2,股份,0.006431,0.0
3,—,0.006333,0.0
4,科技,0.005998,0.0


In [9]:
# View the distribution of word sentiment 
print(keywords['sentiment'].value_counts())

# Get word list for each sentiment
word_pos = keywords.loc[keywords['sentiment']==1.0, 'word'].to_list()
word_neg = keywords.loc[keywords['sentiment']==-1.0, 'word'].to_list()

# Classify text by sentiments (positive and negative only)
s_pos = set()
s_neg = set()

for headline in data['headline']:
    tokens = jb.cut(str(headline))
    for token in tokens:
        if token in word_pos:
            s_pos.add(headline)
        elif token in word_neg:
            s_neg.add(headline)
            
# Save headlines as txt
with open('data/pos_tgb.txt', 'w', encoding='utf-8') as f:
    for l in s_pos:
        f.write(l)
f.close()

with open('data/neg_tgb.txt', 'w', encoding='utf-8') as f:
    for l in s_neg:
        f.write(l)
f.close()

 0.0    882
 1.0     70
-1.0     46
Name: sentiment, dtype: int64


## Sentiment analysis

- Source of training data: https://github.com/isnowfy/snownlp
- Sentiment Analysis Model: Naive Bayes Model

In [10]:
# Integrate labelled headlines into training data 
filenames = ['data/pos.txt', 'data/pos_tgb.txt']
with open('data/pos_final.txt', 'w', encoding='utf-8') as outfile:
    for fname in filenames:
        with open(fname, 'r', encoding='utf-8') as infile:
            for line in infile:
                outfile.write(line)
                
filenames = ['data/neg.txt', 'data/neg_tgb.txt']
with open('data/neg_final.txt', 'w', encoding='utf-8') as outfile:
    for fname in filenames:
        with open(fname, 'r', encoding='utf-8') as infile:
            for line in infile:
                outfile.write(line)

In [11]:
# Train model with data 
sentiment.train('data/neg_final.txt', 'data/pos_final.txt')
sentiment.save('sentiment.marshal')

# Took 15 mins to run 

- Remember to change the data path in `snownlp/seg/__init__.py` to the above trained model, `sentiment.marshal` to avoid retraining the model every time. 
- Check the package location by the following code:
    - `import snownlp` and run `snownlp.__file__`

In [12]:
# Predict sentiments using trained model 
sentiments = []

for line in data['headline']:
    s = SnowNLP(str(line))
    sentiments.append(s.sentiments)

data['sentiments'] = sentiments
data.head(50)

# Took 20 mins to run

Unnamed: 0,headline,n_replies,n_likes,year,month,day,category,views,sentiments
0,股票大跌以后为什么总上不去？,0,14,2019,10,5,原,923,0.232284
1,股市里的大道到底是什么？ - - -什么才是股市里的道。,2,68,2019,10,5,原,151,0.360709
2,人类文明还能存在多久？死亡天体正在逼近，科学家给出准确答案,10,178,2019,10,5,原,165,0.99881
3,跟随市场，理解市场，做到了即成功也,0,64,2019,10,5,原,4,0.958018
4,一步！一步！我要明天会更好！,0,58,2019,10,5,原,189,0.224666
5,破除缠论束缚（六）：递归级别,0,57,2019,10,5,原,1873,0.996955
6,破除缠论束缚（五）：背驰（二）,0,56,2019,10,5,原,1082,0.986443
7,平凡之路之蜗牛慢行,2,69,2019,10,5,原,519,0.769066
8,才来，匆匆忙忙,0,54,2019,10,5,原,31,0.386525
9,我想买个多屏电脑操作股票，大家都什么意见吗,5,226,2019,10,5,原,60,0.397351


In [13]:
data.to_csv('data/data_tgb_48w_labelled.csv')