# Neural Network for sentiment analysis of press texts

### Goal:
Creating a Neural Network for sentiment analysis of press materials

### Data:
Training set contains x news articles from 7 polish news portals. For each of these articles I calculate emotional score based on emotional valuation dictionary from Słowosieć - Polish Wordnet. Later the scores are converted to labels for positively and negatively valued articles, which will serve as a target in supervised learning

### Method choice
I want to use a Neural Network for this task, as it has higher potential for learning relationship between the words. Learning these relationships, which isn't possible with methods like Logistic Regression etc guarantees that the model will generalize better on texts with different topics than those in training set. 

In [3]:
# basic libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
# NLP
from sklearn.feature_extraction.text import CountVectorizer
from wordcloud import WordCloud, STOPWORDS 
import spacy
from collections import Counter
import nltk
from nltk.collocations import *

In [None]:
!pip install
!python -m spacy download pl_core_news_lg

Collecting spacy==3.0
[?25l  Downloading https://files.pythonhosted.org/packages/8b/62/a98c61912ea57344816dd4886ed71e34d8aeec55b79e5bed05a7c2a1ae52/spacy-3.0.0-cp37-cp37m-manylinux2014_x86_64.whl (12.7MB)
[K     |████████████████████████████████| 12.7MB 220kB/s 
Collecting thinc<8.1.0,>=8.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/e3/08/20e707519bcded1a0caa6fd024b767ac79e4e5d0fb92266bb7dcf735e338/thinc-8.0.2-cp37-cp37m-manylinux2014_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 43.2MB/s 
Collecting pydantic<1.8.0,>=1.7.1
[?25l  Downloading https://files.pythonhosted.org/packages/b3/0a/52ae1c659fc08f13dd7c0ae07b88e4f807ad83fb9954a59b0b0a3d1a8ab6/pydantic-1.7.3-cp37-cp37m-manylinux2014_x86_64.whl (9.1MB)
[K     |████████████████████████████████| 9.1MB 37.2MB/s 
Collecting typer<0.4.0,>=0.3.0
  Downloading https://files.pythonhosted.org/packages/90/34/d138832f6945432c638f32137e6c79a3b682f06a63c488dcfaca6b166c64/typer-0.3.2-py3-none-any.whl
Co

2021-03-18 09:22:04.297883: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Collecting pl-core-news-lg==3.0.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/pl_core_news_lg-3.0.0/pl_core_news_lg-3.0.0-py3-none-any.whl (612.6MB)
[K     |████████████████████████████████| 612.6MB 31kB/s 
Installing collected packages: pl-core-news-lg
Successfully installed pl-core-news-lg-3.0.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('pl_core_news_lg')


# Data preparation

I need to compute the emotional scores for each article in the set. In the emotional dictionary words are written as lemmas, and Logistic Regression(my baseline model) works better with lemmatized words, so I need to lemmatize the articles. Moreover, later in the notebook I need word vectors for creating the embedding matrix for Neural Net. Both of these processes can be done with SpaCy model, which I will use for preprocessing.

In [None]:
# reading datasets
data = {"gazeta_pl": pd.read_csv("gazeta_pl.csv",sep="^"),
       "oko_press": pd.read_csv("OKO_press_all.csv",sep="^"),
       "onet": pd.read_csv("Onet.csv",sep="^",error_bad_lines=False),
       "wp":pd.read_csv("wp.csv",sep="^",error_bad_lines=False),
       "wPolityce":pd.read_csv("wPolityce.csv",sep="^"),
       "niezalezna":pd.read_csv("Niezalezna.csv",sep="^").rename({"Content\r":"Content"},axis=1),
       "krypol":pd.read_csv("krypol.csv",sep="^")}

b'Skipping line 1195: expected 5 fields, saw 9\n'


In [None]:
# loading large polish spacy model
nlp = spacy.load("pl_core_news_lg")

In [None]:
# checking wether model loaded properly
nlp("ja")[0].lemma_

'ja'

In [None]:
# loading emotional dictionary
emotional_df = pd.read_csv(r"emotions.csv")

In [None]:
def count_emotional_value(text,emotional_df):
    """
    Function calculates the emotional score of a single text based on evaluations provided by emotional valuation dictionary.
    text - lemmatized version of the text to calculate score on
    emotional_df - emotional dictionary read as pd.DataFrame
    """
    # variable storing emotional score
    emotional_score = 0
    # dictionary for translating emotional values into numerical values
    values = {"- m":-4,
             "- s":-1,
             "amb":0,
             "+ s":1,
             "+ m":4}
             
    # iterating over words in texts, checking wether word has emotional valuations, adding values to emotional score
    for word in text:
        word = str(word)
        if word in emotional_df.lemat.tolist():
            row = emotional_df[emotional_df["lemat"] == word]
            value = values[row.stopien_nacechowania.iloc[0]]
            emotional_score += value
            
    return emotional_score

In [None]:
def preprocess_data(data,nlp,emotional_df):
    
    # process texts - saving columns with spacy docs and with lemmatized texts as lists of words
    data["processed"] = [nlp(x) for x in data.Content.astype("str")]
    data["lemmatized"] = [[x.lemma_ for x in text] for text in data.processed]
        
    
    
    # computing emotional values of texts
    data["emotional_value"] = [count_emotional_value(x,emotional_df) for x in data.lemmatized]
    
    
    return data

In [None]:
# preprocessing data from each data frame

for key in data.keys():
        data[key] = preprocess_data(data[key],nlp,emotional_df)

In [None]:
# checking distributions of positive and negative class
for key in data.keys():
    # print amounts of articles in positive and negative class
    print("Positive",data[key][data[key].emotional_value > 0].Title.count(),
          "Negative",data[key][data[key].emotional_value < 0].Title.count())
    
    # plot histogram to see the distribution of values
    plt.hist(data[key].emotional_value)
    plt.show()

# Data preparation for Neural Network

In the next steps I prepare the data to enter Neural Network - Train test split, Tokenization, Integer Encoding, creating starting embedding matrix, padding.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score,f1_score
from sklearn.model_selection import train_test_split
# classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

from tensorflow.python.keras.utils import np_utils
# tensorflow and keras
import keras
from keras import regularizers, optimizers
from keras.layers.experimental.preprocessing import TextVectorization
from keras.layers import Activation, Bidirectional,Embedding, Dense, Dropout, Input, LSTM,MaxPooling1D,Flatten,SpatialDropout1D,BatchNormalization, Conv1D
from keras.models import Sequential
from keras.initializers import Constant
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import History,EarlyStopping, ModelCheckpoint
from tensorflow.keras.preprocessing.text import one_hot

In [None]:
# creating lerning set

learning_set = pd.DataFrame({"Emotion":[], "Text":[]})

for key in data.keys():
    # for each dataframe take emotional values and text, assign to new DF, append to learning set
    
    frame = pd.DataFrame({"Emotion":data[key].emotional_value,"Text":data[key].processed})
    learning_set = learning_set.append(frame,ignore_index=True)

In [None]:
# creating learning set for logistic regression - lemmatized texs as strings
learning_set["Logistic"] = [str([x.lemma_ for x in text]).replace("'","").replace(",","").replace("[","").replace("]","")\
                            for text in learning_set.Text]

In [4]:
learning_set = pd.read_csv("learning_set.csv",sep="^")

In [5]:
# drop NaNs from learning set
learning_set.dropna(inplace=True)

In [6]:
# prepare target and features variable for Neural Network and Logistic Regression

y = pd.Series([0 if x<0 else 1 for x in learning_set["Emotion"]])
y_log = pd.Series([0 if x<0 else 1 for x in learning_set["Emotion"]])
X_log = pd.Series(learning_set["Logistic"])
X = pd.Series(learning_set["Text"])

In [7]:
#  train test split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
X_train_l,X_test_l,y_train_l,y_test_l = train_test_split(X_log,y_log,test_size=0.2)

Baseline model - logistic regression

In [8]:
# fit the data on simple logistic regression to check wether the set is classifiable

mdl = make_pipeline(CountVectorizer(),LogisticRegression(solver="liblinear"))
                    
mdl.fit(X_train_l,y_train_l)
mdl.score(X_test_l,y_test_l)

0.8435434111943587

In [9]:
# prepare train and test variables for tokenizing - make sure that every element is a string

X_train_for_tokenizer = [[str(x) for x in y] for y in X_train] 
X_test_for_tokenizer = [[str(x) for x in y] for y in X_test] 



In [10]:
X_train[:10]

9299     ['Pytany', 'w', 'programie', 'Michała', 'Racho...
5240     ['Kolejna', 'partia', 'szczepionek', 'na', 'CO...
5025     ['Dopiero', 'zaczynało', 'świtać', ',', 'gdy',...
403      ['Minionej', 'doby', 'odnotowano', '7', '038',...
10130    ['81', '-', 'letnia', 'mieszkanka', 'Olsztyna'...
378      ['Kancelaria', 'Prezydenta', 'poinformowała', ...
8935     ['Jeszcze', 'do', '25', 'lutego', 'uczniowie',...
319      ['Amerykańska', 'agencja', 'kosmiczna', 'NASA'...
3976     ['Ojciec', 'Święty', 'nawiązał', 'do', '90', '...
10830    ['Politycy', ',', 'publicyści', 'oraz', 'dyplo...
Name: Text, dtype: object

In [11]:
# median length of textsw
np.median([len(x) for x in X_train])

2887.0

In [12]:
# tokenizer with custom filter - does not filter out comas, stops, ? and !, as they can carry important insights
t = Tokenizer(num_words=50000,lower=False,filters='#$%&+-/<=>@[\\]^_`{|}~\t\n')
t.fit_on_texts(X_train)

# vocab size for embedding martix
vocab_size = len(t.word_index) + 1

# integer encode the documents
encoded_docs = t.texts_to_sequences(X_train)
encoded_test = t.texts_to_sequences(X_test)

# padding texts 
max_length = 1000
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
padded_test = pad_sequences(encoded_test, maxlen=max_length, padding='post')

    


In [13]:
padded_docs[0]

array([ 1606,     3,   863,  1775, 11104,    13,     4,     5, 10257,
          13,    16,    14,     1,    42,   546,  6051,  5902,     7,
         415,   824,     8,   121,     1, 10025,   527,     1, 13247,
           1,   740,  4517, 10792,   291,     1,    10,     3,  3548,
       22802,     1, 22803,     1,    15,    38,    11,   610,  6403,
       25853,     2,     4,     5,  4677,    12,    58,   658,     2,
         757,    31,  1042,  3121,  2057,  1594,     3,   666,  1419,
           3,   273,    59,     2,     6,    14,    12,    15,  3121,
        1268,     4,     5,   761,     2,    18,   334,  1078,  1024,
           6,  1274,   730,  7778,  6404,   509,  1326,    25,  1326,
           4,     5,   659,  1021,  2039,     3,   736,    11,  9320,
        6590,  5961,    27,   357,   263,     1,   159,   509,   393,
           1,  2907,     1,   546,     6,   567,   123,   437,   919,
         527,     2,   143,  6793,  5025, 46875,  7505,  7240, 25854,
          30,  4741,

# Creating architecture and training model

I use architecture based on LSTM's, as they can work with text data as sequences and they capture relationship between words best. I use LSTM's instead of simple RNN's because i want the model to consider long term relationships between words, which simple RNN's do worse. 





In [27]:
history = History()

# Creating 
model = Sequential()
model.add(Embedding(
    50000,
    100,
    input_length = max_length,
    trainable=True))
model.add(Bidirectional(LSTM(32,return_sequences=True)))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Dropout(0.65))


model.add(Dense(32))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Dropout(0.65))
model.add(Dense(1,activation="sigmoid"))
model.summary()

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 1000, 100)         5000000   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 1000, 64)          34048     
_________________________________________________________________
batch_normalization_15 (Batc (None, 1000, 64)          256       
_________________________________________________________________
activation_15 (Activation)   (None, 1000, 64)          0         
_________________________________________________________________
dropout_15 (Dropout)         (None, 1000, 64)          0         
_________________________________________________________________
dense_12 (Dense)             (None, 1000, 32)          2080      
_________________________________________________________________
batch_normalization_16 (Batc (None, 1000, 32)         

In [28]:
Adam = keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, amsgrad=False)
model.compile(loss="binary_crossentropy",optimizer=Adam, metrics=["accuracy"])
save_best_model = ModelCheckpoint("sentiment.h5",save_best_only=True)

model.fit(padded_docs,y_train,validation_split=0.2,batch_size=32,epochs=15,callbacks=[history,save_best_model])


Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15

KeyboardInterrupt: ignored