The dataset for fine-tuning is the [Financial Phrase Bank Dataset](https://www.researchgate.net/publication/251231364_FinancialPhraseBank-v10) that was collected by [Maloet al, 2014](https://arxiv.org/abs/1307.5336) 

The reason for choosing this dataset is because it consists of 4845 english sentences selected randomly from financial news, labeled according to how the information contained might affect the mentioned company stock price: positive, negative or neutral. The dataset is available in four possible configurations depending on the percentage of agreement of annotators, however, we will use file `Sentences_50Agree.txt` as this file contains the most complete data.

This picture below depicts how how text is classified in the datase: 
![image.png](attachment:image.png)
*Source: Good Debt or Bad Debt: Detecting Semantic Orientations in Economic Texts*

# Import

In [9]:
#import packages
import pandas as pd
import numpy as np

from collections import Counter

import re                                  # library for regular expression operations
import string                              # for string operations

from nltk.corpus import stopwords          # module for stop words that come with NLTK
from nltk.stem import PorterStemmer        # module for stemming
from nltk.tokenize import TweetTokenizer   # module for tokenizing strings

In [10]:
# import dataset
path_50 = '/Users/hienanh/Documents/GitHub/Model/Sentences_50Agree.txt'
df = pd.read_csv(path_50, sep="@",encoding="ISO-8859-1", names = ['sentence', 'label'])
df.head()

Unnamed: 0,sentence,label
0,"According to Gran , the company has no plans t...",neutral
1,Technopolis plans to develop in stages an area...,neutral
2,The international electronic industry company ...,negative
3,With the new production plant the company woul...,positive
4,According to the company 's updated strategy f...,positive


In [3]:
df.isnull().sum()

sentence    0
label       0
dtype: int64

In [4]:
Counter(df['label'].tolist())

Counter({'neutral': 2879, 'negative': 604, 'positive': 1363})

# Vectorizers: Bag of words

In [13]:
#import packages
import pandas as pd
import numpy as np

from collections import Counter

import re                                  # library for regular expression operations
import string                              # for string operations

from nltk.corpus import stopwords          # module for stop words that come with NLTK
from nltk.stem import PorterStemmer        # module for stemming
from nltk.tokenize import TweetTokenizer   # module for tokenizing strings

In [14]:
# import dataset
path_50 = '/Users/hienanh/Documents/GitHub/Model/Sentences_50Agree.txt'
df_50 = pd.read_csv(path_50, sep="@",encoding="ISO-8859-1", names = ['sentence', 'label'])
df_50.head()

Unnamed: 0,sentence,label
0,"According to Gran , the company has no plans t...",neutral
1,Technopolis plans to develop in stages an area...,neutral
2,The international electronic industry company ...,negative
3,With the new production plant the company woul...,positive
4,According to the company 's updated strategy f...,positive


In [15]:
# pre-processing 
## clean text
def process_text(text):
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    #text = text.str
    text = re.sub(r'\$\w*', '', text)
    text = re.sub(r'^RT[\s]+', '', text)
    text = re.sub(r'https?:\/\/.*[\r\n]*', '', text)
    text = re.sub(r'#', '', text)
    tokenizer = TweetTokenizer(preserve_case=False,        strip_handles=True,reduce_len=True)
    text_tokens = tokenizer.tokenize(text)

    text_clean = []
    for word in text_tokens:
        if (word not in stopwords_english and  
                word not in string.punctuation): 
            stem_word = stemmer.stem(word)  # stemming word
            text_clean.append(stem_word)
            
    text_clean_2 = ' '.join(text_clean)

    return text_clean_2

## quantify label 
def encode(label):
    if label == 'negative':
        value = -1
    elif label == 'neutral':
        value = 0
    elif label == 'positive':
        value = 1
    return value

In [16]:
df_50['clean_text'] = df_50['sentence'].apply(lambda x: process_text(x))
df_50['sentiment'] = df_50['label'].apply(lambda x: encode(x))
df_50

Unnamed: 0,sentence,label,clean_text,sentiment
0,"According to Gran , the company has no plans t...",neutral,accord gran compani plan move product russia a...,0
1,Technopolis plans to develop in stages an area...,neutral,"technopoli plan develop stage area less 100,00...",0
2,The international electronic industry company ...,negative,intern electron industri compani elcoteq laid ...,-1
3,With the new production plant the company woul...,positive,new product plant compani would increas capac ...,1
4,According to the company 's updated strategy f...,positive,accord compani updat strategi year 2009-2012 b...,1
...,...,...,...,...
4841,LONDON MarketWatch -- Share prices ended lower...,negative,london marketwatch share price end lower londo...,-1
4842,Rinkuskiai 's beer sales fell by 6.5 per cent ...,neutral,rinkuskiai beer sale fell 6.5 per cent 4.16 mi...,0
4843,Operating profit fell to EUR 35.4 mn from EUR ...,negative,oper profit fell eur 35.4 mn eur 68.8 mn 2007 ...,-1
4844,Net sales of the Paper segment decreased to EU...,negative,net sale paper segment decreas eur 221.6 mn se...,-1


In [17]:
from sklearn.model_selection import train_test_split

X=df_50['clean_text'].values
Y=df_50['sentiment'].values
X_train, X_test, Y_train, Y_test= train_test_split(X,Y, test_size=0.3, random_state=254)

X_train

array(['initi estim total valu contract 250 000 euro exclud vat',
       'payment 2.779 million lita interest long-term loan provid raguti major sharehold estonia le coq also ad loss',
       'finnish ac drive manufactur vacon acquir ac drive busi tb wood part us group altra hold',
       ...,
       'offer 30 million share aim rais x20ac 500 million us 640 million expect complet oct 9 outokumpu said',
       'net profit period 2009 euro 29 million',
       'accord mark white locatrix commun ceo compani web servic interfac allow devic owner friend famili track locat twig user via web browser'],
      dtype=object)

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(strip_accents='unicode', 
                             analyzer='word', 
                             ngram_range=(1,3), 
                             norm='l2')
vectorizer.fit(X_train)
vectorizer.fit(X_test)

TfidfVectorizer(ngram_range=(1, 3), strip_accents='unicode')

In [19]:
x_train = vectorizer.transform(X_train)
x_test = vectorizer.transform(X_test)
x_train

<3392x34472 sparse matrix of type '<class 'numpy.float64'>'
	with 52320 stored elements in Compressed Sparse Row format>

In [20]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [21]:
model = LogisticRegression(random_state=0,multi_class='ovr').fit(x_train, Y_train)
prediction = model.predict(x_test)
accuracy_score(Y_test, prediction)

0.6836313617606602

In [22]:
Counter(prediction)

Counter({0: 1320, 1: 104, -1: 30})

In [29]:
file_list = ['participants_1.csv','participants_2.csv',
             'participants_3.csv','participants_4.csv']

In [30]:
def test(file_list):
    list_text = []
    for file in file_list:
        path_dir = '/Users/hienanh/Desktop/Cass/ARP/text_data/'
        path = path_dir + file
        participants = pd.read_csv(path)
        text = participants.iloc[0,1]
        test = process_text(text)
        list_text.append(test)
    
    return list_text

In [31]:
test = test(file_list)
test

["ye thank uli good thing interfac surround somi read glass broke 15 minut meet axel kind enough toget qualiti uk product drugstor next corner i'm onlin thank you.thank uli think mention alreadi part – major part keyfigur result think start – get detail p l look key metric remark develop key metric whichwer driven us also market ezb whomsoever.first strong cash flow 2014 nearli reach € 2 billion mark westart littl bit slow year think mention previou occas wason u mobil contract might come contract littl bit later whichwa cash neg first two quarter turnaround – alreadi willturnaround 2015 get money back second half year verystrong cash flow think support posit messag qualiti theresult even littl bit depress overal growth cash flow stay verypositive.on right hand side asset manag 14 € 4.5 billion cantel develop u dollar yield especi yield thegovern bond today approach € 39 billion increas reallyremark driven good result also yield andespeci currenc exchang rates.th goe capit posit next s

In [32]:
test_vecto = vectorizer.transform(test)
prediction_test = model.predict(test_vecto)
prediction_test

array([1, 0, 0, 0])

In [33]:
Counter(prediction)

Counter({0: 1320, 1: 104, -1: 30})

In [35]:
path_dir = '/Users/hienanh/Desktop/Cass/ARP/text_data/'
new_path = path_dir + file_list[3]
text_check = pd.read_csv(new_path)
text_check.iloc[0,1]

"Three quick questions. One is just a follow-up on your last answer. I don't understand thenwhy the diversification benefit on page 84 has gone up 51%, if you reduced thediversification to how come the overall, and as a proportion of the gross capitalrequirement gone up, I think from 32% to 36%, so?Okay. And the two other quick questions. If I look at your life reinsurance business andyour MCEV disclosure over the last five years, there's been just short of €800 million ofnegative operating variances or assumption changes related to U.S., UK, longevity,Australian DII. You continue to express a lot of confidence in the Life result improving, andalso in your EV translating to IFRS profits. It's a bit hard for us to kind of see that, given theamount of sort of assumption changes and variances that have been going through. Somaybe do you think you need to do a more thorough kind of ground-up review again ofassumptions? Or we've really passed the worst on particularly the U.S.? I still find