## Financial News Data - Exploration of the Project Dataset

### Dataset Description:

Polar sentiment dataset of sentences from financial news. The dataset consists of 4840 sentences from English language financial news categorised by sentiment. The dataset is divided by agreement rate of 5-8 annotators.


#### Initial Data Collection and Normalization

The corpus used in this paper is made out of English news on all listed companies in OMX Helsinki. The news has been downloaded from the LexisNexis database using an automated web scraper. Out of this news database, a random subset of 10,000 articles was selected to obtain good coverage across small and large companies, companies in different industries, as well as different news sources. Following the approach taken by Maks and Vossen (2010), we excluded all sentences which did not contain any of the lexicon entities. This reduced the overall sample to 53,400 sentences, where each has at least one or more recognized lexicon entity. The sentences were then classified according to the types of entity sequences detected. Finally, a random sample of ∼5000 sentences was chosen to represent the overall news database.


#### Who are the source language producers?

The source data was written by various financial journalists.

#### Annotation process

This release of the financial phrase bank covers a collection of 4840 sentences. The selected collection of phrases was annotated by 16 people with adequate background knowledge on financial markets.

Given the large number of overlapping annotations (5 to 8 annotations per sentence), there are several ways to define a majority vote based gold standard. To provide an objective comparison, we have formed 4 alternative reference datasets based on the strength of majority agreement:


#### Who are the annotators?

Three of the annotators were researchers and the remaining 13 annotators were master's students at Aalto University School of Business with majors primarily in finance, accounting, and economics.


#### Discussion of Biases

All annotators were from the same institution and so interannotator agreement should be understood with this taken into account.


In [16]:
from datasets import load_dataset

data = load_dataset('financial_phrasebank', 'sentences_50agree')

In [17]:
data

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 4846
    })
})

In [18]:
data.set_format(type='pandas')

In [22]:
type(data['train'])

datasets.arrow_dataset.Dataset

In [23]:
data = data['train']

In [24]:
data.to_csv('financial_news.txt', index=False)

Creating CSV from Arrow format:   0%|          | 0/5 [00:00<?, ?ba/s]

641139

In [25]:
from pyspark.sql import SparkSession


In [26]:
spark = SparkSession.builder.appName('SentimentAnalysis').getOrCreate()

In [38]:
financial_news = spark.read.csv('financial_news.txt', header=True)
financial_news.head()

Row(sentence='According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .', label='1')

In [39]:
financial_news.show(20, truncate=70)

+----------------------------------------------------------------------+-----+
|                                                              sentence|label|
+----------------------------------------------------------------------+-----+
|According to Gran , the company has no plans to move all production...|    1|
|Technopolis plans to develop in stages an area of no less than 100,...|    1|
|The international electronic industry company Elcoteq has laid off ...|    0|
|With the new production plant the company would increase its capaci...|    2|
|According to the company 's updated strategy for the years 2009-201...|    2|
|FINANCING OF ASPOCOMP 'S GROWTH Aspocomp is aggressively pursuing i...|    2|
|For the last quarter of 2010 , Componenta 's net sales doubled to E...|    2|
|In the third quarter of 2010 , net sales increased by 5.2 % to EUR ...|    2|
|Operating profit rose to EUR 13.1 mn from EUR 8.7 mn in the corresp...|    2|
|Operating profit totalled EUR 21.1 mn , up from EUR

In [42]:
financial_news.printSchema()

root
 |-- sentence: string (nullable = true)
 |-- label: string (nullable = true)



In [44]:
financial_news.groupBy('label').agg({'sentence': 'count'}).show()

+-----+---------------+
|label|count(sentence)|
+-----+---------------+
|    0|            604|
|    1|           2879|
|    2|           1363|
+-----+---------------+



In [45]:
financial_news.take(2)

[Row(sentence='According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .', label='1'),
 Row(sentence='Technopolis plans to develop in stages an area of no less than 100,000 square meters in order to host companies working in computer technologies and telecommunications , the statement said .', label='1')]

In [46]:
financial_news.first()

Row(sentence='According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .', label='1')

In [53]:
financial_news.filter( financial_news['label'] > 0 ).show(20, truncate=80)

+--------------------------------------------------------------------------------+-----+
|                                                                        sentence|label|
+--------------------------------------------------------------------------------+-----+
|According to Gran , the company has no plans to move all production to Russia...|    1|
|Technopolis plans to develop in stages an area of no less than 100,000 square...|    1|
|With the new production plant the company would increase its capacity to meet...|    2|
|According to the company 's updated strategy for the years 2009-2012 , Baswar...|    2|
|FINANCING OF ASPOCOMP 'S GROWTH Aspocomp is aggressively pursuing its growth ...|    2|
|For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m fro...|    2|
|In the third quarter of 2010 , net sales increased by 5.2 % to EUR 205.5 mn ,...|    2|
|Operating profit rose to EUR 13.1 mn from EUR 8.7 mn in the corresponding per...|    2|
|Operating profit tot

In [55]:
display(financial_news)

DataFrame[sentence: string, label: string]

In [58]:
financial_news.pandas_api()



Unnamed: 0,sentence,label
0,"According to Gran , the company has no plans t...",1
1,Technopolis plans to develop in stages an area...,1
2,The international electronic industry company ...,0
3,With the new production plant the company woul...,2
4,According to the company 's updated strategy f...,2
5,FINANCING OF ASPOCOMP 'S GROWTH Aspocomp is ag...,2
6,"For the last quarter of 2010 , Componenta 's n...",2
7,"In the third quarter of 2010 , net sales incre...",2
8,Operating profit rose to EUR 13.1 mn from EUR ...,2
9,"Operating profit totalled EUR 21.1 mn , up fro...",2


24/04/13 23:07:44 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 942761 ms exceeds timeout 120000 ms
24/04/13 23:07:44 WARN SparkContext: Killing executors is not supported by current scheduler.
24/04/13 23:07:50 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:56)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:310)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
	at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:36)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.driverEndpoint$lzycompute(BlockManagerMasterEndpoint.scala:124)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$