# Queen's University Alternative Assets Fund
#### Learning and Development Session - Sentiment Analysis for Financial News


* Prepared by Robert Davis for QUAAF
* May 20, 2021
* To be run in Google Colab


## Setup

#### Load required packages


In [1]:
# Note that the simpletransformers installation requires a runtime restart
!pip install simpletransformers

import pandas as pd



### Load Data

Load data from Kaggle dataset located at: https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/QueensU-Alternative-Asset-Fund/Learning-and-Development/master/data/FinancialSentiment.csv', encoding='latin-1', header=None)

### Inspect Data

In [3]:
#Inspect dataframe
df

Unnamed: 0,0,1
0,neutral,"According to Gran , the company has no plans t..."
1,neutral,Technopolis plans to develop in stages an area...
2,negative,The international electronic industry company ...
3,positive,With the new production plant the company woul...
4,positive,According to the company 's updated strategy f...
...,...,...
4841,negative,LONDON MarketWatch -- Share prices ended lower...
4842,neutral,Rinkuskiai 's beer sales fell by 6.5 per cent ...
4843,negative,Operating profit fell to EUR 35.4 mn from EUR ...
4844,negative,Net sales of the Paper segment decreased to EU...


In [4]:
# Clean up the dataframe
# Need to add column titles, and remove any rows where the sentiment is neutral
# Need to change 'negative' to 0, and 'positive' to 1

df.columns = ['Sentiment','Text']
df = df[df['Sentiment']!= 'neutral']
df.reset_index(inplace=True,drop=True)

df.replace('negative',0, inplace=True)
df.replace('positive',1,inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  method=method,


In [5]:
# Inspect updated dataframe
df

Unnamed: 0,Sentiment,Text
0,0,The international electronic industry company ...
1,1,With the new production plant the company woul...
2,1,According to the company 's updated strategy f...
3,1,FINANCING OF ASPOCOMP 'S GROWTH Aspocomp is ag...
4,1,"For the last quarter of 2010 , Componenta 's n..."
...,...,...
1962,0,HELSINKI Thomson Financial - Shares in Cargote...
1963,0,LONDON MarketWatch -- Share prices ended lower...
1964,0,Operating profit fell to EUR 35.4 mn from EUR ...
1965,0,Net sales of the Paper segment decreased to EU...


In [6]:
# Look at a particular row

row = 400
sentiment = df.iloc[row]['Sentiment']
text = df.iloc[row]['Text']

print(f'Row selected = {row}')
print(f'Sentiment: {sentiment}')
print(f'Text: {text}')

Row selected = 400
Sentiment: 1
Text: The company plans to expand into the international market through its subsidiaries and distributors from 2011 onwards .


### Data Quality
Note that for most datasets, significant data cleaning would be involved.
This is a cleaned dataset, which allows us to skip that step.
Data cleaning/engineering will often represent upwards of 80% of the work required to do this type of analysis.


### Train Test Split


In [7]:
# Split the data

from sklearn.model_selection import train_test_split

X = df['Text']
y = df['Sentiment']

X_train, X_val, y_train, y_val = train_test_split(X,y,random_state=42)

## Sentiment Analysis with Transformers

### Data Prep

In [8]:
# SimpleTransformers requires the input to be in one dataframe, but we currently have X and y stored separately

X_train_transformers = pd.DataFrame(X_train)
X_train_transformers['Polarity'] = y_train
X_train_transformers

X_val_transformers = pd.DataFrame(X_val)
X_val_transformers['Polarity'] = y_val


### Model Setup

In [9]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs

# Optional model configuration
model_args = ClassificationArgs(num_train_epochs=5, sliding_window=False, overwrite_output_dir=True, save_model_every_epoch=False, max_seq_length=420)

# Create a ClassificationModel
model = ClassificationModel("xlnet", "xlnet-base-cased", args=model_args, use_cuda=True)

Downloading:   0%|          | 0.00/760 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/467M [00:00<?, ?B/s]

Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.bias', 'lm_loss.weight']
- This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['logits_proj.weight', 'sequence_summary.summary.bias', 'sequence_summary.summary.weight', 'logits_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions a

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

### Train the model

In [10]:
model.train_model(X_train_transformers)

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


  0%|          | 0/1475 [00:00<?, ?it/s]

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Running Epoch 0 of 5:   0%|          | 0/185 [00:00<?, ?it/s]



Running Epoch 1 of 5:   0%|          | 0/185 [00:00<?, ?it/s]

Running Epoch 2 of 5:   0%|          | 0/185 [00:00<?, ?it/s]

Running Epoch 3 of 5:   0%|          | 0/185 [00:00<?, ?it/s]

Running Epoch 4 of 5:   0%|          | 0/185 [00:00<?, ?it/s]

(925, 0.16956394432200278)

### Evaluate the model

In [11]:
import sklearn

result, model_outputs, wrong_predictions = model.eval_model(X_val_transformers, f1 = sklearn.metrics.f1_score)
result

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


  0%|          | 0/492 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/62 [00:00<?, ?it/s]

{'auprc': 0.9968419537683126,
 'auroc': 0.9937512900865061,
 'eval_loss': 0.20758276702869602,
 'f1': 0.9774436090225564,
 'fn': 6,
 'fp': 9,
 'mcc': 0.9305147090841114,
 'tn': 152,
 'tp': 325}

### Predict a new sentence

In [12]:
input_text = ['APPLE SHARES DOWN ABOUT 6% PREMARKET AFTER CO FORECASTS Q4 PROFIT BELOW ESTIMATES',
              '$TSLA IS STUCK WITH OVER 10,000 CARS ON FACTORY HOLD, RESULTING IN A LOGISTICAL NIGHTMARE - ELECTREK']

In [13]:

predictions = model.predict(input_text)

for i in range(0,len(input_text)):
  print(f'Sentence: {input_text[i]}')
  print(f'Prediction: {predictions[0][i]}')

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Sentence: APPLE SHARES DOWN ABOUT 6% PREMARKET AFTER CO FORECASTS Q4 PROFIT BELOW ESTIMATES
Prediction: 0
Sentence: $TSLA IS STUCK WITH OVER 10,000 CARS ON FACTORY HOLD, RESULTING IN A LOGISTICAL NIGHTMARE - ELECTREK
Prediction: 0


## Named Entity Recognition


In [14]:
import spacy

# Load English tokenizer, tagger, parser and NER
nlp = spacy.load('en_core_web_sm')

In [15]:
doc = nlp(input_text[0])

# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)



Noun phrases: ['ABOUT 6% PREMARKET', 'CO FORECASTS Q4 PROFIT', 'ESTIMATES']
Verbs: ['share']
APPLE ORG
ABOUT 6% PERCENT
