### Qiaoling Huang(BU20421641)
### Financial Text Analysis Project
- Apply Neural Network BERT model and Naive Bayers Classifier model to predict two target label which are `EARNINGS` and `ACQUIS`
- Compare the accuracy score for both models



## Installing the transformers library

In [2]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/37/ba/dda44bbf35b071441635708a3dd568a5ca6bf29f77389f7c7c6818ae9498/transformers-2.7.0-py3-none-any.whl (544kB)
[K     |▋                               | 10kB 28.7MB/s eta 0:00:01[K     |█▏                              | 20kB 6.0MB/s eta 0:00:01[K     |█▉                              | 30kB 8.5MB/s eta 0:00:01[K     |██▍                             | 40kB 5.6MB/s eta 0:00:01[K     |███                             | 51kB 6.8MB/s eta 0:00:01[K     |███▋                            | 61kB 8.0MB/s eta 0:00:01[K     |████▏                           | 71kB 9.1MB/s eta 0:00:01[K     |████▉                           | 81kB 7.3MB/s eta 0:00:01[K     |█████▍                          | 92kB 8.1MB/s eta 0:00:01[K     |██████                          | 102kB 8.8MB/s eta 0:00:01[K     |██████▋                         | 112kB 8.8MB/s eta 0:00:01[K     |███████▏                        | 122kB 8.8M

## Importing Packages

In [0]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

## Importing Dataset
We import the dataset from local drive and read it with Pandas

In [6]:
#from google.colab import files
#uploaded = files.upload()

import pandas as pd
import numpy as np
df = pd.read_csv('assign3.csv')
pd.DataFrame.from_records(df)
df.head()

Unnamed: 0,TEST,EARNINGS,ACQUIS,NEWS_TEXT
0,1,0,0,Mounting trade friction between the U.S. And J...
1,1,0,0,survey of provinces and seven cities showed v...
2,1,1,0,Shr .p .p Div .p .p making .p .p Turnover . ...
3,1,0,1,Whim Creek Consolidated NL> said the consortiu...
4,1,0,0,The number of workers employed in the West Ger...


Print out descriptive statistics for the variables (columns) in the dataframe. 

In [7]:
df.describe()

Unnamed: 0,TEST,EARNINGS,ACQUIS
count,2165.0,2165.0,2165.0
mean,0.263741,0.145497,0.37321
std,0.440762,0.352682,0.483769
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,0.0,0.0,0.0
75%,1.0,0.0,1.0
max,1.0,1.0,1.0


We can see there is no descriptive statistics for the `NEWS_TEXT` column, meaning the column is only text data. We are going to use text analysis to handle this column. 

## Loading the pre-trained DistilBERT model.
Let's now load a pre-trained BERT model. We want to use DistilBERT instead of BERT model as DistilBERT is smaller which will run much faster and requiring less memory. 

In [8]:
#For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

HBox(children=(IntProgress(value=0, description='Downloading', max=231508, style=ProgressStyle(description_wid…




HBox(children=(IntProgress(value=0, description='Downloading', max=546, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=267967963, style=ProgressStyle(description_…




## Preparing the Dataset
Before we can hand our sentences to BERT, we need to so some minimal processing to put them in the format it requires.

### Tokenization
Our first step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with.

In [0]:
tokenized = df['NEWS_TEXT'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

### Padding
After tokenization, `tokenized` is a list of sentences -- each sentences is represented as a list of tokens. We want BERT to process our examples all at once (as one batch). It's just faster that way. For that reason, we need to pad all lists to the same size, so we can represent the input as one 2-d array, rather than a list of lists (of different lengths).

In [0]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

Our dataset is now in the `padded` variable, we can view its dimensions below:

In [11]:
np.array(padded).shape

(2165, 86)

### Masking
If we directly send `padded` to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is:

In [12]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(2165, 86)

Now we can see the attention_mask has the same shape as padded.

## Model #1-1: And Now, Deep Learning!
Now that we have our model and inputs ready, let's run our model!

The `model()` function runs our sentences through BERT. The results of the processing will be returned into `last_hidden_states`.

In [0]:
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

We'll save those in the `features` variable, as they'll serve as the features to our logitics regression model.

In [0]:
features = last_hidden_states[0][:,0,:].numpy()

The labels indicating which sentence is positive and negative now go into the labels variable, we use `EARNINGS` as our labels for this model.

In [0]:
labels = df['EARNINGS']

## Model #1-1: Train/Test Split
Let's now split our datset into a training set and testing set.

In [0]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

## Grid Search for Parameters
We can dive into Logistic regression directly with the Scikit Learn default parameters, but sometimes it's worth searching for the best value of the C parameter, which determines regularization strength.

In [0]:
#parameters = {'C': np.linspace(0.0001, 100, 20)}
#grid_search = GridSearchCV(LogisticRegression(), parameters)
#grid_search.fit(train_features, train_labels)

#print('best parameters: ', grid_search.best_params_)
#print('best scrores: ', grid_search.best_score_)

We now train the LogisticRegression model. If you've chosen to do the gridsearch, you can plug the value of C into the model declaration (e.g. LogisticRegression(C=5.26)).

In [18]:
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Evaluating Model #1-1
So how well does our model do in classifying sentences? One way is to check the accuracy against the testing dataset:

In [19]:
lr_clf.score(test_features, test_labels)

0.9428044280442804

We got 0.96 accuracy score for our model. How good is this score? What can we compare it against? Let's first look at a dummy classifier:

In [20]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Dummy classifier score: 0.759 (+/- 0.04)


## Model #1-2 Repeat Modeling
We use `ACQUIS` variable as our labels this time.

In [0]:
labels = df['ACQUIS']

Let's now split our datset into a training set and testing set.

In [0]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

### Fiting model with new labels
We already got the best parameter from previous gradsearch, we continue to use the same parameter C=5.26 for the Logistic Regression model

In [23]:
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Model #1-2 Evaluating Model


In [24]:
lr_clf.score(test_features, test_labels)

0.9501845018450185

We got 0.92 accuracy score for our model with new labels. Let's check how good the score it is by looking at a dummy classifier. 

In [25]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Dummy classifier score: 0.515 (+/- 0.05)


## Naive Bayes Model


In [26]:
import nltk
nltk.download('all')
## split train and test dataframe
df_train = df[df.TEST == 0]
df_test = df[df.TEST == 1]

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package brown_tei to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    | Downloading package cess_esp to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_esp.zip.
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Unzipp

Creating a wordlist with each single word in a list

In [0]:
##create wordlist for train 
tr_wordlist = []
for m in range(0, len(df_train)):
  i = df_train.iloc[m].NEWS_TEXT.split()
  for n in i:
    tr_wordlist.append(n)


Define a feature function to get the unique word from the wordlist with all lower case into a new format

In [0]:
# Define the feature extractor
tr_words = nltk.FreqDist(w.lower() for w in tr_wordlist)
word_features = list(tr_words)

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features


## Model #2-1: Applying Naive Bayer model to predict `EARNINGS`
First to extrate `EARNINGS` column from train and test dataset and create a tuple list for the model


In [0]:
## extract `EARNINGS` from train and test
tr_EARNINGS = df_train['EARNINGS']
test_EARNINGS = df_test['EARNINGS']

## create train dataset vs earnings
tr_earnings = [(list(df_train.iloc[i].NEWS_TEXT.split()), tr_EARNINGS.iloc[i]) for i in range(0, len(df_train))]

test_earnings = [(list(df_test.iloc[i].NEWS_TEXT.split()), test_EARNINGS.iloc[i]) for i in range(0, len(df_test))]


Second to get the unique words into the tuple format for both train and test dataset and train the Naive Bayers model with train features

In [0]:
# Train Naive Bayes classifier
tr_featuresets = [(document_features(d), c) for (d,c) in tr_earnings]
test_featuresets = [(document_features(d), c) for (d,c) in test_earnings]
classifier = nltk.NaiveBayesClassifier.train(tr_featuresets)

Last but not least, look at the accuracy score by complaring the predict result with test features

In [39]:
# Test the classifier
print(nltk.classify.accuracy(classifier, test_featuresets))

0.978984238178634


So we got 0.98 accuracy score for this model.

## Model #2-2: Applying Naive Bayer model to predict `ACQUIS`
First to extrate `ACQUIS` column from train and test dataset and create a tuple list for the model


In [0]:
## extract `ACQUIS` from train and test
tr_EARNINGS = df_train['ACQUIS']
test_EARNINGS = df_test['ACQUIS']

## create train dataset vs earnings
tr_earnings = [(list(df_train.iloc[i].NEWS_TEXT.split()), tr_EARNINGS.iloc[i]) for i in range(0, len(df_train))]

test_earnings = [(list(df_test.iloc[i].NEWS_TEXT.split()), test_EARNINGS.iloc[i]) for i in range(0, len(df_test))]

The following steps are same as above, the only different is we are predicting the `ACQUIS` as our target label

In [41]:
# Train Naive Bayes classifier
tr_featuresets = [(document_features(d), c) for (d,c) in tr_earnings]
test_featuresets = [(document_features(d), c) for (d,c) in test_earnings]
classifier = nltk.NaiveBayesClassifier.train(tr_featuresets)

# Test the classifier
print(nltk.classify.accuracy(classifier, test_featuresets))


0.9422066549912435


## Compare Naive Bayes and BERT

Naive Bayes work better in this situation because it treats the probability of each word appearing in a document as though it were independent of the probability of any other word appearing.

In simple words, we would know that the certain text is about Earnings when earnings as a words appears multiple times in the text. That is why for Naive Bayes it is an easy task.

Moreover, Naive Bayes in our case were trained specifically only on texts related either to Earnings or Acquisition, which might have helped it to get better accuracy than Bert in Earnings classification. However, Naive Bayes was less accurate in Acquisition prediction. The reason behind that might be that the words related to acquisition might also have other reasons and to get the idea that the text is about acquisition you need to look at the words nearby.