# **BA 870 – Assignment 3**

**Shangkun(Sherry) Zuo, Yuqi(Yoki) Liu, Yanni Lan, Jiayuan Zou, Ziyan Pei, Siqi Zhang**  
Cohort B  
April 6, 2020

##Upload data to Pandas Dataframes

In [0]:
from google.colab import files
uploaded = files.upload()

Saving assign3.csv to assign3.csv


The textfile with the labelled training and testing data of
Reuters News Articles

In [0]:
import pandas as pd
import numpy as np
df = pd.read_csv('assign3.csv')
pd.DataFrame.from_records(df)
df.head()

Unnamed: 0,TEST,EARNINGS,ACQUIS,NEWS_TEXT
0,1,0,0,Mounting trade friction between the U.S. And J...
1,1,0,0,survey of provinces and seven cities showed v...
2,1,1,0,Shr .p .p Div .p .p making .p .p Turnover . ...
3,1,0,1,Whim Creek Consolidated NL> said the consortiu...
4,1,0,0,The number of workers employed in the West Ger...


## Dataset Overview

In [0]:
#dimension check
df.shape

(2165, 4)

This is a subset of the Reuters-21578 dataset with
2,165 observations.

##### Columns' Descriptions:  

**TEST** = a variable that equals “1” if the observation will be part of the Testing set to evaluate your trained machine learning model, otherwise it is equal to “0” which means it will be used for the Training sample for the model.  
**EARNINGS** = a variable that equals “1” if the text data is labelled as an “Earnings Announcement” news; otherwise it is equal to “0”  
**ACQUIS** = a variable that equals “1” if the text data is labelled as an “Corporate Acquisition” news items on Reuters; otherwise it is equal to “0”  
**NEWS_TEXT** = a string of text that captures the beginning of the actual news report on
Reuters

In [0]:
#check missing values
df.isna().any()

TEST         False
EARNINGS     False
ACQUIS       False
NEWS_TEXT    False
dtype: bool

The dataset is cleaning, and ready to go next step

In [0]:
#summary of statistics, check outliers
df.describe()

Unnamed: 0,TEST,EARNINGS,ACQUIS
count,2165.0,2165.0,2165.0
mean,0.263741,0.145497,0.37321
std,0.440762,0.352682,0.483769
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,0.0,0.0,0.0
75%,1.0,0.0,1.0
max,1.0,1.0,1.0


The dataset does not have obvious outliers, so it's ready to go next step



##### Numeric Variables' Value Count Check

In [0]:
df['TEST'].value_counts()

0    1594
1     571
Name: TEST, dtype: int64

There are 1594 observations in the training set, and 571 observations in the testing set.

In [0]:
df['EARNINGS'].value_counts()

0    1850
1     315
Name: EARNINGS, dtype: int64

There are 315 news are labeled "Earnings Announcement", and 1850 news are not.

In [0]:
df['ACQUIS'].value_counts()

0    1357
1     808
Name: ACQUIS, dtype: int64

There are 808 news are labeled “Corporate Acquisition”, and 1357 news are not. 

# DistilBERT

## Install the transformers library

In [0]:
#installing the huggingface transformers library so we can load our deep learning NLP model
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/37/ba/dda44bbf35b071441635708a3dd568a5ca6bf29f77389f7c7c6818ae9498/transformers-2.7.0-py3-none-any.whl (544kB)
[K     |████████████████████████████████| 552kB 6.8MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/a6/b4/7a41d630547a4afd58143597d5a49e07bfd4c42914d8335b2a5657efc14b/sacremoses-0.0.38.tar.gz (860kB)
[K     |████████████████████████████████| 870kB 53.0MB/s 
Collecting tokenizers==0.5.2
[?25l  Downloading https://files.pythonhosted.org/packages/d1/3f/73c881ea4723e43c1e9acf317cf407fab3a278daab3a69c98dcac511c04f/tokenizers-0.5.2-cp36-cp36m-manylinux1_x86_64.whl (3.7MB)
[K     |████████████████████████████████| 3.7MB 53.1MB/s 
[?25hCollecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/74/f4/2d5214cbf13d06e7cb2c20d84115ca25b53ea76fa1f0ade0e3c9749de214/sentencepiece-0.1.85-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K    

In [0]:
#import libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

## Load the pre-trained DistilBERT model.

In [0]:
# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Right now, the variable `model` holds a pretrained distilBERT model -- a version of BERT that is smaller, but much faster and requiring less memory.

## Prepare Model #1

Before we can hand our sentences to BERT, we need to do some minimal processing to put our dataset in the format that BERT requirement.

### Prepare the Reuters News dataset

##### Tokenization  
Our first step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with.

In [0]:
tokenized = df['NEWS_TEXT'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

### Pad the Reuters News dataset

After tokenization, `tokenized` is a list of sentences that each sentence is represented as a list of tokens. In order to be more efficient and faster, We want BERT to process our examples all at once (as one batch). For that reason, we need to pad all lists to the same size, so we can represent the input as one 2-d array, rather than a list of lists (in different lengths).

In [0]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

Our dataset is now in the `padded` variable, we can view its dimensions below:

In [0]:
np.array(padded).shape

(2165, 86)

### Mask the Reuters News dataset

If we directly send `padded` to BERT, that would slightly confuse it. To avoid confusion, We create another variable to tell it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is:

In [0]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(2165, 86)

## “Earnings Announcement”

### DistilBERT Model #1 for Deep Learning

Now that we have our model and inputs ready, let's run our model!

The `model()` function runs our sentences through BERT. The results of the processing will be returned into `last_hidden_states`.

In [0]:
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

Let's slice only the part of the output that we need. That is the output corresponding the first token of each sentence. The way BERT does sentence classification, is that it adds a token called `[CLS]` (for classification) at the beginning of every sentence. The output corresponding to that token can be thought of as an embedding for the entire sentence.

We'll save those in the `features` variable, as they'll serve as the features to our logitics regression model.

In [0]:
features = last_hidden_states[0][:,0,:].numpy()

In [0]:
features

array([[-0.14431146, -0.13774735, -0.07801528, ..., -0.10113572,
         0.36493364,  0.17891423],
       [-0.21805531, -0.1451973 , -0.01023186, ..., -0.14361796,
         0.5414291 , -0.06821286],
       [-0.15863766, -0.18726122, -0.02563679, ..., -0.10387485,
         0.39985478,  0.43581346],
       ...,
       [-0.0598238 , -0.00563753,  0.19096635, ..., -0.0381794 ,
         0.6116144 ,  0.394104  ],
       [-0.09204306, -0.25917462, -0.18157378, ..., -0.0273106 ,
         0.34951085,  0.4110036 ],
       [-0.1615981 , -0.28920633, -0.08893129, ..., -0.01243179,
         0.35322142,  0.48190033]], dtype=float32)

The labels indicating which sentence is positive and negative now go into the `labels` variable, it's the `Earnings Announcement` this time

In [0]:
labels = df['EARNINGS']

In [0]:
df.head()

Unnamed: 0,TEST,EARNINGS,ACQUIS,NEWS_TEXT
0,1,0,0,Mounting trade friction between the U.S. And J...
1,1,0,0,survey of provinces and seven cities showed v...
2,1,1,0,Shr .p .p Div .p .p making .p .p Turnover . ...
3,1,0,1,Whim Creek Consolidated NL> said the consortiu...
4,1,0,0,The number of workers employed in the West Ger...


### Logistic Regression Model #1 

apply the embeddings from the
trained Model #1 based on the Reuters News dataset.

#### Random train/test split

In [0]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

After split our test and training dataset, we fit our logistic regression model by trainning dataset with its features and labels.

In [0]:
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

#### Evaluate the Performance of the Model

Determine the prediction accuracy for the testing sample.  

check the accuracy against the testing dataset for how welL our model does in classifying sentences:

In [0]:
lr_clf.score(test_features, test_labels)

0.9538745387453874

Our model is 95.4% correct for classifying sentences.

Also, we take a look at the dummy classifier to evaluate our accuracy score.

In [0]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Dummy classifier score: 0.746 (+/- 0.04)


Obviously, since our model has the accuracy score of 0.954, which is much higher than dummy classifier score with 0.746, our model did a good job for prediction.

### Logistic Regression Model #1 Cont.

#### Use the actual training/test variable

First, we create a copy of our original dataframe in order to process the utilization of actual training/test variable. Then, we attached `features` that we created before to our dataframe as a new column named `features`for modeling.

In [0]:
df2=df.copy()
df2['features'] = features.tolist()
df2.head()

Unnamed: 0,TEST,EARNINGS,ACQUIS,NEWS_TEXT,features
0,1,0,0,Mounting trade friction between the U.S. And J...,"[-0.14431145787239075, -0.13774734735488892, -..."
1,1,0,0,survey of provinces and seven cities showed v...,"[-0.21805530786514282, -0.1451973021030426, -0..."
2,1,1,0,Shr .p .p Div .p .p making .p .p Turnover . ...,"[-0.1586376577615738, -0.18726122379302979, -0..."
3,1,0,1,Whim Creek Consolidated NL> said the consortiu...,"[0.06605631858110428, -0.12267395853996277, -0..."
4,1,0,0,The number of workers employed in the West Ger...,"[-0.15836602449417114, -0.016903648152947426, ..."


Then we split test and training dataset according to `TEST` variable, which means that we will utilize the actual training and test split of data frame to train the model.

In [0]:
# split features and labels based on the 'TEST' variable in df
train_features2=df2.features[df2['TEST'] == 0]
test_features2=df2.features[df2['TEST'] == 1]
train_labels2=df2.EARNINGS[df2['TEST'] == 0]
test_labels2=df2.EARNINGS[df2['TEST'] == 1]

By checking the types of labels and features that we generated above, we know that the data types are not the same as the data type that logistic regression model requires. Therefore, we transform the data format into numpy arrays.

In [0]:
#transfer to correct format
train_features3 = np.array([np.array(xi) for xi in train_features2])
test_features3 = np.array([np.array(xi) for xi in test_features2])
train_labels3 = train_labels2
test_labels3 = test_labels2

Last, we run our logistic regression model.

In [0]:
lr_clf2 = LogisticRegression()
lr_clf2.fit(train_features3, train_labels3)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

#### Evaluate the Performance of the Model

Determine the prediction accuracy for the testing sample.  

Check the accuracy against the actual testing dataset for how well our model does in classifying sentences:

In [0]:
lr_clf2.score(test_features3, test_labels3)

0.9772329246935202

Our model is about 97.7% correct for predicting `EARNINGS` variable.

In [0]:
from sklearn.dummy import DummyClassifier
clf2 = DummyClassifier()

scores2 = cross_val_score(clf2, train_features3, train_labels3)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores2.mean(), scores2.std() * 2))

Dummy classifier score: 0.715 (+/- 0.02)


By creating a dummy classifier, we check the performance of our accuracy scores. Obviously, since our model has the accuracy score of around 0.977, which is much higher than dummy classifier score, our model did a good job for predicting `EARNINGS` variable.

**Moreover**, the model that we trained by actual train/test split has **higher** accuracy scores (0.977) than random train/test split (0.954). Such situation indicates that different methods of splitting training and test dataset will influence the performance of the model distinctively.

## “Corporate Acquisition”

### DistilBERT Model #1 for Deep Learning

We have our model and features above, and they are same here, so we don't need to rebuild the model and features this time, we only need to change labels

The labels indicating which sentence is positive and negative now go into the `labels` variable, it's the `Corporate Acquisition` this time

In [0]:
labels = df['ACQUIS']

### Logistic Regression Model #1 

apply the embeddings from the
trained Model #1 based on the Reuters News dataset.

#### Random train/test split

In [0]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

In [0]:
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

#### Evaluate the Performance of the Model

Determine the prediction accuracy for the testing sample.  

Check the accuracy against the random testing dataset for how well our model does in classifying sentences:

In [0]:
lr_clf.score(test_features, test_labels)

0.9428044280442804

Our model is 94.3% correct when predicts `ACQUIS` variable.

In [0]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Dummy classifier score: 0.531 (+/- 0.02)


The accuracy of our model for predicting `ACQUIS` variable is around 0.943. Also, we built a dummy classifier with a value of 0.531. Obviously, our model does a good job on predicting `ACQUIS` variable. 

### Logistic Regression Model #1 Cont.

#### Use the actual training/test variable

Similarily, we also utilize the actual trainig/test variable to train our model which predicts `AQUIS` variable. The steps are the same as the model of predicting `EARNINGS` variable using actual training/test variable.

In [0]:
# split features and labels based on the 'TEST' variable in df
train_features4=df2.features[df2['TEST'] == 0]
test_features4=df2.features[df2['TEST'] == 1]
train_labels4=df2.ACQUIS[df2['TEST'] == 0]
test_labels4=df2.ACQUIS[df2['TEST'] == 1]

In [0]:
#transfer to correct format
train_features5 = np.array([np.array(xi) for xi in train_features4])
test_features5 = np.array([np.array(xi) for xi in test_features4])
train_labels5 = train_labels4
test_labels5 = test_labels4

In [0]:
lr_clf2 = LogisticRegression()
lr_clf2.fit(train_features5, train_labels5)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

#### Evaluate the Performance of the Model

Determine the prediction accuracy for the testing sample.  

To check the accuracy against the testing dataset for how well does our model do in classifying sentences

In [0]:
lr_clf2.score(test_features5, test_labels5)

0.9492119089316988

Our model is 94.9% correct for predicting `AQUIS` variable.

In [0]:
from sklearn.dummy import DummyClassifier
clf3 = DummyClassifier()

scores3 = cross_val_score(clf3, train_features5, train_labels5)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores3.mean(), scores3.std() * 2))

Dummy classifier score: 0.556 (+/- 0.06)


The accuracy of our model for predicting `ACQUIS` variable is around 0.949. Also, we built a dummy classifier with a value of 0.556. Obviously, our model does a good job on predicting `ACQUIS` variable. 

Though the results of accuracy scores are still slightly different (0.943 for random split vs. 0.949 for actual split) by using different methods of training/test split, the scale of such difference is very small (less than 0.01). Therefore, we do not think two different ways we utilized in this homework to select test/train split are significant enough to be considered into further analysis.

## Evaluate Performance for Different Variables and Different Models

Overall, logistic regression model that we built works better for predicting the `EARNINGS` variable than the `ACQUIS` variable, since the accuracy scores of `EARNINGS` by both random split and actual split are higher than those of `ACQUIS` variable. Actual Split works better for Random Split in this sample in our BERT model. Both our models clearly do better than dummy classifiers. Nevertheless, to further compare two set of models with two variables, we might need to consider more methods of model evaluations, such as AUC scores, in the future. 

# Naïve Bayes estimation method

## Install the nltk library

In [0]:
!pip install nltk
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package brown_tei to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    | Downloading package cess_esp to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_esp.zip.
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Unzipp

True

## Get a Wordlist for Reuters News

In [0]:
df_NB=df.copy()

In [0]:
# create wordlist
word_NB = []
for m in range(0, len(df_NB)):
  i=df_NB.iloc[m].NEWS_TEXT.split()
  for n in i:
    word_NB.append(n)


In [0]:
#check the list with first 10 words
word_NB[:10]

['Mounting',
 'trade',
 'friction',
 'between',
 'the',
 'U.S.',
 'And',
 'Japan',
 'has',
 'raised']

## Remove the Punctuations and Stop Words

We found that there are punctuations and stop words in our word tokens, which are useless for our prediction. 

So we remove the punctuations and stop words. 

In [0]:
## remove punctuations and stop words
import string
from nltk.corpus import stopwords
word_NB = [w.lower() for w in word_NB]
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in word_NB]
stop_words = set(stopwords.words('english'))
words = [w for w in stripped if not w in stop_words]
words[:10]

['mounting',
 'trade',
 'friction',
 'us',
 'japan',
 'raised',
 'fears',
 'among',
 'many',
 'asias']

## Create a frequency count of words in each Reuters News

In [0]:
# Define the feature extractor
NB_words = nltk.FreqDist(w.lower() for w in words)

def document_features(document):
    document_words = set(document)
    features = {}
    for word in NB_words:
        features['contains({})'.format(word)] = (word in document_words)
    return features

## Data Split

### Random Split

We randomly assign data to test and training data set by 25% vs. 75% split.

In [0]:
#a random 75% training and 25% Testing Split
train = df_NB.sample(frac=0.75, random_state=6)
test = df_NB.loc[~df_NB.index.isin(train.index), :]

In [0]:
print(len(train))
print(len(test))

1624
541


### Actual Split based on `TEST` varaible

We also assign data according to `TEST` variable.

In [0]:
test_NB = df_NB[df_NB.TEST == 1]
train_NB = df_NB[df_NB.TEST == 0]

In [0]:
print(len(train_NB))
print(len(test_NB))

1594
571


## “Earnings Announcement”

### Random Check

#### Train Naïve Bayes Model

In [0]:
#extract 'earnings'
Earnings_NB_train=train.EARNINGS
Earnings_NB_test=test.EARNINGS

In [0]:
# create train dataset -- earnings
train_earnings = [(list(train.iloc[i].NEWS_TEXT.split()), Earnings_NB_train.iloc[i])
                            for i in range(0, len(train))]
test_earnings = [(list(test.iloc[i].NEWS_TEXT.split()), Earnings_NB_test.iloc[i])
                            for i in range(0, len(test))]

In [0]:
#train NB model -- earnings
featuresets_train_earnings = [(document_features(m), n) for (m,n) in train_earnings]
featuresets_test_earnings = [(document_features(m), n) for (m,n) in test_earnings]
classifier_earnings= nltk.NaiveBayesClassifier.train(featuresets_train_earnings)

#### Evaluate the Performance of the Model

In [0]:
# accuracy on test dataset -- earnings
accuracy=nltk.classify.accuracy(classifier_earnings, featuresets_test_earnings)
accuracy

0.966728280961183

The accuracy of our model for predicting `EARNINGS` variable is around 0.967. Obviously, our model does a good job on predicting `EARNINGS` variable. 

### Actual Check

#### Train Naïve Bayes Model

In [0]:
#extract 'earnings'
Earnings_NB_train=train_NB.EARNINGS
Earnings_NB_test=test_NB.EARNINGS

In [0]:
# create train dataset -- earnings
train_earnings = [(list(train_NB.iloc[i].NEWS_TEXT.split()), Earnings_NB_train.iloc[i])
                            for i in range(0, len(train_NB))]
test_earnings = [(list(test_NB.iloc[i].NEWS_TEXT.split()), Earnings_NB_test.iloc[i])
                            for i in range(0, len(test_NB))]

In [0]:
#train NB model -- earnings
featuresets_train_earnings = [(document_features(m), n) for (m,n) in train_earnings]
featuresets_test_earnings = [(document_features(m), n) for (m,n) in test_earnings]
classifier_earnings= nltk.NaiveBayesClassifier.train(featuresets_train_earnings)

#### Evaluate the Performance of the Model

In [0]:
# accuracy on test dataset -- earnings
accuracy=nltk.classify.accuracy(classifier_earnings, featuresets_test_earnings)
accuracy

0.9527145359019265

The accuracy of our model for predicting `EARNINGS` variable is around 0.953. Obviously, our model does a good job on predicting `EARNINGS` variable. 

Though the results of accuracy scores are still slightly different (0.967 for random split vs. 0.953 for actual split) by using different methods of training/test split, the scale of such difference is very small (less than 0.02). Therefore, we do not think two different ways we utilized in this homework to select test/train split are significant enough to be considered into further analysis.

In [0]:
# Show the most important features as interpreted by Naive Bayes in EARNINGS
classifier_earnings.show_most_informative_features(20)

Most Informative Features
      contains(dividend) = True                1 : 0      =     96.4 : 1.0
     contains(quarterly) = True                1 : 0      =     76.3 : 1.0
      contains(earnings) = True                1 : 0      =     65.9 : 1.0
       contains(profits) = True                1 : 0      =     57.4 : 1.0
     contains(reporting) = True                1 : 0      =     53.5 : 1.0
        contains(profit) = True                1 : 0      =     47.1 : 1.0
       contains(payable) = True                1 : 0      =     35.5 : 1.0
      contains(declared) = True                1 : 0      =     27.6 : 1.0
        contains(income) = True                1 : 0      =     27.1 : 1.0
   contains(improvement) = True                1 : 0      =     24.3 : 1.0
        contains(losses) = True                1 : 0      =     24.3 : 1.0
       contains(results) = True                1 : 0      =     21.7 : 1.0
        contains(fourth) = True                1 : 0      =     21.1 : 1.0

## “Corporate Acquisition”

### Random Check

#### Train Naïve Bayes Model

In [0]:
#extract 'acquis'
ACQUIS_NB_train=train.ACQUIS
ACQUIS_NB_test=test.ACQUIS

In [0]:
# create train dataset -- acquis
train_acquis = [(list(train.iloc[i].NEWS_TEXT.split()), ACQUIS_NB_train.iloc[i])
                            for i in range(0, len(train))]
test_acquis = [(list(test.iloc[i].NEWS_TEXT.split()), ACQUIS_NB_test.iloc[i])
                            for i in range(0, len(test))]

In [0]:
#train NB model -- acquis
featuresets_train_acquis = [(document_features(a), b) for (a,b) in train_acquis]
featuresets_test_acquis = [(document_features(a), b) for (a,b) in test_acquis]
classifier_acquis= nltk.NaiveBayesClassifier.train(featuresets_train_acquis)

#### Evaluate the Performance of the Model

In [0]:
# accuracy on test dataset -- acquis
accuracy=nltk.classify.accuracy(classifier_acquis, featuresets_test_acquis)
accuracy

0.9390018484288355

The accuracy of our model for predicting `ACQUIS` variable is around 0.939. Obviously, our model does a good job on predicting `ACQUIS` variable. 

### Actual Check

#### Train Naïve Bayes Model

In [0]:
#extract 'acquis'
ACQUIS_NB_train=train_NB.ACQUIS
ACQUIS_NB_test=test_NB.ACQUIS

In [0]:
# create train dataset -- acquis
train_acquis = [(list(train_NB.iloc[i].NEWS_TEXT.split()), ACQUIS_NB_train.iloc[i])
                            for i in range(0, len(train_NB))]
test_acquis = [(list(test_NB.iloc[i].NEWS_TEXT.split()), ACQUIS_NB_test.iloc[i])
                            for i in range(0, len(test_NB))]

In [0]:
#train NB model -- acquis
featuresets_train_acquis = [(document_features(a), b) for (a,b) in train_acquis]
featuresets_test_acquis = [(document_features(a), b) for (a,b) in test_acquis]
classifier_acquis= nltk.NaiveBayesClassifier.train(featuresets_train_acquis)

#### Evaluate the Performance of the Model

In [0]:
# accuracy on test dataset -- acquis
accuracy=nltk.classify.accuracy(classifier_acquis, featuresets_test_acquis)
accuracy

0.9369527145359019

The accuracy of our model for predicting `ACQUIS` variable is around 0.937. Obviously, our model does a good job on predicting `ACQUIS` variable. 

Though the results of accuracy scores are still slightly different (0.939 for random split vs. 0.937 for actual split) by using different methods of training/test split, the scale of such difference is very small (less than 0.01). Therefore, we do not think two different ways we utilized in this homework to select test/train split are significant enough to be considered into further analysis.

In [0]:
# Show the most important features as interpreted by Naive Bayes in ACQUIS
classifier_acquis.show_most_informative_features(20)

Most Informative Features
       contains(acquire) = True                1 : 0      =     88.0 : 1.0
        contains(intent) = True                1 : 0      =     37.7 : 1.0
        contains(merger) = True                1 : 0      =     37.0 : 1.0
          contains(rate) = True                0 : 1      =     26.6 : 1.0
        contains(filing) = True                1 : 0      =     26.2 : 1.0
         contains(stake) = True                1 : 0      =     26.1 : 1.0
      contains(investor) = True                1 : 0      =     24.1 : 1.0
        contains(prices) = True                0 : 1      =     24.0 : 1.0
      contains(takeover) = True                1 : 0      =     22.6 : 1.0
      contains(acquired) = True                1 : 0      =     21.4 : 1.0
          contains(fell) = True                0 : 1      =     21.4 : 1.0
          contains(rise) = True                0 : 1      =     20.6 : 1.0
        contains(letter) = True                1 : 0      =     19.2 : 1.0

## Evaluate Performance for Different Variables and Different Models

For the `Naïve Bayes` model, both random split works better for actual split in our sample. The `Naïve Bayes` model that we built works better for predicting the `EARNINGS` variable than the `ACQUIS` variable, since the accuracy scores of `EARNINGS` by both random split and actual split are higher than those of `ACQUIS` variable. Thus, overall, both `BERT` model and `Naïve Bayes` model work better for predicting the `EARNINGS` variable than the `ACQUIS` variable.  

For predicting the `EARNINGS` variable, the `Naïve Bayes` model that we built works better than the `BERT` model for the random split method, and the `BERT` model that we built works better than the `Naïve Bayes` model for the actual split method.  

For predicting the `ACQUIS` variable, the `Naïve Bayes` model that we built works better than the `BERT` model for both data split methods.   

Thus, different dataset may have different accuracy results for different models, we can not say which one is better. However, in this Reuters News dataset, predicting the `EARNINGS` variable works better than the `ACQUIS` variable.    