# NLP Final Project
## Financial Claim Detection using Binary Classification
The primary task of our project is to determine which numerals in the Earning Conference Calls (ECCs) are in-claim (relevant to company’s financial performance) vs out-of-claim (not relevant to company’s financial performance)

In [4]:
from google.colab import drive 
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
cd drive/MyDrive/NLP\ Project/FinNum-3 

/content/drive/.shortcut-targets-by-id/1fE5bHnT8pi5BSIyVaZzWDHFCkvqEmiAq/NLP Project/FinNum-3


## Importing required libraries

In [6]:
import numpy as np 
import pandas as pd 
import string 

import nltk
from nltk import word_tokenize
nltk.download('punkt') 

from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
nltk.download('wordnet')
nltk.download('omw-1.4')

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import precision_score, recall_score, confusion_matrix, accuracy_score, f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from imblearn.over_sampling import SMOTE

from keras.models import Sequential
from keras.layers import Dense

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


## Loading the data

In [7]:
train_data = pd.read_json("data/ConCall_train.json")
dev_data = pd.read_json("data/ConCall_dev.json")
test_data = pd.read_json("data/ConCall_test.json")

In [8]:
print(train_data.shape)
print(dev_data.shape)
print(test_data.shape)

(8337, 6)
(1191, 6)
(2383, 6)


Let's look at the columns in the dataset

In [None]:
train_data.columns

Index(['paragraph', 'target_num', 'category', 'offset_start', 'offset_end',
       'claim'],
      dtype='object')

Let's have a look a sample of the dataset

In [None]:
train_data.sample(5)

Unnamed: 0,paragraph,target_num,category,offset_start,offset_end,claim
8081,We expect to make further progress on market s...,5.0,relative,548,549,0
6082,Our commercial cloud revenue was $6.9 billion ...,53.0,relative,54,56,0
2264,Next on Power orders of $4.9 billion were down...,2.0,relative,67,68,0
5184,Even without the charges our income grew by mo...,0.28,change,322,326,0
2015,Overall Power backlog closed at $93 billion up...,4.0,relative,143,144,0


We can see a unequal distribution of the 2 classes of the variable below, which shows that there is imbalance in the class weights in the train data 

In [9]:
print("Number of rows with claim = 0 are", len(train_data[train_data["claim"] == 0]))
print("Number of rows with claim = 1 are", len(train_data[train_data["claim"] == 1]))

Number of rows with claim = 0 are 7298
Number of rows with claim = 1 are 1039


## Pre-processing

### Lower-casing

In [10]:
train_data['paragraph'] = train_data['paragraph'].str.lower()
dev_data['paragraph'] = dev_data['paragraph'].str.lower()
test_data['paragraph'] = test_data['paragraph'].str.lower()

### Remove punctuations 
We have ensured to not remove $, % and . because we found them to be valuable to our task

In [11]:
punc_list = list(string.punctuation)
punc_list.remove('$')
punc_list.remove('%')
punc_list.remove('.')

In [12]:
train_data['paragraph'] = train_data['paragraph'].apply(lambda x: "".join(letter for letter in x if letter not in punc_list))
dev_data['paragraph'] = dev_data['paragraph'].apply(lambda x: "".join(letter for letter in x if letter not in punc_list))
test_data['paragraph'] = test_data['paragraph'].apply(lambda x: "".join(letter for letter in x if letter not in punc_list))

### Tokenization

In [13]:
train_data['paragraph'] = train_data['paragraph'].apply(lambda x: " ".join(word_tokenize(x)))
dev_data['paragraph'] = dev_data['paragraph'].apply(lambda x: " ".join(word_tokenize(x)))
test_data['paragraph'] = test_data['paragraph'].apply(lambda x: " ".join(word_tokenize(x)))

### Lemmatization 
We chose lemmatization as it was a more apt form of pre-processing techniques to reduce variation as compared to stemming

In [14]:
train_data['paragraph'].apply(lambda x: ' '.join(lemmatizer.lemmatize(w) for w in x.split()))
dev_data['paragraph'].apply(lambda x: ' '.join(lemmatizer.lemmatize(w) for w in x.split()))
test_data['paragraph'].apply(lambda x: ' '.join(lemmatizer.lemmatize(w) for w in x.split()))
print("*****pre-preocessing done*****")

*****pre-preocessing done*****


## Feature Engineering 

### Bag Of Words (BOWs)

In [15]:
# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

X_train = train_data['paragraph']
# Fit the training data and then return the matrix
training_data_fit = count_vector.fit(X_train)
train_transformed_data = count_vector.transform(X_train)

X_dev = dev_data['paragraph']
# Transform dev data and return the matrix. 
# Note we are not fitting the dev data into the CountVectorizer()
dev_transformed_data = count_vector.transform(X_dev)

X_test = test_data['paragraph']
# Transform test data and return the matrix. 
# Note we are not fitting the test data into the CountVectorizer()
test_transformed_data = count_vector.transform(X_test)

### Balancing classes using SMOTE 

In [50]:
method = SMOTE()

In [51]:
# Instantiate the CountVectorizer method
vectorizer = CountVectorizer()

X_train = train_data['paragraph']
y_train = train_data['claim']
# Fit the training data and then return the matrix
vectorizer.fit(X_train.values.ravel())
X_train = vectorizer.transform(X_train.values.ravel())
X_train=X_train.toarray()

# balancing class weights in the train data
X_train_res, y_train_res = method.fit_resample(X_train, y_train)

In [52]:
X_dev = dev_data['paragraph']
y_dev = dev_data['claim']
# Transform testing data and return the matrix. 
# Note we are not fitting the testing data into the CountVectorizer()
X_dev = count_vector.transform(X_dev)

# balancing class weights in the dev data
X_dev_res, y_dev_res = method.fit_resample(X_dev, y_dev)

In [53]:
X_test = test_data['paragraph']
y_test = test_data['claim']
# Transform test data and return the matrix. 
# Note we are not fitting the test data into the CountVectorizer()
X_test = count_vector.transform(X_test)

# balancing class weights in the test data
X_test_res, y_test_res = method.fit_resample(X_test, y_test)

In [57]:
import numpy as np

np.savez('balanced_data.npz', X_train_res2=X_train_res, y_train_res2=y_train_res, X_dev_res2=X_dev_res, y_dev_res2=y_dev_res, X_test_res2=X_test_res, y_test_res2=y_test_res)


In [46]:
import os

# Print the current working directory
print(os.getcwd())

/content/drive/.shortcut-targets-by-id/1fE5bHnT8pi5BSIyVaZzWDHFCkvqEmiAq/NLP Project/FinNum-3


In [59]:
import numpy as np

# Load the balanced data file with allow_pickle=True
balanced_data = np.load('balanced_data.npz', allow_pickle=True)

# Access the arrays in the file
X_train_res2 = balanced_data['X_train_res2']
y_train_res2 = balanced_data['y_train_res2']
X_dev_res2 = balanced_data['X_dev_res2']
y_dev_res2 = balanced_data['y_dev_res2']
X_test_res2 = balanced_data['X_test_res2']
y_test_res2 = balanced_data['y_test_res2']

# Print the shapes of the arrays
print('X_train_res shape:', X_train_res2.shape)
print('y_train_res shape:', y_train_res2.shape)
print('X_dev_res shape:', X_dev_res2.shape)
print('y_dev_res shape:', y_dev_res2.shape)
print('X_test_res shape:', X_test_res2.shape)
print('y_test_res shape:', y_test_res2.shape)



X_train_res shape: (14596, 8757)
y_train_res shape: (14596,)
X_dev_res shape: ()
y_dev_res shape: (2154,)
X_test_res shape: ()
y_test_res shape: (4392,)


## Models and Evaluation

In [20]:
y_train = train_data['claim']

y_dev = dev_data['claim']
y_dev = y_dev.values

y_test = test_data['claim']
y_test = y_test.values

We performed the following models:

### Naive Bayes (baseline model)

In [21]:
naive_bayes = MultinomialNB()
naive_bayes.fit(train_transformed_data, y_train)

In [22]:
dev_pred = naive_bayes.predict(dev_transformed_data)
len(dev_pred)

1191

In [23]:
print ('Accuracy:', accuracy_score(y_dev, dev_pred))
print ('F1 score:', f1_score(y_dev, dev_pred))
print ('Precision score:', precision_score(y_dev, dev_pred))
print ('Recall score:', recall_score(y_dev, dev_pred))

Accuracy: 0.8471872376154492
F1 score: 0.4858757062146893
Precision score: 0.35833333333333334
Recall score: 0.7543859649122807


In [24]:
test_pred = naive_bayes.predict(test_transformed_data)
len(test_pred)

2383

In [25]:
print ('Accuracy:', accuracy_score(y_test, test_pred))
print ('F1 score:', f1_score(y_test, test_pred))
print ('Precision score:', precision_score(y_test, test_pred))
print ('Recall score:', recall_score(y_test, test_pred))

Accuracy: 0.8673940411246328
F1 score: 0.4254545454545454
Precision score: 0.32231404958677684
Recall score: 0.6256684491978609


### Decision Tree Classifier
Code from https://scikit-learn.org/stable/modules/tree.html

#### Decision Tree Classifier with imbalanced classes

In [26]:
treemodel = DecisionTreeClassifier(random_state=42)
treemodel = treemodel.fit(train_transformed_data, y_train)

In [28]:
dev_pred_tree = treemodel.predict(dev_transformed_data)
len(dev_pred_tree)

1191

In [29]:
print("Accuracy:", accuracy_score(y_dev, dev_pred_tree))
print('F1 score:', f1_score(y_dev, dev_pred_tree))
print('Precision score:', precision_score(y_dev, dev_pred_tree))
print('Recall score:', recall_score(y_dev, dev_pred_tree))

Accuracy: 0.8673383711167086
F1 score: 0.33050847457627125
Precision score: 0.319672131147541
Recall score: 0.34210526315789475


In [30]:
test_pred_tree = treemodel.predict(test_transformed_data)
len(test_pred_tree)

2383

In [31]:
print("Accuracy:", accuracy_score(y_test, test_pred_tree))
print('F1 score:', f1_score(y_test, test_pred_tree))
print('Precision score:', precision_score(y_test, test_pred_tree))
print('Recall score:', recall_score(y_test, test_pred_tree))

Accuracy: 0.8980276961812841
F1 score: 0.26586102719033233
Precision score: 0.3055555555555556
Recall score: 0.23529411764705882


#### Decision Tree Classifier with balanced class weights  

In [32]:
treemodel = DecisionTreeClassifier(random_state=42, class_weight='balanced')
treemodel = treemodel.fit(train_transformed_data, y_train)

In [33]:
dev_pred_tree = treemodel.predict(dev_transformed_data)
len(dev_pred_tree)

1191

In [34]:
print("Accuracy:", accuracy_score(y_dev, dev_pred_tree))
print('F1 score:', f1_score(y_dev, dev_pred_tree))
print('Precision score:', precision_score(y_dev, dev_pred_tree))
print('Recall score:', recall_score(y_dev, dev_pred_tree))

Accuracy: 0.8287153652392947
F1 score: 0.3964497041420119
Precision score: 0.29910714285714285
Recall score: 0.5877192982456141


In [None]:
test_pred_tree = treemodel.predict(test_transformed_data)
len(test_pred_tree)

In [36]:
print("Accuracy:", accuracy_score(y_test, test_pred_tree))
print('F1 score:', f1_score(y_test, test_pred_tree))
print('Precision score:', precision_score(y_test, test_pred_tree))
print('Recall score:', recall_score(y_test, test_pred_tree))

Accuracy: 0.8115820394460763
F1 score: 0.3207261724659607
Precision score: 0.22362869198312235
Recall score: 0.5668449197860963


#### Decision Tree Classifier with balanced classes using SMOTE

In [54]:
treemodel = DecisionTreeClassifier(random_state=42)
treemodel = treemodel.fit(X_train_res, y_train_res)

In [55]:
dev_pred_tree = treemodel.predict(X_dev_res)
len(dev_pred_tree)

2154

In [56]:
print("Accuracy:", accuracy_score(y_dev_res, dev_pred_tree))
print('F1 score:', f1_score(y_dev_res, dev_pred_tree))
print('Precision score:', precision_score(y_dev_res, dev_pred_tree))
print('Recall score:', recall_score(y_dev_res, dev_pred_tree))

Accuracy: 0.7878365831012071
F1 score: 0.7740978744438953
Precision score: 0.8276955602536998
Recall score: 0.7270194986072424


In [40]:
test_pred_tree = treemodel.predict(X_test_res)
len(test_pred_tree)

4392

In [41]:
print("Accuracy:", accuracy_score(y_test_res, test_pred_tree))
print('F1 score:', f1_score(y_test_res, test_pred_tree))
print('Precision score:', precision_score(y_test_res, test_pred_tree))
print('Recall score:', recall_score(y_test_res, test_pred_tree))

Accuracy: 0.7520491803278688
F1 score: 0.7410225921521997
Precision score: 0.7755102040816326
Recall score: 0.7094717668488161


In [44]:
X_train_res

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [60]:
#new
treemodel2 = DecisionTreeClassifier(random_state=42)
treemodel2 = treemodel2.fit(X_train_res2, y_train_res2)

In [61]:
#new
test_pred_tree2 = treemodel2.predict(X_test_res2)
len(test_pred_tree2)

ValueError: ignored

In [None]:
#new
print("Accuracy:", accuracy_score(y_test_res2, test_pred_tree))
print('F1 score:', f1_score(y_test_res2, test_pred_tree))
print('Precision score:', precision_score(y_test_res2, test_pred_tree))
print('Recall score:', recall_score(y_test_res2, test_pred_tree))

In [None]:
#new
import numpy as np

# Load the balanced data file
balanced_data = np.load('balanced_data.npz')

# Get the arrays
X_train_res = balanced_data['X_train_res']
y_train_res = balanced_data['y_train_res']
X_dev_res = balanced_data['X_dev_res']
y_dev_res = balanced_data['y_dev_res']
X_test_res = balanced_data['X_test_res']
y_test_res = balanced_data['y_test_res']

# Combine the features and target arrays for each dataset
train_data = np.hstack((X_train_res, y_train_res.reshape(-1,1)))
dev_data = np.hstack((X_dev_res, y_dev_res.reshape(-1,1)))
test_data = np.hstack((X_test_res, y_test_res.reshape(-1,1)))

# Save the datasets as CSV files
np.savetxt('train_data.csv', train_data, delimiter=',')
np.savetxt('dev_data.csv', dev_data, delimiter=',')
np.savetxt('test_data.csv', test_data, delimiter=',')


### Logistic regression

#### Logistic regression with imbalanced classes

In [None]:
logmodel = LogisticRegression(random_state = 42)
logmodel.fit(train_transformed_data,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(random_state=42)

In [None]:
dev_pred_log = logmodel.predict(dev_transformed_data)
len(dev_pred_log)

1191

In [None]:
print ('Accuracy:', accuracy_score(y_dev, dev_pred_log))
print ('F1 score:', f1_score(y_dev, dev_pred_log))
print ('Precision score:', precision_score(y_dev, dev_pred_log))
print ('Recall score:', recall_score(y_dev, dev_pred_log))

Accuracy: 0.8933669185558354
F1 score: 0.3618090452261307
Precision score: 0.4235294117647059
Recall score: 0.3157894736842105


In [None]:
test_pred_log = logmodel.predict(test_transformed_data)
len(test_pred_log)

2383

In [None]:
print ('Accuracy:', accuracy_score(y_test, test_pred_log))
print ('F1 score:', f1_score(y_test, test_pred_log))
print ('Precision score:', precision_score(y_test, test_pred_log))
print ('Recall score:', recall_score(y_test, test_pred_log))

Accuracy: 0.9185900125891733
F1 score: 0.31690140845070425
Precision score: 0.4639175257731959
Recall score: 0.24064171122994651


#### Logistic regression with balanced class weights

In [None]:
logmodel = LogisticRegression(random_state = 42, class_weight='balanced')
logmodel.fit(train_transformed_data,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(class_weight='balanced', random_state=42)

In [None]:
dev_pred_log = logmodel.predict(dev_transformed_data)
len(dev_pred_log)

1191

In [None]:
print ('Accuracy:', accuracy_score(y_dev, dev_pred_log))
print ('F1 score:', f1_score(y_dev, dev_pred_log))
print ('Precision score:', precision_score(y_dev, dev_pred_log))
print ('Recall score:', recall_score(y_dev, dev_pred_log))

Accuracy: 0.8631402183039463
F1 score: 0.45117845117845123
Precision score: 0.366120218579235
Recall score: 0.5877192982456141


In [None]:
test_pred_log = logmodel.predict(test_transformed_data)
len(test_pred_log)

2383

In [None]:
print ('Accuracy:', accuracy_score(y_test, test_pred_log))
print ('F1 score:', f1_score(y_test, test_pred_log))
print ('Precision score:', precision_score(y_test, test_pred_log))
print ('Recall score:', recall_score(y_test, test_pred_log))

Accuracy: 0.8997062526227444
F1 score: 0.40987654320987654
Precision score: 0.38073394495412843
Recall score: 0.44385026737967914


#### Logistic regression with balanced classes using SMOTE

In [None]:
logmodel = LogisticRegression(random_state = 42)
logmodel.fit(X_train_res, y_train_res)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(random_state=42)

In [None]:
dev_pred_log = logmodel.predict(X_dev_res)
len(dev_pred_log)

2154

In [None]:
print ('Accuracy:', accuracy_score(y_dev_res, dev_pred_log))
print ('F1 score:', f1_score(y_dev_res, dev_pred_log))
print ('Precision score:', precision_score(y_dev_res, dev_pred_log))
print ('Recall score:', recall_score(y_dev_res, dev_pred_log))

Accuracy: 0.8347260909935005
F1 score: 0.8291746641074855
Precision score: 0.8579940417080437
Recall score: 0.8022284122562674


In [None]:
test_pred_log = logmodel.predict(X_test_res)
len(test_pred_log)

4392

In [None]:
print ('Accuracy:', accuracy_score(y_test_res, test_pred_log))
print ('F1 score:', f1_score(y_test_res, test_pred_log))
print ('Precision score:', precision_score(y_test_res, test_pred_log))
print ('Recall score:', recall_score(y_test_res, test_pred_log))

Accuracy: 0.8604280510018215
F1 score: 0.8516097797143549
Precision score: 0.9090439276485788
Recall score: 0.8010018214936248


### Neural Networks with balanced classes using SMOTE
Code from https://www.kaggle.com/code/jagdmir/tweet-analysis-ann-bert-cnn-n-gram-cnn#Model-Building-&-Evaluation

In [None]:
def define_model(n_words):
    # define network
    model = Sequential()
    model.add(Dense(128, input_shape=(n_words,), activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    # compile network
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics = ['accuracy'])
    # summarize defined model
    model.summary()
    # plot_model(model, to_file='model.png', show_shapes=True)
    return model

In [None]:
n_words = X_train_res.shape[1]
model = define_model(n_words)

Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_12 (Dense)            (None, 128)               1121024   
                                                                 
 dense_13 (Dense)            (None, 1)                 129       
                                                                 
Total params: 1,121,153
Trainable params: 1,121,153
Non-trainable params: 0
_________________________________________________________________


In [None]:
X_train_res.shape, y_train_res.shape

((14596, 8757), (14596,))

In [None]:
model.fit(X_train_res,y_train_res,epochs=10,verbose=2)

Epoch 1/10
457/457 - 5s - loss: 0.2400 - accuracy: 0.9232 - 5s/epoch - 12ms/step
Epoch 2/10
457/457 - 2s - loss: 0.1608 - accuracy: 0.9461 - 2s/epoch - 5ms/step
Epoch 3/10
457/457 - 1s - loss: 0.1429 - accuracy: 0.9492 - 1s/epoch - 3ms/step
Epoch 4/10
457/457 - 1s - loss: 0.1310 - accuracy: 0.9509 - 1s/epoch - 3ms/step
Epoch 5/10
457/457 - 1s - loss: 0.1235 - accuracy: 0.9527 - 1s/epoch - 3ms/step
Epoch 6/10
457/457 - 2s - loss: 0.1200 - accuracy: 0.9527 - 2s/epoch - 4ms/step
Epoch 7/10
457/457 - 2s - loss: 0.1170 - accuracy: 0.9539 - 2s/epoch - 5ms/step
Epoch 8/10
457/457 - 3s - loss: 0.1145 - accuracy: 0.9549 - 3s/epoch - 6ms/step
Epoch 9/10
457/457 - 1s - loss: 0.1126 - accuracy: 0.9546 - 1s/epoch - 3ms/step
Epoch 10/10
457/457 - 1s - loss: 0.1119 - accuracy: 0.9555 - 1s/epoch - 3ms/step


<keras.callbacks.History at 0x7f8e8c298810>

In [None]:
dev_pred_nn = model.predict(X_dev_res)



In [None]:
dev_pred_nn = np.where(dev_pred_nn > 0.5, 1, 0)

In [None]:
print ('Accuracy:', accuracy_score(y_dev_res, dev_pred_nn))
print ('F1 score:', f1_score(y_dev_res, dev_pred_nn))
print ('Precision score:', precision_score(y_dev_res, dev_pred_nn))
print ('Recall score:', recall_score(y_dev_res, dev_pred_nn))

Accuracy: 0.8779015784586816
F1 score: 0.8728854519091348
Precision score: 0.9102822580645161
Recall score: 0.8384401114206128


In [None]:
test_pred_nn = model.predict(X_test_res)



In [None]:
test_pred_nn = np.where(test_pred_nn > 0.5, 1, 0)

In [None]:
print ('Accuracy:', accuracy_score(y_test_res, test_pred_nn))
print ('F1 score:', f1_score(y_test_res, test_pred_nn))
print ('Precision score:', precision_score(y_test_res, test_pred_nn))
print ('Recall score:', recall_score(y_test_res, test_pred_nn))


Accuracy: 0.8784153005464481
F1 score: 0.8651515151515152
Precision score: 0.9710884353741497
Recall score: 0.7800546448087432
