<a href="https://colab.research.google.com/github/79AceVo/Text-analytics/blob/main/Text_Classification_with_neural_networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text Classification with Neural Networks

Where we will use Neural networks (traditional, CNN, RNN, LSTM) to train and predict



In [None]:
#load the libraries

import numpy as np
import pandas as pd
import itertools

import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

import string
import re

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn import metrics

import sklearn
from sklearn.model_selection import train_test_split

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
df = pd.read_csv("Data/econ_news.csv", encoding = "ISO-8859-1") #need encoding because default UTF-8 does not work

#let's do some data exploration

In [None]:
df.shape #here is our table. 800 rows and 15 columns

(8000, 15)

In [None]:
df.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,positivity,positivity:confidence,relevance,relevance:confidence,articleid,date,headline,positivity_gold,relevance_gold,text
0,842613455,False,finalized,3,12/5/15 17:48,3.0,0.64,yes,0.64,wsj_398217788,8/14/91,Yields on CDs Fell in the Latest Week,,,NEW YORK -- Yields on most certificates of dep...
1,842613456,False,finalized,3,12/5/15 16:54,,,no,1.0,wsj_399019502,8/21/07,The Morning Brief: White House Seeks to Limit ...,,,The Wall Street Journal Online</br></br>The Mo...
2,842613457,False,finalized,3,12/5/15 1:59,,,no,1.0,wsj_398284048,11/14/91,Banking Bill Negotiators Set Compromise --- Pl...,,,WASHINGTON -- In an effort to achieve banking ...
3,842613458,False,finalized,3,12/5/15 2:19,,0.0,no,0.675,wsj_397959018,6/16/86,Manager's Journal: Sniffing Out Drug Abusers I...,,,The statistics on the enormous costs of employ...
4,842613459,False,finalized,3,12/5/15 17:48,3.0,0.3257,yes,0.64,wsj_398838054,10/4/02,Currency Trading: Dollar Remains in Tight Rang...,,,NEW YORK -- Indecision marked the dollar's ton...


In [None]:
df.columns #here are all the columns


Index(['_unit_id', '_golden', '_unit_state', '_trusted_judgments',
       '_last_judgment_at', 'positivity', 'positivity:confidence', 'relevance',
       'relevance:confidence', 'articleid', 'date', 'headline',
       'positivity_gold', 'relevance_gold', 'text'],
      dtype='object')

In [None]:
df.describe() #here are some stats

Unnamed: 0,_unit_id,_trusted_judgments,positivity,positivity:confidence,relevance:confidence,positivity_gold,relevance_gold
count,8000.0,8000.0,1420.0,3775.0,8000.0,0.0,0.0
mean,836799500.0,3.0,4.985211,0.18845,0.859009,,
std,5816278.0,0.0,1.680357,0.269593,0.16618,,
min,830981600.0,3.0,2.0,0.0,0.3364,,
25%,830983600.0,3.0,3.0,0.0,0.6697,,
50%,836799500.0,3.0,5.0,0.0,1.0,,
75%,842615500.0,3.0,7.0,0.3458,1.0,,
max,842617500.0,3.0,9.0,1.0,1.0,,


In [None]:
df.describe(include="O") #here are some stats with text entry

Unnamed: 0,_unit_state,_last_judgment_at,relevance,articleid,date,headline,text
count,8000,8000,8000,8000,8000,8000,8000
unique,1,1229,3,8000,6109,7698,7994
top,finalized,11/18/15 8:27,no,wapo_149196517,12/15/94,Business and Finance,ÐÊ M B ROF. PAUL HURD outlines a frightening ...
freq,8000,36,6571,1,6,86,2


In [None]:
df["relevance"].value_counts() #here are the raw numbers

Unnamed: 0_level_0,count
relevance,Unnamed: 1_level_1
no,6571
yes,1420
not sure,9


In [None]:
df["relevance"].value_counts() /len(df) #here are percentage

Unnamed: 0_level_0,count
relevance,Unnamed: 1_level_1
no,0.821375
yes,0.1775
not sure,0.001125


So in this dataset, the majority, 82% is not US Economy related, only 17%. This is understandable, and also common, for most classification task. There is also a very small percent of "not sure", which we should ignore.

Data Imbalance is really common. There are may ways to treat it, and there are ongoing debates on if treatment is necessary at all.

But for now, let's focus on the binary classification task. We will relabel no as 0 (not relevant), yes as 1 (relevant).


## Data Preprocessing

In [None]:
df = df[df["relevance"]!="not sure"] #get rid of not sure

In [None]:
df.shape #check to see if the 9 is gone. should be 7991 in the rows now

(7991, 15)

In [None]:
df["relevance"] = df["relevance"].map({'yes':1, 'no':0}) #mapping of yes to 1 and no to 0

#since we only need some column, let's reduce the dataframe

data = df[["text","relevance","headline"]]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["relevance"] = df["relevance"].map({'yes':1, 'no':0}) #mapping of yes to 1 and no to 0


In [None]:
data #here is the data for analysis. now we can feel free to start precocessing the texts like we know.

Unnamed: 0,text,relevance,headline
0,NEW YORK -- Yields on most certificates of dep...,1,Yields on CDs Fell in the Latest Week
1,The Wall Street Journal Online</br></br>The Mo...,0,The Morning Brief: White House Seeks to Limit ...
2,WASHINGTON -- In an effort to achieve banking ...,0,Banking Bill Negotiators Set Compromise --- Pl...
3,The statistics on the enormous costs of employ...,0,Manager's Journal: Sniffing Out Drug Abusers I...
4,NEW YORK -- Indecision marked the dollar's ton...,1,Currency Trading: Dollar Remains in Tight Rang...
...,...,...,...
7995,Secretary of Commerce Charles W. Sawyer said y...,1,"Sawyer Sees Strong Economy For 2 Years, Truce ..."
7996,"U.S. stocks inched up last week, overcoming co...",0,Oil's losses are airlines' gains
7997,Ben S. Bernanke cleared a key hurdle Thursday ...,0,Full Senate to vote on Bernanke; PANEL ADVANCE...
7998,The White House's push to contract out many fe...,0,Reinventing Opportunities


In [None]:
#let's take a look at some data

data.loc[2,"text"] #here is the third line data

"WASHINGTON -- In an effort to achieve banking reform, Senate negotiators and the Bush administration have agreed to drop efforts to allow banks to expand further into the securities business.</br></br>The compromise is one of several the Senate Banking Committee is pursuing to remove obstacles its banking bill will face when the Senate starts voting on the measure, perhaps today. The latest version of the House banking bill also drops the administration's proposals to broaden bank entry into the securities business.</br></br>Last night, the House began its second attempt to pass a banking bill after failing last week, in part because of disagreement over how to allow banks into the securities business. The House adopted on a voice vote provisions that would replenish the bank deposit insurance fund, tighten bank regulation, trim the scope of deposit insurance, and restrict the Federal Reserve Board's ability to keep sick banks alive with loans.</br></br>But the House delayed until tod

In [None]:
stop_words_nltk = set(stopwords.words('english'))
def clean(doc): # doc is a string of text

    pdoc = doc.replace("</br>", " ")
    pdoc= pdoc.split()
    pdoc = [char for char in pdoc if char not in string.punctuation and not char.isdigit()]
    pdoc = [token for token in pdoc if token not in stop_words_nltk]
    pdoc = " ".join(pdoc) #join all the elements from the list with each other again, separated by space
    return pdoc

In [None]:
df["text"] = df["text"].apply(lambda row : clean(row))

In [None]:
df["text"] #text looks ok for now, let's go to model

Unnamed: 0,text
0,NEW YORK -- Yields certificates deposit offere...
1,The Wall Street Journal Online The Morning Bri...
2,WASHINGTON -- In effort achieve banking reform...
3,The statistics enormous costs employee drug ab...
4,"NEW YORK -- Indecision marked dollar's tone, t..."
...,...
7995,Secretary Commerce Charles W. Sawyer said yest...
7996,"U.S. stocks inched last week, overcoming conce..."
7997,Ben S. Bernanke cleared key hurdle Thursday co...
7998,The White House's push contract many federal f...


## Modeling

Modeling usually follows the same step,

1) Split data to train and test (80/20 is the rule of thumb, but you can change this ratio)

2) Create feature out of text. This time we use BoW , so we use CountVectorizer to create the feature

3) transform both test and train data the same way

4) train the classifier

5) evaluate the classifier



In [None]:
# Step 1: train-test split
X = data.text # the column text contains textual data to extract features from
y = data.relevance # this is the column we are learning to predict.
print(X.shape, y.shape)
# split X and y into training and testing sets. By default, it splits 75% training and 25% test
# random_state=1 for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=0.25)
print(X_train.shape, y_train.shape) #check to seee if the
print(X_test.shape, y_test.shape)

(7991,) (7991,)
(5993,) (5993,)
(1998,) (1998,)


In [None]:
# Step 2-3: Preprocess and Vectorize train and test data
vect = CountVectorizer(preprocessor=clean) # instantiate a vectoriezer
X_train_dtm = vect.fit_transform(X_train)# use it to extract features from training data
# transform testing data (using training data's features)
X_test_dtm = vect.transform(X_test)
print(X_train_dtm.shape, X_test_dtm.shape)
# i.e., the dimension of our feature vector is 49753!

(5993, 45758) (1998, 45758)


## Train with Neural Network

Here we already perform the Countvectorizer as the vectorization after tokenization, so we will pass this into the neural network. Feel free to change this vectorization, perhaps it will perform better

## Chose GPU for neural network

Neural Network loves GPU, that's why NVIDIA is so hard to get by.

Click on Runtime > Change Runtime Type



* T4 GPU: Suitable for moderate deep learning and machine learning tasks. This GPU handles most models well without being overpowered.
* L4 GPU: Ideal for more complex models that require additional power, such as intricate neural networks or large image processing tasks.
* A100 GPU: The most powerful option, recommended for training large-scale deep learning models with frameworks like TensorFlow and PyTorch.

In our course, T4 should be the first one to try.

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.utils import pad_sequences


In [None]:
X_train_dtm.shape[1] #size of term

45758

In [None]:
# padd matrix so it will be faster in training, ready for DL
#if you want to retain all info, use X_train_dtm.shape[1]
#use 5,000 to reduce complexity
X_train_dense = pad_sequences(X_train_dtm.toarray(), maxlen=5000, padding='post', truncating='post')
X_test_dense = pad_sequences(X_test_dtm.toarray(), maxlen=5000, padding='post', truncating='post')

In [None]:
X_train_dense.shape[1] #same size

5000

## Classic Neural Network


In [None]:
#Let's try an intial network

# Define the neural network model
model = keras.Sequential([
    keras.layers.Embedding(input_dim=X_train_dense.shape[1],output_dim= 64),
    keras.layers.GlobalAveragePooling1D(),
    keras.layers.Dense(4, activation='relu', kernel_regularizer=keras.regularizers.l2(0.01)), #regularlization on each layer
    keras.layers.Dense(8, activation='relu', kernel_regularizer=keras.regularizers.l1(0.1)),
    keras.layers.Dense(1, activation='sigmoid')
])

"""
keras.layers.Embedding(input_dim = vocab_size, output_dim= 64):ke
input_dimension: the maximum vocabulary the network will take
output_dimension: the size of the vector space in which words will be embedded (64-dimension vector as the embedding)
input_length: length of input sequences, ensure all the inputs have the same length.
   model can handle up to max length in a sequence(each X /y). Less than max lenght: it will be padded


keras.layers.GlobalAveragePooling1D(): average all word embeddings to make 1 64-dimensional word vector for each sequence (each X /y)

model.add(Dense(4, activation='relu'),kernel_regularizer=keras.regularizers.l2(0.01)): first hidden layer, 4 neurons, relu activation, l2 regularization for each layer
model.add(Dense(8, activation='relu'), kernel_regularizer=keras.regularizers.l1(0.001)): second hidden layer, 8 neurons, relu activation,  l1 regularization for each layer
model.add(Dense(1, activation='sigmoid')): output layer, 1 neuron, sigmoid activation
"""

# Compile the model
model.compile(loss='binary_crossentropy', optimizer=keras.optimizers.Adam(learning_rate=0.001), metrics=[ keras.metrics.BinaryAccuracy(),
        keras.metrics.FalseNegatives(),]) #learning rate is added here

# Train the model
num_epochs = 5 # Adjust as needed
batch_size = 32 # Adjust as needed
class_weights = {0: 1, 1: 2}  # Example: Give class 1 twice(2x) the weight of class 0
history = model.fit(
    X_train_dense, y_train, epochs=num_epochs,
                    validation_data=(X_test_dense, y_test), batch_size=batch_size, class_weight=class_weights) #batch size

Epoch 1/5
[1m188/188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 17ms/step - binary_accuracy: 0.8055 - false_negatives_1: 526.2169 - loss: 1.7360 - val_binary_accuracy: 0.8288 - val_false_negatives_1: 342.0000 - val_loss: 1.1446
Epoch 2/5
[1m188/188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - binary_accuracy: 0.8227 - false_negatives_1: 535.0741 - loss: 1.2571 - val_binary_accuracy: 0.8288 - val_false_negatives_1: 342.0000 - val_loss: 0.8241
Epoch 3/5
[1m188/188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 6ms/step - binary_accuracy: 0.8184 - false_negatives_1: 550.6402 - loss: 0.9613 - val_binary_accuracy: 0.8288 - val_false_negatives_1: 342.0000 - val_loss: 0.6115
Epoch 4/5
[1m188/188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - binary_accuracy: 0.8111 - false_negatives_1: 561.4868 - loss: 0.8061 - val_binary_accuracy: 0.8288 - val_false_negatives_1: 342.0000 - val_loss: 0.5300
Epoch 5/5
[1m188/188[0m [32m━━━━━━━━━━━━

In [None]:
model.summary()

In [None]:
# Evaluate the model
loss, accuracy, fn = model.evaluate(X_test_dense, y_test)
print(f"Loss: {loss}")
print(f"Accuracy: {accuracy}")
print(f"False Negative': {fn}")

predictions = model.predict(X_test_dense)
# Example: Convert probabilities to classes (assuming a threshold of 0.5)
predicted_classes = (predictions > 0.5).astype(int)
predicted_classes

# classification report

print(metrics.classification_report(y_test, predicted_classes))


## Dealing with underfit / overfit:

if the model underfit, we need to
* change tokenization / vectorization
* train more with more layers, more neuron per layer
* increase epochs gradually
* Use Early stopping to stop training if too many epochs
* Reduce or remove L1/L2 regularization.
*  Increase Batch Size
* change learning rate
* reduce / remove dropout rate
* change class weights

If our model overfits, You can try these things:


*   Increase L2 Regularization (prevents overfitting better than L1 for deep network)
* remove L1 regularlization. L1 Regularization introduce sparse matrix making thigns harder
*  Reduce Overfitting with Dropout
* Adjust Learning rate and optimizer
* Adjust class weight. more weight to more important class
* Reduce complexity of the model, remove layers.
* Use Early stopping to stop training
*  Adjust Batch Size. smaller batch size is better.


Early stopping: implement this:

early_stopping = keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True) #patience: how many epocs with no improvement https://keras.io/api/callbacks/early_stopping/

history = model.fit(X_train, y_train, epochs=20, validation_data=(X_val, y_val), callbacks=[early_stopping])

## Optimizers

Keras have several Optimizers

* SGD is	Simple, supports momentum, slow convergence can be used for	Large datasets
* Adam combines momentum + adaptive learning rate, fast convergence, usually the default for most deep learning tasks
* RMSprop	controls learning rate dynamically, prevents overshooting, use for	RNNs, time series
* Adagrad	adapts learning rate for each parameter, aggressive decay, can be used for	NLP, sparse data
* Adadelta has no learning rate tuning needed.
* Adamax is an Adam variant, use for high-dimensional data
* Nadam is combination of	Adam + Nesterov momentum for complex deep models

momentum helps the weight updates accelerate over time, quickly converging to the minimum