# News Text Classification

This notebook solves a text classification NLP problem, and predicts the category of the news document. The categories may be "news", "finance" or "sports".

### Install Python packages

In [9]:
pip install gensim==4.2.0 nltk==3.6.1 --user

Collecting nltk==3.6.1
  Using cached nltk-3.6.1-py3-none-any.whl (1.5 MB)
Installing collected packages: nltk
Successfully installed nltk-3.6.1


### Import Packages

In [38]:
import re
import os

import pandas as pd
import numpy as np

import nltk 
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
#from nltk.stem import WordNetLemmatizer
from tensorflow.keras.preprocessing import text, sequence 

from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report

from gensim.models import KeyedVectors
import gensim

from keras.models import Sequential
from keras.layers import Dense, Embedding, GRU, LSTM, Dropout, Flatten
from keras.layers.embeddings import Embedding
from keras.callbacks import ModelCheckpoint

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Read dataset as a Pandas dataframe

In [3]:
raw_df = pd.read_csv("news_text.csv", sep='\t')

In [4]:
raw_df.head()

Unnamed: 0,news_id,title,abstract,category
0,N55528,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",lifestyle
1,N19639,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...,health
2,N61837,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,news
3,N53526,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",health
4,N38324,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re...",health


### Check categories available and their counts

In [5]:
raw_df.category.value_counts()

news             15774
sports           14510
finance           3107
foodanddrink      2551
lifestyle         2479
travel            2350
video             2068
weather           2048
health            1885
autos             1639
tv                 889
music              769
movies             606
entertainment      587
kids                17
middleeast           2
northamerica         1
Name: category, dtype: int64

### Filter out only required categories

Retain records which are labelled with categories "news", "sports" and "finance"

In [6]:
df = raw_df[(raw_df.category == 'news') | (raw_df.category == 'sports') | (raw_df.category == 'finance')]
df

Unnamed: 0,news_id,title,abstract,category
2,N61837,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,news
5,N2073,Should NFL be able to fine players for critici...,Several fines came down against NFL players fo...,sports
7,N59295,Chile: Three die in supermarket fire amid prot...,Three people have died in a supermarket fire a...,news
9,N39237,"How to report weather-related closings, delays","When there are active closings, view them here...",news
19,N29120,"John Dorsey admits talks with Washington, but ...","Team officials in Washington ""emphatically"" de...",sports
...,...,...,...,...
51274,N43432,US Forest Service shuts down vandalized Georgi...,"GAINESVILLE, Ga. (AP) The U.S. Forest Servic...",news
51275,N17258,Realme takes chunk of India mobile market as S...,Over 400 percent more phones shipped year-on-year,news
51276,N23858,Young Northeast Florida fans flock to U.S. wom...,When the U.S. women's national soccer team arr...,sports
51279,N7482,St. Dominic soccer player tries to kick cancer...,"Sometimes, what happens on the sidelines can b...",sports


### Reset index to obtain continuous index values

In [7]:
#df = df.reset_index(drop=True)
df.reset_index(drop=True, inplace=True)

### Drop unrequired "news_id" column

In [8]:
df.drop('news_id', axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


### Truncate number of records from each category & concatenate to form a new Dataframe

In [9]:
df_news = df[df['category'] == 'news']
df_news = df_news.iloc[:3000]

df_sports = df[df['category'] == 'sports']
df_sports = df_sports.iloc[:3000]

df_finance = df[df['category'] == 'finance']

In [10]:
df = pd.concat([df_news, df_sports, df_finance])
df

Unnamed: 0,title,abstract,category
0,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,news
2,Chile: Three die in supermarket fire amid prot...,Three people have died in a supermarket fire a...,news
3,"How to report weather-related closings, delays","When there are active closings, view them here...",news
5,Elijah Cummings to lie in state at US Capitol ...,"Cummings, a Democrat whose district included s...",news
6,Trump's Trustbusters Bring Microsoft Lessons t...,DOJ's Makan Delrahim and the FTC's Joe Simons ...,news
...,...,...,...
33201,Jamie Dimon: The '60 Minutes' interview,The chairman and CEO of JPMorgan Chase tells L...,finance
33339,Sunday Real Estate: 3 Luxurious Florida Homes,"Sunday Real Estate takes you to Star Island, L...",finance
33350,Only 1 World City Charges More Than NYC Per Fo...,The amount New Yorkers pay per square foot of ...,finance
33360,"Are Stores Open on Veterans Day? Target, Aldi,...",Will shoppers be able to make the most of Vete...,finance


### Combine "title" and "abstract" into a single corpus

In [11]:
df['corpus'] = df['title'] + ' ' + df['abstract']

### Drop "title" and "abstract" columns

In [12]:
df.drop(['title','abstract'],axis=1, inplace=True)

In [13]:
df['corpus'][0]

"The Cost of Trump's Aid Freeze in the Trenches of Ukraine's War Lt. Ivan Molchanets peeked over a parapet of sand bags at the front line of the war in Ukraine. Next to him was an empty helmet propped up to trick snipers, already perforated with multiple holes."

### Function to pre-process or clean data

In [14]:
def clean_data(sent):
    stop_words = stopwords.words('english')
    #wordnet_lemmatizer = WordNetLemmatizer()
    tokenizer = nltk.NLTKWordTokenizer()
    
    # Remove punctuations and numerical characters
    sent = re.sub(r'[^a-zA-Z\s]','', str(sent))
    
    # Tokenize the sentences
    tokenized_sent = tokenizer.tokenize(sent)
    
    # Convert tokens to lower case
    lower_sent = [i.lower() for i in tokenized_sent]
    
    # Remove stop-words
    sent = [item for item in lower_sent if item not in stop_words]
    sent = ' '.join(sent)
    return sent        

In [15]:
df['corpus'] = df['corpus'].apply(lambda x: clean_data(x))

In [16]:
df

Unnamed: 0,category,corpus
0,news,cost trumps aid freeze trenches ukraines war l...
2,news,chile three die supermarket fire amid protests...
3,news,report weatherrelated closings delays active c...
5,news,elijah cummings lie state us capitol thursday ...
6,news,trumps trustbusters bring microsoft lessons bi...
...,...,...
33201,finance,jamie dimon minutes interview chairman ceo jpm...
33339,finance,sunday real estate luxurious florida homes sun...
33350,finance,world city charges nyc per foot apt space amou...
33360,finance,stores open veterans day target aldi walmart s...


### Split the dataset into train and test data

In [17]:
xtrain, xtest, ytrain, ytest = train_test_split(df['corpus'], df['category'], shuffle=True, test_size=0.2)

# find the length of the largest sentence in training data
max_len = xtrain.apply(lambda x: len(x)).max()
print(f'Max number of words in a text in training data: {max_len}')

Max number of words in a text in training data: 552


### Convert text data into numerical data of equal lengths 

In [18]:
max_words = 24000
tokenizer = text.Tokenizer(num_words = max_words)
# create the vocabulary by fitting on x_train text
tokenizer.fit_on_texts(xtrain)

# generate the sequence of tokens
xtrain_seq = tokenizer.texts_to_sequences(xtrain)
xtest_seq = tokenizer.texts_to_sequences(xtest)

# pad the sequences
xtrain_pad = sequence.pad_sequences(xtrain_seq, maxlen=max_len, padding='post')
xtest_pad = sequence.pad_sequences(xtest_seq, maxlen=max_len,padding='post')
word_index = tokenizer.word_index
print(len(word_index))

print('text example:', xtrain[0])
print('sequence of indices(before padding):', xtrain_seq[0])
print('sequence of indices(after padding):', xtrain_pad[0])

23113
text example: cost trumps aid freeze trenches ukraines war lt ivan molchanets peeked parapet sand bags front line war ukraine next empty helmet propped trick snipers already perforated multiple holes
sequence of indices(before padding): [960, 2086, 254, 5523, 380, 2319, 3650, 13423, 2747, 9596, 426, 13424, 631, 58, 13425, 13426, 585, 13427]
sequence of indices(after padding): [  960  2086   254  5523   380  2319  3650 13423  2747  9596   426 13424
   631    58 13425 13426   585 13427     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0 

### Solving class imbalance issue by over-sampling

In [None]:
# oversampled = SMOTE(random_state=0)
# X_train_smote, y_train_smote = oversampled.fit_resample(xtrain_pad, ytrain)

In [None]:
# example = pd.DataFrame(y_train_smote)
# example.value_counts()

category
finance     2482
news        2482
sports      2482
dtype: int64

### Encode the labels into numerical categories

In [26]:
le = LabelEncoder()
ytrain = le.fit_transform(ytrain)
ytest = le.fit_transform(ytest)

### Load the pre-trained Word2Vec model from drive

In [27]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [28]:
embedding_dim = 300
word2vec_dir = "/content/drive/MyDrive"
word2vec_model = KeyedVectors.load_word2vec_format(os.path.join(word2vec_dir, 'GoogleNews-vectors-negative300.bin'), binary=True)

### Use word2vec model to find and map vocabulary with their respective embeddings

This forms the embedding matrix which can be an input to the deep learning model for training

In [30]:
embedding_matrix = np.zeros((len(word_index) + 1, embedding_dim))
for word, i in word_index.items():
    if word in word2vec_model: 
        embedding_vector = word2vec_model[word]
        embedding_matrix[i] = embedding_vector

### Define embedding layer

In [32]:
embedding_layer = Embedding(len(word_index) + 1,
                            embedding_dim,
                            weights=[embedding_matrix],
                            input_length=max_len,
                            trainable=False)

### Build deep-learning model

In [34]:
model_word2vec = Sequential()
model_word2vec.add(embedding_layer)
model_word2vec.add(LSTM(units=128,  dropout=0.2, recurrent_dropout=0.25, return_sequences=True))
model_word2vec.add(Flatten())
model_word2vec.add(Dense(20, activation='relu'))
model_word2vec.add(Dense(1, activation='sigmoid'))

model_word2vec.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model_word2vec.summary())

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 552, 300)          6934200   
                                                                 
 lstm_1 (LSTM)               (None, 552, 128)          219648    
                                                                 
 flatten_1 (Flatten)         (None, 70656)             0         
                                                                 
 dense_2 (Dense)             (None, 20)                1413140   
                                                                 
 dense_3 (Dense)             (None, 1)                 21        
                                                                 
Total params: 8,567,009
Trainable params: 1,632,809
Non-trainable params: 6,934,200
_________________________________________________________________
None


### Create an object for model checkpointing

In [35]:
checkpoint = ModelCheckpoint('model-{epoch:03d}-{val_accuracy:03f}.h5', verbose=1, monitor='val_accuracy',save_best_only=True, mode='auto')

### Verify input array shapes

In [36]:
print(xtrain_pad.shape)
print(ytrain.shape)
print(xtest_pad.shape)
print(ytest.shape)

(7285, 552)
(7285,)
(1822, 552)
(1822,)


### Train data on word2vec model

In [37]:
history_word2vec = model_word2vec.fit(xtrain_pad, ytrain, batch_size=32, epochs=10, validation_data=(xtest_pad, ytest), callbacks=[checkpoint], verbose=1)

Epoch 1/10
Epoch 1: val_accuracy improved from -inf to 0.46432, saving model to model-001-0.464325.h5
Epoch 2/10
Epoch 2: val_accuracy did not improve from 0.46432
Epoch 3/10
Epoch 3: val_accuracy did not improve from 0.46432
Epoch 4/10
Epoch 4: val_accuracy did not improve from 0.46432
Epoch 5/10
Epoch 5: val_accuracy did not improve from 0.46432
Epoch 6/10
Epoch 6: val_accuracy improved from 0.46432 to 0.46487, saving model to model-006-0.464874.h5
Epoch 7/10
Epoch 7: val_accuracy did not improve from 0.46487
Epoch 8/10
Epoch 8: val_accuracy improved from 0.46487 to 0.49012, saving model to model-008-0.490121.h5
Epoch 9/10
Epoch 9: val_accuracy did not improve from 0.49012
Epoch 10/10
Epoch 10: val_accuracy improved from 0.49012 to 0.49122, saving model to model-010-0.491218.h5


In [1]:
print(f"The Validation accuracy is found to be %.4f."%0.4912)

The Validation accuracy is found to be 0.4912.
