Adding text_normalizer.py to directory. (Must first download text_normalizer.py to PC, to add to this colab directory)

In [None]:
!mkdir my_code
from google.colab import files

uploaded = files.upload()
!mv text_normalizer.py my_code/


mkdir: cannot create directory ‘my_code’: File exists


Saving text_normalizer.py to text_normalizer.py


Running libraries

In [None]:
# dependencies
from sklearn.datasets import fetch_20newsgroups #getting dataset
import nltk
nltk.download('stopwords')
!pip install contractions


import numpy as np
import my_code.text_normalizer as tn
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.metrics import accuracy_score, classification_report
from sklearn import metrics
from gensim.models import Word2Vec
from sklearn.naive_bayes import GaussianNB
from sklearn.decomposition import PCA
from tensorflow.keras.optimizers import Adam

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!




Fetches the entire 20 Newsgroups dataset, including all categories, and creates a dictionary (data_labels_map) mapping numerical labels to corresponding newsgroup names while removing headers, footers, and quotes from the text content.

In [None]:
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups(subset='all', shuffle=True,
                          remove=('headers', 'footers', 'quotes'))
data_labels_map = dict(enumerate(data.target_names))

Transforming the data into a dataframe

In [None]:
corpus, target_labels, target_names = (data.data, data.target,
                                       [data_labels_map[label] for label in data.target])
data_df = pd.DataFrame({'Article': corpus, 'Target Label': target_labels, 'Target Name': target_names})
print(data_df.shape)
data_df.head(10)

(18846, 3)


Unnamed: 0,Article,Target Label,Target Name
0,\n\nI am sure some bashers of Pens fans are pr...,10,rec.sport.hockey
1,My brother is in the market for a high-perform...,3,comp.sys.ibm.pc.hardware
2,\n\n\n\n\tFinally you said what you dream abou...,17,talk.politics.mideast
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,3,comp.sys.ibm.pc.hardware
4,1) I have an old Jasmine drive which I cann...,4,comp.sys.mac.hardware
5,\n\nBack in high school I worked as a lab assi...,12,sci.electronics
6,\n\nAE is in Dallas...try 214/241-6060 or 214/...,4,comp.sys.mac.hardware
7,"\n[stuff deleted]\n\nOk, here's the solution t...",10,rec.sport.hockey
8,"\n\n\nYeah, it's the second one. And I believ...",10,rec.sport.hockey
9,\nIf a Christian means someone who believes in...,19,talk.religion.misc


## Data Processing and Normalization

Before, we preprocess and normalize our documents, let’s first take a look at potential empty documents in our dataset and remove them.


In [None]:
#removing empty documents
total_nulls = data_df[data_df.Article.str.strip() == ''].shape[0]
print("Empty documents:", total_nulls)

Empty documents: 515


We can now do use a simple pandas filter operation and remove all the records with no textual content in the article as follows. **The dataset is also sampled here.**

In [None]:
#removing records with no text
data_df = data_df[~(data_df.Article.str.strip() == '')][0:10000:5]
data_df.shape

(2000, 3)

In [None]:
#doing the general pre-processing or text wrangling
#refer to text_normalizer.py
stopword_list = nltk.corpus.stopwords.words('english')
# just to keep negation if any in bi-grams
stopword_list.remove('no')
stopword_list.remove('not')

# normalize our corpus
norm_corpus = tn.normalize_corpus(corpus=data_df['Article'], html_stripping=True, contraction_expansion=False,
                                  accented_char_removal=True, text_lower_case=True, text_lemmatization=True,
                                  text_stemming=False, special_char_removal=True, remove_digits=True,
                                  stopword_removal=True, stopwords=stopword_list)
data_df['Clean Article'] = norm_corpus

In [None]:
data_df = data_df[['Article', 'Clean Article', 'Target Label', 'Target Name']]
data_df

Unnamed: 0,Article,Clean Article,Target Label,Target Name
0,\n\nI am sure some bashers of Pens fans are pr...,sure basher pen fan pretty confused lack kind ...,10,rec.sport.hockey
5,\n\nBack in high school I worked as a lab assi...,back high school work lab assistant bunch expe...,12,sci.electronics
10,the blood of the lamb.\n\nThis will be a hard ...,blood lamb hard task culture use animal blood ...,19,talk.religion.misc
15,In the following report: _Turkey Eyes Regional...,following report turkey eyes regional role ank...,17,talk.politics.mideast
20,: \n: I am considering buying a 1993 Chevy or ...,consider buy chevy gmc x full size pickup exte...,7,rec.autos
...,...,...,...,...
10265,Posted for a friend without posting access (bu...,post friend without post access e mail access ...,5,comp.windows.x
10270,"HiFonics ""Ceres"" 3-Band Parametric Equalizer\n...",hifonics ceres band parametric equalizer specs...,6,misc.forsale
10275,jsn104 is jeremy scott noonan\n,jsn jeremy scott noonan,0,alt.atheism
10281,I am working for a company which has only one ...,work company one connection internet firewall ...,5,comp.windows.x


In [None]:
#check for null or empty documents
data_df = data_df.replace(r'^(\s?)+$', np.nan, regex=True)
data_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 10286
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Article        2000 non-null   object
 1   Clean Article  1997 non-null   object
 2   Target Label   2000 non-null   int64 
 3   Target Name    2000 non-null   object
dtypes: int64(1), object(3)
memory usage: 78.1+ KB


We definitely have some null articles after our preprocessing operation. We can safely remove these null documents using the following code.

In [None]:
#remove null/empty documents
data_df = data_df.dropna().reset_index(drop=True)
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1997 entries, 0 to 1996
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Article        1997 non-null   object
 1   Clean Article  1997 non-null   object
 2   Target Label   1997 non-null   int64 
 3   Target Name    1997 non-null   object
dtypes: int64(1), object(3)
memory usage: 62.5+ KB


##Feature Extraction

Splitting into training and testing sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data_df['Clean Article'], data_df['Target Label'], test_size=0.2, random_state=42)

###Using TF-IDF

In [None]:
# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

###Using Word-Embeddings

In [None]:
# Tokenize the text into sentences
tokenized_sentences = [sentence.split() for sentence in X_train]

# Train Word2Vec model
embedding_dim = 200  # You can adjust this
word2vec_model = Word2Vec(sentences=tokenized_sentences, vector_size=embedding_dim, window=5, min_count=1, workers=4)

# Build a vocabulary
word2vec_model.build_vocab(tokenized_sentences)

# Train the model
word2vec_model.train(tokenized_sentences, total_examples=word2vec_model.corpus_count, epochs=10)

# Save the model (optional)
word2vec_model.save("word2vec_model")

# Use the trained embeddings to transform text data
def average_word_vectors(words, model, vocabulary, num_features):
    feature_vector = np.zeros((num_features,), dtype="float64")
    nwords = 0.

    for word in words:
        if word in vocabulary:
            nwords = nwords + 1.
            feature_vector = np.add(feature_vector, model.wv[word])

    if nwords:
        feature_vector = np.divide(feature_vector, nwords)

    return feature_vector

def get_avg_feature_vectors(reviews, model, vocabulary, num_features):
    review_feature_vectors = [average_word_vectors(review, model, vocabulary, num_features) for review in reviews]
    return np.array(review_feature_vectors)

# Get vocabulary
vocabulary = set(word2vec_model.wv.index_to_key)

# Transform text data to average word vectors
X_train_word2vec = get_avg_feature_vectors(tokenized_sentences, word2vec_model, vocabulary, embedding_dim)

# Repeat the process for the test set
tokenized_test_sentences = [sentence.split() for sentence in X_test]
X_test_word2vec = get_avg_feature_vectors(tokenized_test_sentences, word2vec_model, vocabulary, embedding_dim)

pca = PCA(n_components=embedding_dim)
X_train_word2vec = pca.fit_transform(X_train_word2vec)
X_test_word2vec = pca.transform(X_test_word2vec)



##Text Classification Models

###Naive-Bayes

TF-IDF with Naive Bayes

In [None]:
# Naive Bayes
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_tfidf, y_train)
nb_predictions = nb_classifier.predict(X_test_tfidf)

# Evaluate Models
print("Naive Bayes Metrics:")
print(metrics.classification_report(y_test, nb_predictions))

Naive Bayes Metrics:
              precision    recall  f1-score   support

           0       0.38      0.25      0.30        12
           1       0.62      0.65      0.63        20
           2       0.35      0.50      0.41        14
           3       0.54      0.79      0.64        19
           4       1.00      0.32      0.48        19
           5       0.60      0.71      0.65        21
           6       0.93      0.54      0.68        24
           7       0.28      0.94      0.43        16
           8       0.78      0.44      0.56        16
           9       0.79      0.71      0.75        21
          10       0.84      0.81      0.82        26
          11       0.50      0.88      0.64        24
          12       0.79      0.44      0.56        25
          13       0.59      0.77      0.67        13
          14       0.80      0.76      0.78        21
          15       0.72      0.62      0.67        37
          16       0.47      0.75      0.58        20
      

Word Embeddings with Naive-Bayes

In [None]:
# Gaussian Naive Bayes
nb_model = GaussianNB()
nb_model.fit(X_train_word2vec, y_train)
nb_predictions = nb_model.predict(X_test_word2vec)

# Evaluate Naive Bayes performance
accuracy_nb = accuracy_score(y_test, nb_predictions)
precision_nb = precision_score(y_test, nb_predictions, average='weighted')
recall_nb = recall_score(y_test, nb_predictions, average='weighted')
f1_nb = f1_score(y_test, nb_predictions, average='weighted')

print("\nNaive Bayes Performance:")
print(f"Accuracy: {accuracy_nb}")
print(f"Precision: {precision_nb}")
print(f"Recall: {recall_nb}")
print(f"F1 Score: {f1_nb}")


Naive Bayes Performance:
Accuracy: 0.25
Precision: 0.3495117508523098
Recall: 0.25
F1 Score: 0.26823354629394747


###Support Vector Machine (SVM)

TF-IDF with SVM

In [None]:
# SVM
svm_classifier = LinearSVC()
svm_classifier.fit(X_train_tfidf, y_train)
svm_predictions = svm_classifier.predict(X_test_tfidf)

print("Support Vector Machine Metrics:")
print(metrics.classification_report(y_test, svm_predictions))

Support Vector Machine Metrics:
              precision    recall  f1-score   support

           0       0.33      0.42      0.37        12
           1       0.61      0.70      0.65        20
           2       0.38      0.43      0.40        14
           3       0.68      0.68      0.68        19
           4       0.81      0.68      0.74        19
           5       0.70      0.67      0.68        21
           6       0.80      0.67      0.73        24
           7       0.54      0.81      0.65        16
           8       0.64      0.56      0.60        16
           9       0.70      0.67      0.68        21
          10       0.73      0.85      0.79        26
          11       0.83      0.83      0.83        24
          12       0.57      0.48      0.52        25
          13       0.59      0.77      0.67        13
          14       0.70      0.76      0.73        21
          15       0.75      0.81      0.78        37
          16       0.67      0.70      0.68      

Word Embeddings with SVM

In [None]:
# Linear Support Vector Classifier (SVM)
svm_model = LinearSVC()
svm_model.fit(X_train_word2vec, y_train)
svm_predictions = svm_model.predict(X_test_word2vec)

# Evaluate SVM performance
accuracy_svm = accuracy_score(y_test, svm_predictions)
precision_svm = precision_score(y_test, svm_predictions, average='weighted')
recall_svm = recall_score(y_test, svm_predictions, average='weighted')
f1_svm = f1_score(y_test, svm_predictions, average='weighted')

print("\nSVM Performance:")
print(f"Accuracy: {accuracy_svm}")
print(f"Precision: {precision_svm}")
print(f"Recall: {recall_svm}")
print(f"F1 Score: {f1_svm}")
print(f"F1 Score: {f1_svm}")


SVM Performance:
Accuracy: 0.26
Precision: 0.2354572486218127
Recall: 0.26
F1 Score: 0.2165244728879609
F1 Score: 0.2165244728879609


###Using RNN (Recurrent Nueral Network)

TF-IDF with RNN

In [None]:
# Define the RNN model
def create_rnn_model(input_shape, embedding_dim, vocab_size):
    model = Sequential()
    model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=input_shape))
    model.add(Bidirectional(LSTM(128, return_sequences=True)))  # Increased LSTM units for more complexity
    #model.add(Dropout(0.5))  # Added dropout for regularization
    model.add(Bidirectional(LSTM(64)))
    model.add(Dense(20, activation='softmax'))  # Assuming 20 categories
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

# Tokenize the text into sequences for TF-IDF
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
X_train_sequences = tokenizer.texts_to_sequences(X_train)
X_test_sequences = tokenizer.texts_to_sequences(X_test)

# Pad sequences to have consistent length for TF-IDF
max_len = max(len(x) for x in X_train_sequences)
X_train_padded = pad_sequences(X_train_sequences, maxlen=max_len, padding='post')
X_test_padded = pad_sequences(X_test_sequences, maxlen=max_len, padding='post')

# Vocabulary size
vocab_size = len(tokenizer.word_index) + 1

embedding_dim=100
# Create and train the RNN model on TF-IDF
rnn_model_tfidf = create_rnn_model(input_shape=max_len, embedding_dim=embedding_dim, vocab_size=vocab_size)
rnn_model_tfidf.fit(X_train_padded, y_train, epochs=5, batch_size=64, validation_data=(X_test_padded, y_test))

# Evaluate the model
y_pred_tfidf = np.argmax(rnn_model_tfidf.predict(X_test_padded), axis=1)

# Calculate metrics for TF-IDF
accuracy_tfidf = accuracy_score(y_test, y_pred_tfidf)
precision_tfidf = precision_score(y_test, y_pred_tfidf, average='weighted')
recall_tfidf = recall_score(y_test, y_pred_tfidf, average='weighted')
f1_tfidf = f1_score(y_test, y_pred_tfidf, average='weighted')

# Print metrics for TF-IDF
print("TF-IDF Metrics:")
print("Accuracy:", accuracy_tfidf)
print("Precision:", precision_tfidf)
print("Recall:", recall_tfidf)
print("F1 Score:", f1_tfidf)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
TF-IDF Metrics:
Accuracy: 0.3275
Precision: 0.3246095682095682
Recall: 0.3275
F1 Score: 0.3119577265307792


Word Embeddings with RNN

In [None]:
# Tokenize the text into sentences
tokenized_sentences = [sentence.split() for sentence in X_train]

# Train Word2Vec model
embedding_dim = 200  # You can adjust this based on your requirements
word2vec_model = Word2Vec(sentences=tokenized_sentences, vector_size=embedding_dim, window=5, min_count=1, workers=4)

# Build a vocabulary
word2vec_model.build_vocab(tokenized_sentences)

# Train the model
word2vec_model.train(tokenized_sentences, total_examples=word2vec_model.corpus_count, epochs=10)

# Use the trained embeddings to transform text data
def average_word_vectors(words, model, vocabulary, num_features):
    feature_vector = np.zeros((num_features,), dtype="float64")
    nwords = 0.

    for word in words:
        if word in vocabulary:
            nwords = nwords + 1.
            feature_vector = np.add(feature_vector, model.wv[word])

    if nwords:
        feature_vector = np.divide(feature_vector, nwords)

    return feature_vector

def get_avg_feature_vectors(reviews, model, vocabulary, num_features):
    review_feature_vectors = [average_word_vectors(review, model, vocabulary, num_features) for review in reviews]
    return np.array(review_feature_vectors)

# Get vocabulary
vocabulary = set(word2vec_model.wv.index_to_key)

# Reshape Word2Vec input data for RNN
X_train_word2vec = X_train_word2vec.reshape(X_train_word2vec.shape[0], 1, X_train_word2vec.shape[1])
X_test_word2vec = X_test_word2vec.reshape(X_test_word2vec.shape[0], 1, X_test_word2vec.shape[1])

# Define the RNN model for Word2Vec
def create_rnn_model_word2vec(input_shape, embedding_dim):
    model = Sequential()
    model.add(Bidirectional(LSTM(128, return_sequences=True)))
    model.add(LSTM(64, input_shape=(1, embedding_dim)))  # Reshape added here
    model.add(Dense(20, activation='softmax'))  # Assuming 20 categories
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

# Create and train the RNN model on Word2Vec
rnn_model_word2vec = create_rnn_model_word2vec(input_shape=embedding_dim, embedding_dim=embedding_dim)
rnn_model_word2vec.fit(X_train_word2vec, y_train, epochs=4000, batch_size=64, validation_data=(X_test_word2vec, y_test))

# Evaluate the model
y_pred_word2vec = np.argmax(rnn_model_word2vec.predict(X_test_word2vec), axis=1)

# Calculate metrics for Word2Vec
accuracy_word2vec = accuracy_score(y_test, y_pred_word2vec)
precision_word2vec = precision_score(y_test, y_pred_word2vec, average='weighted')
recall_word2vec = recall_score(y_test, y_pred_word2vec, average='weighted')
f1_word2vec = f1_score(y_test, y_pred_word2vec, average='weighted')

# Print metrics for Word2Vec
print("\nWord2Vec Metrics:")
print("Accuracy:", accuracy_word2vec)
print("Precision:", precision_word2vec)
print("Recall:", recall_word2vec)
print("F1 Score:", f1_word2vec)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Epoch 1505/4000
Epoch 1506/4000
Epoch 1507/4000
Epoch 1508/4000
Epoch 1509/4000
Epoch 1510/4000
Epoch 1511/4000
Epoch 1512/4000
Epoch 1513/4000
Epoch 1514/4000
Epoch 1515/4000
Epoch 1516/4000
Epoch 1517/4000
Epoch 1518/4000
Epoch 1519/4000
Epoch 1520/4000
Epoch 1521/4000
Epoch 1522/4000
Epoch 1523/4000
Epoch 1524/4000
Epoch 1525/4000
Epoch 1526/4000
Epoch 1527/4000
Epoch 1528/4000
Epoch 1529/4000
Epoch 1530/4000
Epoch 1531/4000
Epoch 1532/4000
Epoch 1533/4000
Epoch 1534/4000
Epoch 1535/4000
Epoch 1536/4000
Epoch 1537/4000
Epoch 1538/4000
Epoch 1539/4000
Epoch 1540/4000
Epoch 1541/4000
Epoch 1542/4000
Epoch 1543/4000
Epoch 1544/4000
Epoch 1545/4000
Epoch 1546/4000
Epoch 1547/4000
Epoch 1548/4000
Epoch 1549/4000
Epoch 1550/4000
Epoch 1551/4000
Epoch 1552/4000
Epoch 1553/4000
Epoch 1554/4000
Epoch 1555/4000
Epoch 1556/4000
Epoch 1557/4000
Epoch 1558/4000
Epoch 1559/4000
Epoch 1560/4000
Epoch 1561/4000
Epoch 1562/4000
Epoch 1

##Inference

The results obtained from various approaches provide valuable insights into the strengths and weaknesses of different models.

The project evaluated multiple text classification models on the 20 Newsgroups dataset. The tf-idf with Naive Bayes achieved a moderate 60% accuracy, demonstrating competitive precision and recall in some categories but struggled with others, possibly due to limitations in handling diverse topics. Conversely, word embeddings with Naive Bayes had a lower 25% accuracy, indicating suboptimal performance. The tf-idf with SVM stood out with a 67% accuracy, showcasing improved precision, recall, and F1 score across categories.

However, word embeddings with SVM showed lower performance. RNNs with tf-idf and word embeddings achieved reasonable accuracies of 32.75% and 38.25%, respectively, suggesting their potential for capturing sequential dependencies. In conclusion, careful model selection and fine-tuning are crucial, with tf-idf and SVM emerging as effective, but further optimization may enhance overall performance.