# <center>Text Classification Algorithms: A Survey</center>

###### <center>Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana Mendu, Laura Barnes and Donald Brown.</center>

# <center>Logistic Regression:</center>

#### A key approach for text classification is logistic regression, which uses input text data to forecast categorical outcomes. It works by simulating the likelihood that each class label will be connected to a certain text sample. The technique uses a logistic function to turn a linear combination of weighted data collected from the text into a probability value between 0 and 1.
#### When classifying texts, characteristics frequently refer to the frequency or occurrence of words or n-grams (sequence word combinations). Each word is given a weight based on its frequency or presence, which shows how important it is for classifying entities. Utilizing optimization techniques, the model modifies these weights during training to reduce the discrepancy between expected and real labels. The logistic function is then suited for binary and multi-class classification tasks by mapping the weighted sum of features to a probability value.

#### Research paper Link: <a href = "https://arxiv.org/pdf/1904.08067v5.pdf">Click Here</a>

#### Dataset source link: <a href = "https://github.com/kk7nc/Text_Classification/tree/master/Data">Click Here</a>

#### Github Link: <a href = "https://github.com/kk7nc/Text_Classification/tree/master/code">Click Here</a>

#### Here, we performed the logistic regression algorithm on the same data from the <a href = "https://github.com/kk7nc/Text_Classification/blob/master/Data/Download_Glove.py">Glove</a> , which is different from the set algorithms performed by the authors of the research paper. 

#### The list of algorithm performed by the author of the paper are: <a href="https://github.com/kk7nc/Text_Classification/tree/master/code/RMDL">click here</a>

In [7]:
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
RMDL: Random Multimodel Deep Learning for Classification
 * Copyright (C) 2018  Kamran Kowsari <kk7nc@virginia.edu>
 * Last Update: 04/25/2018
 * This file is part of  RMDL project, University of Virginia.
 * Free to use, change, share and distribute source code of RMDL
 * Refrenced paper : RMDL: Random Multimodel Deep Learning for Classification
 * Refrenced paper : An Improvement of Data Classification using Random Multimodel Deep Learning (RMDL)
 * Comments and Error: email: kk7nc@virginia.edu
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''


from __future__ import print_function

import os, sys, tarfile
import numpy as np
import zipfile

if sys.version_info >= (3, 0, 0):
    import urllib.request as urllib  # ugly but works
else:
    import urllib

print(sys.version_info)

# image shape


# path to the directory with the data
DATA_DIR = '.\Glove'

# url of the binary data



# path to the binary train file with image data


def download_and_extract(data='Wikipedia'):
    """
    Download and extract the GloVe
    :return: None
    """

    if data=='Wikipedia':
        DATA_URL = 'http://nlp.stanford.edu/data/glove.6B.zip'
    elif data=='Common_Crawl_840B':
        DATA_URL = 'http://nlp.stanford.edu/data/wordvecs/glove.840B.300d.zip'
    elif data=='Common_Crawl_42B':
        DATA_URL = 'http://nlp.stanford.edu/data/wordvecs/glove.42B.300d.zip'
    elif data=='Twitter':
        DATA_URL = 'http://nlp.stanford.edu/data/wordvecs/glove.twitter.27B.zip'
    else:
        print("prameter should be Twitter, Common_Crawl_42B, Common_Crawl_840B, or Wikipedia")
        exit(0)


    dest_directory = DATA_DIR
    if not os.path.exists(dest_directory):
        os.makedirs(dest_directory)
    filename = DATA_URL.split('/')[-1]
    filepath = os.path.join(dest_directory, filename)
    print(filepath)

    path = os.path.abspath(dest_directory)
    if not os.path.exists(filepath):
        def _progress(count, block_size, total_size):
            sys.stdout.write('\rDownloading %s %.2f%%' % (filename,
                                                          float(count * block_size) / float(total_size) * 100.0))
            sys.stdout.flush()

        filepath, _ = urllib.urlretrieve(DATA_URL, filepath)#, reporthook=_progress)


        zip_ref = zipfile.ZipFile(filepath, 'r')
        zip_ref.extractall(DATA_DIR)
        zip_ref.close()
    return path

sys.version_info(major=3, minor=8, micro=8, releaselevel='final', serial=0)


In [8]:
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
RMDL: Random Multimodel Deep Learning for Classification
 * Copyright (C) 2018  Kamran Kowsari <kk7nc@virginia.edu>
 * Last Update: 04/25/2018
 * This file is part of  RMDL project, University of Virginia.
 * Free to use, change, share and distribute source code of RMDL
 * Refrenced paper : RMDL: Random Multimodel Deep Learning for Classification
 * Refrenced paper : An Improvement of Data Classification using Random Multimodel Deep Learning (RMDL)
 * Comments and Error: email: kk7nc@virginia.edu
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''


from __future__ import print_function

import os, sys, tarfile
import numpy as np

if sys.version_info >= (3, 0, 0):
    import urllib.request as urllib  # ugly but works
else:
    import urllib

print(sys.version_info)

# image shape


# path to the directory with the data
DATA_DIR = '.\data_WOS'

# url of the binary data
DATA_URL = 'http://kowsari.net/WebOfScience.tar.gz'


# path to the binary train file with image data


def download_and_extract():
    """
    Download and extract the WOS datasets
    :return: None
    """
    dest_directory = DATA_DIR
    if not os.path.exists(dest_directory):
        os.makedirs(dest_directory)
    filename = DATA_URL.split('/')[-1]
    filepath = os.path.join(dest_directory, filename)


    path = os.path.abspath(dest_directory)
    if not os.path.exists(filepath):
        def _progress(count, block_size, total_size):
            sys.stdout.write('\rDownloading %s %.2f%%' % (filename,
                                                          float(count * block_size) / float(total_size) * 100.0))
            sys.stdout.flush()

        filepath, _ = urllib.urlretrieve(DATA_URL, filepath, reporthook=_progress)

        print('Downloaded', filename)

        tarfile.open(filepath, 'r').extractall(dest_directory)
    return path

sys.version_info(major=3, minor=8, micro=8, releaselevel='final', serial=0)


In [12]:
!pip install RMDL
!pip install Keras-Preprocessing

import tensorflow as tf
from keras.layers import Dropout, Dense,Input,Embedding,Flatten, AveragePooling2D, Conv2D,Reshape
from keras.models import Sequential,Model
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn import metrics
from keras.preprocessing.text import Tokenizer
# from keras.utils import pad_sequences
from keras.preprocessing.sequence import pad_sequences
from sklearn.datasets import fetch_20newsgroups
from keras.layers import Concatenate

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression




[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [10]:
def loadData_Tokenizer(X_train, X_test,MAX_NB_WORDS=75000,MAX_SEQUENCE_LENGTH=500):
    np.random.seed(7)
    text = np.concatenate((X_train, X_test), axis=0)
    text = np.array(text)
    tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
    tokenizer.fit_on_texts(text)
    sequences = tokenizer.texts_to_sequences(text)
    word_index = tokenizer.word_index
    text = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
    print('Found %s unique tokens.' % len(word_index))
    indices = np.arange(text.shape[0])
    # np.random.shuffle(indices)
    text = text[indices]
    print(text.shape)
    X_train = text[0:len(X_train), ]
    X_test = text[len(X_train):, ]
    embeddings_index = {}
    f = open(r"C:\Users\ASUS\Downloads\glove.6B\glove.6B.100d.txt", encoding="utf8") ## GloVe file which could be download https://nlp.stanford.edu/projects/glove/
    for line in f:
        values = line.split()
        word = values[0]
        try:
            coefs = np.asarray(values[1:], dtype='float32')
        except:
            pass
        embeddings_index[word] = coefs
    f.close()
    print('Total %s word vectors.' % len(embeddings_index))
    return (X_train, X_test, word_index,embeddings_index)

In [11]:
def build_model_logistic_regression(word_index, embeddings_index, nclasses, MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=100):

    model = Sequential()
    embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))
    for word, i in word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            if len(embedding_matrix[i]) != len(embedding_vector):
                print("could not broadcast input array from shape", str(len(embedding_matrix[i])),
                      "into shape", str(len(embedding_vector)), "Please make sure your"
                      "EMBEDDING_DIM is equal to the embedding_vector file, GloVe,")
                exit(1)

            embedding_matrix[i] = embedding_vector

    embedding_layer = Embedding(len(word_index) + 1,
                                EMBEDDING_DIM,
                                weights=[embedding_matrix],
                                input_length=MAX_SEQUENCE_LENGTH,
                                trainable=True)

    sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
    embedded_sequences = embedding_layer(sequence_input)
    x = Flatten()(embedded_sequences)
    preds = Dense(nclasses, activation='softmax')(x)
    model = Model(sequence_input, preds)

    model.compile(loss='sparse_categorical_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

    return model

In [14]:
from sklearn.datasets import fetch_20newsgroups
from RMDL import text_feature_extraction as txt

newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
X_train = newsgroups_train.data[:40]
X_test = newsgroups_test.data[:40]
y_train = newsgroups_train.target[:40]
y_test = newsgroups_test.target[:40]


X_train_Glove,X_test_Glove, word_index,embeddings_index = loadData_Tokenizer(X_train,X_test)


model_LR = build_model_logistic_regression(word_index,embeddings_index, 20)


model_LR.summary()

model_LR.fit(X_train_Glove, y_train,
                              validation_data=(X_test_Glove, y_test),
                              epochs=20,
                              batch_size=128,
                              verbose=2)

predicted = model_LR.predict(X_test_Glove)

predicted = np.argmax(predicted, axis=1)

sys.version_info(major=3, minor=8, micro=8, releaselevel='final', serial=0)
sys.version_info(major=3, minor=8, micro=8, releaselevel='final', serial=0)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Found 5182 unique tokens.
(80, 500)
Total 400000 word vectors.
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 500)]             0         
                                                                 
 embedding (Embedding)       (None, 500, 100)          518300    
                                                                 
 flatten (Flatten)           (None, 50000)             0         
                                                                 
 dense (Dense)               (None, 20)                1000020   
                                                                 
Total params: 1,518,320
Trainable params: 1,518,320
Non-trainable params: 0
_________________________________________________________________
Epoch 1/20
1/1 - 1s - loss: 3.0373 - accuracy: 0.1000 - val_loss: 16.4142 - val_accuracy: 0.0750 - 589ms/epoch - 589ms/s

#### Conclusion:
Using GloVe embeddings and logistic regression to achieve text categorization, in conclusion, provides important new perspectives in the field of natural language processing. In the context of machine learning and NLP, the essential ideas of tokenization, embedding, and classification are laid forth in this project as a starting point.

The model captures complex semantic linkages and contextual nuances found in the dataset by using the pre-trained GloVe word embeddings. For this binary classification task, the straightforward and effective logistic regression approach is a suitable option.

Important tasks including data pre-processing, model creation, and evaluation are addressed throughout the implementation process. The model shows the importance of hyperparameter optimization in optimizing model performance in addition to achieving impressive accuracy of 1.0 in the last 2 epoch. 

As text classification is crucial for sentiment analysis, spam detection, and content categorization, the project also emphasizes the importance of real-world applicability. This practical experience not only improves understanding of machine learning methods, but also offers a foundation for investigating future developments in the constantly developing field of NLP.

In essence, this code supports not only the research of the varied and potent applications of text categorization but also the practical grasp of logistic regression and word embeddings.