**Document Classification using Machine Learning and Deep Learning Techniques**

In the realm of automated document classification, we embarked on an exploration to classify a diverse set of documents into predefined categories. The dataset under scrutiny, sourced from Kaggle, is titled "[(10)Dataset Text Document Classification](https://www.kaggle.com/datasets/jensenbaxter/10dataset-text-document-classification)". It comprises documents categorized under the following labels: 'business', 'entertainment', 'food', 'graphics', 'historical', 'medical', 'politics', 'space', 'sport', and 'technology'. Each category is represented by 100 text files, culminating in a comprehensive dataset.

For the purpose of training and evaluation, the dataset was partitioned into a 70-30 split, with 70% allocated for training and the remaining 30% reserved for testing.

Our investigative approach encompassed a range of machine learning models, namely Naive Bayes, Support Vector Machines (SVM), and Random Forest. Additionally, a Deep Neural Network *(due to this model, the chanks of codes might take a while to load)*, a paradigm of deep learning, was also employed. To enhance the performance and robustness of these models, various techniques were integrated into the pipeline:

Feature Engineering: Utilization of N-grams to capture local word order information.
Word Embeddings: Leveraging pre-trained embeddings to represent words in a dense vector space.
Feature Selection: The Chi-Squared Test was employed to select significant features.
Model Ensembling: Bagging was used to reduce variance by training multiple models.
Regularization Techniques: Dropout was introduced to prevent overfitting in the deep learning model.
The models were evaluated on multiple metrics, including Precision, Recall, F1-Score, and Accuracy. The F1-Score, which harmoniously balances Precision and Recall, was chosen as the primary metric to determine the best model.

Upon thorough evaluation, all models showed a high performance averaging with more than 0.9. SVM without enhancements and Dropout Deep Neural Network were the best performance models, both achieving more 0.980. Nonetheless, as the Deep Neural Network model load time is significant, SVM without enhancement techniques is chosen with an F1 Score of 0.980 is chosen as the best. While other models with feature enhancements were closely competitive, the computational overhead and marginally inferior performance make the standardized SVM the recommended choice for this document classification task.

The development and refinement of our approach were greatly aided by resources from StackOverflow, ChatGPT, and a series of YouTube tutorials:

[Python Machine Learning #4 - Support Vector Machines
](https://www.youtube.com/watch?v=99Eyw7Quacc)

[Neural Network Python | How to make a Neural Network in Python | Python Tutorial | Edureka](https://www.youtube.com/watch?v=9UBqkUJVP4g)

[Machine Learning Tutorial Python - 11 Random Forest
](https://www.youtube.com/watch?v=ok2s1vV9XW0)

[Naive Bayes Classifier in Python (from scratch!)
](https://www.youtube.com/watch?v=3I8oX3OUL6I)

**Literature used for deeper understanding of the models:**
Alpaydin, E., 2020. Introduction to machine learning. MIT press.
Baltrušaitis, T., Ahuja, C. and Morency, L.P., 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2), pp.423-443.
Chen, Y.W. and Lin, C.J., 2006. Combining SVMs with various feature selection strategies. Feature extraction: foundations and applications, pp.315-324.
Goodfellow, I., Bengio, Y. and Courville, A., 2016. Deep learning. MIT press.
Liaw, A. and Wiener, M., 2015. randomForest: Breiman and Cutler’s random forests for classification and regression. R package version, 4, p.14.
McCallum, A. and Nigam, K., 1998, July. A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization (Vol. 752, No. 1, pp. 41-48).
Oliphant, T.E., 2007. Python for scientific computing. Computing in science & engineering, 9(3), pp.10-20.
Provost, F. and Fawcett, T., 2013. Data Science for Business: What you need to know about data mining and data-analytic thinking. " O'Reilly Media, Inc.".
Shawe-Taylor, J. and Cristianini, N., 2004. Kernel methods for pattern analysis. Cambridge university press.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R., 2014. Dropout: a simple way to prevent neural networks from overfitting.
The journal of machine learning research, 15(1), pp.1929-1958.




Importing Libraries and API


In [19]:
# Import necessary libraries and API
import numpy as np
import pandas as pd
import os
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
import gensim.downloader as api
from gensim.models import Word2Vec
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.ensemble import BaggingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import LinearSVC
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.preprocessing import MinMaxScaler
word2vec_model = api.load("word2vec-google-news-300")





In [3]:
# Once uploaded, unzip using:
!unzip DATASET.zip
!ls


Archive:  DATASET.zip
  inflating: business/business_1.txt  
  inflating: business/business_10.txt  
  inflating: business/business_100.txt  
  inflating: business/business_11.txt  
  inflating: business/business_12.txt  
  inflating: business/business_13.txt  
  inflating: business/business_14.txt  
  inflating: business/business_15.txt  
  inflating: business/business_16.txt  
  inflating: business/business_17.txt  
  inflating: business/business_18.txt  
  inflating: business/business_19.txt  
  inflating: business/business_2.txt  
  inflating: business/business_20.txt  
  inflating: business/business_21.txt  
  inflating: business/business_22.txt  
  inflating: business/business_23.txt  
  inflating: business/business_24.txt  
  inflating: business/business_25.txt  
  inflating: business/business_26.txt  
  inflating: business/business_27.txt  
  inflating: business/business_28.txt  
  inflating: business/business_29.txt  
  inflating: business/business_3.txt  
  inflating: busines

Data Loading and Preprocessing


Extrcting each category, setting 30% of data for training and 70% for testing

In [4]:
# List of categories based on the directories
categories = ['business', 'entertainment', 'food', 'graphics', 'historical', 'medical', 'politics', 'space', 'sport', 'technologie']

texts = []
labels = []

for category in categories:
    for filename in os.listdir(category):
        with open(f"./{category}/{filename}", 'r', encoding='utf-8', errors='ignore') as file:
            texts.append(file.read())
            labels.append(category)

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.3, random_state=42)


**Naive Bayes, SVM, Random Forest, Deep Neural Network (No enhancment tecniques)**

**Description:**

This code segment is dedicated to enhancing the document classification process by incorporating bigrams (pairs of adjacent words) and evaluating the performance of various machine learning models, including a deep neural network.

**Bigram Incorporation:**
The TfidfVectorizer is employed with the ngram_range parameter set to (1,2). This means that the vectorization process will consider both individual words (unigrams) and pairs of adjacent words (bigrams) to transform the text data into numerical format. This can capture more contextual information than using unigrams alone.

The training and test datasets are transformed using this vectorizer and converted to arrays for compatibility with Keras.

**Deep Neural Network Definition:**

A feed-forward neural network is defined using the Keras library. The network comprises two hidden layers with dropout regularization to prevent overfitting. The output layer uses a softmax activation function, suitable for multi-class classification problems. The loss function chosen is sparse_categorical_crossentropy, which is appropriate for integer-encoded class labels.

**Model Training and Evaluation:**

Four models are defined: Naive Bayes, Support Vector Machine (SVM), Random Forest, and the previously defined Deep Neural Network.
Each model is trained on the training dataset and then evaluated on the test dataset.
The performance of each model is assessed using various metrics, including precision, recall, F1-score, accuracy, and support. These metrics provide a comprehensive understanding of each model's performance, considering both the positive and negative classes.
The results are aggregated and presented in a tabular format using a DataFrame.

**Outcome:**

Upon execution, this code will display the performance metrics of each model when using bigrams as part of the feature extraction process. This provides insights into the potential benefits of capturing more contextual information from the text data.


In [5]:
# Using only unigrams
tfidf_vectorizer_ngrams = TfidfVectorizer(stop_words='english', ngram_range=(1,1))
X_train_tfidf_ngrams = tfidf_vectorizer_ngrams.fit_transform(X_train).toarray()  # Convert to array for Keras
X_test_tfidf_ngrams = tfidf_vectorizer_ngrams.transform(X_test).toarray()

# Define a simple feed-forward neural network for Keras
def create_nn_model():
    model = Sequential()
    model.add(Dense(128, input_dim=X_train_tfidf_ngrams.shape[1], activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(len(set(y_train)), activation='softmax'))  # Number of classes
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

def train_and_evaluate_models(X_train, X_test):
    # Define the models
    models = {
        'Naive Bayes': MultinomialNB(),
        'SVM': SVC(kernel='linear'),
        'Random Forest': RandomForestClassifier(random_state=42),
        'Deep Neural Network': KerasClassifier(build_fn=create_nn_model, epochs=10, batch_size=32, verbose=0)
    }

    # Train and evaluate the models
    results = {}
    for model_name, model in models.items():
        # Train
        model.fit(X_train, y_train)
        # Predict
        y_pred = model.predict(X_test)
        # Evaluate
        report = classification_report(y_test, y_pred, output_dict=True)
        results[model_name] = {
            'Precision': report['weighted avg']['precision'],
            'Recall': report['weighted avg']['recall'],
            'F1-Score': report['weighted avg']['f1-score'],
            'Accuracy': report['accuracy'],
            'Support': report['weighted avg']['support']
        }

    # Convert results to DataFrame for display
    df_results = pd.DataFrame(results).transpose()
    return df_results

# Train and evaluate models with Unigrams
df_results = train_and_evaluate_models(X_train_tfidf_ngrams, X_test_tfidf_ngrams)
print(df_results)


  'Deep Neural Network': KerasClassifier(build_fn=create_nn_model, epochs=10, batch_size=32, verbose=0)


                     Precision    Recall  F1-Score  Accuracy  Support
Naive Bayes           0.961440  0.956667  0.957449  0.956667    300.0
SVM                   0.981571  0.980000  0.980188  0.980000    300.0
Random Forest         0.931606  0.923333  0.924426  0.923333    300.0
Deep Neural Network   0.977766  0.976667  0.976829  0.976667    300.0


**Naive Bayes, SVM, Random Forest, Deep Neural Network (Feature Engineering - N-grams)**

**Description:**

This segment of code is dedicated to the task of document classification, with a particular emphasis on leveraging bigrams (pairs of adjacent words) as features. The performance of various machine learning models, inclusive of a deep neural network, is evaluated.


**Bigram Feature Extraction:**

The TfidfVectorizer is utilized with the ngram_range parameter set to (1,2). This ensures that the vectorization process captures both individual words (unigrams) and their adjacent pairs (bigrams). This approach can potentially encapsulate more contextual nuances than merely using unigrams.
The training and test datasets undergo transformation using this vectorizer and are subsequently converted to arrays to ensure compatibility with the Keras framework.


**Deep Neural Network Architecture:**

A feed-forward neural network is architected using the Keras library. The network encompasses two hidden layers, with dropout layers interspersed to mitigate the risk of overfitting. The activation function for the output layer is set to softmax, making it suitable for multi-class classification scenarios. The chosen loss function, sparse_categorical_crossentropy, is apt for integer-encoded class labels.

**Model Training & Evaluation Framework:**

Four distinct models are delineated: Naive Bayes, Support Vector Machine (SVM), Random Forest, and the previously defined Deep Neural Network.
Each model undergoes training on the training dataset, followed by evaluation on the test dataset.

Performance metrics, namely precision, recall, F1-score, accuracy, and support, are employed to gauge the efficacy of each model. These metrics furnish a holistic view of the model's performance, taking into account both the positive and negative classes.
The results are collated and presented in a structured tabular format using a DataFrame.

**Outcome:**

Upon execution, this code will render the performance metrics of each model when bigrams are incorporated into the feature extraction process. This elucidates the potential advantages of imbibing more contextual information from the textual data.




In [6]:
# Using bigrams
tfidf_vectorizer_ngrams = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
X_train_tfidf_ngrams = tfidf_vectorizer_ngrams.fit_transform(X_train).toarray()  # Convert to array for Keras
X_test_tfidf_ngrams = tfidf_vectorizer_ngrams.transform(X_test).toarray()

# Define a simple feed-forward neural network for Keras
def create_nn_model():
    model = Sequential()
    model.add(Dense(128, input_dim=X_train_tfidf_ngrams.shape[1], activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(len(set(y_train)), activation='softmax'))  # Number of classes
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

def train_and_evaluate_models(X_train, X_test):
    # Define the models
    models = {
        'N-grams Naive Bayes': MultinomialNB(),
        'N-grams SVM': SVC(kernel='linear'),
        'N-grams Random Forest': RandomForestClassifier(random_state=42),
        'N-grams Deep Neural Network': KerasClassifier(build_fn=create_nn_model, epochs=10, batch_size=32, verbose=0)
    }

    # Train and evaluate the models
    results = {}
    for model_name, model in models.items():
        # Train
        model.fit(X_train, y_train)
        # Predict
        y_pred = model.predict(X_test)
        # Evaluate
        report = classification_report(y_test, y_pred, output_dict=True)
        results[model_name] = {
            'Precision': report['weighted avg']['precision'],
            'Recall': report['weighted avg']['recall'],
            'F1-Score': report['weighted avg']['f1-score'],
            'Accuracy': report['accuracy'],
            'Support': report['weighted avg']['support']
        }

    # Convert results to DataFrame for display
    df_results = pd.DataFrame(results).transpose()
    return df_results

# Train and evaluate models with N-grams
df_results_ngrams = train_and_evaluate_models(X_train_tfidf_ngrams, X_test_tfidf_ngrams)
print(df_results_ngrams)


  'N-grams Deep Neural Network': KerasClassifier(build_fn=create_nn_model, epochs=10, batch_size=32, verbose=0)


                             Precision    Recall  F1-Score  Accuracy  Support
N-grams Naive Bayes           0.955492  0.946667  0.948155  0.946667    300.0
N-grams SVM                   0.980477  0.980000  0.979978  0.980000    300.0
N-grams Random Forest         0.932119  0.926667  0.926885  0.926667    300.0
N-grams Deep Neural Network   0.980752  0.980000  0.980027  0.980000    300.0


**Naive Bayes, SVM, Random Forest, Deep Neural Network (Word Embeddings - Pre-trained Embeddings)**

**Description:**

This segment of the code is dedicated to the task of document classification, leveraging the power of Word2Vec embeddings. Word2Vec is a pre-trained model that captures semantic relationships between words by representing them as vectors in a high-dimensional space. The performance of various machine learning models, inclusive of a deep neural network, is evaluated using these embeddings.

**Word2Vec Embeddings:**

The Google News Word2Vec model, which is trained on a vast corpus and has a vector size of 300, is loaded.
A function, document_to_word2vec, is defined to convert a document into its corresponding Word2Vec representation. This is achieved by averaging the Word2Vec vectors of individual words present in the document.
Both the training and test datasets are transformed into their Word2Vec representations.

**Deep Neural Network Architecture:**

A feed-forward neural network is architected using the Keras library. The network encompasses two hidden layers, with dropout layers interspersed to mitigate the risk of overfitting. The activation function for the output layer is set to softmax, making it suitable for multi-class classification scenarios. The chosen loss function, sparse_categorical_crossentropy, is apt for integer-encoded class labels.

**Model Training & Evaluation Framework:**

Four distinct models are delineated: Naive Bayes (specifically GaussianNB due to continuous features from Word2Vec), Support Vector Machine (SVM), Random Forest, and the previously defined Deep Neural Network.
Each model undergoes training on the Word2Vec-transformed training dataset, followed by evaluation on the transformed test dataset.
Performance metrics, namely precision, recall, F1-score, accuracy, and support, are employed to gauge the efficacy of each model. These metrics furnish a holistic view of the model's performance, taking into account both the positive and negative classes.
The results are collated and presented in a structured tabular format using a DataFrame.

**Outcome:**
Upon execution, this code will render the performance metrics of each model when Word2Vec embeddings are employed as features. This elucidates the potential advantages of imbibing semantic information from the textual data.



In [7]:
def document_to_word2vec(doc, model):
    # Tokenize the document, filter out words not in the model's vocabulary
    words = [word for word in doc.split() if word in model.key_to_index]
    if len(words) == 0:
        return np.zeros(model.vector_size)
    # Convert words to vectors and average them
    return np.mean([model[word] for word in words], axis=0)

X_train_word2vec = np.array([document_to_word2vec(doc, word2vec_model) for doc in X_train])
X_test_word2vec = np.array([document_to_word2vec(doc, word2vec_model) for doc in X_test])

# Define a simple feed-forward neural network for Keras
def create_nn_model():
    model = Sequential()
    model.add(Dense(128, input_dim=X_train_word2vec.shape[1], activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(len(set(y_train)), activation='softmax'))  # Number of classes
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

def train_and_evaluate_models(X_train, X_test):
    # Define the models
    models = {
        'Pre-trained Embeddings Naive Bayes': GaussianNB(),
        'Pre-trained Embeddings SVM': SVC(kernel='linear'),
        'Pre-trained Embeddings Random Forest': RandomForestClassifier(random_state=42),
        'Pre-trained Embeddings Deep Neural Network': KerasClassifier(build_fn=create_nn_model, epochs=10, batch_size=32, verbose=0)
    }

    # Train and evaluate the models
    results = {}
    for model_name, model in models.items():
        # Train
        model.fit(X_train, y_train)
        # Predict
        y_pred = model.predict(X_test)
        # Evaluate
        report = classification_report(y_test, y_pred, output_dict=True)
        results[model_name] = {
            'Precision': report['weighted avg']['precision'],
            'Recall': report['weighted avg']['recall'],
            'F1-Score': report['weighted avg']['f1-score'],
            'Accuracy': report['accuracy'],
            'Support': report['weighted avg']['support']
        }

    # Convert results to DataFrame for display
    df_results = pd.DataFrame(results).transpose()
    return df_results

# Train and evaluate models with Word2Vec features
df_results_word2vec = train_and_evaluate_models(X_train_word2vec, X_test_word2vec)
print(df_results_word2vec)


  'Pre-trained Embeddings Deep Neural Network': KerasClassifier(build_fn=create_nn_model, epochs=10, batch_size=32, verbose=0)


                                            Precision    Recall  F1-Score  \
Pre-trained Embeddings Naive Bayes           0.929992  0.920000  0.920931   
Pre-trained Embeddings SVM                   0.956522  0.953333  0.953573   
Pre-trained Embeddings Random Forest         0.945605  0.943333  0.943607   
Pre-trained Embeddings Deep Neural Network   0.917206  0.913333  0.913377   

                                            Accuracy  Support  
Pre-trained Embeddings Naive Bayes          0.920000    300.0  
Pre-trained Embeddings SVM                  0.953333    300.0  
Pre-trained Embeddings Random Forest        0.943333    300.0  
Pre-trained Embeddings Deep Neural Network  0.913333    300.0  


**Naive Bayes, SVM, Random Forest, Deep Neural Network (Feature Selection - Chi-Squared Test)**

**Description:**

In this segment, the code is focused on enhancing the document classification task by incorporating bigrams and feature selection using the chi-squared test. The objective is to discern the impact of these techniques on the performance of various machine learning models, including a deep neural network.

**Bigrams with TF-IDF:**

The Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer is employed with a configuration to consider both unigrams and bigrams. Bigrams can capture more contextual information than unigrams, potentially improving the model's understanding of the text.
The training and test datasets are transformed into their corresponding TF-IDF representations, which are then converted to arrays to facilitate compatibility with Keras.

**Feature Selection using Chi-Squared Test:**

The chi-squared test is a statistical test used to determine the dependence of two categorical variables. Here, it's used to select the top 10,000 features that have the strongest relationship with the output variable.
The TF-IDF transformed datasets are further refined by retaining only the selected features.

**Deep Neural Network Architecture:**

A feed-forward neural network is constructed using the Keras library. The network comprises two hidden layers, with dropout layers interspersed to prevent overfitting. The softmax activation function in the output layer ensures compatibility with multi-class classification. The loss function, sparse_categorical_crossentropy, is suitable for integer-encoded class labels.

**Model Training & Evaluation Framework:**

Four distinct models are delineated: Naive Bayes, Support Vector Machine (SVM), Random Forest, and the previously defined Deep Neural Network.
Each model undergoes training on the chi-squared selected features of the training dataset and is subsequently evaluated on the test dataset.
Performance metrics, namely precision, recall, F1-score, accuracy, and support, are employed to gauge the efficacy of each model. These metrics provide a comprehensive assessment of the model's performance, considering both the positive and negative classes.
The results are collated and presented in a structured tabular format using a DataFrame.

**Outcome:**
Upon execution, this code segment will render the performance metrics of each model when bigrams and chi-squared feature selection are employed. This will elucidate the potential advantages of these techniques in enhancing the model's understanding of the textual data.



In [8]:
# Using bigrams
tfidf_vectorizer_ngrams = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
X_train_tfidf_ngrams = tfidf_vectorizer_ngrams.fit_transform(X_train).toarray()  # Convert to array for Keras
X_test_tfidf_ngrams = tfidf_vectorizer_ngrams.transform(X_test).toarray()

# Select top 10,000 features based on the chi-squared test
k_best = 10000
ch2 = SelectKBest(chi2, k=k_best)
X_train_chi2_selected = ch2.fit_transform(X_train_tfidf_ngrams, y_train)
X_test_chi2_selected = ch2.transform(X_test_tfidf_ngrams)

# Define a simple feed-forward neural network for Keras
def create_nn_model():
    model = Sequential()
    model.add(Dense(128, input_dim=X_train_chi2_selected.shape[1], activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(len(set(y_train)), activation='softmax'))  # Number of classes
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

def train_and_evaluate_models(X_train, X_test):
    # Define the models
    models = {
        'Chi-Squared Test Naive Bayes': MultinomialNB(),
        'Chi-Squared Test SVM': SVC(kernel='linear'),
        'Chi-Squared Test Random Forest': RandomForestClassifier(random_state=42),
        'Chi-Squared Test Deep Neural Network': KerasClassifier(build_fn=create_nn_model, epochs=10, batch_size=32, verbose=0)
    }

    # Train and evaluate the models
    results = {}
    for model_name, model in models.items():
        # Train
        model.fit(X_train, y_train)
        # Predict
        y_pred = model.predict(X_test)
        # Evaluate
        report = classification_report(y_test, y_pred, output_dict=True)
        results[model_name] = {
            'Precision': report['weighted avg']['precision'],
            'Recall': report['weighted avg']['recall'],
            'F1-Score': report['weighted avg']['f1-score'],
            'Accuracy': report['accuracy'],
            'Support': report['weighted avg']['support']
        }

    # Convert results to DataFrame for display
    df_results = pd.DataFrame(results).transpose()
    return df_results

# Train and evaluate models with chi-squared selected features
df_results_chi2 = train_and_evaluate_models(X_train_chi2_selected, X_test_chi2_selected)
print(df_results_chi2)


  'Chi-Squared Test Deep Neural Network': KerasClassifier(build_fn=create_nn_model, epochs=10, batch_size=32, verbose=0)


                                      Precision    Recall  F1-Score  Accuracy  \
Chi-Squared Test Naive Bayes           0.964585  0.960000  0.960622  0.960000   
Chi-Squared Test SVM                   0.974356  0.973333  0.973402  0.973333   
Chi-Squared Test Random Forest         0.951898  0.946667  0.947257  0.946667   
Chi-Squared Test Deep Neural Network   0.976628  0.973333  0.973876  0.973333   

                                      Support  
Chi-Squared Test Naive Bayes            300.0  
Chi-Squared Test SVM                    300.0  
Chi-Squared Test Random Forest          300.0  
Chi-Squared Test Deep Neural Network    300.0  


**Naive Bayes, SVM, Random Forest, Deep Neural Network (Model Ensembling - Bagging)**

**Description:**

In this segment, the code is centered on exploring the benefits of ensemble learning, specifically bagging, to enhance the document classification task. Bagging, or Bootstrap Aggregating, involves training multiple instances of a model on different subsets of the training data and then aggregating their predictions. This can reduce variance and improve generalization.

**Deep Neural Network Architecture:**

A feed-forward neural network is constructed using the Keras library. The network comprises two hidden layers, with dropout layers interspersed to prevent overfitting. The softmax activation function in the output layer ensures compatibility with multi-class classification. The loss function, sparse_categorical_crossentropy, is suitable for integer-encoded class labels.

**Bagging with Deep Neural Network:**

The BaggingClassifier from scikit-learn is employed to create an ensemble of the previously defined deep neural network. The ensemble consists of 10 instances of the neural network, each trained on a different subset of the chi-squared selected features of the training dataset.
The ensemble is trained and subsequently evaluated on the test dataset.

**Consolidation of Results:**

The performance metrics of the bagging ensemble for the deep neural network are consolidated with those of other models (SVM, Random Forest, Naive Bayes) that have presumably been evaluated using bagging in prior code segments.
Metrics such as precision, recall, F1-score, accuracy, and support are considered for a comprehensive assessment of the model's performance.
The consolidated results are structured in a tabular format using a DataFrame for a clear and concise presentation.

**Outcome:**
Upon execution, this code segment will render the performance metrics of each model when bagging is employed. This will provide insights into the potential advantages of ensemble learning in enhancing the model's robustness and generalization capabilities.



In [9]:
# Model Definitions
def create_nn_model(input_dim, num_classes):
    model = Sequential()
    model.add(Dense(128, input_dim=input_dim, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(num_classes, activation='softmax'))
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# Model Training and Evaluation
def train_bagging_classifier(base_estimator, X_train, X_test, y_train, y_test):
    bagging_classifier = BaggingClassifier(base_estimator=base_estimator, n_estimators=10, random_state=42)
    bagging_classifier.fit(X_train, y_train)
    y_pred = bagging_classifier.predict(X_test)
    report = classification_report(y_test, y_pred, output_dict=True)
    return {
        'Precision': report['weighted avg']['precision'],
        'Recall': report['weighted avg']['recall'],
        'F1-Score': report['weighted avg']['f1-score'],
        'Accuracy': report['accuracy'],
        'Support': report['weighted avg']['support']
    }

# Main Execution
results_bagging = {
    'Bagging SVM': train_bagging_classifier(SVC(kernel='linear'), X_train_chi2_selected, X_test_chi2_selected, y_train, y_test),
    'Bagging Random Forest': train_bagging_classifier(RandomForestClassifier(random_state=42), X_train_chi2_selected, X_test_chi2_selected, y_train, y_test),
    'Bagging Naive Bayes': train_bagging_classifier(MultinomialNB(), X_train_chi2_selected, X_test_chi2_selected, y_train, y_test),
    'Bagging Deep Neural Network': train_bagging_classifier(KerasClassifier(build_fn=lambda: create_nn_model(X_train_chi2_selected.shape[1], len(set(y_train))), epochs=10, batch_size=32, verbose=0), X_train_chi2_selected, X_test_chi2_selected, y_train, y_test)
}

# Convert results to DataFrame for display
df_results_bagging = pd.DataFrame(results_bagging).transpose()
print(df_results_bagging)


  'Bagging Deep Neural Network': train_bagging_classifier(KerasClassifier(build_fn=lambda: create_nn_model(X_train_chi2_selected.shape[1], len(set(y_train))), epochs=10, batch_size=32, verbose=0), X_train_chi2_selected, X_test_chi2_selected, y_train, y_test)


                             Precision    Recall  F1-Score  Accuracy  Support
Bagging SVM                   0.968281  0.966667  0.966811  0.966667    300.0
Bagging Random Forest         0.947732  0.940000  0.941160  0.940000    300.0
Bagging Naive Bayes           0.955492  0.946667  0.948155  0.946667    300.0
Bagging Deep Neural Network   0.971912  0.970000  0.970361  0.970000    300.0


**Naive Bayes, SVM, Random Forest, Deep Neural Network (Regularization Techniques - Dropout)**

**Description:**

In this segment, the primary focus is on the exploration of regularization techniques to improve the generalization capabilities of various machine learning models for document classification. Regularization helps in preventing overfitting, ensuring that the model performs well on unseen data.

**Deep Neural Network with Dropout Regularization:**

A feed-forward neural network is constructed using the TensorFlow and Keras libraries. Dropout layers are introduced after the dense layers to serve as a regularization mechanism. By randomly setting a fraction of input units to 0 at each update during training, dropout helps prevent overfitting.
The network is trained on the chi-squared selected features of the training dataset and subsequently evaluated on the test dataset.

**Support Vector Machines (SVM) with L1 and L2 Regularization:**

Two SVM models are trained: one with L1 regularization and the other with L2 regularization. L1 regularization can lead to feature selection as it tends to produce a sparse weight vector, while L2 regularization can prevent overfitting without necessarily zeroing out weights.

**Naive Bayes with Regularization:**

A Naive Bayes classifier is trained with Laplace smoothing (controlled by the alpha parameter). This regularization technique helps in handling the absence of features in the training data that might appear in the test data.

**Random Forest with Regularization:**

A Random Forest classifier is trained with hyperparameters that act as regularization. The max_depth parameter ensures that the trees do not grow too deep, and the max_features parameter controls the number of features to consider when looking for the best split.

The performance metrics of all models, including precision, recall, F1-score, accuracy, and support, are consolidated into a structured format.
The results are presented in a tabular format using a DataFrame for clarity.

**Outcome:**
Upon execution, this code segment will display the performance metrics of each model with their respective regularization techniques. This will provide insights into the efficacy of regularization in enhancing the model's robustness and performance on unseen data.



In [11]:
# Model Definitions
def create_nn_model(input_dim, num_classes):
    model = Sequential()
    model.add(Dense(128, input_dim=input_dim, activation='relu'))
    model.add(Dropout(0.5))  # Dropout layer for regularization
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.5))  # Dropout layer for regularization
    model.add(Dense(num_classes, activation='softmax'))
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# Model Training and Evaluation
def train_and_evaluate_classifier(classifier, X_train, X_test, y_train, y_test):
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    report = classification_report(y_test, y_pred, output_dict=True)
    return {
        'Precision': report['weighted avg']['precision'],
        'Recall': report['weighted avg']['recall'],
        'F1-Score': report['weighted avg']['f1-score'],
        'Accuracy': report['accuracy'],
        'Support': report['weighted avg']['support']
    }

# Main Execution
results_regularized = {
    'Dropout SVM L1': train_and_evaluate_classifier(LinearSVC(penalty='l1', dual=False, C=1.0), X_train_chi2_selected, X_test_chi2_selected, y_train, y_test),
    'Dropout SVM L2': train_and_evaluate_classifier(SVC(kernel='linear', C=1.0), X_train_chi2_selected, X_test_chi2_selected, y_train, y_test),
    'Dropout Naive Bayes Regularized': train_and_evaluate_classifier(MultinomialNB(alpha=0.5), X_train_chi2_selected, X_test_chi2_selected, y_train, y_test),
    'Dropout Random Forest Regularized': train_and_evaluate_classifier(RandomForestClassifier(n_estimators=100, max_depth=10, max_features='sqrt', random_state=42), X_train_chi2_selected, X_test_chi2_selected, y_train, y_test),
    'Dropout Deep Neural Network': train_and_evaluate_classifier(KerasClassifier(build_fn=lambda: create_nn_model(X_train_chi2_selected.shape[1], len(set(y_train))), epochs=10, batch_size=32, verbose=0), X_train_chi2_selected, X_test_chi2_selected, y_train, y_test)
}

# Convert results to DataFrame for display
df_results_regularized = pd.DataFrame(results_regularized).transpose()
print(df_results_regularized)


  'Dropout Deep Neural Network': train_and_evaluate_classifier(KerasClassifier(build_fn=lambda: create_nn_model(X_train_chi2_selected.shape[1], len(set(y_train))), epochs=10, batch_size=32, verbose=0), X_train_chi2_selected, X_test_chi2_selected, y_train, y_test)


                                   Precision    Recall  F1-Score  Accuracy  \
Dropout SVM L1                      0.918721  0.916667  0.916141  0.916667   
Dropout SVM L2                      0.974356  0.973333  0.973402  0.973333   
Dropout Naive Bayes Regularized     0.964585  0.960000  0.960622  0.960000   
Dropout Random Forest Regularized   0.909965  0.903333  0.903895  0.903333   
Dropout Deep Neural Network         0.980425  0.980000  0.980047  0.980000   

                                   Support  
Dropout SVM L1                       300.0  
Dropout SVM L2                       300.0  
Dropout Naive Bayes Regularized      300.0  
Dropout Random Forest Regularized    300.0  
Dropout Deep Neural Network          300.0  


**Best Model without combinations**

**Description:**

In this code segment, the primary objective is to consolidate the results from various model configurations and preprocessing techniques, and then identify the best-performing model based on the F1-Score.

**Consolidation of Results:**

The results from different model configurations and preprocessing techniques, namely standard models, models with N-grams, models using Word2Vec embeddings, models with chi-squared selected features, models with bagging, and models with regularization, are consolidated into a single DataFrame.
This consolidation provides a unified view, making it easier to compare and analyze the performance of different configurations side by side.

**Sorting based on F1-Score:**

The consolidated results are sorted in descending order based on the F1-Score. The F1-Score is a harmonic mean of precision and recall, providing a balanced measure of a model's performance, especially in cases where class distributions might be imbalanced.

**Display of Results:**

The sorted results are displayed in tabular format, providing a clear view of how each model configuration performed relative to the others.

**Identification of the Best Model:**

The model with the highest F1-Score is identified as the best model.
Details of the best model, including its name and F1-Score, are displayed.

**Outcome:**
Upon execution, this code segment will present a comprehensive view of the performance metrics of all model configurations. It will also highlight the best-performing model based on the F1-Score, providing a clear recommendation for the most effective model configuration for the given task.



In [12]:
# Consolidate all results
all_results = pd.concat([
    df_results,
    df_results_ngrams,
    df_results_word2vec,
    df_results_chi2,
    df_results_bagging,
    df_results_regularized,
])

# Sort based on F1-Score
sorted_results = all_results.sort_values(by='F1-Score', ascending=False)

# Display the consolidated results
print(sorted_results)

# Display the best model
best_model = sorted_results.iloc[0]
print("\nBest Model based on F1-Score:")
print(best_model.name)
print("F1-Score:", best_model['F1-Score'])


                                            Precision    Recall  F1-Score  \
SVM                                          0.981571  0.980000  0.980188   
Dropout Deep Neural Network                  0.980425  0.980000  0.980047   
N-grams Deep Neural Network                  0.980752  0.980000  0.980027   
N-grams SVM                                  0.980477  0.980000  0.979978   
Bagging Deep Neural Network                  0.977733  0.976667  0.976864   
Deep Neural Network                          0.977766  0.976667  0.976829   
Chi-Squared Test Deep Neural Network         0.976628  0.973333  0.973876   
Dropout SVM L2                               0.974356  0.973333  0.973402   
Chi-Squared Test SVM                         0.974356  0.973333  0.973402   
Bagging SVM                                  0.968281  0.966667  0.966811   
Dropout Naive Bayes Regularized              0.964585  0.960000  0.960622   
Chi-Squared Test Naive Bayes                 0.964585  0.960000  0.960622   

**Best Model with combinations**

**Description:**

This code segment is designed to evaluate the performance of various machine learning models, combined with different preprocessing techniques, feature selection methods, and regularization techniques, on a given dataset. The primary objective is to identify the best-performing model configuration based on the F1-Score.

**Neural Network Definition:**

A simple feed-forward neural network is defined using Keras. This network will be used as one of the classifiers in the subsequent steps.

**Model, Preprocessing, and Feature Selection Definitions:**

Four machine learning models are defined: Linear SVM, Naive Bayes, Random Forest, and a Neural Network.
Two preprocessing techniques are defined: TF-IDF and TF-IDF with N-grams.
Two feature selection methods are defined: Chi-squared test and no selection.
Two regularization techniques are defined for the Linear SVM: L1 and L2.

**Training and Evaluation Loop:**

A nested loop is used to iterate through each combination of model, preprocessing technique, feature selection method, and regularization technique.
For each combination, the data is preprocessed, features are selected (if applicable), the model is trained, and predictions are made on the test set.
Evaluation metrics (Precision, Recall, F1-Score, Accuracy, and Support) are calculated for each combination and stored in a list.

**Results Consolidation and Display:**

The results are consolidated into a DataFrame for easier visualization and analysis.
The results are sorted based on the F1-Score in descending order to identify the best-performing model configuration.
The consolidated results are displayed, followed by details of the best-performing model.

**Outcome:**

Upon execution, this code segment will provide a comprehensive view of the performance metrics of all model configurations. It will also highlight the best-performing model based on the F1-Score, offering insights into the most effective model configuration for the given task.



In [23]:
# Define a function to create the neural network model
def create_nn_model(input_dim):
    model = Sequential()
    model.add(Dense(128, input_dim=input_dim, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# Define models
models = {
    'Linear SVM': LinearSVC(),
    'Naive Bayes': MultinomialNB(),
    'Random Forest': RandomForestClassifier(random_state=42),
    'Neural Network': None  # Placeholder, will be defined later
}

# Define preprocessing techniques
preprocessing_techniques = {
    'TF-IDF': TfidfVectorizer(stop_words='english'),
    'TF-IDF Ngrams': TfidfVectorizer(stop_words='english', ngram_range=(1,2))
}

# Define feature selection/enhancement techniques
feature_selection_methods = {
    'Chi-squared': SelectKBest(chi2, k=10000),
    'No Selection': None
}

# Define regularization techniques (only for Linear SVM)
regularizations = {
    'L1': {'penalty': 'l1', 'dual': False},
    'L2': {'penalty': 'l2'}
}

results_list = []

for model_name, model in models.items():
    for preprocess_name, preprocess in preprocessing_techniques.items():
        # Preprocess the data
        X_train_preprocessed = preprocess.fit_transform(X_train)
        X_test_preprocessed = preprocess.transform(X_test)

        for feature_name, feature_method in feature_selection_methods.items():
            # Feature selection
            if feature_method:
                # Dynamically set k for SelectKBest
                if feature_name == 'Chi-squared':
                    k_value = min(10000, X_train_preprocessed.shape[1])
                    feature_method.set_params(k=k_value)

                # Scale the data to [0, 1] range
                scaler = MinMaxScaler()
                X_train_preprocessed = scaler.fit_transform(X_train_preprocessed.toarray())
                X_test_preprocessed = scaler.transform(X_test_preprocessed.toarray())

                X_train_selected = feature_method.fit_transform(X_train_preprocessed, y_train)
                X_test_selected = feature_method.transform(X_test_preprocessed)
            else:
                X_train_selected = X_train_preprocessed
                X_test_selected = X_test_preprocessed

            # Regularization (only for Linear SVM)
            if model_name == 'Linear SVM':
                for reg_name, reg_params in regularizations.items():
                    model.set_params(**reg_params)

                    # Train the model
                    model.fit(X_train_selected, y_train)

                    # Evaluate the model
                    y_pred = model.predict(X_test_selected)

                    precision = precision_score(y_test, y_pred, average='weighted')
                    recall = recall_score(y_test, y_pred, average='weighted')
                    f1 = f1_score(y_test, y_pred, average='weighted')
                    accuracy = accuracy_score(y_test, y_pred)
                    support = len(y_test)

                    results_list.append({
                        'Model': model_name,
                        'Preprocessing': preprocess_name,
                        'Feature Selection': feature_name,
                        'Regularization': reg_name,
                        'Precision': precision,
                        'Recall': recall,
                        'F1-Score': f1,
                        'Accuracy': accuracy,
                        'Support': support
                    })

            elif model_name == 'Neural Network':
                input_dim = X_train_selected.shape[1]
                model = KerasClassifier(build_fn=create_nn_model, input_dim=input_dim, epochs=10, batch_size=32, verbose=0)
                model.fit(X_train_selected, y_train)
                y_pred = model.predict(X_test_selected)

                precision = precision_score(y_test, y_pred, average='weighted')
                recall = recall_score(y_test, y_pred, average='weighted')
                f1 = f1_score(y_test, y_pred, average='weighted')
                accuracy = accuracy_score(y_test, y_pred)
                support = len(y_test)

                results_list.append({
                    'Model': model_name,
                    'Preprocessing': preprocess_name,
                    'Feature Selection': feature_name,
                    'Regularization': 'N/A',
                    'Precision': precision,
                    'Recall': recall,
                    'F1-Score': f1,
                    'Accuracy': accuracy,
                    'Support': support
                })

            else:
                # Enhancement: Bagging
                bagging_model = BaggingClassifier(base_estimator=model, n_estimators=10, random_state=42)
                bagging_model.fit(X_train_selected, y_train)
                y_pred = bagging_model.predict(X_test_selected)

                precision = precision_score(y_test, y_pred, average='weighted')
                recall = recall_score(y_test, y_pred, average='weighted')
                f1 = f1_score(y_test, y_pred, average='weighted')
                accuracy = accuracy_score(y_test, y_pred)
                support = len(y_test)

                results_list.append({
                    'Model': model_name + ' with Bagging',
                    'Preprocessing': preprocess_name,
                    'Feature Selection': feature_name,
                    'Regularization': 'N/A',
                    'Precision': precision,
                    'Recall': recall,
                    'F1-Score': f1,
                    'Accuracy': accuracy,
                    'Support': support
                })

# Convert results to DataFrame for display
df_results = pd.DataFrame(results_list)

# Sort based on F1-Score
sorted_results = df_results.sort_values(by='F1-Score', ascending=False)

# Display the consolidated results
print(sorted_results)

# Display the best model
best_model = sorted_results.iloc[0]
print("\nBest Model based on F1-Score:")
print(best_model['Model'])
print("F1-Score:", best_model['F1-Score'])


  model = KerasClassifier(build_fn=create_nn_model, input_dim=input_dim, epochs=10, batch_size=32, verbose=0)




  _warn_prf(average, modifier, msg_start, len(result))
  model = KerasClassifier(build_fn=create_nn_model, input_dim=input_dim, epochs=10, batch_size=32, verbose=0)




  _warn_prf(average, modifier, msg_start, len(result))
  model = KerasClassifier(build_fn=create_nn_model, input_dim=input_dim, epochs=10, batch_size=32, verbose=0)




  _warn_prf(average, modifier, msg_start, len(result))
  model = KerasClassifier(build_fn=create_nn_model, input_dim=input_dim, epochs=10, batch_size=32, verbose=0)


                         Model  Preprocessing Feature Selection  \
3                   Linear SVM         TF-IDF      No Selection   
7                   Linear SVM  TF-IDF Ngrams      No Selection   
6                   Linear SVM  TF-IDF Ngrams      No Selection   
2                   Linear SVM         TF-IDF      No Selection   
11    Naive Bayes with Bagging  TF-IDF Ngrams      No Selection   
13  Random Forest with Bagging         TF-IDF      No Selection   
0                   Linear SVM         TF-IDF       Chi-squared   
1                   Linear SVM         TF-IDF       Chi-squared   
5                   Linear SVM  TF-IDF Ngrams       Chi-squared   
8     Naive Bayes with Bagging         TF-IDF       Chi-squared   
9     Naive Bayes with Bagging         TF-IDF      No Selection   
4                   Linear SVM  TF-IDF Ngrams       Chi-squared   
10    Naive Bayes with Bagging  TF-IDF Ngrams       Chi-squared   
14  Random Forest with Bagging  TF-IDF Ngrams       Chi-square

  _warn_prf(average, modifier, msg_start, len(result))
