A Hybrid TF-IDF and RNN Model for Multi-label Classification of the Deep and Dark Web

This repository contains the code and data for the paper "A Hybrid TF-IDF and RNN Model for Multi-label Classification of the Deep and Dark Web" by Ashwini Dalvi et al., published in the International Journal of Advanced Computer Science and Applications (IJACSA), Volume 14, Issue 7, 2023.

Abstract

The classification of content on the deep and dark web has been a topic of interest for researchers. Researchers focus on adopting more efficient and effective classification methods as the data available on deep and dark web platforms continues to grow. Multi-label classification is the approach for simultaneously categorizing content into multiple classes. To address this, a hybrid approach combining Term Frequency-Inverse Document Frequency (TF-IDF) and Recurrent Neural Network (RNN) has been proposed. The approach involves preprocessing a dataset of Hypertext Markup Language (HTML) documents, selecting specific HTML tags to generate embeddings using TF-IDF, and using an RNN model for multi-label classification. The proposed model was evaluated against commonly used methods (Binary Relevance, Classifier Chains, and Label Powerset) using precision, recall, and F1-score as evaluation metrics, demonstrating promising results in accurately classifying data from the deep and dark web. This contribution represents a noteworthy advancement for researchers and analysts working in this field.

Key Points

The paper proposes a novel hybrid approach for multi-label classification of deep and dark web content, which combines TF-IDF and RNN techniques.
The paper collects and preprocesses a large dataset of HTML documents from the deep and dark web using a custom crawler and various data cleaning techniques.
The paper uses FastText to generate embeddings of the selected tokens from the HTML documents and assigns them to predefined categories based on their similarity.
The paper trains and evaluates an RNN model using the assigned categories as labels and compares its performance with three existing methods: Binary Relevance, Classifier Chains, and Label Powerset.
The paper demonstrates that the proposed approach outperforms the other methods in terms of precision, recall, and F1-score for most of the categories and provides insights into the factors affecting the multi-label classification performance.

Code and Data

The code and data for this paper can be found in this repository. The code is written in Python and uses libraries such as TensorFlow, Keras, Scikit-learn, and NLTK. The data consists of 50,000 HTML documents from the deep and dark web, which are preprocessed and labeled using the proposed approach. The data is split into training and testing sets using a 75:25 ratio. The code includes scripts for data preprocessing, feature extraction, label assignment, model training, and evaluation. The code also includes a Jupyter notebook that demonstrates the usage and results of the proposed approach. The code and data are licensed under the Creative Commons Attribution 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
A_Hybrid_TF_IDF_and_RNN_Model_for_Multi_label_Classification.pdf		A_Hybrid_TF_IDF_and_RNN_Model_for_Multi_label_Classification.pdf
LICENSE		LICENSE
Multilabel_Classification_Generic_Hidden_Data_Wiki.ipynb		Multilabel_Classification_Generic_Hidden_Data_Wiki.ipynb
README.md		README.md
multilabel-classification-2nd-dataset.ipynb		multilabel-classification-2nd-dataset.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Hybrid TF-IDF and RNN Model for Multi-label Classification of the Deep and Dark Web

Abstract

Key Points

Code and Data

About

Releases

Packages

Languages

License

AtharvaKitkaru/multilabel-classifier-for-darkweb

Folders and files

Latest commit

History

Repository files navigation

A Hybrid TF-IDF and RNN Model for Multi-label Classification of the Deep and Dark Web

Abstract

Key Points

Code and Data

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages