This repository contains the code and data for the paper "A Hybrid TF-IDF and RNN Model for Multi-label Classification of the Deep and Dark Web" by Ashwini Dalvi et al., published in the International Journal of Advanced Computer Science and Applications (IJACSA), Volume 14, Issue 7, 2023.
The classification of content on the deep and dark web has been a topic of interest for researchers. Researchers focus on adopting more efficient and effective classification methods as the data available on deep and dark web platforms continues to grow. Multi-label classification is the approach for simultaneously categorizing content into multiple classes. To address this, a hybrid approach combining Term Frequency-Inverse Document Frequency (TF-IDF) and Recurrent Neural Network (RNN) has been proposed. The approach involves preprocessing a dataset of Hypertext Markup Language (HTML) documents, selecting specific HTML tags to generate embeddings using TF-IDF, and using an RNN model for multi-label classification. The proposed model was evaluated against commonly used methods (Binary Relevance, Classifier Chains, and Label Powerset) using precision, recall, and F1-score as evaluation metrics, demonstrating promising results in accurately classifying data from the deep and dark web. This contribution represents a noteworthy advancement for researchers and analysts working in this field.
- The paper proposes a novel hybrid approach for multi-label classification of deep and dark web content, which combines TF-IDF and RNN techniques.
- The paper collects and preprocesses a large dataset of HTML documents from the deep and dark web using a custom crawler and various data cleaning techniques.
- The paper uses FastText to generate embeddings of the selected tokens from the HTML documents and assigns them to predefined categories based on their similarity.
- The paper trains and evaluates an RNN model using the assigned categories as labels and compares its performance with three existing methods: Binary Relevance, Classifier Chains, and Label Powerset.
- The paper demonstrates that the proposed approach outperforms the other methods in terms of precision, recall, and F1-score for most of the categories and provides insights into the factors affecting the multi-label classification performance.
The code and data for this paper can be found in this repository. The code is written in Python and uses libraries such as TensorFlow, Keras, Scikit-learn, and NLTK. The data consists of 50,000 HTML documents from the deep and dark web, which are preprocessed and labeled using the proposed approach. The data is split into training and testing sets using a 75:25 ratio. The code includes scripts for data preprocessing, feature extraction, label assignment, model training, and evaluation. The code also includes a Jupyter notebook that demonstrates the usage and results of the proposed approach. The code and data are licensed under the Creative Commons Attribution 4.0 International License.