The Exploit.in Forum Scraper is a Python-based tool designed to systematically extract and process data from the Exploit.in forum. This automated scraper collects data on forum categories, subforums, threads, and posts, which can be utilized for various analytical purposes such as threat intelligence analysis, trend detection, or content monitoring. Users must ensure they have the proper authorization and comply with all legal and ethical standards for data collection.
The scraper is developed to:
- Automate the extraction of structured information from the Exploit.in forum.
- Store this information in a MongoDB database for subsequent analysis and retrieval.
- Dynamically manage web sessions via Selenium to maintain valid and current session-specific cookies.
- A GUI version of Linux (Tested on Ubuntu 22.04.4 LTS)
- A valid Exploit.in account is mandatory as the scraper uses session cookies for authentication.
- Python 3.x environment.
- MongoDB installed and operational for data storage.
- Specific Python libraries listed in the provided
requirements.txt. - Function selectively as a crawler that saves data based on the presence of specific keywords in either the thread title, post content, or both, as specified in the
keywords.txtfile.
The MongoDB database is configured to store various types of data extracted from the forum:
- Forums: Data about different forum categories.
- Subforums: Data about subforums under each category.
- Threads: Information about individual threads within each subforum.
- Posts: Individual posts within each thread.
- Users: User profiles extracted from posts and threads.
Documents in each collection are stored in JSON format with fields tailored to the specific data type (e.g., thread title, post content, user profile information).
- Python Installation: Ensure Python 3.x is installed. Download from the official Python website.
- Clone the Repository: First, clone this repository to your local machine using git:
git clone https://github.com/Ilansos/Exploit_forum_scraper.git cd <repository-directory>
- Library Installation: Install the necessary Python libraries with pip:
pip install -r requirements.txt
- Install Translation Languages: Run the translator_install.py script to install the necessary languages for the translation library:
python translator_install.py
MongoDB Installation: Follow the installation guide on the official MongoDB website for your specific operating system.
Running MongoDB: Ensure MongoDB is running. Typically, it runs at mongodb://localhost:27017/ by default.
To use Firefox with Selenium, you'll need to download and set up Firefox in your project's root directory:
- Download Firefox: Navigate to the Firefox download page: https://www.mozilla.org/en-US/firefox/all/#product-desktop-release. Choose the appropriate Linux version (e.g., 64-bit). Download the tar.bz2 file.
- Extract and Set Up: Open a terminal and use the following commands to extract Firefox into your project's root directory:
tar -xjf firefox-xx.0.tar.bz2 -C /path/to/your/project/This will create a firefox directory within your specified path.
Configure Selenium:
- *Update the firefox_path in config.json: To point to the newly extracted Firefox executable.
"firefox_path": "/path/to/your/project/firefox/firefox"- Configuration: Update the config.json with your MongoDB URI and in case you updated the Selenium Driver and Browser, add the new user agent string.
- Selenium Driver and Browser: The script uses Selenium; ensure the geckodriver and Firefox are properly installed and paths are specified in config.json.
- Account Credentials: Selenium will automatically retrieve the session cookies and save them in the config.json file.
The script can operate as a crawler that selectively stores data if specific keywords are found. These keywords must be listed in the keywords.txt file. The crawler can be configured to search for keywords in:
Thread Titles: Only threads with titles containing any of the specified keywords are processed and saved to the database.
Post Contents: Only posts containing any of the specified keywords are processed and saved.
Both: The script checks both thread titles and post contents for keywords.
The default content of keywords.txt are cybersecurity related keywords in English, Russian, Chinese and Farsi
To activate crawling based on keywords, configure the crawl_threads and crawl_content flags in config.json to true. This setup allows for flexible data collection tailored to specific monitoring or analysis needs.
{
"crawl_threads": true, # Set to false if you don't need to crawl
"crawl_content": true # Set to false if you don't need to crawl
}Execute the scraper by running the following command in the terminal:
python exploit.pyThis initializes the scraping process where Selenium manages web sessions in the background, periodically updating session cookies.
Session Management: Selenium runs in the background for session management, periodically retrieving cookies to keep the session valid.
Legal Compliance: Obtain permission for data scraping and comply with all laws and forum terms regarding data scraping and privacy.
Ethical Use: Handle the data ethically, especially since forums can contain sensitive or personally identifiable information.
This project is licensed under the MIT License - see the LICENSE file for details.