Exploit.in Forum Scraper

Overview

The Exploit.in Forum Scraper is a Python-based tool designed to systematically extract and process data from the Exploit.in forum. This automated scraper collects data on forum categories, subforums, threads, and posts, which can be utilized for various analytical purposes such as threat intelligence analysis, trend detection, or content monitoring. Users must ensure they have the proper authorization and comply with all legal and ethical standards for data collection.

Purpose

The scraper is developed to:

Automate the extraction of structured information from the Exploit.in forum.
Store this information in a MongoDB database for subsequent analysis and retrieval.
Dynamically manage web sessions via Selenium to maintain valid and current session-specific cookies.

Requirements

A GUI version of Linux (Tested on Ubuntu 22.04.4 LTS)
A valid Exploit.in account is mandatory as the scraper uses session cookies for authentication.
Python 3.x environment.
MongoDB installed and operational for data storage.
Specific Python libraries listed in the provided requirements.txt.
Function selectively as a crawler that saves data based on the presence of specific keywords in either the thread title, post content, or both, as specified in the keywords.txt file.

Database Architecture

The MongoDB database is configured to store various types of data extracted from the forum:

Forums: Data about different forum categories.
Subforums: Data about subforums under each category.
Threads: Information about individual threads within each subforum.
Posts: Individual posts within each thread.
Users: User profiles extracted from posts and threads.

Documents in each collection are stored in JSON format with fields tailored to the specific data type (e.g., thread title, post content, user profile information).

Installation Instructions

Setting up Python and Libraries

Python Installation: Ensure Python 3.x is installed. Download from the official Python website.

Clone the Repository: First, clone this repository to your local machine using git:

git clone https://github.com/Ilansos/Exploit_forum_scraper.git
cd <repository-directory>

Library Installation: Install the necessary Python libraries with pip:
```
pip install -r requirements.txt
```
Install Translation Languages: Run the translator_install.py script to install the necessary languages for the translation library:
```
python translator_install.py
```

Setting up MongoDB

MongoDB Installation: Follow the installation guide on the official MongoDB website for your specific operating system.

Running MongoDB: Ensure MongoDB is running. Typically, it runs at mongodb://localhost:27017/ by default.

Downloading and Setting Up Firefox

To use Firefox with Selenium, you'll need to download and set up Firefox in your project's root directory:

Download Firefox: Navigate to the Firefox download page: https://www.mozilla.org/en-US/firefox/all/#product-desktop-release. Choose the appropriate Linux version (e.g., 64-bit). Download the tar.bz2 file.
Extract and Set Up: Open a terminal and use the following commands to extract Firefox into your project's root directory:

tar -xjf firefox-xx.0.tar.bz2 -C /path/to/your/project/

This will create a firefox directory within your specified path.

Configure Selenium:

*Update the firefox_path in config.json: To point to the newly extracted Firefox executable.

"firefox_path": "/path/to/your/project/firefox/firefox"

Usage Instructions

Configuration: Update the config.json with your MongoDB URI and in case you updated the Selenium Driver and Browser, add the new user agent string.
Selenium Driver and Browser: The script uses Selenium; ensure the geckodriver and Firefox are properly installed and paths are specified in config.json.
Account Credentials: Selenium will automatically retrieve the session cookies and save them in the config.json file.

Crawler Functionality

The script can operate as a crawler that selectively stores data if specific keywords are found. These keywords must be listed in the keywords.txt file. The crawler can be configured to search for keywords in:

Thread Titles: Only threads with titles containing any of the specified keywords are processed and saved to the database.
Post Contents: Only posts containing any of the specified keywords are processed and saved.
Both: The script checks both thread titles and post contents for keywords.

The default content of keywords.txt are cybersecurity related keywords in English, Russian, Chinese and Farsi

To activate crawling based on keywords, configure the crawl_threads and crawl_content flags in config.json to true. This setup allows for flexible data collection tailored to specific monitoring or analysis needs.

{
    "crawl_threads": true, # Set to false if you don't need to crawl
    "crawl_content": true  # Set to false if you don't need to crawl
}

Running the Script

Execute the scraper by running the following command in the terminal:

python exploit.py

This initializes the scraping process where Selenium manages web sessions in the background, periodically updating session cookies.

Important Notes

Session Management: Selenium runs in the background for session management, periodically retrieving cookies to keep the session valid.
Legal Compliance: Obtain permission for data scraping and comply with all laws and forum terms regarding data scraping and privacy.
Ethical Use: Handle the data ethically, especially since forums can contain sensitive or personally identifiable information.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploit.in Forum Scraper

Overview

Purpose

Requirements

Database Architecture

Installation Instructions

Setting up Python and Libraries

Setting up MongoDB

Downloading and Setting Up Firefox

Usage Instructions

Crawler Functionality

Running the Script

Important Notes

License

MIT License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
config.json		config.json
exploit.py		exploit.py
geckodriver		geckodriver
modules.py		modules.py
requirements.txt		requirements.txt
translator_install.py		translator_install.py

Folders and files

Latest commit

History

Repository files navigation

Exploit.in Forum Scraper

Overview

Purpose

Requirements

Database Architecture

Installation Instructions

Setting up Python and Libraries

Setting up MongoDB

Downloading and Setting Up Firefox

Usage Instructions

Crawler Functionality

Running the Script

Important Notes

License

MIT License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages