This script allows you to crawl a website and collect links from its webpages based on a specified regex pattern. It can be useful for extracting links from websites for various purposes such as data scraping or analysis.
Before running the script, make sure you have the following installed:
- Python 3.x
argparse
libraryrequests
libraryre
moduleos
modulesys
modulebase64
moduleurllib.parse
modulebs4
(BeautifulSoup) libraryshutil
module
You can install the required dependencies using pip
:
pip install argparse requests bs4
To use the script, follow these steps:
-
Clone or download the script file to your local machine.
-
Open a terminal or command prompt.
-
Navigate to the directory where the script is located.
-
Run the following command:
python link_crawler.py -u <url> -p <pattern> [-d] [-c]
Replace
<url>
with the URL of the website you want to crawl, and<pattern>
with the regex pattern to match the links.Optional flags:
-d
or--domain
: Include the website domain for internal links. By default, it deletes the domain name from internal links and then searches for the pattern.-c
or--clear-directory
: Clear the directory if it already exists for this command. By default, if the command is entered with a duplicate pattern and domain, the search is not performed.
-
The script will start crawling the website, collecting links from its webpages, and display the results.
- If links matching the regex pattern are found, the script will save them to a
links.txt
file in the corresponding directory. - If no links are found, the script will display a message accordingly.
- If links matching the regex pattern are found, the script will save them to a
Note: The script crawls webpages within the specified website by following links found in HTML tags such as <a>
, <link>
, <script>
, <base>
, <form>
, and more (in all tags that contain links). It searches for href
, src
, and data-src
attributes in these tags to extract the links.
Note: this script finds any link anywhere on the webpage, even outside of the attributes of the tags.
Here are a few examples of how you can use the script:
-
Crawl a website and collect all links from its webpages:
python link_crawler.py -u https://example.com -p ".*"
This will crawl the
example.com
website, collect all links from its webpages, and save them tolinks.txt
in thedata/<host>/<pattern>/
directory. -
Crawl a website and collect only specific links matching a pattern:
python link_crawler.py -u https://example.com -p "https://example.com/downloads/.*"
This will crawl the
example.com
website and collect only the links that match the patternhttps://example.com/downloads/
. -
Crawl a website and putting domains in internal links:
python link_crawler.py -u https://example.com -p ".*" -d
This will crawl the
example.com
website, collect all links from its webpage, putting domains in internal links, and save them tolinks.txt
. -
Clear the directory and crawl the website to collect fresh links:
python link_crawler.py -u https://example.com -p ".*" -c
This will clear the existing directory (if any) for the specified command and crawl the
example.com
website to collect fresh links.
This script is licensed under the MIT License. Feel free to modify and use it according to your needs.