Cradose

This project is still WIP, expect bugs and missing features

Cradose (CRAwl, DOcument, and SEarch) is a web application built using the Django framework that can crawl, index, download, and search the web.

Features

Crawl any inputted url
Choose which files to download
- Source codes
- Texts
- PDFs
- Microsoft word documents, powerpoints, and excel sheets
- Images
- Videos
- Audios
- Archives
Compress downloaded files into a standard or password protected zip file
Limit the amount of links to crawl
Index crawled pages
Search the downloaded pages for any query
Display urls sorted by relevance

Prerequisites

To run this you need to have Python>=3.10. Other requirements can be installed using pip install -r requirements.txt.

Installation instructions for Django can be found here.

Usage

Note: This program uses relative directories which is defined in settings.py, so it should work out of the box.

To run, make sure you're in the root directory of this project then run python manage.py runserver. This will start the server on localhost:8000.

Crawling

To crawl a webpage, go to localhost:8000/crawl, and enter a url and fill out the settings below. You will know the program is finished running when you are redirected to the search page.

Searching

To use the search engine, go to localhost:8000/search and enter your query in the search bar. Once you hit enter, the program will begin running. Once the program is finished running (might take some time), you will be redirected to a page that lists all of the relevant urls. Clicking on any of these urls will take you to the original page that was crawled.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
Cradose		Cradose
CrawlUI		CrawlUI
Output		Output
SearchUI		SearchUI
src/img		src/img
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
README.md		README.md
manage.py		manage.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cradose

Cradose

CrawlUI

CrawlUI

Output

Output

SearchUI

SearchUI

src/img

src/img

.gitignore

.gitignore

CODEOWNERS

CODEOWNERS

README.md

README.md

manage.py

manage.py

requirements.txt

requirements.txt

Repository files navigation

Cradose

Features

Prerequisites

Usage

Crawling

Searching

Screenshots

About

Releases

Packages

Languages

BugByte14/cradose

Folders and files

Latest commit

History

Repository files navigation

Cradose

Features

Prerequisites

Usage

Crawling

Searching

Screenshots

About

Topics

Resources

Stars

Watchers

Forks

Languages