Skip to content

Powerful web crawler and search engine built using the Django framework.

Notifications You must be signed in to change notification settings

BugByte14/cradose

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cradose

This project is still WIP, expect bugs and missing features

Cradose (CRAwl, DOcument, and SEarch) is a web application built using the Django framework that can crawl, index, download, and search the web.

Features

  • Crawl any inputted url
  • Choose which files to download
    • Source codes
    • Texts
    • PDFs
    • Microsoft word documents, powerpoints, and excel sheets
    • Images
    • Videos
    • Audios
    • Archives
  • Compress downloaded files into a standard or password protected zip file
  • Limit the amount of links to crawl
  • Index crawled pages
  • Search the downloaded pages for any query
  • Display urls sorted by relevance

Prerequisites

To run this you need to have Python>=3.10. Other requirements can be installed using pip install -r requirements.txt.

Installation instructions for Django can be found here.

Usage

Note: This program uses relative directories which is defined in settings.py, so it should work out of the box.

To run, make sure you're in the root directory of this project then run python manage.py runserver. This will start the server on localhost:8000.

Crawling

To crawl a webpage, go to localhost:8000/crawl, and enter a url and fill out the settings below. You will know the program is finished running when you are redirected to the search page.

Searching

To use the search engine, go to localhost:8000/search and enter your query in the search bar. Once you hit enter, the program will begin running. Once the program is finished running (might take some time), you will be redirected to a page that lists all of the relevant urls. Clicking on any of these urls will take you to the original page that was crawled.

Screenshots

crawl page

search page

Releases

No releases published

Packages

No packages published