Skip to content

Amirsorouri00/web-search-engine

 
 

Repository files navigation

web-search-engine

License: MIT Python 3.5

API - a simple web search engine. The goal is to index an infinite list of URLs (web pages), and then be able to quickly search relevant URLs against a query. This engine uses the ElasticSearch database.

Indexing

The indexing operation of a new URL first crawls URL, then extracts the title and main text content from the page. Then, a new document representing the URL's data is saved in ElasticSearch, and goes for indexing.

Searching

When searching for relevant URLs, the engine will compare the query with the data of each document (web page), and retrieve a list of URLs matching the query, sorted by relevance.

UI

This search engine can be used with an UI : https://github.com/AnthonySigogne/web-search-engine-ui

Note

This API works for a finite list of languages, see here for the complete list : https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html.

DEMO

A demo can be found here : http://searchengine.byprog.com/

About 500 French URLs and 500 English URLs of the news network http://www.france24.com/ have been indexed.

INSTALL AND RUN

$ cd rq-worker
$ docker build -t rq .
$ cd ..
$ docker-compose up

REQUIREMENTS

This tool requires Python3+ and ElasticSearch5+.

WITH PIP

git clone https://github.com/AnthonySigogne/web-search-engine.git
cd web-search-engine
pip install -r requirements.txt

Then, run the tool :

FLASK_APP=index.py HOST=<ip> PORT=<port> USERNAME=<username> PASSWORD=<password> flask run

Where :

  • ip + port : route to ElasticSearch
  • username + password : credentials to access

To run in debug mode, prepend FLASK_DEBUG=1 to the command :

FLASK_DEBUG=1 ... flask run

WITH DOCKER

To run the tool with Docker, you can use my DockerHub image : https://hub.docker.com/r/anthonysigogne/web-search-engine/

docker run -p 5000:5000 \
-e "HOST=<ip>" \
-e "PORT=<port>" \
-e "USERNAME=<username>" \
-e "PASSWORD=<password>" \
anthonysigogne/web-search-engine

Where :

  • ip + port : route to ElasticSearch
  • username + password : credentials to access ElasticSearch

Or, build yourself a Docker image :

git clone https://github.com/AnthonySigogne/web-search-engine.git
cd web-search-engine
docker build -t web-search-engine .

USAGE AND EXAMPLES

To list all services of API, type this endpoint in your web browser : http://localhost:5000/

INDEXING

Index a web page through its URL.

  • URL

    /index

  • Method

    POST

  • Form Data Params

    Required:

    url=[string], the url to index

  • Success Response

    • Code: 200
      Content: Success
  • Error Response

    • Code: 400 INVALID USAGE
  • Sample Call (with cURL)

    curl http://localhost:5000/index --data "language=en&url=https://www.byprog.com/en/"
    

SEARCHING

Query engine to find a list of relevant URLs. Return the sublist of matching URLs sorted by relevance, and the total of matching URLs, in JSON.

  • URL

    /search

  • Method

    POST

  • Form Data Params

    Required:

    query=[string], the search query

    Not required:

    start=[integer], the start of hits (0 by default)

    hits=[integer], the number of hits returned by query (10 by default)

    highlight=[integer], return highlight parts for each URL (0 or 1, 0 by default)

  • Success Response

    • Code: 200
      Content:
      {
        "total": 1,
        "results": [
          {
            "title": "Anthony Sigogne / Freelance / Full-Stack Developer",
            "description": "Full-Stack Developer specialized in new technologies and innovative IT solutions.",
            "url": "https://www.byprog.com/en/"
          }
        ]
      }
      
  • Error Response

    • Code: 400 INVALID USAGE
  • Sample Call (with cURL)

    curl http://localhost:5000/search --data "query=freelance fullstack"
    curl http://localhost:5000/index --data "language=en&url=https://www.laracasts.com"
    curl http://localhost:5000/index --data "language=en&url=https://www.laravel.com"
    curl -d "query=laravel" -X POST http://localhost:5000/search
    curl localhost:9200/_cat/health
    

Explore Job (web search engine container removed from the services)

$ source web_env/bin/activate
$ python
>>> import index
>>> index.explore_job("link to the website to crawl")

FUTURE FEATURES

  • index more page features like keywords,...
  • better scoring function
  • filter bad results
  • create a docker compose
  • traduct tools in several languages
  • connect to pixel tool
  • better description of results
  • redis to index a single url

USEFULL URLS

justext

mine explore

$ python 
>>> import index
>>> index.explore_job(<url>)

done..

List of User Agents

Requirements

PORTS

  • 80
  • 9181
  • 5000

LICENCE

MIT

Releases

No releases published

Packages

 
 
 

Languages

  • Python 87.1%
  • Shell 11.3%
  • Dockerfile 1.6%