Skip to content

Commit

Permalink
Back end implementation for the crawler (see contao#1057)
Browse files Browse the repository at this point in the history
Description
-----------

This is a first draft and I really need some help from here on I guess :)

This is how it looks at the moment:

<img width="1198" alt="Bildschirmfoto 2019-12-04 um 18 03 03" src="https://user-images.githubusercontent.com/481937/70164037-c140c200-16c0-11ea-9939-3072724a515a.png">

ToDo's:

- [x] Bring back the FE member authentication. We need to improve this in a future version to have stateless authentication but for now, it should at least continue to work as before.
- [x] Styling. It's all just some inline styles and not really pretty. I'm just really lacking the skills here.
- [x] What configuration options should we provide? All of them according to the crawl command? Does it even make sense to let the users decide on concurrency etc.?
- [x] Provide subscriber-specific logs (for level info+)

Commits
-------

e287095 First draft for the back end implementation
2f0d8b9 Implemented subscriber specific log files
7a5685d CS
32208f6 CrawlCommand CS
87fa4cf Discard back end configuration
5ca44b7 Re-implemented FE member auth as previously implemented
54f7a08 CS
9088730 Fix issue with timeout because of session deadlock
3ccffd8 Make sure the back end is never followed
ef8158c Merge branch 'master' into feature/crawl-backend
08be7b0 Fix the coding style
3bdd8dc Converted DCA to Doctrine entity and also properly named it
9a5a9e3 Fixed labels
aef6723 Adjust the back end implementation
68e636a Translated summary and warnings of all subscribers for the back end
76d33c5 Removed robots meta tags
45ce63e CS
c968781 Move the debug log link next to the progress bar
b2cec2c Move the CRAWL. labels into the default.xlf file and fix some minor CS issues
  • Loading branch information
Toflar authored and leofeyer committed Jan 9, 2020
1 parent f144630 commit 49d437d
Show file tree
Hide file tree
Showing 21 changed files with 794 additions and 321 deletions.
12 changes: 7 additions & 5 deletions src/Command/CrawlCommand.php
Expand Up @@ -106,11 +106,13 @@ protected function execute(InputInterface $input, OutputInterface $output): int

$logOutput = $output instanceof ConsoleOutput ? $output->section() : $output;

$this->escargot = $this->escargot->withLogger($this->createSourceProvidingConsoleLogger($logOutput));
$this->escargot = $this->escargot->withConcurrency((int) $input->getOption('concurrency'));
$this->escargot = $this->escargot->withRequestDelay((int) $input->getOption('delay'));
$this->escargot = $this->escargot->withMaxRequests((int) $input->getOption('max-requests'));
$this->escargot = $this->escargot->withMaxDepth((int) $input->getOption('max-depth'));
$this->escargot = $this->escargot
->withLogger($this->createSourceProvidingConsoleLogger($logOutput))
->withConcurrency((int) $input->getOption('concurrency'))
->withRequestDelay((int) $input->getOption('delay'))
->withMaxRequests((int) $input->getOption('max-requests'))
->withMaxDepth((int) $input->getOption('max-depth'))
;

$io->comment('Started crawling...');

Expand Down
81 changes: 81 additions & 0 deletions src/Entity/CrawlQueue.php
@@ -0,0 +1,81 @@
<?php

declare(strict_types=1);

/*
* This file is part of Contao.
*
* (c) Leo Feyer
*
* @license LGPL-3.0-or-later
*/

namespace Contao\CoreBundle\Entity;

use Doctrine\ORM\Mapping as ORM;
use Doctrine\ORM\Mapping\GeneratedValue;

/**
* @ORM\Table(
* name="tl_crawl_queue",
* indexes={
* @ORM\Index(name="job_id", columns={"job_id"}),
* @ORM\Index(name="uri", columns={"uri"}),
* @ORM\Index(name="processed", columns={"processed"}),
* }
* )
* @ORM\Entity()
*/
class CrawlQueue
{
/**
* @var int
*
* @ORM\Id
* @ORM\Column(type="integer", options={"unsigned"=true})
* @GeneratedValue
*/
public $id;

/**
* @var string
*
* @ORM\Column(name="job_id", type="string", length=128, options={"fixed"=true})
*/
public $jobId;

/**
* @var string
*
* @ORM\Column(type="string", length=255, options={"fixed"=true})
*/
public $uri;

/**
* @var string
*
* @ORM\Column(name="found_on", type="string", nullable=true, length=255, options={"fixed"=true})
*/
public $foundOn;

/**
* @var int
*
* @ORM\Column(type="smallint")
*/
public $level;

/**
* @var bool
*
* @ORM\Column(type="boolean")
*/
public $processed;

/**
* @var string
*
* @ORM\Column(type="text", nullable=true)
*/
public $tags;
}
3 changes: 3 additions & 0 deletions src/Resources/config/services.yml
Expand Up @@ -541,13 +541,16 @@ services:

contao.search.escargot_subscriber.broken_link_checker:
class: Contao\CoreBundle\Search\Escargot\Subscriber\BrokenLinkCheckerSubscriber
arguments:
- '@translator'
tags:
- { name: 'contao.escargot_subscriber' }

contao.search.escargot_subscriber.search_index:
class: Contao\CoreBundle\Search\Escargot\Subscriber\SearchIndexSubscriber
arguments:
- '@contao.search.indexer'
- '@translator'
tags:
- { name: 'contao.escargot_subscriber' }

Expand Down

0 comments on commit 49d437d

Please sign in to comment.