Skip to content

Latest commit

 

History

History
160 lines (112 loc) · 4.73 KB

indexing.md

File metadata and controls

160 lines (112 loc) · 4.73 KB

How to index more than 500k entities

If you want to index a lot of entities then it is not a good idea to use solr:index:populate. The command works well with 100k-200k entites everything larger than that makes the command incredible slow and needs a lot of memory. This is because doctrine is not designed to handle hundred-thousands of entities.

In my following example I have a person table with 5000000 rows and three columns: id, name and email. The resulting documents are schemaless, so all fields have a suffix e.g. name_s.

Here are some possibilities which works well for me:

CSV export with MySQL Prepared Statement + Solr PostTool

This solution does not use PHP.

  1. export your data to person.csv
SET @TS = DATE_FORMAT(NOW(),'_%Y_%m_%d_%H_%i_%s');

SET @FOLDER = '/tmp/'; -- target dir
SET @PREFIX = 'person';
SET @EXT    = '.csv';

-- first select defines the header of the csv-file
SET @CMD = CONCAT("SELECT 'id', 'name_s', 'email_s' UNION ALL SELECT * FROM person INTO OUTFILE '",@FOLDER,@PREFIX,@TS,@EXT,
    "' FIELDS ENCLOSED BY '\"' TERMINATED BY ',' ESCAPED BY '\"'",
    "  LINES TERMINATED BY '\r\n';");

PREPARE statement FROM @CMD;

EXECUTE statement;

Then run this SQL-script:

mysql -udbuser -p123 dbname < dump_person_table.sql

The resulting file looks like this: /tmp/person_2017_03_01_11_21_41.csv

  1. index the csv with post-tool
/opt/solr/solr-5.5.2/bin/post -c core0 /tmp/person_2017_03_01_11_21_41.csv

The script has two parts:

  1. select a chunk of rows from the DB
  2. add the rows to the index with Solarium
<?php
require 'vendor/autoload.php';

$connection = new PDO('mysql:host=localhost;dbname=dbname;charset=utf8mb4', 'dbuser', '123');
$connection->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);

$statement = $connection->prepare('SELECT COUNT(*) as total_items FROM person');
$statement->execute();
$countResult = $statement->fetch(PDO::FETCH_ASSOC);

$totalItems = $countResult['total_items'];
$batchSize = 5000;

$pages = ceil($totalItems / $batchSize);

$client = new Solarium\Client([
    'endpoint' => [
        'localhost' => [
            'host' => 'localhost',
            'port' => 8983,
            'path' => '/solr/core0',
        ]
    ]
]);

/** @var \Solarium\Plugin\BufferedAdd\BufferedAdd $buffer */
$buffer = $client->getPlugin('bufferedadd');
$buffer->setBufferSize($batchSize);

for ($i = 0; $i <= $pages; $i++) {
    $limitStart = ($i - 1) * $batchSize;
    $limitEnd = $batchSize * $i;
    if ($i == 0) {
        $limitStart = 1;
        $limitEnd = $batchSize;
    }

    $statement = $connection->prepare(sprintf('SELECT id, name, email FROM person WHERE id >= %s AND id <= %s ', $limitStart, $limitEnd));
    $statement->execute();

    foreach ($statement->fetchAll(PDO::FETCH_ASSOC) as $item) {
        $buffer->createDocument([
            'id' => $item['id'],
            'name_s' => $item['name'],
            'email_s' => $item['email']
        ]);
    }

    $statement->closeCursor();

    $buffer->commit();

    echo sprintf('Indexing page %s / %s', $i, $pages) . PHP_EOL;
}

$buffer->flush();

PDO Select + CSV Export + Solr Post-Tool

This solution exports the database to csv by using PDO. The exported files are located under /tmp/export.

$connection = new PDO('mysql:host=localhost;dbname=dbname;charset=utf8mb4', 'dbuser', '123');
$connection->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);

$statement = $connection->prepare('SELECT COUNT(*) as total_items FROM person');
$statement->execute();
$countResult = $statement->fetch(PDO::FETCH_ASSOC);
$statement->closeCursor();

$totalItems = $countResult['total_items'];
$batchSize = 10000;

$pages = ceil($totalItems / $batchSize);

@mkdir('/tmp/export');

for ($i = 0; $i <= $pages; $i++) {
    $data = [];
    $limitStart = ($i - 1) * $batchSize;
    $limitEnd = $batchSize * $i;
    if ($i == 0) {
        $limitStart = 1;
        $limitEnd = $batchSize;
    }

    $statement = $connection->prepare(sprintf('SELECT id, name, email FROM person WHERE id >= %s AND id <= %s ', $limitStart, $limitEnd));
    $statement->execute();

    $data[] = "id, name_s, email_s\n";

    foreach ($statement->fetchAll(PDO::FETCH_ASSOC) as $item) {
        $data[] = sprintf("\"%s\", \"%s\", \"%s\"", $item['id'], $item['name'], $item['email']);
    }

    $statement->closeCursor();

    file_put_contents(sprintf('/tmp/export/person_%s.csv', $i), join("\n", $data));

    echo sprintf('Indexing page %s / %s', $i, $pages) . PHP_EOL;
}

To import the data we are using Solr Post-Tool:

/opt/solr/solr-5.5.2/bin/post -c core0 /tmp/export