Skip to content

How To Write a Simple Miner

Luigi Mori edited this page Sep 20, 2017 · 8 revisions

As a first simple example of writing MineMeld nodes, let's write code a simple Miner.

Introduction

The idea behind the Miner that will be used in this example comes from a script of Carlito Vleia, a Systems Engineer at Palo Alto Networks. The idea is pretty simple: periodically retrieve the list of videos published in a specific YouTube channel and translate the entries in a set of indicators of type URL for the External Dynamic List feature of Palo Alto Networks PAN-OS.

The Usual, Annoying Preliminary Steps

Before starting to code, make sure you have read about:

After all that reading, spin up a full MineMeld devel environment following the developer's guide.

Videos on a YouTube channel

The proper way of retrieving the list of videos from a YouTube channel is using the YouTube API, as suggested by Luke Amery. But it would be a bit too complex for this tutorial. Instead we will use the quick and dirty way suggested by Carlito:

  1. Retrieve the page https://www.youtube.com/user/<channel_name>/videos
  2. Extract all the data-context-item attributes values inside the page
  3. Generate a URL for each of them with the format www.youtube.com/watch?v=<item>

Excerpt from the videosHTML page, note the div with the data-context-item-id attribute:

[...]
<li class="channels-content-item yt-shelf-grid-item">
  <div class="yt-lockup clearfix yt-lockup-video yt-lockup-grid vve-check" data-context-item-id="FAKEVIDEOID" data-visibility-tracking="....">
    ...
  </div>
</li>
[...]

Process Overview

The Node

Let's start with writing the code for the new Miner.

The new Miner should be able to:

  1. Read the YouTube channel name from the configuration
  2. Periodically poll the URL of the channel
  3. Extract videos IDs from the result
  4. Create an indicator of type URL for each video
  5. Generate an UPDATE message for new videos
  6. Generate a WITHDRAW message for removed videos
  7. Answer to RPC requests coming from master and API

The Poller Boilerplate

To make our life a bit easier, logic for 5, 6, 7 and most of 1 and 2 is already implemented in minemeld.ft.basepoller.BasePollerFT - this is the superclass used by most of the polling Miners.

Create a file in /opt/minemeld/engine/core/minemeld/ft directory, and call it ytexample.py. Copy & paste the following code into it:

from __future__ import absolute_import

import logging
import requests
import bs4  # we use bs4 to parse the HTML page

from . import basepoller

LOG = logging.getLogger(__name__)


class YTExample(basepoller.BasePollerFT):
    def configure(self):
        pass

    def _process_item(self, item):
        # called on each item returned by _build_iterator
        # it should return a list of (indicator, value) pairs
        pass

    def _build_iterator(self, now):
        # called at every polling interval
        # here you should retrieve and return the list of items
        pass

Read the Configuration

Our YouTube Miner should read the channel name from the config. This can be done inside the configure method. When called, the configuration is already stored as a dictionary in the self.config attribute of the instance.

We also want to read from the config a polling timeout and a flag to control HTTPS certificate verification (default should be True).

The configure method of your class should look like:

    def configure(self):
        super(YTExample, self).configure()

        self.polling_timeout = self.config.get('polling_timeout', 20)
        self.verify_cert = self.config.get('verify_cert', True)

        self.channel_name = self.config.get('channel_name', None)
        if self.channel_name is None:
            raise ValueError('%s - channel name is required' % self.name)
        self.url = 'https://www.youtube.com/user/{}/videos'.format(
            self.channel_name
        )

Extract the IDs of the videos

At every polling interval, the _build_iterator method is called with current time since epoch in millisec. The return should be an iterator yielding a list of items. Each item is then translated to a list of indicators using the _process_item method.

    def _build_iterator(self, now):
        # builds the request and retrieves the page
        rkwargs = dict(
            stream=False,
            verify=self.verify_cert,
            timeout=self.polling_timeout
        )

        r = requests.get(
            self.url,
            **rkwargs
        )

        try:
            r.raise_for_status()
        except:
            LOG.debug('%s - exception in request: %s %s',
                      self.name, r.status_code, r.content)
            raise

        # parse the page
        html_soup = bs4.BeautifulSoup(r.content, "lxml")
        result = html_soup.find_all(
            'div',
            class_='yt-lockup-video',
            attrs={
                'data-context-item-id': True
            }
        )

        return result

Create the indicators

The method _build_iterator returns a list of bs4.element.Tag objects. Each object is then passed by the base class to the _process_item method. This method is responsible for translating each object in a list of indicators.

In our case the _process_item method creates an indicator of type URL for each object.

    def _process_item(self, item):
        video_id = item.attrs.get('data-context-item-id', None)
        if video_id is None:
            LOG.error('%s - no data-context-item-id attribute', self.name)
            return []

        indicator = 'www.youtube.com/watch?v={}'.format(video_id)
        value = {
            'type': 'URL',
            'confidence': 100
        }

        return [[indicator, value]]

Test the new Miner

To test the new Miner create replace the content of the file /opt/minemeld/local/config/committed-config.yml with the following:

nodes:
  testYT:
    class: minemeld.ft.ytexample.YTExample
    inputs: []
    output: true
    config:
      # set the channel name to EEVblog
      channel_name: EEVblog
      # source name used in the indicators
      source_name: youtube.EEVblog
      # age out of indicators
      # disabled, removed when they disappear from the channel
      age_out:
        sudden_death: true
        default: null

and then restart the minemeld service:

$ sudo service minemeld stop
$ sudo service minemeld start

Check that all the services are running:

$ sudo -u minemeld /opt/minemeld/engine/current/bin/supervisorctl -c /opt/minemeld/local/supervisor/config/supervisord.conf status
minemeld-engine                  RUNNING   pid 4526, uptime 0:05:39
minemeld-traced                  RUNNING   pid 4527, uptime 0:05:39
minemeld-web                     RUNNING   pid 4528, uptime 0:05:39

If something fails the first place to look is the engine log file, /opt/minemeld/log/minemeld-engine.log.

The prototype

Congratulations ! Your new node is up and running !

To make it available to the CONFIG page on the UI you should now create a simple prototype for it.

Create the local prototype

Create the file /opt/minemeld/local/prototypes/ytexample.yml with the following content:

description: Test prototype library

prototypes:
  YTEEVblog:
    author: Test
    description: Miner for videos of EEVblog
    class: minemeld.ft.ytexample.YTExample
    config:
      channel_name: EEVblog
      source_name: youtube.EEVblog
      age_out:
        sudden_death: true
        default: null    

Test the prototype

Refresh the UI, and then go to CONFIG. Click the browse button and search for the new prototype. It should be there:

prototype

BAM !

Now you can use the prototype to instantiate the Miner from the UI:

configured prototype

And you can also modify the prototype from the UI to support additional YouTube channels:

edit prototype

You can’t perform that action at this time.