Skip to content

Documentation

py-am-i edited this page Oct 27, 2016 · 62 revisions

This is where you can learn how to use the Slack micro-framework to scrape the web good, and do other things good too.

Starting out

After installing the library, you'll need to generate a new project...

Generating a new project

Open up a python console(or idle), do an import and call a function:

from Slack import make_project
make_project('some/folder/Some Awesome Project')

This will create the project folder. Inside the project folder there will be a SiteAutomations folder for your Controllers, a Jobs folder, for running custom Jobs, and some project files. In the example below, we import from Slack's example SiteAutomations folder. When developing, you would write your Controllers in the SiteAutomations folder contained in your project.

The Project Files

.env

The .env file is used to hold static variables. It's used for things like setting up your database connection, or holding API keys:

# DB_TYPE values: sql, mysql, postgresql, berkeley
DB_TYPE=sql
DB=default.db
DB_HOST=localhost
DB_PORT=3306
DB_USERNAME=None
DB_PASSWORD=None

You can also store python lists in the file, along with dicts:

# List
CUSTOMERS[]:
BOBLHEAD
WOOGLE
CUSTOMERS[END]

# Dict
PRICES{}:
BOOK=15.95
ORANGE=.75
PRICES{END}

Import Slacks env function to access these values:

>>> from Slack.Environment import env

>>> float(env('PRICES')['BOOK'])
>>> 15.95

or if you want to transform the data as you get it, give it a callback function.

>>> env('DB_PORT', func=int)
>>> 3306

This isn't meant to be used as a complex datastore, it's main purpose is for setting up configurations for applications in a way that makes the data easily accessible.

models.py

models.py is where you put your peewee models.

migrations.py

Finally, running the migrations.py file will drop all the tables in your database(based on what is defined in models.py), then recreate them. It's used in the database design side of things when you're developing. All this file does is import your models, then passes them to Slacks migrate function which takes care of the rest. If you need to write database seeders, then you could add that to migrations.py. I usually create an additional Seed module in my project's Jobs module. I place all my table seeder jobs into the Seed module, then run each seeder as a Job using run_job('Seed.UserTable').

Single Threaded Automations

Here is an example of a single threaded automation. If you want to learn about building multi-threaded automations, head the the ThreadedCommandFactory section. For the single-threaded automations, we need to import all the appropriate pieces.

from time import sleep

# Database models used to interact with databases if needed.
import models

# The environment variable loader.  These variables can be set in the .env file.  
# This is important if we want to create configurable web automations / scrapers.
from Slack.Environment import env, env_driver

# Pull in controllers from Slack's example SiteAutomations.
from Slack.SiteAutomations.Examples import GoogleExample, BingExample

Controllers are kept in the SiteAutomations folder. In all of these examples, we are importing Slack's example controllers which are located at Slack/SiteAutomations/Examples, so you can play with those, or write your own Controllers in the SiteAutomations folder in your projects folder.

# The quitting contexts helps to `close()` and `quit()` the WebDriver instance if 
# something goes wrong.
from Slack.Helpers.Contexts import quitting
from selenium.webdriver.support.wait import WebDriverWait

To get a hold of our WebDriver instance, we need to use the env_driver and env functions. env('BROWSER') will return the name of the browser set in the .env file and env_driver takes the name of the browser, and returns the appropriate WebDriver instance. The quitting function is used to open the WebDriver instance the same way you would open a file using with. When you add all of this together, you get:

# This could be written as:
#
# browser = env("BROWSER")
# web_driver = env_driver(browser)
# with quitting(web_driver()) as driver:
#     pass
with quitting(env_driver(env("BROWSER"))()) as driver:
    # Do stuff.

Now that we have a valid WebDriver instance, we can instantiate our Controllers and do some work.

    # Get an instance of `WebDriverWait`.
    wait = WebDriverWait(driver, 30)

    # Pass the web driver to the site automation along with anything
    # else it might need to do its job. This could include an
    # instance of `WebDriverWait`, and even the collection of
    # Models.
    google_search = GoogleExample.GoogleSearch(driver, wait, Models)
    bing_search = BingExample.BingSearch(driver, wait, Models)

    # Do stuff with your controllers.
    google_search.do_search('google wiki')
    sleep(5)
    bing_search.do_search('bing wiki')
    sleep(5)

Controllers

There are 2 types of controllers. The first type is really just a class that controls a WebDriver instance. The other is a class that inherits from IndependentController, and controls a WebDriver instance. The only difference is instances of IndpendentController attach their instance of WebDriver after they're instantiated using it's attach_driver method. This facilitates the use of the ThreadedCommandFactory and CommandFactory objects. A basic controller might look something like:

class Google(object):
    def __init__(self, driver, wait):
        self.driver = driver
        self.wait = wait

    def do_search(self, search_term):
        self.driver.get('https://google.com')

        # Type search
        search_input = self.driver.find_element_by_name('q')
        search_input.send_keys(search_term)

        # Click search button.
        search_button = self.driver.find_element_by_name('btnG')
        search_button.click()
        self.wait.until(lambda the_driver: the_driver.find_element_by_id('resultStats').is_displayed())
        return self

Or, if you wanted to create Command objects with the ThreadedCommandFactory objects, it might look like this:

from Slack.Helpers.Controllers import has_kwargs

# Inherit from IndependentController to automatically get access to the `attach_driver` method.
class ThreadedGoogleSearch(IndependentController):
    def __init__(self, models):
        self.models = models

    # Using the @has_kwargs decorator allows keyword arguments to be
    # passed to the method.  When you assemble a command pack for the
    # CommandManager, just include an instance of the Kwargs object.
    @has_kwargs
    def do_search(self, search_term, some_kwarg='some value'):
        print some_kwarg
        self.driver.get('https://google.com')

        # Type search
        search_input = self.driver.find_element_by_name('q')
        search_input.send_keys(search_term)

        # Click search button.
        search_button = self.driver.find_element_by_name('btnG')
        search_button.click()
        self.wait.until(lambda the_driver: the_driver.find_element_by_id('resultStats').is_displayed())
        return self

ThreadedCommandFactory, CommandFactory, and Command

ThreadedCommandFactory and CommandFactory are used to create Command objects, which are used to execute Controller methods. This facilitates the use of separate WebDrivers for each Controller(each controller gets it's own browser). Both CommandFactory objects inherit from BaseCommandFactory, which sets up the dict like functionality, and also the base methods that make up the factories. In order to use one of these factories, you must pass a dict of Controllers to the factory.

# Grab the Models that the Controllers need. They aren't used, just as an example.
import models
# Grab the Example Controllers.
from Slack.SiteAutomations.Examples import GoogleExample, BingExample
# And lastly the CommandFactory
from Slack.Helpers.Commands import ThreadedCommandFactory

# Here we set up the dict of controllers.
controllers = {
    'google': GoogleExample.ThreadedGoogleSearch(Models),
    'bing': BingExample.ThreadedBingSearch(Models)
}

# Get the CommandFactory instance by passing it the Controllers.
cmd_factory = ThreadedCommandFactory(controllers, logging=False)

Once we have a CommandFactory, we can create the Command instance. The Command instance is used to execute the various commands(methods) your controllers have. This is done by creating a dict of tuples. Use the same keys you used in the dict of Controllers. Pass a function that takes a Controller as it's first argument, and this new dict to cmd_factory.create_command.

# Setting up the Command pack.
search_command = {
    'google': ('google wiki',),  # note how single arguments still need to be passed as a tuple
    'bing': ('bing wiki',)
}
# Here we pass an anonymous function as the fist argument,
# and search_command as the second.
cmd = cmd_factory.create_command(
    lambda controller, *args: controller.do_search(*args), 
    search_command
)
# Start the command!
cmd.start()

This will execute the do_search method on each controller, in their own threads, meaning it will only take as long as the longest method to finish executing.

Jobs

Jobs allow you to run pieces of code by calling run_job('JobName'). Jobs are kept inside the Jobs folder of your project, as individual python files. So long as these files contain a start_job function, when the Job's filename(without the extension) is passed to run_job, it will execute the start_job function. What is actually happening, is run_job is importing the name you pass to it, then it extracts the start_job function and runs it for you. This is super useful for doing common database operations. Maybe you need to seed a database with some fake data, or maybe you need to clear some data out of the database every Sunday at midnight. If you're using PyCharm, it's super simple to open a python console in your project folder, then run your jobs.

from Slack.Project.Jobs import run_job
run_job('ExampleJob')

Here is a what a custom job might look like. Let's assume this file is called UserSeeder.py:

import models
from faker import Faker


def start_job():
    print 'starting user table seeder!'
    fake = Faker()

    # Create 10 fake users in the User table(doesn't actually exist)
    for i in range(0, 10):
        user = {
            'name': fake.name(),
            'address': fake.address()
        }
        models.User.create(**user)

    print 'finished seeding users table!'

To run this job, we run the run_job function with the python file name:

run_job('UserSeeder')

Requests WebReader

WebReader allows you to use the find_element/find_elements methods to retrieve Requests.WebElement instances. These are different than a selenium.WebElement in the sense that interaction with the elements is not supported, but you can traverse the DOM and read information from the elements in the same way. That means that you can save resources by using the requests helpers instead of firing up a selenium WebDriver if all you are doing is scraping a non-interactive web-page:

from Slack.Helpers.Requests import WebReader

driver = WebReader()

driver.get('http://docs.python-requests.org/en/master/')

section = driver.find_element_by_class_name('section')
print(section.text)
print(section.get_attribute('id'))
print(section.find_element_by_tag_name('p').text)
print([e.text for e in section.find_elements_by_xpath('//*[@class="section"]')])

Clone this wiki locally