Commits on Jul 17, 2017
  1. Make export multithreaded

    The vast majority of the time spent during sailthru exports is spent
    waiting for Sailthru. This adds a thread pool for blast exports.
    Sailthru only seems to run 12 jobs at once, so we conservatively
    limit the thread size to 8.
    While doing these tests, I accidentally started a bunch of jobs, which
    apperently are impossible to cancel, so I lowered the THREAD_COUNT when
    dry_run is set to prevent others from making the same mistake.
    Test Plan:
     - `./ -nv export`
     - Went to and saw multiple jobs run.
     - Pressed ctrl+c once one finished.
     - Checked temporary folder for exported files.
     - Checked global variables and confirmed that
       `_get_sailthru_timezone_utc_offset` is thread safe.
    Reviewers: ragini
    Reviewed By: ragini
    Subscribers: tom
    Differential Revision:
    jnetterf committed Jul 17, 2017
  2. Add debugging options

    This adds two flags not intended to be used in production:
     - `--dry_run`: Does not upload data to BQ.
     - `--keep_temp`: Keeps the temporary folder for debugging.
    The second implies the first as the temporary files are the only way to access the data from Sailthru in a dry run.
    Test Plan:
     - Commented out `import bq_util`
     - `./ -n campaigns --status sent --start_date 'January 1 2010' --end_date 'July 14 2017'`
     - Looked at data in temporary folder.
     - `./ -n blast --blast_id 10049846`
     - Looked at data in temporary folder.
    Reviewers: ragini
    Reviewed By: ragini
    Subscribers: tom
    Differential Revision:
    jnetterf committed Jul 13, 2017
  3. cache sailthru settings

    Summary: In export mode, we only want to load settings once.
    Reviewers: ragini
    Reviewed By: ragini
    Subscribers: tom
    Differential Revision:
    jnetterf committed Jul 12, 2017
Commits on Jul 15, 2017
  1. merge from upstream phabricator

    csilvers committed Jul 15, 2017
Commits on Jul 12, 2017
  1. Make schema for the Sailthru to Bigquery pipeline more precise.

    In particular, this uses TIMESTAMP instead of STRING for fields that are
    timestamps, and makes fields that contain multiple values load in REPEATED
    Test Plan:
     - ./ blast --blast_id 8262943
     - In,
       verified that schema matched what was expected
     - In bq UI:
    	SELECT COUNT(first_ten_clicks) WITHIN RECORD as click_count,
    	       COUNT(first_ten_clicks_time) WITHIN RECORD as click_time_count,
    	       FIRST(first_ten_clicks) WITHIN RECORD as first_click,
    	       FIRST(first_ten_clicks_time) WITHIN RECORD AS first_click_time,
    	FROM []
    	WHERE open_time IS NOT NULL
     - Made sure that all timestamp columns had at least one non-null cell, except for purchase_time,
       since we don't have donation information in Sailthru
     - Downloaded the blast report manually, and made sure a random row matched with the newly imported data.
     - Verified that the timestamp in bq for a user who clicked on a link in 10049846 matched what was shown in Sailthru (note: I had to use a later blast, because the Sailthru UI only shows recent clicks)
    Reviewers: ragini
    Reviewed By: ragini
    Subscribers: johnsullivan, tom
    Differential Revision:
    jnetterf committed Jul 10, 2017
  2. Use the "schema_update_option" to allow adding new sailthru campaign …

    The sailthru export to bigquery is failing because it can't change the
    schema on the table it's trying to write (the campaigns table).  I
    think that if we specify the
    `--schema_update_option=ALLOW_FIELD_ADDITION` flag it will
    automatically patch the schema as new fields get added.
    Test Plan:
    - fingers crossed (we're about to have a daily run of this, so we can
    just try it out on that)
    Auditors: ragini
    cjfuller committed Jul 12, 2017
Commits on Jul 5, 2017
  1. Update indentation

    Update some formatting in the sailthru_campaign_export_schema.json file
    which was giving errors.
    Auditors: csilvers
    Test Plan: Fingers crossed
    guptaragini committed Jul 5, 2017
Commits on Jul 3, 2017
  1. Fix failure due to newly added field

    Summary: Fixed the failing script to create the campaigns table due to the newly
    added field in the Sailthru data called "public_url".
    Test Plan: FIngers crossed
    Auditors: #infrastructure
    guptaragini committed Jul 3, 2017
Commits on Jul 2, 2017
Commits on Jun 30, 2017
  1. Add react_render_server.failure_pct as a stat we wsend over.

    We have to go through graphite for this because stackdriver can't do
    the ratios manually.
    If I was better at graphite I'd be able to send over failure
    information for the most failing-est routes as well.  We'd need to add
    metric-labels to graphite_bridge, though.
    Auditors: nabil, colin
    Test plan:
    Fingers crossed.  I'll try to run this manually to at least make sure
    it doesn't crash.
    csilvers committed Jun 30, 2017
Commits on Jun 26, 2017
  1. Use `GROUP EACH BY` for the OOM query in

    This commit fixes an issue where the query to calculate OOMs by route is
    failing with "resources exceeded".
    The query in question doesn't do any joins, which are the usual
    culprit, so the only remaining candidates are the `GROUP BY` and
    `ORDER BY` clauses.  Neither changing the `GROUP BY` to `GROUP EACH BY`
    nor removing the `ORDER BY` helped, so the only remaining thing to do
    is to decrease the amount of data we keep in memory relative to the
    resources allocated.
    The most straightforward way to do this was to remote the
    `NEST(app_logs.message) WITHIN RECORD` that I added in D35735, and
    replace it with a less general solution for each query that works with
    something that's smaller than the full complement of messages for each
    For the OOM query, I replaced it with summing integers for whether each
    message contains the OOM string.  For the memory profiling query (which
    also hit resources exceeded after I fixed the first one), I replaced it
    with summing up the memory total and added figures across all messages
    in the record (which should all be 0, except for the message where the
    memory profiling info appears).
    Test Plan:
    - `./ --dry-run --date 20170625`
    - verify this completes successfully, whereas before it would hit
      resources exceeded every time.
    Auditors: benkraft, csilvers
    cjfuller committed Jun 26, 2017
Commits on Jun 18, 2017
  1. merge from upstream phabricator

    csilvers committed Jun 18, 2017
Commits on Jun 14, 2017
  1. Allow one of our projects to be missing the totalBilledBytes field

    The gae_dashboard_stats job has been failing on toby, because the
    bigquery logs export table for the khan-academy project is missing the
    ...totalBilledBytes field.  Looking at the list of jobs for that
    project, it seems like we just haven't had a successful job run in that
    project yet today, so we don't actually have that field in the logs yet.
    This revision makes it so that we ignore (logging a warning) that error
    if it happens for some subset (i.e. not all) of our projects being cost
    monitored.  My rationale here was that it's much more unlikely we
    haven't had a successful job today in any of our projects, so if we see
    this condition, it may mean something is wrong that needs human
    Test Plan:
    - `gae_dashboard/ --dry-run --date 20170614`
    - verify that since it is currently missing the field only in the
      khan-academy project, it logs a warning but continues, and that it
      says it would send the stats to graphite.
    - remove the deductive-jet project from the list of projects to check,
      so that khan-academy is now the only project
    - repeat the command, verify it now raises an exception because all
      projects failed with this message
    Reviewers: nabil
    Reviewed By: nabil
    Subscribers: tom
    Differential Revision:
    cjfuller committed Jun 14, 2017
Commits on Jun 6, 2017
  1. Merge branch 'master' into nabil

    Nabil Hassein committed Jun 6, 2017
  2. Fix field_name to fetch bigquery cost numbers from bigquery.

    It appears google renamed this field within the last week.
    I'm not sure if there is some announcement of these changes we
    could use to make changes in advance rather than after jobs break.
    This commit also includes minor whitespace changes to satisfy the linter,
    which is more strict now than the last time this file was modified.
    Test Plan:
    `gae_dashboard/ --dry-run --verbose`
    Before the change, this command failed with the same error as we see
    in the toby-admin mailing list, involving the missing field.
    After the change, the cost is fetched successfully.
    Reviewers: csilvers, benkraft, colin, amos
    Reviewed By: csilvers
    Differential Revision:
    Nabil Hassein committed Jun 6, 2017
Commits on May 27, 2017
  1. merge from upstream phabricator

    csilvers committed May 27, 2017
Commits on May 18, 2017
  1. merge from upstream phabricator

    csilvers committed May 18, 2017
Commits on May 16, 2017
  1. Change nightly usage reports to account for requests split into multi…

    …ple entries
    In our nightly reports of instance hours / memory / rpcs / etc. the route
    `None` has suddenly been appearing at a much higher level.  It turns out the
    main thing that's been going on here is that we've been seeing more frequent
    occurrences where the logging API splits the logs for a single request into
    multiple entries.  Since only one of these entries will have a KALOG line, then
    this means that all the other logs for the same request will be bucketed into
    the `None` route.
    Additionally it turns out that when the logs are split, the latency field does
    not report on the segment of time corresponding to the log lines in an entry,
    but the entire time from the start of the request to when that log entry was
    emitted.  This means that we were multiple counting time in the `None` bucket
    as well when there were more than 2 log entries per request.
    All of this can be fixed by deduplicating on request id before operating on the
    logs.  When deduplicating, we get the value for a field using `FIRST` which
    takes the first non-null value in a group.  In some queries, no change was
    necessary, as they were filtering only down to fields based on KALOG, which
    will only exist in one entry anyway.
    I also incidentally changed `COUNT(*)` to `COUNT(1)` in a couple places.  This
    doesn't actually make a difference in these queries, but this makes it clearer
    that our intent is to count rows in the subquery so that future refactorers
    won't have to worry about figuring out whether there are any nested/repeated
    fields in scope in the inner query that would change the result.
    Test Plan:
    - hard code `dry_run` to `True` in the send_to_stackdriver and send_email
      functions so that we can test running the queries for real, but not actually
      send the reports
    - `gae_dashboard/ --date 20170515`
    - verify that the results look largely the same as the actual reports for 5/15
      in the infrastructure-blackhole archive, except that the `None` route is much
      further down the list
    Reviewers: benkraft, nabil
    Reviewed By: benkraft, nabil
    Subscribers: tom
    Differential Revision:
    cjfuller committed May 16, 2017
  2. Lint cleanup in

    This fixes some indentation issues in so that we can remove
    the linter disable rule.  The main issue was that there was a gnuplot script
    literal requiring string interpolation with a dict indented flush with the left
    margin.  The required indetation of the dict contents was off from what looks
    normal for python, so I wrapped the gnuplot script with `textwrap.dedent` and
    indented it at the normal level of the function that contains it.
    Test Plan:
    - `gae_dashboard/ --date 20170515 --dry-run`
    - verify it completes successfully
    Auditors: benkraft
    cjfuller committed May 16, 2017
Commits on Apr 20, 2017
  1. Resolve relative path errors

    Python was having problem finding files correctly due to difference in
    the path definitions on local and on toby. I have used relative paths
    instead now which should resolve these errors.
    Test Plan: Run the scripts manually to test if they work fine.
    Subscribers: tom
    Auditors: csilvers, colin
    Differential Revision:
    guptaragini committed Apr 20, 2017
Commits on Apr 19, 2017
  1. merge from upstream phabricator

    tzjames committed Apr 19, 2017
Commits on Apr 17, 2017
  1. Send data from Sailthru to Bigquery

    Sending data from Sailthru using its APIs to creating a campaigns table
    and individual blasts tables in bigquery
    Made changes based on previous review comments.
    Removed secrets from this repo and now we get it from the webapp. This
    diff is dependent on another diff which will be sent along with this.
    That diff adds this script to run on a schedule using cron and also
    enables creating a secrets file from where the secrets will be used.
    Combined calling both the campaigns generation function and the
    individual blasts generation function into one script. Made some
    formatting changes as well.
    Changes in how the script is called using argparse, using
    tempfile.mkdtemp() and some other minor changes.
    Test Plan:
      python gae_dashboard/ --options send_blast_details_to_bq --blast_id <blast ID>
      python gae_dashboard/ --options send_campaign_report --status <status> --start_date <date> --end_date <date>
    Reviewers: craig
    Subscribers: tom
    Differential Revision:
    guptaragini committed Feb 18, 2017
Commits on Apr 14, 2017
  1. Lint ignore for khan-linter upgrade

    See D35059.  This change is the output of webapp's:
    `tools/ --todo_target colin ~/khan/internal-webserver`
    Also includes a manual pyflakes fix to prevent overwriting a variable via list
    comprehension and a substate of alertlib to pick up lint fixes
    Test Plan:
    `~/khan/devtools/khan-linter/ .`
    Auditors: csilvers
    cjfuller committed Apr 14, 2017
Commits on Apr 13, 2017
  1. merge from upstream phabricator

    csilvers committed Apr 13, 2017
Commits on Apr 10, 2017
  1. Fix typo in email template.

    Fixed a typo in the weekly performance report email template.
    Test Plan:
    Auditors: nabil
    latteier committed Apr 10, 2017
Commits on Mar 30, 2017
  1. Add charts of precise weekly numbers to email_reliability_metrics. Ba…

    …se graphs from zero. Explain data sources.
    This commit enhances our email reliability metrics html template to include
    html tables showing this week's numbers, in addition to the graphs.
    These tables include a column of hardcoded baseline numbers.
    The graphs have also been updated to always base the y-axis from zero.
    Finally, we add some more explanation of the sources of the data to the email,
    and make assorted stylistic changes.
    Test Plan:
    For convenience, I changed the line in send_email after `if dry_run:` to read:
    `print html`
    I then invoked the script with:
    ./ --dry-run --save-graphs --start-time=1490042503 --end-time=1490647306 > ~/test.html
    I chose those particular dates because they were already contained in my local
    database file, following my work in D34649. I will attach it for reference.
    To reproduce this test, use the same file as your ~/email_reliability_metrics.db.
    The start-time and end-time I passed are the results of, respectively:
    gdate -d "9 days ago" +%s
    gdate -d "2 days ago" +%s
    Which are unix times sometime on March 20 2017 and March 27 2017.
    The hardcoded baseline numbers are based on this same file.
    After running the above command, I opened the html file and verified the table looked
    as I intended, at least in a browser.
    Using this approach, the graphs do not appear inline, but due to the use of the
    --save-graphs flag, I was able to open the png graphs to verify their y-axes now
    always start from zero.
    To verify that the email looks how we intend in the inbox, I will test live on toby.
    Reviewers: amos
    Reviewed By: amos
    Differential Revision:
    Nabil Hassein committed Mar 29, 2017
Commits on Mar 29, 2017
  1. Add optional --start-time, --end-time and --save-graphs flags to emai…

    Previously, certain internal functions allowed custom start and end times,
    while others always used the defaults. This commit consistently allows
    optional arguments for start and end time for all relevant functions in the module,
    and adds the ability to pass unix times for these values at the command line.
    This makes it easier to seed the database with past weeks of data.
    It could prove similarly useful if we alter the data we report.
    For similar convenience, add a flag to save graphs to the disk from the command line.
    Test Plan:
    I deleted my existing ~/email_reliability_metrics.db, and invoked the script
    several times as follows:
    ./email_reliability_metrics --dry-run --start-time=`gdate -d "30 days ago" +%s` --end-time=`gdate -d "23 days ago" +%s`
    ./email_reliability_metrics --dry-run --start-time=`gdate -d "23 days ago" +%s` --end-time=`gdate -d "16 days ago" +%s`
    ./email_reliability_metrics --dry-run --start-time=`gdate -d "16 days ago" +%s` --end-time=`gdate -d "9 days ago" +%s`
    ./email_reliability_metrics --dry-run --save-graphs --start-time=`gdate -d "9 days ago" +%s` --end-time=`gdate -d "2 days ago" +%s`
    This added four entries in my database, as intended.
    Note that the fourth command also passes the flag to save the graphs to disk.
    I examined the generated PNGs and verified that they looked as I expected,
    graphing several weeks of data.
    Reviewers: amos
    Reviewed By: amos
    Differential Revision:
    Nabil Hassein committed Mar 28, 2017
Commits on Mar 28, 2017
  1. Improve email reliability report

    Fix multiple recipients.
    Fix inline images.
    Test Plan:
    Ran on toby and it worked.
    Auditors: nabil
    latteier committed Mar 28, 2017
  2. Make email_reliability_uptime script executable.

    Test Plan: None
    Auditors: amos
    Differential Revision:
    Nabil Hassein committed Mar 28, 2017
  3. Merge branch 'master' into nabil

    Nabil Hassein committed Mar 28, 2017
  4. Save pingdom and service degradation data to a file. Related refactor…

    This commit saves the uptime data from the Pingdom API and the service
    degradation data from the email_uptime script to our _PAGE_LOAD_DATABASE.
    This will make graphing either or both types of data easy, although this
    commit does not add that functionality.
    Relatedly, read from the same file to get the data to put into an email
    template. Because we consider the page load performance graphs the most
    critical part of the report, this was done in such a way that the report
    can still go out if some pingdom or service degradation data is missing.
    Test Plan:
    `python --dry-run`
    Other than that, we will test by deploying to Toby, observing the email,
    and tweaking based on that.
    Reviewers: amos
    Reviewed By: amos
    Differential Revision:
    Nabil Hassein committed Mar 27, 2017
Commits on Mar 24, 2017
  1. Integrate graphs into email.

    We should probably make pingdom and degraded service metrics persistent too.
    Then we can add graphs for them.
    I think we should redo the persistence format. I propose:
        {start_date: "YYYYMMDD",
         end_date: "YYYYMMDD",
         metrics: [
           {type: "TYPE OF METRIC",
            data: ...},
    Each record has a time and a list of metrics. Each metric has a type and some data appropriate for that metric.
    This way we can more easily add metrics, and scan the db by date.
    Here's an example metric:
        {"type: "page_load.by_page",
         "data": [{"page_load_time": 10.122, "page_load_page": "exercise_page"}, {"page_load_time": 5.731, "page_load_page": "curation_page"}, {"page_load_time": 5.547, "page_load_page": "scratchpad_page"}, {"page_load_time": 6.633, "page_load_page": "logged_out_homepage"}, {"page_load_time": 3.903, "page_load_page": "login_page"}, {"page_load_time": 3.968, "page_load_page": "signup_page"}, {"page_load_time": 11.095, "page_load_page": "video_page"}, {"page_load_time": 12.262, "page_load_page": "article_page"}, {"page_load_time": 9.797, "page_load_page": "logged_in_homepage"}]
    Also it would be nice to have a way to send test emails from a dev environment rather than having to deploy to toby to send an email.
    This is especially true because I suspect that we'll want to do a bunch of minor tweaks to the email formatting.
    Test Plan: --dry-run
    Auditors: nabil
    latteier committed Mar 24, 2017
  2. Add a function to build PNG graphs of page load perf from the databas…

    …e file via matplotlib.
    This commit builds on the previous work in this file to actually create
    the graphs to be included in the weekly email report of reliability metrics.
    We return the pngs as strings within a dictionary so they can later be
    attached to an email or inserted into a template.
    Test Plan:
    Locally, I had two entries in my _PAGE_LOAD_DATABASE achieved by invoking
    build_page_load_temp_table, get_weekly_page_load_data_from_bigquery, and
    save_weekly_page_load_data_to_disk both yesterday and today.
    I added the following line near the end of build_graphs_from_page_load_database,
    immediately below the existing call to plt.savefig:
    `plt.savefig('/Users/nabilhassein/' + title + '.png')`
    I examined the resulting images to verify that they looked how I wanted.
    Reviewers: amos
    Reviewed By: amos
    Differential Revision:
    Nabil Hassein committed Mar 24, 2017
  3. merge from upstream phabricator

    csilvers committed Mar 24, 2017