ChiHackNight Meeting Notes

beckeroobonsai edited this page May 11, 2016 · 23 revisions

This document contains brief notes on the progress of this project. Helps communicate completed tasks, dependencies, and new items that were discussed during ChiHackNight meetings.

May 10, 2016

  • Discussion about using an ensemble of models as the final model. Over-fitting can be reduced by taking an average of the predictions produced from nine models produced by hold-one-year-out training (from the years 2006-2014). The average of the regression GBM models will be used as predictor output of final model. Also, do the same for the nine Random Forest classifiers produced by hold-one-year-out training and average those for a final prediction. The GBM regression model and Random Forest classifier final predictions can compliment each other.

  • Discussion on how to choose a threshold from validation of model to then use on 2015 test year data. Decided to set precision to be no lower than .45 (chosen because that is the precision of current model) and then get the threshold which maximizes recall.

April 5, 2016

  • :exclamation: We have a model which performs better than the USGS model!
  • :zap: @dgalt looked around and it looks like no one keeps around their forecasts (but how do they know their accuracy!?). We have seen that including daily weather from the day-of (e.g. max temperature from the day-of) does improve model performance. It is likely that a predicted daily weather from, say, 4am the day of will be very representative of the observed weather. Going forward, we should consider two separate models, one that includes daily weather summaries from the day-of and one that does not, and then during the next summer use predicted daily weather summaries and see if performance drops.
  • :+1: Discussion on a different way to do dimensionality reduction, especially regarding the past 24 hours of weather. (INSERT LINK ONCE IT HAS BEEN UPLOADED)
  • :information_source: @chrisprokop discussed the limitations of using "feature importance" from tree-based models to weed out variables. "Feature importance" is simply the number of times that variable was used to split in the tree. If a variable is only used to split once, at the top of every tree in the random forest, then it will have very low "feature importance", but perhaps extremely high practical importance. Similarly, variables which are highly non-linear can have high "feature importance" but perhaps lower practical importance.
  • :soon: Looking ahead to other information sources: There are two very important data sources that are only available in 2015. Namely, the water sensor data and the USGS model predictions. After we have done our due diligence and made models using only data available from 2006-2015, we can investigate how useful those two data sources might be in models going forward.
  • :information_source: @kbrose mentions that using a partial AUC as a measure of model performance is going to be more informative than taking the AUC over the entire range of FPRs. Reasonable bounds would probably be FPRs in the range [0, .05].

March 8, 2016

:exclamation: Very important notes:

  • The group made some critical decisions on the common test/train/validation framework (per issue #15). Namely:
*Train* period: 2006 - 2014
*Test* period (e.g., use random seeding / _k_-fold methods): 2006 - 2014
*Validation*: 2015 (leave out, fit predictions from 2006-2014 to 2015 actual observations)

:information_source: Useful information:

  • @CPecaut added a relevant economic analysis of closing beaches.
  • @DGalt looked at levels by beach (sorted from south-to-north latitude). In particular, looking at days where there are elevated levels, the southern beaches have elevated levels.
  • @nicklucius conducted some PCA analysis (in #42):

February 23, 2016

:+1: Notable progress:

Since the last meeting, several very good models have been created. @beckeroobonsai used a GBM that showed tremendous promise:

A discussion started around the correct number and names of beaches. After the discussion, the team decided to do two things:

  • Reconcile the beaches that should be included in the final name of beaches.
  • Shorten the name of beaches to a shorter syntax to make it easier to program.

:exclamation: Important Right now, there is a difference between the Python and R code when it "shifts" data. The Python code shifts back to the previous day while the R code shifts to the previous observation. For instance, a Monday typically does not have a prior day reading (e.g., Sunday). For Python, these would read as "NA" or "NULL". However, in the R code, it will revert back to the previous reading (e.g., Friday) and will not be null.

Right now, we will maintain the difference to see if one leads to a better result instead of standardizing on a specific approach. @kbrose will modify the Python function to provide an option to do either.

February 9, 2016

:+1: Some findings from the weekend work:

  • The repo contains the weather data from forecast.io in the data/ExternalData/forecastio_daily_weather.csv file.
  • Holidays are also included in data/ExternalData/Holidays.csv file
  • The naive model has little predictive value (thanks @melissamcneill @kbrose @jonschoning)
  • It appears that some beaches are sampled, systematically, earlier than others. This graph shows the times of the sample by beach over the course of summer 2015. Despite the variation in some of the lines, the order (the order of colors) tends to be the same. But appears to have two components: an approach from the south (the flatter lines) and an approach from the north (the lines that move in parallel). See issue #39 Sample times by beach
  • This graph shows the relationship between E. coli levels (log) and reading times (thanks @kbrose) Reading time vs. log(e coli)
  • Added the ability to read data with Python 3 with read_data3.py file (thanks @beckeroobonsai)
  • Below is the output of comparing several analytical models: black-solid = naive model (prevoius geomean); black-dashed = naive model (previous high reading); blue = previous geomean + icon (summary of day weather); red = previous geomean + day of week + month; green = previous geomean + icon + day of week + month + client.id (beach) Precision-Recall of simple GLM models

:shipit: Finished during the meeting:

  • @kbrose - merged pull request so Python code has weather, beach, and holiday information
  • @tomschenkjr - completed some analysis on the number of days between readings at beaches.
Days 1 2 3 4 8
Count 1203 24 232 38 1

:heavy_check_mark: Work in progress:

  • @mcsweeney - working on normalizing the name of beaches (issue #22)
  • Working on "shifting" data, so a row also contains historical observations (issue 47)
    • Function inputs: function(number_of_observations, original_data_frame, names_of_columns_to_shift)
    • Function outputs: All of the original data; additional columns that are shifted for number_of_observations since a given beach-day.
    • Later options: May want to return everything between beach-day and number_of_observations before, with an option to average/smooth them.
      • @beckeroobonsai - Working on this in Python
      • @nicklucius - Working on this in R

February 2, 2016

  • @mncneill - working on a naive model using high reading and low reading as a predictor for tomorrow.
  • @kbrose is going to work on a multilevel model/HLM:
    • Level 1: Day-to-day readings
    • Level 2: Beaches
  • @nicklucius - working on Principal Component Analysis (PCA)
  • Note: Pratt beach is no longer present and now named Toby Prince.

:+1: Major accomplishments:

  • Discovered the basic 5-day moving average (MA) model over-identifies positive
  • A naive model using the geometric mean from the last reading (logistic regression) is significant predictor, but weak (1.8e-4) coefficient.
  • It appears the log of the readings helps normalize the data (new threshold is approximately 5.5)

January 26, 2016

  • Need to eliminate rows that does not have weather sensor data
  • Beach readings tend to be correlated across beaches, showing a general rise and fall. However, some beaches seem to portend the movement of other beaches.
  • In an analysis #37 shows there is a relationship between disagreements (defined as 2 std. dev above the typical std. dev between Reading.1 and Reading.2) and elevated levels of E. coli.

:+1: Major accomplishments:

  • Merged Kevin's pull request that incorporated a Python codebase
  • Identified further lines of inquiry

:heavy_check_mark: Next steps

  • Look at the relationship between "disagreement", Reading.1, Reading.2, and elevated levels
  • Have the "data mining" team continue to look at potential variables.
  • Continuing to work on PR #35 and eliminate large CSV files that have already been committed to the request. Need to review pull request.

January 12, 2016

Merged pull request #30, which closed #23 by merging DrekBeach data into the historical lab results

The team began to look at some rudimentary analysis for 2015. First, calculating the confusion matrix to see where and the type of errors being encountered:

Confusion matrix:

                                  | Predicted Elevated Levels              | Predicted Non-elevated Levels

------------------------------------- | -------------------------------------- | -------------------------------------- Actual Elevated Levels (positive) | 13/200 = 6.5% (true positive) | 187/200 = 93.5% (false negative) Actual Non-elevated levels (negative) | 16/1,319 = 1.2% (false positive) | 1,302/1,318 = 99% (true negative)

Then, the team begun to look at residuals (predicted value - actual value). The results indicate that the predictions consistently underestimate actual values (negative residuals). Below is a summary for 2015:

Minimum 1st Quartile Median Mean 3rd Quartile Maximum
-2388.00 -50.00 2.65 -70.69 19.70 396.40

:+1: Major accomplishment: Calculated baseline performance data and analyzed residuals

:heavy_check_mark: Next steps:

January 5, 2016

Went over some new (very minor) issues with the data — most require no new action because current code already handles them. For a few of these issues, for example days that have 3 or 4 readings, we’ll need to decide how to handle them. Matt has these documented and we can discuss next week.

While examining the DrekBeach data we realized there must be swim advisories for reasons beyond water quality — and in fact, the Park District says “Swim advisories are issued for potentially hazardous weather or water quality conditions” (http://www.chicagoparkdistrict.com/faq/#262). This closes issue [#10]((https://github.com/Chicago/e-coli-beach-predictions/issues/10). Shall we discard the SwimStatus variable, then?

DrekBeach data has been integrated. However, I’m going to normalize the names of beaches #22 and this will affect the code created for #23. @mesweeney4 will send a pull request which will close both issues before next Tuesday. Also, there are 61 instances where the DrekBeach geo.mean is off by more than one compared to the corresponding beach_readings geo.mean…

The significant outlier has been removed — will be included in @mesweeney4 pull request, which will close #24.

The NU students will begin working on issue #1 (potential variables).

:+1: Major accomplishment: Partial integration of DrekBeach data

:heavy_check_mark: Next steps:

  • The NU students will begin working on issue #1 (potential variables).
  • @mesweeney4 will work on #24

December 15, 2015

We've begun to focus more on the R code as it seems to now handle the data errors more than the original Python script.

It appears that @IrvicRodriguez may have code to normalize the name of beaches (issue #22) was completed in Python. Awaiting the Python pull request--the code will be kept in Python.

After looking at some data quality issues (see #26), we decided to remove data from 2006-07-06, 2006-07-08, and 2006-07-09. Those days had a few issues: the header column was duplicated and it did not contain two readings.

@tomschenkjr is reorganizing the code. "New" code won't be written, but will reorganize the code already contributed to the project. Current progress on this task is placed in the dev branch. The objective of the reorganization is:

  • Functions which clean data--for instance, split_sheets.R--will be placed within the lower directories of the repository.
  • Steps which clean data should be functions, for instance, clean(x, ...).
  • The analysis.R code will call the data cleaning functions. This will allow the code to be replicated in a single file instead of requiring multiple files.

Although analysis.R is an R-file, this project will still accept code from other languages. Will leverage shell commands or R-interpreters to call those files.

:+1: Major accomplishment: Completed task #26--see above.

:heavy_check_mark: Next steps:

  • @tomschenkjr will inquire from Park District if (a) historical forecasts of beach advisories is available and (b) inquire if there are swim advisories that aren't due to forecasted elevated e. coli levels.
  • @mesweeney4 will work on #23 and merge DrekBeach data into the data set (will only be 2015 values)
  • Need to normalize the name of the beaches #22
  • @tomschenkjr is reorganizing code (per above) #27

Next meeting will be held January 5, 2016.

December 8, 2015

Discovered that there is a high level of disagreement between reading 1 and reading 2. For instance, an initial reading may be zero ppm, but the second reading may be elevated levels (e.g., over 300 ppm). This raises a question on how Parks district deals with large differences between the readings.

Data is both right- and left-sensored. Many readings are left sensored near zero ppm. Meanwhile, the data is also moderately right-sensored at ~2419 ppm. Any analytical model will need to accomplish both.

The Python script is going to be deprecated as it has some issues. The raw data is rather messy, which the Python script is not handling. Will be moving to the stack_sheets.R script

The data has several more points of inconsistency:

  • Sometimes both readings are blank, but the lab tests show the geometric mean.
  • Occasionally, neither beach reading is provided, but a geometric mean is also provided.
  • The maximum values are somewhat inconsistent, ranging between 2419.6 to slightly above 2420.
  • There is a value that is around 6,000 ppm.

:+1: Major accomplishment: A better understanding of how the data is censored. Namely, there is a substantial left-senor.

:heavy_check_mark: Next steps:

December 1, 2015

Discovered that data files differ by year:

  • 2002, 2003, and 2004 data files only contain the measurements from a single reading.
  • The 2005 file contain additional fields:
    • A "Laboratory ID"
    • A "Units" field
    • A sample collection time
  • Beginning in 2006 and onward, additional fields were added:
    • Two readings were added
    • An "Escherichia coli" field was added, which is the geometric mean of the two readings fields

Data quality issues: The E. coli measures are maxed-out at 2420 and is often denoted as ">2420" that causes errors on data input.

:+1: Major accomplishment: Determined that data from 2008 and onward should be included in the model--removing earlier data.

:heavy_check_mark: Next steps:

November 17, 2015

Introduced new participants to the repo and data set.

Cleaned-up issues in the Issue Tracker.

:+1: Major accomplishment: created a Python and shell script that converts the multiple [Excel workbooks](../tree/master/data/ChicagoParkDistrict/raw/Standard 18 hr Testing/) (and the multiple sheets within them) to CSVs.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.