# This notebook acts as a guide

The overall problem we are trying to solve is to take historical data about a project on github and use that to build a deep-learning model that can predict project activity in the future. We have broken the problem up into smaller steps, with each step producing some form of intermediate output that is saved to file.

Each step is represented by a notebook. These are explained below. Note that all notebooks are found in the same directory as this one: /ideas-uo/machine_learning/predicting_project_activity.

Also note that these steps assume that a single project has been selected to study. The project you select is a parameter to the various notebooks.

## Step 1: wrangling (intake, inversion, compaction)

Notebook: build_days_table.ipynb

This notebook gets data into shape for feature set construction. It has 2 basic pieces. First, pull information from the github project into Python lists and dictionaries. The project info comes in organized by individual developers. Each developer has a list of commits he or she has carried out over the life of the project. The goal is to invert this structure. Pull out each individual commit as a row in a table. Iterating over the entire developer list, we will get all the commits on the project. Some further wrangling is done to build a table with interesting columns. The table is then reordered by date to get a chronological picture of commits in sequence. Converting this to a pandas table is then straightforward. The reordered table is written out as a csv file.

The second piece is to compact the commits table to a day table. The final days table will have a summarized account (as columns) for each day in the project from begging until present.

Here are screenshots of the 2 tables taken from the slack project.

<img src="https://www.dropbox.com/s/j0wk6v3kn7bbyup/Screenshot%202019-07-14%2009.43.31.png?raw=1">


<img src="https://www.dropbox.com/s/k8tpjymolexlihj/Screenshot%202019-07-14%2010.26.48.png?raw=1">

## Step 2: create the feature set

Notebook: build_feature_set.ipynb

Given the table from step 1, first build a new table that is day-based: each row is a day where columns accumulate individual commits for the day. For days with no commit activity, produce rows with 0 accumulations. In particular, do not skip days.

Once a day is compacted, build a feature for that day. A feature is a list of values. The simplest type of feature is one that takes values straight from the columns.