## Goal: determining attraction / retention

An important characteristic of a how a community evolves is how contributors are joining and leaving. Determining when contributors leave is not easy (as they could be on a temporary leave, but coming back later), but if after a certain period they are still inactive, it is very likely they can be considered 'gone'. With this definition, the evolution of attraction and retention, and its difference (net gain of developers, which can be negative) can be computed.

These data could be determined for each of the specific contributors groups defined in the first goal. 

This goal can be refined in the following questions:

**Questions**:

* How many contributors are joining the community?
* How many contributos are no longer active (leaving) in the community?
* How is the attraction / retention ratio, and the net gain of contributors, over time?

To answer these questions, the following metrics can be used:

**Metrics**:

* Number of contributors joining the community over time (attracted)
* Number of contirbutors leaving (becoming inactive) over time
* Number of contributors not leaving (retained) over time

These metrics can be computed for each of the "cohorts", defined as the groups of contributors joining during a certain period of time (for example, during each year). Some of these metrics will be computed for the speficied contributor groups, over time.

# Metric Calculations
First we need to load a connection against the proper ES instance. We use an external module to load credentials from a file that will not be shared. If you want to run this, please use your own credentials, just put them in a file named '.settings' (in the same directory as this notebook) following the example file 'settings.sample'.

This section includes common code to manage and plot data. Queries will be available at the corresponding section.

**TODO**: Add bot and merges filtering.

**TODO** : provide plots similar to:

https://analytics.mozilla.community/edit/app/kibana#/dashboard/Community-Analytics-Demographics

In [37]:
from datetime import datetime
import pandas

import plotly as plotly
import plotly.graph_objs as go

from util import ESConnection
from elasticsearch_dsl import Search

es_conn = ESConnection()

In [34]:
def create_search(source):
    s = Search(using=es_conn, index=source)
    # TODO: Add bot and merges filtering.
    #s = s.filter('range', grimoire_creation_date={'gt': 'now/M-2y', 'lt': 'now/M'})
    s.params(timeout=30)
    return s

# Metrics

## Groups of Contributors based on their experience in the Community

Looking at when a given contributor sent her first commit we calculate how long he has been contributing to the community at a given time. We define following groups:
* People with more than **1 month** of experience and less than **6 months**
* People with more than **6 months** of experience and less than **12 months**
* People with more than **1 year** of experience and less than **2 years**
* People with more than **2 years** of experience and less than **4 years**
* People with more than **4 years**


In [21]:
# Define ranges
ranges = [{
           'from': 'now-1M/M',
           'key': '1- Months'
         }, {
           'from': 'now-6M/M',
           'to': 'now-1M/M',
           'key': '1-6 Months'
         }, {
           'from': 'now-12M/M',
           'to': 'now-6M/M',
           'key': '6-12 Months'
         }, {
           'from': 'now-24M/M',
           'to': 'now-12M/M',
           'key':  '1-2 Years'
         }, {
           'from': 'now-48M/M',
           'to': 'now-24M/M',
           'key': '2-4 Years'
         }, {
           'to': 'now-24M/M',
           'key': '4+ Years'
         }]

In [22]:
s = Search(using=es_conn, index='git')
s.params(timeout=30)

# Unique count of Commits by Project (max 100 projects)
s = s.filter('range', grimoire_creation_date={'gt': 'now/M-2y', 'lt': 'now/M'})
s.aggs\
    .bucket('experience', 'date_range', field='grimoire_creation_date', ranges=ranges)\
    .metric('contributors', 'cardinality', field='author_uuid', precision_threshold=100000)

    #.bucket('org', 'terms', field='author_org_name', size=10)\
    #.bucket('time', 'date_histogram', field='grimoire_creation_date', interval='quarter')\
#print(s.to_dict())
result = s.execute()

# In case you need to check response, uncomment line below
print(result.to_dict()['aggregations'])


{'experience': {'buckets': [{'doc_count': 0, 'contributors': {'value': 0}, 'key': '4+ Years', 'to_as_string': '2015-04-01T00:00:00.000Z', 'to': 1427846400000.0}, {'doc_count': 0, 'contributors': {'value': 0}, 'from': 1364774400000.0, 'key': '2-4 Years', 'from_as_string': '2013-04-01T00:00:00.000Z', 'to_as_string': '2015-04-01T00:00:00.000Z', 'to': 1427846400000.0}, {'doc_count': 73684, 'contributors': {'value': 1242}, 'from': 1427846400000.0, 'key': '1-2 Years', 'from_as_string': '2015-04-01T00:00:00.000Z', 'to_as_string': '2016-04-01T00:00:00.000Z', 'to': 1459468800000.0}, {'doc_count': 31469, 'contributors': {'value': 871}, 'from': 1459468800000.0, 'key': '6-12 Months', 'from_as_string': '2016-04-01T00:00:00.000Z', 'to_as_string': '2016-10-01T00:00:00.000Z', 'to': 1475280000000.0}, {'doc_count': 0, 'contributors': {'value': 0}, 'from': 1475280000000.0, 'key': '1-6 Months', 'from_as_string': '2016-10-01T00:00:00.000Z', 'to_as_string': '2017-03-01T00:00:00.000Z', 'to': 1488326400000.0}

In [30]:
# Buckets by author name, finding first commit for each of them
s = Search(using=es_conn, index='git')
s.params(timeout=30)

s = s.filter('range', grimoire_creation_date={'lt': 'now/y'})

s.aggs.bucket('authors', 'terms', field='author_uuid', size=100000) \
    .metric('first', 'top_hits', _source=['author_date', 'author_org_name', 'author_uuid'],
            size=1, sort=[{"author_date": {"order": "asc"}}]) \
    .metric('last_commit', 'max', field='author_date')
s = s.sort("author_date")
#       pprint(s.to_dict())
result = s.execute()

In [39]:
def get_authors_df(result, author_bucket_field):        

        # Get a dataframe with each author and their first commit
        buckets_result = result['aggregations'][author_bucket_field]['buckets']

        buckets = []
        for bucket_author in buckets_result:
            author = bucket_author['key']
            
            first = bucket_author['first']['hits']['hits'][0]
            first_commit = first['sort'][0]/1000
            last_commit = bucket_author['last_commit']['value']/1000
            org_name = first['_source']['author_org_name']
            uuid = first['_source']['author_uuid']
            buckets.append({
                    'first_commit': datetime.utcfromtimestamp(first_commit),
                    'last_commit': datetime.utcfromtimestamp(last_commit),
                    'author': author,
                    'uuid': uuid,
                    'org': org_name
            })
        authors_df = pandas.DataFrame.from_records(buckets)
        authors_df.sort_values(by='first_commit', ascending=False,
                                inplace=True)
        return authors_df
    


get_authors_df(result, author_bucket_field='authors')



Unnamed: 0,author,first_commit,last_commit,org,uuid
3434,1e2828dd94a896d402c26b347cd26489a52a9afe,2016-08-29 12:36:35,2016-08-29 12:36:35,Independent,1e2828dd94a896d402c26b347cd26489a52a9afe
3118,837e2ec2315a7446cccb438e7768be998e530182,2016-08-26 19:09:12,2016-08-29 18:36:10,Independent,837e2ec2315a7446cccb438e7768be998e530182
3076,5acf8b9215f9210c90bfba8f829f60466dfd87a1,2016-08-26 18:05:56,2016-08-29 12:04:21,Independent,5acf8b9215f9210c90bfba8f829f60466dfd87a1
3423,1c907008c5769f535e5c83b0a249aff4f23d3cd3,2016-08-24 12:39:41,2016-08-24 12:39:41,Independent,1c907008c5769f535e5c83b0a249aff4f23d3cd3
2538,4c78e0a3162133de08b507175f4756bd0bcc1a0d,2016-08-23 22:36:46,2016-08-25 14:52:00,Independent,4c78e0a3162133de08b507175f4756bd0bcc1a0d
2951,25229e79153818c38ce482cd93f7e0eb6b4567de,2016-08-23 11:20:26,2016-08-26 11:09:40,Independent,25229e79153818c38ce482cd93f7e0eb6b4567de
3356,128a44bf7ca807ae229f8412f71109d9a2e97385,2016-08-23 10:46:40,2016-08-23 10:46:40,Independent,128a44bf7ca807ae229f8412f71109d9a2e97385
2952,2527617b37ff60fa5fdfea65b623e86b220ce63e,2016-08-22 12:11:55,2016-08-22 13:22:20,Independent,2527617b37ff60fa5fdfea65b623e86b220ce63e
3067,56ab6606091bb46925658fc5162346171c4eba14,2016-08-17 18:37:30,2016-08-18 13:52:34,Independent,56ab6606091bb46925658fc5162346171c4eba14
3770,7dff8571555cfed71a1c26a802bd615a73d4fdf0,2016-08-17 17:56:12,2016-08-17 17:56:12,Independent,7dff8571555cfed71a1c26a802bd615a73d4fdf0
