## Goal: determining attraction / retention

An important characteristic of a how a community evolves is how contributors are joining and leaving. Determining when contributors leave is not easy (as they could be on a temporary leave, but coming back later), but if after a certain period they are still inactive, it is very likely they can be considered 'gone'. With this definition, the evolution of attraction and retention, and its difference (net gain of developers, which can be negative) can be computed.

These data could be determined for each of the specific contributors groups defined in the first goal. 

This goal can be refined in the following questions:

**Questions**:

* How many contributors are joining the community?
* How many contributors are no longer active (leaving) in the community?
* How is the attraction / retention ratio, and the net gain of contributors, over time?

To answer these questions, the following metrics can be used:

**Metrics**:

* Number of contributors joining the community over time (attracted)
* Number of contirbutors leaving (becoming inactive) over time
* Number of contributors not leaving (retained) over time

These metrics can be computed for each of the "cohorts", defined as the groups of contributors joining during a certain period of time (for example, during each year). Some of these metrics will be computed for the speficied contributor groups, over time.

# Metric Calculations
First we need to load a connection against the proper ES instance. We use an external module to load credentials from a file that will not be shared. If you want to run this, please use your own credentials, just put them in a file named '.settings' (in the same directory as this notebook) following the example file 'settings.sample'.

This section includes common code to manage and plot data. Queries will be available at the corresponding section.

**TODO**: Add bot and merges filtering.

**TODO** : provide plots similar to:

https://analytics.mozilla.community/edit/app/kibana#/dashboard/Community-Analytics-Demographics

In [1]:
from datetime import datetime
import pandas

import plotly as plotly
import plotly.graph_objs as go

import util as ut

from util import ESConnection
from elasticsearch_dsl import Search

es_conn = ESConnection()

In [2]:
def create_search(source):
    s = Search(using=es_conn, index=source)
    # TODO: Add bot and merges filtering.
    #s = s.filter('range', grimoire_creation_date={'gt': 'now/M-2y', 'lt': 'now/M'})
    #s.params(timeout=100)
    return s

In [3]:
reposdef get_authors_df(result, author_bucket_field):        

    # Get a dataframe with each author and their first commit
    buckets_result = result['aggregations'][author_bucket_field]['buckets']

    buckets = []
    for bucket_author in buckets_result:
        author = bucket_author['key']

        first = bucket_author['first']['hits']['hits'][0]
        first_commit = first['sort'][0]/1000
        last_commit = bucket_author['last_commit']['value']/1000
        org_name = first['_source']['author_org_name']
        uuid = first['_source']['author_uuid']
        buckets.append({
                'first_commit': datetime.utcfromtimestamp(first_commit),
                'last_commit': datetime.utcfromtimestamp(last_commit),
                'author': author,
                'uuid': uuid,
                'org': org_name
        })
    authors_df = pandas.DataFrame.from_records(buckets)
    authors_df.sort_values(by='first_commit', ascending=False,
                            inplace=True)
    return authors_df

def get_active_authors_df(result, author_bucket_field, year):
    """Returns a dataframe with first and last commit of those authors
    whose last commit was made within a given year"""

    # Get a dataframe with each author and their first commit
    buckets_result = result['aggregations'][author_bucket_field]['buckets']

    buckets = []
    for bucket_author in buckets_result:
        author = bucket_author['key']

        first = bucket_author['first']['hits']['hits'][0]
        first_commit = first['sort'][0]/1000
        last_commit = bucket_author['last_commit']['value']/1000
        org_name = first['_source']['author_org_name']
        #uuid = first['_source']['author_uuid']
        if datetime.utcfromtimestamp(last_commit).year == year:
            buckets.append({
                    'first_commit': datetime.utcfromtimestamp(first_commit),
                    'last_commit': datetime.utcfromtimestamp(last_commit),
                    'author': author,
                    #'uuid': uuid,
                    'org': org_name
            })
    authors_df = pandas.DataFrame.from_records(buckets)
    authors_df.sort_values(by='first_commit', ascending=False,
                            inplace=True)
    return authors_df

In [4]:
def print_horizontal_bar_chart(df, experience_field, title, min_range = 0):
    
    plotly.offline.init_notebook_mode(connected=True)
    
    experience = list(range(min_range, int(df[experience_field].max()) + 1))
    
    people_count = []
    for exp in experience:
        people_count.append(len(df.loc[df[experience_field] == exp]))
        
    data = [go.Bar(
            x=people_count,
            y=experience,
            orientation = 'h'
    )]
    
    layout = go.Layout(
        barmode='group',
        title= title
    )

    fig = go.Figure(data=data, layout=layout)
    plotly.offline.iplot(fig, filename='horizontal-bar')
    
    
def print_horizontal_bar_chart_relative(df, experience_field, title):
    
    plotly.offline.init_notebook_mode(connected=True)
    
    experience = list(range(0, int(df[experience_field].max()) + 1))
    
    people_count = []
    first_count = len(df.loc[df[experience_field] == 0])
    for exp in experience:
        current_count = len(df.loc[df[experience_field] == exp])
        people_count.append(current_count * 100/ first_count)
        
    data = [go.Bar(
            x=people_count,
            y=experience,
            orientation = 'h'
    )]
    
    layout = go.Layout(
        barmode='group',
        title= title
    )

    fig = go.Figure(data=data, layout=layout)
    plotly.offline.iplot(fig, filename='horizontal-bar')
    
def print_horizontal_bar_chart_percent(df, experience_field, title):
    
    plotly.offline.init_notebook_mode(connected=True)
    
    experience = list(range(0, int(df[experience_field].max()) + 1))
    
    people_count = []
    cusum = len(df)
    for exp in experience:
        current_count = len(df.loc[df[experience_field] == exp])
        people_count.append(current_count * 100/ cusum)
        
    data = [go.Bar(
            x=people_count,
            y=experience,
            orientation = 'h'
    )]
    
    layout = go.Layout(
        barmode='group',
        title= title
    )

    fig = go.Figure(data=data, layout=layout)
    plotly.offline.iplot(fig, filename='horizontal-bar')
    


# Metrics



In [5]:
def add_general_date_filters(s):
    # 01/01/1998
    initial_ts = '883609200000'
    return s.filter('range', grimoire_creation_date={'gt': initial_ts})

def add_bot_filter(s):
    return s.filter('term', author_bot='false')

def add_merges_filter(s):
    return s.filter('range', files={'gt': 0})
    


## Time from first to last contrib for authors who made a commit before a given year 

Next plot shows the number of authors grouped by time from their first to last contribution. This give us an idea of how long contributors are around the community. In this chart we don't have any clue of their activity in that period, just a quick and approximate glance of the time they remain around the community.

**Long bars in group of 0 years of experiece means that there are many people who made their first and last contributions whithin the same year along the whole period**. That is, the accumulated sum of people who made all their contributions within same year from 1998.

* Y axis corresponds to the difference in years from first to last contributions.
* X axis corresponds to the number of contributors in the given group.
* Each plot shows a snapshot of this information from the specified year to the past (1998 was chosen as the oldest date to get results from). 

In [6]:
results = []
for i in range(0,4):

    # Buckets by author name, finding first commit for each of them
    s = Search(using=es_conn, index='git')
    s.params(timeout=30)

    # General filters
    s = add_general_date_filters(s)
    s = add_bot_filter(s)
    s = add_merges_filter(s)
    
    # Retrieve commits before given year
    s = s.filter('range', grimoire_creation_date={'lt': 'now-' + str(i) + 'y/y'})

    # Bucketize by uuid and get first and last commit
    s.aggs.bucket('authors', 'terms', field='author_uuid', size=100000) \
        .metric('first', 'top_hits', _source=['author_date', 'author_org_name', 'author_uuid'],
                size=1, sort=[{"author_date": {"order": "asc"}}]) \
        .metric('last_commit', 'max', field='author_date')
    s = s.sort("author_date")
    #print(s.to_dict())
    results.append(s.execute())

In [7]:
authors_dfs = []
for result in results:
    authors_df = get_authors_df(result, author_bucket_field='authors')
    authors_df['active_years'] = (authors_df.last_commit-authors_df.first_commit).astype('timedelta64[Y]')
    authors_dfs.append(authors_df)

authors_dfs


[                                         author        first_commit  \
 11577  149ad981c7f8acd65b9cb9a3f306833fc4b6ec8e 2016-12-31 12:01:23   
 13024  778f8adac79c8593c8413792e9d838e0620797c1 2016-12-31 08:34:00   
 14349  f0a27b3276baf429f51f6afed8ece01f1a09e7d0 2016-12-31 05:11:58   
 12669  5c6bef06c5d1fa2e83f13db2046ec3d9d27bd00d 2016-12-30 23:09:23   
 13425  977b2b48ece5125093c2cfdedb56be00bdba4f13 2016-12-30 22:21:03   
 12596  57a1cf0d7a35c3f77475953873212148d3c01e1b 2016-12-30 21:55:02   
 10727  8c536fb5c49a281f793cb47e933fd71403891960 2016-12-30 17:26:13   
 13554  a1e6b9aca42df46460c5fed6a73cde615b2b04a3 2016-12-30 16:45:19   
 11875  26e5b20e7ccff92ec895b5f4af3f5f0b9a5b0f0a 2016-12-30 14:34:47   
 9905   09a899922407e40e04fa6ab865da040f06c0d904 2016-12-30 12:22:11   
 9612   c91f349eb811f5870fe7b4e987bd34decfbf72cb 2016-12-30 11:34:16   
 13171  829a9b7eae96736b2408c62c4b2843721116fe41 2016-12-29 23:33:52   
 13884  c19b5436d45eb06bb0d9a16accd4a16da1ad9765 2016-12-29 14:2

In [8]:
# Plot bar charts for each dataframe
i = 0
for authors_df in authors_dfs:
#    print(author_df['experience_years'].max(), type(author_df['experience_years'].max()))
    print_horizontal_bar_chart(authors_df, 'active_years', title=str(2017 - i))
    i += 1


### Relative time from first to last contrib for authors who made a commit before a given year 

Next charts show the same information as above from a relative point of view. This way all numbers are now shown as percentages relative to the group of people who made their contributions within the same year.

* Y axis corresponds to the difference in years from first to last contributions.
* X axis corresponds to the percentage of contributors in the given group related to the first group (0 years).
* Each plot shows a snapshot of this information from the specified year to the past (1998 was chosen as the oldest date to get results from). 

In [9]:
# Plot bar charts for each dataframe
i = 0
for authors_df in authors_dfs:
#    print(author_df['experience_years'].max(), type(author_df['experience_years'].max()))
    print_horizontal_bar_chart_relative(authors_df, 'active_years', title=str(2017 - i))
    i += 1

### Percentage of time from first to last contrib for authors who made a commit before a given year 

Next charts show another view on the same data. In this case the percentage of people in a given group out of the total number of people is shown.

* Y axis corresponds to the difference in years from first to last contributions.
* X axis corresponds to the percentage of contributors in the given group.
* Each plot shows a snapshot of this information from the specified year to the past (1998 was chosen as the oldest date to get results from). 

In [10]:
# Plot bar charts for each dataframe
i = 0
for authors_df in authors_dfs:
#    print(author_df['experience_years'].max(), type(author_df['experience_years'].max()))
    print_horizontal_bar_chart_percent(authors_df, 'active_years', title=str(2017 - i))
    i += 1

## Time from first to last commit for authors active in a given year

We define an author as **active** iff she made at least one commit within a given year. E.g. an author would be considered active in 2017 if she made a commit after Jan. 1st, 2017 and before Dec. 31st 2017. 

In other words, the difference with previous plots lies in having into account only contributors who made their last contribution in the year we are visualizing data from.

* Y axis corresponds to the difference in years from first to last contributions.
* X axis corresponds to the number of contributors in the given group.
* Each plot shows a snapshot of this information from the specified year to the past (1998 was chosen as the oldest date to get results from). 

In [11]:
results = []
for i in range(0,4):

    # Buckets by author name, finding first commit for each of them
    s = Search(using=es_conn, index='git')
    s.params(timeout=30)
    
    # General filters
    s = add_general_date_filters(s)
    s = add_bot_filter(s)
    s = add_merges_filter(s)

    # Retrieve commits before given year
    s = s.filter('range', grimoire_creation_date={'lte': 'now-' + str(i) + 'y/y'})

    # Bucketize by uuid and get first and last commit
    s.aggs.bucket('authors', 'terms', field='author_uuid', size=100000) \
        .metric('first', 'top_hits', _source=['author_date', 'author_org_name', 'author_uuid'],
                size=1, sort=[{"author_date": {"order": "asc"}}]) \
        .metric('last_commit', 'max', field='author_date')
    s = s.sort("author_date")
    #       pprint(s.to_dict())
    results.append(s.execute())

In [12]:
authors_dfs = []
year = 2016
for result in results:
    authors_df = get_active_authors_df(result, author_bucket_field='authors', year=year)
    authors_df['active_years'] = (authors_df.last_commit-authors_df.first_commit).astype('timedelta64[Y]')
    authors_dfs.append(authors_df)
    year -= 1

authors_dfs


[                                        author        first_commit  \
 2378  149ad981c7f8acd65b9cb9a3f306833fc4b6ec8e 2016-12-31 12:01:23   
 2741  778f8adac79c8593c8413792e9d838e0620797c1 2016-12-31 08:34:00   
 3077  f0a27b3276baf429f51f6afed8ece01f1a09e7d0 2016-12-31 05:11:58   
 2653  5c6bef06c5d1fa2e83f13db2046ec3d9d27bd00d 2016-12-30 23:09:23   
 2149  8c536fb5c49a281f793cb47e933fd71403891960 2016-12-30 17:26:13   
 2871  a1e6b9aca42df46460c5fed6a73cde615b2b04a3 2016-12-30 16:45:19   
 1920  09a899922407e40e04fa6ab865da040f06c0d904 2016-12-30 12:22:11   
 1861  c91f349eb811f5870fe7b4e987bd34decfbf72cb 2016-12-30 11:34:16   
 2775  829a9b7eae96736b2408c62c4b2843721116fe41 2016-12-29 23:33:52   
 2453  286f050fe75f8123df329c88b74f3dd61b73bda4 2016-12-28 18:43:08   
 2946  bd1da52b37734708ba675cb93b8ab3ab5ac5446c 2016-12-28 18:09:59   
 2400  1c9ccec2047e42eb37053a6e93cb10781c56ec82 2016-12-28 16:53:00   
 2597  4d0114c7876b3d23637a32f6e203ab41d3f44c29 2016-12-28 07:37:45   
 2343 

In [13]:
# Plot bar charts for each dataframe
i = 0
for authors_df in authors_dfs:
#    print(author_df['experience_years'].max(), type(author_df['experience_years'].max()))
    print_horizontal_bar_chart(authors_df, 'active_years', title=str(2016 - i))
    i += 1


## Years of Experience
We consider **12 commits** per year, i.e. one commit per month aprox., as a minimum to add one year of experience to a given author. From this assumption, we build groups of authors by years of experience. As a result, we present a plot with number of people in each group.

To give a more complete idea of how community evolves, we plot snapshots corresponding to different years. Each of them will take all commits sent until the given year, and calculate years of experience for all authors in that slice.

We are also counting authors whose last year of experience is the one we are analyzing data from. That is, if we are looking to year 2017, we only count those authors who made at least 12 commits in 2017. From there we add 1 year of experience for each year they fulfill this condition.

* Y axis corresponds to years of experience as defined above.
* X axis corresponds to the umber of contributors in the given group.
* Each plot shows a snapshot of this information from the specified year to the past (1998 was chosen as the oldest date to get results from). 

In [6]:
###
## GET COMMITS BY YEAR AND AUTHOR
###

results = []
min_commits = 1

for i in range(0,10):

    # Buckets by author name, finding first commit for each of them
    s = create_search(source='git')
    
    # General filters
    s = add_general_date_filters(s)
    s = add_bot_filter(s)
    s = add_merges_filter(s)
    
    # Retrieve commits before given year
    s = s.filter('range', grimoire_creation_date={'lte': 'now-' + str(i) + 'y/y'})

    # Bucketize by time, uuid and organization, then count commits per year
    s.aggs.bucket('time', 'date_histogram', field='grimoire_creation_date', interval='year') \
        .bucket('authors', 'terms', field='author_uuid', size=100000, min_doc_count=min_commits) \
        .bucket('org', 'terms', field='author_org_name', size=1) \
        .metric('commits', 'cardinality', field='hash', precision_threshold=1000)

    r = s.execute()
    # In case you need to check response, uncomment line below
    #print(r.to_dict()['aggregations']['time']['buckets'])
        
    results.append(r)
    
#results

In [9]:
###
## CREATE A DF CONTAINING, FOR EACH AUTHOR UUID, COUNT OF YEARS OF EXPERIENCE (YEARS
## WITH MORE THAN 12 COMMITS MADE) AND LAST YEAR ACTIVE
###
exp_df_list = []
year = 2017

for result in results:
    exp_df = ut.to_df_by_time(result, 'Author', 'Time', 'Commits', 'Org', 'authors', 'time', 'commits', 'org')
    exp_df['Time'] = exp_df['Time'].apply(lambda x: str(pandas.Period(x,'A')))
    
    ## ACTIVE CONDITION
    ## Filter those having less than 12 commits per year
    exp_df = exp_df[exp_df['Commits'] >= 12]
    
    ## Group by author, get MAX YEAR and NUMBER OF ROWS FOR THE GIVEN AUTHOR
    exp_df = exp_df.groupby(['Author', 'Org']).agg({'Time': 'max', 'Commits': 'count'})
    ## Filter those whose last active year is not the one we want
    exp_df = exp_df[exp_df['Time'] == str(year)]
    
    exp_df['exp'] = exp_df['Commits']
    exp_df['last_active'] = exp_df['Time']
    exp_df= exp_df.drop('Commits', axis=1)
    exp_df = exp_df.drop('Time', axis=1)
    
    exp_df_list.append(exp_df)
    
    year -= 1

exp_df_list


[                                                        exp last_active
 Author                                   Org                           
 000063c4e47e93ab3b30607680609e4d2500ce5d Mozilla Staff    4        2017
 002893ffe1425c220756f8ba4c78e1e3bb0be50f Mozilla Staff    7        2017
 00834d313bfc6fc60be1631bcc57b2c05ee2e0e3 Mozilla Staff    9        2017
 00846eff46b051d92317fc74e54041c6fdccd7cf Mozilla Staff   10        2017
 00a40f9e9e7f7633ddab8291a99e1e487f88481c Community        3        2017
 00b934012989b386ac9efc706dbc28cd6be173c6 Mozilla Staff    2        2017
 00d00a6e7530f1ccac19b98727f25b6528ed9fe3 Community        1        2017
 00da6ede3bff8db21a33473d7e552b76f50757eb Mozilla Staff    5        2017
 00ed7b25063cf90c8bdeb9d45d37b73ba2317f96 Mozilla Staff    6        2017
 01307140d33369746c5013f45295092f8752d378 Community        2        2017
 013bbdb9412b88db677df21347e032c57b099d97 Community        3        2017
 016a5f2ec7191e74e984c71787b2b292d89543a1 Community

In [17]:
# Plot bar charts for each dataframe
i = 0
for exp_df in exp_df_list:
    if not exp_df.empty:
        print_horizontal_bar_chart(exp_df, 'exp', 
                                   title='All ' + str(2017 - i), min_range=1)
        print_horizontal_bar_chart(exp_df[[group2 in ['Mozilla Staff'] for group1, group2 in exp_df.index]], 'exp',
                                   title='Mozilla ' + str(2017 - i), min_range=1)
        print_horizontal_bar_chart(exp_df[[group2 not in ['Mozilla Staff'] for group1, group2 in exp_df.index]], 'exp', 
                                   title='Others ' + str(2017 - i), min_range=1)
    i += 1

In [30]:
exp_groups_evo_df = pandas.DataFrame(columns=['last_active', 'exp', 'count', 'org'])

for exp_df in exp_df_list:
    
    if exp_df.empty:
        continue
    
    experience = list(range(1, int(exp_df['exp'].max()) + 1))
    orgs = exp_df.index.get_level_values('Org').unique()
    
    last_active = exp_df['last_active'].unique()[0]
    for exp in experience:
        for org in orgs:
            org_df = exp_df[[group2 in [org] for group1, group2 in exp_df.index]]
            count = len(org_df.loc[org_df['exp'] == exp])
            #print(last_active, exp, count)
            exp_groups_evo_df.loc[len(exp_groups_evo_df)] = [last_active, exp, count, org]
        
print('Max. Exp: ', exp_groups_evo_df['exp'].max(), 'Max. Count: ',  exp_groups_evo_df['count'].max())
exp_groups_evo_df    

Max. Exp:  17.0 Max. Count:  336.0


Unnamed: 0,last_active,exp,count,org
0,2017,1.0,16.0,Mozilla Staff
1,2017,1.0,53.0,Community
2,2017,1.0,0.0,Mozilla Reps
3,2017,2.0,48.0,Mozilla Staff
4,2017,2.0,59.0,Community
5,2017,2.0,0.0,Mozilla Reps
6,2017,3.0,50.0,Mozilla Staff
7,2017,3.0,33.0,Community
8,2017,3.0,0.0,Mozilla Reps
9,2017,4.0,39.0,Mozilla Staff


### Evolution of Experience

Next table and plot show how each group changes over time. This way we can visualize how new people come and remain in the community. It is worth to note that we are not following a given group of people through time (it could be done following diagonals in the table, we look at this in the next section), but looking at how a given group changes from one year to another. 

For instance, if we look at the group of 2 years of experience in 2008 we see we had 204 people. If we look at the same group in 2009 we see that our **new group** of people accumulating 2 years of experience has 105 people. So, it seems we have fewer people with 2 years of experience in 2009. If we look at 2016 we find 226 people with two years of experience, so we have more people with 2 years of experience nowadays than we had 8 years ago.

Table can be read as follows:

* Cell values corresponds to the number of contributors in the given group.
* Rows corresponds to groups based on years of experience.
* Columns corresponds to years we are analyzing. 

In [39]:
exp_groups_evo_df = pandas.DataFrame(columns=['exp'])

for exp_df in exp_df_list:
    
    if exp_df.empty:
        continue
    
    year = exp_df['last_active'].unique()[0]
    exp_groups_df = pandas.DataFrame(columns=['exp', year])
    
    experience = list(range(1, int(exp_df['exp'].max()) + 1))
    
    for exp in experience:
        count = len(exp_df.loc[exp_df['exp'] == exp])
        exp_groups_df.loc[len(exp_groups_df)] = [exp, count]

    exp_groups_evo_df = exp_groups_evo_df.merge(exp_groups_df, on='exp', how='outer')


# Fill Nan with 0's
exp_groups_evo_df = exp_groups_evo_df.fillna(0)

# Reorder columns
exp_groups_evo_df = exp_groups_evo_df.set_index('exp')
exp_groups_evo_df = exp_groups_evo_df.sort_index(axis=1)


#print('Max. Exp: ', exp_groups_evo_df['exp'].max(), 'Max. Count: ')
exp_groups_evo_df

Unnamed: 0_level_0,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
exp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1.0,180.0,180.0,203.0,312.0,369.0,435.0,493.0,465.0,434.0,69.0
2.0,204.0,105.0,106.0,133.0,211.0,206.0,225.0,212.0,224.0,107.0
3.0,37.0,91.0,79.0,80.0,98.0,149.0,154.0,163.0,123.0,83.0
4.0,16.0,5.0,83.0,68.0,59.0,82.0,119.0,109.0,123.0,54.0
5.0,18.0,0.0,4.0,74.0,58.0,49.0,67.0,100.0,86.0,82.0
6.0,9.0,1.0,0.0,4.0,62.0,48.0,46.0,54.0,86.0,47.0
7.0,8.0,1.0,1.0,0.0,3.0,52.0,42.0,42.0,50.0,57.0
8.0,4.0,1.0,1.0,1.0,0.0,2.0,47.0,37.0,38.0,39.0
9.0,6.0,1.0,0.0,1.0,0.0,1.0,2.0,43.0,34.0,16.0
10.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,2.0,36.0,21.0


Next plot can be read as follows:
* Y axis corresponds to the number of contributors in the given group.
* X axis corresponds to years we are looking through.
* Each line corresponds to a given group based on their years of experience. 

In [19]:
plotly.offline.init_notebook_mode(connected=True)

data = []
for exp in exp_groups_evo_df.index.values:
    #print(exp, '\n', exp_groups_evo_df.loc[exp].tolist(), '\n', exp_groups_evo_df.loc[exp].index.values)
    data.append(
        go.Scatter(
            x = exp_groups_evo_df.loc[exp].index.values,
            y = exp_groups_evo_df.loc[exp].tolist(),
            mode = 'lines+markers',
            name = str(int(exp)) + ' years'
        )
    )
    


plotly.offline.iplot(data, filename='line-mode')    

### Evolution of Experience by Organization

In [33]:
exp_groups_evo_moz_df = pandas.DataFrame(columns=['exp'])

for exp_df in exp_df_list:
    
    if exp_df.empty:
        continue
        
    exp_df = exp_df[[group2 in ['Mozilla Staff'] for group1, group2 in exp_df.index]]
    
    year = exp_df['last_active'].unique()[0]
    exp_groups_df = pandas.DataFrame(columns=['exp', year])
    
    experience = list(range(1, int(exp_df['exp'].max()) + 1))
    
    for exp in experience:
        count = len(exp_df.loc[exp_df['exp'] == exp])
        exp_groups_df.loc[len(exp_groups_df)] = [exp, count]

    exp_groups_evo_moz_df = exp_groups_evo_moz_df.merge(exp_groups_df, on='exp', how='outer')


# Fill Nan with 0's
exp_groups_evo_moz_df = exp_groups_evo_moz_df.fillna(0)

# Reorder columns
exp_groups_evo_moz_df = exp_groups_evo_moz_df.set_index('exp')
exp_groups_evo_moz_df = exp_groups_evo_moz_df.sort_index(axis=1)


#print('Max. Exp: ', exp_groups_evo_moz_df['exp'].max(), 'Max. Count: ')
exp_groups_evo_moz_df

Unnamed: 0_level_0,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
exp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1.0,93.0,93.0,112.0,186.0,219.0,226.0,188.0,129.0,114.0,16.0
2.0,96.0,76.0,75.0,89.0,152.0,159.0,158.0,117.0,103.0,48.0
3.0,7.0,83.0,61.0,63.0,75.0,123.0,132.0,124.0,85.0,50.0
4.0,0.0,5.0,75.0,57.0,48.0,70.0,102.0,103.0,99.0,39.0
5.0,1.0,0.0,4.0,66.0,51.0,42.0,59.0,90.0,84.0,70.0
6.0,1.0,1.0,0.0,4.0,56.0,44.0,38.0,50.0,77.0,46.0
7.0,0.0,1.0,1.0,0.0,3.0,47.0,38.0,35.0,46.0,53.0
8.0,1.0,0.0,1.0,1.0,0.0,2.0,44.0,33.0,31.0,37.0
9.0,1.0,1.0,0.0,1.0,0.0,1.0,2.0,40.0,30.0,13.0
10.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,2.0,35.0,18.0


In [36]:
exp_groups_evo_others_df = pandas.DataFrame(columns=['exp'])

for exp_df in exp_df_list:
    
    if exp_df.empty:
        continue
        
    exp_df = exp_df[[group2 not in ['Mozilla Staff'] for group1, group2 in exp_df.index]]
    
    year = exp_df['last_active'].unique()[0]
    exp_groups_df = pandas.DataFrame(columns=['exp', year])
    
    experience = list(range(1, int(exp_df['exp'].max()) + 1))
    
    for exp in experience:
        count = len(exp_df.loc[exp_df['exp'] == exp])
        exp_groups_df.loc[len(exp_groups_df)] = [exp, count]

    exp_groups_evo_others_df = exp_groups_evo_others_df.merge(exp_groups_df, on='exp', how='outer')


# Fill Nan with 0's
exp_groups_evo_others_df = exp_groups_evo_others_df.fillna(0)

# Reorder columns
exp_groups_evo_others_df = exp_groups_evo_others_df.set_index('exp')
exp_groups_evo_others_df = exp_groups_evo_others_df.sort_index(axis=1)


#print('Max. Exp: ', exp_groups_evo_others_df['exp'].max(), 'Max. Count: ')
exp_groups_evo_others_df

Unnamed: 0_level_0,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
exp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1.0,87.0,87.0,91.0,126.0,150.0,209.0,305.0,336.0,320.0,53.0
2.0,108.0,29.0,31.0,44.0,59.0,47.0,67.0,95.0,121.0,59.0
3.0,30.0,8.0,18.0,17.0,23.0,26.0,22.0,39.0,38.0,33.0
4.0,16.0,0.0,8.0,11.0,11.0,12.0,17.0,6.0,24.0,15.0
5.0,17.0,0.0,0.0,8.0,7.0,7.0,8.0,10.0,2.0,12.0
6.0,8.0,0.0,0.0,0.0,6.0,4.0,8.0,4.0,9.0,1.0
7.0,8.0,0.0,0.0,0.0,0.0,5.0,4.0,7.0,4.0,4.0
8.0,3.0,1.0,0.0,0.0,0.0,0.0,3.0,4.0,7.0,2.0
9.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,4.0,3.0
10.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,3.0


## Retention by experience
Next table shows percentage of people remaining in the community for each experience group. It is calculated not by following individuals, but comparing total number of people in each group. This is a different picture to te one we showed before. Here we try to follow the evolution of a given group of people through years.

To read the table have into account that each number corresponds to percentage of people remaining in a given group (e.g. 2 years of experience in 2010) with respect to same group during previous year (i.e. 1 year of experience in 2009).

So, if we look at cell (3.0, 2010) we can read it as number of people with **3** years of experience in **2010** represents a 78.90% of people having **2** years of experience in **2009**. Note that both groups are in fact the same as they evolve through time, increasing their years of experience. 

Table can be read as follows:

* Cell values corresponds to the percentage of contributors remaining in the given group with respect to the previous year, **in which they have 1 year less of experience**.
* Rows corresponds to **groups based on years of experience they had in the year specified in column title**.
* Columns corresponds to years we are analyzing.

In [20]:
exp_groups_evo_diff_df = pandas.DataFrame()

for exp in exp_groups_evo_df.index.values:
    #print(exp, '\n', exp_groups_evo_df.loc[exp].tolist(), '\n', exp_groups_evo_df.loc[exp].index.values)
    
    cols = list(exp_groups_evo_df)
    min_col = int(cols[0])
    
    #print(exp - 1)
    for col in list(exp_groups_evo_df): 
        current_val = exp_groups_evo_df.get_value(exp, col)
        prev_row = exp - 1
        prev_col= int(col) - 1
        if prev_row > 0 and prev_col >= min_col:
            prev_val = exp_groups_evo_df.get_value(prev_row, str(prev_col))
            #print(col, prev_val, current_val, prev_val - current_val)
            if prev_val == 0:
                percent = 0
            else:
                percent = current_val * 100 / prev_val
            exp_groups_evo_diff_df.set_value(exp, col, round(percent, 2))

exp_groups_evo_diff_df

    


Unnamed: 0,2009,2010,2011,2012,2013,2014,2015,2016,2017
2.0,58.33,58.89,65.52,67.63,55.83,51.72,42.89,48.17,24.71
3.0,44.61,75.24,75.47,73.68,70.62,74.76,72.44,58.29,37.05
4.0,13.51,91.21,86.08,73.75,83.67,80.54,70.78,75.46,43.09
5.0,0.0,80.0,89.16,85.29,83.05,81.71,84.17,78.9,66.67
6.0,5.56,0.0,100.0,83.78,82.76,93.88,80.6,86.14,54.65
7.0,11.11,100.0,0.0,75.0,83.87,87.5,91.3,92.59,66.67
8.0,12.5,100.0,100.0,0.0,66.67,90.38,88.1,90.48,78.0
9.0,25.0,0.0,100.0,0.0,0.0,100.0,91.49,91.89,42.11
10.0,16.67,0.0,0.0,100.0,0.0,100.0,100.0,83.72,61.76
11.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,69.44


### Following evolution of groups

Another way of visualizing this is selecting a group from a particular year and following its evolution throug time. This way, we will get number of people belonging to that group through time.

Next table shows evolution of groups from a given year, in this case **2008**. So first row shows how the 138 people who had 1 year of experience in 2008 have evolved through years, losing people year by year until 2016, when there are only 8 of them.

Table can be read as follows:

* Cell values corresponds to the number of contributors in the given group.
* Rows corresponds to **groups based on years of experience they had in 2008**.
* Columns corresponds to years we are analyzing.

In [21]:
group_evo_df = pandas.DataFrame()

years = list(exp_groups_evo_df)

# Group we want to follow
first_year = 2008


for exp_group in exp_groups_evo_df.index.values:
    
    exp_index = exp_group

    for year in years:
        if int(year) < first_year:
            continue

        if exp_index > exp_groups_evo_df.index.values.max():
            break

        people_count = exp_groups_evo_df.get_value(exp_index, year)
        #print(exp_index, year, people_count)
        group_evo_df.set_value(exp_group, year, people_count)
        exp_index += 1

group_evo_df = group_evo_df.fillna(0)
group_evo_df


Unnamed: 0,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
1.0,180.0,105.0,79.0,68.0,58.0,48.0,42.0,37.0,34.0,21.0
2.0,204.0,91.0,83.0,74.0,62.0,52.0,47.0,43.0,36.0,25.0
3.0,37.0,5.0,4.0,4.0,3.0,2.0,2.0,2.0,2.0,1.0
4.0,16.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
5.0,18.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
6.0,9.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
7.0,8.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9.0,6.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
10.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We can look at the same data in percentages over previous years, i.e. how many people are retained in a group from one year to the following. This is a different view of the first table, just easier to follow because we can follow rows instead of diagonals.

Table can be read as follows:

* Cell values corresponds to the percentage of contributors remaining in the given group with respect to the previous year, **in which they have 1 year less of experience**.
* Rows corresponds to **groups based on years of experience they had in 2008**.
* Columns corresponds to years we are analyzing.

In [22]:
group_evo_percent_df = pandas.DataFrame()

years = list(exp_groups_evo_df)

# Group we want to follow
first_year = 2008


for exp_group in exp_groups_evo_df.index.values:
    
    exp_index = exp_group

    for year in years:
        if int(year) < first_year:
            continue

        if exp_index > exp_groups_evo_df.index.values.max():
            break
        
        current_val = exp_groups_evo_df.get_value(exp_index, year)
        
        if int(year) == first_year:
            prev_val = current_val
        
        if prev_val == 0:
            percent = current_val * 100
        else:
            percent = current_val * 100 / prev_val
        group_evo_percent_df.set_value(exp_group, year, round(percent, 2))
        
        exp_index += 1
        prev_val = current_val
              
        

group_evo_percent_df = group_evo_percent_df.fillna(0)
group_evo_percent_df

Unnamed: 0,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
1.0,100.0,58.33,75.24,86.08,85.29,82.76,87.5,88.1,91.89,61.76
2.0,100.0,44.61,91.21,89.16,83.78,83.87,90.38,91.49,83.72,69.44
3.0,100.0,13.51,80.0,100.0,75.0,66.67,100.0,100.0,100.0,50.0
4.0,100.0,0.0,0.0,0.0,0.0,100.0,100.0,0.0,0.0,0.0
5.0,100.0,5.56,100.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0
6.0,100.0,11.11,100.0,100.0,100.0,0.0,0.0,0.0,0.0,0.0
7.0,100.0,12.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8.0,100.0,25.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9.0,100.0,16.67,100.0,100.0,100.0,100.0,100.0,100.0,100.0,0.0
10.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Answers

* How many contributors are joining the community?
* How many contributors are no longer active (leaving) in the community?
* How is the attraction / retention ratio, and the net gain of contributors, over time?

To answer these questions we need to understand following tables. Each of them shows contributor groups based on their years of experience in the community. To count years of experience we make the following assumptions:
* We consider **12 commits** per year, i.e. one commit per month aprox., as a minimum to add one year of experience to a given author.
* For each year we count only authors who were active that year, that is, authors whose last year of experience is the one we are analyzing data from (represented as columns in table).
* Years of experience could be not consecutive, that is, one contributor that made 12 or more contributions in 2010 and then 12 or more in 2014 will have 2 years of experience in 2014 and 1 year of experience in 2010. For other years this contributor is not taken into account as she was inactive.

It is worth noting that numbers are calculated not by following individuals, but comparing total number of people in each group at a given time.

Table below shows data for all **Git** contributions made in the last 10 years. Table can be read as follows:
* **Cell** values corresponds to the number of contributors in the given group.
* **Rows** corresponds to groups based on years of experience.
* **Columns** corresponds to years we are analyzing.

Depending on how we read the table we can analyze different aspects:
* **Attraction**. Focusing on exp 1.0 row we can have an idea of number of people getting attracted to the community. As we count only people who made at least 12 commits, we can consider they are reasonably engaged in the community.
* **Retention**. Following diagonals in the table allow us to follow the evolutions of a given group from a particular year through time. That is, we will get number of people belonging to that group through time. For instance, if we look at cell (1.0, 2008) we find 180 people with one year of experience in 2008. If we move to (2.0, 2009) we find 105 people with 2 years of experience in 2009. This way, we've lost 75 people that were attracted in 2008. By following this diagonal we can follow the evolution of people that came in 2008. This procedure is the same whatever cell we use as starting point and always give us the evolution of a given group of people. 
* **Evolution of the community**. For instance, if we look at the group of 2 years of experience in 2008 we see we had 204 people. If we look at the same group in 2009 we see that our new group of people accumulating 2 years of experience has 105 people. So, it seems we have fewer people with 2 years of experience in 2009. If we look at 2016 we find 226 people with two years of experience, so we have more people with 2 years of experience nowadays than we had 8 years ago.

Focusing on results:

**NOTE: 2017 column shows incomplete results. It could be removed to avoid mistakes, though it could be useful to know where we are at this point of the year**

#### Attraction 
(How many contributors are joining the community?)

From 2008 the number of people coming to the community each year have increased almost a 275% in 2014. Last two years seem to show a decrease, but even in 2016 we find numbers a 241% higher than 2008.

From Mozilla Staff point of view ([see Table 2][2]) there is an interesting difference when we look at 1.0 row. It seems to follow a different pattern than general ([Table 1][1]) and others([Table 3][3]) tables. First, differences between 2008 and 2016 are much smaller than the rest. Also, from 2014 there is a huge decrease in people coming to Mozilla Staff.

#### Retention 
(How many contributors are no longer active (leaving) in the community?)

If we look at (1.0, 2008) compared to (2.0, 2009) we find we lost 75 people. Next year we lost 26 people more and so on to 2016 when 34 people remains from thos 180 attracted in 2008.

Comparing retention of Mozilla Staff to Others, we find fewer people with more than 2 years of experience in 2008 (we will talk about this in [next section](#Evolution-of-Community)). Nevertheless, when new people come to the community, a higher percentage of them them remains there for the rest of the analyzed period. From this, we have two points to remark:
* Mozilla Staff is composed of more experienced contributors in 2016.
* Retention rate is better (as it could be expected) for Mozilla Staff.

#### Evolution of Community 
(How is the attraction / retention ratio, and the net gain of contributors, over time?)

Looking at the differences between 2016 and 2008 we see how numbers are much better now than then. All groups have more people now. However, groups of people with 1 to 3 years of experience seems to be not growing from 2014, we could say they seem to be decreasing.

In general numbers tell us the community has been growing during this period. New contributors come to the community and a number of them contribute for years. In terms of experience, we find several groups well represented.

Comparing evolution of Mozilla Staff to evolution of Others, we find fewer people with more than 2 years of experience in 2008. Nevertheless once a group begins to be populated, it numbers increase for the rest of the analyzed period. In other words, numbers in 2016 column for Mozilla Staff clearly improve what we find looking at the first columns, when those groups of experience (i.e. same row, previous columns) came to the community. 

### Quick Summary of Mozilla vs Community

* Mozilla Staff doesn't show the same pattern in 1.0 row. **Fewer people come**.
* Mozilla Staff shows more stability when following the same group of people through time. **More people remain**.
* Mozilla Staff seems to have less experienced contributors at the begining, but much more experienced in 2016. This implies a much bigger retention rate among Mozilla staff. More experice usually implies **better understanding of projects**.

[1]: #Table-1.-Experience-in-Git-contributors.-General-view
[2]: #Table-2.-Experience-in-Git-contributors.-Mozilla-view.
[3]: #Table-3.-Experience-in-Git-contributors.-Others-view.


In [34]:
exp_groups_evo_df

Unnamed: 0_level_0,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
exp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1.0,180.0,180.0,203.0,312.0,369.0,435.0,493.0,465.0,434.0,69.0
2.0,204.0,105.0,106.0,133.0,211.0,206.0,225.0,212.0,224.0,107.0
3.0,37.0,91.0,79.0,80.0,98.0,149.0,154.0,163.0,123.0,83.0
4.0,16.0,5.0,83.0,68.0,59.0,82.0,119.0,109.0,123.0,54.0
5.0,18.0,0.0,4.0,74.0,58.0,49.0,67.0,100.0,86.0,82.0
6.0,9.0,1.0,0.0,4.0,62.0,48.0,46.0,54.0,86.0,47.0
7.0,8.0,1.0,1.0,0.0,3.0,52.0,42.0,42.0,50.0,57.0
8.0,4.0,1.0,1.0,1.0,0.0,2.0,47.0,37.0,38.0,39.0
9.0,6.0,1.0,0.0,1.0,0.0,1.0,2.0,43.0,34.0,16.0
10.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,2.0,36.0,21.0


##### Table 1. Experience in Git contributors. General view

In [35]:
exp_groups_evo_moz_df

Unnamed: 0_level_0,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
exp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1.0,93.0,93.0,112.0,186.0,219.0,226.0,188.0,129.0,114.0,16.0
2.0,96.0,76.0,75.0,89.0,152.0,159.0,158.0,117.0,103.0,48.0
3.0,7.0,83.0,61.0,63.0,75.0,123.0,132.0,124.0,85.0,50.0
4.0,0.0,5.0,75.0,57.0,48.0,70.0,102.0,103.0,99.0,39.0
5.0,1.0,0.0,4.0,66.0,51.0,42.0,59.0,90.0,84.0,70.0
6.0,1.0,1.0,0.0,4.0,56.0,44.0,38.0,50.0,77.0,46.0
7.0,0.0,1.0,1.0,0.0,3.0,47.0,38.0,35.0,46.0,53.0
8.0,1.0,0.0,1.0,1.0,0.0,2.0,44.0,33.0,31.0,37.0
9.0,1.0,1.0,0.0,1.0,0.0,1.0,2.0,40.0,30.0,13.0
10.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,2.0,35.0,18.0


##### Table 2. Experience in Git contributors. Mozilla view.

In [37]:
exp_groups_evo_others_df

Unnamed: 0_level_0,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
exp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1.0,87.0,87.0,91.0,126.0,150.0,209.0,305.0,336.0,320.0,53.0
2.0,108.0,29.0,31.0,44.0,59.0,47.0,67.0,95.0,121.0,59.0
3.0,30.0,8.0,18.0,17.0,23.0,26.0,22.0,39.0,38.0,33.0
4.0,16.0,0.0,8.0,11.0,11.0,12.0,17.0,6.0,24.0,15.0
5.0,17.0,0.0,0.0,8.0,7.0,7.0,8.0,10.0,2.0,12.0
6.0,8.0,0.0,0.0,0.0,6.0,4.0,8.0,4.0,9.0,1.0
7.0,8.0,0.0,0.0,0.0,0.0,5.0,4.0,7.0,4.0,4.0
8.0,3.0,1.0,0.0,0.0,0.0,0.0,3.0,4.0,7.0,2.0
9.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,4.0,3.0
10.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,3.0


##### Table 3. Experience in Git contributors. Others view.