## Goal: determining attraction / retention

An important characteristic of a how a community evolves is how contributors are joining and leaving. Determining when contributors leave is not easy (as they could be on a temporary leave, but coming back later), but if after a certain period they are still inactive, it is very likely they can be considered 'gone'. With this definition, the evolution of attraction and retention, and its difference (net gain of developers, which can be negative) can be computed.

These data could be determined for each of the specific contributors groups defined in the first goal. 

This goal can be refined in the following questions:

**Questions**:

* How many contributors are joining the community?
* How many contributos are no longer active (leaving) in the community?
* How is the attraction / retention ratio, and the net gain of contributors, over time?

To answer these questions, the following metrics can be used:

**Metrics**:

* Number of contributors joining the community over time (attracted)
* Number of contirbutors leaving (becoming inactive) over time
* Number of contributors not leaving (retained) over time

These metrics can be computed for each of the "cohorts", defined as the groups of contributors joining during a certain period of time (for example, during each year). Some of these metrics will be computed for the speficied contributor groups, over time.

# Metric Calculations
First we need to load a connection against the proper ES instance. We use an external module to load credentials from a file that will not be shared. If you want to run this, please use your own credentials, just put them in a file named '.settings' (in the same directory as this notebook) following the example file 'settings.sample'.

This section includes common code to manage and plot data. Queries will be available at the corresponding section.

**TODO**: Add bot and merges filtering.

**TODO** : provide plots similar to:

https://analytics.mozilla.community/edit/app/kibana#/dashboard/Community-Analytics-Demographics

In [1]:
from datetime import datetime
import pandas

import plotly as plotly
import plotly.graph_objs as go

import util as ut

from util import ESConnection
from elasticsearch_dsl import Search

es_conn = ESConnection()

In [2]:
def create_search(source):
    s = Search(using=es_conn, index=source)
    # TODO: Add bot and merges filtering.
    #s = s.filter('range', grimoire_creation_date={'gt': 'now/M-2y', 'lt': 'now/M'})
    #s.params(timeout=100)
    return s

In [3]:
def get_authors_df(result, author_bucket_field):        

    # Get a dataframe with each author and their first commit
    buckets_result = result['aggregations'][author_bucket_field]['buckets']

    buckets = []
    for bucket_author in buckets_result:
        author = bucket_author['key']

        first = bucket_author['first']['hits']['hits'][0]
        first_commit = first['sort'][0]/1000
        last_commit = bucket_author['last_commit']['value']/1000
        org_name = first['_source']['author_org_name']
        uuid = first['_source']['author_uuid']
        buckets.append({
                'first_commit': datetime.utcfromtimestamp(first_commit),
                'last_commit': datetime.utcfromtimestamp(last_commit),
                'author': author,
                'uuid': uuid,
                'org': org_name
        })
    authors_df = pandas.DataFrame.from_records(buckets)
    authors_df.sort_values(by='first_commit', ascending=False,
                            inplace=True)
    return authors_df

def get_active_authors_df(result, author_bucket_field, year):
    """Returns a dataframe with first and last commit of those authors
    whose last commit was made within a given year"""

    # Get a dataframe with each author and their first commit
    buckets_result = result['aggregations'][author_bucket_field]['buckets']

    buckets = []
    for bucket_author in buckets_result:
        author = bucket_author['key']

        first = bucket_author['first']['hits']['hits'][0]
        first_commit = first['sort'][0]/1000
        last_commit = bucket_author['last_commit']['value']/1000
        org_name = first['_source']['author_org_name']
        #uuid = first['_source']['author_uuid']
        if datetime.utcfromtimestamp(last_commit).year == year:
            buckets.append({
                    'first_commit': datetime.utcfromtimestamp(first_commit),
                    'last_commit': datetime.utcfromtimestamp(last_commit),
                    'author': author,
                    #'uuid': uuid,
                    'org': org_name
            })
    authors_df = pandas.DataFrame.from_records(buckets)
    authors_df.sort_values(by='first_commit', ascending=False,
                            inplace=True)
    return authors_df

In [4]:
def print_horizontal_bar_chart(df, experience_field, title, min_range = 0):
    
    plotly.offline.init_notebook_mode(connected=True)
    
    experience = list(range(min_range, int(df[experience_field].max()) + 1))
    
    people_count = []
    for exp in experience:
        people_count.append(len(df.loc[df[experience_field] == exp]))
        
    data = [go.Bar(
            x=people_count,
            y=experience,
            orientation = 'h'
    )]
    
    layout = go.Layout(
        barmode='group',
        title= title
    )

    fig = go.Figure(data=data, layout=layout)
    plotly.offline.iplot(fig, filename='horizontal-bar')
    
    
def print_horizontal_bar_chart_relative(df, experience_field, title):
    
    plotly.offline.init_notebook_mode(connected=True)
    
    experience = list(range(0, int(df[experience_field].max()) + 1))
    
    people_count = []
    first_count = len(df.loc[df[experience_field] == 0])
    for exp in experience:
        current_count = len(df.loc[df[experience_field] == exp])
        people_count.append(current_count * 100/ first_count)
        
    data = [go.Bar(
            x=people_count,
            y=experience,
            orientation = 'h'
    )]
    
    layout = go.Layout(
        barmode='group',
        title= title
    )

    fig = go.Figure(data=data, layout=layout)
    plotly.offline.iplot(fig, filename='horizontal-bar')
    
def print_horizontal_bar_chart_percent(df, experience_field, title):
    
    plotly.offline.init_notebook_mode(connected=True)
    
    experience = list(range(0, int(df[experience_field].max()) + 1))
    
    people_count = []
    cusum = len(df)
    for exp in experience:
        current_count = len(df.loc[df[experience_field] == exp])
        people_count.append(current_count * 100/ cusum)
        
    data = [go.Bar(
            x=people_count,
            y=experience,
            orientation = 'h'
    )]
    
    layout = go.Layout(
        barmode='group',
        title= title
    )

    fig = go.Figure(data=data, layout=layout)
    plotly.offline.iplot(fig, filename='horizontal-bar')
    


# Metrics

## Groups of Contributors based on their experience in the Community

Looking at when a given contributor sent her first commit we calculate how long he has been contributing to the community at a given time. We define following groups:
* People with more than **1 month** of experience and less than **6 months**
* People with more than **6 months** of experience and less than **12 months**
* People with more than **1 year** of experience and less than **2 years**
* People with more than **2 years** of experience and less than **4 years**
* People with more than **4 years**


In [5]:
# Define ranges
ranges = [{
           'from': 'now-1M/M',
           'key': '1- Months'
         }, {
           'from': 'now-6M/M',
           'to': 'now-1M/M',
           'key': '1-6 Months'
         }, {
           'from': 'now-12M/M',
           'to': 'now-6M/M',
           'key': '6-12 Months'
         }, {
           'from': 'now-24M/M',
           'to': 'now-12M/M',
           'key':  '1-2 Years'
         }, {
           'from': 'now-48M/M',
           'to': 'now-24M/M',
           'key': '2-4 Years'
         }, {
           'to': 'now-24M/M',
           'key': '4+ Years'
         }]

In [6]:
s = Search(using=es_conn, index='git')
s.params(timeout=30)

# Unique count of Commits by Project (max 100 projects)
s = s.filter('range', grimoire_creation_date={'gt': 'now/M-2y', 'lt': 'now/M'})
s.aggs\
    .bucket('experience', 'date_range', field='grimoire_creation_date', ranges=ranges)\
    .metric('contributors', 'cardinality', field='author_uuid', precision_threshold=100000)

    #.bucket('org', 'terms', field='author_org_name', size=10)\
    #.bucket('time', 'date_histogram', field='grimoire_creation_date', interval='quarter')\
#print(s.to_dict())
result = s.execute()

# In case you need to check response, uncomment line below
print(result.to_dict()['aggregations'])


{'experience': {'buckets': [{'to': 1430438400000.0, 'doc_count': 0, 'contributors': {'value': 0}, 'key': '4+ Years', 'to_as_string': '2015-05-01T00:00:00.000Z'}, {'from': 1367366400000.0, 'to': 1430438400000.0, 'to_as_string': '2015-05-01T00:00:00.000Z', 'from_as_string': '2013-05-01T00:00:00.000Z', 'contributors': {'value': 0}, 'key': '2-4 Years', 'doc_count': 0}, {'from': 1430438400000.0, 'to': 1462060800000.0, 'to_as_string': '2016-05-01T00:00:00.000Z', 'from_as_string': '2015-05-01T00:00:00.000Z', 'contributors': {'value': 4034}, 'key': '1-2 Years', 'doc_count': 472338}, {'from': 1462060800000.0, 'to': 1477958400000.0, 'to_as_string': '2016-11-01T00:00:00.000Z', 'from_as_string': '2016-05-01T00:00:00.000Z', 'contributors': {'value': 2886}, 'key': '6-12 Months', 'doc_count': 196420}, {'from': 1477958400000.0, 'to': 1491004800000.0, 'to_as_string': '2017-04-01T00:00:00.000Z', 'from_as_string': '2016-11-01T00:00:00.000Z', 'contributors': {'value': 2186}, 'key': '1-6 Months', 'doc_coun

## Time from first to last contrib for authors who made a commit before a given year  

In [7]:
results = []
for i in range(0,4):

    # Buckets by author name, finding first commit for each of them
    s = Search(using=es_conn, index='git')
    s.params(timeout=30)

    # Retrieve commits before given year
    s = s.filter('range', grimoire_creation_date={'lt': 'now-' + str(i) + 'y/y'})

    # Bucketize by uuid and get first and last commit
    s.aggs.bucket('authors', 'terms', field='author_uuid', size=100000) \
        .metric('first', 'top_hits', _source=['author_date', 'author_org_name', 'author_uuid'],
                size=1, sort=[{"author_date": {"order": "asc"}}]) \
        .metric('last_commit', 'max', field='author_date')
    s = s.sort("author_date")
    #       pprint(s.to_dict())
    results.append(s.execute())

In [8]:
authors_dfs = []
for result in results:
    authors_df = get_authors_df(result, author_bucket_field='authors')
    authors_df['active_years'] = (authors_df.last_commit-authors_df.first_commit).astype('timedelta64[Y]')
    authors_dfs.append(authors_df)

authors_dfs


[                                         author        first_commit  \
 11650  149ad981c7f8acd65b9cb9a3f306833fc4b6ec8e 2016-12-31 12:01:23   
 13088  778f8adac79c8593c8413792e9d838e0620797c1 2016-12-31 08:34:00   
 14420  f0a27b3276baf429f51f6afed8ece01f1a09e7d0 2016-12-31 05:11:58   
 12737  5c6bef06c5d1fa2e83f13db2046ec3d9d27bd00d 2016-12-30 23:09:23   
 13497  977b2b48ece5125093c2cfdedb56be00bdba4f13 2016-12-30 22:21:03   
 12664  57a1cf0d7a35c3f77475953873212148d3c01e1b 2016-12-30 21:55:02   
 10797  8c536fb5c49a281f793cb47e933fd71403891960 2016-12-30 17:26:13   
 13625  a1e6b9aca42df46460c5fed6a73cde615b2b04a3 2016-12-30 16:45:19   
 11944  26e5b20e7ccff92ec895b5f4af3f5f0b9a5b0f0a 2016-12-30 14:34:47   
 9983   09a899922407e40e04fa6ab865da040f06c0d904 2016-12-30 12:22:11   
 9687   c91f349eb811f5870fe7b4e987bd34decfbf72cb 2016-12-30 11:34:16   
 13237  829a9b7eae96736b2408c62c4b2843721116fe41 2016-12-29 23:33:52   
 13954  c19b5436d45eb06bb0d9a16accd4a16da1ad9765 2016-12-29 14:2

In [9]:
# Plot bar charts for each dataframe
i = 0
for authors_df in authors_dfs:
#    print(author_df['experience_years'].max(), type(author_df['experience_years'].max()))
    print_horizontal_bar_chart(authors_df, 'active_years', title=str(2017 - i))
    i += 1


In [10]:
# Plot bar charts for each dataframe
i = 0
for authors_df in authors_dfs:
#    print(author_df['experience_years'].max(), type(author_df['experience_years'].max()))
    print_horizontal_bar_chart_relative(authors_df, 'active_years', title=str(2017 - i))
    i += 1

In [11]:
# Plot bar charts for each dataframe
i = 0
for authors_df in authors_dfs:
#    print(author_df['experience_years'].max(), type(author_df['experience_years'].max()))
    print_horizontal_bar_chart_percent(authors_df, 'active_years', title=str(2017 - i))
    i += 1

## Time from first to last commit for authors active in a given year

We define an author as **active** iff she made at least one commit within a given year. E.g. an author would be considered active in 2017 if she made a commit after Jan. 1st, 2017 and before Dec. 31st 2017. 

In [12]:
results = []
for i in range(0,4):

    # Buckets by author name, finding first commit for each of them
    s = Search(using=es_conn, index='git')
    s.params(timeout=30)

    # Retrieve commits before given year
    s = s.filter('range', grimoire_creation_date={'lte': 'now-' + str(i) + 'y/y'})

    # Bucketize by uuid and get first and last commit
    s.aggs.bucket('authors', 'terms', field='author_uuid', size=100000) \
        .metric('first', 'top_hits', _source=['author_date', 'author_org_name', 'author_uuid'],
                size=1, sort=[{"author_date": {"order": "asc"}}]) \
        .metric('last_commit', 'max', field='author_date')
    s = s.sort("author_date")
    #       pprint(s.to_dict())
    results.append(s.execute())

In [13]:
authors_dfs = []
year = 2016
for result in results:
    authors_df = get_active_authors_df(result, author_bucket_field='authors', year=year)
    authors_df['active_years'] = (authors_df.last_commit-authors_df.first_commit).astype('timedelta64[Y]')
    authors_dfs.append(authors_df)
    year -= 1

authors_dfs


[                                        author        first_commit  \
 2387  149ad981c7f8acd65b9cb9a3f306833fc4b6ec8e 2016-12-31 12:01:23   
 2749  778f8adac79c8593c8413792e9d838e0620797c1 2016-12-31 08:34:00   
 3085  f0a27b3276baf429f51f6afed8ece01f1a09e7d0 2016-12-31 05:11:58   
 2663  5c6bef06c5d1fa2e83f13db2046ec3d9d27bd00d 2016-12-30 23:09:23   
 2157  8c536fb5c49a281f793cb47e933fd71403891960 2016-12-30 17:26:13   
 2879  a1e6b9aca42df46460c5fed6a73cde615b2b04a3 2016-12-30 16:45:19   
 1932  09a899922407e40e04fa6ab865da040f06c0d904 2016-12-30 12:22:11   
 1874  c91f349eb811f5870fe7b4e987bd34decfbf72cb 2016-12-30 11:34:16   
 2783  829a9b7eae96736b2408c62c4b2843721116fe41 2016-12-29 23:33:52   
 2462  286f050fe75f8123df329c88b74f3dd61b73bda4 2016-12-28 18:43:08   
 2954  bd1da52b37734708ba675cb93b8ab3ab5ac5446c 2016-12-28 18:09:59   
 2409  1c9ccec2047e42eb37053a6e93cb10781c56ec82 2016-12-28 16:53:00   
 2607  4d0114c7876b3d23637a32f6e203ab41d3f44c29 2016-12-28 07:37:45   
 2352 

In [14]:
# Plot bar charts for each dataframe
i = 0
for authors_df in authors_dfs:
#    print(author_df['experience_years'].max(), type(author_df['experience_years'].max()))
    print_horizontal_bar_chart(authors_df, 'active_years', title=str(2016 - i))
    i += 1


## Years of Experience
We consider **12 commits** per year, i.e. one commit per month aprox., as a minimum to add one year of experience to a given author. From this assumption, we build groups of authors by years of experience. As a result, we present a plot with number of people in each group.

To give a more complete idea of how community evolves, we plot snapshots corresponding to different years. Each of them will take all commits sent until the given year, and calculate years of experience for all authors in that slice.

In [15]:
###
## GET COMMITS BY YEAR AND AUTHOR
###

results = []
min_commits = 1

for i in range(0,10):

    # Buckets by author name, finding first commit for each of them
    s = create_search(source='git')
    
    # Retrieve commits before given year
    s = s.filter('range', grimoire_creation_date={'lte': 'now-' + str(i) + 'y/y'})

    # Bucketize by uuid and get first and last commit
    s.aggs.bucket('time', 'date_histogram', field='grimoire_creation_date', interval='year')\
        .bucket('authors', 'terms', field='author_uuid', size=100000, min_doc_count=min_commits) \
        .metric('commits', 'cardinality', field='hash', precision_threshold=1000)

    r = s.execute()
    # In case you need to check response, uncomment line below
    #print(len(r.to_dict()['aggregations']['time']['buckets']))
        
    results.append(r)
    
#results

In [16]:
###
## CREATE A DF CONTAINING, FOR EACH AUTHOR UUID, COUNT OF YEARS OF EXPERIENCE (YEARS
## WITH MORE THAN 12 COMMITS MADE) AND LAST YEAR ACTIVE
###
exp_df_list = []
year = 2017

for result in results:
    exp_df = ut.to_df_by_time(result, 'Author', 'Time', 'Commits', 'authors', 'time', 'commits')
    exp_df['Time'] = exp_df['Time'].apply(lambda x: str(pandas.Period(x,'A')))
    
    ## ACTIVE CONDITION
    ## Filter those having less than 12 commits per year
    exp_df = exp_df[exp_df['Commits'] >= 12]
    
    ## Group by author, get MAX YEAR and NUMBER OF ROWS FOR THE GIVEN AUTHOR
    exp_df = exp_df.groupby(['Author']).agg({'Time': 'max', 'Commits': 'count'})
    ## Filter those whose last active year is not the one we want
    exp_df = exp_df[exp_df['Time'] == str(year)]
    
    exp_df['exp'] = exp_df['Commits']
    exp_df['last_active'] = exp_df['Time']
    exp_df= exp_df.drop('Commits', axis=1)
    exp_df = exp_df.drop('Time', axis=1)
    
    exp_df_list.append(exp_df)
    
    year -= 1

exp_df_list


[                                          exp last_active
 Author                                                   
 000063c4e47e93ab3b30607680609e4d2500ce5d    4        2017
 002893ffe1425c220756f8ba4c78e1e3bb0be50f    7        2017
 00834d313bfc6fc60be1631bcc57b2c05ee2e0e3    9        2017
 00846eff46b051d92317fc74e54041c6fdccd7cf   10        2017
 00a40f9e9e7f7633ddab8291a99e1e487f88481c    3        2017
 00b934012989b386ac9efc706dbc28cd6be173c6    2        2017
 00d00a6e7530f1ccac19b98727f25b6528ed9fe3    1        2017
 00da6ede3bff8db21a33473d7e552b76f50757eb    5        2017
 00ed7b25063cf90c8bdeb9d45d37b73ba2317f96    6        2017
 01307140d33369746c5013f45295092f8752d378    4        2017
 013bbdb9412b88db677df21347e032c57b099d97    3        2017
 016a5f2ec7191e74e984c71787b2b292d89543a1    4        2017
 0183bd06d487bf71fd39a568ae892e3ec3079d82    5        2017
 019ee580517ec3e8eb2c90ad0e3b2d749e9f0359    9        2017
 01b36f5096db2e9bd47f3c2817644dec9e7dff17    8        20

In [17]:
# Plot bar charts for each dataframe
i = 0
for exp_df in exp_df_list:
    if not exp_df.empty:
        print_horizontal_bar_chart(exp_df, 'exp', title=str(2017 - i), min_range=1)
    i += 1

In [18]:
exp_groups_evo_df = pandas.DataFrame(columns=['last_active', 'exp', 'count'])

for exp_df in exp_df_list:
    
    if exp_df.empty:
        continue
    
    experience = list(range(1, int(exp_df['exp'].max()) + 1))
    
    last_active = exp_df['last_active'].unique()[0]
    for exp in experience:
        count = len(exp_df.loc[exp_df['exp'] == exp])
        #print(last_active, exp, count)
        exp_groups_evo_df.loc[len(exp_groups_evo_df)] = [last_active, exp, count]
        
print('Max. Exp: ', exp_groups_evo_df['exp'].max(), 'Max. Count: ',  exp_groups_evo_df['count'].max())
exp_groups_evo_df
    
    

Max. Exp:  17.0 Max. Count:  501.0


Unnamed: 0,last_active,exp,count
0,2017,1.0,68.0
1,2017,2.0,111.0
2,2017,3.0,87.0
3,2017,4.0,58.0
4,2017,5.0,87.0
5,2017,6.0,48.0
6,2017,7.0,59.0
7,2017,8.0,40.0
8,2017,9.0,17.0
9,2017,10.0,21.0


In [19]:
exp_groups_evo_df = pandas.DataFrame(columns=['exp'])

for exp_df in exp_df_list:
    
    if exp_df.empty:
        continue
    
    year = exp_df['last_active'].unique()[0]
    exp_groups_df = pandas.DataFrame(columns=['exp', year])
    
    experience = list(range(1, int(exp_df['exp'].max()) + 1))
    
    for exp in experience:
        count = len(exp_df.loc[exp_df['exp'] == exp])
        exp_groups_df.loc[len(exp_groups_df)] = [exp, count]

    exp_groups_evo_df = exp_groups_evo_df.merge(exp_groups_df, on='exp', how='outer')


# Fill Nan with 0's
exp_groups_evo_df = exp_groups_evo_df.fillna(0)

# Reorder columns
exp_groups_evo_df = exp_groups_evo_df.set_index('exp')
exp_groups_evo_df = exp_groups_evo_df.sort_index(axis=1)


#print('Max. Exp: ', exp_groups_evo_df['exp'].max(), 'Max. Count: ')
exp_groups_evo_df

Unnamed: 0_level_0,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
exp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1.0,183.0,183.0,205.0,322.0,374.0,451.0,501.0,470.0,447.0,68.0
2.0,204.0,105.0,107.0,135.0,216.0,207.0,240.0,220.0,226.0,111.0
3.0,38.0,90.0,79.0,80.0,100.0,153.0,154.0,173.0,126.0,87.0
4.0,16.0,6.0,82.0,69.0,59.0,83.0,124.0,111.0,132.0,58.0
5.0,18.0,0.0,5.0,73.0,58.0,50.0,67.0,104.0,88.0,87.0
6.0,9.0,1.0,0.0,5.0,61.0,48.0,47.0,55.0,90.0,48.0
7.0,8.0,1.0,1.0,0.0,4.0,51.0,42.0,43.0,51.0,59.0
8.0,4.0,1.0,1.0,1.0,0.0,3.0,47.0,37.0,38.0,40.0
9.0,6.0,1.0,0.0,1.0,0.0,1.0,3.0,42.0,34.0,17.0
10.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,3.0,36.0,21.0


In [20]:
plotly.offline.init_notebook_mode(connected=True)

data = []
for exp in exp_groups_evo_df.index.values:
    #print(exp, '\n', exp_groups_evo_df.loc[exp].tolist(), '\n', exp_groups_evo_df.loc[exp].index.values)
    data.append(
        go.Scatter(
            x = exp_groups_evo_df.loc[exp].index.values,
            y = exp_groups_evo_df.loc[exp].tolist(),
            mode = 'lines+markers',
            name = str(int(exp)) + ' years'
        )
    )
    


plotly.offline.iplot(data, filename='line-mode')    

## Retention by experience
Next table shows percentage of people remaining in the community for each experience group. It is calculated not by following individuals, but comparing total number of people in each group.

To read the table have into account that each number corresponds to percentage of people remaining in a given group (e.g. 2 years of experience in 2010) with respect to same group during previous year (i.e. 1 year of experience in 2009).

So, if we look at cell (3.0, 2010) we can read it as number of people with **3** years of experience in **2010** represents a 78.90% of people having **2** years of experience in **2009**. Note that both groups are in fact the same as they evolve through time, increasing their years of experience. 

In [21]:
exp_groups_evo_diff_df = pandas.DataFrame()

for exp in exp_groups_evo_df.index.values:
    #print(exp, '\n', exp_groups_evo_df.loc[exp].tolist(), '\n', exp_groups_evo_df.loc[exp].index.values)
    
    cols = list(exp_groups_evo_df)
    min_col = int(cols[0])
    
    #print(exp - 1)
    for col in list(exp_groups_evo_df): 
        current_val = exp_groups_evo_df.get_value(exp, col)
        prev_row = exp - 1
        prev_col= int(col) - 1
        if prev_row > 0 and prev_col >= min_col:
            prev_val = exp_groups_evo_df.get_value(prev_row, str(prev_col))
            #print(col, prev_val, current_val, prev_val - current_val)
            if prev_val == 0:
                percent = 0
            else:
                percent = current_val * 100 / prev_val
            exp_groups_evo_diff_df.set_value(exp, col, round(percent, 2))

exp_groups_evo_diff_df

    


Unnamed: 0,2009,2010,2011,2012,2013,2014,2015,2016,2017
2.0,57.38,58.47,65.85,67.08,55.35,53.22,43.91,48.09,24.83
3.0,44.12,75.24,74.77,74.07,70.83,74.4,72.08,57.27,38.5
4.0,15.79,91.11,87.34,73.75,83.0,81.05,72.08,76.3,46.03
5.0,0.0,83.33,89.02,84.06,84.75,80.72,83.87,79.28,65.91
6.0,5.56,0.0,100.0,83.56,82.76,94.0,82.09,86.54,54.55
7.0,11.11,100.0,0.0,80.0,83.61,87.5,91.49,92.73,65.56
8.0,12.5,100.0,100.0,0.0,75.0,92.16,88.1,88.37,78.43
9.0,25.0,0.0,100.0,0.0,0.0,100.0,89.36,91.89,44.74
10.0,16.67,0.0,0.0,100.0,0.0,100.0,100.0,85.71,61.76
11.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,133.33,66.67


### Following evolution of groups

Another way of visualizing this is selecting a group from a particular year and following its evolution throug time. This way, we will get number of people belonging to that group through time.

Next table shows evolution of groups from a given year, in this case **2008**. So first row shows how the 138 people who had 1 year of experience in 2008 have evolved throug years, losing people year by year until 2016, when there are only 8 of them. 

In [22]:
group_evo_df = pandas.DataFrame()

years = list(exp_groups_evo_df)

# Group we want to follow
first_year = 2008


for exp_group in exp_groups_evo_df.index.values:
    
    exp_index = exp_group

    for year in years:
        if int(year) < first_year:
            continue

        if exp_index > exp_groups_evo_df.index.values.max():
            break

        people_count = exp_groups_evo_df.get_value(exp_index, year)
        #print(exp_index, year, people_count)
        group_evo_df.set_value(exp_group, year, people_count)
        exp_index += 1

group_evo_df = group_evo_df.fillna(0)
group_evo_df


Unnamed: 0,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
1.0,183.0,105.0,79.0,69.0,58.0,48.0,42.0,37.0,34.0,21.0
2.0,204.0,90.0,82.0,73.0,61.0,51.0,47.0,42.0,36.0,24.0
3.0,38.0,6.0,5.0,5.0,4.0,3.0,3.0,3.0,4.0,2.0
4.0,16.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
5.0,18.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
6.0,9.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
7.0,8.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9.0,6.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
10.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We can look at the same data in percentages over previous years, i.e. how many people are retained in a group from one year to the following:

In [23]:
group_evo_percent_df = pandas.DataFrame()

years = list(exp_groups_evo_df)

# Group we want to follow
first_year = 2008


for exp_group in exp_groups_evo_df.index.values:
    
    exp_index = exp_group

    for year in years:
        if int(year) < first_year:
            continue

        if exp_index > exp_groups_evo_df.index.values.max():
            break
        
        current_val = exp_groups_evo_df.get_value(exp_index, year)
        
        if int(year) == first_year:
            prev_val = current_val
        
        if prev_val == 0:
            percent = current_val * 100
        else:
            percent = current_val * 100 / prev_val
        group_evo_percent_df.set_value(exp_group, year, round(percent, 2))
        
        exp_index += 1
        prev_val = current_val
              
        

group_evo_percent_df = group_evo_percent_df.fillna(0)
group_evo_percent_df

Unnamed: 0,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
1.0,100.0,57.38,75.24,87.34,84.06,82.76,87.5,88.1,91.89,61.76
2.0,100.0,44.12,91.11,89.02,83.56,83.61,92.16,89.36,85.71,66.67
3.0,100.0,15.79,83.33,100.0,80.0,75.0,100.0,100.0,133.33,50.0
4.0,100.0,0.0,0.0,0.0,0.0,100.0,100.0,0.0,0.0,0.0
5.0,100.0,5.56,100.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0
6.0,100.0,11.11,100.0,100.0,100.0,0.0,0.0,0.0,0.0,0.0
7.0,100.0,12.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8.0,100.0,25.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9.0,100.0,16.67,100.0,100.0,100.0,100.0,100.0,100.0,100.0,0.0
10.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
