# Les projets Open Source de Machine Learning sont-ils menés par des chercheurs ?

## Préambule

Nous observons l'apparition fréquente de nouveaux algorithmes de Machine Learning et leur intégration à des bibliothèques Open Source.

Nous pensons que de nombreux contributeurs hétéroclites participent à ces projets notamment une minorité de chercheurs qui effectuent des travaux dans le domaine de l'apprentissage automatique, mais que ce sont ces chercheurs qui contribuent majoritairement à l'avancée de ces projets.

Nous souhaitons évaluer les hypothèses suivantes :
* Les contributeurs sont majoritairement des chercheurs.
* Les contributions viennent majoritairement de chercheurs.
* Les chercheurs contribuent plus individuellement que les autres contributeurs.

Nos hypothèses de travail définissent certaines notions et concernent la bonne foi des contributeurs dans les informations de *commit* :
* Un chercheur possède une adresse mail d'une académie *ou* possède un profil d'auteur de publications sur Google Scholar.
* Le nom/prénom spécifié dans chaque *commit* est celui de l'auteur.
* L'adresse mail spécifiée dans chaque *commit* est celui de l'auteur.

Pour avoir un horizon le plus large possible, nous allons mener l'étude sur trente-quatre projet open source de Machine Learning (voir la liste à la racine de notre [dépôt Git](https://github.com/AntoineAube/reace-study)).

Les données que nous exploitons dans ce document sont issues de l'exécution de plusieurs scripts :
* Extraction d'informations depuis le dépôt Git des projets avec Repodriller.
* Classification des critères pour déterminer quels utilisateurs sont des chercheurs.
Le nécessaire pour reproduire l'étude est disponible sur notre dépôt Git.

In [None]:
# Let us import some awesome libraries!
import pandas as pd
import numpy as np
import math
import os
from operator import itemgetter
import datetime

import pygal

# Definition of some constants.
DATASETS_LOCATION = 'drilled-informations/commits-information'

In [None]:
# Let us load the commits datasets.

projects_commits = {}

for filename in map(lambda filename: filename.split('.csv')[0], os.listdir(DATASETS_LOCATION)):
    commits = pd.read_csv(DATASETS_LOCATION + '/' + filename + '.csv')
    
    # Add a PROJECT column because they are going to be merged.
    commits['PROJECT'] = filename
    
    # The generated timestamps are 1000 times to big for unknown reason.
    commits['TIMESTAMP'] = commits['TIMESTAMP'].apply(lambda timestamp: datetime.datetime.fromtimestamp(timestamp / 1000))
    
    projects_commits[filename] = commits
    
projects_commits['scikit-learn'].sample(3)

## Quels sont les contributeurs chercheurs ?

Conformément à notre hypothèse de travail, nous avons retenu deux critères pour déterminer quel contributeur est un chercheur.

Un script ultérieurement exécuté a constitué un set de données de la manière suivante :
* À partir de la liste des *commits* de chaque projet, lister les contributeurs par leur nom ; pour chaque contributeur, lister ses adresses mail.
* À partir d'une liste blanche de domaines d'adresses mail académiques, déterminer pour chaque contributeur s'il a au moins une adresse académique.
* À partir du nom de chaque contributeur, faire une recherche d'auteur de publications sur Google Scholar pour déterminer s'il en est connu ou non.

In [None]:
contributors_status = pd.read_csv('known-contributors.csv')

contributors_status.sample(3)

En conservant l'union des deux colonnes calculées, nous pouvons déterminer quels contributeurs nous considérons être des chercheurs.

In [None]:
def decide_if_researcher(row):
    if math.isnan(row['HAS_RESEARCHER_EMAIL']):
        row['IS_RESEARCHER'] = row['HAS_PUBLICATION']
    elif math.isnan(row['HAS_PUBLICATION']):
        row['IS_RESEARCHER'] = row['HAS_RESEARCHER_EMAIL']
    else:
        row['IS_RESEARCHER'] = row['HAS_PUBLICATION'] or row['HAS_RESEARCHER_EMAIL']
        
    return row

contributors_status = contributors_status.apply(decide_if_researcher, axis = 1)

contributors_status.set_index('NAME', inplace = True)

contributors_status.sample(3)

In [None]:
researchers_distribution = contributors_status['IS_RESEARCHER'].value_counts() / len(contributors_status)

pie = pygal.Pie(inner_radius = .4)
pie.title = 'Contributors professions in the studied project (in %)'
pie.add('Researcher', researchers_distribution[True])
pie.add('Not researcher', researchers_distribution[False])

In [None]:
# Let us annotate the commits for future uses.
def commit_has_been_made_by_researcher(row):
    global contributors_status
    
    row['IS_RESEARCHER'] = contributors_status['IS_RESEARCHER'][row['AUTHOR_NAME']]
    
    return row

for project_name in projects_commits.keys():
    commits = projects_commits[project_name]
    
    projects_commits[project_name] = commits.apply(commit_has_been_made_by_researcher, axis = 1)

## Questions de l'étude

Les hypothèses de l'étude infèrent les sous-questions suivantes :
* Les contributeurs sont-ils majoritairement des chercheurs ?
* Les contributions sont-elles majoritairement produites par des chercheurs ?
* Les chercheurs sont-ils les contributeurs qui contribuent le plus individuellement ?

Les données que nous avons préparées vont nous permettre de répondre à ces questions. Quand il s'agit de comparer la quantité de contributions, nous prenons soin d'effectuer le comparatif sur le nombre de *commits* ainsi que sur le nombre de lignes ajoutées/retirées.

In [None]:
commits = pd.concat(projects_commits.values())
commits.reset_index(drop = True, inplace = True)

commits.sample(3)

### Les contributeurs sont-ils majoritairement des chercheurs ?

In [None]:
def compute_researchers(project_name):
    global projects_commits
    
    project_commits = projects_commits[project_name].drop_duplicates(['PROJECT', 'AUTHOR_NAME'])
    
    return len(project_commits[project_commits['IS_RESEARCHER'] == True]) / len(project_commits)

projects_to_researchers = []
for name in projects_commits.keys():
    researchers_count = compute_researchers(name)
    projects_to_researchers.append([name, compute_researchers(name)])
    
projects_to_researchers = np.array(sorted(projects_to_researchers, key = itemgetter(1)))
projects_names = projects_to_researchers[:, 0]

bar = pygal.Bar(x_label_rotation = 50, show_legend = False)
bar.title = 'Number of researchers per project (in %)'
bar.x_title = 'Project\'s name'
bar.y_title = 'Proportion of researchers'
bar.x_labels = projects_names
bar.add('Researchers', np.array(list(map(float, list(projects_to_researchers[:, 1])))) * 100)

### Les contributions sont-elles majoritairement produites par des chercheurs ?

#### En nombre de *commits*

In [None]:
def compute_researchers(project_name):
    global projects_commits
    
    project_commits = projects_commits[project_name]
    
    return len(project_commits[project_commits['IS_RESEARCHER'] == True]) / len(project_commits)

projects_to_researchers = []
for name in projects_commits.keys():
    projects_to_researchers.append([name, compute_researchers(name)])
    
projects_to_researchers = np.array(sorted(projects_to_researchers, key = itemgetter(1)))
projects_names = projects_to_researchers[:, 0]

bar = pygal.Bar(x_label_rotation = 50, show_legend = False)
bar.title = 'Commits of researchers per project (in %)'
bar.x_title = 'Project\'s name'
bar.y_title = 'Proportion of researchers commits in all commits'
bar.x_labels = projects_names
bar.add('Researchers', np.array(list(map(float, list(projects_to_researchers[:, 1])))) * 100)

#### En nombre de lignes ajoutées/retirées

In [None]:
def compute_researchers(project_name):
    global projects_commits
    
    project_commits = projects_commits[project_name]
    
    return (project_commits['ADDED_LINES'][project_commits['IS_RESEARCHER'] == True].sum() + project_commits['DELETED_LINES'][project_commits['IS_RESEARCHER'] == True].sum()) / (project_commits['ADDED_LINES'].sum() + project_commits['DELETED_LINES'].sum())

projects_to_researchers = []
for name in projects_commits.keys():
    projects_to_researchers.append([name, compute_researchers(name)])
    
projects_to_researchers = np.array(sorted(projects_to_researchers, key = itemgetter(1)))
projects_names = projects_to_researchers[:, 0]

bar = pygal.Bar(x_label_rotation = 50, show_legend = False)
bar.title = 'Added and removed lines of researchers per project (in %)'
bar.x_title = 'Project\'s name'
bar.y_title = 'Proportion of researchers modified lines in all modified lines'
bar.x_labels = projects_names
bar.add('Researchers', np.array(list(map(float, list(projects_to_researchers[:, 1])))) * 100)

### Les chercheurs sont-ils les contributeurs qui contribuent le plus individuellement ?

In [None]:
def decide_if_commit_made_by_researcher(row):
    global contributors_status
    
    row['IS_RESEARCHER'] = contributors_status['IS_RESEARCHER'][row['NAME']]
    
    return row

def compute_contributors_statistics(commits):
    statistics = pd.DataFrame(index = commits['AUTHOR_NAME'].unique())
    
    statistics['NUMBER_OF_COMMITS'] = 0
    statistics['ADDED_LINES'] = 0
    statistics['DELETED_LINES'] = 0
    statistics['MODIFIED_LINES'] = 0
    
    for index, row in commits.iterrows():
        name = row['AUTHOR_NAME']
        
        statistics['NUMBER_OF_COMMITS'][name] += 1
        statistics['ADDED_LINES'][name] += row['ADDED_LINES']
        statistics['DELETED_LINES'][name] += row['DELETED_LINES']
        statistics['MODIFIED_LINES'][name] += row['ADDED_LINES'] + row['DELETED_LINES']
        
    statistics['NAME'] = statistics.index
    
    statistics = statistics.apply(decide_if_commit_made_by_researcher, axis = 1)
    
    statistics.reset_index(drop = True, inplace = True)
        
    return statistics

projects_contributors = {}

for project_name in projects_commits.keys():
    contributors = compute_contributors_statistics(projects_commits[project_name])
    contributors['PROJECT'] = project_name
    
    projects_contributors[project_name] = contributors
    
contributors = pd.concat(projects_contributors.values())
contributors.reset_index(drop = True, inplace = True)

contributors.sample(3)

#### En nombre de *commits*

In [None]:
sum_up_researchers = pd.DataFrame()
sum_up_non_researchers = pd.DataFrame()

for project_name in contributors['PROJECT'].unique():
    project_contributors = contributors[contributors['PROJECT'] == project_name]
    
    project_commits = project_contributors['NUMBER_OF_COMMITS'].sum()
    
    sum_up_researchers = sum_up_researchers.append(project_contributors[project_contributors['IS_RESEARCHER'] == True]['NUMBER_OF_COMMITS'].describe() / project_commits, ignore_index = True)
    sum_up_non_researchers = sum_up_non_researchers.append(project_contributors[project_contributors['IS_RESEARCHER'] == False]['NUMBER_OF_COMMITS'].describe() / project_commits, ignore_index = True)

In [None]:
plot = pygal.Box(box_mode = 'stdev', legend_at_bottom = True)
plot.title = 'Normalized number of commits per contributor (mean)'
plot.x_title = 'Contributor type'
plot.y_title = 'Mean of normalized number of commits'
plot.add('Researchers', sum_up_researchers['mean'])
plot.add('Non researchers', sum_up_non_researchers['mean'])

In [None]:
plot = pygal.Box(box_mode = 'stdev', legend_at_bottom = True)
plot.title = 'Normalized number of commits per contributor (first quartile)'
plot.x_title = 'Contributor type'
plot.y_title = 'First quartile of normalized number of commits'
plot.add('Researchers', sum_up_researchers['25%'])
plot.add('Non researchers', sum_up_non_researchers['25%'])

In [None]:
plot = pygal.Box(box_mode = 'stdev', legend_at_bottom = True)
plot.title = 'Normalized number of commits per contributor (median)'
plot.x_title = 'Contributor type'
plot.y_title = 'Median of normalized number of commits'
plot.add('Researchers', sum_up_researchers['50%'])
plot.add('Non researchers', sum_up_non_researchers['50%'])

In [None]:
plot = pygal.Box(box_mode = 'stdev', legend_at_bottom = True)
plot.title = 'Normalized number of commits per contributor (third quartile)'
plot.x_title = 'Contributor type'
plot.y_title = 'Third quartile of normalized number of commits'
plot.add('Researchers', sum_up_researchers['75%'])
plot.add('Non researchers', sum_up_non_researchers['75%'])

#### En nombre de lignes ajoutées/retirées

In [None]:
sum_up_researchers = pd.DataFrame()
sum_up_non_researchers = pd.DataFrame()

for project_name in contributors['PROJECT'].unique():
    project_contributors = contributors[contributors['PROJECT'] == project_name]
    
    project_contributors['MODIFIED_LINES'] = project_contributors['ADDED_LINES'] + project_contributors['DELETED_LINES']
    project_modified_lines = project_contributors['MODIFIED_LINES'].sum()
    
    sum_up_researchers = sum_up_researchers.append(project_contributors[project_contributors['IS_RESEARCHER'] == True]['MODIFIED_LINES'].describe() / project_modified_lines, ignore_index = True)
    sum_up_non_researchers = sum_up_non_researchers.append(project_contributors[project_contributors['IS_RESEARCHER'] == False]['MODIFIED_LINES'].describe() / project_modified_lines, ignore_index = True)

In [None]:
plot = pygal.Box(box_mode = 'stdev', legend_at_bottom = True)
plot.title = 'Normalized number of modified lines per contributor (mean)'
plot.x_title = 'Contributor type'
plot.y_title = 'Mean of normalized number of modified lines'
plot.add('Researchers', sum_up_researchers['mean'])
plot.add('Non researchers', sum_up_non_researchers['mean'])

In [None]:
plot = pygal.Box(box_mode = 'stdev', legend_at_bottom = True)
plot.title = 'Normalized number of modified lines per contributor (first quartile)'
plot.x_title = 'Contributor type'
plot.y_title = 'First quartile of normalized number of modified lines'
plot.add('Researchers', sum_up_researchers['25%'])
plot.add('Non researchers', sum_up_non_researchers['25%'])

In [None]:
plot = pygal.Box(box_mode = 'stdev', legend_at_bottom = True)
plot.title = 'Normalized number of modified lines per contributor (median)'
plot.x_title = 'Contributor type'
plot.y_title = 'Median of normalized number of modified lines'
plot.add('Researchers', sum_up_researchers['50%'])
plot.add('Non researchers', sum_up_non_researchers['50%'])

In [None]:
plot = pygal.Box(box_mode = 'stdev', legend_at_bottom = True)
plot.title = 'Normalized number of modified lines per contributor (third quartile)'
plot.x_title = 'Contributor type'
plot.y_title = 'Third quartile of normalized number of modified lines'
plot.add('Researchers', sum_up_researchers['75%'])
plot.add('Non researchers', sum_up_non_researchers['75%'])