# Detect variants

In [44]:
import pandas
import matplotlib 

%matplotlib inline

In this notebook, we'll detect variants of software packages, based on the following statements:

 - a mainline projects is distributed on npm;
 - a variant is a fork of a mainline projects;
 - a variant is also distributed on npm.
 

### Load data

`df_packages` contains the list of projects distributed on npm.

In [31]:
df_packages = pandas.read_csv(
    '../data-raw/packages.csv.gz',
    usecols=['package', 'repoid'],
    dtype={'repoid': 'Int32'}
)

In [33]:
df_packages.head()

Unnamed: 0,package,repoid
0,0,
1,001,
2,001_skt,
3,001_test,
4,007,49873.0


`df_repositories` contains a subset of repositories related to npm on GitHub. It is expected to contain repositories related to npm packages (and repositories related to project depending on a npm package). 

In [32]:
df_repositories = pandas.read_csv(
    '../data-raw/repositories.csv.gz',
    usecols=['repository', 'repoid', 'forked_from'],
    dtype={'repoid': 'Int32'}
)

In [34]:
df_repositories.head()

Unnamed: 0,repository,repoid,forked_from
0,brianmhunt/knockout-modal,1,
1,SteveSanderson/knockout.mapping,2,
2,azman-co/knockout-model,3,devco/knockup
3,zonuexe/aozora-ruby-parser.js,4,
4,immense/knockout-pickatime,5,


### Detecting variants

Let's first associate repositories and packages.

In [35]:
df_variants = (
    df_packages
    # remove packages with no repository
    .dropna(subset=['repoid'])
    # remove packages being developed in the same repository
    .drop_duplicates(subset=['repoid'], keep=False)
    # associate repositories and packages
    .merge(df_repositories, how='inner', on='repoid')
    # associate `forked_from` to a repoid
    .pipe(lambda df:
        df
        .merge(
            df[['repository', 'repoid', 'package']], 
            how='inner', 
            left_on='forked_from', 
            right_on='repository',
            suffixes=('', '_mainline'),
        )
    )
    # rename to make things easier
    .rename(columns={
        'package_mainline': 'mainline',
        'repository_mainline': 'mainline_repo',
        'repoid_mainline': 'mainline_repoid',
        'package': 'variant',
        'repository': 'variant_repo',
        'repoid': 'variant_repoid',
    })
    [['mainline', 'mainline_repo', 'mainline_repoid', 'variant', 'variant_repo', 'variant_repoid']]
)

In [54]:
df_variants

Unnamed: 0,mainline,mainline_repo,mainline_repoid,variant,variant_repo,variant_repoid
0,wheat,creationix/wheat,162291,11zwheat,sun11/wheat,49882
1,wheat,creationix/wheat,162291,barley,frodare/barley,124697
2,keypair,juliangruber/keypair,110982,akeypair,quartzjer/akeypair,86500
3,keypair,juliangruber/keypair,110982,jh-keypair,johnhaley81/keypair,805497
4,sasl-digest-md5,jaredhanson/js-sasl-digest-md5,149511,alt-sasl-digest-md5,legastero/js-sasl-digest-md5,86665
...,...,...,...,...,...,...
12808,dot-values,bajankristof/dot-values,34049409,dot-values2,bluelovers/dot-values,41256794
12809,kompression,tuananh/kompression,30312975,@nivinjoseph/kompression,nivinjoseph/kompression,41256967
12810,contentful-typescript-codegen,intercom/contentful-typescript-codegen,39168489,@zeusdeux/contentful-typescript-codegen,zeusdeux/contentful-typescript-codegen,41257476
12811,prometheus-gc-stats,SimenB/node-prometheus-gc-stats,13589391,prometheus-gc-stats2,acifani/node-prometheus-gc-stats,41257504


How many variants do we have?

In [55]:
(
    df_variants
    .groupby('mainline', sort=False)
    [['variant']]
    .count()
    .rename(columns={'variant': 'variants'})
    .assign(mainlines=1)
    .groupby('variants')
    .count()
    .T
)

variants,1,2,3,4,5,6,7,8,9,10,12,13,16,17
mainlines,9280,1117,234,63,20,11,6,2,2,2,2,2,1,1


In [57]:
df_variants.to_csv('../data/variants.csv.gz', compression='gzip', index=False)