In previous chapters we (1) assessed our ability to map the raw GHTorrent data on to sectors using some given lists [see chapter "Overview of raw GHTorrent professional affiliation data.ipynb"] and (2) looked at the data itself and conisidered/implemented some cleaning heuristics [see chapters "Difflib demo.ipynb" and "Company Cleaning Narritive.ipynb"].  Now that we've done all this work, we actually need to see if these strategies have accomplished anything.  Ideally, what we have been doing is removing the "noise" from our data such that the entires in the raw data represent a more "unified" expression of the same affilliations, and thus better map on to the target entity lists.  The result of this would be a greater number of entries being mapped to each sector.  Alternatively, another method for mapping more entries to those sectors would be to improve/augment our per-sector lists of entities.  We'll set that option aside for now and proceed by evaluating the results of our cleaning efforts using largely the same approach as we did in the "raw overview" chapter.

This time we begin by loading the cleaned set,rather than pulling from the database

In [1]:
#this code guarentees you can impor the ossPyFuncs library
import subprocess
import os
#get top directory path of the current git repository, under the presumption that
#you're in the dspg20oss repo.
gitRepoPath=subprocess.check_output(['git', 'rev-parse', '--show-toplevel']).decode('ascii').strip()
#move to the osspy directory, assuming the directory structure has remained the  same
os.chdir(os.path.join(gitRepoPath, 'ossPy'))
#import the osspy library
import ossPyFuncs
import pandas as pd

currentDir=os.path.dirname('ossPyFuncs.py')
#read in the file
cleanedInput=pd.read_csv(os.path.join(currentDir,'PackageOuts/rawCleanNoNA.csv'))
companyColumn=pd.DataFrame(cleanedInput['company'])
companyColumn.head(20)

Unnamed: 0,company
0,nearform
1,answers
2,mbari
3,scality
4,crom microsystems
5,oohology
6,apple
7,object rocket at rackspace
8,rackspace hosting
9,thinkphp


Thus we see that slightly more than 80% of users have not entered anything into the company column.  This doesn't bode particularly well for our attempt to begin mapping sectors, but let's investigate what we do have.

Lets begin  with the list for the household/individual category.  We'll begin by loading up our criteria list and taking a look at it.

In [2]:
#get path to local github directory for the ossPy function set, use that as a marker to find other files
currentDir=os.path.dirname('ossPyFuncs.py')
#read in the file
householdTermList=pd.read_csv(os.path.join(currentDir,'keyFiles/individualKeys.csv'),quotechar="'",header=None)
#look at some of the items  
householdTermList.head(20)

Unnamed: 0,0
0,(?i)^self$
1,(?i)^personal$
2,(?i)^home$
3,(?i)^private$
4,(?i)^individual$
5,(?i)^myself$
6,(?i)^me$
7,(?i)^house$
8,(?i)^independent$
9,(?i)independent contractor


We see that these are all sql queries for strings which are (we assume) associated with individuals who are engaging in home innovation with open source.  Lets see how many of the individuals from the GHTorrent database this reflects.

In [3]:
import numpy as np
#iteratively apply the list as a string search, and mark true where a match is found
householdOutColumn=ossPyFuncs.addBooleanColumnFromCriteria(companyColumn,householdTermList,'household')

print(str(np.count_nonzero(householdOutColumn['household'])) + ' household innovators found')

subsetHouseholdUsers=householdOutColumn[householdOutColumn['household']]
subsetHouseholdUsersCountDF=subsetHouseholdUsers['company'].value_counts()
subsetHouseholdUsersCountDF.head(20)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


4819 household innovators found


freelancer                 990
freelance                  949
home                       339
self                       261
personal                   250
private                    199
self-employed              196
independent                153
individual                  75
myself                      75
freelance developer         63
me                          62
consultant                  50
independent contractor      49
freelance web developer     36
web developer               33
independent consultant      30
software developer          29
independent developer       28
freelancing                 23
Name: company, dtype: int64

That's a fairly sizable number.  Lets try this same approach again, but this time, instead of using a list of terms we generated, lets use an existing list of academic institutions

In [4]:
#formulate sql query
postgreSql_selectQuery="SELECT institution FROM hipolabs.universities ;"
#perform query
universitiesList=ossPyFuncs.queryToPDTable(postgreSql_selectQuery)
universitiesList=pd.DataFrame(universitiesList['institution'].str.lower())




#use that query output for the iterative boolean vector creation
universityOutColumn=ossPyFuncs.addBooleanColumnFromCriteria(companyColumn,universitiesList,'academic')

#count the number of true
print(str(np.count_nonzero(universityOutColumn['academic'])) + ' academic contributors found')

subsetAcademicUsers=universityOutColumn[universityOutColumn['academic']]
subsetAcademicUsersCountDF=subsetAcademicUsers['company'].value_counts()
subsetAcademicUsersCountDF.head(20)


  return func(self, *args, **kwargs)


20772 academic contributors found


carnegie mellon university           474
university of washington             403
stanford university                  395
tsinghua university                  252
columbia university                  246
cornell university                   203
university of waterloo               202
zhejiang university                  195
university of toronto                193
imperial college london              182
northeastern university              167
university of oxford                 163
new york university                  163
duke university                      163
university of cambridge              163
peking university                    162
university of southern california    161
university of pennsylvania           159
harvard university                   156
johns hopkins university             153
Name: company, dtype: int64

And now lets do the same thing again but for government branches

In [5]:
#formulate sql query
postgreSql_selectQuery="SELECT agency FROM us_gov_depts.us_gov_azindex ;"
#perform query
govtList=ossPyFuncs.queryToPDTable(postgreSql_selectQuery)
govtList=pd.DataFrame(govtList['agency'].str.lower())
govtList=pd.DataFrame('\\b'+govtList['agency']+'\\b')


#use that query output for the iterative boolean vector creation
governmentOutColumn=ossPyFuncs.addBooleanColumnFromCriteria(companyColumn,govtList,'government')

#count the number of true
print(str(np.count_nonzero(governmentOutColumn['government'])) + ' government contributors found')

subsetGovernmentUsers=governmentOutColumn[governmentOutColumn['government']]
subsetGovernmentUsersCountDF=subsetGovernmentUsers['company'].value_counts()
subsetGovernmentUsersCountDF.head(100)

669 government contributors found


oak ridge national laboratory                   54
argonne national laboratory                     54
los alamos national laboratory                  41
lawrence livermore national laboratory          38
sandia national laboratories                    34
lawrence berkeley national laboratory           28
brookhaven national laboratory                  19
pacific northwest national laboratory           16
lawrence livermore national laboratory, llnl    15
slac national accelerator laboratory            13
idaho national laboratory                       11
fermi national accelerator laboratory           11
tendermint                                      10
consumer financial protection bureau             8
us army                                          7
paypermint                                       7
national renewable energy laboratory             6
minted                                           6
imint                                            5
fannie mae                     

And finally lets try this for commercial entities

In [9]:
#extract multiple tables from the forbes dataset
postgreSql_selectQuery="SELECT company FROM forbes.fortune2018_us1000 ;"
fortune2018=ossPyFuncs.queryToPDTable(postgreSql_selectQuery)

postgreSql_selectQuery="SELECT company FROM forbes.fortune2019_us1000 ;"
fortune2019=ossPyFuncs.queryToPDTable(postgreSql_selectQuery)

postgreSql_selectQuery="SELECT company FROM forbes.fortune2020_global2000 ;"
global2020=ossPyFuncs.queryToPDTable(postgreSql_selectQuery)

#merge them together
mergedCompanies=pd.concat([fortune2018,fortune2019,global2020],ignore_index=True)
mergedCompanies=pd.DataFrame(mergedCompanies['company'].str.lower())
mergedCompanies=pd.DataFrame('\\b'+mergedCompanies['company']+'\\b')

#use that query output for the iterative boolean vector creation
businessOutColumn=ossPyFuncs.addBooleanColumnFromCriteria(companyColumn,mergedCompanies,'business')

#count the number of true
print(str(np.count_nonzero(businessOutColumn['business'])) + ' business contributors found')

subsetBusinessUsers=businessOutColumn[businessOutColumn['business']]
subsetBusinessUsersCountDF=subsetBusinessUsers['company'].value_counts()
subsetBusinessUsersCountDF.head(20)

35244 business contributors found


microsoft              6363
red hat                2063
ibm                    1865
facebook               1194
intel                  1000
alibaba                 677
baidu                   562
amazon                  547
sap                     510
shopify                 490
oracle                  417
uber                    395
amazon web services     384
vmware                  365
apple                   354
adobe                   331
twitter                 315
accenture               293
netease                 276
netflix                 259
Name: company, dtype: int64

Although this is a nontrivial number of people we've captured, it's still only a fraction of the 400 thousand plus users who have entered professional affiliations.  In the next notebook chapter, we'll look at a number of strategies for cleaning our input data to optimize our sectoring efforts.

In [8]:
mergedCompanies.head(50)

microsoft                            6363
red hat                              2063
ibm                                  1865
facebook                             1194
intel                                1000
alibaba                               677
baidu                                 562
amazon                                547
sap                                   510
shopify                               490
carnegie mellon university            474
oracle                                417
uber                                  395
amazon web services                   384
vmware                                365
apple                                 354
adobe                                 331
twitter                               315
accenture                             293
netease                               276
netflix                               259
zalando                               258
atlassian                             238
capgemini                         