In previous chapters we (1) assessed our ability to map the raw GHTorrent data on to sectors using some given lists [see chapter "Overview of raw GHTorrent professional affiliation data.ipynb"] and (2) looked at the data itself and conisidered/implemented some cleaning heuristics [see chapters "Difflib demo.ipynb" and "Company Cleaning Narritive.ipynb"].  Now that we've done all this work, we actually need to see if these strategies have accomplished anything.  Ideally, what we have been doing is removing the "noise" from our data such that the entires in the raw data represent a more "unified" expression of the same affilliations, and thus better map on to the target entity lists.  The result of this would be a greater number of entries being mapped to each sector.  Alternatively, another method for mapping more entries to those sectors would be to improve/augment our per-sector lists of entities.  We'll set that option aside for now and proceed by evaluating the results of our cleaning efforts using largely the same approach as we did in the "raw overview" chapter.

This time we begin by loading the cleaned set,rather than pulling from the database

In [6]:
import os
#os.chdir('../')
import ossPyFuncs
import pandas as pd

currentDir=os.path.dirname('ossPyFuncs.py')
#read in the file
cleanedInput=pd.read_csv(os.path.join(currentDir,'PackageOuts/rawCleanNoNA.csv'))
companyColumn=pd.DataFrame(cleanedInput['company'])
companyColumn.head(20)

Unnamed: 0.1,Unnamed: 0,company
0,0,nearform
1,2,answers
2,3,mbari
3,4,scality
4,5,crom microsystems
5,10,oohology
6,11,apple
7,12,object rocket at rackspace
8,13,rackspace hosting
9,14,thinkphp


Thus we see that slightly more than 80% of users have not entered anything into the company column.  This doesn't bode particularly well for our attempt to begin mapping sectors, but let's investigate what we do have.

Lets begin  with the list for the household/individual category.  We'll begin by loading up our criteria list and taking a look at it.

In [9]:
#get path to local github directory for the ossPy function set, use that as a marker to find other files
currentDir=os.path.dirname('ossPyFuncs.py')
#read in the file
householdTermList=pd.read_csv(os.path.join(currentDir,'keyFiles/individualKeys.csv'),quotechar="'",header=None)
#look at some of the items  
householdTermList.head(20)

Unnamed: 0,0
0,(?i)^self$
1,(?i)^personal$
2,(?i)^home$
3,(?i)^private$
4,(?i)^individual$
5,(?i)^myself$
6,(?i)^me$
7,(?i)^house$
8,(?i)^independent$
9,(?i)independent contractor


We see that these are all sql queries for strings which are (we assume) associated with individuals who are engaging in home innovation with open source.  Lets see how many of the individuals from the GHTorrent database this reflects.

In [10]:
import numpy as np
#iteratively apply the list as a string search, and mark true where a match is found
householdOutColumn=ossPyFuncs.addBooleanColumnFromCriteria(companyColumn,householdTermList,'household')

print(str(np.count_nonzero(householdOutColumn['household'])) + ' household innovators found')

subsetHouseholdUsers=householdOutColumn[householdOutColumn['household']]
subsetHouseholdUsersCountDF=subsetHouseholdUsers['company'].value_counts()
subsetHouseholdUsersCountDF.head(20)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


NameError: name 'np' is not defined

That's a fairly sizable number.  Lets try this same approach again, but this time, instead of using a list of terms we generated, lets use an existing list of academic institutions

In [9]:
#formulate sql query
postgreSql_selectQuery="SELECT institution FROM hipolabs.universities ;"
#perform query
universitiesList=ossPyFuncs.queryToPDTable(postgreSql_selectQuery)

#use that query output for the iterative boolean vector creation
universityOutColumn=ossPyFuncs.addBooleanColumnFromCriteria(companyColumn,universitiesList,'academic')

#count the number of true
print(str(np.count_nonzero(universityOutColumn['academic'])) + ' academic contributors found')

subsetAcademicUsers=universityOutColumn[universityOutColumn['academic']]
subsetAcademicUsersCountDF=subsetAcademicUsers['company'].value_counts()
subsetAcademicUsersCountDF.head(20)


20202 academic contributors found


Carnegie Mellon University           471
University of Washington             392
Stanford University                  385
Columbia University                  241
Tsinghua University                  240
Cornell University                   203
University of Waterloo               200
University of Toronto                189
Imperial College London              182
Zhejiang University                  166
University of Cambridge              163
University of Oxford                 160
Peking University                    159
New York University                  159
Northeastern University              158
Duke University                      157
Harvard University                   153
University of Pennsylvania           152
Johns Hopkins University             149
University of Southern California    149
Name: company, dtype: int64

And now lets do the same thing again but for government branches

In [10]:
#formulate sql query
postgreSql_selectQuery="SELECT agency FROM us_gov_depts.us_gov_azindex ;"
#perform query
govtList=ossPyFuncs.queryToPDTable(postgreSql_selectQuery)

#use that query output for the iterative boolean vector creation
governmentOutColumn=ossPyFuncs.addBooleanColumnFromCriteria(companyColumn,govtList,'government')

#count the number of true
print(str(np.count_nonzero(governmentOutColumn['government'])) + ' government contributors found')

subsetGovernmentUsers=governmentOutColumn[governmentOutColumn['government']]
subsetGovernmentUsersCountDF=subsetGovernmentUsers['company'].value_counts()
subsetGovernmentUsersCountDF.head(20)

565 government contributors found


Argonne National Laboratory                       54
Oak Ridge National Laboratory                     54
Los Alamos National Laboratory                    40
Lawrence Livermore National Laboratory            37
Sandia National Laboratories                      34
Lawrence Berkeley National Laboratory             27
Brookhaven National Laboratory                    17
Pacific Northwest National Laboratory             16
SLAC National Accelerator Laboratory              12
Lawrence Livermore National Laboratory, @LLNL     12
Fermi National Accelerator Laboratory             11
Idaho National Laboratory                         10
Consumer Financial Protection Bureau               8
National Renewable Energy Laboratory               6
US Army                                            6
Fannie Mae                                         4
PayPerMint                                         4
US Navy                                            3
Lawrence Livermore National Laboratory, @LLNL 

And finally lets try this for commercial entities

In [5]:
#extract multiple tables from the forbes dataset
postgreSql_selectQuery="SELECT company FROM forbes.fortune2018_us1000 ;"
fortune2018=ossPyFuncs.queryToPDTable(postgreSql_selectQuery)

postgreSql_selectQuery="SELECT company FROM forbes.fortune2019_us1000 ;"
fortune2019=ossPyFuncs.queryToPDTable(postgreSql_selectQuery)

postgreSql_selectQuery="SELECT company FROM forbes.fortune2020_global2000 ;"
global2020=ossPyFuncs.queryToPDTable(postgreSql_selectQuery)

#merge them together
mergedCompanies=pd.concat([fortune2018,fortune2019,global2020],ignore_index=True)

#use that query output for the iterative boolean vector creation
businessOutColumn=ossPyFuncs.addBooleanColumnFromCriteria(companyColumn,mergedCompanies,'business')

#count the number of true
print(str(np.count_nonzero(businessOutColumn['business'])) + ' business contributors found')

subsetBusinessUsers=businessOutColumn[businessOutColumn['business']]
subsetBusinessUsersCountDF=subsetBusinessUsers['company'].value_counts()
subsetBusinessUsersCountDF.head(20)

  return func(self, *args, **kwargs)


37749 business contributors found


Microsoft                4301
Red Hat                  1594
IBM                      1492
@Microsoft                730
@Microsoft                674
Facebook                  670
Intel                     608
Amazon                    502
Alibaba                   351
Microsoft Corporation     335
Amazon Web Services       318
Baidu                     286
Oracle                    265
@IBM                      248
Intel Corporation         222
Uber                      218
Adobe                     214
Accenture                 214
SAP                       207
Twitter                   201
Name: company, dtype: int64

Although this is a nontrivial number of people we've captured, it's still only a fraction of the 400 thousand plus users who have entered professional affiliations.  In the next notebook chapter, we'll look at a number of strategies for cleaning our input data to optimize our sectoring efforts.