In our Open Source Software project, [name goes here], we are fortuate enough to have access to the data contained within the GHtorrent resource--specifically for those individuals who have made contributions to open source projects.  Lets begin by taking a look at this data in its raw form.

We'll begin by performing an sql query and looking at what we get back.

In [1]:
import os
os.chdir('../')
import ossPyFuncs

#formulate sql query
postgreSql_selectQuery="SELECT login, company FROM gh.ctrs_raw ;"
#perform query
inputRaw=ossPyFuncs.queryToPDTable(postgreSql_selectQuery)

#print number of entries (rows) in database
print(len(inputRaw.index))

2143407


As you can see we have more than two million entries in this database, each corresponding to a username on github.  However, given that we are interested in linking individuals with productive sectors (i.e. business, academic, household, governmental, etc.) we need to take a closer look at the ['company'] column in order to assess the viability of using it as the bridge from individual user to sector label.  One key feature is the *sparsity* of the data, that is, the "emptyness" of the data.

In [2]:
import numpy as np
import pandas as pd

#count the number of null values in the company column
numberOfNull=np.count_nonzero(pd.isnull(inputRaw['company']))

print(numberOfNull)

#compute the percentage this represents
print(np.divide(numberOfNull,len(inputRaw.index))*100)

1720890
80.2875982023013


Thus we see that slightly more than 80% of users have not entered anything into the company column.  This doesn't bode particularly well for our attempt to begin mapping sectors, but let's investigate what we do have.

In [3]:
companyColumn=pd.DataFrame(inputRaw['company'])

#get the counts for the unique values
tableUniqueFullNameCounts=companyColumn['company'].value_counts()
#convert that output to a proper table
tableUniqueFullNameCounts=tableUniqueFullNameCounts.reset_index()

#rename the columns
tableUniqueFullNameCounts.rename(columns={"company":"count","index":"company"},inplace=True)

#display table
tableUniqueFullNameCounts.head(n=20)

Unnamed: 0,company,count
0,Microsoft,4301
1,,3650
2,Google,2216
3,Red Hat,1594
4,IBM,1492
5,Freelancer,817
6,Freelance,795
7,@Microsoft,730
8,@Microsoft,674
9,Facebook,670


Here we see a number of names that we might expect to see. For example, tech companies like Microsoft, Google, and IBM as well as status listings like Freelancer and Stduent.

However, from a "data cleanliness" perspective, we also notice that several of these listings are redundant. For example, there appear to be at least two Google listings in just these most common fifteen listings, and three Microsoft listings! This may also prove to be a challenge when we attempt to assign sectors using explicit lists of companies.  In such cases where the strings dont *exactly* match, we'll likely run in to trouble.

Lets go ahead and try that now with the list for the household/individual category.  We'll begin by loading up our criteria list and taking a look at it.

In [4]:
#get path to local github directory for the ossPy function set, use that as a marker to find other files
currentDir=os.path.dirname('ossPyFuncs.py')
#read in the file
householdTermList=pd.read_csv(os.path.join(currentDir,'keyFiles/individualKeys.csv'),quotechar="'",header=None)
#look at some of the items  
householdTermList.head(20)

Unnamed: 0,0
0,(?i)^self$
1,(?i)^personal$
2,(?i)^home$
3,(?i)^private$
4,(?i)^individual$
5,(?i)^myself$
6,(?i)^me$
7,(?i)^house$
8,(?i)^independent$
9,(?i)independent contractor


We see that these are all sql queries for strings which are (we assume) associated with individuals who are engaging in home innovation with open source.  Lets see how many of the individuals from the GHTorrent database this reflects.

In [7]:
#iteratively apply the list as a string search, and mark true where a match is found
householdOutColumn=ossPyFuncs.addBooleanColumnFromCriteria(companyColumn,householdTermList,'household')

print(str(np.count_nonzero(householdOutColumn['household'])) + ' household innovators found')

subsetHouseholdUsers=householdOutColumn[householdOutColumn['household']]
subsetHouseholdUsersCountDF=subsetHouseholdUsers['company'].value_counts()
subsetHouseholdUsersCountDF.head(20)

4734 household innovators found


Freelancer             817
Freelance              795
Home                   220
Personal               189
Self                   169
freelancer             126
freelance              125
Private                125
Independent            124
Self-employed           88
self                    79
home                    76
private                 64
Self-Employed           62
Individual              53
Consultant              49
personal                48
Freelance Developer     45
self-employed           42
Myself                  37
Name: company, dtype: int64

That's a fairly sizable number.  Lets try this same approach again, but this time, instead of using a list of terms we generated, lets use an existing list of academic institutions

In [None]:
#formulate sql query
postgreSql_selectQuery="SELECT institution FROM hipolabs.universities ;"
#perform query
universitiesList=ossPyFuncs.queryToPDTable(postgreSql_selectQuery)

#use that query output for the iterative boolean vector creation
universityOutColumn=ossPyFuncs.addBooleanColumnFromCriteria(companyColumn,universitiesList,'academic')

#count the number of true
print(str(np.count_nonzero(universityOutColumn['academic'])) + ' academic contributors found')

subsetAcademicUsers=universityOutColumn[universityOutColumn['academic']]
subsetAcademicUsersCountDF=subsetAcademicUsers['company'].value_counts()
subsetAcademicUsersCountDF.head(20)


And now lets do the same thing again but for government branches

In [None]:
#formulate sql query
postgreSql_selectQuery="SELECT agency FROM us_gov_depts.us_gov_azindex ;"
#perform query
govtList=ossPyFuncs.queryToPDTable(postgreSql_selectQuery)

#use that query output for the iterative boolean vector creation
governmentOutColumn=ossPyFuncs.addBooleanColumnFromCriteria(companyColumn,govtList,'government')

#count the number of true
print(str(np.count_nonzero(governmentOutColumn['government'])) + ' government contributors found')

subsetGovernmentUsers=governmentOutColumn[governmentOutColumn['government']]
subsetGovernmentUsersCountDF=subsetGovernmentUsers['company'].value_counts()
subsetGovernmentUsersCountDF.head(20)