As part of our overarching strategy for assigning users to specific sectors, we need to be able to assign users to business as well.  Given the specifics of our source dataset (GHTorrent), we can reasonably assume that the more frequently that a company name appears, the more "authoratative" (reflective of a consensus) of a representation of that company name it is.  Once we've removed the user entries that correspond to the non-business sectors, we can be reasonably confident in mapping users whose worplace affiliation listing is shared with some critical threshold of other users (i.e. 5) to the business sector.

Lets begin by obtaining the raw user data, as well as the user mappings for the acadmic and government sector.

In [1]:
import os
os.chdir('../')
#assuming you're starting this notebook from it's source directory,
#this will get us to the directory containing the ossPyFuncs library
import ossPyFuncs

#obtain the raw GHTorrent data
postgreSql_selectQuery="SELECT login, company FROM gh.ctrs_raw ;"
fullData=ossPyFuncs.queryToPDTable(postgreSql_selectQuery)

#perform sql query for academic entries
postgreSql_selectQuery="SELECT login, company_cleaned, is_academic FROM gh.sna_ctr_academic ;"
academicCleaned=ossPyFuncs.queryToPDTable(postgreSql_selectQuery)

#perform sql qery for government entries
postgreSql_selectQuery="SELECT login, is_gov FROM gh.sna_ctr_gov ;"
govData=ossPyFuncs.queryToPDTable(postgreSql_selectQuery)

#add nonprofit here and in the next code block if desired

Now that we have obtained those pre-exisitng user mappings, we need to join them into one table.

In [2]:
#join the academic into the raw table
joinedData1=fullData.set_index('login').join(academicCleaned.set_index('login'))

#join the government table into the raw/academic table
joinedData2=joinedData1.join(govData.set_index('login'))

#reset the the indexes
joinedAndReset=joinedData2.reset_index()

#it seems that the sql query pulls in boolean falses as NANs, which is not want we want for a boolean vector
#as such, we replace the NANs in the relevant columns with false
naReplaced=joinedAndReset[['is_gov','is_academic']].fillna(value=False)

#take those NAN replaced columns and reinsert them
fixedDataframe=joinedAndReset.assign(is_gov=naReplaced['is_gov'],is_academic=naReplaced['is_academic'])

In order to perform a full sectoring we also need the information for household and null values.  Lets obtain those now from our source keylists.  After that, we'll determine which users have yet to be mapped

In [5]:
import pandas as pd

#get the directory structure using the ossPyFuncs library as the reference point
currentDir=os.path.dirname('ossPyFuncs.py')

#obtain the household list from the keyfile directory, and make a bool column for it
houseHoldList=pd.read_csv(os.path.join(currentDir,'keyFiles/individualKeys.csv'),quotechar="'",header=None)
withHouseholdColumn=ossPyFuncs.addBooleanColumnFromCriteria(fixedDataframe,houseHoldList,'household')

#obtain the null list from the keyfile directory, and make a bool column for it
noneList=pd.read_csv(os.path.join(currentDir,'keyFiles/nullKeys.csv'),quotechar="'",header=None)
withNoneColumn=ossPyFuncs.addBooleanColumnFromCriteria(withHouseholdColumn,noneList,'null')

#generate a bool column for all users that have been mapped, these will be excluded from our business count
alreadyAssigned=withNoneColumn[['is_gov','is_academic','household','null']].any(axis=1)

#extract those users which are not assigned
onlyUnassignedFrame=fixedDataframe.loc[~alreadyAssigned]

Now that we have derived the list of users which have yet to be assigned, lets clean their input in the company column, in preperation for subsequent processing

In [None]:
import numpy as np

#construct path to legal entity list and erase them
LElist=pd.read_csv(os.path.join(currentDir,'keyFiles/curatedLegalEntitesRaw.csv'),quotechar="'",header=None)
LEoutput, LEeraseList=ossPyFuncs.eraseFromColumn(onlyUnassignedFrame['company'],LElist)

#construct path to legal symbol list and erase them
symbollist=pd.read_csv(os.path.join(currentDir,'keyFiles/symbolRemove.csv'),quotechar="'",header=None)
Symboloutput, symbolEraseList=ossPyFuncs.eraseFromColumn(LEoutput,symbollist)
#construct path to legal symbol list and erase them
domainsList=pd.read_csv(os.path.join(currentDir,'keyFiles/curatedDomains.csv'),quotechar="'",header=None)
domiansOutput, domainsEraseList=ossPyFuncs.eraseFromColumn(Symboloutput,domainsList)

#remap all of the space, case, and symbol variants to their most common form
fixedList, fixedReport=ossPyFuncs.spaceSymbolRemap(domiansOutput)

#report the number of changes made
print(str(np.sum(fixedReport['changeNum'])),' space, case and symbol variations entries remapped')


Now that we have fully cleaned and remapped the data (and in doing so, collapsed redundant entries in to one another) we can now apply our heuristic count.  Specifically, given that we have removed all entries which would be associated with governmental, academic, and independent (household) users (and removed null entries), it seems reasonable to assume that entries which have multiple users listing them are businesses.  This inference is based on our exhausting any other workplace affiliations that a person might express.

However, we have to apply a (somewhat arbitrary) cutoff when we decide the minimum number of users which have to list the same workplace in order for us to assume it reflects a valid business.  We'll begin with 5, but the code can be changed as one sees fit.

In [None]:
#set the threshold, change if you'd like
threshold=5

#here we're going to obtain a frequency count of the workplace listings.  However, in preperation for a step we
#will be taking later we'll also use this opportunity to obtain the user mappings for each unique company.  In
#this way we can determine which users are associated with a workplace listing that meets the threshold criterion
#that we established at the start of this cell.
sortedInputColumn, sortedTableUniqueFullNameCounts=ossPyFuncs.uniquePandasIndexMapping(fixedList)

