As part of our overarching strategy for assigning users to specific sectors, we need to be able to assign users to business as well.  Given the specifics of our source dataset (GHTorrent), we can reasonably assume that the more frequently that a company name appears, the more "authoratative" (reflective of a consensus) of a representation of that company name it is.  Once we've removed the user entries that correspond to the non-business sectors, we can be reasonably confident in mapping users whose worplace affiliation listing is shared with some critical threshold of other users (i.e. 5) to the business sector.

Lets begin by obtaining the raw user data, as well as the user mappings for the acadmic and government sector.

In [None]:
import os
os.chdir('../')
#assuming you're starting this notebook from it's source directory,
#this will get us to the directory containing the ossPyFuncs library
import ossPyFuncs

#obtain the raw GHTorrent data
postgreSql_selectQuery="SELECT login, company FROM gh.ctrs_raw ;"
fullData=ossPyFuncs.queryToPDTable(postgreSql_selectQuery)

#perform sql query for academic entries
postgreSql_selectQuery="SELECT login, company_cleaned, is_academic FROM gh.sna_ctr_academic ;"
academicCleaned=ossPyFuncs.queryToPDTable(postgreSql_selectQuery)

#perform sql qery for government entries
postgreSql_selectQuery="SELECT login, is_gov FROM gh.sna_ctr_gov ;"
govData=ossPyFuncs.queryToPDTable(postgreSql_selectQuery)



Now that we have obtained those pre-exisitng user mappings, we need to join them into one table.

In [None]:
#join the academic into the raw table
joinedData1=fullData.set_index('login').join(academicCleaned.set_index('login'))

#join the government table into the raw/academic table
joinedData2=joinedData1.join(govData.set_index('login'))

#reset the the indexes
joinedAndReset=joinedData2.reset_index()

#it seems that the sql query pulls in boolean falses as NANs, which is not want we want for a boolean vector
#as such, we replace the NANs in the relevant columns with false
naReplaced=joinedAndReset[['is_gov','is_academic']].fillna(value=False)

#take those NAN replaced columns and reinsert them
fixedDataframe=joinedAndReset.assign(is_gov=naReplaced['is_gov'],is_academic=naReplaced['is_academic'])

In order to perform a full sectoring we also need the information for household and null values.  Lets obtain those now from our source keylists.

In [None]:
#get the directory structure using the ossPyFuncs library as the reference point
currentDir=os.path.dirname('ossPyFuncs.py')

houseHoldList=pd.read_csv(os.path.join(currentDir,'keyFiles/individualKeys.csv'),quotechar="'",header=None)
withHouseholdColumn=ossPyFuncs.addBooleanColumnFromCriteria(joinedAndReset,houseHoldList,'household')

noneList=pd.read_csv(os.path.join(currentDir,'keyFiles/nullKeys.csv'),quotechar="'",header=None)
withNoneColumn=ossPyFuncs.addBooleanColumnFromCriteria(withHouseholdColumn,noneList,'null')





alreadyAssigned=fixedDataframe[['is_gov','is_academic','household','null']].any(axis=1)

onlyUnassignedFrame=fixedDataframe.loc[~alreadyAssigned]


#begin cleaning


#construct path to legal entity list and erase them
LElist=pd.read_csv(os.path.join(currentDir,'keyFiles/curatedLegalEntitesRaw.csv'),quotechar="'",header=None)
LEoutput, LEeraseList=ossPyFuncs.eraseFromColumn(onlyUnassignedFrame['company'],LElist)

#construct path to legal symbol list and erase them
symbollist=pd.read_csv(os.path.join(currentDir,'keyFiles/symbolRemove.csv'),quotechar="'",header=None)
Symboloutput, symbolEraseList=ossPyFuncs.eraseFromColumn(LEoutput,symbollist)

domainsList=pd.read_csv(os.path.join(currentDir,'keyFiles/curatedDomains.csv'),quotechar="'",header=None)
domiansOutput, domainsEraseList=ossPyFuncs.eraseFromColumn(Symboloutput,domainsList)

fixedList, fixedReport=ossPyFuncs.spaceSymbolRemap(domiansOutput)

sortedInputColumn, sortedTableUniqueFullNameCounts=ossPyFuncs.uniquePandasIndexMapping(inputColumn)