Given the capabilities of the [functions we've created](https://github.com/DSPG-Young-Scholars-Program/dspg20oss/blob/master/ossPy/ossPyFuncs.py) and our [sectoring capabilities](https://github.com/DSPG-Young-Scholars-Program/dspg20oss/blob/master/ossPy/Notebooks/Business_User_Bool_Vec_Creation.ipynb), it seems that we are capable of improving one of [our previous visualizations](https://github.com/DSPG-Young-Scholars-Program/dspg20oss/blob/master/ossPy/Notebooks/Company%20Cleaning%20Narritive.ipynb), [the wordcloud](https://github.com/DSPG-Young-Scholars-Program/dspg20oss/blob/master/ossPy/scrips/wordCloud.py).

Instead of simply allowing colors to be applied randomly, we can assign colors to affiliations based upon their sector membership (i.e. academic, business, individual, governmental, etc.).  To do this we have to reapply [many of the same methods we did previously when cleaning and sectoring](https://github.com/DSPG-Young-Scholars-Program/dspg20oss/blob/master/ossPy/Notebooks/Business_User_Bool_Vec_Creation.ipynb)

Lets begin by loading the existing data (raw github, academic sectoring, and governmental sectoring) from our sql database.

In [1]:
#this code guarentees you can import the ossPyFuncs library
import subprocess
import os
#get top directory path of the current git repository, under the presumption that
#you're in the dspg20oss repo.
gitRepoPath=subprocess.check_output(['git', 'rev-parse', '--show-toplevel']).decode('ascii').strip()
#move to the osspy directory, assuming the directory structure has remained the  same
os.chdir(os.path.join(gitRepoPath, 'ossPy'))
#import the osspy library
import ossPyFuncs

#obtain the raw GHTorrent data
postgreSql_selectQuery="SELECT login, company FROM gh.ctrs_raw ;"
fullData=ossPyFuncs.queryToPDTable(postgreSql_selectQuery)

#perform sql query for academic entries
postgreSql_selectQuery="SELECT login, company_cleaned, is_academic FROM gh.sna_ctr_academic ;"
academicCleaned=ossPyFuncs.queryToPDTable(postgreSql_selectQuery)

#perform sql qery for government entries
postgreSql_selectQuery="SELECT login, is_gov FROM gh.sna_ctr_gov ;"
govData=ossPyFuncs.queryToPDTable(postgreSql_selectQuery)

#add nonprofit here and in the next code block if desired

Now that we have obtained those pre-exisitng user mappings, we need to join them into one table.

In [2]:
import pandas as pd
import numpy as np

#join the academic into the raw table
joinedData1=fullData.set_index('login').join(academicCleaned.set_index('login'))

#join the government table into the raw/academic table
joinedData2=joinedData1.join(govData.set_index('login'))

#reset the the indexes
joinedAndReset=joinedData2.reset_index()

#it seems that pandas interpolates with NaN, which we have to reset to the relevant values for each column
academicNanFix=pd.DataFrame(joinedAndReset['is_academic'].fillna(value=False))
govNanFix=pd.DataFrame(joinedAndReset['is_gov'].fillna(value=False))
companyCleanFix=pd.DataFrame(joinedAndReset['company_cleaned'].fillna(value=''))


#get the count for both
academicCount=np.count_nonzero(academicNanFix['is_academic'])
print(str(academicCount) + ' users with academic affiliations')
#get the count
govCount=np.count_nonzero(govNanFix['is_gov'])
print(str(govCount) + ' users with governmental affiliations')

#take those NAN replaced columns and reinsert them
fixedDataframe=joinedAndReset.assign(is_gov=govNanFix['is_gov'],is_academic=academicNanFix['is_academic'],company_cleaned=companyCleanFix['company_cleaned'])

fixedDataframe.head(10)

40273 users with academic affiliations
3576 users with governmental affiliations


Unnamed: 0,login,company,company_cleaned,is_academic,is_gov
0,0,,,False,False
1,0----0,,,False,False
2,0--key,,,False,False
3,0-0-1,,,False,False
4,0-1-,,,False,False
5,0-22,,,False,False
6,0-3,Reborn Network,,False,False
7,0-60FPS,,,False,False
8,0-8-15,,,False,False
9,0-CNice,,,False,False


In order to perform a full sectoring we also need the information for household and null values. Lets obtain those now from our source keylists for household and null values. After that, we'll determine which users have yet to be mapped.

In [3]:
#get the directory structure using the ossPy directory as the reference point
ossPyDir=os.path.join(gitRepoPath, 'ossPy')

#obtain the household list from the keyfile directory, and make a bool column for it
houseHoldList=pd.read_csv(os.path.join(ossPyDir,'keyFiles/individualKeys.csv'),quotechar="'",header=None)
withHouseholdColumn=ossPyFuncs.addBooleanColumnFromCriteria(pd.DataFrame(fixedDataframe['company']),houseHoldList,'household')
#get the count
householdCount=np.count_nonzero(withHouseholdColumn['household'])
print(str(householdCount) + ' users in individual sector')
#add the column to the main table
fixedDataframe['household']=withHouseholdColumn['household']

#obtain the null list from the keyfile directory, and make a bool column for it
noneList=pd.read_csv(os.path.join(ossPyDir,'keyFiles/nullKeys.csv'),quotechar="'",header=None)
withNoneColumn=ossPyFuncs.addBooleanColumnFromCriteria(pd.DataFrame(fixedDataframe['company']),noneList,'null')
#get the count
nullCount=np.count_nonzero(withNoneColumn['null'])
print(str(nullCount) + ' users with null affiliations')
#add the column to the main table
fixedDataframe['null']=withNoneColumn['null']

#generate a bool column for all users that have been mapped, these will be excluded from our business count
alreadyAssigned=fixedDataframe[['is_gov','is_academic','household','null']].any(axis=1)

#extract those users which are not assigned
onlyUnassignedFrame=fixedDataframe.loc[~alreadyAssigned]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


4715 users in individual sector


  return func(self, *args, **kwargs)


7819 users with null affiliations


Now that we have derived the list of users which have yet to be assigned, lets clean their input in the company column, in preperation for subsequent processing.  We'll be cleaning out entries for substrings related to [legal entities](https://github.com/DSPG-Young-Scholars-Program/dspg20oss/blob/master/ossPy/keyFiles/curatedLegalEntitesRaw.csv), [web domains](https://github.com/DSPG-Young-Scholars-Program/dspg20oss/blob/master/ossPy/keyFiles/curatedDomains.csv), and [extraneous symbols](https://github.com/DSPG-Young-Scholars-Program/dspg20oss/blob/master/ossPy/keyFiles/symbolRemove.csv) as [described in another notebook](https://github.com/DSPG-Young-Scholars-Program/dspg20oss/blob/master/ossPy/Notebooks/Company%20Cleaning%20Narritive.ipynb) and [quantatively profiled in another](https://github.com/DSPG-Young-Scholars-Program/dspg20oss/blob/master/ossPy/Notebooks/Cleaning%20heuristic%20assesment.ipynb).

In [5]:
#replace entries that need to be replaced
replaceList=pd.read_csv(os.path.join(ossPyDir,'keyFiles/expandAbrevs.csv'),quotechar="'",header=None)
replaceOutput, replaceList=ossPyFuncs.eraseFromColumn(onlyUnassignedFrame['company'],replaceList)

#construct path to legal entity list and erase them
LElist=pd.read_csv(os.path.join(ossPyDir,'keyFiles/curatedLegalEntitesRaw.csv'),quotechar="'",header=None)
LEoutput, LEeraseList=ossPyFuncs.eraseFromColumn(replaceOutput,LElist)

#construct path to legal symbol list and erase them
symbollist=pd.read_csv(os.path.join(ossPyDir,'keyFiles/symbolRemove.csv'),quotechar="'",header=None)
Symboloutput, symbolEraseList=ossPyFuncs.eraseFromColumn(LEoutput,symbollist)

#construct path to legal symbol list and erase them
domainsList=pd.read_csv(os.path.join(ossPyDir,'keyFiles/curatedDomains.csv'),quotechar="'",header=None)
domiansOutput, domainsEraseList=ossPyFuncs.eraseFromColumn(Symboloutput,domainsList)

Now that we have fully cleaned and remapped the data (and in doing so, collapsed redundant entries in to one another) we can now apply our heuristic count.  Specifically, given that we have removed all entries which would be associated with governmental, academic, and independent (household) users (and removed null entries), it seems reasonable to assume that entries which have multiple users listing them are businesses.  This inference is based on our exhausting any other workplace affiliations that a person might express.

However, we have to apply a (somewhat arbitrary) cutoff when we decide the minimum number of users which have to list the same workplace in order for us to assume it reflects a valid business.  We'll begin with 5, but the code can be changed as one sees fit.

In [6]:
#set the threshold, change if you'd like
threshold=5
#get the column names
#domiansOutput=pd.DataFrame(domiansOutput)
#force lowercase and replace spaces
domiansOutput=domiansOutput.assign(company=domiansOutput['company'].str.lower())
domiansOutput=domiansOutput.assign(company=domiansOutput['company'].str.replace('\ ',''))
inputColumnName=domiansOutput.columns

#get the counts from the cleaned column
tableCleanedFullNameCounts=domiansOutput[inputColumnName[0]].value_counts()
#convert that output to a proper table
tableCleanedFullNameCounts=tableCleanedFullNameCounts.reset_index()
#rename the columns
tableCleanedFullNameCounts.rename(columns={inputColumnName[0]:"count","index":inputColumnName[0]},inplace=True)


#+1 because we are using greater than or equal to
#we'll also be using this vector to obtain our user remapping
aboveThresholdBoolVec=tableCleanedFullNameCounts['count'].ge(threshold+1)

#create a bool column
tableCleanedFullNameCounts['is_business']=False
tableCleanedFullNameCounts['is_business'].loc[aboveThresholdBoolVec]=True

totalUsersAboveThreshold=np.sum(tableCleanedFullNameCounts['count'].loc[aboveThresholdBoolVec])

print(str(totalUsersAboveThreshold)+ ' users users assumed to be business sector, with ' + str(threshold) + ' or more other users with the same listing')

AttributeError: 'Series' object has no attribute 'assign'

Now that we have obtained the raw count of the users meeting this criteria, we can also obtain a boolean vector that indicates which users are associated with these presumed businesses.

For now, because we don't have information about government affiliations, we'll plot business and academic separately.

In [None]:
#business afiiliations
longString=domiansOutput['company'].str.cat(sep=' ')

#generate a wordcloud and convert it to svg
outcloud=wordcloud.WordCloud(width=2000, height=1000, max_words=2000).generate(longString)
svgCloud=outcloud.to_svg()
IPython.display.SVG(svgCloud)

In [None]:
fixedDataframe=fixedDataframe.assign(company=fixedDataframe['company_cleaned'].str.lower())
fixedDataframe=fixedDataframe.assign(company=fixedDataframe['company_cleaned'].str.replace('\ ',''))

longString=fixedDataframe['company_cleaned'].str.cat(sep=' ')

#generate a wordcloud and convert it to svg
outcloud=wordcloud.WordCloud(width=2000, height=1000, max_words=2000).generate(longString)
svgCloud=outcloud.to_svg()
IPython.display.SVG(svgCloud)