One possible approach to cleaning the messy input data we have is to use a fuzzy matching algorithm to match infrequent or singleton entries with more frequent entries (under the assumption that deviation from the fequent entries indicates some sort of error process).

Lets begin by loading a pre-cleaned dataset

In [3]:
import pandas as pd

#import file
cleanedUsers=pd.read_csv('/home/dnb3k/git/dspg20oss/ossPy/PackageOuts/sourceCountsNames.csv')
cleanedUsers=cleanedUsers[['company','count']]

cleanedUsers.head(10)

Unnamed: 0,company,count
0,microsoft,6363
1,google,3723
2,red hat,2063
3,ibm,1865
4,facebook,1194
5,intel,1000
6,freelancer,990
7,freelance,949
8,thoughtworks,864
9,none,850


We'll next need to select what proportion of these names we want to assume are "correct" and thus should be treated as the "dictonary" that everything else will be checked against.  We'll use the arbitrary cut off of 5 listings for this purpose.  In this way we'll split the listing count into all of those entires with more than 6 user listings, and all those with 5 or fewer

In [None]:
# set threshold
thresholdEmployees=5

#extract new "verified" dataframe for companies
verifiedCompanies=tableUniqueFullNameCounts['company'].loc[tableUniqueFullNameCounts['count']>thresholdEmployees]
verifiedCompanies=pd.DataFrame(verifiedCompanies)

#do the same for the "unverified" companies
unverifiedCompanies=tableUniqueFullNameCounts['company'].loc[tableUniqueFullNameCounts['count']<=thresholdEmployees]
unverifiedCompanies=pd.DataFrame(unverifiedCompanies)

unverifiedCompanies.head(10)

In [15]:
import difflib
import os

diffLibGuessPath='/home/dnb3k/git/dspg20oss/ossPy/PackageOuts/diffLibGuesses.csv'

#if the demo file exists locally, load it up to save time
if os.path.isfile(diffLibGuessPath):
    unverifiedCompanies=pd.read_csv(diffLibGuessPath)
    #because loading of csvs is mysterious and fickle
    unverifiedCompanies=unverifiedCompanies.set_index(unverifiedCompanies['Unnamed: 0'])
     #convert to format the permits modification
else:
    #set new column
    unverifiedCompanies['guesses']=''

    #iterate across unverified companies
    for iAttempts in range(len(unverifiedCompanies)):
    
        #place the guesses for each unverified company, if applicable.
        unverifiedCompanies['guesses'].iloc[iAttempts]=difflib.get_close_matches(unverifiedCompanies['company'].iloc[iAttempts],verifiedCompanies['company'],cutoff=0.8)

Lets take a look at some of these output

In [16]:
#pick some good examples
sensibleGuesses=unverifiedCompanies.loc[[6614,6628,6650,6677,6680,6774,6788,6867]]
#show them
sensibleGuesses.head(8)

Unnamed: 0_level_0,Unnamed: 0,company,guesses
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
6614,6614,theironyard,['the iron yard']
6628,6628,independent software consultant,['independent consultant']
6650,6650,makinacorpus,['makina corpus']
6677,6677,toyotaresearchinstitute,['toyota research institute']
6680,6680,ibmresearch,['ibm research']
6774,6774,adobe system,"['adobe systems', 'radio systems']"
6788,6788,voyagegroup,['voyage group']
6867,6867,fourkitchens,['four kitchens']


Here we see some examples of some (manually curated) examples of (apparently) valid guesses.  In most cases what we're seeing is basically two cases:
    (1) the addition of spaces:  This is likely in cases where the @ has been removed from an affiliation tag.  Because the linked @ listings *do not feature spaces*, the addition of spaces is necessary in order to overlap with other users who have explicitly typed in the groups full name (using spaces).
    (2) removal of spurious letters or words:  in the "independent software consultant" and "adobe system" cases we see that the algorithm detects a match vai the exclusion of a letter or word.
    
While we have found some examples of valid matches, there are *many* examples of invalid matches.

In [18]:
#pick some good examples
badGuesses=unverifiedCompanies.loc[[6622,6639,6686,6690,6699,6714,6735,6746]]
#show them
badGuesses.head(8)

Unnamed: 0_level_0,Unnamed: 0,company,guesses
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
6622,6622,post ch,['postech']
6639,6639,rmit,"['rit', 'mit']"
6686,6686,lumi,['pulumi']
6690,6690,limetray,['liferay']
6699,6699,and digital,"['andigital', 'nhs digital']"
6714,6714,ignitionapp,['invisionapp']
6735,6735,nodeart,['codeart']
6746,6746,directv,['directi']


Here we see some examples of guesses which are actually inaccurate.  In fact, most of the guesses made by this algorithm are of this sort.  Perhaps this isn't that surprising.  With 200 thousand unique entries, the probability of having strings be adjacent to one another in "levenshtein space" becomes increasingly high.  Given that there's just a finite number of ways to combine words (and make company names seem "snappy" or "techy" its not surprising that we have valid company names that are only two or so characters from one another.  

Overall we see that, outside of limiting our use of the algorithm to finding where spaces ought to be used, there may not be a wide application of this algorithm.

Lets go ahead and implement this new "space based" remapping algorithm

In [19]:
#establish new column for correctly spaced mappings
unverifiedCompanies['spaceDiffMap']=''
#find the entries that actually have guesses--to speed up iteration
entriesWithGuesses=unverifiedCompanies.index[unverifiedCompanies['guesses'].str.len()>0]

#iterate across entries with guesses
for iAttempts in range(len(entriesWithGuesses)):
        #set current entry number
        currentEntry=entriesWithGuesses[iAttempts]
        #extract current string from company vector
        string1=unverifiedCompanies['company'].loc[currentEntry]
        #extract what may be a list of guesses
        currentGuesses=unverifiedCompanies['guesses'].loc[currentEntry]
        #iterate across those guesses
        for iGuesses in range(len(currentGuesses)):
            #extract the current guess
            string2=currentGuesses[iGuesses]
            #remove all the strings
            string1NoSpaces=string1.replace(' ', '')
            string2NoSpaces=string2.replace(' ', '')
            #determine if the second string actually has spaces, done to ensure correct directionality of mapping
            string2HasSpaces=string2.find(' ')!=-1
            #create a boolean if the two space-removed strings are equal
            onlySpaceDiff=string1NoSpaces==string2NoSpaces
            #if both of the conditions are true
            if string2HasSpaces & onlySpaceDiff:
                #set the current mapping to the relevant value
                unverifiedCompanies['spaceDiffMap'].loc[iAttempts]=string2