Lets begin by importing the csv file with the company name list

In [10]:
import pandas as pd

#establish path
pathToFDAList='/home/dnb3k/Downloads/NDC_Company_Dataset.xls'

#load it
fdaCompanyList=pd.read_excel(pathToFDAList)

fdaCompanyList=fdaCompanyList.rename(columns={'Row Labels': 'company'})
fdaCompanyList=fdaCompanyList.astype(str)

#view what we loaded
fdaCompanyList.head(30)



Unnamed: 0,company
0,19750101
1,20090901
2,20141217
3,20150301
4,20150801
5,20160601
6,20161012
7,20161026
8,20181130
9,20190501


Now that we have taken a quick look at it, lets look at the unique subtokens (i.e. "words" that are found in these entries.  We'll use a strategy adopted in a previous notebook

In [12]:
#cat all the row entries into one long string
longString=fdaCompanyList['company'].str.cat(sep=' ')

#separate each "word" (space separated token) into a extremely long list
longStringSeparated=longString.split(' ')

#turn it into a dataframe
uniqueSubTokenFrame=pd.DataFrame(longStringSeparated)

#get the count on that column, this tells us the frequency of those unique tokens
columnUniqueCounts=uniqueSubTokenFrame.iloc[:,0].value_counts()

#convert that output to a proper table
tableUniqueCounts=columnUniqueCounts.reset_index()
tableUniqueCounts.rename(columns={0:"count","index":"token"},inplace=True)

print(tableUniqueCounts.shape)
print('number of unique string tokens in this dataset')

tableUniqueCounts.head(20)

(7418, 2)
number of unique string tokens in this dataset


Unnamed: 0,token,count
0,Inc.,1636
1,LLC,1168
2,Inc,651
3,"Co.,",295
4,Medical,290
5,Pharmaceuticals,279
6,Ltd.,269
7,"Pharmaceuticals,",242
8,&,203
9,Products,168


As we can see there are a number of very common legal entity substrings which could be leading to some confusion as we try and merge and de-duplcate this data set (ie. rows 0, 1, 2, 3, 6, 10, 11, 13, 15, 17 -- That's half!).  Lets apply a tried and true method for removing these

In [25]:
import os
#infer directory structure from location of ossPyFuncs file.  Open to suggestions on how to do this better.
currentDir='/home/dnb3k/git/dspg20oss/ossPy'

os.chdir(currentDir)

import ossPyFuncs

#construct path to legal entity list
LElist=pd.read_csv(os.path.join(currentDir,'keyFiles/curatedLegalEntitesRaw.csv'),quotechar="'",header=None)

#perform the erasure
LEoutput, LEeraseList=ossPyFuncs.eraseFromColumn(fdaCompanyList['company'],LElist)

#format the output
LEoutput=pd.DataFrame(LEoutput)
LEeraseList=LEeraseList.sort_values(by='changeNum',ascending=False)
#view some of the output statistics
LEeraseList.head(30)

Unnamed: 0,0,changeNum,changeIndexes
0,(?i) Inc\b,2574,"[26, 29, 30, 33, 38, 43, 44, 52, 57, 58, 59, 6..."
2,(?i) LLC\b,1284,"[34, 41, 48, 50, 53, 55, 61, 74, 75, 79, 80, 8..."
332,(?i) co\b,605,"[64, 71, 92, 95, 113, 142, 143, 152, 183, 206,..."
1,(?i) Ltd\b,532,"[60, 85, 86, 95, 99, 138, 152, 183, 200, 212, ..."
7,(?i) Company\b,236,"[45, 66, 68, 124, 184, 382, 397, 401, 402, 406..."
135,(?i) DBA\b,187,"[87, 88, 89, 114, 144, 161, 179, 186, 239, 256..."
4,(?i) Corporation\b,173,"[39, 129, 148, 165, 211, 348, 362, 413, 417, 4..."
10,(?i) Corp\b,120,"[28, 63, 127, 128, 187, 309, 389, 416, 457, 70..."
5,(?i) Limited\b,119,"[124, 160, 243, 258, 276, 278, 293, 442, 452, ..."
67,(?i) Incorporated\b,29,"[70, 913, 914, 921, 1099, 1332, 1359, 1626, 19..."
