In [1]:
import json
import string

First combine the lists of raw names from each website scraped with BeautifulSoup:

In [2]:
names = []
for rx_site in ['assist', 'list']:
    with open(f'raw/names_rx{rx_site}.json', 'r') as f:
        names += list(json.load(f))
len(names)

9332

9332 total names. Looking at the contents of the json files there are clearly some duplicates especially when the names involve two words. Generally anything beyond the first word isn't necessarily relevant to the original brand name itself. 

The following saves only the first word for names and removes duplicates. It also makes names all lowercase:

In [3]:
def clean_name(name_str):
    name_str = name_str.split(' ')[0]
    name_str = name_str.lower()
    return name_str

names_clean = list(map(clean_name, names))  # apply the element-wise function
names_clean = list(set(names_clean))        # converting to a set and back to a list will remove duplicates
len(names_clean)

3704

After this step only 3704 names remain. Further filtering may be unwise at this point since we want a sufficiently large dataset. 

The following checks which characters appear in the names other than from the standard English alphabet, to see if there are any special characters:

In [4]:
names_str_long = ''.join(names_clean)
chars = list(
    set(names_str_long).difference(
        set(string.ascii_lowercase)
    )
)
chars

['-', '0', '\t', '6', '7', '4', 'é', '.', '1', '3', '/', '5', '2', '8', ',']

Let's look at these suspect names:

In [5]:
for char in chars:
    print(f'Words with "{char}":')
    print([name for name in names_clean if char in name])
    print('--------')

Words with "-":
['tri-sprintec', 'propimex-2', 'trivora-28', 'kenalog-40', 'micro-k', 'monoclate-p', 'dritho-scalp', 'synvisc-one', 'tyrex-2', 'platinol-aq', 'rimso-50', 'md-gastroview', 'oxsoralen-ultra', 'klor-con', 'podocon-25', 'depo-provera', 'an-sulfur', 'gavilyte-h', 'derma-smoothe/fs', 'yf-vax', 'je-vax', 'zembrace-symtouch', 'm-m-r', 'timoptic-xe', 'phenex-1', 'alphagan-p', 'slow-k', 'hominex-1', 'epivir-hbv', 'aci-jel', 'gavilyte-c', 'autoplex-t', 'ak-fluor', 'slo-phyllin', 'neo-fradin', 'monistat-derm', 'cnj-016', '8-mop', 'm-r-vax', 'ic-green', 'np-thyroid', 'neo-synephrine', 'vagistat-1', 'lac-hydrin', 'proplex-t', 'tyrex-1', 'clarinex-d', 'dyna-hex', 'glutarex-2', 'nabi-hb', 'ultra-technekow', 'center-al', 'propimex-1', 'tuxarin-er', 'omeclamox-pak', 'hi-cal', 'ak-pentolate', 'derma-smoothe', 'olux-e', 'phenex-2', 'depo-medrol', 'i-valex-2', 'nor-qd', 'coly-mycin', 'cytra-k', 'alka-seltzer', 'vira-a', 'k-tab', 'k-lor', 'gel-one', 'allegra-d', 'retin-a', 'paxil-cr', 'hydro

It seems more duplicates need be removed since the hyphen often acts as an alternative for a space when adding descriptors to the name. To try to best account for the inconcistency of format of the hyphenated names, since sometimes the descriptor tags come before (e.g. `tri-sprintec`) or after (e.g. `neotrace-4`) the presumed drug name, we will assume that the actual drug name in a sequence of hyphenated strings is the largest substring in the sequence.

So, in the previous examples, we would extract the names `sprintec` and `neotrace`:

In [6]:
def clean_hyphens(name_str):
    substrings = name_str.split('-')
    longest = max(substrings, key=len)
    return longest

names_clean = list(map(clean_hyphens, names_clean))     # apply the element-wise function

Let's see what outlier words remain with similar code from earlier:

In [7]:
names_str_long = ''.join(names_clean)
chars = list(
    set(names_str_long).difference(
        set(string.ascii_lowercase)
    )
)

for char in chars:
    print(f'Words with "{char}":', [name for name in names_clean if char in name])

Words with "0": ['acam2000']
Words with "	": ['voraxaze\t']
Words with "6": ['76r']
Words with "7": ['76r']
Words with ".": ['d.', 'h.p.', 'e.e.s.']
Words with "é": ['juvéderm']
Words with "1": ['b12']
Words with "/": ['smoothe/fs']
Words with "2": ['acam2000', 'b12']
Words with ",": ['naprosyn,', 'glucophage,', 'prempro,', 'aerobid,', 'biaxin,']


We make our final cleanup edits based on these words as follows:

In [12]:
def clean_misc(name_str):
    name_str = name_str.strip()     # handle the \t tab character
    name_str.replace(',', '')       # remove commas from names
    name_str.replace('é', 'e')      # remove accent on the single name

names_clean = list(map(clean_hyphens, names_clean))     # apply the element-wise function

# Now remove the rest of the remaining outlier words
names_str_long = ''.join(names_clean)
chars = list(
    set(names_str_long).difference(
        set(string.ascii_lowercase)
    )
)
remaining_outliers = [name for name in names_clean if char in name]
for name in remaining_outliers:
    names_clean.remove(name)

# Now ensure we've removed them all
remaining_outliers = [name for name in names_clean if char in name]
remaining_outliers

[]

Finally we should probably remove names that are excessively short, and make sure we remove any duplicates generated by the cleaning process:

In [13]:
names_clean = [name for name in names_clean if len(name) > 3]
names_final = list(set(names_clean))
len(names_final)

3601

We have 3601 total names to work with. Let's export them to a final JSON file:

In [14]:
with open('names_clean.json', 'w') as f:
    json.dump(names_final, f)