In [1]:
import json
import string

First combine the lists of raw names from each website scraped with BeautifulSoup:

In [2]:
names = []
for rx_site in ['assist', 'list']:
    with open(f'raw/names_rx{rx_site}.json', 'r') as f:
        names += list(json.load(f))
len(names)

9332

9332 total names. Looking at the contents of the json files there are clearly some duplicates especially when the names involve two words. Generally anything beyond the first word isn't necessarily relevant to the original brand name itself. 

The following saves only the first word for names and removes duplicates. It also makes names all lowercase:

In [3]:
def clean_name(name_str):
    name_str = name_str.split(' ')[0]
    name_str = name_str.lower()
    return name_str

names_clean = list(map(clean_name, names))  # apply the element-wise function
names_clean = list(set(names_clean))        # converting to a set and back to a list will remove duplicates
len(names_clean)

3704

After this step only 3704 names remain. Further filtering may be unwise at this point since we want a sufficiently large dataset. 

The following checks which characters appear in the names other than from the standard English alphabet, to see if there are any special characters:

In [4]:
names_str_long = ''.join(names_clean)
chars = list(
    set(names_str_long).difference(
        set(string.ascii_lowercase)
    )
)
chars

['1', '\t', 'é', '0', '6', '-', '8', '4', '5', '.', '/', '7', ',', '2', '3']

Let's look at these suspect names:

In [5]:
for char in chars:
    print(f'Words with "{char}":')
    print([name for name in names_clean if char in name])
    print('--------')

Words with "1":
['propimex-1', 'cyclinex-1', 'glofil-125', 'ketonex-1', 'cnj-016', 'niferex-150', 'vagistat-1', 'tyrex-1', 'hominex-1', 'phenex-1', 'i-valex-1', 'b12', 'glutarex-1']
--------
Words with "	":
['voraxaze\t']
--------
Words with "é":
['juvéderm']
--------
Words with "0":
['cnj-016', 'niferex-150', 'rimso-50', 'anadrol-50', 'kenalog-40', 'acam2000']
--------
Words with "6":
['cnj-016', 'md-76r']
--------
Words with "-":
['propimex-1', 'ery-tab', 'neo-synalar', 'phenergan-codeine', 'np-thyroid', 'platinol-aq', 'gelsyn-3', 'melquin-3', 'cyclinex-1', 'gavilyte-g', 'derma-smoothe/fs', 'levo-t', 'glofil-125', 'ketonex-1', 'cyclinex-2', 'zyrtec-d', 'tylenol-codeine', 'hydro-q', 'dilaudid-hp', 'micro-k', 'hi-cal', 'retin-a', 'cnj-016', 'omeclamox-pak', 'gavilyte-c', 'lo-zumandimine', 'thyro-tabs', 'ez-disk', 'poly-pred', 'tyrex-2', 'gamunex-c', 'cardiogen-82', 'lac-hydrin', 'r-gene', 'glyrx-pf', 'halog-e', 'niferex-150', 'gavilyte-n', 'timoptic-xe', 'paxil-cr', 'vagistat-1', 'ultr

It seems more duplicates need be removed since the hyphen often acts as an alternative for a space when adding descriptors to the name. To try to best account for the inconcistency of format of the hyphenated names, since sometimes the descriptor tags come before (e.g. `tri-sprintec`) or after (e.g. `neotrace-4`) the presumed drug name, we will assume that the actual drug name in a sequence of hyphenated strings is the largest substring in the sequence.

So, in the previous examples, we would extract the names `sprintec` and `neotrace`:

In [6]:
def clean_hyphens(name_str):
    substrings = name_str.split('-')
    longest = max(substrings, key=len)
    return longest

names_clean = list(map(clean_hyphens, names_clean))     # apply the element-wise function

Let's see what outlier words remain with similar code from earlier:

In [8]:
names_str_long = ''.join(names_clean)
chars = list(
    set(names_str_long).difference(
        set(string.ascii_lowercase)
    )
)

for char in chars:
    print(f'Words with "{char}":')
    print([name for name in names_clean if char in name])
    print('--------')

Words with "1":
['b12']
--------
Words with "é":
['juvéderm']
--------
Words with "0":
['acam2000']
--------
Words with "6":
['76r']
--------
Words with ".":
['d.', 'h.p.', 'e.e.s.']
--------
Words with "/":
['smoothe/fs']
--------
Words with "7":
['76r']
--------
Words with ",":
['aerobid,', 'prempro,', 'glucophage,', 'naprosyn,', 'biaxin,']
--------
Words with "2":
['acam2000', 'b12']
--------
Words with "	":
['voraxaze\t']
--------


TO DO: remove too short names, (probably) anything with a number or ".", and simply mask out the rest of the characters