# Text Preprocessing Solution

##### Author: Alex Sherman | alsherman@deloitte.com

### Exercise
1. print all the distinct entities tagged with 'CARDINAL'
2. print all the distinct entities tagged with 'PERSON'
3. print all the distinct entities tagged with 'GPE'

For all exercises:
- use a batch size of 100
- disable the parse and tagger (ner is needed to add the tags)

In [46]:
%%time

# print all the distinct entities tagged as a CARDINAL
# search in immune_df.head(200)

for doc in nlp.pipe(immune_df.head(200), batch_size=100, disable=['parser','tagger']):
    for ent in doc.ents:
        if 'CARDINAL' in ent.label_:
            print(ent.text)

seven
at least eight
one
three
at least three
nine
five
three
two
at least two
one
two
two
one
seven
six
two
nine
two
two
Wall time: 1.15 s


In [47]:
%%time

# print all the distinct entities tagged as an organization (ORG)
# search in immune_df.head(500)
for doc in nlp.pipe(immune_df.head(500), batch_size=100, disable=['parser','tagger']):
    for ent in doc.ents:
        if 'PERSON' in ent.label_:
            print(ent.text) 

conley
anti kell
Wall time: 2.3 s


In [48]:
%%time

# print all the distinct entities tagged as a geopolitical entity (GPE)
# search in immune_df.head(1000)
for doc in nlp.pipe(immune_df.head(1000), batch_size=100, disable=['parser','tagger']):
    for ent in doc.ents:
        if 'GPE' in ent.label_:
            print(ent.text) 

mls
sweden
switzerland
india
Wall time: 3.28 s


### Exercise
1. Count how many time each individual entity appears
2. Create a mapping that keeps track of every combination of entities pairs that appear in the same sentence
3. Count how many times each entity combo appears
4. Print the entity combos (using sorted) in descending order

In [92]:
%%time

# create a defaultdict(int) called entity_relations
entity_relations = defaultdict(int)

# create an empty list called counter_entities 
counter_entities = []

# during testing set .head() to a smaller number such as .head(1000) 
for doc in nlp.pipe(immune_df.head(1000), disable=['parser','tagger', 'ner']):

    # store the token.text for all the tokens containing the letters 'toxic' (i.e. 'toxic' in term)
    # use a list comprehension
    entities = [token.text for token in doc if 'toxic' in token.text]

    # add the tokens from the current doc to counter_entities (use += to add the token.text)
    counter_entities += entities
    
    # create combinations of two terms each time multiple 'toxic' words appear
    # increment the count in entity_relations defaultdict each time a combo is repeated
    for combo in combinations(entities, 2):
        entity_relations[combo] += 1

Wall time: 199 ms


In [93]:
print(Counter(counter_entities))

Counter({'cytotoxic': 24, 'cytotoxicity': 3, 'bacteriotoxic': 1, 'toxicity': 1, 'toxic': 1, 'lymphocytotoxic': 1, 'thyrotoxic': 1, 'thyrotoxicosis': 1})


In [94]:
# view the entity pairs in descending order
sorted(entity_relations.items(), key=lambda x: x[1], reverse=True)

[(('cytotoxic', 'cytotoxic'), 3),
 (('cytotoxic', 'cytotoxicity'), 1),
 (('thyrotoxic', 'thyrotoxicosis'), 1)]