# Filtering

One problem with the lexicons is that they contain some popular event terms, such as the words *manchester* or *kepa*.
This notebook filters both lexicons and lexical concepts (word clusters) using a systematic method.
The way filtering works, this notebook removes words from the lexicons and lexical concepts that had been used as tracking keywords to collect the datasets that generated the terms.

In [4]:
import csv
import json
import os
import sys

sys.path.append(os.path.expanduser("~/GitHub/EvenTDT"))

meta = os.path.expanduser('~/DATA/analyses/tdt/meta') # tracking details for the datasets used to generate the lexicons

filters = os.path.expanduser('~/DATA/analyses/tdt/filters-all.json') # the filters, a bootstrapping file
new_filters = os.path.expanduser('~/DATA/analyses/tdt/filters-filtered.json') # the path where the new filters will be stored

# splits = os.path.expanduser('~/DATA/analyses/tdt/splits-80.csv') # the concepts, saved as a CSV file with one split on each line
splits = os.path.expanduser('~/DATA/analyses/tdt/concepts-15.json')
new_splits = os.path.expanduser('~/DATA/analyses/tdt/concepts-filtered-15.csv') # the path where the new splits will be stored

The first step is to load all the tracking keywords.
This notebook looks for tracking keywords used during the event, which contain the names of players, coaches and the stadium in addition to basic information.
The next code cell tokenizes all tracking keywords.

In [5]:
from eventdt.nlp import Tokenizer
tokenizer = Tokenizer(stem=True) # stem the tokenizer

all_tokens = set() # all the tokens used to collect datasets

for file in os.listdir(meta): 
    with open(os.path.join(meta, file)) as f:
        metadata = json.loads(''.join(f.readlines()))
        keywords = metadata['event']['keywords'] # use the event keywords
        all_tokens = all_tokens.union(set( [ token for keyword in keywords
                                                   for token in tokenizer.tokenize(keyword) ] ))

The next code cell loads the bootstrapping's output and filters the seed terms and bootstrapped terms that had been used to collect datasets.

Note that this notebook does not eliminate the keywords, but prepends *EXCLUDED-\**.
Because of stemming and splitting, these keywords will never be filtered or used for splitting.

In [6]:
with open(filters) as f1, open(new_filters, 'w') as f2:
    data = json.loads(''.join(f1.readlines()))
    for term in data['pcmd']['seed']:
        if term in all_tokens:
            print(f"Filtering { term }")
    data['pcmd']['seed'] = [ f"EXCLUDED-{ keyword }" if keyword in all_tokens else keyword # filter seed terms
                                                     for keyword in data['pcmd']['seed'] ]
    
    for term in data['bootstrapped']:
        if term in all_tokens:
            print(f"Filtering { term }")
    data['bootstrapped'] = [ f"EXCLUDED-{ keyword }" if keyword in all_tokens else keyword # filter bootstrapped terms
                                                     for keyword in data['bootstrapped'] ]
    f2.write(json.dumps(data))

Filtering arsen
Filtering manchest
Filtering man
Filtering kepa
Filtering hold
Filtering rob
Filtering lacazett
Filtering hazard
Filtering willock
Filtering silva
Filtering foden
Filtering anthoni
Filtering mendi
Filtering bellerin


Splitting is similar, although it uses a CSV file.
The next code cell processes each split separately and filters the terms as above, by prepending *EXCLUDED-\**.

In [7]:
with open(splits) as f1, open(new_splits, 'w') as f2:
    concepts = json.loads(''.join(f1.readlines()))['concepts']
    writer = csv.writer(f2, delimiter=',')
    for concept in concepts:
        for term in concept:
            if term in all_tokens:
                print(f"Filtering { term }")

        terms = [ term if term not in all_tokens else f"EXCLUDED-{ term }" for term in concept ] # filter the splits
        writer.writerow(terms)

Filtering manchest
Filtering kepa
Filtering arsen
Filtering man
