### Filtering dataset by keywords

In [1]:
# imports
from datasets import load_dataset
import nltk
import itertools
from collections import Counter
from tqdm import tqdm
from nltk.stem.snowball import EnglishStemmer
from nltk.corpus import stopwords
import re
import random

In [2]:
# deze cel downloadet de dataset
# hij gebruikt het script cnn_dailymail.py van huggingface:
# https://huggingface.co/datasets/cnn_dailymail/tree/main
# dit kan wel iets van 10 minuten duren, ga maar wat koffie halen

dataset = load_dataset("cnn_dailymail.py", "3.0.0", split="train") 

Reusing dataset cnn_dailymail (/Users/tppl/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de)


In [3]:
# print de features en het aantal datapunten in de dataset
print(dataset.features)
print(dataset.num_rows)

{'article': Value(dtype='string', id=None), 'highlights': Value(dtype='string', id=None), 'id': Value(dtype='string', id=None)}
287113


In [4]:
# methoden om de data te pre-processen
en_stemmer = EnglishStemmer() # stemmer voor engelse woorden
nltk.download('stopwords') # stopwoorden die niet veel waarde toevoegen
stop_words = set(stopwords.words('english'))
alph_string_pattern = re.compile("[a-zA-Z]") # filtert 'woorden' die niet beginnen met een letter, zoals interpunctietokens


def word_counter_text(text: str, stem=False, remove_stopwords=False):
    """
    Neemt als input een string tekst
    Returnt een Counter object die alle woorden uit de tekst telt
    """
    
    # splits de tekst op in een lijst van woorden
    sents = nltk.tokenize.sent_tokenize(text)
    words = [nltk.word_tokenize(sent) for sent in sents]
    flatten_words = list(itertools.chain(*words))
    
    # woorden stemmen of alleen maar hoofdletters weghalen
    if stem:
        flatten_lower_words = [en_stemmer.stem(str) for str in flatten_words]
    else:
        flatten_lower_words = [str.lower() for str in flatten_words]
    
    # stopwoorden weghalen
    if remove_stopwords:
        flatten_lower_words = [str for str in flatten_lower_words if str not in stop_words]
        
    # tokens die niet beginnen met een letter weghalen
    flatten_lower_words = [str for str in flatten_lower_words if alph_string_pattern.match(str)]
    
    return Counter(flatten_lower_words)

[nltk_data] Downloading package stopwords to /Users/tppl/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Tot nu toe wat Erik heeft gedaan
vanaf hier filteren we onze dataset met een bepaalde keywords

### Wij beginnen met de makkelijkste taak namelijk alle totaal verschillend artikelen uithalen
en daarna gaan wij nog filteren

In [5]:
# een lijst van keywords bedacht door Coen

words_accident = ['accident','disaster','catastrophe','incident','near-miss', 'tragedy']

words_damage = ['victim','casualties','died','killed','damage','harm','hospital','hospitalized',
                'wounded','succumbed','unscathed','evacuate','rescue','first responders','ambulance','first aid']

words_specific = ['sunk','fire','derailed','collision','poisoned','burned']

In [6]:
def filter1(texts, num_articles, keywords, threshold):
    positive_articles = []
    negative_articles = []
    
    for i in tqdm(num_articles):
        # haal alle unieke woorden van het artikel
        article = texts[i]
        words = set(dict(word_counter_text(article, stem=True, remove_stopwords=True).items()).keys())

        # kijk hoeveel keywords er voorkomen
        contains_keyword = 0
        for keyword in keywords:
            if keyword in words:
                contains_keyword += 1

        if contains_keyword >= threshold:
            positive_articles.append(i)
        else:
            negative_articles.append(i)

    return positive_articles, negative_articles
    

In [7]:
# parameters instelling
texts = dataset['article']
num_articles = list(range(2000))
keywords = words_accident + words_damage + words_specific + ['safety']
# gestemde versie van keywords
keywords = [en_stemmer.stem(w) for w in keywords]
# drempelwarde voor hoeveel keywords een artikel moet bevatten
threshold = 3

In [8]:
filter1_1 = filter1(texts, num_articles, keywords, threshold)

100%|██████████| 2000/2000 [00:43<00:00, 46.29it/s]


In [9]:
print('positive:', len(filter1_1[0]), 'negative:', len(filter1_1[1]))

positive: 482 negative: 1518


### Omdat er geen label bestaat moeten wij handmatig controleren of het goed gefilterd is.
#### Het doel van dit gedeelte is alle onpassende artikelen uithalen 

In [10]:
for i in range(30):
    print(dataset['highlights'][filter1_1[0][i]], '\n')

NEW: "I thought I was going to die," driver says .
Man says pickup truck was folded in half; he just has cut on face .
Driver: "I probably had a 30-, 35-foot free fall"
Minnesota bridge collapsed during rush hour Wednesday . 

Parents beam with pride, can't stop from smiling from outpouring of support .
Mom: "I was so happy I didn't know what to do"
Burn center in U.S. has offered to provide treatment for reconstructive surgeries .
Dad says, "Anything for Youssif" 

Aid workers: Violence, increased cost of living drive women to prostitution .
Group is working to raise awareness of the problem with Iraq's political leaders .
Two Iraqi mothers tell CNN they turned to prostitution to help feed their children .
"Everything is for the children," one woman says . 

Two cars loaded with gasoline and nails found abandoned in London Friday .
52 people killed on July 7, 2005 after bombs exploded on London bus, trains .
British capital wracked by violence by the IRA for years . 

NEW: President B

## Resultaat filter1 op positive_articles
Van de 30 artikelen 11 waren gerelateerd tot ongevallen en 2 gingen specifiek over werkplaats ongevallen. Dus 19 van 30 was foutief als positief beschouwd.

In [11]:
# check of negative_articles ook passende artikelen heeft
for i in range(30):
    print(dataset['highlights'][filter1_1[1][i]],'\n')

Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .
Young actor says he has no plans to fritter his cash away .
Radcliffe's earnings from first five Potter films have been held in trust fund . 

Mentally ill inmates in Miami are housed on the "forgotten floor"
Judge Steven Leifman says most are there as a result of "avoidable felonies"
While CNN tours facility, patient shouts: "I am the son of the president"
Leifman says the system is unjust and he's fighting for change . 

Five small polyps found during procedure; "none worrisome," spokesman says .
President reclaims powers transferred to vice president .
Bush undergoes routine colonoscopy at Camp David . 

NEW: NFL chief, Atlanta Falcons owner critical of Michael Vick's conduct .
NFL suspends Falcons quarterback indefinitely without pay .
Vick admits funding dogfighting operation but says he did not gamble .
Vick due in federal court Monday; future in NFL remains uncertain . 

Tomas Medina Caracas was a fugit

## Resultaat filter1 op negative_articles
Van de 30 artikelen 1 was gerelateerd tot ongevallen en 0 ging specifiek werkplaats ongevallen. Dus filter1 werkt best wel goed voor het filteren van ongerelateerde artikelen. Dus het doel is bereikt?

### Nu weten wij over de eerste 30 artikelen. Wat nou als de artikelen willekeurig gekozen zijn? Krijgen wij nog steeds hetzelfde resultaat? namelijk 100% accuracy voor negatieve_articles.

In [15]:
num_articles = random.sample(range(0, dataset.num_rows), 2000)
filter1_2 = filter1(texts, num_articles, keywords, threshold)

100%|██████████| 2000/2000 [00:50<00:00, 39.78it/s]


In [16]:
for i in range(30):
    print(dataset['highlights'][filter1_2[0][i]], '\n')

AAA predicts 38.4 million Americans will travel over the holiday weekend .
AAA attributes slight increase to improved consumer confidence .
Those traveling by air probably will decline to 2.3 million, from 2.5 million last year .
The bulk of travelers will be going by car, AAA says . 

Gunman wearing a ski-mask fired at a woman narrowly missing her child .
Police linking 20 incidents to the random shooting spree .
First incident reported to Kansas City police  on March 8 .
State officials have called in the FBI in a bid to track down the gunman .
Random shooter 'causing a problem for everyone' 

The blaze began about 4:30am. Friday on the second floor of a multi-family house .
A total of seven children ranging in age from 1 to 9 lived on the second floor with their father, 60-year-old Troy Lewis .
Witnesses say they heard screaming as the fire ripped through the building .
Five children are dead, ranging in age from 19 months to 8 years, and sisters Shaca, 9, and Electra, 5, are in cri

Abu Anas Al-Liby, 50, died Friday night at a New York hospital .
Had complications stemming from a recent liver surgery .
Was on the FBI's most-wanted list with a $5 million price on his head .
Captured by US troops in the Libyan capital Tripoli in October 2013 .
Due to stand trial on January 12 over the attacks on the US embassies in Kenya and Tanzania that killed 244 people and wounded more than 5,000 . 

Scientists found they can distinguish whether memory will decline in healthy people from measuring blood flow to their brains .
Quicker detection would allow earlier treatment and maybe prevention .
Alzheimer’s and other forms of dementia affect some 800,000 Britons .
Number expected to double in a generation as population ages . 



## Resultaat filter1 op positive_articles
Van de 30 artikelen 11 waren gerelateerd tot ongevallen en 1 ging specifiek werkplaats ongevallen. Dus 19 van 30 was foutief als positief beschouwd. 

In [17]:
for i in range(30):
    print(dataset['highlights'][filter1_2[1][i]], '\n')

Neil Patrick Harris is being confused with his Broadway musical character .
A fan attending an April 19 show expressed her adoration for the actor .
Her proclamation was met with profanity, as Harris was still in character .
He's had to explain that his response was part of the show . 

Xi .
Jinping took over this week as party general secretary in China's second orderly power transfer in 63 .
years .
Wife Peng Liyuan is far more famous as a syrup-voiced star of folk music .
They have one child, daughter Xi Mingze, who goes to Harvard .
She is described as studious and low key and joined a sorority . 

School staff among 900 carers and cleaners to benefit after deal with Sheffield city council . 

The cataclysmic death blasts occurred just 1.5bn years after the Big Bang .
New record beats the previous holder that was 11bn years old . 

The hedgehog triplets were born on the same day as Prince George .
In honour of the Prince they share his first names and live in a castle . 

Brad Hink

## Resultaat filter1 op negative_articles
Van de 30 artikelen 0 was gerelateerd tot ongevallen en 0 ging specifiek werkplaats ongevallen. Dus filter1 haalt alle totaal ongerelateerde artikelen weg.

# Sub Conclusie
### filter1 kan heel goed ongerelateerde artikelen weghalen maar het presteert minder goed bij het uithalen van artikelen die over ongevallen gaan.

Dus het eerste doel is bereikt. filter1 kan totaal ongerelateerde artikelen goed onderscheiden en weghalen. Nu is de vraag kan het nog beter doen bij het uithalen van artikelen die over ongelukken gaan. Om dit te doen halen wij artikelen eruit die over oorlog of terreur gaan op basis van keywords.

In [None]:
indicates_war = ['']