# Data Collection Process for Team Vaccinated

## Collecting Datasets

We collected our data by directly parsing the HTML DOM of pages on Facebook.  

To find the pages, we searched specific queries on Facebook's search bar: 

####                ["stop vaccination", "anti vaccination"], ["parental advice", "child health"]
         
For each search, we followed this process:

1. Search the term.
2. Click the "Pages" tab to filter to only Page results.
3. Scroll downward until no results are left.  If laborious, you may use this script:
```javascript
var s = () => {window.scrollTo(0,document.body.scrollHeight); setTimeout(s, 500);};
s();
```
3. Open up the console and input this script:
```javascript
var parse = () => {
    return Array.from(document.getElementsByClassName('_3u1 _gli _6pe1')).map((elem) => {
        var mo = elem.getElementsByClassName('_32mo')[0];
        var name = mo.textContent;
        var href = mo.getAttribute('href');
        let likes = elem.getElementsByClassName('_pac')[0].getElementsByTagName('a')[0];
        likes = (likes) ? likes.textContent.split(' ')[0] : null;
        return {name: name, href: href, likes: likes};
    });
}
copy(parse());
```
4. The page results are now copied to your clipboard.  Open a text editor and paste the results into a file and save it

In [77]:
import pandas as pd
import numpy as np
import json
import os

datasets = ['Data/' + f for f in os.listdir('Data/') if f.startswith('Datasets - ')]
print(datasets)

['Data/Datasets - anti vaccination.json', 'Data/Datasets - child health.json', 'Data/Datasets - infant parent advice.json', 'Data/Datasets - stop vaccination.json']


In [78]:
def load_dataset(filename, query):
    with open(filename) as f:
        d = pd.DataFrame(json.loads(f.read()), columns=['name', 'likes', 'href'])
        d['query'] = query
        return d
    raise Exception("Could not open file")
def load_datasets(datasets):
    data = pd.concat([load_dataset(f, q) for f, q in zip(datasets, [n.split('.')[0].split('- ')[1] for n in datasets])])
    gb = data.groupby('name')
    base = pd.DataFrame(gb.query.apply(set).apply(list).apply(','.join))
    base['likes'] = gb.likes.last().fillna('0')
    base['href'] = gb.href.last()
    base['href_posts'] = base.href.apply(lambda x: 'https://www.facebook.com/pg/' + \
                                         x[25:].split('/')[0] + '/posts/?ref=page_internal')
    base['anti_vax'] = base['query'].apply(lambda x: 'vaccination' in x)
    def convert_num(n):
        if 'M' in n:
            return float(n[:-1]) * 1e6
        elif 'K' in n:
            return float(n[:-1]) * 1e3
        else:
            return float(n)
    base['likes_adj'] = base.likes.apply(convert_num)
    return base
data = load_datasets(datasets)
data.head()

Unnamed: 0_level_0,query,likes,href,href_posts,anti_vax,likes_adj
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
#StopBullying,stop vaccination,9.8K,https://www.facebook.com/StopBullying112/?ref=...,https://www.facebook.com/pg/StopBullying112/po...,True,9800.0
2/8 autistic meme causing anti-vaccination,anti vaccination,15,https://www.facebook.com/28-autistic-meme-caus...,https://www.facebook.com/pg/28-autistic-meme-c...,True,15.0
AAP Section On International Child Health,child health,620,https://www.facebook.com/AAPSOICH/?ref=br_rs,https://www.facebook.com/pg/AAPSOICH/posts/?re...,False,620.0
APHA Maternal & Child Health Section,child health,343,https://www.facebook.com/APHAMCHSection/?ref=b...,https://www.facebook.com/pg/APHAMCHSection/pos...,False,343.0
About Pediatrics and Parenting Advice,infant parent advice,1.5K,https://www.facebook.com/AboutPediatrics/?ref=...,https://www.facebook.com/pg/AboutPediatrics/po...,False,1500.0


# Remove unwanted pages

Given that the searches might not return results which are intended for the dataset, we must manually evaluate and filter pages. The criteria that we use is as follows:

- Page must explicitly and specifically dedicate itself to expanding, supporting, or espousing either the anti-vaccination movement or child health care.
- Page must have at least 10 posts.
- If the page specifies a location, it must be a predominately English-speaking location.
- Page must exclusively target the general topics of anti-vaccination or child health advice.  If the page specifically targets a subtopic, then it should be filtered out.
- Page must target the discussion of its topic.  Example of non-qualifier: a landing page for a company.

In [79]:
non_anti_vax_pages = {'Stop TB Partnership', 'Stop Amnesty', 'Stop Cyberbullying', 'Pro Vaccinations', 
                   'Stop the Anti-Science Movement', 'Letters to President Trump.  Please Help Us Stop Vaccine  Injury.', 
                   'STOP Abortion', 'Vaccine Ambassadors', 'Stop The Silence: Stop Child Sexual Abuse Inc.', 
                   'The Vaccines', 'Montanans for Vaccine Choice', 'The Dangers of Gardasil (HPV/Cervical Cancer Vaccine)', 
                   'Stop Clickbait - Gaming', 'Stop Abuse Campaign', 'Refutations to Anti-Vaccine Memes', 
                   'Informed Parents of Vaccinated Children', 'HIV Vaccine Trials Network', 'STOP the BLEED', 
                   'SPOT - Stop Pet Overpopulation Today', 'StopBullying.Gov', 'We Love GMOs and Vaccines', 
                   'Voices for Vaccines', 'Never Stop Learning', 'Anti-Anti-Vaccine Campaign', 'Stop Deforestation', 
                   'Stop Anti-Vaccine Misinformation', 'Stop Mesothelioma', 'The Pro-Vaccine Movement.', 
                   'Stop Chasing Pain', 'Vaccinate Your Family', 'Vaccination Information Network - India', 
                   'Northern Rivers Vaccination Supporters', 'Stop Pollution', 'Stop the Thyroid Madness', 
                   '#StopBullying', 'Sallie O. Elkordy for Mayor, Vaccine Free NYC', 'The Pet Stop Mobile Vaccine Clinic', 
                   'DA Pet Stop', 'TRUMENBA (Meningococcal Group B Vaccine)', 'Vaccinations for Dogs - The Alternatives', 
                   'Vaccines From Anti-Vaxxers', 'United Against Vaccination', 'My Vaccine Lawyer', 
                   "My child's vaccine reaction", 'Stop the Over-Vaccination of Pets', 'Vaccines', 
                   "Briar's journey after HPV vaccine injury", 'Lowcountry Pet Vaccine Clinic', 'Stop Bullying: Speak Up', 
                   'Stop Racism', 'Stop The Overpopulation of Pets', 
                   'Stop the Australian (Anti)Vaccination Network', 'Stop HPV - stop livmoderhalskræft', 
                   'Stop Pneumonia', 'Gavi, the Vaccine Alliance', 'Stop Aging Now', 'Stop Bullying', 
                   'Stop Street Harassment', 'Being Liberal', 'March for Science', 
                   '2/8 autistic meme causing anti-vaccination', 'Anti Anti Vaccination', 'Anti Records', 
                   'Anti Vaccination Idiots', 'Anti Vaccination Memes', 'Anti Vaccins', 'Anti Vaxxed : Catching The Autism',
                   'Anti anti-vaccines', 'Anti-Bullying', 'Anti vaccination intactivism', 'Anti-Vaccination Dum-Dums',
                   'Anti-Imperialist Parentiposting', 'Anti-Cancer Mom', 'Anti-Pit Bull Memes', 
                   'Anti-Trump / Pro-America Memes', 'Anti-Vaxxer has NO idea about vaccine science', 
                   'Autistic People Against the Anti-Vaccination Movement', 'Autism Awareness Australia', 'Anti-Vaxx Sins',
                   'Detox, AntiVax and Woo Insanity', 'Debunking Anti-Vaxxers', 'Detox, AntiVax and Woo Insanity',
                   'Head to Toe Anti-Rabies Vaccination Clinic', 'I fucking love science', 
                   'Meme Anti Vaccination Supporter Groups', 'Peace, Love, and Threats from anti-vaxxers', 
                   'Pro-Vaccination', 'Protecting Children and Communities through Vaccination - Global Network', 
                   'Pro-Vaccine Memes in the Style of Anti-Vaccine Memes', 'The Anti-Choice Project',
                   'Province-wide Annual Anti-Rabies Mass Vaccination for CY: 2019', 'Scary Mommy', 
                   'Start Mandatory Vaccination', 'Stop à la propagande anti-vaccins', 'The Onion', 
                   'Things anti-vaxers say', 'Vaccin Anti-Sioniste', 'Vaccination Anti HPV', 'Vaccination Myths Debunked',
                   'Vaxxed: From Cover-Up to Catastrophe', 'Non au vaccin', 'The Anti-Socialist', 
                   'Tigyaing Free Anti Rabies Vaccination Group For Dogs', 'The Anti-Vaccine Community Exposed',
                   'Vaccination against the Anti-vaccination virus (vaavv / WOW)', 'Refutations to Anti-Nuclear Memes',
                   'RJC Anti-Rabies Vaccination Center & Wellness Hub', 'TravelDoc Vaccination Clinic',
                   'Stop à la propagande anti-vaccins', 'StopBullying.Gov', 'GMAOQV', 
                   'Conscience collective (anti-vaccin H1N1)', 'Anti Mandatory Vaccination', 
                   'Dr. Tenpenny on Vaccines and Current Events', 'Anti-Vaccination'}
unwanted_normal_pages = {'Really Bad Parenting Advice', 'Baby Sleep Advice', 'Car Seat Safety Advice for Parents.'
                         'Advice for parents of children with Absence Seizures', 'Child Abuse Prevention Association',
                         'Center for Leadership in Maternal and Child Health University of Minnesota', 
                         'Baseball Parenting', 'Brazil Child Health', 'Daily Health Tips', 'Duke Health', 
                         'Flower Child (Dallas)', 'I love unsolicited parenting advice', 'Health24', 'Just a Step Parent?',
                         'Parenting Advice From A Non-Parent', 'Parents advice for kids', 'Just a Step Parent?',
                         'History of Maternal-Child Health', 'IACAPAP Textbook of Child and Adolescent Mental Health',
                         'IHA Child Health - WestArbor (4350 Jackson Rd, Suite 100, Ann Arbor, MI)',
                         'International Meeting on Indigenous Child Health 2019', 'Leaving Cert Parent Advice',
                         'Maternal Child Health Nursing Academy', 'NBC Parent Toolkit', 'Blissful Parenting',
                         'Parental Alienation Support: Professional Resources & References', '22:6 Parenting',
                         'Natural Birth, Midwifery and Maternal Child Health', 'Parent Quotes', 'HuffPost Parents',
                         'Muslim Parents - Education Advice', 'Mental Health on The Mighty', 'Parent Society',
                         'Parenting Advice for Foster Carers and Adopters - PAFCA', 'Autism Parenting Magazine',
                         'American Academy of Child & Adolescent Psychiatry', 'Terrible Parenting Advice',
                         'Parents Rule!', 'Parenting Works', 'Lesbian Love & Advice, LLC', 
                         'Journal of Paediatrics and Child Health', 'Parenting Children With Behavioral Problems',
                         'Parents Quotes', 'Parents of children on Dla Advice and support', 'Health+',
                         'Parents with kids with ADHD, ADD or ODD', 'Pet Parenting Advice', 'Your Teen for Parents',
                         'Soccer Parent Advice' 'Special Education Parental Advice Community', 
                         'SpecialKids Child Health & Development Clinic', 'eParent', 'Ulfulu Child', 'Ufulu Child'
                         'Step Parent & Non-Custodial Parent Advice, Support, & Legal Resources', 
                         "Step Parent's", 'Step-Parent Advice & Support', 'Step-Parents Place', 'Teen Parent Talk/Advice'}
banned_terms = {'mixed babies', 'teen', 'adolescent', 'mental', 'seizures',  'single', 'therapy', 'ltd', 'company',
                'institute', 'bad', 'circumcision', 'step', 'stories', 'autism', 'center', 'foster', 'natural', 'special',
                'spasms', 'brown', 'peaceful', 'college', 'program', 'section', 'organization', 'foundation', 'gender',
                'worldwide', 'myanmar', 'ufulu', 'collaboratory', 'paediatrics', 'kernicterus'}
def remove_unwanted_pages(df):
    df = df.loc[(~df.index.isin(non_anti_vax_pages)) & (~df.index.isin(unwanted_normal_pages))]
    def has_banned_terms(x):
        x = x.lower()
        for term in banned_terms:
            if term in x:
                return True
        return False
    return df[[not has_banned_terms(x) for x in df.index]]
data = remove_unwanted_pages(data)
print(data['query'].value_counts())

infant parent advice                 39
child health                         36
stop vaccination                     28
anti vaccination                     18
anti vaccination,stop vaccination    14
infant parent advice,child health     1
Name: query, dtype: int64


# Parsing each page

To parse each Facebook page, we follow a set methodology:

1. Head to the Posts section of the Facebook page you want to parse.
	Then, open up your browser's console (Chrome: Ctrl + Shift + J).

2. Paste the following code into the console and press Enter:

***
```javascript
try{document.getElementById("pagelet_bluebar").remove()}catch(e){}try{document.getElementById("pagelet_sidebar").remove()}catch(e){}try{document.getElementById("uiContextualLayerParent").style.display="none"}catch(e){}try{document.getElementsByClassName("_1qks")[0].remove()}catch(e){}try{document.getElementsByClassName("_1pfm")[0].remove()}catch(e){}try{document.getElementsByClassName("uiScrollableAreaContent")[0].remove()}catch(e){}var prog=document.querySelector('div[aria-label="Facebook"]').lastChild.firstChild;prog.style.fontSize="4.5em",prog.style.textShadow="1px 1px 4px black";var random_fun=()=>"#"+((1<<24)*Math.random()|0).toString(16),updated_count={posts:0,times_counted:0},update=()=>{let e=document.getElementsByClassName("userContentWrapper").length;if(e==updated_count.posts){if(updated_count.times_counted++,updated_count.times_counted>=15)return console.log("Tried scrolling 15 times, but no update has occurred."),prog.textContent="Loaded "+e+" posts. Right-click and Save!",!0}else updated_count.posts=e,updated_count.times_counted=0;return prog.textContent="Loaded "+e+" posts",prog.style.color=random_fun(),!1},hide=()=>Array.from(document.getElementsByClassName("userContentWrapper")).forEach(e=>e.style.display="none"),scroll=()=>window.scrollTo(0,document.body.scrollHeight),complete=()=>{updated_count={posts:0,times_counted:0},Array.from(document.getElementsByClassName("see_more_link_inner")).forEach(e=>e.click()),console.log("Done.")},get_abbr_year=e=>e?new Date(e.title).getFullYear():new Date,scroll_till_year=e=>{let t=Array.from(document.getElementsByClassName("timestampContent")).slice(-10).map(e=>e.parentElement).filter(e=>!e.classList.contains("livetimestamp")).map(get_abbr_year).filter(e=>!isNaN(e)),o=t.length?Math.max(...t):2019,l=Math.min(...t);if(o<=e||update())return setTimeout(complete,2e3);o!=l?console.log("Possible last seen:",o):console.log("Last seen:",o),scroll(),setTimeout(()=>{scroll_till_year(e)},1e3*(Math.random()+1)),hide()},scroll_i_times=e=>{if(0==e||update())return setTimeout(complete,2e3);e%10==0&&console.log("Scrolling",e,"times."),scroll(),setTimeout(()=>{scroll_i_times(e-1)},1e3*(Math.random()+1)),hide()};clear();
```
***
The above code is the following, in a semi-readable format:
```javascript
try { document.getElementById('pagelet_bluebar').remove(); } catch (err) {}
try { document.getElementById('pagelet_sidebar').remove(); } catch (err) {}
try { document.getElementById('uiContextualLayerParent').style.display = 'none'; } catch (err) {}
try { document.getElementsByClassName('_1qks')[0].remove(); } catch (err) {}
try { document.getElementsByClassName('_1pfm')[0].remove(); } catch (err) {}
try { document.getElementsByClassName('uiScrollableAreaContent')[0].remove(); } catch (err) {}
var prog = document.querySelector('div[aria-label="Facebook"]').lastChild.firstChild;
prog.style.fontSize = "4.5em";
prog.style.textShadow = "1px 1px 4px black";
var random_fun = () => "#"+((1<<24)*Math.random()|0).toString(16);
var updated_count = {posts: 0, times_counted: 0};
var update = () => {
    let posts = document.getElementsByClassName('userContentWrapper').length;
    if (posts == updated_count.posts) {
        updated_count.times_counted++;
        if (updated_count.times_counted >= 15) {
            console.log("Tried scrolling 15 times, but no update has occurred.");
            prog.textContent = "Loaded " + posts + " posts. Right-click and Save!"; 
            return true;
        }
    }
    else {
        updated_count.posts = posts;
        updated_count.times_counted = 0;
    }
    prog.textContent = "Loaded " + posts + " posts"; 
    prog.style.color = random_fun()
    return false;
};
var hide = () => Array.from(document.getElementsByClassName('userContentWrapper'))
                            .forEach((elem) => elem.style.display = 'none');
var scroll = (() => window.scrollTo(0, document.body.scrollHeight));
var complete = (() => {
    updated_count = {posts: 0, times_counted: 0};
    Array.from(document.getElementsByClassName("see_more_link_inner")).forEach(
        e => e.click());
    console.log("Done.");
});
var get_abbr_year = (e => e ? new Date(e.title).getFullYear() : new Date);
var scroll_till_year = (year => {
    let found_years = Array.from(document.getElementsByClassName("timestampContent")).slice(-10)
                .map(e => e.parentElement).filter(e => !e.classList.contains("livetimestamp"))
                .map(get_abbr_year).filter(e => !isNaN(e));
    let max_found_year = (found_years.length) ? Math.max(...found_years) : 2019;
    let min_found_year = Math.min(...found_years);
    if (max_found_year <= year || update()) return setTimeout(complete, 2000);
    else if (max_found_year != min_found_year) console.log("Possible last seen:", max_found_year);
    else console.log("Last seen:", max_found_year);
    scroll();
    setTimeout(() => {
        scroll_till_year(year)
    }, 1000 * (Math.random() + 1));
    hide();
});
var scroll_i_times = (i => {
    if (i == 0 || update()) return setTimeout(complete, 2000);
    else if (i % 10 == 0) console.log("Scrolling", i, "times.");
    scroll();
    setTimeout(() => {
        scroll_i_times(i - 1)
    }, 1000 * (Math.random() + 1));
    hide();
});
clear();
```
***

3. You can now scroll for a certain number of times or until you
	reach a certain year using:
    ```javascript
	scroll_i_times(100); //Tries to scroll 100 times
	scroll_till_year(2015); //Scrolls until bottom post from 2015
    ```
4. Right click the page and click "Save As".


## Splitting up pages amongst group members

We use this section to divide the download workload for group members

In [84]:
alon, alec, alec_2, phillip = np.array_split(np.random.permutation(data.index), 4)
alec += alec_2

In [101]:
from IPython.display import display, HTML
os.mkdir('Work/')
for name, work in zip(['alon', 'alec', 'phillip'], [alon, alec, phillip]):
    print(name[0].upper() + name[1:] + ':')
    print(len(work))
    data.loc[work, ['href', 'href_posts', 'anti_vax']].to_csv('Work/' + name + '.csv')

Alon:
34
Alec:
34
Phillip:
34
