# Data Collection Process for Team Vaccinated

## Collecting Datasets

We collected our data by directly parsing the HTML DOM of pages on Facebook.  

To find the pages, we searched specific queries on Facebook's search bar: 

####                ["stop vaccination", "anti vaccination"], ["parental advice", "child health"]
         
For each search, we followed this process:

1. Search the term.
2. Click the "Pages" tab to filter to only Page results.
3. Scroll downward until no results are left.  If laborious, you may use this script:
```javascript
var s = () => {window.scrollTo(0,document.body.scrollHeight); setTimeout(s, 500);};
s();
```
3. Open up the console and input this script:
```javascript
var parse = () => {
    return Array.from(document.getElementsByClassName('_3u1 _gli _6pe1')).map((elem) => {
        var mo = elem.getElementsByClassName('_32mo')[0];
        var name = mo.textContent;
        var href = mo.getAttribute('href');
        let likes = elem.getElementsByClassName('_pac')[0].getElementsByTagName('a')[0];
        likes = (likes) ? likes.textContent.split(' ')[0] : null;
        return {name: name, href: href, likes: likes};
    });
}
copy(parse());
```
4. The page results are now copied to your clipboard.  Open a text editor and paste the results into a file and save it

In [1]:
from tqdm import tqdm_notebook as tqdm
from fbdp import FBDesktopParser
import pandas as pd
import numpy as np
import json
import os

datasets = ['Data/' + f for f in os.listdir('Data/') if f.startswith('Datasets - ')]
print(datasets)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Alec\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Alec\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Alec\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Alec\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


['Data/Datasets - anti vaccination.json', 'Data/Datasets - child health.json', 'Data/Datasets - infant parent advice.json', 'Data/Datasets - stop vaccination.json']


In [2]:
def load_dataset(filename, query):
    with open(filename) as f:
        d = pd.DataFrame(json.loads(f.read()), columns=['name', 'likes', 'href'])
        d['query'] = query
        return d
    raise Exception("Could not open file")
def load_datasets(datasets):
    data = pd.concat([load_dataset(f, q) for f, q in zip(
        datasets, [n.split('.')[0].split('- ')[1] for n in datasets])])
    gb = data.groupby('name')
    base = pd.DataFrame(gb.query.apply(set).apply(list).apply(','.join))
    base['likes'] = gb.likes.last().fillna('0')
    base['href'] = gb.href.last()
    base['href_posts'] = base.href.apply(lambda x: 'https://www.facebook.com/pg/' + \
                                         x[25:].split('/')[0] + '/posts/?ref=page_internal')
    base['anti_vax'] = base['query'].apply(lambda x: 'vaccination' in x)
    def convert_num(n):
        if 'M' in n:
            return float(n[:-1]) * 1e6
        elif 'K' in n:
            return float(n[:-1]) * 1e3
        else:
            return float(n)
    base['likes_adj'] = base.likes.apply(convert_num)
    return base
data = load_datasets(datasets)
data.head()

Unnamed: 0_level_0,query,likes,href,href_posts,anti_vax,likes_adj
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
#StopBullying,stop vaccination,9.8K,https://www.facebook.com/StopBullying112/?ref=...,https://www.facebook.com/pg/StopBullying112/po...,True,9800.0
2/8 autistic meme causing anti-vaccination,anti vaccination,15,https://www.facebook.com/28-autistic-meme-caus...,https://www.facebook.com/pg/28-autistic-meme-c...,True,15.0
AAP Section On International Child Health,child health,620,https://www.facebook.com/AAPSOICH/?ref=br_rs,https://www.facebook.com/pg/AAPSOICH/posts/?re...,False,620.0
APHA Maternal & Child Health Section,child health,343,https://www.facebook.com/APHAMCHSection/?ref=b...,https://www.facebook.com/pg/APHAMCHSection/pos...,False,343.0
About Pediatrics and Parenting Advice,infant parent advice,1.5K,https://www.facebook.com/AboutPediatrics/?ref=...,https://www.facebook.com/pg/AboutPediatrics/po...,False,1500.0


# Remove unwanted pages

Given that the searches might not return results which are intended for the dataset, we must manually evaluate and filter pages. The criteria that we use is as follows:

- Page must explicitly and specifically dedicate itself to expanding, supporting, or espousing either the anti-vaccination movement or child health care.
- Page must have at least 10 posts.
- If the page specifies a location, it must be a predominately English-speaking location.
- Page must exclusively target the general topics of anti-vaccination or child health advice.  If the page specifically targets a subtopic, then it should be filtered out.
- Page must target the discussion of its topic.  Example of non-qualifier: a landing page for a company.

In [3]:
import json
stop_pages = json.loads(open('Data/stop_pages.json').read())
def remove_unwanted_pages(df):
    df = df.loc[(~df.index.isin(stop_pages['non_anti_vax_pages'])) & \
                (~df.index.isin(stop_pages['unwanted_normal_pages']))]
    def has_banned_terms(x):
        x = x.lower()
        for term in stop_pages['banned_terms']:
            if term in x:
                return True
        return False
    return df[[not has_banned_terms(x) for x in df.index]]
data = remove_unwanted_pages(data)
print(data['query'].value_counts())

child health                         34
stop vaccination                     29
infant parent advice                 27
anti vaccination                     15
stop vaccination,anti vaccination    14
child health,infant parent advice     1
Name: query, dtype: int64


# Parsing each page

To parse each Facebook page, we follow a set methodology:

1. Head to the Posts section of the Facebook page you want to parse.
	Then, open up your browser's console (Chrome: Ctrl + Shift + J).

2. Paste the following code into the console and press Enter:

***
```javascript
try{document.getElementById("pagelet_bluebar").remove()}catch(e){}try{document.getElementById("pagelet_sidebar").remove()}catch(e){}try{document.getElementById("uiContextualLayerParent").style.display="none"}catch(e){}try{document.getElementsByClassName("_1qks")[0].remove()}catch(e){}try{document.getElementsByClassName("_1pfm")[0].remove()}catch(e){}try{document.getElementsByClassName("uiScrollableAreaContent")[0].remove()}catch(e){}var prog=document.querySelector('div[role="contentinfo"]');prog.style="position: fixed;top: 50%;right: 10%;z-index: 999;width: 200px;font-size:4.5em;text-shadow:1px 1px 4px black;";try{prog.querySelector(".fsm").remove()}catch(e){}prog.textContent="";var random_fun=()=>"#"+((1<<24)*Math.random()|0).toString(16),updated_count={posts:0,times_counted:0},update=()=>{let e=document.getElementsByClassName("userContentWrapper").length;if(e==updated_count.posts){if(updated_count.times_counted++,updated_count.times_counted>=15)return console.log("Tried scrolling 15 times, but no update has occurred."),prog.textContent="Loaded "+e+" posts. Right-click and Save!",!0}else updated_count.posts=e,updated_count.times_counted=0;return prog.textContent="Loaded "+e+" posts",prog.style.color=random_fun(),!1},hide=()=>{Array.from(document.querySelectorAll(".userContentWrapper:not(.done)")).forEach(e=>{e.style.display="none",e.classList.add("done")})},scroll=()=>{window.scrollTo(0,document.body.scrollHeight);let e=document.querySelector(".mhs");e&&e.click()},complete=()=>{updated_count={posts:0,times_counted:0},Array.from(document.getElementsByClassName("see_more_link_inner")).forEach(e=>e.click()),console.log("See More's clicked. You may now save. (Done)")},get_abbr_year=e=>e?new Date(e.title).getFullYear():new Date,oversized_count=()=>!!(updated_count&&updated_count.posts>=1500)&&(complete(),console.log("Your page has become too big!  Save the page now as one chunk (name it as such) then call 'next()'."),!0),next=()=>{Array.from(document.getElementsByClassName("_4-u2")).forEach(e=>e.remove());let e=document.querySelector(".mhs");e&&e.click(),console.log("Page Cleared. You may now call scroll_till_year or scroll_i_times to begin the next chunk."),scroll()},scroll_till_year=e=>{let t=Array.from(document.getElementsByClassName("timestampContent")).slice(-10).map(e=>e.parentElement).filter(e=>!e.classList.contains("livetimestamp")).map(get_abbr_year).filter(e=>!isNaN(e)),o=t.length?Math.max(...t):2019,l=Math.min(...t);if(!oversized_count()){if(o<=e||update())return setTimeout(complete,2e3);o!=l?console.log("Possible last seen:",o):console.log("Last seen:",o),scroll(),setTimeout(()=>{scroll_till_year(e)},1e3*(Math.random()+1)),hide()}},scroll_i_times=e=>{if(!oversized_count()){if(0==e||update())return setTimeout(complete,2e3);e%10==0&&console.log("Scrolling",e,"times."),scroll(),setTimeout(()=>{scroll_i_times(e-1)},1e3*(Math.random()+1)),hide()}};clear(),scroll_till_year(2013);
```
***
The above code is the following, in a semi-readable format:
```javascript
try { document.getElementById('pagelet_bluebar').remove(); } catch (err) {}
try { document.getElementById('pagelet_sidebar').remove(); } catch (err) {}
try { document.getElementById('uiContextualLayerParent').style.display = 'none'; } catch (err) {}
try { document.getElementsByClassName('_1qks')[0].remove(); } catch (err) {}
try { document.getElementsByClassName('_1pfm')[0].remove(); } catch (err) {}
try { document.getElementsByClassName('uiScrollableAreaContent')[0].remove(); } catch (err) {}
var prog = document.querySelector('div[role="contentinfo"]');
prog.style = "position: fixed;top: 50%;right: 10%;z-index: 999;width: 200px;font-size:4.5em;text-shadow:1px 1px 4px black;"
try { prog.querySelector('.fsm').remove(); } catch (err) {}
prog.textContent = '';
var random_fun = () => "#"+((1<<24)*Math.random()|0).toString(16);
var updated_count = {posts: 0, times_counted: 0};
var update = () => {
    let posts = document.getElementsByClassName('userContentWrapper').length;
    if (posts == updated_count.posts) {
        updated_count.times_counted++;
        if (updated_count.times_counted >= 15) {
            console.log("Tried scrolling 15 times, but no update has occurred.");
            prog.textContent = "Loaded " + posts + " posts. Right-click and Save!"; 
            return true;
        }
    }
    else {
        updated_count.posts = posts;
        updated_count.times_counted = 0;
    }
    prog.textContent = "Loaded " + posts + " posts"; 
    prog.style.color = random_fun()
    return false;
};
var hide = () => {
    Array.from(document.querySelectorAll('.userContentWrapper:not(.done)')).forEach((elem) => {
        elem.style.display = 'none';
        elem.classList.add('done');
    });
}
var scroll = (() => {
    window.scrollTo(0, document.body.scrollHeight);
    let see_more = document.querySelector('.mhs');
    if (see_more) see_more.click();
});
var complete = (() => {
    updated_count = {posts: 0, times_counted: 0};
    Array.from(document.getElementsByClassName("see_more_link_inner")).forEach(
        e => e.click());
    console.log("See More's clicked. You may now save. (Done)");
});
var get_abbr_year = (e => e ? new Date(e.title).getFullYear() : new Date);
var oversized_count = () => {
    if (updated_count && updated_count.posts >= 1200) {
        complete();
        console.log("Your page has become too big!  Save the page now as one chunk (name it as such) then call 'next()'.");
        return true;
    }
    return false;
};
var next = () => {
    Array.from(document.getElementsByClassName('_4-u2')).forEach(elem => elem.remove());
    let see_more = document.querySelector('.mhs');
    if (see_more) see_more.click();
    console.log("Page Cleared. You may now call scroll_till_year or scroll_i_times to begin the next chunk.");
    scroll();
};
var scroll_till_year = (year => {
    let found_years = Array.from(document.getElementsByClassName("timestampContent")).slice(-10)
                .map(e => e.parentElement).filter(e => !e.classList.contains("livetimestamp"))
                .map(get_abbr_year).filter(e => !isNaN(e));
    let max_found_year = (found_years.length) ? Math.max(...found_years) : 2019;
    let min_found_year = Math.min(...found_years);
    if (oversized_count()) return;
    else if (max_found_year <= year || update()) return setTimeout(complete, 2000);
    else if (max_found_year != min_found_year) console.log("Possible last seen:", max_found_year);
    else console.log("Last seen:", max_found_year);
    scroll();
    setTimeout(() => {
        scroll_till_year(year)
    }, 1000 * (Math.random() + 1));
    hide();
});
var scroll_i_times = (i => {
    if (oversized_count()) return;
    else if (i == 0 || update()) return setTimeout(complete, 2000);
    else if (i % 10 == 0) console.log("Scrolling", i, "times.");
    scroll();
    setTimeout(() => {
        scroll_i_times(i - 1)
    }, 1000 * (Math.random() + 1));
    hide();
});
clear();
scroll_till_year(2013);
```
***

3. You can now scroll for a certain number of times or until you
	reach a certain year using:
```javascript
scroll_i_times(100); //Tries to scroll 100 times
scroll_till_year(2015); //Scrolls until bottom post from 2015
```
4. Right click the page and click "Save As".


## Validating Loaded Files

We use this section to validate the HTML data files we downloaded for completeness and accuracy.

In [4]:
data.index = data.index.str.replace('/', '_').str.replace(':', '_').str.replace('|', '_').str.replace('?', '_')
found_files = set(['-'.join(page.split('-')[:-1]).strip() for page in os.listdir('Data/') if page.endswith('sts.html')])
for file in found_files:
    assert not file.startswith('('), file + " Should not start with (#)"
data['Found'] = [x in found_files for x in data.index]
#Assert no pages have not been collected
assert data.Found.all(), "All files should be downloaded"
data.to_csv('Data_Clean/pages.csv')

In [5]:
#Ensure that all pages can be parsed (Can skip if already completed)
validate = False
for page_name in tqdm(data.index):
    if not validate:
        continue
    filename = 'Data/' + page_name + ' - Posts.html'
    parser = FBDesktopParser(filename)
    parser.parse_posts(limit=10)
    assert parser.posts.shape[0] == 10, page_name + " doesn't have at least 10 posts"

HBox(children=(IntProgress(value=0, max=120), HTML(value='')))




In [6]:
#Generate parser and validate data for given filename
def generate_page_posts(file, verbose=False):
    parser = FBDesktopParser(file)
    #Assert parser read posts from file
    assert parser.parse_posts().shape[0] >= 10, file + ', ' + str(parser.posts.shape[0])
    parser.posts = parser.posts.loc[parser.posts.timestamp > '2013-12-31']
    parser.posts = parser.posts.iloc[:1500]
    see_more = parser.posts.text.apply(lambda x: 'See More' in x).sum()
    assert see_more < 5, "'See More' should not appear multiple times"
    if verbose:
        print(file, 'Posts: ', parser.posts.shape[0], '- Number of See More\'s:', see_more)
    parser.extract_features(bag_of_words=False, lemmatize=True)
    return parser.posts

In [7]:
#Generate all parsers
parsers = []
for page_name in tqdm(data.index, smoothing=0):
    filename = 'Data/' + page_name + ' - Posts.html'
    if not os.path.exists(filename):
        raise Exception("File Not Found: " + page_name)
    try:
        parser = generate_page_posts(filename, verbose=False)
        parser['page_name'] = page_name
        parser['anti_vax'] = data.loc[page_name].anti_vax
        parsers.append(parser)
    except Exception as e:
        print(page_name, e)

HBox(children=(IntProgress(value=0, max=120), HTML(value='')))

Data/About Pediatrics and Parenting Advice - Posts.html Posts:  1129 - Number of See More's: 0
Data/Adult and Child Health - Posts.html Posts:  816 - Number of See More's: 0
Data/Advice4Parenting - Posts.html Posts:  8 - Number of See More's: 0
Data/Anti Vaccination Saskatoon - Posts.html Posts:  1500 - Number of See More's: 0
Data/Anti Vaccinations - Posts.html Posts:  232 - Number of See More's: 0
Data/Anti-Vaccination Australia - Posts.html Posts:  943 - Number of See More's: 0
Data/Anti-Vaccination Movement UK - Posts.html Posts:  160 - Number of See More's: 0
Data/Anti-Vaccine Choice MA - Posts.html Posts:  1500 - Number of See More's: 0
Data/Are Vaccines Safe - Posts.html Posts:  72 - Number of See More's: 0
Data/Association of Teachers of Maternal and Child Health - ATMCH - Posts.html Posts:  88 - Number of See More's: 0
Data/Assuring Better Child Health & Development - Posts.html Posts:  360 - Number of See More's: 0
Data/Australian Vaccination-risks Network Inc. - AVN - Posts.

In [8]:
#Combine all pages into one and save as a checkpoint
cleaned_df = pd.concat(parsers).reset_index(drop=True)
if not os.path.exists('Data_Clean'):
    os.makedirs('Data_Clean')
cleaned_df.to_csv('Data_Clean/posts.csv')
cleaned_df.to_json('Data_Clean/posts_json.json')
cleaned_df.head()

Unnamed: 0,article_host,article_name,article_subtitle,hashtags,img-label,img_src,linked_profiles,links,text,timestamp,...,percent_questionms,num_equals,percent_equals,num_dollars,percent_dollars,sentiment,readability,ttr,page_name,anti_vax
0,vaxopedia.org,Is the CDC Pushing Vaccines Because a Batch of...,Guess when the latest batches of MMR expire?,[],No photo description available.,./About Pediatrics and Parenting Advice - Post...,[About Pediatrics and Parenting Advice],[],The latest conspiracy theory is that MMR vacci...,2019-03-31 16:29:00,...,0.03125,0,0.0,0,0.0,"{'neg': 0.108, 'neu': 0.892, 'pos': 0.0, 'comp...","[0.0, 11.32, 6.6]",0.866667,About Pediatrics and Parenting Advice,False
1,keepkidshealthy.com,The New Vaccine Surveillance Network Report on...,Anyone who has been following the outbreaks of...,[],Sadio Mane has revealed he hates watching Man ...,./About Pediatrics and Parenting Advice - Post...,[About Pediatrics and Parenting Advice],[],The New Vaccine Surveillance Network Report on...,2019-03-28 13:07:00,...,0.0,0,0.0,0,0.0,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...","[0.0, 12.0, 9.6]",1.0,About Pediatrics and Parenting Advice,False
2,sccgov.org,Public Health Department Warns of Possible Mea...,The County of Santa Clara Public Health Depart...,[measles],,,[About Pediatrics and Parenting Advice],[],"Someone with in Santa Clara County, #Californ...",2019-03-27 14:36:00,...,0.0,0,0.0,0,0.0,"{'neg': 0.051, 'neu': 0.949, 'pos': 0.0, 'comp...","[0.0, 16.4, 15.4]",0.809524,About Pediatrics and Parenting Advice,False
3,vaxopedia.org,News on the Latest Measles Outbreaks of 2019,Get vaccinated and stop the measles outbreaks.,[],Image may contain: one or more people and peop...,./About Pediatrics and Parenting Advice - Post...,[About Pediatrics and Parenting Advice],[],"There are 33 new measles cases in Brooklyn, br...",2019-03-27 14:14:00,...,0.0,0,0.0,0,0.0,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...","[0.0, 5.8, 7.8]",0.761905,About Pediatrics and Parenting Advice,False
4,vaxopedia.org,The CDC Vaccine Price List Conspiracy,Did you believe this one?,[],No photo description available.,./About Pediatrics and Parenting Advice - Post...,[About Pediatrics and Parenting Advice],[],It took less a few minutes to debunk the lates...,2019-03-27 13:32:00,...,0.045455,0,0.0,0,0.0,"{'neg': 0.159, 'neu': 0.841, 'pos': 0.0, 'comp...","[0.0, 8.0, 6.0]",1.0,About Pediatrics and Parenting Advice,False


In [9]:
#Remove page from its own linked_profiles (faulty link)
names = remove_unwanted_pages(load_datasets(['Data/' + f for f in os.listdir('Data/') if f.startswith('Datasets - ')]))
names['page_name_adjusted'] = names.index.str.replace('/', '_').str.replace(':', '_').str.replace(
                                        '|', '_').str.replace('?', '_')
cleaned_df = cleaned_df.join(names[['page_name_adjusted']], on='page_name')
def check_bad_links(row):
    if len(row.linked_profiles) > 0:
        if row.linked_profiles is not None and row.page_name in row.linked_profiles:
            return row.linked_profiles.remove(row.page_name)
        if row.linked_profiles is not None and row.page_name_adjusted in row.linked_profiles:
            return row.linked_profiles.remove(row.page_name_adjusted) 
    return row.linked_profiles
cleaned_df['linked_profiles'] = cleaned_df.apply(check_bad_links, axis=1)

In [10]:
#Change objects/lists into seperate fields
for sen in ['neg', 'neu', 'pos', 'compound']:
    cleaned_df['sentiment_' + sen] = cleaned_df.sentiment.apply(lambda x: x[sen])
cleaned_df.drop('sentiment', axis=1, inplace=True)
for i, r in enumerate(['smog_index', 'gunning_fog', 'flesch_kincaid_grade']):
    cleaned_df['readability_' + r] = cleaned_df.readability.apply(lambda x: x[i])
cleaned_df.drop('readability', axis=1, inplace=True)
cleaned_df.head()

Unnamed: 0,article_host,article_name,article_subtitle,hashtags,img-label,img_src,linked_profiles,links,text,timestamp,...,page_name,anti_vax,page_name_adjusted,sentiment_neg,sentiment_neu,sentiment_pos,sentiment_compound,readability_smog_index,readability_gunning_fog,readability_flesch_kincaid_grade
0,vaxopedia.org,Is the CDC Pushing Vaccines Because a Batch of...,Guess when the latest batches of MMR expire?,[],No photo description available.,./About Pediatrics and Parenting Advice - Post...,,[],The latest conspiracy theory is that MMR vacci...,2019-03-31 16:29:00,...,About Pediatrics and Parenting Advice,False,About Pediatrics and Parenting Advice,0.108,0.892,0.0,-0.5267,0.0,11.32,6.6
1,keepkidshealthy.com,The New Vaccine Surveillance Network Report on...,Anyone who has been following the outbreaks of...,[],Sadio Mane has revealed he hates watching Man ...,./About Pediatrics and Parenting Advice - Post...,,[],The New Vaccine Surveillance Network Report on...,2019-03-28 13:07:00,...,About Pediatrics and Parenting Advice,False,About Pediatrics and Parenting Advice,0.0,1.0,0.0,0.0,0.0,12.0,9.6
2,sccgov.org,Public Health Department Warns of Possible Mea...,The County of Santa Clara Public Health Depart...,[measles],,,,[],"Someone with in Santa Clara County, #Californ...",2019-03-27 14:36:00,...,About Pediatrics and Parenting Advice,False,About Pediatrics and Parenting Advice,0.051,0.949,0.0,-0.0772,0.0,16.4,15.4
3,vaxopedia.org,News on the Latest Measles Outbreaks of 2019,Get vaccinated and stop the measles outbreaks.,[],Image may contain: one or more people and peop...,./About Pediatrics and Parenting Advice - Post...,,[],"There are 33 new measles cases in Brooklyn, br...",2019-03-27 14:14:00,...,About Pediatrics and Parenting Advice,False,About Pediatrics and Parenting Advice,0.0,1.0,0.0,0.0,0.0,5.8,7.8
4,vaxopedia.org,The CDC Vaccine Price List Conspiracy,Did you believe this one?,[],No photo description available.,./About Pediatrics and Parenting Advice - Post...,,[],It took less a few minutes to debunk the lates...,2019-03-27 13:32:00,...,About Pediatrics and Parenting Advice,False,About Pediatrics and Parenting Advice,0.159,0.841,0.0,-0.5267,0.0,8.0,6.0


In [15]:
#Add parts of speech features
import nltk
from nltk.stem.wordnet import wordnet
from collections import Counter
def get_wordnet_pos(pos):
    pos = pos[0].upper()
    wordnet_tag_dict = {"J": wordnet.ADJ,
        "N": wordnet.NOUN,
        "V": wordnet.VERB,
        "R": wordnet.ADV}
    return wordnet_tag_dict.get(pos, wordnet.NOUN)
pos = cleaned_df.text_tokenized_lemmatized.apply(lambda x: [pos for word, pos in nltk.pos_tag(x)])
counted_basic = pos.apply(lambda x: Counter([get_wordnet_pos(word) for word in x]))
counted = pos.apply(lambda x: Counter(x))
for tag in ['a', 'n', 'r', 'v']:
    cleaned_df['num_pos_basic_' + tag] = counted_basic.apply(lambda x: x[tag] if x and tag in x else 0)
for tag in set(counted.apply(lambda x: list(x.keys())).sum()):
    cleaned_df['num_pos_' + tag] = counted.apply(lambda x: x[tag] if x and tag in x else 0)
cleaned_df.head()

Unnamed: 0,article_host,article_name,article_subtitle,hashtags,img-label,img_src,linked_profiles,links,text,timestamp,...,num_pos_VBD,num_pos_RBS,num_pos_UH,num_pos_VB,num_pos_RB,num_pos_MD,num_pos_JJS,num_pos_'',num_pos_WP$,num_pos_VBZ
0,vaxopedia.org,Is the CDC Pushing Vaccines Because a Batch of...,Guess when the latest batches of MMR expire?,[],No photo description available.,./About Pediatrics and Parenting Advice - Post...,,[],The latest conspiracy theory is that MMR vacci...,2019-03-31 16:29:00,...,0,0,0,1,0,0,0,0,0,0
1,keepkidshealthy.com,The New Vaccine Surveillance Network Report on...,Anyone who has been following the outbreaks of...,[],Sadio Mane has revealed he hates watching Man ...,./About Pediatrics and Parenting Advice - Post...,,[],The New Vaccine Surveillance Network Report on...,2019-03-28 13:07:00,...,0,0,0,0,0,0,0,0,0,1
2,sccgov.org,Public Health Department Warns of Possible Mea...,The County of Santa Clara Public Health Depart...,[measles],,,,[],"Someone with in Santa Clara County, #Californ...",2019-03-27 14:36:00,...,0,0,0,0,0,0,0,0,0,0
3,vaxopedia.org,News on the Latest Measles Outbreaks of 2019,Get vaccinated and stop the measles outbreaks.,[],Image may contain: one or more people and peop...,./About Pediatrics and Parenting Advice - Post...,,[],"There are 33 new measles cases in Brooklyn, br...",2019-03-27 14:14:00,...,2,0,0,0,1,0,0,0,0,0
4,vaxopedia.org,The CDC Vaccine Price List Conspiracy,Did you believe this one?,[],No photo description available.,./About Pediatrics and Parenting Advice - Post...,,[],It took less a few minutes to debunk the lates...,2019-03-27 13:32:00,...,0,0,0,2,0,0,0,0,0,0


In [16]:
#Add LIWC and other features
from Liwc import LiwcAnalyzer
liwc = LiwcAnalyzer()
liwc_results = liwc.parse(cleaned_df.text)
new_df = cleaned_df.join(liwc_results.fillna(0))

In [17]:
new_df.to_csv('Data_Clean/posts_full.csv')
new_df.to_json('Data_Clean/posts_full_json.json')

### Clean Data Further

In [None]:
import dataprep
dataprep.load_and_clean('Data_Clean/posts_full.csv')['features'].to_csv('Data_Clean/features_scaled.csv')