Network mapping/analysis prototype #7

sjacks26 · 2017-02-03T16:09:11Z

This is a proof-of-concept for network mapping/analysis of far-right blogs/websites. Basically, my (rough) thinking is we could grab all the links in a blog, parse them, and count the domains so we can generate some normalized "citation" score. I think we should collect both blogrolls and links mentioned in posts, but we should keep them separate: something appearing on a blogroll means something different than being mentioned in a post.

So here's 3 blogs that are huge:

https://westernrifleshooters.wordpress.com (migrated from http://westernrifleshooters.blogspot.com, if you want more posts)
http://sipseystreetirregulars.blogspot.com (not active since September-ish)
http://waronguns.blogspot.com

Steps:

Find all links in posts
Parse all links to identify domain name
Generate domain counts (normalized by total number of pages, or total number of out-links, or something else)

We can recursive-ize this by doing the same thing will all the domains linked. I have a feeling that the number of posts in level 0 might mean level 1 is astronomically large, but I don't know that.

If this works and we want to scale it up, I have a list of 160 domains, almost all of which are associated with the patriot/militia movement.

Thoughts?

sjacks26 · 2017-02-03T16:09:58Z

I started working on doing something like this a couple weeks ago and wrote some sloppy Python that did a couple things right but most things wrong. If someone wants to work on this and wants to see that, let me know.

ccarey · 2017-02-03T17:35:52Z

@sjacks26 Can you upload them to a directory here on GitHub? I'd be interested in taking a look at them.

sjacks26 · 2017-02-03T18:10:33Z

@ccarey check out https://github.com/Data4Democracy/far-right-analysis/blob/master/citation_analysis/find_hrefs_loop.py. I was trying to do this on sites that I had mirrored, so the script is missing the piece that scrapes from the websites directly.

ccarey · 2017-02-03T20:51:39Z

@sjacks26 Thanks! I'll try to look into this some this weekend. Will let you know if I have any questions, but just at first glance the code you already have looks like a solid jumping-off point.

bbrewington · 2017-03-07T13:18:11Z

I wrote some code to scrape all posts for a given blog domain, for the whole year. See pull request: #16

TODO: Loop through each blog post and grab the links it contains (note to self. use code below)

read_html(blog_post) %>% html_nodes("p a") %>% html_attr("href") %>% 
               .[str_detect(., "https?:")] %>%
               .[!str_detect(., "https://westernrifleshooters.files.wordpress.com")]

sjacks26 added data-collection proposal labels Feb 3, 2017

sjacks26 changed the title ~~Network mapping/analysis proof-of-concept~~ Network mapping/analysis prototype Feb 5, 2017

sjacks26 added the analysis label Feb 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network mapping/analysis prototype #7

Network mapping/analysis prototype #7

sjacks26 commented Feb 3, 2017

sjacks26 commented Feb 3, 2017

ccarey commented Feb 3, 2017

sjacks26 commented Feb 3, 2017

ccarey commented Feb 3, 2017

bbrewington commented Mar 7, 2017

Network mapping/analysis prototype #7

Network mapping/analysis prototype #7

Comments

sjacks26 commented Feb 3, 2017

sjacks26 commented Feb 3, 2017

ccarey commented Feb 3, 2017

sjacks26 commented Feb 3, 2017

ccarey commented Feb 3, 2017

bbrewington commented Mar 7, 2017