Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network mapping/analysis prototype #7

Open
sjacks26 opened this issue Feb 3, 2017 · 5 comments
Open

Network mapping/analysis prototype #7

sjacks26 opened this issue Feb 3, 2017 · 5 comments

Comments

@sjacks26
Copy link
Contributor

sjacks26 commented Feb 3, 2017

This is a proof-of-concept for network mapping/analysis of far-right blogs/websites. Basically, my (rough) thinking is we could grab all the links in a blog, parse them, and count the domains so we can generate some normalized "citation" score. I think we should collect both blogrolls and links mentioned in posts, but we should keep them separate: something appearing on a blogroll means something different than being mentioned in a post.

So here's 3 blogs that are huge:

Steps:

  1. Find all links in posts
  2. Parse all links to identify domain name
  3. Generate domain counts (normalized by total number of pages, or total number of out-links, or something else)

We can recursive-ize this by doing the same thing will all the domains linked. I have a feeling that the number of posts in level 0 might mean level 1 is astronomically large, but I don't know that.

If this works and we want to scale it up, I have a list of 160 domains, almost all of which are associated with the patriot/militia movement.

Thoughts?

@sjacks26
Copy link
Contributor Author

sjacks26 commented Feb 3, 2017

I started working on doing something like this a couple weeks ago and wrote some sloppy Python that did a couple things right but most things wrong. If someone wants to work on this and wants to see that, let me know.

@ccarey
Copy link

ccarey commented Feb 3, 2017

@sjacks26 Can you upload them to a directory here on GitHub? I'd be interested in taking a look at them.

@sjacks26
Copy link
Contributor Author

sjacks26 commented Feb 3, 2017

@ccarey check out https://github.com/Data4Democracy/far-right-analysis/blob/master/citation_analysis/find_hrefs_loop.py. I was trying to do this on sites that I had mirrored, so the script is missing the piece that scrapes from the websites directly.

@ccarey
Copy link

ccarey commented Feb 3, 2017

@sjacks26 Thanks! I'll try to look into this some this weekend. Will let you know if I have any questions, but just at first glance the code you already have looks like a solid jumping-off point.

@sjacks26 sjacks26 changed the title Network mapping/analysis proof-of-concept Network mapping/analysis prototype Feb 5, 2017
@bbrewington
Copy link

I wrote some code to scrape all posts for a given blog domain, for the whole year. See pull request: #16

  • TODO: Loop through each blog post and grab the links it contains (note to self. use code below)
read_html(blog_post) %>% html_nodes("p a") %>% html_attr("href") %>% 
               .[str_detect(., "https?:")] %>%
               .[!str_detect(., "https://westernrifleshooters.files.wordpress.com")]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants