New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network mapping/analysis prototype #7
Comments
I started working on doing something like this a couple weeks ago and wrote some sloppy Python that did a couple things right but most things wrong. If someone wants to work on this and wants to see that, let me know. |
@sjacks26 Can you upload them to a directory here on GitHub? I'd be interested in taking a look at them. |
@ccarey check out https://github.com/Data4Democracy/far-right-analysis/blob/master/citation_analysis/find_hrefs_loop.py. I was trying to do this on sites that I had mirrored, so the script is missing the piece that scrapes from the websites directly. |
@sjacks26 Thanks! I'll try to look into this some this weekend. Will let you know if I have any questions, but just at first glance the code you already have looks like a solid jumping-off point. |
I wrote some code to scrape all posts for a given blog domain, for the whole year. See pull request: #16
read_html(blog_post) %>% html_nodes("p a") %>% html_attr("href") %>%
.[str_detect(., "https?:")] %>%
.[!str_detect(., "https://westernrifleshooters.files.wordpress.com")] |
This is a proof-of-concept for network mapping/analysis of far-right blogs/websites. Basically, my (rough) thinking is we could grab all the links in a blog, parse them, and count the domains so we can generate some normalized "citation" score. I think we should collect both blogrolls and links mentioned in posts, but we should keep them separate: something appearing on a blogroll means something different than being mentioned in a post.
So here's 3 blogs that are huge:
Steps:
We can recursive-ize this by doing the same thing will all the domains linked. I have a feeling that the number of posts in level 0 might mean level 1 is astronomically large, but I don't know that.
If this works and we want to scale it up, I have a list of 160 domains, almost all of which are associated with the patriot/militia movement.
Thoughts?
The text was updated successfully, but these errors were encountered: