Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
Identification of Polarized Blog Posts #4
Labelling Blog Sites
We need labelled data for various topics and sentiment and we need a lot of it. We have decided on a form of labelling called distant supervision, where we use heuristics and tags in order to classify far more text than we could possibly label manually, with the idea being the cost of potentially mislabelling some data is outweighed by the far greater volume. In order to do this we have targeted opinion blogs for 3 main reasons:
We will need to scrape this data meaning we first need to label potential target sites. To do this we need people to pick a topic, such as global warming, vaccination, religion/atheism or some other polarizing topic. Once that topic is decided one, try to find blogs that have to do more or less exclusively with this topic, and determine the dominant sentiment of official posts on the site (not comments). Check that the sentiment is fairly consistent between posts and authors (if there's more than one).
Once a site or domain is determined to be a good target, enter the url into a text file. The text file should be named in the format: Topic of Blog Posts - Sentiment (eg. Climate Change - Denial, Abortion - Pro Choice, etc). Each file should contain only one leaning for the sake of easily running them through any automated scraper we create. Avoid ambiguously leaning sites (those that post from both sides) or those whose topic varies significantly .
What should be in the file
The first is the domain of the website, which will be used to limit where a crawler can go and which links it can follow. It should not include 'http://' or 'www', but simply the domain name, such as realclimate.org.
The next is the URL pattern for the blog posts. By this I mean the longest consistent URL for all blog pages on that site. For example for realclimate.org, all of the blog posts can be found by year, eg. http://www.realclimate.org/index.php/archives/2017/05/ or http://www.realclimate.org/index.php/archives/2016/03/. Thus, the common URL would be http://www.realclimate.org/index.php/archives/20. This is not itself a valid URL, but all valid URLs MUST contain this sequence. This makes it easy for anyone scraping using Portia or other scrapers to simply enter this sequence into the ReGex section when designing a spider and then setting it loose. Finally, if you want you can add a subjective evaluation of how extreme you believe the site to be in their position, with 1 being centrist and 5 being extremist. A template is available in the URL Dump folder and remember to name your file with the topic and sentiment
The list of possible topics includes but is not limited to:
We have deliberately stayed away from topics like Politics - Left/Right or Libertarian/Authoritarian for two reasons:
If you choose to create your own topic, please keep in mind that it should be clear/unambiguous as well as broad. Ie. Yankees vs. Red Sox would not be a good topic as it's very specific. If you have any doubts please comment on this issue with your suggested topic and we'll give you feedback. Also, while any self-directed initiative is encouraged, keep in mind that we'd rather have a bunch of data for just a few topics than sparser data for many topics.
Thank you for your efforts and patiences.
I think that encompassing the full range of stances on each side is important. That said, it's also important to get primarily the middle 90% of perspectives, and only have a few samples from the fringes.
We want the corpus to reflect all views under the broad umbrella of 'for' and 'against', and the extremists are likely to have some of the most distinctive speaking patterns which will be beneficial. But they also tend to be the most vitriolic, and so we definitely want to keep those in the minority. They'll be the easiest to find, so really they'll most likely have to be avoided once a few decently sized examples are found.
Thanks for the info, @TyJK.
I also wanted to let you know about an addon called, Pocket made for Firefox. It allows you to share bookmarked pages with a group of collaborators and it also allows you to assign labels to the bookmarked pages.
So things I like about it are:
I think it would work really well for a project like this, where we need to organize a bunch of links and share them.
@PiReel Tyler JK is what I have the name set up as, hopefully that's unique enough, but if not let me know. I'll see if I can set that up, but I've come across a second advantage to a .txt file, which is I can read it into a scraper I write. Is there a way to download pocket links into something like that? Thanks :)
I'm writing an extension that turns the browsers file system in to a "Adobe Bridge" type application that works with Pi Reel.
For your case though we would have to write your scraper as a browser extension that works with pocket.
Which would be a nice piece of bookmarking software to have, but more elaborate.
That seems like one of those things that might be an interesting project on it's own, but not necessarily something we can get up and running right now. I'm already going through the various features you could incorporate with this. From what I can tell, you need 3 things to efficiently scrape a site: the domain, the sub domains it should follow (I've found with most blogs, their archive has a longer url that's consistent for all blog posts, and not shared by non post content) and the webpage elements you want to scrape. Once you have that for each site you should be able to put it into a dictionary or list of lists and then run everything. Still doing research but hopefully I'll have something more detailed by tonight.