New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identification of Polarized Blog Posts #4

Open
TyJK opened this Issue May 9, 2017 · 12 comments

Comments

Projects
None yet
3 participants
@TyJK
Owner

TyJK commented May 9, 2017

Labelling Blog Sites

We need labelled data for various topics and sentiment and we need a lot of it. We have decided on a form of labelling called distant supervision, where we use heuristics and tags in order to classify far more text than we could possibly label manually, with the idea being the cost of potentially mislabelling some data is outweighed by the far greater volume. In order to do this we have targeted opinion blogs for 3 main reasons:

  • They contain far more text than a single social media comment
  • Posts on the same site should largely hold the same sentiment or point of view for a given topic
  • Unlike news articles, they should be very semantically similar to comments

We will need to scrape this data meaning we first need to label potential target sites. To do this we need people to pick a topic, such as global warming, vaccination, religion/atheism or some other polarizing topic. Once that topic is decided one, try to find blogs that have to do more or less exclusively with this topic, and determine the dominant sentiment of official posts on the site (not comments). Check that the sentiment is fairly consistent between posts and authors (if there's more than one).

Once a site or domain is determined to be a good target, enter the url into a text file. The text file should be named in the format: Topic of Blog Posts - Sentiment (eg. Climate Change - Denial, Abortion - Pro Choice, etc). Each file should contain only one leaning for the sake of easily running them through any automated scraper we create. Avoid ambiguously leaning sites (those that post from both sides) or those whose topic varies significantly .

What should be in the file

The first is the domain of the website, which will be used to limit where a crawler can go and which links it can follow. It should not include 'http://' or 'www', but simply the domain name, such as realclimate.org.

The next is the URL pattern for the blog posts. By this I mean the longest consistent URL for all blog pages on that site. For example for realclimate.org, all of the blog posts can be found by year, eg. http://www.realclimate.org/index.php/archives/2017/05/ or http://www.realclimate.org/index.php/archives/2016/03/. Thus, the common URL would be http://www.realclimate.org/index.php/archives/20. This is not itself a valid URL, but all valid URLs MUST contain this sequence. This makes it easy for anyone scraping using Portia or other scrapers to simply enter this sequence into the ReGex section when designing a spider and then setting it loose. Finally, if you want you can add a subjective evaluation of how extreme you believe the site to be in their position, with 1 being centrist and 5 being extremist. A template is available in the URL Dump folder and remember to name your file with the topic and sentiment

The list of possible topics includes but is not limited to:

  • Climate Change - IsReal/Skeptic
  • Abortion - Pro-life/Pro-choice
  • Religion - Believers/Non-believers
  • Vaccines - Pro-vaccination/anti-vaccination
  • Guns - Pro-gun/Anti-gun
  • Drug Policy - Criminalization/Decriminalization and Legalization

We have deliberately stayed away from topics like Politics - Left/Right or Libertarian/Authoritarian for two reasons:

  • These sorts of categories are quite general and tend to encompass many of the above topics
  • Defining what is Left vs what is Right is more subjective and inconsistent person to person.

If you choose to create your own topic, please keep in mind that it should be clear/unambiguous as well as broad. Ie. Yankees vs. Red Sox would not be a good topic as it's very specific. If you have any doubts please comment on this issue with your suggested topic and we'll give you feedback. Also, while any self-directed initiative is encouraged, keep in mind that we'd rather have a bunch of data for just a few topics than sparser data for many topics.

Thank you for your efforts and patiences.

@PikioopSo

This comment has been minimized.

Show comment
Hide comment
@PikioopSo

PikioopSo May 17, 2017

@TyJK

What's your take on extremist/radical sites? Are those counted too.

Kip/PiReel

PikioopSo commented May 17, 2017

@TyJK

What's your take on extremist/radical sites? Are those counted too.

Kip/PiReel

@TyJK

This comment has been minimized.

Show comment
Hide comment
@TyJK

TyJK May 17, 2017

Owner

@PiReel

I think that encompassing the full range of stances on each side is important. That said, it's also important to get primarily the middle 90% of perspectives, and only have a few samples from the fringes.

We want the corpus to reflect all views under the broad umbrella of 'for' and 'against', and the extremists are likely to have some of the most distinctive speaking patterns which will be beneficial. But they also tend to be the most vitriolic, and so we definitely want to keep those in the minority. They'll be the easiest to find, so really they'll most likely have to be avoided once a few decently sized examples are found.

Owner

TyJK commented May 17, 2017

@PiReel

I think that encompassing the full range of stances on each side is important. That said, it's also important to get primarily the middle 90% of perspectives, and only have a few samples from the fringes.

We want the corpus to reflect all views under the broad umbrella of 'for' and 'against', and the extremists are likely to have some of the most distinctive speaking patterns which will be beneficial. But they also tend to be the most vitriolic, and so we definitely want to keep those in the minority. They'll be the easiest to find, so really they'll most likely have to be avoided once a few decently sized examples are found.

@PikioopSo

This comment has been minimized.

Show comment
Hide comment
@PikioopSo

PikioopSo May 18, 2017

Thanks for the info, @TyJK.

I also wanted to let you know about an addon called, Pocket made for Firefox. It allows you to share bookmarked pages with a group of collaborators and it also allows you to assign labels to the bookmarked pages.

So things I like about it are:
Good looking interface.
Shareable web pages makes it easier to explain the concepts of complicated subjects.
connectable to third party accounts.

I think it would work really well for a project like this, where we need to organize a bunch of links and share them.

PikioopSo commented May 18, 2017

Thanks for the info, @TyJK.

I also wanted to let you know about an addon called, Pocket made for Firefox. It allows you to share bookmarked pages with a group of collaborators and it also allows you to assign labels to the bookmarked pages.

So things I like about it are:
Good looking interface.
Shareable web pages makes it easier to explain the concepts of complicated subjects.
connectable to third party accounts.

I think it would work really well for a project like this, where we need to organize a bunch of links and share them.

@TyJK

This comment has been minimized.

Show comment
Hide comment
@TyJK

TyJK May 18, 2017

Owner

@PiReel
I wasn't aware of that feature, that sounds like exactly what we need, thank you! I'll play around with it to get an idea of how we could use it in a systematic way and then maybe update it so that everyone who's contributing is on the same page.

Owner

TyJK commented May 18, 2017

@PiReel
I wasn't aware of that feature, that sounds like exactly what we need, thank you! I'll play around with it to get an idea of how we could use it in a systematic way and then maybe update it so that everyone who's contributing is on the same page.

@TyJK

This comment has been minimized.

Show comment
Hide comment
@TyJK

TyJK May 26, 2017

Owner

w.r.t. Pocket, unfortunately it doesn't seem there's any easy bulk sharing feature. For now I'm going to leave the recommendation to just share with a text file, but I'll keep tinkering with it to see how it can be used.

Owner

TyJK commented May 26, 2017

w.r.t. Pocket, unfortunately it doesn't seem there's any easy bulk sharing feature. For now I'm going to leave the recommendation to just share with a text file, but I'll keep tinkering with it to see how it can be used.

@PikioopSo

This comment has been minimized.

Show comment
Hide comment
@PikioopSo

PikioopSo May 31, 2017

@TyJK, sorry for the late response my email is packed with mozsprint stuff, but I was wondering if you were trying to find other people to share pocket stuff with. I guess you were.

I am going to try to do a search for you on Pocket. What should I search for?

PikioopSo commented May 31, 2017

@TyJK, sorry for the late response my email is packed with mozsprint stuff, but I was wondering if you were trying to find other people to share pocket stuff with. I guess you were.

I am going to try to do a search for you on Pocket. What should I search for?

@PikioopSo

This comment has been minimized.

Show comment
Hide comment
@PikioopSo

PikioopSo May 31, 2017

I believe you can use tags so that people contributing can do tag searches for an echoburst tag or something

PikioopSo commented May 31, 2017

I believe you can use tags so that people contributing can do tag searches for an echoburst tag or something

@TyJK

This comment has been minimized.

Show comment
Hide comment
@TyJK

TyJK May 31, 2017

Owner

@PiReel Tyler JK is what I have the name set up as, hopefully that's unique enough, but if not let me know. I'll see if I can set that up, but I've come across a second advantage to a .txt file, which is I can read it into a scraper I write. Is there a way to download pocket links into something like that? Thanks :)

Owner

TyJK commented May 31, 2017

@PiReel Tyler JK is what I have the name set up as, hopefully that's unique enough, but if not let me know. I'll see if I can set that up, but I've come across a second advantage to a .txt file, which is I can read it into a scraper I write. Is there a way to download pocket links into something like that? Thanks :)

@KipOmaha

This comment has been minimized.

Show comment
Hide comment
@KipOmaha

KipOmaha May 31, 2017

I'm writing an extension that turns the browsers file system in to a "Adobe Bridge" type application that works with Pi Reel.

For your case though we would have to write your scraper as a browser extension that works with pocket.

Which would be a nice piece of bookmarking software to have, but more elaborate.

KipOmaha commented May 31, 2017

I'm writing an extension that turns the browsers file system in to a "Adobe Bridge" type application that works with Pi Reel.

For your case though we would have to write your scraper as a browser extension that works with pocket.

Which would be a nice piece of bookmarking software to have, but more elaborate.

@TyJK

This comment has been minimized.

Show comment
Hide comment
@TyJK

TyJK May 31, 2017

Owner

That seems like one of those things that might be an interesting project on it's own, but not necessarily something we can get up and running right now. I'm already going through the various features you could incorporate with this. From what I can tell, you need 3 things to efficiently scrape a site: the domain, the sub domains it should follow (I've found with most blogs, their archive has a longer url that's consistent for all blog posts, and not shared by non post content) and the webpage elements you want to scrape. Once you have that for each site you should be able to put it into a dictionary or list of lists and then run everything. Still doing research but hopefully I'll have something more detailed by tonight.

Owner

TyJK commented May 31, 2017

That seems like one of those things that might be an interesting project on it's own, but not necessarily something we can get up and running right now. I'm already going through the various features you could incorporate with this. From what I can tell, you need 3 things to efficiently scrape a site: the domain, the sub domains it should follow (I've found with most blogs, their archive has a longer url that's consistent for all blog posts, and not shared by non post content) and the webpage elements you want to scrape. Once you have that for each site you should be able to put it into a dictionary or list of lists and then run everything. Still doing research but hopefully I'll have something more detailed by tonight.

@TyJK

This comment has been minimized.

Show comment
Hide comment
@TyJK

TyJK Jun 2, 2017

Owner

I'm taking on Climate Change, both sides, atm. Will upload within the next few hours and then people can add to that if they wish.

Owner

TyJK commented Jun 2, 2017

I'm taking on Climate Change, both sides, atm. Will upload within the next few hours and then people can add to that if they wish.

@TyJK

This comment has been minimized.

Show comment
Hide comment
@TyJK

TyJK Jun 2, 2017

Owner

Climate Change is up, will be working on Drug Policy - Criminalization/Decriminalization next.

Owner

TyJK commented Jun 2, 2017

Climate Change is up, will be working on Drug Policy - Criminalization/Decriminalization next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment