Skip to content

Unsupervised topic modeling of print journalism corpora ranging from tabloid articles to broadsheet articles. Trained a classifier to distinguish between tabloid and broadsheet articles.

License

Notifications You must be signed in to change notification settings

Atomahawk/print-journalism-topic-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python Numpy Pandas Scikit-Learn
PyPI PyPI version PyPI version PyPI version

Slides live here. Clean notebooks coming soon

Print Journalism Article Classifier and Topic Models

  • Unsupervised topic modeling of print journalism corpora ranging from tabloid articles to broadsheet articles.
  • Trained a classifier to distinguish between tabloid and broadsheet articles based off the language provided.

Problem Statement

What’s driving sensationalist news is a change in the fundamental business model of news reporting: from a few well fact-checked news agencies to the new internet enabled anyone-can-write-a-headline-and-get-paid free for all.

  • There are lots of us who understand the inherent danger in the proliferation of sensationalist news without substance or depth.
  • I wanted to demonstrate using data science and objective measuring as a potential check on accurate and fair reporting within journalism.

Problem: Illustrate distinctions between broadsheet and tabloid journalism.

On Motives for Filtering News

  • I believe improving civil discourse begins with more consumption of news with substance or depth.
  • Although I could not assess an article's content for truth-value in an automated way, I could use NLP to objective measure word style and language choice to make clear distinctions to measure 'sensationalist' stories relative to my training set.

Data measuring the difference between broadsheets and tabloids

Broadsheets have come to be associated with a high-minded approach to the dissemination of news, and with an upscale readership. Even today, broadsheet papers tend to employ a traditional approach to newsgathering that emphasizes in-depth coverage and a sober tone in articles and editorials.

** Broadsheet training examples: NYT, Reuters, Associated Press **

  • Justification: The Times has been an annual winner of Pulitzer Prizes, and has so far amassed 117 of them, beginning in 1918 for its coverage of World War I. The Times is one of the few newspapers that has been successful in monetizing the Internet because it is so well-known and well-respected worldwide that people are willing to pay a subscription fee to read it online.
  • Pulled a corpus of 10,000+ documents from NYT API.

In contrast, tabloids tend to be more irreverent and slangy in their writing style than their more serious broadsheet brothers. In a crime story, a broadsheet might refer to a police officer, while the tabloid will call him a cop. And while a broadsheet might spend dozens of column inches on "serious" news - say, a major bill being debated in Congress - a tabloid is more likely to zero in on a heinous sensational crime story or celebrity gossip.

** Tabloid training examples: http://www.opensources.co/ **

  • Justification: Trusting on the integrity of university affiliated academics, I scraped a corpus 6000+ documents of click-bait news articles from many of the sources provided.

Analytical Approach

Will better explain the questions guiding analyses and a summary of the methods involved.

  • TF-IDF Vectorizer --> Multinomial Naive Bayes Classifier
  • CountVectorizer --> HDP --> LDA --> Clustering & Topic Modeling

Tool Stack

  • AWS t2.medium linux EC2 instance with a MongoDB database
  • Jupyter notebook
  • Python 3.5
  • Newspaper, BeautifulSoup4
  • Sci-kit Learn
  • NLTK
  • Gensim
  • D3.js

Might include step by step series of examples that tell you have to get a development env running

Visuals [In Development]

  • See slide deck. Interactive word clouds coming soon.

Conclusions

More to come. Will explain insights gleaned, model evaluation, or patterns in visualization.

Best Performer: Multinomial Naive Bayes Classifier

  • Score: 95%
Give an example

Future Work

Expand to other aspects of fake news as described by Brennan Borlaug - one member of small team of grad students at UC Berkeley who have defined 'fake news' as a broad category of the following attributes:

  • Clickbait — Shocking headlines meant to generate clicks to increase ad revenue. Oftentimes these stories are highly exaggerated or totally false.
  • Propaganda — Intentionally misleading or deceptive articles meant to promote the author’s agenda. Oftentimes the rhetoric is hateful and incendiary.
  • Commentary/Opinion — Biased reactions to current events. These articles oftentimes tell the reader how to perceive recent events.
  • Humor/Satire — Articles written for entertainment. These stories are not meant to be taken seriously.

Author

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments & Inspirations

About

Unsupervised topic modeling of print journalism corpora ranging from tabloid articles to broadsheet articles. Trained a classifier to distinguish between tabloid and broadsheet articles.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published