Skip to content

CamCairns/NYT_article_classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Classifying Articles from the New York Times

A small project to demonstrate NLP and text classification using a corpus derived from the New York Times. The goal was to classify articles into one of five 'news desk' categories (Arts, Business, Obituaries, World, Sport) based on the stories headline and body text.

A longer writeup and discussion of results can be found on my website [INSERT LINK HERE]

This repo contains two separate directories:

  • generate_corpus: contains code to generate the nyt article corpus using the NYT API. A version of the corpus generated in 2015 is available to download here
  • classify_articles: Self-contained classification script. Reads in the corpus, creates an 80:20 train/test split and applies a Naive Bayes model. Plots a confusion matrix and writes the 10 most dicriminating words for each article category.

The NYT Article Search API

First thing you have to do is signup and register as a developer. API Keys are assigned by API, so make sure you specify the Article Search API.

Once you have recieved an API key set it as environmntal variable nyt_api_key in your .bash_profile:

export nyt_api_key=<YOUR API KEY>

Even before you register, you can use the NYT's handy API Console to interactively test your queries: http://developer.nytimes.com/io-docs

The Article Search API is pretty flexible; you can call it with no parameters except for your api-key and it will return (presumably) a list of articles, in reverse chronological order, starting from Sept. 18, 1851. However, it only returns 10 articles per request. And it won't let you paginate beyond a page parameter of 100 (i.e. you can't go to page 100000 to retrieve the 1,000,000th oldest Times article). To put it another way, you can only paginate through a maximum of 10,000 results, so you'll have to facet your search.

A lot more information about the article can be pulled out, but for the classication project we only need the title and body text.

NYT Corpus Creation

Generate the corpus first set the path to save the corpus as an environmental variable data_dir where you would like the corpus to be saved

    export data_dir=/path/to/save/corpus

Then cd into the generate_corpus/bin dir and execute the shell script

cd NYT_API_scraper/generate_corpus/bin/ 
./create_corpus.sh

This pulls the first 101 pages (1010 articles) for each of five news desks (Arts, Business, Obituaries, World, Sport) and saves each article as an individual file as both a .csv (containing: url, headline and body) and a .txt containing headline + body. The csv exists as more of a 'debugging' tool (lets you trace back to the original NYT article via the url) and the txt documents are used for the actual classification modelling.

The files are stored in a two levels folder structure like the following:

txt_document/
    category_1_folder/
        category1_0001.txt category1_0002.txt ... category1_1010.txt
    category_2_folder/
        category2_0001.txt category2_0002.txt ... category2_1010.txt

csv_document/
    category_1_folder/
        category1_0001.csv category1_0002.csv ... category1_1010.csv
    category_2_folder/
        category2_0001.csv category2_0002.csv ... category2_1010.csv

Classification Model

To run the classification model

cd NYT_API_scraper/classify_articles
python nyt_article_classification.py

This will generate a confusion matrix and a .json file with the top 10 most discriminating words for each category.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published