A small project to demonstrate NLP and text classification using a corpus derived from the New York Times. The goal was to classify articles into one of five 'news desk' categories (Arts, Business, Obituaries, World, Sport) based on the stories headline and body text.
A longer writeup and discussion of results can be found on my website [INSERT LINK HERE]
This repo contains two separate directories:
generate_corpus
: contains code to generate the nyt article corpus using the NYT API. A version of the corpus generated in 2015 is available to download hereclassify_articles
: Self-contained classification script. Reads in the corpus, creates an 80:20 train/test split and applies a Naive Bayes model. Plots a confusion matrix and writes the 10 most dicriminating words for each article category.
First thing you have to do is signup and register as a developer. API Keys are assigned by API, so make sure you specify the Article Search API.
Once you have recieved an API key set it as environmntal variable nyt_api_key
in your .bash_profile
:
export nyt_api_key=<YOUR API KEY>
Even before you register, you can use the NYT's handy API Console to interactively test your queries: http://developer.nytimes.com/io-docs
The Article Search API is pretty flexible; you can call it with no parameters except for your api-key
and it will return (presumably) a list of articles, in reverse chronological order, starting from Sept. 18, 1851. However, it only returns 10 articles per request. And it won't let you paginate beyond a page
parameter of 100
(i.e. you can't go to page
100000 to retrieve the 1,000,000th oldest Times article). To put it another way, you can only paginate through a maximum of 10,000 results, so you'll have to facet your search.
A lot more information about the article can be pulled out, but for the classication project we only need the title and body text.
Generate the corpus first set the path to save the corpus as an environmental variable data_dir
where you would like the corpus to be saved
export data_dir=/path/to/save/corpus
Then cd into the generate_corpus/bin
dir and execute the shell script
cd NYT_API_scraper/generate_corpus/bin/
./create_corpus.sh
This pulls the first 101 pages (1010 articles) for each of five news desks (Arts, Business, Obituaries, World, Sport) and saves each article as an individual file as both a .csv
(containing: url, headline and body) and a .txt
containing headline + body. The csv exists as more of a 'debugging' tool (lets you trace back to the original NYT article via the url) and the txt documents are used for the actual classification modelling.
The files are stored in a two levels folder structure like the following:
txt_document/
category_1_folder/
category1_0001.txt category1_0002.txt ... category1_1010.txt
category_2_folder/
category2_0001.txt category2_0002.txt ... category2_1010.txt
csv_document/
category_1_folder/
category1_0001.csv category1_0002.csv ... category1_1010.csv
category_2_folder/
category2_0001.csv category2_0002.csv ... category2_1010.csv
To run the classification model
cd NYT_API_scraper/classify_articles
python nyt_article_classification.py
This will generate a confusion matrix and a .json
file with the top 10 most discriminating words for each category.