This project aims to use the NYT Developer's APIs to gather word-occurrence data from past years and graph it to demonstrate Zipf's Law
Explore the docs »
View Demo
·
Report Bug
·
Request Feature
To get a local copy up and running follow these simple steps.
This is an example of how to list things you need to use the software and how to install them.
- npm
npm install npm@latest -g
- Clone the repo
git clone https://github.com/Paristha/node_web_scraper.git
- Install NPM packages
npm install
The three dropdown lists control the year, month, and sampling in that order from top to bottom. Years from 2009-2019 are included. The Times changed the layout of their articles at some point before 2009, and articles before 2002 are archived. 2009 was chosen as a cutoff point to ensure reliability.
The sampling offers three options: 10, 50, and 100. When common words are being excluded, 10 does not ensure good results. Originally 1 and 5 were options, however there is a chance an article without words (a slideshow) is picked, so 10 was picked as a minimum for providing a reasonable corpus.
The app functions by making a call to the New York Times Archive API with the chosen month and year. This returns a json with information on every NYT article from that year. A sampling of these is taken using Math.random() to retrieve as many article URLs as requested, which are then visited. The text of the articles is extracted by taking all inner html from elements with the name 'articleBody'. The text is split into word counts and used to update a MySQL database with a column for 'word' (which is a string and the key) and a column for 'word-count' (which is a number). After all articles are visited, the top 50 rows of the database, descending on 'word-count', are extracted and inserted into the html so that they may be used to render the chart when the webpage is loaded.
The default word exclusion list is the most common 150 words as found here. Custom word exclusion lists should follow the same format.
The Word-Occurrence bar graph uses the words as labels, for the Log-Log scatter plot you can see the word by scrolling over the point. The natural log is used; any log would function the same. The scatter plot should show a downwards linear trend, demonstrating Zipf's Law.
Example graphs (taken from 10-2016, 100 articles sampled, excluding common words):
N.B.: The example word-exclusion list is not exhaustive. 'Common' words can be seen here still. Future updates may include better common word lists, but the ability to tailor your own to your needs should suffice.
-Improved common word list -Ability to download MySQL db created to store data for graph
Feel free to email me suggestions!
See the open issues for a list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the MIT License. See LICENSE
for more information.
Thana Paris - tmp2121@caa.columbia.edu
Project Link: https://github.com/Paristha/node_web_scraper