GitHub - Paristha/node_web_scraper: A web-scraping program written in node.js

New York Times Word-Occurrence Grapher

This project aims to use the NYT Developer's APIs to gather word-occurrence data from past years and graph it to demonstrate Zipf's Law
Explore the docs »

View Demo · Report Bug · Request Feature

About The Project

Built With

Getting Started

To get a local copy up and running follow these simple steps.

Prerequisites

This is an example of how to list things you need to use the software and how to install them.

npm

npm install npm@latest -g

Installation

Clone the repo

git clone https://github.com/Paristha/node_web_scraper.git

Install NPM packages

npm install

Usage

The three dropdown lists control the year, month, and sampling in that order from top to bottom. Years from 2009-2019 are included. The Times changed the layout of their articles at some point before 2009, and articles before 2002 are archived. 2009 was chosen as a cutoff point to ensure reliability.

The sampling offers three options: 10, 50, and 100. When common words are being excluded, 10 does not ensure good results. Originally 1 and 5 were options, however there is a chance an article without words (a slideshow) is picked, so 10 was picked as a minimum for providing a reasonable corpus.

The app functions by making a call to the New York Times Archive API with the chosen month and year. This returns a json with information on every NYT article from that year. A sampling of these is taken using Math.random() to retrieve as many article URLs as requested, which are then visited. The text of the articles is extracted by taking all inner html from elements with the name 'articleBody'. The text is split into word counts and used to update a MySQL database with a column for 'word' (which is a string and the key) and a column for 'word-count' (which is a number). After all articles are visited, the top 50 rows of the database, descending on 'word-count', are extracted and inserted into the html so that they may be used to render the chart when the webpage is loaded.

The default word exclusion list is the most common 150 words as found here. Custom word exclusion lists should follow the same format.

The Word-Occurrence bar graph uses the words as labels, for the Log-Log scatter plot you can see the word by scrolling over the point. The natural log is used; any log would function the same. The scatter plot should show a downwards linear trend, demonstrating Zipf's Law.

Example graphs (taken from 10-2016, 100 articles sampled, excluding common words):

N.B.: The example word-exclusion list is not exhaustive. 'Common' words can be seen here still. Future updates may include better common word lists, but the ability to tailor your own to your needs should suffice.

Roadmap

-Improved common word list -Ability to download MySQL db created to store data for graph

Feel free to email me suggestions!

See the open issues for a list of proposed features (and known issues).

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Thana Paris - tmp2121@caa.columbia.edu

Project Link: https://github.com/Paristha/node_web_scraper

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.ebextensions		.ebextensions
public/modules		public/modules
.gitattributes		.gitattributes
.gitignore		.gitignore
EBSampleApp-Nodejs.iml		EBSampleApp-Nodejs.iml
EB_deployedfile_list.txt		EB_deployedfile_list.txt
LICENCE		LICENCE
README.md		README.md
app.html		app.html
app.js		app.js
cron.yaml		cron.yaml
exclusion_list.csv		exclusion_list.csv
node_web_scraper.png		node_web_scraper.png
node_web_scraper_word-occurrence_graph.png		node_web_scraper_word-occurrence_graph.png
node_web_scraper_zipf_graph.png		node_web_scraper_zipf_graph.png
nytArchiveGET.js		nytArchiveGET.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

New York Times Word-Occurrence Grapher

Table of Contents

About The Project

Built With

Getting Started

Prerequisites

Installation

Usage

Roadmap

Contributing

License

Contact

About

Uh oh!

Releases

Packages

Languages

License

Paristha/node_web_scraper

Folders and files

Latest commit

History

Repository files navigation

New York Times Word-Occurrence Grapher

Table of Contents

About The Project

Built With

Getting Started

Prerequisites

Installation

Usage

Roadmap

Contributing

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages