Historical US City populations
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.Rproj.user 2018 edits Mar 9, 2018
.ipynb_checkpoints Image and README Aug 22, 2017
wiki_census Image and README Aug 22, 2017
wikipedia_state_data 2018 edits Mar 9, 2018
.Rhistory Image and README Aug 22, 2017
.gitignore .gitignore Aug 22, 2017
1790-2010_MASTER.xlsx Image and README Aug 22, 2017
2016_Gaz_cousubs_national.txt Image and README Aug 22, 2017
City Sources.png
First parsing.Rmd Image and README Aug 22, 2017
First parsing.nb.html Image and README Aug 22, 2017
Gaz_counties_national.txt Image and README Aug 22, 2017
Kmean and loess.Rmd Image and README Aug 22, 2017
Kmean and loess.nb.html Image and README Aug 22, 2017
Maxpop.png Image and README Aug 22, 2017
Merge data together 2018.ipynb 2018 edits Mar 9, 2018
Merge data together.ipynb 2018 edits Mar 9, 2018
Parse Wikipedia Dumps for Gutentext.ipynb Image and README Aug 22, 2017
Parse Wikipedia Dumps.ipynb Update Parse Wikipedia Dumps.ipynb Mar 13, 2018
README.md Update README.md Mar 13, 2018
Second Parsing.Rmd 2018 edits Mar 9, 2018
Second Parsing.nb.html Image and README Aug 22, 2017
U2.ipynb Image and README Aug 22, 2017
cache.pickle
cache2.pickle Image and README Aug 22, 2017
city_pops.csv
extended_description.md 2018 edits Mar 9, 2018
merged.csv 2018 edits Mar 9, 2018
merging_functions.py 2018 edits Mar 9, 2018
nohup.out Image and README Aug 22, 2017
places.shelf Image and README Aug 22, 2017
provinces.py 2018 edits Mar 9, 2018
wikiparser.py 2018 edits Mar 9, 2018
wikipedia_population.Rproj Image and README Aug 22, 2017

README.md

The municipal places in this dataset, by year of maximum population

This is a dataset and code that merges three major sources of historical US population data.

It is part of the in-progress Creating Data digital monograph. If citing, please cite that project in addition to this repo. Eg: "Schmidt, Benjamin. Creating Data: The Invention of Information in the nineteenth century American State. http://creatingdata.us".

License

This data is in the public domain and there are no legal restrictions on its use. If you're an academic, I'd recommend also citing the CESTA population set that this draws on, as well as Wikipedia if you can swing that.

Content

A fuller description of data and method is contained in the file extended_description.md and on the project page.

The sources are:

  1. Every Wikipedia page with a population box.
  2. A manually entered set of CSVs by Wikipedia editor Jacob Alperin-Sheriff (which is mostly, but not entirely, on wikipedia).
  3. A set of historical populations compiled by Stanford's CESTA: U.S. Census Bureau and Erik Steiner, Spatial History Project, Center for Spatial and Textual Analysis, Stanford University.

There are many process files here. The most useful files are likely:

  • merged.csv (The union dataset.)
  • The files in wikipedia_state_data/, which include the parsed contents of all Wikipedia population boxes in the United States.
  • The files in wiki_census, which are the sources Alperin-Sheriff used to build the wikipedia page.

There are all sorts of errors here. Since this is built up programatically, I'm not interested in corrections to individual data points, although I encourage you to correct the Wikipedia pages.

I have made many efforts to merge duplicate cities in the merged.csv file, but there are many cases of double-counting of various sorts, especially when the wikipedia and CESTA populations diverge for a single city or when multiple levels of government each have an entry (for example, both Manhattan and New York City have entries).

The wikipedia set is about 4x bigger than the CESTA one. The following maps show roughly the original contributions of each dataset:

Sources of cities

Also included is the code that performs extraction from a wikipedia dump, and which performs the merge (including a few examples of errors and differences between sets.) These are mostly in ipython notebooks, with a little bit in R notebooks. Most of the operational python code is broken out into the .py files which are imported.