Data Wrangling Project — Udacity Data Analyst Nanodegree
Switch branches/tags
Nothing to show
Clone or download
Latest commit e12360d Jul 26, 2018

Wrangle OpenStreetMap Data

Data Wrangling Project — Udacity Data Analyst Nanodegree

This project is part of the Data Analyst Nanodegree. Below you'll find the rest of the projects and I also wrote a short post about the experience.

Important note: The entire project is documented and explained in the file, I encourage you to start there. Below you'll find the project file structure.

Main Scripts

  • calls all the functions and executes the program. To create the .csv files and import the data to the database in the data folder, just run python and the script will take care of the rest. can also run functions, but those are commented by default since they don't cause any modification to the data itself.
  • this is the first look at the data. It programmatically checks for data validity, accuracy and other measures and prints its results in the terminal. It does not modify the data itself, only reports the issues it encounters.

The script consists of two similar modules:

  • audit_nodes(): checks for node elements.
  • audit_ways(): checks for way elements.

Running both at the same time could lead to parsing errors, therefore it is recommended to leave one of them commented in the script and run the other separately after the first has finished.

  • reads in the data from the .osm file and exports all the data to .csv files. During the process, it ensures the export is compliant with the structure dictated by For data validity it focuses more on semantics rather than format, but unlike, treats and modifies (through any data related problems described in the Part II of the document.
  • after the data has been stored in .csv files, creates a database osm.db and the necessary tables matching the structure described in

Support Features

  • contains all the data wrangling functions used by
  • takes an .osm file as an input and outputs a k-reduced version of it. k is a parameter that can be changed in the code.
  • schema of how the data will be exported from the .osm file to the database.