# Wrangle-OpenStreetMap-Data

On the particular project, I am using data mungling techniques to assess the quality of OpenStreetMap’s (OSM) data for the mumbai city regarding their consistency and uniformity. The data wrangling takes place programmatically, using Python for the most of the process and SQL for items that need further attention.

The dataset describes the city of mumbai.Mumbai,India is the closest thing I have to a hometown in the India as I lived there for a good chunk of my childhood, so I was keen to take a look at it in this new, OpenStreetMap-based lens. The size of the dataset is 66 MB and can can be downloaded from here: https://mapzen.com/data/metro-extracts/metro/mumbai_india/

About the project

Scope

OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. The creation and growth of OSM have been motivated by restrictions on use or availability of map information across much of the world, and the advent of inexpensive portable satellite navigation devices.

On the specific project, I am using data from https://www.openstreetmap.org/node/16173235 and data mungling techniques, to assess the quality of their validity, accuracy, completeness, consistency and uniformity. The biggest part of the wrangling takes place programmatically using Python and then the dataset is entered into a SQL database for further examination of any remaining elements that need attention. Finally, I perform some basic exploration and express some ideas for additional improvements.

Skills demonstrated

Assessment of the quality of data for validity, accuracy, completeness, consistency and uniformity. Parsing and gathering data from popular file formats such as .xml and .csv. Processing data from very large files that cannot be cleaned with spreadsheet programs. Storing, querying, and aggregating data using SQL. Mumbai, India

https://www.openstreetmap.org/node/16173235 https://mapzen.com/data/metro-extracts/metro/mumbai_india/

Problems Encountered in the Map

Problems Encountered in the Map Once the location was decided, I downloaded the full extract of the region and ran Python code to investigate any issues with the data. The following problems were discovered:

Street Names: Incomplete ('hanuman raod ___') or incorrect names ('Zhopadpatti'), along with street abbreviations ('rd.' instead of 'Road')

Postal Codes: Inconsistent postal code formats ('500023' and '120045') and incorrect post codes ('123')

To tackle these issues, I had to create python scripts to clean each respective category of data. Auditing part is explained in Openstreetmap.ipynb notebook

I have created mumbai_sample file which is part of mumbai_india file and it can be used for various experiments before using those on main osm file, we can also get the idea about the format of osm file.

Different types of tags mumbai_india file

Afterwards, I matched them against a list of acceptable street types. If they weren't in the list of expected types, they would be added to a dictionary as keys, with the addresses that contain the problematic cases as the values.

Having this overview allowed me to determine what my auditing function needed to accomplish. I created a dictionaries for mapping/correcting purposes - 'mapping'. If my function came across a problematic street type, it would refer to that dictionaries for the corrected version to be replaced with.

An analysis of the XML data, along with outside research on Google Maps and OpenStreetMaps, was needed to identify the missing street types for the incomplete street names.


# Data Overview
In this section, I'll execute a number of SQL queries in order to analyze the dataset.

Users with most numbers of contributions.

In [11]:
sql_file="mumbai_india.db"
con = sqlite3.connect(sql_file)
cur = con.cursor()
cur.execute('SELECT user,count(*) FROM nodes GROUP BY user ORDER BY count(*) DESC LIMIT 7')
all_rows=cur.fetchall()
pprint(all_rows)


con.close()

[(u'parambyte', 69327),
 (u'PlaneMad', 68363),
 (u'anushap', 62553),
 (u'Ashok09', 62208),
 (u'Narsimulu', 55611),
 (u'Srikanth07', 53795),
 (u'premkumar', 51097)]


In [12]:
sql_file="mumbai_india.db"
con = sqlite3.connect(sql_file)
cur = con.cursor()
cur.execute('SELECT key,count(*) FROM ways_tags GROUP BY key ORDER BY count(*) DESC LIMIT 7')
all_rows=cur.fetchall()
pprint(all_rows)


con.close()

[(u'building', 223631),
 (u'highway', 40437),
 (u'name', 11741),
 (u'oneway', 4466),
 (u'source', 4142),
 (u'landuse', 3038),
 (u'levels', 2673)]


Total number of unique users.

In [43]:
sql_file="mumbai_india.db"
con = sqlite3.connect(sql_file)
cur = con.cursor()
cur.execute('SELECT COUNT(DISTINCT(e.uid)) FROM (SELECT uid FROM nodes UNION ALL SELECT uid FROM ways) e')
all_rows=cur.fetchall()
pprint(all_rows)


con.close()

[(1739,)]


Total number of users who contributed for less than 8 times.

In [13]:
sql_file="mumbai_india.db"
con = sqlite3.connect(sql_file)
cur = con.cursor()
cur.execute('SELECT COUNT(*) FROM (SELECT e.user, COUNT(*) as num FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) as e GROUP BY e.user HAVING num<=8)  u')
all_rows=cur.fetchall()
pprint(all_rows)


con.close()

[(1053,)]


Different types of cuisines available in mumbai and number of restaurants it is available in.

In [45]:
sql_file="mumbai_india.db"
con = sqlite3.connect(sql_file)
cur = con.cursor()
cur.execute('SELECT nodes_tags.value, COUNT(*) as num FROM nodes_tags JOIN (SELECT DISTINCT(id) FROM nodes_tags WHERE value="restaurant") as i ON nodes_tags.id=i.id WHERE nodes_tags.key="cuisine" GROUP BY nodes_tags.value ORDER BY num DESC')
all_rows=cur.fetchall()
pprint(all_rows)


con.close()

[(u'indian', 66),
 (u'regional', 21),
 (u'pizza', 14),
 (u'vegetarian', 13),
 (u'chinese', 12),
 (u'italian', 11),
 (u'burger', 5),
 (u'international', 5),
 (u'seafood', 4),
 (u'asian', 3),
 (u'South_Indian', 2),
 (u'all_types_of_food', 2),
 (u'lebanese', 2),
 (u'Indian,_Chinese_etc', 1),
 (u'Seafood', 1),
 (u'Vegetarian_Restaurant', 1),
 (u'american', 1),
 (u'cafe', 1),
 (u'chicken;fish;indian', 1),
 (u'chicken;kebab;indian', 1),
 (u'chicken_,fish,cafe', 1),
 (u'fast_food', 1),
 (u'grill;coffee_shop;asian;noodles;fish_and_chips;diner;chicken;italian_pizza;indian;curry;fish;french;friture;chinese;barbecue',
  1),
 (u'indian;south_indian', 1),
 (u'indian_aagri', 1),
 (u'italian_pizza;pizza', 1),
 (u'lebanese,_chinese,_indian', 1),
 (u'local', 1),
 (u'mediterranean', 1),
 (u'only_vegiterian', 1),
 (u'oriental', 1),
 (u'persian', 1),
 (u'sad_food', 1),
 (u'south Indian; Punjabi; agari; malwani; Chinese', 1),
 (u'south_indian', 1),
 (u'south_indian,_chinese', 1),
 (u'spanish', 1),
 (u'swee

Different religions and number of places where they are workshipped.

In [46]:
sql_file="mumbai_india.db"
con = sqlite3.connect(sql_file)
cur = con.cursor()
cur.execute('SELECT nodes_tags.value, COUNT(*) as num FROM nodes_tags  JOIN (SELECT DISTINCT(id) FROM nodes_tags WHERE value="place_of_worship") as i ON nodes_tags.id=i.id WHERE nodes_tags.key="religion" GROUP BY nodes_tags.value ORDER BY num DESC')
all_rows=cur.fetchall()
pprint(all_rows)


con.close()

[(u'hindu', 125),
 (u'muslim', 71),
 (u'christian', 34),
 (u'buddhist', 13),
 (u'jain', 6),
 (u'sikh', 4),
 (u'zoroastrian', 2),
 (u'jewish', 1)]


Different leisures

In [47]:
sql_file="mumbai_india.db"
con = sqlite3.connect(sql_file)
cur = con.cursor()
cur.execute('SELECT nodes_tags.value, count(*) as num FROM nodes_tags  WHERE nodes_tags.key="leisure" GROUP BY nodes_tags.value ORDER BY num DESC')
all_rows=cur.fetchall()
pprint(all_rows)


con.close()

[(u'park', 63),
 (u'playground', 20),
 (u'sports_centre', 15),
 (u'garden', 10),
 (u'fitness_centre', 8),
 (u'pitch', 6),
 (u'swimming_pool', 4),
 (u'aquarium', 1),
 (u'fitness_station', 1),
 (u'golf_course', 1),
 (u'stadium', 1)]


Names of few corrected streets.

In [48]:
sql_file="mumbai_india.db"
con = sqlite3.connect(sql_file)
cur = con.cursor()
cur.execute('SELECT value, count(*) as num FROM nodes_tags WHERE key="street" GROUP BY value ORDER BY num DESC LIMIT 10')

all_rows=cur.fetchall()
pprint(all_rows)


con.close()

[(u'Hanuman Road', 78),
 (u'Yashavant Nagar Road', 29),
 (u'Hiranandani Estate', 24),
 (u'P.L. Lokhande Marg', 24),
 (u'New Link Road, Andheri West', 21),
 (u'LBS Marg', 18),
 (u'Road Number 3', 18),
 (u'Thane Ghodbunder Road', 18),
 (u'Eastern Express Highway', 14),
 (u'GD Somani Road', 13)]


Cities surrounding mumbai and number of nodes in those cities.

In [49]:
sql_file="mumbai_india.db"
con = sqlite3.connect(sql_file)
cur = con.cursor()
cur.execute('SELECT tags.value, COUNT(*) as count FROM (SELECT * FROM nodes_tags UNION ALL  SELECT * FROM ways_tags) tags WHERE tags.key LIKE "%city" GROUP BY tags.value ORDER BY count DESC LIMIT 10')

all_rows=cur.fetchall()
pprint(all_rows)


con.close()

[(u'Mumbai', 607),
 (u'Bandra, Mumbai', 566),
 (u'mumbai', 187),
 (u'Virar West', 91),
 (u'Mulund (West)', 79),
 (u'Navi Mumbai', 70),
 (u'MUMBAI', 68),
 (u'Mulund (East)', 62),
 (u'Thane', 49),
 (u'Kharghar', 43)]


Different postcodes in ways_tags after using update postcode function.

In [12]:
sql_file="mumbai_india.db"
con = sqlite3.connect(sql_file)
cur = con.cursor()
cur.execute('SELECT value, COUNT(*) as count FROM nodes_tags WHERE key="amenity" GROUP BY value ORDER BY count LIMIT 10')

all_rows=cur.fetchall()
pprint(all_rows)


con.close()

[(u'car_rental', 1),
 (u'cold_storage', 1),
 (u'conference_centre', 1),
 (u'cyber_cafe', 1),
 (u'electric socket', 1),
 (u'internet_cafe', 1),
 (u'meditation_centre', 1),
 (u'parking_entrance', 1),
 (u'parking_space', 1),
 (u'picnic spot', 1)]


# Potential Additional Improvements

There are several areas of improvement of the project in the future. The first one is on the completeness of the data. All the above analysis is based on a dataset that reflects a big part of mumbai but not only mumbai. The reason for this is the lack of a way to download a dataset for the entire mumbai without including parts of the neighboring cities. The analyst has to either select a part of the island/city or select a wider area that includes parts of thane and ratnagiri. Also, because of relations between nodes, ways, and relations, the downloaded data expand much further than the actual selection.

As a future improvement, I would download a wider selection or the metro extract from MapZen and filter the non-mumbai nodes and their references. The initial filtering could take place by introducing some latitude/longitude limits in the code to sort out most of the "non-m" nodes.

The second area with room for future improvement is the exploratory analysis of the dataset. Just to mention some of the explorings that could take place:


1.Popular franchises in the country (fast food, conventional stores, etc.)

2.Selection of a bank based on the average distance you have to walk for an ATM.

3.Which area has the biggest parks and recreation spaces.

The scope of the current project was the wrangling of the dataset, so all the above have been left for future improvement.

Increasing Submissions

Going through this dataset, my concerns were less with the cleanliness of the data - as I found it surprisingly clean - and more with the lack of data. This part of mumbai is too big to have as little information as it does. I think OpenStreetMap can go a long way in developing their map database if they took on certain initiative to increase engagement with their service. One possible initiative would be for OpenStreetMap to form partnerships with educational institutions such as schools, or maybe libraries, to engage students with their service. As a way to develop computer and internet literacy, computer-related courses can teach students how to use OpenStreetMap. It'll expose them to online maps, GPS technology, how to participate in open source projects, and more - all while adding data to a free resource that could benefit the members of the community and the world.

Anticipated Problem: However, the concern here is that you might see an influx of dirty, unreliable data, particularly if the people behind them aren't very computer literate or only participating because it's a mandatory portion of a course. Naturally the data that come from volunteers who get involved because of their genuine passion for the project would be of higher quality.

Ensuring Data Consistency

For data improvement, the biggest problem I came across my data before I cleaned it was the lack of a unified format for street types or phone numbers, or simply incomplete information. If OpenStreetMap had a hard format that street types, phone numbers, zip codes, etc. should follow - and they ensured the format is appropriate for the city/country - there would be much cleaner data for analysis.



# Conclusion

It's clear from what we've seen that the mumbai OpenStretMap data is still incomplete and incorrect but there is still much in this city to be found and explored. The upside is that a lot of the data that has been entered is fairly clean, so future OSM users who embark on the task of improving the dataset with new data won't have much to worry about with regards to cleaning prior submissions. 