# OpenStreetMap Project - Data  Wrangling with SQL

[Openstreet map](https://wiki.openstreetmap.org/wiki/Main_Page) is a free world map where anyone can contribute geographic data.  Because anyone can contribute it stands to reason that the information contained may not always be accurate. 

The data set used for this project is located at https://www.openstreetmap.org/export#map=12/32.7831/-96.8067 

I picked this location because it is of interest to me.  I recently relocated to the Dallas metro area and getting around and discovering things in a new city can be challenging.  We will explore the accuracy of the information in this area and correct crucial data.

## Overview of the Data

When taking a look at the tags from Dallas.osm we can see there are over 3 million node and ways datapoints.

### Mapparser.py

So we started by looking at the elements and categorizing them according to data patterns.  It is rather surpising to see that there are no problem characters associate with the data.
### Tags.py

So next we took a look at unique user contributions.  We can see that 841 unique users contributed to the dataset.

### Users.py

With a little bit better view of the data we start but taking a look at what can be one of the most important areas of the data, streets.  It won't matter what cool things we find to do in the city if we can't figure out how to get there.  As we can see there are numerous problems with the data.
### Audit.py

Because this is a very large set of possible problems we need to approach this carefully.  We discovered that Dallas metro streets can be particularly unique.  So we start with the clearer entries.  

The entries for Avenue[A-Q] are correct and will not be altered. Those ending with abbreviations like Rd. or Blvd. are also straight-forward to clean so we will add those to the mapping.

Now we will look at the less clear entries.
One entry that is unique to the area is Via Estrella.  This is accurate and is not abbreviated and the order is correct, as are all associated listings. https://goo.gl/maps/ZvZEQu72Nos. Also in this area are a number of streets that do not have a suffix associated with them.  Those have been ignored.

Nile is another oddity, and here in Dallas it can cause a barrage of bug repellent to be aimed in your direction.  In this case, Nile should be Nile Drive.

Pleasant Mound is a cemetary in Dallas. The entry was correct to reflect its location at South Buckner Boulevard.

Of the remaining 20 entries I was unable to determine the correct suffix associated with the following 8 entries:

- Halsey
- Slick Rock Chase
- Millmar
- Lynnacre
- Wood Thrush 
- Mt Vernon 
- Stonecourt
- Hunterwood

These entries are unique in that they are within a short distance of each other and are either correct in one city or have a suffix in the other city.

### update_street_name.py
All corrections were added to mappings.  This decision was made as it was more straight-forward to correct the data properly than by approaching it with additional scripting.

### data_wrangling_schema.sql
Creating the database with data_wrangling_schema.sql resulted in an extra entry in each table. Also an error of datatype mismatch returned for the nodes and ways imports. I was only able to create the tables when 'PRIMARY KEY' was removed.

### db_create.py
When we created the database with the python script it was successful.  When the database was created with the python script db_create.py the queries returned the expected result.

### Top 10 data creators
We first take a look at which user made the most contributions to the dataset.  Here we can see that **Andrew Matheny** contributed the most by a very large margin.  It is rather interesting that the bot was behind him by a margin of **4-to-1**.  

### Dallas Metro top 5
To give us a sense of the area we take a look at the top 5 cities in this dataset.  Dallas has grown exponentially since I arrived so it is no surprise to see that this portion of the metro is overlapping many cities.

### It's all about those Amenities
Then we took a look at what amenities the area offers.  This being Texas, no one should be surprised that religion has the most entries.  

### If religion is your thing
When we take a closer look we can see that the majority of entries are categorized as christian.

Out of curiosity we take an even deeper look to see all entries.  

### Park it!
We then take a wider look at amenities and see that parking is a big one.  A common problem in any major retro is finding and place to park and Dallas is noexception.  Food overall is a close second.  Personally, food and coffee are my favorite topics.

### Food!
Speaking of food we dive in and discover that Mexican cuisine followed closely by pizza and american are popular in the metro. One of the benefits of living in a metro area is the diversity of the food.

### More about FOOD!
Finally, we take a slightly different view of food to see that Mexican still is top choice and then the lines begin to blur a big in favor of american and regional cuisines.  

# Other Thoughts and Ideas

As you can see the possibility for discovery is endless the Dallas metro dataset. One possible approach is investigating how restaurants are broken down categorically.  Texas is all about barbeque and it is really surprising that for this area there is only one entry.  It could be that this area doesn't have a good representation of barbeque restaurants but I doubt that.  A quick look at Yelp tells me that this is indeed incorrect.  

One of the most popular apps for getting places in the city is Waze.  Google maps is too inconsistent and often does not take into account construction and other possible delay factors that is necessary to get around efficiently.  The potential for better information, quicker with accurate updates is avaiable through OpenStreetMap, but this dataset shows that not many people are aware that it exists.  841 unique contributors in a large portion of the metro is essentially demonstrating almost no one knows.  Perhaps if more people became aware of the rich data points available and knew that they could play an active role in the discovery of this amazing city we could see greater accuracy.  Who knows what treasures may be discovered!

### Sources
https://gist.github.com/carlward/54ec1c91b62a5f911c42#file-sample_project-md
https://www.w3schools.com/sql/default.asp <br>
https://www.sqlite.org/docs.html <br>
https://discussions.udacity.com/t/are-the-csvs-supposed-to-be-double-spaced/285305 <br>
https://discussions.udacity.com/t/nodes-csv-1-insert-failed-datatype-mismatch/239638