<a id="1"></a> 
# OpenStreetMap Data Case Study
**Name: Jan FOERSTER**
***

In [1]:
%matplotlib notebook

import pprint

from jfo.schema import schema
from jfo.mySQL3dbConn import mySQLLITE3
from jfo.myXML import Investigation
from jfo.myExtension import myDict2CSVTransformer

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#sample file size 80MB
#xmlfile = 'OpenStreetMap-Hamburg-Sample.osm'

#original file size378MB
xmlfile = 'OpenStreetMap-Hamburg-31.osm'

myOSMfile = Investigation(xmlfile)

<div class="alert alert-block alert-info">
<a id="11"></a>
## Map Area
***
This map is of about my hometown and it's direct regional environment, so I’m more interested to see what database querying reveals, and I’d like an opportunity to contribute to its improvement on OpenStreetMap.org by an customized extract from MapZen

- OpenStreet Map: Freie und Hansestadt Hamburg, Germany


- __[MapZen Extract](https://mapzen.com/data/metro-extracts/your-extracts/d4104a4aab6c)__

In [2]:
%%HTML
<div class="alert alert-block alert-info">
    <iframe width="100%" height="100%" frameborder="0" scrolling="no" marginheight="0" marginwidth="0" src="http://www.openstreetmap.org/export/embed.html?bbox=9.534759521484375%2C53.38291881008687%2C10.441131591796877%2C53.7438381234801&amp;layer=mapnik&amp;marker=53.56371244239144%2C9.987602233886719" style="border: 1px solid black"></iframe>
    <br/>
    <small>
        <a href="http://www.openstreetmap.org/?mlat=53.5637&amp;mlon=9.9876#map=11/53.5637/9.9876">Show Detailed Map</a>
    </small>
</div>

<div class="alert alert-block alert-info">
<a id="11"></a> 
## Data Infrastructure
***
With the method <code>*'count_tags'*</code> of my class <code>*'Investigation'*</code>, I save the number of unique tags of chosen XML-file within a private dictionary. By applying the method <code>*'get_Tags'*</code>, I receive the full information.

<div class="alert alert-block alert-info">
### Unique Tags

<div class="alert alert-block alert-success">

XML tag | frequency
--------|----------
<code>*'bounds'*</code>|1
<code>*'member'*</code>|84855
<code>*'nd'*</code>|2118748
<code>*'node'*</code>|1609887
<code>*'osm'*</code>|1
<code>*'relation'*</code>|6524
<code>*'tag'*</code>|1574672
<code>*'way'*</code>|300719

<div class="alert alert-block alert-info">
In order to understand, which attributes are used by which XML-tag the method <code>*'view_Tags_Attributes'*</code> saves this in a further private dictionary. 

<div class="alert alert-block alert-success">

* <code>*'node'*</code>: 
    * <code>*'changeset'*</code>
    * <code>*'uid'*</code>
    * <code>*'timestamp'*</code>
    * <code>*'lon'*</code>
    * <code>*'version'*</code>
    * <code>*'user'*</code>
    * <code>*'lat'*</code>
    * <code>*'id'*</code>


* <code>*'nd'*</code>:
    * <code>*'ref'*</code>


* <code>*'bounds'*</code>:
    * <code>*'minlat'*</code>
    * <code>*'maxlon'*</code>
    * <code>*'minlon'*</code>
    * <code>*'maxlat'*</code>


* <code>*'member'*</code>:
    * <code>*'role'*</code>
    * <code>*'ref'*</code>
    * <code>*'type'*</code>


* <code>*'tag'*</code>:
    * <code>*'k'*</code>
    * <code>*'v'*</code>


* <code>*'relation'*</code>:
    * <code>*'changeset'*</code>
    * <code>*'uid'*</code>
    * <code>*'timestamp'*</code>
    * <code>*'version'*</code>
    * <code>*'user'*</code>
    * <code>*'id'*</code>


* <code>*'way'*</code>:
    * <code>*'changeset'*</code>
    * <code>*'uid'*</code>
    * <code>*'timestamp'*</code>
    * <code>*'version'*</code>
    * <code>*'user'*</code>
    * <code>*'id'*</code>


* <code>*'osm'*</code>:
    * <code>*'timestamp'*</code>
    * <code>*'version'*</code>
    * <code>*'generator'*</code>


<div class="alert alert-block alert-info">
### Patterns in the Tags

The <code>*'k'*</code> attribute of tag <code>*'tag'*</code> contains different patterns. In order to identify potential bias from used characters, all content has been classidied as following by the method <code>*'audit_Data_Chars'*</code>:

* __lower__ : regex pattern is valid, if tags contain only lowercase letters
* __lower_colon__: regex pattern is valid, if tags contain only colons
* __problemchars__: regex pattern is valid, if tags contain characters, which could cause issue, e.g. '=\+/&<>;\'"\?%#$@\,\. \t\r\n'
* __other__: regex pattern is valid, if tags contain only nothing from above

<div class="alert alert-block alert-success">

Type of Patterns | Frequency
-----------------|-----------
<code>*'lower'*</code>|789084
<code>*'lower_colon'*</code>|736073
<code>*'other'*</code>|49515
<code>*'problemchars'*</code>|0

<div class="alert alert-block alert-info">
### Users, who have contributed to OSM
The <code>*'user'*</code> attribute of tag <code>*'node'*</code> reveals information, which user have contributed most / least to OSM for nodes. The method <code>*'audit_Unique_Values('node', 'user')'*</code> lists all users and number of contributions. 

<div class="alert alert-block alert-success">

User | Frequency of Contributions
-----|---------------------------
<code>*'fahrrad'*</code>|242516
<code>*'svbr'*</code>|232474
<code>*'Abendstund'*</code>|187830
<code>*'sundew'*</code>|172620
<code>*'Divjo'*</code>|148136
<code>*'findichgut'*</code>|128704
<code>*u'gl\xfchw\xfcrmchen'*</code>|120734
<code>*'vademecum'*</code>|103042
<code>*'Joke123'*</code>|91336
<code>*'simlox'*</code>|89440

<div class="alert alert-block alert-info">
The <code>*'user'*</code> attribute of tag <code>*'way'*</code> reveals information, which user have contributed most / least to OSM for ways. The method <code>*'audit_Unique_Values('node', 'user')'*</code>. 

<div class="alert alert-block alert-info">
<a id="12"></a> 
## Problems Encountered in the Map
***
After initially downloading a small sample size of the Hamburg area and running it against a provisional Python file, I noticed __2 main problems with the data__, which I will discuss in the following order:

- __mixed format of phone number notion__, e.g. "+49040123456789", "004940123456789", "040-123456789", etc.

- __some housenumbers in street names__

__Postal codes__ have been analyzed and identified as __being consistent__ postal codes.

<div class="alert alert-block alert-info">
<a id="121"></a>
### Phone Numbers
***
The __major problem__ has been to have phone numbers __not written in a standardized format__, e.g. "+49 (0)40 XXXXXXX", as some persons have indicated international prefix and some persons not. In case international prefix has been added, there hasn't been a constant formatting way for further notion.

<div class="alert alert-block alert-success">

Old Phone Number | Corrected Phone Number to 
-----------------|--------------------------
'+494028787174'|'+49 (0)40 28787174'
'+494028806718'|'+49 (0)40 28806718'
'+4940862978'|'+49 (0)40 862978'
'04054880575'|'+49 (0)40 54880575'
'+494065067790'|'+49 (0)40 65067790'
'04036005520'|'+49 (0)40 36005520'
'+494079027754'|'+49 (0)40 30389505'
'+49404142760'|'+49 (0)40 4142760'
'+4940545077'|'+49 (0)40 545077'

<div class="alert alert-block alert-info">
<a id="122"></a>
### Postal Codes
***
All postal codes have been indicated correctly. __No cleaning has been required.__

<div class="alert alert-block alert-success">

Old Postal Code | Remains Postal Code 
----------------|--------------------
'25488'|'25488'
'25482'|'25482'
'25469'|'25469'
'25462'|'25462'
'25421'|'25421'
'22880'|'22880'
'22869'|'22869'
'22769'|'22769'
'22767'|'22767'
'22765'|'22765'

<div class="alert alert-block alert-info">
<a id="123"></a>
### Street Names
***
All street names __have been indicated correctly with only three exceptions__, which have been corrected, whereas numbers have been inidcated in combination with street names. These three house numbers have been deleted.

<div class="alert alert-block alert-success">

XML Street Names | Corrected Street Names
-----------------|-----------------------
<code>*'108'*</code>|<code>*''*</code>
<code>*'65'*</code>|<code>*''*</code>
<code>*'85'*</code>|<code>*''*</code>


<div class="alert alert-block alert-info">
<a id="13"></a>
## Data Overview and Additional Ideas
***

### Create SQL3Lite database to save XML content

In [3]:
#dbUdacity = mySQLLITE3('C:\\Users\\th65jt\\sqlite3\\', 'osmHamburg.db')
dbUdacity = mySQLLITE3('C:\\Users\\JanUser\\sqlite\\sqlite_windows\\', 'osmHamburg.db')

#dbUdacity.execute_SQLStatementsFile('C:\\Users\\JanUser\\Documents\\Udacity\\Sybullus\\L3 MongoDB\\Project\\OSM_Drop_Tables.sql')
#dbUdacity.commit_SQL()

#dbUdacity.execute_SQLStatementsFile('OSM_Create_Tables.sql')
#dbUdacity.import_preformatted_CSV('.\\nodes.csv', 'nodes')
#dbUdacity.import_preformatted_CSV('.\\nodes_tags.csv', 'nodes_tags')
#dbUdacity.import_preformatted_CSV('.\\ways.csv', 'ways')
#dbUdacity.import_preformatted_CSV('.\\ways_tags.csv', 'ways_tags')
#dbUdacity.import_preformatted_CSV('.\\ways_nodes.csv', 'ways_nodes')

In [4]:
dbUdacity.query_AllResults('''SELECT nodes_tags.key                                
                                , count(nodes_tags.key) as sum
                           FROM nodes_tags
                           GROUP BY nodes_tags.key
                           ORDER BY sum DESC
                           LIMIT 10;''')

[(u'street', 34999),
 (u'housenumber', 34616),
 (u'postcode', 28577),
 (u'city', 26874),
 (u'natural', 24232),
 (u'name', 22760),
 (u'country', 22093),
 (u'entrance', 15777),
 (u'highway', 15071),
 (u'amenity', 15049)]

<div class="alert alert-block alert-info">
### Looking for 'restaurants' in the database

There seems to be some inconsistencies, how to map <code>*'restaurants'*</code>, because there is existing a <code>*'tag'.key = 'restaurant'*</code> as well as a <code>*'tag'.key = 'amenity'*</code> with a <code>*'tag'.value = 'restaurant'*</code>.

In [5]:
dbUdacity.query_AllResults('''SELECT nodes_tags.value
                                    , count (nodes_tags.key) as Sum    
                                FROM nodes_tags                                    
                                WHERE nodes_tags.key = "restaurant"                                    
                                GROUP BY nodes_tags.value
                                ORDER BY Sum DESC
                            ;''')

[(u'cafe', 1), (u'yes', 1)]

In [6]:
dbUdacity.query_AllResults('''SELECT nodes_tags.value
                                    , count (nodes_tags.key) as Sum    
                                FROM nodes_tags                                    
                                WHERE nodes_tags.key = "amenity"                                    
                                GROUP BY nodes_tags.value
                                ORDER BY Sum DESC
                                LIMIT 10
                            ;''')

[(u'bench', 2531),
 (u'restaurant', 1527),
 (u'post_box', 1015),
 (u'vending_machine', 838),
 (u'recycling', 834),
 (u'parking', 763),
 (u'cafe', 669),
 (u'fast_food', 638),
 (u'waste_basket', 463),
 (u'bicycle_parking', 462)]

<div class="alert alert-block alert-info">
### Looking for 'religion' in the database

In general, __figures identified seem not to be sufficient to cover all religious buildings__ in the region of Hamburg.

In [7]:
dbUdacity.query_AllResults('''SELECT nodes_tags.value
                                    , count (nodes_tags.key) as Sum    
                                FROM nodes_tags                                    
                                WHERE nodes_tags.key = "religion"                                    
                                GROUP BY nodes_tags.value
                                ORDER BY Sum DESC
                            ;''')

[(u'christian', 51),
 (u'muslim', 23),
 (u'all', 1),
 (u'buddhist', 1),
 (u'hindu', 1),
 (u'scientologist', 1),
 (u'sikh', 1)]

<div class="alert alert-block alert-info">
<a id="3"></a>
# Conclusion
***
The __OpenStreetMap data of Hamburg, Germany is of fairly reasonable quality__. The __overall level of systematically completeness is very low__, e.g. if total numbers of rows for streets are compared to total numbers of rows for postcodes, phone numbers, amenity, restaurant or any other meaningful piece of information.

Furthermore, the type of information requested for a given location is not systematically required during input leading to disadvantages to competitors like GoogleMaps.

In the detailed analysis, the input typos of humans for street names are so rare, that obviously some sort of data cleaning has been already occured. Same for Postal (ZIP) Codes, because there are all of correct digits. __The only big issue remains about standardized format for phone numbers__.

Although Open Source Community has undertaken a very high level of effort, to provide a very good free map, __there are still a lot of gaps compared to competitors like GoogleMaps__, e.g.

* indicated __public transportation stops__,
* __traffic jams__,
* __touristic attractions__,
* __company locations__ - including direct __hyperlink to homesite__ etc.

## Additional suggestions & Ideas

* __Systematic level of available information__

    In general, the __available type of information is reasonble non-systematic__ and seems to depend purely on user
    interest.
    
    For example, if <code>*'nodes_tags'.key = 'amenity'*</code>, there are 1527 <code>*'nodes_tags'.values = 'restaurants'*</code> listed. However, there is listed as <code>*'nodes_tags'.key = 'restaurant'*</code> listed additional further 2 ones. 
    
    * __benefits:__
        * Higher level of __data quality__
        * Higher level of users and __customer satisfaction__
        * __Further services__ could be offered, some services __with fees__ e.g. to support OpenStreetMap developments        
        <br>
    * __anticipated issues:__
        * Due to high level of __Data Sensitivity Awareness__ in particular in Germany, __users and customers may don't want to indicate__ all desired __information__ at once.
        

* __Typos and well-formatted input__

    Typically, post-entering cleaning activities should be avoided as there are easy to __implement prevention measures__,
    e.g. GUI / CSS format patterns
    
    Usually, directly at input prompting, there needs to be a formatting validation, to cross-check if entered patterns
    are meeting defined format. Therefore, it is necessary only to determine an associated css-classin html5 and a
    stored pattern in the database.
    
    * __benefits:__
        * This __prevents any user to indicate undesired input__. Only validated, well-formatted information will be stored in database in the background.
        
        <br>
    * __anticipated issues:__
        * Higher level of coding, introducing __more complexity and interaction between GUI, frontend, and backend codes__ and softwares.
    
    
* __Marketshare__

    Similar to Google, OpenStreetMap should __set-up an agreement with manufacturers of mobile devices for pre-loading__
    OpenStreetMap and encouraging users to provide systematically more information to pre-defined areas of interest.
    In order not to "overload" users with request, there should be defined first a __roll-out concept__, which __type of
    information__ first should be requested first.
    
    * __benefits:__
        * __Gaining further market shares__ from GoogleMaps
        * __Getting known__ by a bigger audience
        
        <br>
    * __anticipated issues:__
        * cost __license fees__ which could be __paid from services with fees__


<a id="4"></a>
# Bibliography & Files
***

- OSM files
    - OpenStreetMap-Hamburg-31.osm
    - OpenStreetMap-Hamburg-Sample.osm


- Python classes & methods
    - class myXML.py
    - class mySQL3dbConn.py
    - module myExtension.py, containing class 'myUnicodeDictWriter' inherited from csv.DictWriter and
      class 'myDict2CSVTransformer'
    - module schema.py
    - __[Sphinx Documentation jfo package](./doc/modules.html)__
    

- csv files
    - nodes.csv
    - nodes_tags.csv
    - ways.csv
    - ways_nodes.csv
    - ways_tags.csv


- SQL files
    - OSM_Create_Tables.sql
    - OSM_Drop_Tables.sql
    - OSM_Import_CSV.sql
    
    
- osmHamburg.db SQL3Lite database


- Readme.md