# OpenStreetMap Data Case Study

#### Author: Denis Pastory

### MAP  AREA
Dar-es-salaam, Tanzania, East Africa

- [https://www.openstreetmap.org/search?query=dar%20es%20salaam#map=12/-6.8235/39.2695]

Dar es Salaam is the former capital and largest city in Tanzania. It is one of the largest cities in East Africa by population. The region had a population of 4,364,541 as of the official 2012 census and the city is one of the fastest growing cities in the world. 
I would like to contribute to improvement on OpenstreetMap to help ease movement within the city and usage of several facilities in the city

After downloading the map from openstreetmap. I saw several mistakes that i managed to solve in this analysis.
The openstreet downloaded map was big so i had to first reduce it.


# 1. PRELIMINARY ANALYSIS
### 1.1. This code creates a sample data to be used in the analysis 
The original data was 1.3 GB so had to reduce it to about 70MB using the code below
```python
import xml.etree.ElementTree as ET  # Use cElementTree or lxml if too slow

OSM_FILE = "dar_sample.osm"  # Replace this with your osm file
SAMPLE_FILE = "DAR.osm" # The data to be used

k = 2 # Parameter: take every k-th top level element
def get_element(osm_file, tags=('node', 'way', 'relation')):
    """Yield element if it is the right type of tag
    Reference:
    http://stackoverflow.com/questions/3095434/inserting-newlines-in-xml-
    file-generated-via-xml-etree-elementtree-in-python
    """
    context = iter(ET.iterparse(osm_file, events=('start', 'end')))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()

with open(SAMPLE_FILE, 'wb') as output:
    output.write('<?xml version="1.0" encoding="UTF-8"?>\n')
    output.write('<osm>\n  ')
  # Write every kth top level element
    for i, element in enumerate(get_element(OSM_FILE)):
        if i % k == 0:
            output.write(ET.tostring(element, encoding='utf-8'))

    output.write('</osm>')
    ```  

### 1.2. Function to Check the tags in the sample file created
 After gettting the data to work with. There was need to check on the elements(more especially nodes,tag and way) to have a seens of what amount of data will deal with.
 ```python
def count_tags_element(filename):
    tags = defaultdict(int) # initate tags  in file
    for event, element in ET.iterparse(filename): 
        tags[element.tag] += 1
    return tags
    
    
count_tags_element("DAR.osm")

defaultdict(int,
              {'member': 637,
             'nd': 382408,
             'node': 329542,
             'osm': 1,
             'relation': 58,
             'tag': 102113,
             'way': 56882})
             ```

### 1.3 This Checks for tag types in the data
```python
lower = re.compile(r'^([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

lower_colon = re.compile(r'^([a-z]|_)+:([a-z]|_)+')

def key_type(element, keys):
    if element.tag == "tag":
        if lower.search(element.attrib['k']):
            keys['lower'] += 1
        elif lower_colon.search(element.attrib['k']):
            keys['lower_colon'] += 1
        elif problemchars.search(element.attrib['k']):
            keys['problemchars'] += 1
        else:
            keys['other'] += 1
    return keys

def process_map_check(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)
    return keys
    
process_map_check("DAR.osm")
```
__{'lower': 69615, 'lower_colon': 32482, 'other': 16, 'problemchars': 0}__

# 2. Problems Related in the Map

    I will discuss one of the major problem encourted in the data set.
    The street names are not properly labeled. My main task was to clean the improper street names in the data sets. 
    Most of the streets were in node and way element.

- Street names are not labeled 'street' in the tag 
    ```XML 
    <tag k="addr:street" v="Almasi" /> ```
    
    so for consistence we should expect something like this
 ```XML 
    <tag k="addr:street" v="Almasi Street" /> ```
    


# 3. AUDIT
 To tackle the above challenge. I had to do some DATA AUDITING using [audit_data.py](https://github.com/DenisDPR/Data-Analyst-Nano-Degree/blob/master/Project%202/OpenStreetMapData/audit_data.py) before processing the whole data to solve the problem.
The process of auditing was done is four main parts and thus creating four functions.
### 3.1 audit_street_type
Using  [re](https://docs.python.org/2/library/re.html) package library. This enabled me to specifies a set of strings that match a particular string matches, and of which my target was   addr:street
```python
def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected_street_types:
            street_types[street_type].add(street_name)
```

### 3.2 is_street_name
   This function suchs for element attributes whose 'K' key is equivalent to "addr:street"
 ```python
def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")
```


### 3.2 audit_street
```python
def audit_street(osmfile):
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):

        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])
    osm_file.close()
    return street_types
    ```


### 3.4 update_name
This helped update the data for consistency as per the mapping and expected street types in the data set
```python
def update_name(name):
    """
    For consistency, accuracy data is cleaned to be 
    used in SQL database
    """
    if num_line_street_re.match(name):
        nth = nth_re.search(name)
        name = lane_mapping[nth.group(0)] + " Line"
        return name
    
    elif name == "Ahmed & Mohamed Line" or name == "Ahmed/Mohamed Line":
        name = "Ahmed-Mohamed Line"
        return name

    else:
        original_name = name
        for key in mapping.keys():
            # When mapping key match such as "St." 
            type_fix_name = re.sub(r'\s' + re.escape(key) + r'$', ' ' + mapping[key], original_name)
            nesw = nesw_re.search(type_fix_name)
            if nesw is not None:
                for key in street_mapping.keys():
                    # No renaming proper names.
                    dir_fix_name = re.sub(r'\s' + re.escape(key) + re.escape(nesw.group(0)), " " + street_mapping[key] + nesw.group(0), type_fix_name)
                    if dir_fix_name != type_fix_name:
                        return dir_fix_name
            if type_fix_name != original_name:
                return type_fix_name
    # Check for capitalized names such as street
    last_word = original_name.rsplit(None, 1)[-1]
    if last_word.islower():
        original_name = re.sub(last_word, last_word.title(), original_name)
    return original_name
    ```

# 4. PROCESSING
After auditing is complete the next step was to prepare the data to be inserted into a SQL database.Using __process_data.py__
Parse the elements in the DAR.OSM XML file document format to tabular format such that they can be inserted into .csv files which will be ready to be inserted into SQL database using the given schema
Validation of data is important so using cerberus library, the output can be validated against the schema. 


### The codes to determine file size
```python
import os
def convert_bytes(num):
    """
    this function will convert bytes to MB.... GB... etc
    """
    for x in ['bytes', 'KB', 'MB', 'GB', 'TB']:
        if num < 1024.0:
            return "%3.1f %s" % (num, x)
        num /= 1024.0
def file_size(file_path):
    """
    this function will return the file size
    """
    if os.path.isfile(file_path):
        file_info = os.stat(file_path)
        return convert_bytes(file_info.st_size)

file_path = r"DAR.osm"
print file_size(file_path)
```

### size of the file

    DAR.osm  :  70.3 MB
    nodes.csv  :  27.3 MB
    nodes_tags.csv  :  265.3 KB
    ways.csv  :  3.5 MB
    ways_nodes.csv  :  8.8 MB
    ways_tags.csv  :  3.1 MB

# 4. SQL 

## Data Overview and Additional ideas:

### Number of nodes<br>
    SELECT Count(*)FROM   nodes; 
     139480
### Number of ways
    SELECT Count(*) FROM ways;
    41964
### Number of users
    SELECT COUNT(distinct(user.uid)) 
    FROM (SELECT uid 
            FROM nodes union all 
            SELECT uid FROM ways) user;
    1279
### Top 10 Most  Active users 
    SELECT e.user, COUNT(*) as num
           FROM (SELECT user FROM Nodes UNION ALL SELECT user FROM Ways) e
           GROUP BY e.user
           ORDER BY num DESC
           LIMIT 10;
           
    kombe1207,8258
    amour_nyl, 4720
    innocent maholi, 4697
    Doricas Mgusi, 4427
    Hawa Adinani, 4195
    tonny john, 2999
    Kabaka@1,2 170
    mwanaharusi ngaluma", 2132
    elia dominic, 2069
    Immaculate Mwanja, 2028
## Religion
    SELECT nodes_tags.value, COUNT(*) as num
    FROM nodes_tags 
    JOIN (SELECT DISTINCT(id) FROM nodes_tags WHERE value='place_of_worship') i
    ON nodes_tags.id=i.id
    WHERE nodes_tags.key='religion'
    GROUP BY nodes_tags.value
    ORDER BY num DESC;
    
    christian|6
    muslim|6
 With the figure above, since dar es salaam is along the coast, it is not a surprise to see such a number of 50% to 50% christian muslims. 
  [More information of religion distribution can be found here](https://en.wikipedia.org/wiki/Religion_in_Tanzania)
    
## Top most cuisine
    SELECT nodes_tags.value, COUNT(*) as num
        FROM nodes_tags 
        JOIN (SELECT DISTINCT(id) FROM nodes_tags WHERE value='restaurant') i
        ON nodes_tags.id=i.id
        WHERE nodes_tags.key='cuisine'
        GROUP BY nodes_tags.value
        ORDER BY num DESC;

    burger,1
    chicken;african;barbecue;grill,1
    thai,1
## Number of Highways with bus stop
    SELECT COUNT(*) 
    FROM nodes_tags 
    WHERE key="highway" and value = "bus_stop";
    12



# 5. Data Improvement and statistics
Considering the query where 

#### Number of Highways
    SELECT COUNT (*) 
    FROM nodes_tags 
    WHERE  key="highway"; 16

#### Number of Highways with traffic signals

    SELECT  COUNT(*) from nodes_tags 
        WHERE key="highway" 
        AND value = "traffic_signals"; 1
The number of highways accounts for 6.25% of the highways. 
The number of highways with traffic signals seems to be proportional small compared with the <br>
available highways. 
For most highways in Dar-es-salaam, as shown in image below, Pedestrians also use them. Having such a small number of traffic signals puts a high risk of many accidents. <br>
However, there are quiet many traffic signals that are not included in the open street map. <br>
The responsible personel ( Tanzania Roads Agency) should take come responsiblity to update the information.![Highway in Tanzania](https://github.com/DenisDPR/Data-Analyst-Nano-Degree/blob/master/Project%202/OpenStreetMapData/Kilwa%20Highway.jpg)


## Other Ideas about data set

As far as Tanzania is concerned and Dar es salaam in particular, Data is actually a big problem. The few available data some times is not used effective or it's importance is not recognized. 

During, my data analysis, I discovered alot of thing that need to be improved on. Tha case of __HIGH SIGNAL__  may be just one in a million. 
## Solution
The Dar es salaam city and it's sub villages are under a Local government system. With the local government system, Most developments are under Local government Authority. This includes construction of roads, streets etc. The major constructions are infact mainly for public. 
Having a well established __INVENTORY__ of all infrastructures ( roads, hospitals etc) can be one of the lead solution. The information from the established inventory can be used to improve on openstreetmap.
Many students ( with major in Geography, Computer etc) are some times attached to these Local government offices for there Field Practicals, so they can help in such tasks of data input for improving the openstreetmap of dar es salaam using the inventory.

## Benefits
Having a well established openstreetmap can be of benefit in a number of ways.I'll talk about;
- Logistics. 
Dar es salaam has got poor logistics in terms of not well established house/building addresses. This means that delivery of goods to a specified address is insufficiently lacking, So when a city is properly addresses( street address properly labeled in openstreet map) then they can be used for logistics and transpotation purposes. 


# Conclusion
The Dar es salaam OpenStreetMap dataset is quite a large map with alot of information lacking that needs improvement. 
For this project i managed to clean the streets however much data cleaning can be done for other messy information. using SQL, alot of important information can be drawn from the dataset. The Data wrangling process is not that easy more especially if one does have clear information for the data he/she is working on and also the area. 


## References

   - [Udacity Discussion Forums](https://discussions.udacity.com/c/nd002-data-wrangling/nd002-p-wrangle-openstreetmaps-data-with-sql)
   - (https://docs.python.org/2/library/re.html)
   - (https://en.wikipedia.org/wiki/Religion_in_Tanzania)
   - (https://docs.python.org/2/library/re.html)

