# Project: OpenStreetMap Data

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> **Tip**: In this section of the report, provide a brief introduction to the dataset you've selected for analysis. At the end of this section, describe the questions that you plan on exploring over the course of the report. Try to build your report around the analysis of at least one dependent variable and three independent variables.
>
> If you haven't yet selected and downloaded your data, make sure you do that first before coming back here. If you're not sure what questions to ask right now, then make sure you familiarize yourself with the variables and the dataset context for ideas of what to explore.

### Map Area
**Brest, France: **
[OpenStreetMap export (206 MB)](https://www.openstreetmap.org/export#map=12/48.4085/-4.4986)
- Full extract (206 MB): Bounding Box coordinates (48.3325, -4.7131, 48.4843, -4.2840)
- Sample extract (6,2 MB): Bouding Box coordinates (48.3795, -4.4955, 48.3914, -4.4810)

This map is of where some of my family live, so I’m more interested to see what database querying reveals, and I’d like an opportunity to contribute to its improvement on OpenStreetMap.org.  
Brest belongs to the "Bretagne" France region, which has its own regional language in addition to french language:"breton language". Both languages are often used for road signs or to indicate city buildings. There is no corresponding national legal framework instructing to do so or explaining how to do so. So I suspect issues or specifities related to this during the data processing and exploration. And especially I'd like to to answer the following questions:  

1. Does the Open Street Map data set include both languages?
2. If yes, how? 
    * Is there any tag to indicate the language of the node? 
    * Do nodes includes both languages, or either french or breton?
    * Which proportion of the data is translated?
    * What kind of data is translated (amenity/building type)?

In [1]:
# Use this cell to set up import statements for all of the packages that you
#   plan to use.

# Remember to include a 'magic word' so that your visualizations are plotted
#   inline with the notebook. See this page for more:
#   http://ipython.readthedocs.io/en/stable/interactive/magics.html

import xml.etree.cElementTree as ET
import re
import pprint
import cerberus
import codecs
import unicodedata
from collections import defaultdict, OrderedDict
import pandas as pd
import StringIo
import gzip
import process_xml #the data.py script used at the very end of lesson 13

<a id='wrangling'></a>
## Data Wrangling
### Open file
As suggested in the project notes, I will consider a data sample of a much smaller size (6,2 MB) to audit the data and test or cleaning functions. I manually selected and exported a smaller area rather than sampling programatically the full data set.

In [2]:
OSM_FILE = "brest_full" 
SAMPLE_FILE = 'brest_sample.osm'

### Explore file to identify what to clean


In [3]:
#top level tags
def count_tags(filename):
    f=open(filename,mode='r')
    tags={}
    for event,elem in ET.iterparse(f):
        if elem.tag in tags:
            tags[elem.tag]+=1
        else:
            tags[elem.tag]=1
    f.close()
    return tags

count_tags(SAMPLE_FILE)

{'bounds': 1,
 'member': 25852,
 'meta': 1,
 'nd': 25447,
 'node': 19065,
 'note': 1,
 'osm': 1,
 'relation': 219,
 'tag': 24319,
 'way': 2990}

In [4]:
#function returning a dictionary where the keys are the k attribute of the "tag" tags and the values a set
#of the v attributes
def list_tags(filename):
    file=open(filename,'r')
    tags=defaultdict(set)
    for _,element in ET.iterparse(file):
        if element.tag=='tag':
            tags[element.get('k')].add(element.get('v'))
    file.close
    return tags
#list_tags(SAMPLE_FILE)

After manually scrolling through the result dictionary, I could already spot some entries needing to be cleaned:
- phone numbers in "contact:phone" and "contact:fax"  
Format is not consistent: +33 x xx xx xx xx or +332xxxxxxxx or +33 x xxxxxxxx.
And I also don't know if all phone numbers have really the right lenght or all start with the right country code (+33)
- opening_hours: format not consistent  
9h-19h du lundi au samedi 
Lu-Sa 07:00-22:00  
Mo-Fr 14:00-18:30

Additionally , although I could not manually "spot" or "see" the errors, I suspect that the following may also need some cleaning:
- addr:postcode
- addr:street (same street types cleaning to perform as shown in the course)

I don't think the following attributes contain errors, but I still want to audit them too to be sure the data is consistent:
- addr:city
- addr:country

In [5]:
expected_city='Brest'
postcode_re=re.compile(r'29\d{3}')
street_type_re=re.compile(r'[\w-]+')
phone_re=re.compile(r'\+33\s\d(\s\d\d){4}')
expected_country='FR'
expected_st_types=['Rue','Boulevard','Square','Place','Avenue','Jardin','Cours','Quai','Parking',\
                   'Eglise','Halles','Rampe']

def audit_city(city):
    if city<>expected_city:
        return city
    
def audit_country(country):
    if country<>expected_country:
        return country
    
def audit_postcode(postcode):
    m=re.search(postcode_re,postcode)
    if not m:
        return postcode
    
def audit_street_type(street_name):
    m=re.match(street_type_re,street_name)
    if m:
        street_type=m.group()
        if street_type not in expected_st_types:
            return street_name
        
def audit_phone(phone_number):
    m=re.match(phone_re,phone_number) # check if phone number is in the right format
    length=sum(c.isdigit() for c in phone_number) # check is phone number has the right number of digits
    if not m or length <> 11 or phone_number[1:3]<>"33" :
        return phone_number

In [6]:
#investigate value attributes of the tags in the tags to audit list. Return the "weirdos": the ones which don't match the expected values
k_list=['addr:street','addr:city','addr:postcode','addr:country','contact:phone','contact:fax']
function_list=[audit_street_type, audit_city, audit_postcode, audit_country, audit_phone, audit_phone]
audit_dict={k:v for k,v in zip(k_list,function_list)}

def audit(filename):
    f=open(filename,'r')
    d=defaultdict(set)
    for _,element in ET.iterparse(f):
        if element.tag=='tag':
            k=element.get('k')
            v=element.get('v')
            if k in k_list:
                d[k].add(audit_dict[k](v))
    return d

results_audit=audit(SAMPLE_FILE)
results_audit

defaultdict(set,
            {'addr:city': {None},
             'addr:country': {None},
             'addr:postcode': {None},
             'addr:street': {None,
              'Bd Jean Moulin',
              u"Cit\xe9 d'Antin",
              'SQUARE PL Sadi Carnot',
              'SQUARE Rue Emile Zola',
              u'bd Fran\xe7ais Libres',
              'bd Jean Moulin',
              'eglise st louis',
              'halles St Louis',
              u'place de la Libert\xe9',
              u'rue Ducou\ufffddic',
              'rue Duquesne',
              'rue Emile Zola',
              'rue Louis Pasteur',
              'rue Michelet',
              'rue Traverse',
              'rue Yves Collet',
              'rue de Lyon',
              'rue de porstrein',
              'rue du Chateau',
              u'rue du Ch\ufffdteau',
              'square',
              'square Rue Aiguillon',
              'square Sangnier'},
             'contact:fax': {None,
              '+33 2 9843

The postcode, city, country tags are clean.  
**addr_street**, **contact:phone**, **contact:fax** need to be cleaned.  
We'll also need to encode/decode the strings properly. It looks like the 'é' character is not read properly.

### Data Cleaning
#### Clean addr:street

In [7]:
#update street_name
mapping={'Bd':'Boulevard',
         'bd':'Boulevard',
         'SQUARE':'Square',
        'eglise':'Eglise',
         'halles':'Halles',
         'place':'Place',
        'rue':'Rue',
        'square':'Square',
        'Cit':'Cit'}

def update_name(name):
    street_type=street_type_re.match(name).group()
    repl=mapping[street_type]
    return re.sub(street_type_re,repl,name,1)

for n in results_audit['addr:street']:
    if n is None:
        pass
    else:
        print update_name(n)


Place de la Liberté
Square
Cité d'Antin
Boulevard Jean Moulin
Rue Traverse
Square PL Sadi Carnot
Rue Yves Collet
Rue de porstrein
Rue Duquesne
Boulevard Français Libres
Halles St Louis
Square Rue Aiguillon
Rue Louis Pasteur
Rue du Chateau
Boulevard Jean Moulin
Eglise st louis
Rue Ducou�dic
Rue Emile Zola
Rue de Lyon
Square Rue Emile Zola
Rue Michelet
Square Sangnier
Rue du Ch�teau


#### Clean contact:phone and contact:fax

In [8]:
#Reformat all phone and fax numbers so that they respect the following format: +33 x xx xx xx xx, which the format commonly used in France
def clean_number(num):
    #case 1: +33 x xxxxxxxx
    if re.search(r'\s\d{8}', num):
        num=num[:6]+num[6:8]+" "+num[8:10]+" "+num[10:12]+" "+num[12:]
        
    #case 2:+33 x xx xxx xxx
    elif re.search(r'\d{3}\s', num):
        num=num[:9]+num[9:11]+' '+num[11:12]+num[13:14]+' '+num[14:]
        
    #case 3: +33xxxxxxxxx
    elif not re.search(r'\s',num):
        num=num[:3]+' '+num[3:4]+' '+num[4:6]+' '+num[6:8]+' '+num[8:10]+' '+num[10:]
    
    return num

#test
numlist=['+33233445566','+33 2 33 221 555','+33 2 22222222']
for n in numlist:
    print n + ' ---> ' +clean_number(n)

+33233445566 ---> +33 2 33 44 55 66
+33 2 33 221 555 ---> +33 2 33 22 15 55
+33 2 22222222 ---> +33 2 22 22 22 22


### Export to csv  

In [10]:
process_xml.clean('street','rue Blabla')

'rue Blabla'

In [11]:
process_xml.process_map(SAMPLE_FILE, validate=True)

### Import to SQL using *schema*

<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (Replace this header name!)

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!