# Project Nitin Ramchand Data Wrangling with MongoDB

## Wrangle OpenStreetMap Data for Toulouse region

The region of Toulouse for which the OpenStreetMap Data is wrangled is the one shown in the orange perimeter in the below image.

Import all the libraries

In [1]:
import os
import os.path
import sys
import time
import requests
import xml.etree.cElementTree as ET
from pprint import pprint
import re
import codecs
import json
from collections import defaultdict
import bson
import pymongo
import matplotlib.pyplot as plt

## Auditing the OpenStreetMap Data (OSM XML File) for neighbourhood of Compans Caffarelli which is in the area of Toulouse

The following function __count_tags()__ counts the number of different tags that there are in the osm file  and stores them in a dictionary. It uses the iterparse mehtod to process the map file to get a feeling of what tags the input file has and how many there are of each. Using this iterparse method makes sense when processing such a big osm file. 

In [2]:
def count_tags(filename):
    count_tag_dict = {}
    for event, elem in ET.iterparse(filename):
        if elem.tag not in count_tag_dict:
            count_tag_dict[elem.tag] = 1
        else:
            count_tag_dict[elem.tag] += 1
    return count_tag_dict

In [3]:
tags = count_tags('Compans_Caffarelli.osm')
pprint(tags)

{'bounds': 1,
 'member': 5958,
 'meta': 1,
 'nd': 24310,
 'node': 16466,
 'note': 1,
 'osm': 1,
 'relation': 217,
 'tag': 16265,
 'way': 2533}


As we can see, there are several tags, namely __bounds, meta, note and osm__, which occur once are the ones that contain the meta data of the osm file. <br>
<br>
The rest of the tags are the ones that we will look in detail and try and clean and so on. <br>
<br>
Nodes, Ways and Relations are known as elements in Open Street Map compising a osm file. <br>
- __Nodes:__ (defining points in space),<br>
- __Ways:__ (defining linear features and area boundaries), and <br>
- __Relations:__ (which are sometimes used to explain how other elements work together).
<br>
<br>
All above mentioned elements contain a child tag called tag.<br>
- __Tag:__ All types of data element (nodes, ways and relations), as well as changesets, can have tags. It is a child tag inside an element which describes the meaning of the particular element to which they are attached. A tag consists of two free format text fields; a 'key' and a 'value'.

Finally the last two tags found in the osm file are member and nd.<br>
- __Member:__ is a child tag under relation and is used to define logical or geographic relationships between other elements. A member of a relation can optionally have a role which describes the part that a particular feature plays within a relation.

- __nd:__ is also a chiled tag in this case under the element way which gives more detail of what are the node references for a specific way.

Giving an example of how the OSM XML file looks:

A Node element looks like the following:

```XML
<node id="4250506436" lat="43.6119336" lon="1.4383687" version="3" timestamp="2017-09-24T09:18:10Z" changeset="52323529" uid="2774341" user="Floeditor">
    <tag k="addr:city" v="Toulouse"/>
    <tag k="addr:housenumber" v="15"/>
    <tag k="addr:postcode" v="31000"/>
    <tag k="addr:street" v="Avenue Honoré Serres"/>
    <tag k="amenity" v="restaurant"/>
    <tag k="name" v="Fiel mon restô"/>
    <tag k="phone" v="+33 5 61 21 82 72"/>
    <tag k="source" v="survey"/>
    <tag k="website" v="http://fielmonresto.com/"/>
  </node>
```

A way element looks like the following:
```XML
  <way id="64649549" version="2" timestamp="2012-12-19T15:17:01Z" changeset="14332349" uid="297482" user="don-vip">
    <nd ref="793922896"/>
    <nd ref="1536844827"/>
    <nd ref="793925492"/>
    <nd ref="793918685"/>
    <nd ref="793923221"/>
    <nd ref="793928395"/>
    <nd ref="793922896"/>
    <tag k="building" v="yes"/>
    <tag k="source" v="cadastre-dgi-fr source : Direction Générale des Impôts - Cadastre. Mise à jour : 2010"/>
  </way>
```

 A relation element looks like the following:
```XML
   <relation id="1546842" version="1" timestamp="2011-04-17T21:24:48Z" changeset="7892211" uid="297482" user="don-vip">
    <member type="node" ref="1249091595" role="stop"/>
    <member type="node" ref="1249091584" role="stop"/>
    <member type="node" ref="1249091573" role="stop"/>
    <member type="node" ref="1045297714" role="platform"/>
    <member type="node" ref="1148214269" role="platform"/>
    <member type="node" ref="1148214274" role="platform"/>
    <tag k="name" v="Héraclès"/>
    <tag k="public_transport" v="stop_area"/>
    <tag k="type" v="public_transport"/>
  </relation>
``` 


Before importing the Open Street Map XML file to a database to perform some statistical analysis on what is contained inside the Toulouse region, we want to clean up the data inside teh OSM XML file. Once cleaned then we want to convert it to a JSON file so that then we can import it in a MongoDB database to perform our queries. 

The format in which we want to convert the OSM XML file is the following:

```
{
"id": "2406124091",
"type: "node",
"visible":"true",
"created": {
          "version":"2",
          "changeset":"17206049",
          "timestamp":"2013-08-03T16:43:42Z",
          "user":"linuxUser16",
          "uid":"1219059"
        },
"pos": [41.9757030, -87.6921867],
"address": {
          "housenumber": "5157",
          "postcode": "60625",
          "street": "North Lincoln Ave"
        },
"amenity": "restaurant",
"cuisine": "mexican",
"name": "La Cabana De Don Luis",
"phone": "1 (773)-271-5176"
}
```

Next, to get familiar with the types of attributes for each tag, we look at how many of each attribute for each tag exist. The following function __count_attributes()__ is the one that defines a dcitionary that counts the number of sttributes types.

In [4]:
def count_attributes(filename):
    count_attributes_dict = {}
    for _, element in ET.iterparse(filename):
        for attribute in element.attrib:
            if attribute not in count_attributes_dict:
                count_attributes_dict[attribute] = 1
            else:
                count_attributes_dict[attribute] += 1
    return count_attributes_dict               
   

attributes = count_attributes('Compans_Caffarelli.osm')
pprint(attributes)

{'changeset': 19216,
 'generator': 1,
 'id': 19216,
 'k': 16265,
 'lat': 16466,
 'lon': 16466,
 'maxlat': 1,
 'maxlon': 1,
 'minlat': 1,
 'minlon': 1,
 'osm_base': 1,
 'ref': 30268,
 'role': 5958,
 'timestamp': 19216,
 'type': 5958,
 'uid': 19216,
 'user': 19216,
 'v': 16265,
 'version': 19217}


As we can see above, aside from the attributes found in each of the elements (node, way, and relation) and some of the meta data attributes which are found only once, the k and the v attributes are very interesting since they are under the tag "tag" of most elements. The k attribute represents the key and the v the value of that key.

It is worth diggining deeper into what type of keys are there since we can observe that in the file there are many different types of key values which we can clean up before passing them into the JSON format mentioned above. The following __types_keys()__ function ouputs a dictionary showing the key "k" attributes under alls tags named "tag" and how many times do they show up. The dictionary types_keys_dict has all 'k' values sotred with the occurences and the dictionary that is shown below is keys_slice_dict which shows the 'k' attributes that occur more than 100 times in the OpenStreetMap OSM XML file.

In [2]:
def types_keys(filename):
    types_keys_dict = {}
    for event, element in ET.iterparse(filename, events=("start","end")):
        if event == "start":
            for tag in element.iter("tag"):
                key = tag.attrib['k']
                if key not in types_keys_dict:
                    types_keys_dict[key] = 1
                else:
                    types_keys_dict[key] += 1
    return types_keys_dict               
   

types_keys_dict = types_keys('Toulouse.osm')
keys_slice_dict = dict((k, v) for k, v in types_keys_dict.items() if v >= 20)
pprint(keys_slice_dict)



{'CEMT': 68,
 'CLC:code': 52,
 'CLC:id': 54,
 'CLC:year': 54,
 'FIXME': 66,
 'access': 29778,
 'access:disabled': 35,
 'addr:city': 13529,
 'addr:country': 9667,
 'addr:housename': 349,
 'addr:housenumber': 165574,
 'addr:place': 26,
 'addr:postcode': 7522,
 'addr:street': 32283,
 'addr:unit': 34,
 'admin_level': 770,
 'advertising': 3496,
 'advertising:animated': 22,
 'advertising:luminous': 22,
 'advertising:size': 22,
 'aeroway': 1381,
 'air_conditioning': 206,
 'alt_name': 617,
 'amenity': 31293,
 'animated': 30,
 'area': 918,
 'artist_name': 81,
 'artwork_type': 103,
 'atm': 547,
 'attraction': 93,
 'backrest': 1286,
 'barrier': 15115,
 'bench': 2386,
 'bicycle': 12571,
 'bicycle_parking': 614,
 'bike_safety': 112,
 'bin': 643,
 'board_type': 34,
 'boat': 332,
 'bollard': 119,
 'border_type': 36,
 'boules': 76,
 'boundary': 1353,
 'brand': 922,
 'brand:website': 62,
 'brand:wikidata': 700,
 'brand:wikipedia': 628,
 'brewery': 54,
 'bridge': 2159,
 'bridge:name': 72,
 'building': 7

In [6]:
phone = []
for event, elem in ET.iterparse('Compans_Caffarelli.osm', events=("start",)):
    if elem.tag == "node" or elem.tag == "way" or elem.tag == "relation": 
        for tag in elem.iter("tag"):
            key = tag.attrib['k']
            if (key == "phone") or (key == "contact:phone"):
                    phone.append(tag.attrib['v'])
pprint(len(phone))               
   




83


In [14]:
def audit_phonenumber(phonenumbers, actual_number):
    m = phone_number_re.search(actual_number)
    if not m:
        phonenumbers.append(actual_number)

def audit_phone(osmfile):
    osm_file = open(osmfile, "r")
    phonenumbers = []
    for event, elem in ET.iterparse(osm_file, events=("start",)):
            for tag in elem.iter():
                for key in tag.attrib:
                    if type(tag.attrib[key]) == str:
                        tag.attrib[key] = unicode(tag.attrib[key])
#                        print tag.attrib
                if elem.tag == "node" or elem.tag == "way":
                    for tag in elem.iter("tag"):
                        if (tag.attrib['k'] == "phone") or (tag.attrib['k'] == "contact:phone"):
                            audit_phonenumber(phonenumbers, tag.attrib['v'])
    osm_file.close()
    return phonenumbers

bad_phones = audit_phone('Compans_Caffarelli.osm')
pprint(bad_phones)

['3631',
 '3631',
 '3631',
 '3631',
 '3631',
 '3631',
 '3631',
 '3631',
 '3631',
 '3631',
 '3631',
 '3631',
 '3631',
 '3631',
 u'3631',
 u'3631',
 u'3631',
 u'3631',
 u'3631',
 u'3631',
 '0984180201',
 '0984180201',
 '0984180201',
 '0984180201',
 '0984180201',
 u'0984180201']


In [8]:
phone_number_re = re.compile(r'^\+')
m = phone_number_re.search('+33 5 62 15 01 70')
m.group()                         


'+'

The following __types_values()__ function ouputs a dictionary showing the values "v" attributes under alls tags named "tag" and how many times do they show up. As we can observe that the occurence of these are much lower with a higher variety and there are street names, websites, phone numbers and other useful information of each of the "tag" tags. The dictionary types_values_dict has all 'v' values sotred with the occurences and the dictionary that is shown below is values_slice_dict which shows the 'v' attributes that occur more than 100 times in the OpenStreetMap OSM XML file.

In [9]:
import phonenumbers

test_phone = '+33 5 61 62 02 26'
test_phone2 = '+33562256228'
x = phonenumbers.parse(test_phone, None)
y = phonenumbers.parse(test_phone2, None)
print x
print y

Country Code: 33 National Number: 561620226
Country Code: 33 National Number: 562256228


In [10]:
def types_values(filename):
    types_values_dict = {}
    for event, element in ET.iterparse(filename, events=("start","end")):
        if event == "start":
            for tag in element.iter("tag"):
                value = tag.attrib['v']
                if value not in types_values_dict:
                    types_values_dict[value] = 1
                else:
                    types_values_dict[value] += 1
    return types_values_dict               
   

types_values_dict = types_values('Compans_Caffarelli.osm')
values_slice_dict = dict((k, v) for k, v in types_values_dict.items() if v >= 5)
pprint(values_slice_dict)

{'#572F08': 6,
 '#58AC25': 6,
 '#A0670F': 5,
 '#DC006B': 5,
 '#E675A7': 8,
 '#FF671B': 6,
 '#FFD900': 5,
 '+33561215515': 6,
 '+33562256228': 6,
 '-1': 47,
 '0': 14,
 '0.15': 6,
 '0.5': 41,
 '1': 281,
 u'1.5\u20ac': 63,
 '10': 104,
 '11': 94,
 '11.5': 23,
 '11B': 10,
 '12': 106,
 u'12\u20ac': 65,
 '13': 76,
 '14': 70,
 '15': 89,
 '16': 74,
 '17': 82,
 '17B': 6,
 '18': 64,
 '18 m': 10,
 '19': 57,
 '1997-08-13': 10,
 '1B': 8,
 '2': 209,
 '2 hours': 63,
 '2.5 hours': 65,
 '20': 67,
 '2007-11-16': 20,
 '2011': 6,
 '2011-06-11': 82,
 '2011-11-06': 280,
 '2012-04-04': 28,
 '2012-05-30': 48,
 '2012-10-04': 180,
 '2012-11-29': 176,
 '2012/10/04': 46,
 '2015-02-14': 6,
 '2015-03': 5,
 '2015-04-06': 8,
 '2018-09-08': 18,
 '2018-09-29': 10,
 '2018-11-06': 6,
 '2018-11-11': 20,
 '20B': 16,
 '21': 52,
 '22': 56,
 '22B': 6,
 '23': 38,
 '23B': 6,
 '24': 45,
 '24 hours': 57,
 '24/7': 12,
 '25': 34,
 '26': 38,
 '27': 32,
 '28': 41,
 '29': 35,
 '2B': 20,
 '2T': 6,
 '3': 213,
 '30': 206,
 '30B': 6,
 '31'

Both keys and values shown above are the attributes that we need to clean up before converting to the JSON file. This is the information fed by humans wihch has the most errors in syntax and formats clearly. The rest of the attributes found directly under the elements such as "lat", "lon", "id", etc have all the same format and there are the same number of elements so they are clean enough to input them into the JSON file and then into the MongoDB database. 

In this project we will just clean up some data for the 'k' and 'v' attributes however before showing in the next section the plan that has been developed to clean the data, we do some more auditing to understand a bit better these attributes and to try and cluster the syntax errors that we find in these to help us develop the plan to clean the data. 

The following function __key_types()__ looks through all tags called "tag" and puts the 'k' attribute into the following different categories and the occurences of them:
- "lowercase", for tags that contain only lowercase letters and are valid,
- "lowercase_colon", for otherwise valid tags with a colon in their names,
- "lowercase_semicolon", for otherwise valid tags with a semicolon in their names,
- "lowercase_morethanone_colon", for tags with at least two semicolons in their names,
- "capitalized", for tags starting with a capital letter,
- "problemchars", for tags with problematic characters, and
- "other", for other tags that do not fall into the other three categories.

In [25]:
lowercase = re.compile(r'^([a-z]|_)*$')
lowercase_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
lowercase_semicolon = re.compile(r'^([a-z]|_)*;([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
capitalized = re.compile(r'^([A-Z])\w')
lowercase_morethanone_colon = re.compile(r'(\b:\b)+')
     

def key_type(element, keys):
    if element.tag == "tag":
        if lowercase.search(element.attrib['k']):
            keys['lowercase'] += 1
#            print("k:" + element.attrib['k'] + " " "v:" + element.attrib['v'])
        elif lowercase_colon.search(element.attrib['k']):
            keys['lowercase_colon'] += 1
#            print("k:" + element.attrib['k'] + " " "v:" + element.attrib['v'])
        elif capitalized.search(element.attrib['k']):
            keys['capitalized'] += 1
        elif lowercase_morethanone_colon.search(element.attrib['k']):
            keys['lowercase_morethanone_colon'] += 1
#            print(element.attrib['k'])
        elif lowercase_semicolon.search(element.attrib['k']):
            keys['lowercase_semicolon'] += 1
        elif problemchars.search(element.attrib['k']):
            keys['problemchars'] += 1
        else:
            keys['other'] += 1
#            print(element.attrib['k'])

    return keys


def process_map_keys(filename):
    keys = {"lowercase": 0, "lowercase_colon": 0, "lowercase_morethanone_colon": 0,"lowercase_semicolon": 0, "capitalized": 0, "problemchars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)

    return keys


keys = process_map_keys('Toulouse.osm')
pprint(keys)

{'capitalized': 165,
 'lowercase': 1432376,
 'lowercase_colon': 202929,
 'lowercase_morethanone_colon': 11856,
 'lowercase_semicolon': 0,
 'other': 24,
 'problemchars': 0}


In [12]:
test_string = 'hola:buenas: com'
#lowercase_morethanone_colon = re.compile(r':{2,}')
lowercase_morethanone_colon = re.compile(r'(\b:\b)+')
print(lowercase_morethanone_colon.search(test_string))

<_sre.SRE_Match object at 0x0000000008FE4E40>


Since we say some 'v' attributes of the tag "tag" that had some special characters like \xe9, \xe8, \xe, etc we will also try and find how many of attributes contain special characters. The __value_types()__ function categorizes into categories with problem characters which is the first filter then 'v' atrributes with capital letters to begin and the rest are i the others categories. We can see that a big portion of the 'v' attributes have special characters.

In [13]:
problemchars_nospaces = re.compile(r'^(\\)')

def value_type(element, values):
    if element.tag == "tag":
        if problemchars_nospaces.search(element.attrib['v']):
            values['problemchars'] += 1
            print (element.attrib['v'])
        elif lowercase_colon.search(element.attrib['v']):
            values['lowercase_colon'] += 1
        elif lowercase_morethanone_colon.search(element.attrib['v']):
            values['lowercase_morethanone_colon'] += 1
        elif lowercase_semicolon.search(element.attrib['v']):
            values['lowercase_semicolon'] += 1
        elif capitalized.search(element.attrib['v']):
            values['capitalized'] += 1
        else:
            values['other'] += 1
#            print(element.attrib['v'])

    return values


def process_map_values(filename):
    values = {"capitalized": 0, "problemchars": 0, "other": 0, "lowercase_colon": 0, "lowercase_morethanone_colon": 0,"lowercase_semicolon": 0}
    for _, element in ET.iterparse(filename):
        values = value_type(element, values)

    return values

values = process_map_values('Compans_Caffarelli.osm')
pprint(values)

{'capitalized': 3272,
 'lowercase_colon': 0,
 'lowercase_morethanone_colon': 200,
 'lowercase_semicolon': 4,
 'other': 12789,
 'problemchars': 0}


# Plan for Cleaning Data

The cleaning of the data will be performed for the inputs of the data where the human error is big. In these areas is where different users input data regarding the `k` and `v` attributes under the tag `tag`. The following things will be cleaned up:<br>
1) The street types will be all filled cleaned to look consistent and have capitalized the first letter of the word and have the complete word for it so there are no abbreviations, and look like the following examples "Rue", "Avenue, etc
2) The phone number attribute for each of the elements where we can find this attribute will be cleaned up to not have a spaces in the string and any brackets and just have the phone number with no space and starting with the international code so "0033".

# Cleaning Data

## 1) Abbreviated, Misspelled and Lower Case Street Types 

The names of the street types e.g. rue, avenue, etc were changed to all having the first letter capitalized and non abbreviated types. E.g. rue => Rue, Av. => Avenue, Bd => Boulevard

In [16]:

street_type_re = re.compile(r'^\b\S+\.?', re.IGNORECASE)

expected_street_ascii = ["Rue", "Esplanade", "Boulevard", "Avenue", "Place", u'All\xe9e', 'Route', 'Voie', 
                   'Impasse', 'Chemin', 'Rond-Point', 'Quai', 'Promenade', 'Port', 'Passage', 'Mail',
                   'Descente', 'Clos' ]

expected_street = map(unicode,expected_street_ascii)
# UPDATE THIS VARIABLE
mapping_street = { u"route": u"Route",
            u"rue": u"Rue",
            u"rte": u"Route",
            u"esplanade": u"Esplanade",
            u"voie":u"Voie",
            u"place":u"Place",
            u"impasse":u"Impasse",
            u"chemin":u"Chemin",
            u"boulevard":u"Boulevard",
            u"avenue":u"Avenue",
            u'all\xe9es': u'All\xe9e',
            u'all\xe9e': u'All\xe9e',
            u"RUE":u"Rue",
            u"ROUTE":u"Route",
            u"Cheminement":u"Chemin",
            u"Av.":u"Avenue",
            u"Bd":u"Boulevard",
            u'All\xe9es': u'All\xe9e',
            u"AVENUE":u"Avenue",
            u"ALLEE": u'All\xe9e'
            }

def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected_street:
            street_types[street_type].add(street_name)
            
def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")

def audit_street(osmfile):  
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):
            for tag in elem.iter():
                for key in tag.attrib:
                    if type(tag.attrib[key]) == str:
                        tag.attrib[key] = unicode(tag.attrib[key])
#                        print tag.attrib
#                        print type(tag.attrib[key])
                if elem.tag == "node" or elem.tag == "way":
                    for tag in elem.iter("tag"):
                        if is_street_name(tag):
                            audit_street_type(street_types, tag.attrib['v'])
    osm_file.close()
    return street_types

Following the __update_name_street()__ function is defined which checks whether street type is in the expected_street list defined above and if not, it changes it via the mapping_street dictionary also defined above.

In [19]:
def update_name_street(name, mapping_street):

    m = street_type_re.search(name)
    street_type = m.group()
    if m and street_type in mapping_street:
#        print 'street_type', street_type
        if street_type not in expected_street:
            name = re.sub(street_type_re, mapping_street[street_type], name)

    return name

Finally the better name (cleaned up one) is shown versus how it looked before. In the full file all the keys found in mapping street dictionary are changed to the corresponsing cleaned up names. But below just one example is shown.

In [20]:
st_types = audit_street('Compans_Caffarelli.osm')
for st_type, ways in st_types.iteritems():
    for name in ways:
        better_name = update_name_street(name, mapping_street)
        print name, "=>", better_name

esplanade compans-caffarelli => Esplanade compans-caffarelli


## 2) Phone number Inconsistent Format

Some phone number have the format as +33561

The following two functions take the list of phone numbers found under the `tags`, `node` and `way`, which are specifically found under the attributed "contact:phone" and "phone" and put them into two baskets, one being the ones starting with a plus sign and the other the ones that are not starting with a plus sign.

In [21]:
phone_number_re = re.compile(r'\+.')
phone_number_cleaning_re = re.compile(r'^\+[0-9]*')

def audit_phonenumber(no_plus_sign_phones, plus_sign_phones, actual_number):
    m = phone_number_re.search(actual_number)
    if m:
        plus_sign_phones.append(actual_number)        
    if not m:
        no_plus_sign_phones.append(actual_number)
        
def audit_phone(osmfile):
    osm_file = open(osmfile, "r")
    no_plus_sign_phones = []
    plus_sign_phones = []
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        for tag in elem.iter():
            for key in tag.attrib:
                if type(tag.attrib[key]) == str:
                    tag.attrib[key] = unicode(tag.attrib[key])
        if elem.tag == "node" or elem.tag == "way" or elem.tag == "relation":
            for tag in elem.iter("tag"):
                if (tag.attrib['k'] == "phone") or (tag.attrib['k'] == "contact:phone"):
                    audit_phonenumber(no_plus_sign_phones, plus_sign_phones, actual_number=tag.attrib['v'])
    osm_file.close()
    return no_plus_sign_phones, plus_sign_phones

Next, for this small sample file we can see what we find in the non plus sign phones, there are phone numbers that either start with "0" which means that these are the phones how the are dialed locally so to keep it consistent the "0" is subistituted for "0033". 

The second type is a number that is "3631". This number is specific to the French reagion as a short number to call the post office. The full number is "0033972721213" so it is subistited. And finally some other phones in the full file there are some numbers that start with "33" so for these a "00 is added before so that the number starts like "0033"

In [22]:
no_plus_sign_phones, plus_sign_phones = audit_phone('Compans_Caffarelli.osm')
pprint(no_plus_sign_phones)

[u'3631', u'0984180201']


For the numbers that start with a plus sign we can have just replaced the "+" by "00". 


In [23]:
pprint(plus_sign_phones)

[u'+33 5 61 63 81 82',
 u'+33562309977',
 u'+33561216654',
 u'+33 5 61 62 02 26',
 u'+33 5 61 22 18 88',
 u'+33 5 61 21 68 81',
 u'+33 5 61 21 74 74',
 u'+33561213764',
 u'+33 9 54 13 39 50',
 u'+33561215533',
 u'+33 5 61 21 76 74',
 u'+33 5 61 12 67 01',
 u'+33 5 34 30 92 51',
 u'+33 534251010',
 u'+33 5 34 33 51 42',
 u'+33 5 62 80 57 01',
 u'+33 5 61 23 93 98',
 u'+33 5 61 21 39 46',
 u'+33 5 61 21 34 45',
 u'+33 5 61 21 30 75',
 u'+33561221465',
 u'+33561218070',
 u'+33561235686',
 u'+33 5 61 23 16 22',
 u'+33 5 62 30 05 30',
 u'+33 5 34 25 28 82',
 u'+33 5 34 25 28 82',
 u'+33561217308',
 u'+33561216184',
 u'+33561219633',
 u'+33561218738',
 u'+33 5 61 23 57 51',
 u'+33 5 61 22 47 60',
 u'+33 5 61 62 24 49',
 u'+33 5 31 22 03 14',
 u'+33 5 61 21 78 27',
 u'+33 5 61 29 23 49',
 u'+33 5 61 12 12 72',
 u'+33561230137',
 u'+33 5 61 23 82 52',
 u'+33 5 61 23 77 28',
 u'+33 5 61 22 17 77',
 u'+33 5 34 33 77 32',
 u'+33 9 88 77 66 80',
 u'+33 6 30 13 30 72',
 u'+33 5 61 21 74 89',
 u'+33

Finally apart from the replacing of numbers and plus signs mentioned below, all the spaces and brackets are removed to have a consistent 12 digit phone numbers. 

In [24]:
for plus_sign_phone in plus_sign_phones:
    m = phone_number_cleaning_re.search(plus_sign_phone)
    if m:
        updated_number_step1 = re.sub(r'^\+','00',plus_sign_phone)
        updated_number = re.sub(r'[\D|\s]+','',updated_number_step1)            
        print plus_sign_phone,"=>", updated_number

for no_plus_sign_phone in no_plus_sign_phones:
    if no_plus_sign_phone == u'3631':
        updated_number = u'0033972721213'
        print no_plus_sign_phone,"=>", updated_number
    else:
        phone_number_start_0_re = re.compile(r'^[0]')
        phone_number_start_0_re = re.compile(r'^[0]')
        match_phone = phone_number_start_0_re.match(no_plus_sign_phone)
        match_phone = phone_number_start_0_re.match(no_plus_sign_phone)
        if match_phone:
            updated_number_step1 = re.sub(r'^[0]','0033',no_plus_sign_phone)
            updated_number = re.sub(r'[\D|\s]+','',updated_number_step1)
            print no_plus_sign_phone,"=>", updated_number
        else:
            updated_number_step1 = re.sub(r'^[3]','003',no_plus_sign_phone)
            updated_number = re.sub(r'[\D|\s]+','',updated_number_step1)
            print no_plus_sign_phone,"=>", updated_number

+33 5 61 63 81 82 => 0033561638182
+33562309977 => 0033562309977
+33561216654 => 0033561216654
+33 5 61 62 02 26 => 0033561620226
+33 5 61 22 18 88 => 0033561221888
+33 5 61 21 68 81 => 0033561216881
+33 5 61 21 74 74 => 0033561217474
+33561213764 => 0033561213764
+33 9 54 13 39 50 => 0033954133950
+33561215533 => 0033561215533
+33 5 61 21 76 74 => 0033561217674
+33 5 61 12 67 01 => 0033561126701
+33 5 34 30 92 51 => 0033534309251
+33 534251010 => 0033534251010
+33 5 34 33 51 42 => 0033534335142
+33 5 62 80 57 01 => 0033562805701
+33 5 61 23 93 98 => 0033561239398
+33 5 61 21 39 46 => 0033561213946
+33 5 61 21 34 45 => 0033561213445
+33 5 61 21 30 75 => 0033561213075
+33561221465 => 0033561221465
+33561218070 => 0033561218070
+33561235686 => 0033561235686
+33 5 61 23 16 22 => 0033561231622
+33 5 62 30 05 30 => 0033562300530
+33 5 34 25 28 82 => 0033534252882
+33 5 34 25 28 82 => 0033534252882
+33561217308 => 0033561217308
+33561216184 => 0033561216184
+33561219633 => 0033561219633
+335

## Converting all nodes to JSON format

As mentioned at the beginning, now the OSM XML file will be converted to the following format with the relative cleaned up street types and phone numbers:

The format in which we want to convert the OSM XML file is the following:

```
{
"id": "2406124091",
"type: "node",
"visible":"true",
"created": {
          "version":"2",
          "changeset":"17206049",
          "timestamp":"2013-08-03T16:43:42Z",
          "user":"linuxUser16",
          "uid":"1219059"
        },
"pos": [41.9757030, -87.6921867],
"address": {
          "housenumber": "5157",
          "postcode": "60625",
          "street": "North Lincoln Ave"
        },
"amenity": "restaurant",
"cuisine": "mexican",
"name": "La Cabana De Don Luis",
"phone": "1 (773)-271-5176"
}
```