# INTRODUCTION 

I have been living in the Boston area for the last few years since grad school. The dataset analyzed for the purposes of this project pertains to the Boston area. The Boston area dataset was exported from [openstreetmaps](http://www.openstreetmap.org/#map=17/40.71652/-73.94470&layers=H). The analysis included the following steps 

* **Question Phase:** This phase involves asking general questions about the dataset. The questions involve the problem we are trying to solve for. 
* **Data Auditing:** This phase involves auditing the data to identify anomalies and patterns. E.g. In the streetmap data we could run into street names which have some kind special characters in them, or we could run into zipcodes in the Boston area that have some kind of alphabetical characters in them. 
* **Data Cleansing:** This phase involves classifying the anomalies that are identified in the previous step and devising approaches to clean up the data. The cleansing could be either manual or done programmatically. The project assumes both a programmatic and a manual approach to cleansing data. The focus is mostly been around cleansing the data programmatically. However in certain cases there is also a need for a manual review 

Data Auditing and Data Cleansing follow a repetive approach till a fair amount of data anomalies have been identified and also cleansed approrpriately. 

* **Conclusion:** This phase involves drawing conclusion about the dataset, based on the auditing and cleansing steps 
* **Communication:** The phase involves communicating the results of the analysis to the audiences. In a real life scenario this would be the business users who make business decisions based on the dataset analysis. 

In addition, this project also involves importing the dataset into [mongoDB](https://www.mongodb.com/), followed by executing some of the mongoDB's aggregation commands to further analyze the dataset that has been imported. 






# Question Phase



# Data Auditing
## Identifying the TAGS along with the count of occurences of each of the TAGS

This step involves doing an initial analysis of the dataset and doing an assessment of the XML nodes. The step also involves counting the number of instances of the specific node. While this step does provide a good start to the data auditing process, it does not answer a whole of questions that needs to be answered. This step definitely helps us confirm the validity of the XML format as the XML parser (ET.iterparse) is able to parse through the entire XML file. 

In [2]:
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json
from collections import defaultdict

filename = 'boston.osm'

def test(): 
    tag_list = []
    tag_dict = defaultdict(int)
    for _, element in ET.iterparse(filename): 
        tag_list.append(element.tag)
    for item in tag_list: 
        tag_dict[item] += 1
    pprint.pprint(tag_dict)


if __name__ == "__main__":
    test()

defaultdict(<type 'int'>, {'node': 444899, 'nd': 551066, 'bounds': 1, 'member': 5284, 'tag': 221456, 'relation': 645, 'way': 75362, 'osm': 1})


# Data Auditing (contd)
## Identifying the Count of Unique Users (based on UIDs) 
OpenSteet map being an openly available map which can be updated by users all over the world, it made sense to get an idea of the number of unique users who have contributed to the map. 


In [3]:
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json
from collections import defaultdict

filename = 'boston.osm'

def test(): 
    users = set()
    for _,element in ET.iterparse(filename):
        if 'uid' in element.attrib:
            users.add(element.get('uid'))
    print "Count of Unique Users:", (len(users))
# Commenting out the Line that prints out set of Unique Users 
#     print users
    users_list = list(users)
    print "Printing List of ten Unique User Ids:", users_list[:10]
    
    
if __name__ == "__main__":
    test()

Count of Unique Users: 848
Printing List of ten Unique User Ids: ['701372', '3057995', '378464', '14850', '967832', '152074', '4581744', '2176051', '113450', '45027']


# Data Auditing (contd)
## Identifying the Count of Unique Users (based on Users) 
This test is very similar to the test done above, the only difference is that instead of using the UID element we are using the USER element. The test is done to be sure that the results are the same. In both the tests we notice that the counts are the same

In [4]:
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json
from collections import defaultdict

filename = 'boston.osm'

def test(): 
    users = set()
    for _,element in ET.iterparse(filename):
        if 'user' in element.attrib:
            users.add(element.get('user'))
    print len(users)
    users_list = list(users)
    print "Printing List of ten Unique Users:", users_list[:10]
    
if __name__ == "__main__":
    test()

848
Printing List of ten Unique Users: ['maxmetcalfe', 'Roger Neumann', 'dloutzen', 'TuftsReady', 'Matej Cepl', 'Steven Deeds', 'Brett Camper', 'noobi', 'Thia564', 'signed0']


# Data Auditing (Contd)
## User Contribution Count:

The purpose of this audit is to identify the number of times a specific user has contributed to the map. This test was done to just get an assessment of the top contribution numbers

In [5]:
import xml.etree.cElementTree as ET
import codecs
import json
import operator
from collections import defaultdict

filename = 'boston.osm'

def test(): 
    user_list = []
    user_dict = defaultdict(int)
    for _,element in ET.iterparse(filename):
        if 'user' in element.attrib:
            user_list.append(element.get('user'))
        else: 
            continue
    
    for item in user_list: 
        user_dict[item] +=1 
    sorted_user_dict = sorted(user_dict.iteritems(), key=operator.itemgetter(1), reverse=True)
    pprint.pprint(sorted_user_dict)
    
    
if __name__ == "__main__":
    test()

[('crschmidt', 269245),
 ('jremillard-massgis', 64989),
 ('wambag', 29490),
 ('OceanVortex', 27828),
 ('ryebread', 21770),
 ('morganwahl', 20412),
 ('mapper999', 8315),
 ('cspanring', 6817),
 ('JasonWoof', 5445),
 ('synack', 5054),
 ('ingalls_imports', 4166),
 ('Alexey Lukin', 3516),
 ('fiveisalive', 3145),
 ('MassGIS Import', 3115),
 ('Utible', 2872),
 ('probiscus', 1779),
 ('Prithason', 1439),
 ('phyzome', 1409),
 ('Extant', 1193),
 ('Alan Bragg', 1171),
 ('massDOT', 1167),
 ('Steven Deeds', 1124),
 ('pkoby', 1090),
 ('Ahlzen', 1080),
 ('thetornado76', 955),
 ('JessAk71', 953),
 ('3yoda', 890),
 ('jokeefe', 881),
 ('ceyockey', 790),
 ('woodpeck_repair', 771),
 ('Aredhel', 731),
 ('Pouletic', 702),
 ('jwass', 683),
 ('mterry', 669),
 ('dloutzen', 633),
 ('Peter Dobratz', 585),
 ('pokey', 574),
 ('KindredCoda', 553),
 ('dannya222', 546),
 ('aroach', 509),
 ('onurozgun', 504),
 ('headwatersolver', 464),
 ('kalanz', 410),
 ('nimapper', 394),
 ('Echo Echo', 393),
 ('nkhall', 382),
 ('spac

**The top ten contributors based on the results of the output above:**
[('crschmidt', 269245),
 ('jremillard-massgis', 64989),
 ('wambag', 29490),
 ('OceanVortex', 27828),
 ('ryebread', 21770),
 ('morganwahl', 20412),
 ('mapper999', 8315),
 ('cspanring', 6817),
 ('JasonWoof', 5445),
 ('synack', 5054)]
 
 In addition did some research on crschmidt and found this (http://crschmidt.net/mapping/). Looks like the user has been actively contributing to the online map community via series of hacks 

# Data Auditing (contd)
## This step involves auditing the value of the addr tags for 

1. Lower Case Characters 
2. Lower Case Characters with Colon 
3. Problematic Characters 

The final output includes a dictionary with keys for each of the category and the count of values for each of the categories

In [6]:
import re
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json
from collections import defaultdict

lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

filename = 'boston.osm'


def process_address(addr_value):
    if re.search(lower,addr_value):
        keys['lower'] += 1
    elif re.search(lower_colon, addr_value): 
        keys['lower_colon'] += 1
    elif re.search(problemchars, addr_value): 
        keys['problemchars'] += 1 
    else: 
        keys['other'] += 1
                

def read_file():
    for _, element in ET.iterparse(filename): 
        if element.tag == 'tag': 
            key=element.get('k')
            if 'addr:' in key:
                addr_value = element.get('v')
                process_address(addr_value)
    return keys
                    

if __name__ == "__main__":
    keys = defaultdict(int)
    read_key = read_file()
    print read_key


defaultdict(<type 'int'>, {'problemchars': 3332, 'lower': 83, 'other': 8195})


# Data Auditing (contd) 
## Auditing Street Names and Identifying Anomalies
This step involves auditing the street names for the last element in the string name. Identifying the last element in street name is done via regular expression. 

The psuedocode involves comparing this "last element value" in the street name with list of expected elements in the "expected" list. If the "last element value" is not found in the expected list, the element is added to the dictionary as key along with the associated street name. 

In [7]:
import re
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json
from collections import defaultdict

street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons", "Heights", "North", "East", "West", "South"]


filename = 'boston.osm'
                
def process_addressanomalies(addressvalue): 
    match = street_type_re.search(addressvalue)
    if match: 
        street_type = match.group()
        if street_type not in expected: 
            street_types[street_type].add(addressvalue)
        
def read_file():
    for _, element in ET.iterparse(filename): 
        if element.tag == 'tag': 
            key=element.get('k')
            if 'addr:street' in key:
                addr_value = element.get('v')
                process_addressanomalies(addr_value)
    return street_types                

if __name__ == "__main__":
    street_types = defaultdict(set)
    read_key = read_file()
pprint.pprint(dict(read_key))

    


{'1100': set(['First Street, Suite 1100']),
 '1302': set(['Cambridge Street #1302']),
 '3': set(['Kendall Square - 3']),
 '303': set(['First Street, Suite 303']),
 '501': set(['Bromfield Street #501']),
 '6': set(['South Station, near Track 6']),
 '846028': set(['PO Box 846028']),
 'Ave': set(['738 Commonwealth Ave',
             'Boston Ave',
             'College Ave',
             'Commonwealth Ave',
             'Concord Ave',
             'Francesca Ave',
             'Highland Ave',
             'Josephine Ave',
             'Lexington Ave',
             'Massachusetts Ave',
             'Morrison Ave',
             'Mystic Ave',
             'Somerville Ave',
             'Western Ave',
             'Willow Ave']),
 'Ave.': set(['Brighton Ave.',
              'Massachusetts Ave.',
              'Somerville Ave.',
              'Spaulding Ave.']),
 'Broadway': set(['Broadway']),
 'Cambrdige': set(['Cambrdige']),
 'Center': set(['Cambridge Center', 'Financial Center']),
 'Circle':

# On quick review of the output from the previous step we see that 
1. Street has different variations. E.g. Some forms include "st", "St", "St." , "St," and also "ST" 
2. Similarly we also see that Avenue is sometimes referred to as "Ave". 

# Data Auditing (contd) 
## Auditing the Phone Numbers and Identifying Anomalies

In [8]:
import re
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json
from collections import defaultdict


filename = 'Boston.osm'
                
        
def read_file():
    for _, element in ET.iterparse(filename): 
        for child in element.getchildren(): 
            if child.tag=='tag': 
                key=child.get('k')
                if 'phone' in key: 
                    phone_number = child.get('v')
                    phone_set.add(phone_number)
    return phone_set
                

if __name__ == "__main__":
    street_types = defaultdict(set)
    phone_set = set()
    read_key = read_file()
    print len(read_key)
    print read_key


424
set(['+1 617 876-3988', '617-261-0158', '+1 617 576 1253', '+1 (781) 267-4539', '(617) 277-3737', '617-254-7163', '+1-617-294-4233', '6174894408', '(617) 206-2994', '617 491 2999', '+1 617 375 8550', '+1-617-764-3152', '+16177872430', '617-635-8937', '+1 617 661 0077', '+1 (617) 623-9068', '+1 617 876 6990', '617 236 0571', '(617) 863-3650', '+1 617 440 4192', '+16176610433', '+1 (857) 417-2396', '+1 617 227-2750', '(617) 307-7608', '+1 617 4971513', '+1 617 876-6555', '+1 617 349 3937', '+1 617 868-6330', '+1 617 262 2424', '(617) 714-3974', '(866) 995-2479', '+1 617 496-5955', '+1-617-627-3030', '+1 617 628-3618', '617-926-7740', '+16175420210', '+1-857-242-3605', '+1-617-542-5942', '+1 617 254-0112', '+1 617 441-2500', '+16174920711', '+1 857 2596552', '+1 617 987-4236', '617-787-5507', '+1 617 623 1159', '(617) 494-8700', '+1-617-236-1100', '+26777722147', '18006350489', '617-225-2777', '+1 617 783-5804', '+1 617 497 3926', '+16173496567', '617-635-6470', '+1 617 764 4960', '61

## Careful Review of the Phone Data did not indicate a lot of anomalies, however it does indicate varying formats as described below  
* A few phone numbers have Country Code (+1) in them. E.g. +16175722000
* A few phone numbers have spaces in between them instead of the hyphens. E.g. 617 242 9000
* A few phone numbers have parenthesis. E.g. (857) 417-2396


# Data Cleansing
## Cleansing Street Names

This step involves creation of logic to cleanse the anomalies identified in the previous step. This steps involves a mapping dictionary which has the mapping between the incorrect format and the correct format of the street names in the form of key/value pairs. The psuedocode involves utilizing some of old code, that involved building a dictionary, by grouping street names that ended with a similar value. 

The cleansing step involves looping through the dictionary of elements and validating to see if the value can be found in the mapping dictionary. If the mapping is found then appropriate replacement is done to the street name using the python replace method


In [9]:
import re
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json
from collections import defaultdict

street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons", "Heights", "North", "East", "West", "South"]

mapping = { "St": "Street",
            "St.": "Street",
            "Rd.": "Road", 
            "Ave": "Avenue", 
            "Ave.": "Avenue", 
            "St": "Street", 
            "St,": "Street", 
           "St.": "Street", 
           "ST": "Street"
            }


filename = 'boston.osm'
                
def process_addressanomalies(addressvalue): 
    match = street_type_re.search(addressvalue)
    if match: 
        street_type = match.group()
        if street_type not in expected: 
            street_types[street_type].add(addressvalue)

def update_name(): 
    for k, v in read_key.iteritems(): 
        for vitem in v: 
            match = street_type_re.search(vitem) 
            val =  match.group(0)
            if val in mapping: 
                new_name = vitem.replace(match.group(0), mapping[match.group(0)])
                print vitem, "==>", new_name
            

def read_file():
    for _, element in ET.iterparse(filename): 
        if element.tag == 'tag': 
            key=element.get('k')
            if 'addr:street' in key:
                addr_value = element.get('v')
                process_addressanomalies(addr_value)
    return street_types                

if __name__ == "__main__":
    street_types = defaultdict(set)
    read_key = read_file()
    better_adress = update_name()

Walnut St, ==> Walnut Street
Pearl St. ==> Pearl Street
Banks St. ==> Banks Street
Marshall St. ==> Marshall Street
Prospect St. ==> Prospect Street
Main St. ==> Main Street
Albion St. ==> Albion Street
Saint Mary's St. ==> Saint Mary's Street
Boylston St. ==> Boylston Street
Stuart St. ==> Stuart Street
Elm St. ==> Elm Street
Newton ST ==> Newton Street
Brighton Ave. ==> Brighton Avenue
Spaulding Ave. ==> Spaulding Avenue
Massachusetts Ave. ==> Massachusetts Avenue
Somerville Ave. ==> Somerville Avenue
Brentwood St ==> Brentwood Street
Athol St ==> Athol Street
Everett St ==> Everett Street
South Waverly St ==> South Waverly Street
Litchfield St ==> Litchfield Street
Hampshire St ==> Hampshire Street
Main St ==> Main Street
Cambridge St ==> Cambridge Street
Arsenal St ==> Arsenal Street
Merrill St ==> Merrill Street
Antwerp St ==> Antwerp Street
1629 Cambridge St ==> 1629 Cambridge Street
Elm St ==> Elm Street
Lothrop St ==> Lothrop Street
Charles St ==> Charles Street
Dane St ==> Dan

# Data Cleansing
## Cleansing Phone Numbers


In [10]:
# Importing all the Needed Libraries
import re
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json
from collections import defaultdict


'''Regular Expressions for different kinds of Cleaning'''
# Regular Expression to check whether the phone number is starting with + 1 
phonecheckone = re.compile(r'^\+[1]\s*')

# Regular Expression to check whether the phone number has a hyphen in it 
phonecheckhyphen = re.compile(r'\-+')

# Regular Expression to check whether the phone number has any white spaces in it 
phonecheckspace = re.compile(r'\s+?')

# Regular Expression to check whether the phone number contains any parenthesis
phonecheckpar1 = re.compile(r'\(+')
phonecheckpar2 = re.compile(r'\)+')

# Regular Expression to check whether the phone number contains any periods
phonecheckperiod = re.compile(r'\.+')

# Regular Expression to check whether the phone number contains any alphabets in them 
phonecheckalpha = re.compile(r'[a-zA-Z].*')

# Regular Expression to check whether the phone number has any commas in it 
phonecheckcomma = re.compile(r'\,+')

filename = 'Boston.osm'



'''Function that does the phone number cleanse through an iterative process'''
def clean_phone(): 
    
    '''For loop to check for alphabets in the phonenumber. Phone Numbers that have alphabets
    are populated in a seperate list for manual review. Phone numbers that do not have any alphabets proceed along
    to other steps in the cleaning process '''
    for phone_item in read_key:
        match=phonecheckalpha.search(phone_item)
        if match: 
            phone_list1Manual.append("To be Review Manually:" + phone_item)
        else:
            phone_list1.append(phone_item)

    '''For Loop to cleanse any whitespaces in the phonenumber'''
    for phonelist1_item in phone_list1: 
        match=phonecheckspace.search(phonelist1_item)
        if match: 
            phonelist1_itemnew=phonelist1_item.replace(match.group(0), '')
            phone_list2.append(phonelist1_itemnew)
        else: 
            phone_list2.append(phonelist1_item)
    
    '''For Loop to cleanse any Hyphens in the phonenumber'''
    for phonelist2_item in phone_list2: 
        match = phonecheckhyphen.search(phonelist2_item)
        if match: 
            phonelist2_itemnew = phonelist2_item.replace(match.group(0), '')
            phone_list3.append(phonelist2_itemnew)
        else: 
            phone_list3.append(phonelist2_item)
    
    '''For Loop to cleanse any phonenumbers that start with +1 '''  
    for phonelist3_item in phone_list3: 
        match = phonecheckone.search(phonelist3_item)
        if match: 
            phonelist3_itemnew = phonelist3_item.replace(match.group(0), '')
            phone_list4.append(phonelist3_itemnew)
        else: 
            phone_list4.append(phonelist3_item)
    
    '''For loop to cleanse any paranthesis in the phonenumber'''
    for phonelist4_item in phone_list4: 
        match = phonecheckpar1.search(phonelist4_item)
        if match: 
            phonelist4_itemnew = phonelist4_item.replace(match.group(0), '')
            phone_list5.append(phonelist4_itemnew)
        else: 
            phone_list5.append(phonelist4_item)

    
    for phonelist5_item in phone_list5: 
        match = phonecheckpar2.search(phonelist5_item)
        if match: 
            phonelist5_itemnew = phonelist5_item.replace(match.group(0), '')
            phone_list6.append(phonelist5_itemnew)
        else: 
            phone_list6.append(phonelist5_item)
    
    '''For loop to cleanse any periods in the phonenumber'''
    for phonelist6_item in phone_list6: 
        match = phonecheckperiod.search(phonelist6_item)
        if match: 
            phonelist6_itemnew = phonelist6_item.replace(match.group(0), '')
            phone_list7.append(phonelist6_itemnew)
        else: 
            phone_list7.append(phonelist6_item)

    '''For loop to cleanse any phone numbers that begin with a 1'''
    for phonelist7_item in phone_list7: 
        if phonelist7_item.startswith('1') or phonelist7_item.startswith('+') : 
            phone_list8.append(phonelist7_item[1:])
        else: 
            phone_list8.append(phonelist7_item)

    '''For loop to cleanse any commas in the phonenumbers'''
    for phonelist8_item in phone_list8: 
        match = phonecheckcomma.search(phonelist8_item)
        if match: 
            phonelist8_itemnew = phonelist8_item.replace(match.group(0), '')
            phone_list9.append(phonelist8_itemnew)
        else: 
            phone_list9.append(phonelist8_item)
        
def checkalphabetsfromsource(): 
     for phonevalue in read_key:
        match=phonecheckalpha.search(phonevalue)
        if match: 
            phone = phonevalue.replace(match.group(0), '')
            phoneset_alpha.append(phonevalue)
        else: 
            phoneset_noalpha.append(phonevalue)

    
       
def read_file():
    for _, element in ET.iterparse(filename): 
        for child in element.getchildren(): 
            if child.tag=='tag': 
                key=child.get('k')
                if 'phone' in key: 
                    phone_number = child.get('v')
                    phone_set.append(phone_number)
    return phone_set
                

if __name__ == "__main__":
    count = 0 
    
    '''Initializing the different list before using them through the 
    different steps of the cleanse process with the regular expression'''
    
    
    phone_set = []
    phoneset_alpha=[]
    phoneset_noalpha=[]
    phone_list1=[]
    phone_list1Manual=[]
    phone_list2=[]
    phone_list3=[]
    phone_list4=[]
    phone_list5=[]
    phone_list6=[]
    phone_list7=[]
    phone_list8=[]
    phone_list9=[]
    
    
    
    read_key = read_file()
    clean_phone = clean_phone()
    check_alphabets = checkalphabetsfromsource()
    
    
    
    
    print "Phone Number Cleansing High Level Stats"
    print "-------------------------------------------------"
    print "Total Number of Phone Numbers to be Validated", len(read_key)
    print "Total Number Cleaned Up Phone Numbers: ", count 
    print "Total Number of Phone Numbers to be cleaned Manually: ", len(phone_list1Manual) 
    print "List of Phone Numbers to be Cleaned Manually" 
    pprint.pprint(phone_list1Manual)
    print "-------------------------------------------------"
    
    
    '''Sample List of Cleansed Phone Numbers'''
    print "Sample List of Cleansed Phone Numbers"
    for i in range(len(phone_list1)): 
        if len(phone_list9[i])==10 and count<=10: 
            print phone_list1[i],    "--->" ,  phone_list9[i]
            count = count + 1 
            
        elif len(phone_list9[i])>10: 
            phone_list1Manual.append("To be Review Manually:" + phone_list9[i])
        
        else: 
            continue
    '''List of PhoneNumbers to be Cleansed Manually'''
    print "List of PhoneNumbers to be Cleansed Manually"
    pprint.pprint(phone_list1Manual)
    

Phone Number Cleansing High Level Stats
-------------------------------------------------
Total Number of Phone Numbers to be Validated 475
Total Number Cleaned Up Phone Numbers:  0
Total Number of Phone Numbers to be cleaned Manually:  6
List of Phone Numbers to be Cleaned Manually
['To be Review Manually:yes',
 'To be Review Manually:yes',
 'To be Review Manually:AstraZeneca Neuroscience: T: (617) 679-1680',
 'To be Review Manually:617-494-9330,  Forest City Management',
 'To be Review Manually:Phone 617.714.0555',
 'To be Review Manually:+1617958DELI']
-------------------------------------------------
Sample List of Cleansed Phone Numbers
617-635-8532 ---> 6176358532
617-254-8383 ---> 6172548383
617-349-6555 ---> 6173496555
617-266-8427 ---> 6172668427
781-393-2333 ---> 7813932333
617-284-7800 ---> 6172847800
617-635-8497 ---> 6176358497
617-635-9976 ---> 6176359976
617-354-0047 ---> 6173540047
617-666-3311 ---> 6176663311
617-262-9562 ---> 6172629562
List of PhoneNumbers to be Clea

# Data Cleansing
## Cleansing Zip Codes

In [11]:
# Importing all the Needed Libraries
import re
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json
from collections import defaultdict


# Regular Expression to check whether the phone number contains any alphabets in them 
zipcheckalpha = re.compile(r'[a-zA-Z].*')
zipcheckhyphen = re.compile(r'^(\d{5})-\d{4}$')

def identifypostcode(): 
    
    '''Identifying Post Codes that have Characters 
        in them and sending them in a manual review list'''
    
    for zip_item in read_key:
        match=zipcheckalpha.search(zip_item)
        if match: 
            zipmanual.append("To be Review Manually:" + zip_item)
        else:
            zipauto.append(zip_item)

def fixpostcode(): 
    
    '''Check for Hyphens in Post Code and Extract the first part of the Zipcode''' 
    
    for zipcode_item in zipauto: 
        match = zipcheckhyphen.search(zipcode_item)
        if match: 
            ziphyphen.append(zipcode_item)
            zipcode_split = zipcode_item.split('-')
            ziphyphencleaned.append(zipcode_split[0]) 
        else: 
            ziphyphencleaned.append(zipcode_item)
        

            
def read_file():
    for _, element in ET.iterparse(filename): 
        for child in element.getchildren(): 
            if child.tag=='tag': 
                key=child.get('k')
                if 'addr:postcode' in key: 
                    post_code = child.get('v')
                    postcode.append(post_code)
    return postcode


    

if __name__ == "__main__":
    count = 0 
    
    '''Initializing the different list before using them through the 
    different steps of the cleanse process with the regular expression'''
    
    
    postcode = []
    zipmanual=[]
    ziphyphen=[]
    ziphyphencleaned=[]
    zipauto=[]
    
    read_key = read_file()
    identifypostcode = identifypostcode()
    fixpostcode=fixpostcode()
    
    print "Zip Code Cleansing High Level Stats"
    print "---------------------------------------------------------------------"
    print "Total Number of Zip Codes Encountered in the File: ", len(read_key)
    print "Total Number of Zip Codes to be Cleaned Manually: ", len(zipmanual)
    print "Total Number of Zip Codes Cleaned Using Regular Expressions: ", len(zipauto)
    print "Total Number of Zip Codes that were Cleaned: ", len(ziphyphencleaned)
    print "---------------------------------------------------------------------"

    '''Sample List of Cleansed ZipCodes'''
    
    print "Sample List of Cleansed ZipCodes"
    for i in range(len(ziphyphencleaned)): 
        if len(zipauto[i])>5 and count<=10: 
            print "Old Post Code: " + zipauto[i] + "--->" + "Cleaned Post Code: " + ziphyphencleaned[i]
            count = count + 1 
        else: 
            continue
    
    print "---------------------------------------------------------------------"
            
    '''List of Zipcodes to be Cleansed Manually'''
    print "List of Zipcodes to be Cleansed Manually"
    pprint.pprint(zipmanual)

    

Zip Code Cleansing High Level Stats
---------------------------------------------------------------------
Total Number of Zip Codes Encountered in the File:  1685
Total Number of Zip Codes to be Cleaned Manually:  5
Total Number of Zip Codes Cleaned Using Regular Expressions:  1680
Total Number of Zip Codes that were Cleaned:  1680
---------------------------------------------------------------------
Sample List of Cleansed ZipCodes
Old Post Code: 02114-3203--->Cleaned Post Code: 02114
Old Post Code: 02110-1301--->Cleaned Post Code: 02110
Old Post Code: 02140-1340--->Cleaned Post Code: 02140
Old Post Code: 02284-6028--->Cleaned Post Code: 02284
Old Post Code: 02134-1409--->Cleaned Post Code: 02134
Old Post Code: 02138-2706--->Cleaned Post Code: 02138
Old Post Code: 02114-3203--->Cleaned Post Code: 02114
Old Post Code: 02138-1901--->Cleaned Post Code: 02138
Old Post Code: 02138-3824--->Cleaned Post Code: 02138
Old Post Code: 02138-2701--->Cleaned Post Code: 02138
Old Post Code: 02138-29

# XML to a JSON Conversion 
## This step involves parsing through the XML file to create a JSON file, which will then be used to import into mongoDB. We need to follow the below rules for translation 

* Process only 2 types of top level tags: "node" and "way"
* All attributes of "node" and "way" should be turned into regular key/value pairs, except: attributes in the CREATED array should be added under a key "created", attributes for latitude and longitude should be added to a "pos" array, for use in geospacial indexing. Make sure the values inside "pos" array are floats and not strings.
* If second level tag "k" value contains problematic characters, it should be ignored
* If second level tag "k" value starts with "addr:", it should be added to a dictionary "address"
* If second level tag "k" value does not start with "addr:", but contains ":", you can process it same as any other tag.
* If there is a second ":" that separates the type/direction of a street, the tag should be ignored


**In addition as a part of the import the street names, phone numbers and zipcodes were also cleaned up**

In [13]:
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json
import sys
sys.setrecursionlimit(10000)
from collections import defaultdict

CREATED = [ "version", "changeset", "timestamp", "user", "uid"]


'''Parameters for Zipcode Clean Up'''

zipcheckalpha = re.compile(r'[a-zA-Z].*')
zipcheckhyphen = re.compile(r'^(\d{5})-\d{4}$')


'''Parameters for Address Street Name Clean Up'''

street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons", "Heights", "North", "East", "West", "South"]

mapping = { "St": "Street",
            "St.": "Street",
            "Rd.": "Road", 
            "Ave": "Avenue", 
            "Ave.": "Avenue", 
            "St": "Street", 
            "St,": "Street", 
           "St.": "Street", 
           "ST": "Street"
            }


'''Parameters for Phone Clean Up'''

# Regular Expression to check whether the phone number is starting with + 1 
phonecheckone = re.compile(r'^\+[1]\s*')

# Regular Expression to check whether the phone number has a hyphen in it 
phonecheckhyphen = re.compile(r'\-+')

# Regular Expression to check whether the phone number has any white spaces in it 
phonecheckspace = re.compile(r'\s+?')

# Regular Expression to check whether the phone number contains any parenthesis
phonecheckpar1 = re.compile(r'\(+')
phonecheckpar2 = re.compile(r'\)+')

# Regular Expression to check whether the phone number contains any periods
phonecheckperiod = re.compile(r'\.+')

# Regular Expression to check whether the phone number contains any alphabets in them 
phonecheckalpha = re.compile(r'[a-zA-Z].*')

# Regular Expression to check whether the phone number has any commas in it 
phonecheckcomma = re.compile(r'\,+')

#-------------------------------------------------------------------------------------------

'''Cleansing Function to Clean Zipcodes'''

def clean_zipcode(zipcode):
    ca = clean_zalpha(zipcode)
    if ca:
        return zipcode 
    else: 
        cleaned_zipcode = clean_zhyphen(zipcode)
        return cleaned_zipcode 


def clean_zalpha(zipcode): 
    match = zipcheckalpha.search(zipcode)
    if match: 
        return zipcode 
    else: 
        pass
            

def clean_zhyphen(zipcode): 
    match = phonecheckhyphen.search(zipcode)
    if match: 
        zipcode = match.group(0).split('-')[0]
        return zipcode
    else: 
        return zipcode

#-------------------------------------------------------------------------------------------


'''Cleansing Function to Clean Street Names'''

def clean_streetname(old_name): 
    match = street_type_re.search(old_name) 
    val =  match.group(0)
    if val in mapping: 
        name = old_name.replace(match.group(0), mapping[match.group(0)])
        return name
    else:
        name = old_name
        return name
#-------------------------------------------------------------------------------------------

    
'''Cleansing Function to Clean Phone Numbers'''

def clean_phone(phone):
    ca = clean_alpha(phone)
    if ca:
        return phone 
    else: 
        whitespace= clean_whitespaces(phone) 
        hyphen= clean_hyphen(whitespace)
        plusone=clean_plusone(hyphen)
        paren1=clean_paranthesis1(plusone)
        paren2= clean_paranthesis2(paren1)
        prd= clean_period(paren2)
        cleanone = clean_one(prd)
        cleancomma = clean_comma(cleanone)
        return cleancomma 


def clean_alpha(phone): 
    match = phonecheckalpha.search(phone)
    if match: 
        return phone 
    else: 
        pass
            
def clean_whitespaces(phone): 
    match=phonecheckspace.search(phone)
    if match: 
        phone=phone.replace(match.group(0), '')
        return phone
    else: 
        return phone


def clean_hyphen(phone): 
    match = phonecheckhyphen.search(phone)
    if match: 
        phone = phone.replace(match.group(0), '')
        return phone
    else: 
        return phone
        
def clean_plusone(phone): 
    match = phonecheckone.search(phone)
    if match: 
        phone = phone.replace(match.group(0), '')
        return phone
    else: 
        return phone 
        
def clean_paranthesis1(phone): 
    match = phonecheckpar1.search(phone)
    if match: 
        phone = phone.replace(match.group(0), '')
        return phone
    else: 
        return phone

def clean_paranthesis2(phone): 
    match = phonecheckpar2.search(phone)
    if match: 
        phone = phone.replace(match.group(0), '')
        return phone
    else: 
        return phone

        
def clean_period(phone): 
    match = phonecheckperiod.search(phone)
    if match: 
        phone = phone.replace(match.group(0), '')
        return phone
    else: 
        return phone
        
def clean_one(phone): 
    if phone.startswith('1') or phone.startswith('+'): 
        phone = phone[1:]
        return phone 
    else: 
        return phone 


def clean_comma(phone): 
    match = phonecheckcomma.search(phone)
    if match: 
        phone = phone.replace(match.group(0), '')
        return phone
    else: 
        return phone       
    
#-------------------------------------------------------------------------------------------
    
def floatOrNofloat(n):
    return float(n) if n else None
    
def shape_element(element): 
    node = defaultdict(dict) 
    if element.tag == "node" or element.tag == "way":
        
        node["tag"] = element.tag

        node ["id"] = element.get('id')

        lat = element.get('lat')

        lon = element.get('lon')

        if lat or lon:
            node['pos'] = [floatOrNofloat(lat), floatOrNofloat(lon)]
        
        node["created"] = {}

        for key in CREATED:
            node["created"][key] = element.get(key)
        
        for child in element.getchildren():
            
            key = child.get("k")
            ref = child.get("ref")
            
            if key == 'address': 
                node['fulladdress'] = child.get('v')
            
            '''Included the logic to clean the Phone Number by calling the Phone Cleaning Function prior to JSON Import'''
            if key == 'phone': 
                if len(clean_phone(child.get('v'))) == 10: 
                    node['phonenumber'] = clean_phone(child.get('v'))
                else: 
                    node['phonenumber'] = 'Phone Number Removed Due to Incorrect Value'
                    
            
            if key is not None: 
                if key.startswith('addr:'):
                    split_key = key.split(":")
                    
                    '''Included the logic to clean the Street Name and Post Codes by calling the respective functions
                    prior to JSON Import'''
                    
                    if split_key[1] == 'street': 
                        node['address'][split_key[1]]= clean_streetname(child.get('v'))
                    elif split_key[1] == 'postcode':
                        node['address'][split_key[1]]= clean_zipcode(child.get('v'))
                    else: 
                        node['address'][split_key[1]] = child.get('v')                        

                elif 'amenity' in key: 
                    node['amenity'] = child.get('v')
                elif 'name' in key: 
                    node['name'] = child.get('v')
            
            if ref: 
                if "node_refs" not in node: 
                    node["node_refs"] = []
                else: 
                    node["node_refs"].append(ref)
        
        return node
    else:
        return None
        

    
def process_map(file_in, pretty = False):
    file_out = "{0}.json".format(file_in)
    data = []
    with codecs.open(file_out, "w") as fo:
        for _, element in ET.iterparse(file_in):
            el = shape_element(element)
            if el:
                data.append(el)
                if pretty:
                    fo.write(json.dumps(el, indent=2)+"\n")
                else:
                    fo.write(json.dumps(el) + "\n")
    return data

def test():
    data = process_map('Boston.osm', True)
    

if __name__ == "__main__":
    test()

The output of the step above is the creation of **"Boston.osm.json"** file, which is later been used to import into MongoDB. In addition as a part of the import the street names, phone numbers and zipcodes were also cleaned up 

# Setting up for Mongo Data Analysis

In [14]:
import pymongo
from pymongo import MongoClient
import pprint
client = MongoClient()
db = client.boston
print db

Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), u'boston')


# Data Analysis/Data Exploration in MongoDB

# Assessing the Size of the Original OSM File and the JSON File 

In [15]:
import os
print 'The original OSM file is {} MB'.format(os.path.getsize('Boston.osm')/1.0e6)
print 'The JSON file is {} MB'.format(os.path.getsize('Boston.osm' + ".json")/1.0e6)

The original OSM file is 100.298777 MB
The JSON file is 145.96108 MB


In [16]:
boston = db['bostonc']

# Number of Documents

In [17]:
boston.find().count()

520261

# Number of Nodes and Ways

In [18]:
print "Number of nodes:",boston.find({'tag': 'node'}).count()
print "Number of ways:", boston.find({'tag': 'way'}).count()

Number of nodes: 444899
Number of ways: 75362


# Top 10 Contributors along with the UserNames

In [19]:
result = boston.aggregate( [
                                        { "$group" : {"_id" : "$created.user", "count" : { "$sum" : 1} } },
                                        { "$sort" : {"count" : -1} }, 
                                        { "$limit" : 10 } ] )

pprint.pprint(list(result))

[{u'_id': u'crschmidt', u'count': 269155},
 {u'_id': u'jremillard-massgis', u'count': 64989},
 {u'_id': u'wambag', u'count': 29468},
 {u'_id': u'OceanVortex', u'count': 27793},
 {u'_id': u'ryebread', u'count': 21755},
 {u'_id': u'morganwahl', u'count': 20291},
 {u'_id': u'mapper999', u'count': 8309},
 {u'_id': u'cspanring', u'count': 6817},
 {u'_id': u'JasonWoof', u'count': 5439},
 {u'_id': u'synack', u'count': 5042}]


# List of Top 50 Amenities in the Boston Area

In [20]:
result = boston.aggregate( [            {'$match': {'amenity': {'$exists': 1}}},
                                        { "$group" : {"_id" : "$amenity", "count" : { "$sum" : 1} } },
                                        { "$sort" : {"count" : -1} }, 
                                        { "$limit" : 50 } ] )

pprint.pprint(list(result))

[{u'_id': u'parking', u'count': 545},
 {u'_id': u'bench', u'count': 495},
 {u'_id': u'restaurant', u'count': 398},
 {u'_id': u'bicycle_parking', u'count': 214},
 {u'_id': u'school', u'count': 205},
 {u'_id': u'place_of_worship', u'count': 184},
 {u'_id': u'library', u'count': 162},
 {u'_id': u'cafe', u'count': 158},
 {u'_id': u'fast_food', u'count': 114},
 {u'_id': u'bicycle_rental', u'count': 89},
 {u'_id': u'university', u'count': 77},
 {u'_id': u'post_box', u'count': 69},
 {u'_id': u'bank', u'count': 65},
 {u'_id': u'waste_basket', u'count': 59},
 {u'_id': u'pub', u'count': 49},
 {u'_id': u'fuel', u'count': 41},
 {u'_id': u'fountain', u'count': 34},
 {u'_id': u'pharmacy', u'count': 34},
 {u'_id': u'atm', u'count': 33},
 {u'_id': u'hospital', u'count': 31},
 {u'_id': u'fire_station', u'count': 31},
 {u'_id': u'drinking_water', u'count': 31},
 {u'_id': u'car_sharing', u'count': 28},
 {u'_id': u'bar', u'count': 28},
 {u'_id': u'parking_space', u'count': 27},
 {u'_id': u'post_office', u

# Extracting the List of Colleges from the DataSet

In [21]:
colleges = boston.aggregate([{"$match":{"amenity":{"$exists":1},
                                 "amenity":"college",}},      
                      {"$group":{"_id":{"Name":"$name"},
                                 "count":{"$sum":1}}},
                      {"$project":{"_id":0,
                                  "College":"$_id.Name",
                                  "Name":"$count"}},
                      {"$sort":{"Count":-1}}, 
                      {"$limit":10}])
pprint.pprint(list(colleges))

[{u'College': u'North Bennet Street School', u'Name': 1},
 {u'College': u'Radcliffe Quad', u'Name': 1},
 {u'College': u'Bunker Hill Community College', u'Name': 1},
 {u'College': u'Emerson College', u'Name': 7},
 {u'College': u'Berklee College of Music', u'Name': 7},
 {u'College': u'Emerson College \u2013 Walker Building', u'Name': 1},
 {u'College': u'Emerson College \u2013 Tuffte Performing Arts Center',
  u'Name': 1},
 {u'College': u'Fisher College', u'Name': 1},
 {u'College': u'Emerson College - Little Building', u'Name': 1},
 {u'College': u'Emerson College \u2013 Piano Row', u'Name': 1}]


**This list is definitely missing some of the key universities in the Boston Area like Harvard, MIT, NorthEastern. On further review of the dataset I noticed that the missing schools and colleges are infact a part of the dataset, they just don't have an amenity of college attached to them** 

# Extracting the list of Public Buildings in the Boston Area

In [22]:
building = boston.aggregate([{"$match":{"amenity":{"$exists":1},
                                 "amenity":"public_building",}},      
                      {"$group":{"_id":{"Name":"$name"},
                                 "count":{"$sum":1}}},
                      {"$project":{"_id":0,
                                  "Building":"$_id.Name",
                                  "Name":"$count"}},
                      {"$sort":{"Count":-1}}, 
                      {"$limit":10}])
pprint.pprint(list(building))

[{u'Building': None, u'Name': 1},
 {u'Building': u'Social Security Administration', u'Name': 1},
 {u'Building': u'Suffolk', u'Name': 5},
 {u'Building': u'Middlesex', u'Name': 2}]


# Extracting the Top Cities in the Boston Area

In [23]:
cities = boston.aggregate([
        {"$match": {"address.city":{"$exists":1}}}, 
        {"$group":{"_id":"$address.city", "count":{"$sum":1}}},
        {"$sort": {"count": -1}}, 
        {"$limit":10}                                 
    ])

pprint.pprint(list(cities))

[{u'_id': u'Boston', u'count': 619},
 {u'_id': u'Cambridge', u'count': 555},
 {u'_id': u'Somerville', u'count': 240},
 {u'_id': u'Arlington', u'count': 172},
 {u'_id': u'Allston', u'count': 17},
 {u'_id': u'Arlington, MA', u'count': 9},
 {u'_id': u'Charlestown', u'count': 9},
 {u'_id': u'Watertown', u'count': 9},
 {u'_id': u'Cambridge, MA', u'count': 8},
 {u'_id': u'Brookline', u'count': 7}]


# Conclusion and Other Suggested Improvements

Given the size of the Boston Data Set that was analyzed and the number of issues that existed with the dataset, it was a lot better than I anticipated. That said, there are definitely areas for improvement. E.g.We noticed inconsistencies with the street names while executing our auditing in Python. We also noticed some minor anomalies with ZipCode and the Phone data. In addition we found other issues around missing data, or the data being associated with different types 

1. When we executed a query to extract the list of colleges in the Boston Area based on amenity == "college", the result set was missing some of the key institutions in the Boston area (e.g. MIT, Harvard, NorthEastern). On further analysis by looking at OSM file we noticed that the data is infact present, but just that it was associated with a different type. 
2. Similarly, when we executed a query to extract the list of public buildings in the Boston Area based on amenity == "public_building, not a whole lot of buildings showed up. 

If we further analyze the root cause for 1 and 2, we can definitely conclude that these are the effects of manual contribution from hundreds and hundreds of users over the web. 

## Here are some recommendations to improve the quality of data within Open Street Maps
### Option 1: Structured Input Form: 

One approach to rectify this would be to use a structured input form to consume data from the users 

#### Pros: 
1. Forces users to input data by adhering to a general format 

#### Cons: 
2. In some cases the form restrictions might prevent users from entering valid data. In situations like those users might either leave the data as blank and proceed or enter data to adhere to the form settings, but might be incorrect data. E.g. Say if the List of Cities is presented as a Drop Down Value, and the user is not able to find the relevant city, they might be inclined to select another neighbouring city for the purposes of ingesting the value into the maps. This would lead to other issues and subsequent clean ups. 

### Option 2: Address Validation using WebService Calls with other Address Verification Services 
Utilize Address Verification Services (E.g. LexisNexis, DOTS, smartystreets) via API calls to validate the Address being entered either via Synchronous Calls. The API could inturn return a much cleaner version of the Address, which could then be consumed and ingested into OpenStreetmaps. 

#### Pros: 
1. The Synchronous call to a Address Verification service will serve as an Address Cleansing Step prior to the ingestion into Open Street maps. 

#### Cons: 

1. API services might be costly, and OpenStreet might have to pay for those services. This might defeat the purpose of OpenStreet Map being a open source project 

2. The intermediate third party API call might slow down the Address Intake process from a user perspective as the user will have to wait for the response from the Synchronous Call 


In addition, it might be a good idea to organize Hackathons/Meetups in different parts of the country to cleanse the dataset in a particular area. E.g A hackathon in the Boston Area could be tasked with the force to cleanse the Boston Area dataset on a periodic basis. Just judging by the sheer volume of datasets for the Boston area alone and extrapolating it to datasets across the world, data wranglers across the world would have a field data cleansing the open street map datasets. 


# References

Listed Below are some of the references I used 
* http://wiki.openstreetmap.org/wiki/Browsing
* Udacity Lectures 
* A lot of StackOverflow Threads (everytime I ran into an error in Python and Regular Expressions)
* https://docs.python.org/2/library/re.html
* https://regexone.com/references/python
* https://pymotw.com/2/xml/etree/ElementTree/parse.html#parsing-an-entire-document
* http://effbot.org/zone/element-iterparse.htm



