# Open Street Map - Data Wrangling With MongoDB
Map Area: Ahemdabad, India 
Problems Faced in the Dataset
Ahemdabad is selected whereas Nagpur is my home town due to file sizeand apart from Nagpur Ahemdabad is the city where I know the streets and their addresses to confirm the data 
The major problems were
Names of the places as Chowk is known as a crossway
Various synonyms are used like for Road, RD. Road, rd etc
Same area is represented with multiple names like Manewada and Manevada are same areas but the spelling is different

# Solving the above problems:

In [None]:
Mapping = { 
             
             "St.": "Street",
             "Rd." : "Road",
             "N.":"North",
             "St" : "Street",
             "no" : "No",
             "Rd" : "Road",
             "ROAD" : "Road",
             "ROad" : "Road",
             "marg" : "Road",
             "road" : "Road",
             "stn" : "Station",
             "Marg" : "Road",
             "lane" : "Lane",
             "sector" : "Sector",
             "Chowk": "Square",
             "chowk": "Square",
             "lane": "Lane",
             "Nagar": "Suburb"
             }

The above lines represents the mistakes noted from the file which is to be rectified. These are language specific and needs to be removed to present a good database for the analyst to work with

In [None]:
def update_name(name, mapping):
    dict_map = sorted(mapping.keys(), key=len, reverse=True)
    for key in dict_map:
        
        if name.find(key) != -1:          
            name = name.replace(key,mapping[key])
            return name


    return name

The above mentioned code helps in removing the erros of the data. It parse individual tags that contains the street names. Once the value is fetched and changed it is later checked to ensure if it is in expected value list. If not they are simply replaced with the values of the Dictionary.

Zip Code Errors
Zip codes were coarsely depicted with less characters. all zip codes in Ahemdabad begin with 38. So all data with zip codes as 01,08 were modified as 380001 and 380008.

In [None]:
<tag k="addr:postcode" v="01"/> => <tag k="addr:postcode" v="380001"/> 
<tag k="addr:postcode" v="08"/> => <tag k="addr:postcode" v="380008"/>

The reason to do this all zip codes in Ahemdabad can be expressed in two forms:

they are done using two digits like 08,04,01 with 3800 prepended remaining obvious.
therefore, in order to make sure all zip codes are in this format 3800+area code.
Since, the code was short and there weren’t many instances of the same, attaching the snippet used

In [None]:
def is_post_code(elem):
    return (elem.attrib['k'] == "addr:postcode")

# this part of the snippet is the same as in the case of iterative parser.
for tag in elem.iter("tag"):
                if is_post_code(tag):
                    if audit_post_code(tag.attrib['v']):
                        #Update the tag attribtue
                        tag.attrib['v'] = update_name(tag.attrib['v'])

def audit_post_code(post_code):
  if len(postcode) == 2:
    return true
  else:
    return false

def update_name(post_code):
  return '3800'+post_code

# Data Overview

This Section Describes the statistics of the dataset used and the basic queries performed. File  size updated to 104mb as specified in the change

In [None]:
Query: db.ahm.count()


Output: 620782
    
    
    This is the number of document object present in the database

In [None]:
Query: db.ahm.find({"type":"node"}).count()

Output: 540312

    
    
    This is the number of node attributes present

In [None]:
Query: db.ahm.find({"type":"way"}).count()
    
Output: 80466
    
    This is the number of way attribute present

# Contributing User

In [None]:
Query: db.ahm.aggregate([
            {'$match': {'created.user':{'$exists':1}}},
            {'$group': {'_id':'$created.user',
                        'count':{'$sum':1}}},
            {'$sort': {'count':-1}},
            {'$limit' : 1}
])
    
    
    Output:  
        
     { "_id" : "uday01", "count" : 177343 }


In [None]:
Number of users having a single post

Query: db.ahm.aggregate([{"$group":{"_id":"$created.user", 
                                        "count":{"$sum":1}}}, {"$group":{"_id":"$count", "num_users":{"$sum":1}}}, 
                             {"$sort":{"_id":1}}, {"$limit":1}])

Output:
{ "_id" : 1, "num_users" : 63 }

In [None]:
Top Contributing Users:
    
    Query:
        db.ahm.aggregate([
            {'$match': {'created.user':{'$exists':1}}},
            {'$group': {'_id':'$created.user',
                        'count':{'$sum':1}}},
            {'$sort': {'count':-1}},
            {'$limit' : 5}
])
        
        Output: 
            { "_id" : "uday01", "count" : 177343 }
{ "_id" : "sramesh", "count" : 136822 }
{ "_id" : "chaitanya110", "count" : 123307 }
{ "_id" : "shashi2", "count" : 49514 }
{ "_id" : "shravan91", "count" : 22461 }


In [None]:
Data With Street Names:
    
    Query:db.ahm.aggregate([
            {'$match': {'address.street':{'$exists':1}}},
            {'$limit' : 5}
            
])
        
        
        Output:
            { "_id" : ObjectId("584aebb87e546026a0116d5b"), "amenity" : "school", "name" : "Street Xavier's High School, Loyola Hall", "created" : { "changeset" : "1622675", "user" : "thepatel", "version" : "1", "uid" : "138012", "timestamp" : "2009-06-25T13:34:41Z" }, "pos" : [ 23.047875, 72.5490839 ], "address" : { "street" : "Street Xavier's High School, Loyola Hall" }, "type" : "node", "id" : "429228993" }
{ "_id" : ObjectId("584aebb87e546026a0116d5c"), "layer" : "1", "name" : "Ellisbridge Gymkhana", "created" : { "changeset" : "27499936", "user" : "shravan91", "version" : "4", "uid" : "1051550", "timestamp" : "2014-12-16T07:05:37Z" }, "pos" : [ 23.0231449, 72.5601423 ], "leisure" : "sports_centre", "address" : { "city" : "Ahmedabad", "street" : "Netaji Subhash Chandra Road", "postcode" : "380006" }, "sport" : "multi", "type" : "node", "id" : "429228996" }
{ "_id" : ObjectId("584aebb87e546026a011745f"), "amenity" : "college", "name" : "Street Xavier's College", "created" : { "changeset" : "1649373", "user" : "thepatel", "version" : "1", "uid" : "138012", "timestamp" : "2009-06-27T13:13:11Z" }, "pos" : [ 23.0329052, 72.5517045 ], "address" : { "street" : "Street Xavier's College" }, "type" : "node", "id" : "429795953" }
{ "_id" : ObjectId("584aebb87e546026a0118bf8"), "website" : "http://www.cafeuppercrust.com/", "cuisine" : "international", "amenity" : "restaurant", "capacity" : "40", "name" : "Upper Crust Cafe", "created" : { "changeset" : "27498669", "user" : "shravan91", "version" : "3", "uid" : "1051550", "timestamp" : "2014-12-16T04:57:29Z" }, "opening_hours" : "07:00-23:00", "wheelchair" : "no", "pos" : [ 23.0412121, 72.5485673 ], "phone" : "91-79-26401554", "source" : "http://www.cafeuppercrust.com/cafe-uppercrust.php", "address" : { "city" : "Ahmedabad", "street" : "Vijay Cross Roads", "housenumber" : "Aarohi Complex", "postcode" : "380009" }, "smoking" : "no", "type" : "node", "id" : "1313805181" }
{ "_id" : ObjectId("584aebb87e546026a0118e38"), "name" : "Mahendrakumar Sampatlal Shah", "designation" : "Owner", "created" : { "changeset" : "29613854", "user" : "shravan91", "version" : "3", "uid" : "1051550", "timestamp" : "2015-03-20T12:54:20Z" }, "old_name" : "Sampatlal Raichand Shah", "pos" : [ 23.0497885, 72.5985435 ], "source" : "Owner", "alt_name" : "Megh Prem", "address" : { "housenumber" : "48/1 Girdharnagar Society Shahibaug", "street" : "Girdharnagar Road", "housename" : "Megh Prem", "postcode" : "380004" }, "type" : "node", "id" : "1369667053" }

            

In [None]:
Bank Names:
    
    Query:
        
        
db.ahm.aggregate([
            {'$match': {'amenity':'bank',
                        'name':{'$exists':1}}},
            {'$project':{'_id':'$name',
                         'contact':'$phone'}}
])


Output:
    { "_id" : "029" }
{ "_id" : "Indian Bank" }
{ "_id" : "State Bank of India, St. Xavier's School Road Branch" }
{ "_id" : "Indian Bank" }
{ "_id" : "Central Bank of India, St. Xavier's School Road Branch" }
{ "_id" : "Kalupur Commercial Co-operative Bank" }
{ "_id" : "Bank of Baroda" }
{ "_id" : "Ahmedabad District Cooperative Bank" }
{ "_id" : "The State Bank of India, Ahmedabad Office" }
{ "_id" : "Citibank" }
{ "_id" : "The State Bank of India, Ahmedabad Office" }
{ "_id" : "Kotak Mahindra Bank" }
{ "_id" : "Central Bank of India" }
{ "_id" : "HDFC Bank Relief Road Branch" }
{ "_id" : "HDFC Bank Relief Road Branch" }
{ "_id" : "Bank of India" }
{ "_id" : "HDFC Bank" }
{ "_id" : "Gol Bank" }
{ "_id" : "Bank of Baroda" }
{ "_id" : "HDFC Bank" }
Type "it" for more


Conclusion:
After studying the map and its output the cleaned document is still not perfect there are many errors still present. For example the last query states the bank but the information like contact infor, branch is not available. Since there are a few users who have provided the data the authentication of the data is missing. As a resident there is a lot of data missing from the file. 


To enhance the cleaning:

1. we can divide the data into categories

The categories can be like streets, Restaurants, Banks. When the data is divided into such categories then when there is analysis like if the question is to find a paticular street in the system then the data of restaurants or banks are of no use to us. 

2. Data Sources: 

the data is only of OpenStreet Maps we can pull data from various sources like google maps, various maps APIs and other repositories can be combined to provide the analysis so that the system can be more reliable. The data from different sources can too be redundant or there may be data which is paid like the API's and plugins which are to purchased for getting data.

3. Missing Data:

we can see that the top 5 users have given the maximum data therefore the data is skewed and a lot of data is missing, if the user is asked for more data at the time of input then this problem can be easily removed. there can be redundancy here too as the data from the user can not be fully trusted while performaing analysis. 



References:
https://docs.mongodb.com/getting-started/shell/query/


https://docs.mongodb.com/manual/aggregation/


https://docs.mongodb.com/v3.0/tutorial/enable-authentication/


http://3t.io/mongochef/


https://www.tutorialspoint.com/mongodb/