### Objective:
* Find  the distance of all localities in a city from a desired locality.
* This can approached by computing distance between pincodes/ longtidues-latitudes

Three methods have been used:
1. **pgeocode** - Postal code geocoding and distance calculations
2. **geopy** - Distance between localities using longitude & latitude
3. **geocoder & geopy** - Get Latitude & longitude of localities & compute distance using geopy

### Method 1: pgeocode - Postal code geocoding and distance calculations

In [3]:
#!pip install tqdm,pgeocode,geopy,geocoder

import pgeocode
import os
import pandas as pd
from tqdm.notebook import tqdm_notebook
tqdm_notebook.pandas()

Progress of apply using tqdm - https://towardsdatascience.com/progress-bars-in-python-and-pandas-f81954d33bae

In [4]:
path = os.getcwd()

#### Import data - Pincode by locality in India 

Data source - Locality based Pin mapping - India https://data.gov.in/resources/villagelocality-based-pin-mapping-16th-march-2017

In [5]:
path1 = path + '/Data1/Locality_village_pincode_final_mar-2017.csv'
data = pd.read_csv(path1,encoding= 'unicode_escape')

In [6]:
data.head()

Unnamed: 0,Village/Locality name,Officename ( BO/SO/HO),Pincode,Sub-distname,Districtname,StateName
0,Aliganj,Lodi Road H.O,110003,Defence Colony,SOUTH EAST DELHI,DELHI
1,Kasturba Nagar,Lodi Road H.O,110003,Defence Colony,SOUTH EAST DELHI,DELHI
2,Jeewan Nagar,Jungpura S.O,110014,Defence Colony,SOUTH EAST DELHI,DELHI
3,Tehkhand,Okhla Industrial Estate S.O,110020,Defence Colony,SOUTH EAST DELHI,DELHI
4,Zakir Nagar SO,New Friends Colony S.O,110025,Defence Colony,SOUTH EAST DELHI,DELHI


#### Apply Filters

In [7]:
district = ['CHENNAI','KANCHIPURAM']
data1 = data[data['Districtname'].isin(district)]

In [8]:
data1.head()

Unnamed: 0,Village/Locality name,Officename ( BO/SO/HO),Pincode,Sub-distname,Districtname,StateName
558410,Parrys,Chennai G.P.O.,600001,Fort - Tondiarpet,CHENNAI,TAMIL NADU
558411,Chennai,Anna Road H.O,600002,Egmore - Nungambakkam,CHENNAI,TAMIL NADU
558412,Parrys,Park Town H.O,600003,Fort - Tondiarpet,CHENNAI,TAMIL NADU
558413,Mylapore,Mylapore H.O,600004,Mylapore - Triplicane,CHENNAI,TAMIL NADU
558414,Tiruvallikkeni,Tiruvallikkeni S.O,600005,Mylapore - Triplicane,CHENNAI,TAMIL NADU


#### Postal code geocoding and distance calculations - pgeocode

* pgeocode is a Python library for high performance off-line querying of GPS coordinates, region name and municipality name from postal codes. 
* Distances between postal codes as well as general distance queries are also supported. 
* The used GeoNames database includes postal codes for 83 countries.

https://pgeocode.readthedocs.io

In [10]:
dist = pgeocode.GeoDistance('IN') # INDIA

# distance between two pincodes
dist.query_postal_code(600119,600117) # retured distance in km

21.82619913175519

#### Distance between pincodes

In [11]:
data1[data1['Pincode']==600119]

Unnamed: 0,Village/Locality name,Officename ( BO/SO/HO),Pincode,Sub-distname,Districtname,StateName
558661,Sholinganallur,Sholinganallur S.O,600119,Sholinganallur,KANCHIPURAM,TAMIL NADU
558662,Uthandi,Sholinganallur S.O,600119,Tambaram,KANCHIPURAM,TAMIL NADU


In [12]:
data2 = data1

In [13]:
data2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1304 entries, 558410 to 601014
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Village/Locality name   1304 non-null   object
 1   Officename ( BO/SO/HO)  1304 non-null   object
 2   Pincode                 1304 non-null   int64 
 3   Sub-distname            1304 non-null   object
 4   Districtname            1304 non-null   object
 5   StateName               1304 non-null   object
dtypes: int64(1), object(5)
memory usage: 71.3+ KB


In [14]:
def distancebw(x):
    dist = pgeocode.GeoDistance('IN') # INDIA
    dist = dist.query_postal_code(600119, x.Pincode)
    return dist

In [15]:
data22 = data2

In [167]:
%time data22['distance(in kms)'] = data22.progress_apply(distancebw,axis=1) #7mins

Wall time: 7min 22s


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [172]:
data22.to_csv('result.csv',index=False)

In [173]:
data22.head()

Unnamed: 0,Village/Locality name,Officename ( BO/SO/HO),Pincode,Sub-distname,Districtname,StateName,dis from A
558410,Parrys,Chennai G.P.O.,600001,Fort - Tondiarpet,CHENNAI,TAMIL NADU,23.079828
558411,Chennai,Anna Road H.O,600002,Egmore - Nungambakkam,CHENNAI,TAMIL NADU,20.959674
558412,Parrys,Park Town H.O,600003,Fort - Tondiarpet,CHENNAI,TAMIL NADU,21.121267
558413,Mylapore,Mylapore H.O,600004,Mylapore - Triplicane,CHENNAI,TAMIL NADU,21.121267
558414,Tiruvallikkeni,Tiruvallikkeni S.O,600005,Mylapore - Triplicane,CHENNAI,TAMIL NADU,19.498688


### Observation: 
Few distances had errors. On checking the pgeocode github repo issues, it was confirmed that there were erroneous data & hence few distances had errors

---

### Method 2: Distance using longitude & latitude

In [57]:
import geopy

#### Import data - All India Pincode list with latitude and longitude

Data source - All India Pincode list with latitude and longitude - https://github.com/mrparveensharma/All-India-Pincode-list-with-latitude-and-longitude/blob/master/All-India-Pincode-list-with-latitude-and-longitude.csv

In [17]:
path1 = path + '/Data1/All-India-Pincode-list-with-latitude-and-longitude.csv'
data_longlat = pd.read_csv(path1,encoding= 'unicode_escape')

In [18]:
data_longlat.head()

Unnamed: 0,CID,CityName/AreaName,Pincode,District,State,Latitude,Longitude
0,1,Baroda House,110001,Central Delhi,DELHI,28.628075,77.21785
1,2,Bengali Market,110001,Central Delhi,DELHI,28.628075,77.21785
2,3,Bhagat Singh Market,110001,Central Delhi,DELHI,28.628075,77.21785
3,4,Connaught Place,110001,Central Delhi,DELHI,28.628075,77.21785
4,5,Constitution House,110001,Central Delhi,DELHI,28.628075,77.21785


#### Apply Filters

In [47]:
district_list = ['Chennai', 'Kanchipuram']
data1_longlat = data_longlat[data_longlat['District'].isin(district_list)]

In [48]:
data1_longlat.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 593 entries, 101464 to 109748
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CID                593 non-null    int64  
 1   CityName/AreaName  593 non-null    object 
 2   Pincode            593 non-null    int64  
 3   District           593 non-null    object 
 4   State              593 non-null    object 
 5   Latitude           546 non-null    float64
 6   Longitude          546 non-null    float64
dtypes: float64(2), int64(2), object(3)
memory usage: 37.1+ KB


In [37]:
data1_longlat = data1_longlat.dropna() # remove null

In [40]:
data1_longlat[data1_longlat['Pincode'] == 600115]

Unnamed: 0,CID,CityName/AreaName,Pincode,District,State,Latitude,Longitude
101741,101742,Injambakkam,600115,Kanchipuram,TAMIL NADU,12.950784,80.255188


#### Distance using longitude & latitude
##### Example

In [41]:
# Importing the geodesic module from the library 
from geopy.distance import geodesic 
  
# Loading the lat-long data for Kolkata & Delhi 
kolkata = (22.5726, 88.3639) 
delhi = (28.7041, 77.1025) 
  
# Print the distance calculated in km 
print(geodesic(kolkata, delhi).km) 

1318.13891581683


In [42]:
def distancebw1(x):
    # Loading the lat-long data for Kolkata & Delhi 
    place = (x['Latitude'],x['Longitude'])
    injambakkam = (12.950784, 80.255188) 

    # Print the distance calculated in km 
    return(geodesic(place, injambakkam).km) 

In [104]:
# From google search
from geopy.distance import geodesic
inj = (12.9198, 80.2511)
chin = (13.0750, 80.2698)
print(geodesic(inj, chin))

# From csv
inj = (12.95078373,80.25518799)
chin = (13.02393532,80.25163269)
print(geodesic(inj, chin))

17.289281309550788 km
8.101974805309982 km


In [43]:
data2_longlat = data1_longlat

In [44]:
%time data2_longlat['distance(in kms)'] = data2_longlat.progress_apply(distancebw1,axis=1) #424ms

Wall time: 424 ms


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [46]:
data2_longlat.head()

Unnamed: 0,CID,CityName/AreaName,Pincode,District,State,Latitude,Longitude,distance(in kms)
101464,101465,Flower Bazaar,600001,Chennai,TAMIL NADU,13.079564,80.266045,14.295615
101465,101466,Govt Stanley Hospital,600001,Chennai,TAMIL NADU,13.079564,80.266045,14.295615
101466,101467,Mannady (Chennai),600001,Chennai,TAMIL NADU,13.079564,80.266045,14.295615
101467,101468,Mint Building,600001,Chennai,TAMIL NADU,13.079564,80.266045,14.295615
101468,101469,MPT AO,600001,Chennai,TAMIL NADU,13.079564,80.266045,14.295615


In [47]:
data2_longlat.to_csv('result1.csv',index=False)

### Observation: 
The latitude & longitude has errors & is same for many locations. So the data has to improved for correct distances.

---

###  Method 3: Latitude & longitude using Geocoder

* https://geocoder.readthedocs.io/
* https://github.com/DenisCarriere/geocoder#a-glimpse-at-the-api
* https://geocoder.readthedocs.io/providers/ArcGIS.html

In [52]:
import geocoder
from tqdm.notebook import tqdm_notebook
tqdm_notebook.pandas()

In [30]:
# Json
g = geocoder.arcgis('Injambakkam')
g.json

{'address': 'Injambakkam, Chennai, Kancheepuram, Tamil Nadu',
 'bbox': {'northeast': [12.936370000000077, 80.26025000000006],
  'southwest': [12.916370000000077, 80.24025000000005]},
 'confidence': 7,
 'lat': 12.926370000000077,
 'lng': 80.25025000000005,
 'ok': True,
 'quality': 'Locality',
 'raw': {'name': 'Injambakkam, Chennai, Kancheepuram, Tamil Nadu',
  'extent': {'xmin': 80.24025000000005,
   'ymin': 12.916370000000077,
   'xmax': 80.26025000000006,
   'ymax': 12.936370000000077},
  'feature': {'geometry': {'x': 80.25025000000005, 'y': 12.926370000000077},
   'attributes': {'Score': 100, 'Addr_Type': 'Locality'}}},
 'score': 100,
 'status': 'OK'}

In [31]:
# Co.ordinates of Injambakkam
g = geocoder.arcgis('Injambakkam,Chennai,TAMIL NADU')
print(g.address)
print(g.lat)
print(g.lng)

Injambakkam, Chennai, Kancheepuram, Tamil Nadu
12.926370000000077
80.25025000000005


#### Import data - All India Pincode list with latitude and longitude

Data source - All India Pincode list with latitude and longitude - https://github.com/mrparveensharma/All-India-Pincode-list-with-latitude-and-longitude/blob/master/All-India-Pincode-list-with-latitude-and-longitude.csv

In [22]:
path1 = path + '/Data1/All-India-Locailty-Pincode-list.csv'
df = pd.read_csv(path1,encoding= 'unicode_escape')

In [23]:
df.head()

Unnamed: 0,CID,CityName/AreaName,Pincode,District,State
0,1,Baroda House,110001,Central Delhi,DELHI
1,2,Bengali Market,110001,Central Delhi,DELHI
2,3,Bhagat Singh Market,110001,Central Delhi,DELHI
3,4,Connaught Place,110001,Central Delhi,DELHI
4,5,Constitution House,110001,Central Delhi,DELHI


In [24]:
district_list = ['Chennai', 'Kanchipuram']
df = df[df['District'].isin(district_list)]

In [25]:
df['Locality'] = df[['CityName/AreaName', 'Pincode', 'District', 'State']].apply(lambda x: ','.join(x.fillna('').map(str)), axis=1)

In [26]:
df.head()

Unnamed: 0,CID,CityName/AreaName,Pincode,District,State,Locality
101464,101465,Flower Bazaar,600001,Chennai,TAMIL NADU,"Flower Bazaar,600001,Chennai,TAMIL NADU"
101465,101466,Govt Stanley Hospital,600001,Chennai,TAMIL NADU,"Govt Stanley Hospital,600001,Chennai,TAMIL NADU"
101466,101467,Mannady (Chennai),600001,Chennai,TAMIL NADU,"Mannady (Chennai),600001,Chennai,TAMIL NADU"
101467,101468,Mint Building,600001,Chennai,TAMIL NADU,"Mint Building,600001,Chennai,TAMIL NADU"
101468,101469,MPT AO,600001,Chennai,TAMIL NADU,"MPT AO,600001,Chennai,TAMIL NADU"


In [42]:
def location(x):
    g = geocoder.arcgis(x['Locality'])
    return g.address

def lat(x):
    g = geocoder.arcgis(x['Locality'])
    return g.lat

def lng(x):
    g = geocoder.arcgis(x['Locality'])
    return g.lng


def latlong(x):
    g = geocoder.arcgis(x['Locality'])
    x['Location'] = g.address
    x['Latitude'] = g.lat
    x['Longitude'] = g.lng
    return x

In [46]:
!pip install tqdm



In [54]:
%time df = df.progress_apply(latlong,axis=1) #12min 3s

HBox(children=(FloatProgress(value=0.0, max=593.0), HTML(value='')))


Wall time: 12min 3s


In [55]:
df.head()

Unnamed: 0,CID,CityName/AreaName,Pincode,District,State,Locality,Location,Latitude,Longitude
101464,101465,Flower Bazaar,600001,Chennai,TAMIL NADU,"Flower Bazaar,600001,Chennai,TAMIL NADU","Flower Bazaar Chowk, George Town, Chennai, Tam...",13.088194,80.281455
101465,101466,Govt Stanley Hospital,600001,Chennai,TAMIL NADU,"Govt Stanley Hospital,600001,Chennai,TAMIL NADU",Blood Bank-Government Stanley Hospital,13.10556,80.28648
101466,101467,Mannady (Chennai),600001,Chennai,TAMIL NADU,"Mannady (Chennai),600001,Chennai,TAMIL NADU","Mannady, George Town, Chennai, Tamil Nadu",13.10087,80.2938
101467,101468,Mint Building,600001,Chennai,TAMIL NADU,"Mint Building,600001,Chennai,TAMIL NADU",T. N. K. Buildings,13.08471,80.27819
101468,101469,MPT AO,600001,Chennai,TAMIL NADU,"MPT AO,600001,Chennai,TAMIL NADU",600001,13.0937,80.295838


In [56]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 593 entries, 101464 to 109748
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CID                593 non-null    int64  
 1   CityName/AreaName  593 non-null    object 
 2   Pincode            593 non-null    int64  
 3   District           593 non-null    object 
 4   State              593 non-null    object 
 5   Locality           593 non-null    object 
 6   Location           593 non-null    object 
 7   Latitude           593 non-null    float64
 8   Longitude          593 non-null    float64
dtypes: float64(2), int64(2), object(5)
memory usage: 46.3+ KB


In [64]:
df[df['Pincode']==600115]

Unnamed: 0,CID,CityName/AreaName,Pincode,District,State,Locality,Location,Latitude,Longitude
101741,101742,Injambakkam,600115,Kanchipuram,TAMIL NADU,"Injambakkam,600115,Kanchipuram,TAMIL NADU","600115, Injambakkam, Chennai, Kancheepuram, Ta...",12.923007,80.250555


In [63]:
# From google search
from geopy.distance import geodesic
inj = (12.9198, 80.2511)
chin = (13.0750, 80.2698)
print(geodesic(inj, chin))

# From csv
inj = (12.923007, 80.250555)
chin = (13.069074,80.270825)
print(geodesic(inj, chin))

17.289281309550788 km
16.308396141019735 km


In [65]:
def distancebw1(x):
    # Loading the lat-long data for Kolkata & Delhi 
    place = (x['Latitude'],x['Longitude'])
    injambakkam = (12.923007, 80.250555) 

    # Print the distance calculated in km 
    return(geodesic(place, injambakkam).km) 

In [66]:
df1 = df
%time df1['distance(in kms)'] = df1.progress_apply(distancebw1,axis=1)

HBox(children=(FloatProgress(value=0.0, max=593.0), HTML(value='')))


Wall time: 200 ms


In [67]:
df1.head()

Unnamed: 0,CID,CityName/AreaName,Pincode,District,State,Locality,Location,Latitude,Longitude,distance(in kms)
101464,101465,Flower Bazaar,600001,Chennai,TAMIL NADU,"Flower Bazaar,600001,Chennai,TAMIL NADU","Flower Bazaar Chowk, George Town, Chennai, Tam...",13.088194,80.281455,18.579589
101465,101466,Govt Stanley Hospital,600001,Chennai,TAMIL NADU,"Govt Stanley Hospital,600001,Chennai,TAMIL NADU",Blood Bank-Government Stanley Hospital,13.10556,80.28648,20.568512
101466,101467,Mannady (Chennai),600001,Chennai,TAMIL NADU,"Mannady (Chennai),600001,Chennai,TAMIL NADU","Mannady, George Town, Chennai, Tamil Nadu",13.10087,80.2938,20.228572
101467,101468,Mint Building,600001,Chennai,TAMIL NADU,"Mint Building,600001,Chennai,TAMIL NADU",T. N. K. Buildings,13.08471,80.27819,18.138748
101468,101469,MPT AO,600001,Chennai,TAMIL NADU,"MPT AO,600001,Chennai,TAMIL NADU",600001,13.0937,80.295838,19.512334


In [68]:
df1.to_csv('result2.csv',index=False)

### Observation:
The latitude & longitude has been generated correctly with approximations, with some approximations on distances also.