Importing required libraries

In [1]:
import pandas as pd
import numpy as np
import math
import seaborn as sns
import matplotlib.pyplot as plt

Data loading
In this notebook, we will clean the BMMS_Overview dataset.

In [2]:
bridges = pd.read_excel('data/BMMS_overview.xlsx')

In [3]:
bridges.head()

Unnamed: 0,road,km,type,LRPName,name,length,condition,structureNr,roadName,chainage,width,constructionYear,spans,zone,circle,division,sub-division,lat,lon,EstimatedLoc
0,N1,1.8,Box Culvert,LRP001a,.,11.3,A,117861,Dhaka (Jatrabari)-Comilla (Mainamati)-Chittago...,1.8,19.5,2005.0,2.0,Dhaka,Dhaka,Narayanganj,Narayanganj-1,23.702889,90.450389,bcs1
1,N1,4.925,Box Culvert,LRP004b,.,6.6,A,117862,Dhaka (Jatrabari)-Comilla (Mainamati)-Chittago...,4.925,35.4,2006.0,1.0,Dhaka,Dhaka,Narayanganj,Narayanganj-1,23.693611,90.478833,bcs1
2,N1,8.976,PC Girder Bridge,LRP008b,Kanch pur Bridge.,394.23,A,119889,Dhaka (Jatrabari)-Comilla (Mainamati)-Chittago...,8.976,,,,Dhaka,Dhaka,Narayanganj,Narayanganj-1,23.704583,90.518833,road_precise
3,N1,10.88,Box Culvert,LRP010b,NOYAPARA CULVERT,6.3,A,112531,Dhaka (Jatrabari)-Comilla (Mainamati)-Chittago...,10.88,12.2,1992.0,2.0,Dhaka,Dhaka,Narayanganj,Vitikandi,23.699833,90.530722,bcs1
4,N1,10.897,Box Culvert,LRP010c,ADUPUR CULVERT,6.3,A,112532,Dhaka (Jatrabari)-Comilla (Mainamati)-Chittago...,10.897,12.2,1984.0,2.0,Dhaka,Dhaka,Narayanganj,Vitikandi,23.699667,90.530722,bcs1


We create a new ID using the road name and the LRP name to define a unique loaction point for the LRP so that it can be differentiated from other bridges that may lie on the same LRP on another road.

In [4]:
bridges['UniqueID']=bridges['road']+bridges['LRPName']

Now we need to look for duplicate bridges that may have the same UniqueID as two bridges can be constructed at the same location or a new bridge may be constructed next to an old one. For the purposes of this model, we chose to ignore these duplicated values. However, on observation of structureNr, we can see that all of these values are unique, and can be used to cross check and prevent other errors that may arise due to quality or capacity issues later.

In [5]:
boolen = bridges.duplicated(subset=['UniqueID'])
boolen.value_counts()

False    18327
True      3080
dtype: int64

In [6]:
boolen = bridges.duplicated(subset=['structureNr'])
boolen.value_counts()

False    21407
dtype: int64

In [7]:
# make copy of the original DataFrame, so we can always compare later
BMMS_mod = bridges.copy()

# count NaN values for each row only in relevant columns
BMMS_mod['count_NaN'] = bridges[['road', 'km', 'type', 'LRPName', 'name', 'length', 'condition', 'structureNr', 'chainage', 'width', 'constructionYear', 'spans', 'lat', 'lon','UniqueID']].isnull().sum(axis=1)

# sort by road then LRPName and count NaN values in the rows
BMMS_mod = BMMS_mod.sort_values(by=['road', 'LRPName', 'count_NaN', 'constructionYear'], ascending=[True, True, True, False])

# reset the index
BMMS_mod = BMMS_mod.reset_index(drop=True)

# drop duplicates and keep the first one (least NaN values)
BMMS_mod = BMMS_mod.drop_duplicates(subset=['road','LRPName'], keep='first')

# reset the index again
BMMS_mod = BMMS_mod.reset_index(drop=True)

In [8]:
BMMS_mod.shape

(18327, 22)

In [9]:
BMMS_mod.head()

Unnamed: 0,road,km,type,LRPName,name,length,condition,structureNr,roadName,chainage,...,spans,zone,circle,division,sub-division,lat,lon,EstimatedLoc,UniqueID,count_NaN
0,N1,1.8,Box Culvert,LRP001a,.,11.3,A,117861,Dhaka (Jatrabari)-Comilla (Mainamati)-Chittago...,1.8,...,2.0,Dhaka,Dhaka,Narayanganj,Narayanganj-1,23.702889,90.450389,bcs1,N1LRP001a,0
1,N1,4.925,Box Culvert,LRP004b,.,6.6,A,117862,Dhaka (Jatrabari)-Comilla (Mainamati)-Chittago...,4.925,...,1.0,Dhaka,Dhaka,Narayanganj,Narayanganj-1,23.693611,90.478833,bcs1,N1LRP004b,0
2,N1,8.976,PC Girder Bridge,LRP008b,KANCHPUR PC GIRDER BRIDGE,397.0,C,101102,Dhaka (Jatrabari)-Comilla (Mainamati)-Chittago...,8.976,...,8.0,Dhaka,Dhaka,Narayanganj,Narayanganj-1,23.702083,90.515917,bcs1,N1LRP008b,0
3,N1,10.543,Box Culvert,LRP010a,KATCHPUR BOX CULVERT,8.0,B,101106,Dhaka (Jatrabari)-Comilla (Mainamati)-Chittago...,10.543,...,2.0,Dhaka,Dhaka,Narayanganj,Vitikandi,23.702056,90.528194,bcs1,N1LRP010a,0
4,N1,10.88,Box Culvert,LRP010b,NOYAPARA CULVERT,6.3,A,112531,Dhaka (Jatrabari)-Comilla (Mainamati)-Chittago...,10.88,...,2.0,Dhaka,Dhaka,Narayanganj,Vitikandi,23.699833,90.530722,bcs1,N1LRP010b,0


Now that we have tentatively cleaned the bridges data from duplicate values, we need to see whether the LRP identifier for a bridge exists in the dataset for roads. However, since we are not looking at the location data yet, we can ignore the processing done for the accuracy and only focus on the LRPs and Road names

In [10]:
df = pd.read_excel('data/Roads_InfoAboutEachLRP.xlsx')
highList=pd.unique(df.road)
roadMap=pd.DataFrame()
atlas={}
for ID in highList:
    road = df.loc[df.road==ID]
    road.reset_index(drop=True) 
    road['Xchanged']='False'
    road['Ychanged']='False'
    atlas[ID]=len(road)
    if len(road)>5:
        for count in range(0,3):
            for i in road.index:
                #loop1, loop2, loop3 = False, False, False 
                '''while i < 5: 
                    # capture confidence interval in absolute terms (not with multiplicator)
                    x_min, x_max = road.loc[i:i+5,'lon'].median() * 0.999 , road.loc[i:i+5,'lon'].median() * 1.001
                    y_min, y_max = road.loc[i:i+5,'lat'].median() * 0.999 , road.loc[i:i+5,'lat'].median() * 1.001

                    if not x_min <  road.loc[i,'lon'] < x_max: # check for outliers and overwrite 
                        road.loc[i,'lon'] = (road.loc[i-1,'lon'] + road.loc[i+1,'lon']) / 2 # position point right in between last and next point 
                        road.loc[i,'Xchanged']='True'

                    if not y_min < road.loc[i,'lat'] < y_max:
                        road.loc[i,'lat'] = (road.loc[i-1,'lat'] + road.loc[i+1,'lat']) / 2 # position point right in between last and next point 
                        road.loc[i,'Ychanged']='True'
                    i+=1
                    break'''
                # check if i is inside first or last five iterations and compute moving averages acordingly 
                while i in range(5, len(road)-5):
                    x_min, x_max = road.loc[i-5:i+5,'lon'].median() * .9999 , road.loc[i-5:i+5,'lon'].median() * 1.0001 # how big do we want to make search depth?
                    y_min, y_max = road.loc[i-5:i+5,'lat'].median() * .9999 , road.loc[i-5:i+5,'lat'].median() * 1.0001 

                    if not x_min <  road.loc[i,'lon'] < x_max: # check for outliers and overwrite 
                        road.loc[i,'lon'] = (road.loc[i-1,'lon'] + road.loc[i+1,'lon']) / 2 # position point right in between last and next point
                        road.loc[i,'Xchanged']='True'

                    if not y_min < road.loc[i,'lat'] < y_max:
                        road.loc[i,'lat'] = (road.loc[i-1,'lat'] + road.loc[i+1,'lat']) / 2 # position point right in between last and next point
                        road.loc[i,'Ychanged']='True'

                    i+=1
                    break 
                '''while i > len(road)-5: 
                    # capture confidence interval in absolute terms (not with multiplicator)
                    x_min, x_max = road.loc[i:i+5,'lon'].median() * 0.999 , road.loc[i:i+5,'lon'].median() * 1.001
                    y_min, y_max = road.loc[i:i+5,'lat'].median() * 0.999 , road.loc[i:i+5,'lat'].median() * 1.001

                    if not x_min <  road.loc[i,'lon'] < x_max: # check for outliers and overwrite 
                        road.loc[i,'lon'] = (road.loc[i-1,'lon'] + road.loc[i+1,'lon']) / 2 # position point right in between last and next point 
                        road.loc[i,'Xchanged']='True'

                    if not y_min < road.loc[i,'lat'] < y_max:
                        road.loc[i,'lat'] = (road.loc[i-1,'lat'] + road.loc[i+1,'lat']) / 2 # position point right in between last and next point 
                        road.loc[i,'Ychanged']='True'
                    i+=1
                    break'''
    roadMap=pd.concat([roadMap,road])
roadMap=roadMap.reset_index(drop=1)
'''#WARNING: TAKES ATLEAST 15 mins to run

for road in roadMap['road'].unique():
    for elem in roadMap.loc[roadMap['road'] == road].index:
        # NOT first or last LRP
        if elem != 0 and elem != 1 and elem != (len(roadMap.loc[roadMap['road'] == road])-1) and elem != (len(roadMap.loc[roadMap['road'] == road])-2):
                    
                #far from both neighbors
            if abs(roadMap.loc[elem, 'lat'] - roadMap.loc[elem-2, 'lat']) > 0.1 and abs(roadMap.loc[elem, 'lat'] - roadMap.loc[elem+2, 'lat']) > 0.1 :  
                    roadMap.loc[elem, 'lat'] = (roadMap.loc[elem-3, 'lat'] + roadMap.loc[elem+3, 'lat'])/2 # replace with average of neighbors
                    roadMap.loc[elem, 'lon'] = (roadMap.loc[elem-3, 'lon'] + roadMap.loc[elem+3, 'lon'])/2 # replace with average of neighbors
                    #print(road)
                    
                    #far from precursor only
            elif abs(roadMap.loc[elem, 'lat'] - roadMap.loc[elem-2, 'lat']) > 0.1 and elem > 6: 
                if abs(roadMap.loc[elem+3, 'lat'] - (roadMap.loc[elem-3, 'lat'] + abs(roadMap.loc[elem-6, 'lat'] - roadMap.loc[elem-3, 'lat']))) < abs(roadMap.loc[elem+3, 'lat'] - roadMap.loc[elem, 'lat']): # if extending linear trend of previous datapoints brings outlier closer to successor
                    roadMap.loc[elem, 'lat'] = roadMap.loc[elem-3, 'lat'] + abs(roadMap.loc[elem-6, 'lat'] - roadMap.loc[elem-3, 'lat'])
                    roadMap.loc[elem, 'lon'] = roadMap.loc[elem-3, 'lon'] + abs(roadMap.loc[elem-6, 'lon'] - roadMap.loc[elem-3, 'lon'])
                            
            # last LRP
            elif elem == len(roadMap.loc[roadMap['road'] == road])-1 or elem == len(roadMap.loc[roadMap['road'] == road])-2: 
                if abs(roadMap.loc[elem, 'lat'] - roadMap.loc[elem-3, 'lat']) > 0.1: #if far from precursor
                    roadMap.loc[elem, 'lat'] = roadMap.loc[elem-3, 'lat'] + abs(roadMap.loc[elem-6, 'lat'] - roadMap.loc[elem-3, 'lat']) #replace with linear extension of precursors
                    roadMap.loc[elem, 'lon'] = roadMap.loc[elem-3, 'lon'] + abs(roadMap.loc[elem-6, 'lon'] - roadMap.loc[elem-3, 'lon'])
                    #print(road)
           
            # first LRP
            elif elem == 0 or elem == 1: 
                if abs(roadMap.loc[elem, 'lat'] - roadMap.loc[elem+3, 'lat']) > 0.1: #if far from successor
                    roadMap.loc[elem, 'lat'] = roadMap.loc[elem+3, 'lat'] - abs(roadMap.loc[elem+3, 'lat'] - roadMap.loc[elem+6, 'lat']) #replace with linear extension of successor
                    roadMap.loc[elem, 'lon'] = roadMap.loc[elem+3, 'lon'] - abs(roadMap.loc[elem+3, 'lon'] - roadMap.loc[elem+6, 'lon'])
roadMap=roadMap.reset_index(drop=True)'''
roadMap['UniqueID']=roadMap['road']+roadMap['lrp']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  road['Xchanged']='False'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  road['Ychanged']='False'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  road['Xchanged']='False'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See th

Now, using the UniqueID columns on each dataframe we can compare the two values to see if there are any points that are present in both datasets. This will allow is to see how many points may be lost if we merged the datasets right now. 

In [11]:
lrpRoad=pd.unique(roadMap.UniqueID)
lrpBridge=pd.unique(bridges.UniqueID)
count=0
for i in lrpBridge:
	if i in lrpRoad:
		count+=1
print('Number of Coinciding bridges and Road LRPs is : '+str(count))
print('Number of Bridges is : '+str(len(lrpBridge)))
print('Number of Roads is : '+str(len(lrpRoad)))
print('Number of bridges without road LRPs : '+str(len(lrpBridge)-count))

Number of Coinciding bridges and Road LRPs is : 12561
Number of Bridges is : 18327
Number of Roads is : 51928
Number of bridges without road LRPs : 5766


Due to a shortage of time, we were unable to proceed further with the dataclaening on this portion. However, if given the opportunity we would have moved forward with the following steps:

Check if LRP of bridge exists in LRP roads 

If yes, then check if the coordinates match. If yes, move on, 

If no, then measure distance between the two LRPs 

If it is small, overwrite the road coordinates with bridge coordinates 

If it is large, overwrite the bridge coordinates with road coordinates 

If LRP cannot be found in Road LRP then look at the data and identify errors that may be causing them to not match and edit them to fit into the Road LRP 

In [12]:
BMMS_mod = BMMS_mod.drop(columns=['count_NaN','UniqueID'])

In [13]:
# save BMMS data with pandas
BMMS_mod.to_excel('Data/processed/BMMS_overview.xlsx', index=False, sheet_name='BMMS_overview')