# Imports

In [1]:
import pandas as pd
import csv

# Data loading
In this notebook, we will clean the BMMS and \_roads dataset.

In [2]:
# Load the BMMS file directly from the .xlsx file
BMMS = pd.read_excel('../Data/raw/BMMS_overview.xlsx')

# Sneak peek of the dataset
BMMS.head(5)

FileNotFoundError: [Errno 2] No such file or directory: '../Data/raw/BMMS_overview.xlsx'

In [None]:
# Load tcv file as a tab-separated csv. Results in a nested list.
with open('../Data/raw/_roads.tcv') as f:
    reader = csv.reader(f, delimiter="\t")
    d = list(reader)

# Sneak peek of the dataset
d[1][:10]

# Cleaning the roads dataset
The roads are now in a nested list. Each element of the list represents one road, the first element in the road is the name of the road, with afterwards a repetition of the sequence LRPName, latitude, longitude for all points in the road. <br>
We convert all entries to their correct datatype and correct obvious outliers.


In the dataset, many of the variables are represented as strings in stead of floats. The first thing we do is converting all road coordinates to floats.

In [None]:
# For each element, try to convert to float. Only works for latitudes and longitudes.
road_coords = []
for road in range(len(d)):
    for elem in range(len(d[road])):
        try:
            d[road][elem] = float(d[road][elem])
        except:
            pass

Next, we remove all roads that consists of only one datapoint.

In [None]:
for road in range(len(d)):
    if len(d[road-1]) == 4:
        d.pop(road-1)

Next, we look at all the datapoints in the road and check if there are any clear errors in the data. Clear errors are indicated by their lack of proximity to their neighboring road points. For this purpose, we assumed that all outliers are at least 0.1 coordinate removed from its neighbors, this corresponds to 11 kilometers. <br>
Spikes are datapoints that is a single outlier where both of its neighbors have reasonable coordinates, these are replaced by the average value in longitude/latitude of its neighbors. Some outliers come in subsequent pairs, these are handled by extending the trend of the previous two correct datapoints to replace the outliers. To prevent this trend to be assumed falsely, we check if this procedure brings the considered outlier closer to its succeeding neighbor. <br>
The start and the end of a road naturally only have one neighboring point, here we consider only their proximity to this one neighbor.

In [None]:
for road in range(1, len(d)): # for each road
    for elem in range(2, len(d[road])): # for each LRP of that road
        if isinstance(d[road][elem], float): # only check coordinates 
            
            # NOT first or last LRP
            if elem != 2 and elem != 3 and elem != len(d[road])-1 and elem != len(d[road])-2:
                    
                    #far from both neighbors
                    if abs(d[road][elem]-d[road][(elem-3)]) > 0.1 and abs(d[road][elem]-d[road][(elem+3)]) > 0.1 :  
                        d[road][elem] = (d[road][elem-3]+d[road][elem+3])/2 # replace with average of neighbors
                    
                    #far from precursor only
                    elif abs(d[road][elem]-d[road][(elem-3)]) > 0.1 and elem>6: 
                        if abs(d[road][elem+3] - (d[road][elem-3] + abs(d[road][elem-6]-d[road][elem-3]))) < abs(d[road][elem+3]-d[road][elem]):# if extending linear trend of previous datapoints brings outlier closer to successor
                            d[road][elem] = d[road][elem-3] + abs(d[road][elem-6]-d[road][elem-3])
                            
            # last LRP
            elif elem==len(d[road])-1 or elem==len(d[road])-2: 
                if abs(d[road][elem]-d[road][(elem-3)]) > 0.1: #if far from precursor
                    d[road][elem] = d[road][elem-3] + abs(d[road][elem-6]-d[road][elem-3]) #replace with linear extension of precursors
           
            # first LRP
            elif elem==2 or elem==3: 
                if abs(d[road][elem]-d[road][(elem+3)]) > 0.1 and abs(d[road][elem+3]-d[road][(elem+6)]) < 0.1: #if first LRP is outlier and second LRP is NOT outlier
                    d[road][elem] = d[road][elem+3] - abs(d[road][elem+6]-d[road][elem+3]) # replace with average of neighbors

# Cleaning the bridges dataset
The bridges dataset contains a number of problems that are tackled in this section:
- Duplicate bridges are removed, whilst keeping the bridge data that is most complete
- Bridges that are on an incorrect location are replaced
- Bridges that do not correspond to a road point in the roads dataset are removed

In [None]:
# Inspect the data
BMMS.info()

Here, duplicate bridges are removed while preserving the bridge data that is most complete.

In [None]:
# make copy of the original DataFrame, so we can always compare later
BMMS_mod = BMMS.copy()

# count NaN values for each row only in relevant columns
BMMS_mod['count_NaN'] = BMMS[['road', 'km', 'type', 'LRPName', 'name', 'length', 'condition', 'structureNr', 'chainage', 'width', 'constructionYear', 'spans', 'lat', 'lon']].isnull().sum(axis=1)

# sort by road then LRPName and count NaN values in the rows
BMMS_mod = BMMS_mod.sort_values(by=['road', 'LRPName', 'count_NaN', 'constructionYear'], ascending=[True, True, True, False])

# reset the index
BMMS_mod = BMMS_mod.reset_index(drop=True)

# drop duplicates and keep the first one (least NaN values)
BMMS_mod = BMMS_mod.drop_duplicates(subset=['road','LRPName'], keep='first')

# reset the index again
BMMS_mod = BMMS_mod.reset_index(drop=True)

In [None]:
# check shape again after dropping the duplicates to see how many bridges were duplicates
BMMS_mod.shape

The dataframe decreased from 21407 entries to 18327 entries, this means that 3080 bridges used to be double in the dataset. <br>
The next thing we do is to match all bridges to the cleaned road data. All bridges that are on a road point in the road data are brought to the location of that road point. <br>
Some road starts or ends have different names ('LRPS', 'LRPSg', or 'LRPSf' in the bridges and BMMS files, these are brought to the same location as well. <br>
Bridges that are on LRP's that do not exist in the road dataset are removed. <br>
Bridges that are on roads that do not exist in the road dataset are linked together to create a coarse representation of the road that should connect these bridges.

In [None]:
# convert road list to dictionary 
di = {d[el][0] : d[el][1:] for el in range(1, len(d))}

In [None]:
# create list of alternative LRPNames
alt_names = ['LRPS', 'LRPSg', 'LRPSf']

# empty list to append new roads to
newRoads = []

for index, row in BMMS_mod.iterrows():
    # if road is in the road list, loop trough nested points. If LRPName of bridge can be found on that road, line up the coordinates. 
    if row['road'] in di: 
        for point in di[row['road']]: 
            if row['LRPName'] == point: 
                BMMS_mod.loc[index, 'lat'] = di[row['road']][di[row['road']].index(point) + 1]
                BMMS_mod.loc[index, 'lon'] = di[row['road']][di[row['road']].index(point) + 2]
                break
            if row['LRPName'] in alt_names and point in alt_names:
                # check if the point does occur, but under an alternative different name
                BMMS_mod.loc[index, 'lat'] = di[row['road']][di[row['road']].index(point) + 1]
                BMMS_mod.loc[index, 'lon'] = di[row['road']][di[row['road']].index(point) + 2]
                break
            
        else:
            # if the linked LRPName is not found on the road, remove the bridge
            BMMS_mod.drop(index, inplace=True)
    else:
        # if the road is not found in the road dictionary, add a new road is created in the roads dataframe that connects all bridges that should be on it
        # update the roads that have been added to the newRoads list
        roadsAdded = [el[0] for el in newRoads] 
        # if road is already created
        if row['road'] in roadsAdded: 
            # find the index that matches the index from roadsAdded and extend the list
            index = roadsAdded.index(row['road'])
            newRoads[index].extend([row['LRPName'], row['lat'], row['lon']])
        else:
            # if road does not exist yet, create it
            newRoads.append([row['road'], row['LRPName'], row['lat'], row['lon']])
            
# merge the roads list with the newRoads list
d = d + newRoads

In [None]:
# check shape again
BMMS_mod.shape

In [None]:
# check how many new roads are created
len(newRoads)

This procedure removed 5283 bridges from the dataset, this is roughly 30%. Also, 28 new roads are added.

# Saving modified data

In [None]:
# an additional column was added to the BMMS file in the cleaning process, 
# we revert the dataset to its original format before saving again 
BMMS_mod = BMMS_mod.drop(columns=['count_NaN'])

In [None]:
# save BMMS data with pandas
BMMS_mod.to_excel('../Data/processed/BMMS_overview.xlsx', index=False, sheet_name='BMMS_overview')

In [None]:
# save road data with csv write
with open('../Data/processed/_roads.tcv', 'w', newline='') as f:
    writer = csv.writer(f, delimiter="\t")
    writer.writerows(d)