# ECS7024P Coursework 2

##  1. Introduction

### 1.1 The Dataset

The original data comes from the [National Bridge Inspection](https://www.fhwa.dot.gov/bridge/nbi/ascii.cfm) section of the FHWA's web site. However, this dataset has been greatly simplified.

**The aim of the Bridge Inspection programme is to check on the state of bridges so that necessary repairs can be carried out. If this is not done, a bridge can fail. The dataset has information about the bridges and the condition given in the most recent inspection.**

* The FHWA's database covers the whole USA, but this dataset focuses exclusively on Texas. 
* While the FHWA dataset also includes tunnels and culverts, the Texas data used here includes only bridges. Culverts (drains under highways) have been removed..
* All of the bridges carry a highway, but what lies underneath (another road, waterway, or railway) varies.


The original FHWA dataset has over 100 variables (Texas collects even more). This version have been simplifies to include both continuous and categorical variables. 


| Variable      |      Description             | Type | 
|:--------------|:-----------------------------|:------:|
|Structure_id   | Unique identifier of the bridge                  | String |
|District       | Highway district in Texas responsible for bridge | category | 
|Detour_Km      | Length of detour if bridge closed                | continuous |
|Toll           | Whether a toll is paid to use bridge             | category |
|Maintainer     | The authority responsible for maintenance        | category |
|Urban          | Whether the bridge is located in an urban or rural area   | category |
|Status         | The road class: interstate to local                       | category | 
|Year           | The year the bridge was built                             | continuous | 
|Lanes_on       | The number of lanes that run over the bridge              | continuous (or discrete) |
|Lanes_under    | The number of lanes that run under the bridge             | continuous (or discrete) |
|AverageDaily   | The average daily traffic (number of vehicles)            | continuous |
|Future_traffic | The estimated daily traffic in approx 20 years time       | continuous |
|Trucks_percent | The percent of traffic made up of 'trucks' (i.e. lorries) | continuous |
|Historic       | Whether the bridge is historic                            | category | 
|Service_under  | The (most important) service that runs under the bridge   | category |
|Material       | The dominant material the bridge is made from             | category |
|Design         | The design of the bridge                                  | category |
|Spans          | The number of spans the bridge has                        | category (or discrete) |
|Length         | The length of the bridge in metres                        | continuous |
|Width          | The width of the bridge in metres                         | continuous |
|Rated_load     | The rated max loading of bridge (in tonnes)               | continuous |
|Scour_rating   | Only for bridges over water: the 'scour' condition        | ordinal |
|Deck_rating    | The condition of the deck of the bridge                   | ordinal |
|Superstr_rating| The condition of the bridge superstructure                | ordinal |
|Substr_rating  | The condition of the bridge substructure (foundations)    | ordinal |

**Note on 'scour'**: For bridges over water, the flow can erode or weaken the bridge supports (piers). This process, called "scouring", is measured by the `Scour_rating`.

 
**Values of Categorical Variables** In the original data, the values of the categorical variables are represented as integers, with their meanings given in a data dictionary. In this dataset, these 'numeric codes' have been replaced with suitable names.

| Variable      |      Values            |
|:--------------|:-----------------------|
|District       | Each district has a unique number  |
|Toll           | Toll, Free                |
|Maintainer     | State, County, Town or City, Agency, Private, Railroad, Toll Authority, Military, Unknown |
|Urban          | Urban, Rural |
|Status         | Interstate, Arterial, Minor, Local |
|Historic       | Register, Possible, Unknown, Not historic |
|Service_under  | Other, Highway, Railroad, Pedestrian, Interchange, Building |
|Material       | Other, Concrete, Steel, Timber, Masonry |
|Design         | Other, Slab, Beam, Frame, Truss, Arch, Suspension, Movable, Tunnel, Culvert, Mixed |
|Scour_rating   | Unknown, Critical, Unstable, Stable, Protected, Dry, No waterway |
|Deck_rating    | *Rating*: NA, Excellent, Very Good, Good, Satisfactory, Fair, Poor, Serious, Critical, Failing, Failed |
|Superstr_rating| *Rating* |
|Substr_rating  | *Rating* |


### 1.2 Scenario

The Texas Department of Transportation aims to investigate how well specific variables can predict the current condition of bridges. The following variables are of interest: 

1. **Age** (derived from the `Year` variable)
2. **Average Daily Traffic** (`AverageDaily`)
3. **Percentage of Trucks** (`Trucks_percent`)
4. **Material** (`Material`)
5. **Design** (`Design`)

The current condition of bridges is derived from three variables `Deck_rating`, `Superstr_rating` and `Substr_rating` of the bridges. 
The department wishes to answer following questions:

1. How well can these variables predict the current condition of bridges?
2. Which variables have the greatest influence on the current condition?

The use of regression has been agreed in advance. 


### 1.3 Loading the Data

A 'type map' is used to set the appropriate data types for each variable. Non-numeric fields are represented as categorical variables: using type `category` gives the default behaviour (use each unique value as a category and categories are not ordered). However, for the ordinal variables (such as Ratings) must be declared with a suitable type explicitly.   


**Note**: The Introduction markdown is adapted and modified from Notebook3 provided by Dr.William Marsh

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns  
%matplotlib inline

In [2]:
# The code below declares a catageorical type with categories in a specified order
# This can be used for an ordinal variable
rating_type = pd.CategoricalDtype(
    categories=['Failed', 'Failing', 'Critical', 'Serious', 'Poor', 'Fair', 
                'Satisfactory', 'Good', 'Very Good', 'Excellent', 'NA'], 
    ordered=True)

# This one is also for an ordinal variable, but with a slightly different set of values
scour_type = pd.CategoricalDtype(
    categories=['Unknown', 'Critical','Unstable', 'Stable', 'Protected', 'Dry', 'No waterway'], 
    ordered=True)

types_dict = { 'Structure_id': str, 'District':'category', 'Toll':'category', 
              'Maintainer':'category', 'Urban':'category', 'Status':'category', 
              'Historic':'category', 'Service_under':'category', 'Material':'category', 
              'Design':'category', 
              'Deck_rating':rating_type, 'Superstr_rating':rating_type, 'Substr_rating':rating_type, 
              'Scour_rating':scour_type}

bridges = pd.read_csv('tx19_bridges_sample.csv', dtype = types_dict, index_col = 'Structure_id')
bridges  

Unnamed: 0_level_0,District,Detour_Km,Toll,Maintainer,Urban,Status,Year,Lanes_on,Lanes_under,AverageDaily,...,Spans,Length,Width,Deck_rating,Superstr_rating,Substr_rating,Rated_load,Trucks_percent,Scour_rating,Future_traffic
Structure_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
000021521-00101,District2,199,Free,Agency,Rural,Local,2005,1,0,1,...,2,31.4,4.3,Good,Very Good,Very Good,41.7,0.0,Dry,1
000021521-00181,District2,199,Free,Agency,Rural,Local,2005,1,0,1,...,1,15.5,4.3,Good,Good,Very Good,41.7,0.0,Dry,1
000021521-TMP20,District2,199,Free,Agency,Rural,Local,2012,2,0,100,...,1,10.1,8.4,Very Good,Very Good,Very Good,48.1,0.0,Dry,150
000021525-00012,District2,199,Free,Agency,Rural,Local,1950,1,0,80,...,14,45.4,3.7,Good,Good,Poor,10.0,0.0,Dry,120
000021580-00092,District2,6,Free,Agency,Rural,Local,2004,2,0,150,...,1,25.0,7.3,Good,Very Good,Very Good,37.2,4.0,Dry,200
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
DAPFORHOO000012,District9,0,Free,Military,Urban,Local,1994,3,0,300,...,3,45.4,19.0,Good,Good,Good,64.3,40.0,No waterway,600
DAPFORHOO000013,District9,0,Free,Military,Urban,Local,2000,2,0,300,...,1,12.3,10.8,Good,Good,Good,35.1,40.0,No waterway,600
DAPFORHOO000015,District9,0,Free,Military,Urban,Minor,1996,2,7,1200,...,4,73.2,9.8,Good,Good,Good,24.5,25.0,No waterway,1500
DAPFORHOO00005A,District9,0,Free,Military,Urban,Local,1991,2,0,300,...,2,27.8,10.4,Good,Good,Satisfactory,53.3,15.0,Stable,300


##  2. Requirements

### 2.1: Data Preparation
In this section, the dataset is prepared by deriving new variables, simplifying categories, and addressing any outliers. The following steps are required for data preparation.

#### 2.1.1 Deriving Age Variable
As the dataset does not have an `Age` variable, it will derived from the `Year` variable. The `Age` of a bridge is calculated by substracting the year it was built from this year (2024).

Bridges classified as historic and those older than 100 years, are excluded as shown below.

In [24]:
this_year = 2024
bridges['Age'] = this_year - bridges['Year']

bridges_filtered_age = bridges[(bridges['Historic'].isin(['Not historic', 'Possible', 'Unknown'])) & (bridges['Age'] <= 100)]
bridges_filtered_age[['Historic', 'Age']].head()

Unnamed: 0_level_0,Historic,Age
Structure_id,Unnamed: 1_level_1,Unnamed: 2_level_1
000021521-00101,Not historic,19
000021521-00181,Not historic,19
000021521-TMP20,Not historic,12
000021525-00012,Not historic,74
000021580-00092,Not historic,20


#### 2.1.2 Simplifying Categories in `Material` and `Design` 

To reduce the number of categories in the `Material` and `Design` variables, very small categories were merged into broader groups. For example:
- **Material**: Categories like `Masonry` and `Other` were merged into an **"Others"** category.

- **Design**: Categories such as `Movable`, `Suspension`, `Frame` and `Other` were merged into an **"Others"** category .

In [36]:
#First check distribution of Material and Design 
material_distr = bridges_filtered_age['Material'].value_counts()
design_distr = bridges_filtered_age['Design'].value_counts()

material_distr, design_distr

(Material
 Concrete    26764
 Steel        6400
 Timber        464
 Other          47
 Masonry         5
 Name: count, dtype: int64,
 Design
 Beam          28072
 Slab           4181
 Other          1237
 Arch            111
 Frame            52
 Truss            16
 Movable           8
 Suspension        3
 Name: count, dtype: int64)

In [49]:
# Variable 1: Material
# Define new ordinal type for the simplified material
simp_material_type = pd.CategoricalDtype(categories=['Concrete','Steel','Timber','Others'])

#Create new dictionary mapping existing to new values 
simp_material_dict={'Concrete':'Concrete', 'Steel':'Steel', 'Timber':'Timber', 
                    'Other':'Others', 'Masonry':'Others'}

def simpMaterial(row):
    if row.Material in simp_material_dict:
        return simp_material_dict[row.Material]
    return row.Material

bridges_filtered_age = bridges_filtered_age.assign(SimpMat = bridges_filtered_age.apply(simpMaterial, axis=1))
bridges_filtered_age = bridges_filtered_age.astype({'SimpMat':simp_material_type})

bridges_filtered_age.loc[:,['Material', 'SimpMat']].head()

Unnamed: 0_level_0,Material,SimpMat
Structure_id,Unnamed: 1_level_1,Unnamed: 2_level_1
000021521-00101,Concrete,Concrete
000021521-00181,Concrete,Concrete
000021521-TMP20,Concrete,Concrete
000021525-00012,Timber,Timber
000021580-00092,Concrete,Concrete


In [48]:
# Variable 2: Design
# Define new ordinal type for the simplified design
simp_design_type = pd.CategoricalDtype(categories=['Beam','Slab','Arch','Others'])

#Create new dictionary mapping existing to new values 
simp_design_dict={'Beam':'Beam', 'Slab':'Slab', 'Arch':'Arch', 'Other':'Others', 
                  'Frame':'Others', 'Truss':'Others', 'Movable':'Others', 'Suspension':'Others'}

def simpDesign(row):
    if row.Design in simp_design_dict:
        return simp_design_dict[row.Design]
    return row.Design 

#Apply the function
bridges_filtered_age = bridges_filtered_age.assign(SimpDes = bridges_filtered_age.apply(simpDesign, axis=1))
bridges_filtered_age = bridges_filtered_age.astype({'SimpDes':simp_design_type})

bridges_filtered_age.loc[:,['Design', 'SimpDes']].head()

Unnamed: 0_level_0,Design,SimpDes
Structure_id,Unnamed: 1_level_1,Unnamed: 2_level_1
000021521-00101,Slab,Slab
000021521-00181,Slab,Slab
000021521-TMP20,Beam,Beam
000021525-00012,Beam,Beam
000021580-00092,Beam,Beam


#### 2.1.3 Deriving the `Current_condition` Variable
A new variable `Current_condition` is derieved from the combination of categorical ratings from three variables, `Deck_rating`, `Superstr_rating` and `Substr_rating`.

Each of the categorical values is converted into an integer score where:
- **0** represents a **Failed** condition,
- **1-9** represent increasing levels of condition quality (e.g., from **Failing** to **Excellent**).

In [53]:
# Define a function to map categorial ratings to integer values
def integer_score(row):
    rating_int = {'Failed': 0, 'Failing': 1, 'Critical': 2, 'Serious': 3, 'Poor': 4, 
                  'Fair': 5, 'Satisfactory': 6, 'Good': 7, 'Very Good': 8, 'Excellent': 9, 'NA': None}

    deck_score = rating_int.get(row['Deck_rating'], None)
    superstr_score = rating_int.get(row['Superstr_rating'], None)
    substr_score = rating_int.get(row['Substr_rating'], None)
    
    # Sum the three scores 
    if None not in (deck_score, superstr_score, substr_score):
        return deck_score + superstr_score + substr_score
    else:
        return None

# Apply the function
bridges_filtered_age['Current_condition'] = bridges_filtered_age.apply(integer_score, axis=1)

bridges_filtered_age[['SimpMat', 'SimpDes', 'Current_condition']].head()


Unnamed: 0_level_0,SimpMat,SimpDes,Current_condition
Structure_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
000021521-00101,Concrete,Slab,23.0
000021521-00181,Concrete,Slab,22.0
000021521-TMP20,Concrete,Beam,24.0
000021525-00012,Timber,Beam,18.0
000021580-00092,Concrete,Beam,23.0
