# Infrastructure Risk Analysis and Failure Prediction

## Project Applied Data Science
**Name:** Alexander Ruiz

**Objective:** The goal of this project is to analyze infrastructure condition data, identify risk profiles using clustering, and build a machine learning model to predict whether an infrastructure asset will fail within five years.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score

np.random.seed(42)

In [2]:
df = pd.read_csv("infrastructure_risk_dataset.csv")
df.head()

Unnamed: 0,asset_id,asset_type,material_type,age_years,crack_index,corrosion_level,vibration_rms,avg_daily_traffic,heavy_vehicle_pct,avg_temperature_f,humidity_pct,flood_risk_score,inspections_per_year,years_since_last_repair,strain_peak_microstrain,displacement_mm,failure_probability,failure_within_5yrs
0,A200000,Pipeline,Asphalt,37.2,0.577,0.449,2.778,15495.0,0.303,54.7,57.7,0.585,0.78,13.8,283.7,1.03,0.5,0
1,A200001,Building,Composite,39.0,0.509,0.573,3.362,23230.0,0.25,52.2,42.1,0.688,0.68,0.0,211.5,6.85,0.32,0
2,A200002,Building,Steel,50.7,0.501,0.239,2.413,31376.0,0.001,85.9,14.6,0.483,0.1,0.0,,8.01,0.294,0
3,A200003,Bridge,Asphalt,37.0,0.371,0.367,2.825,41336.0,0.0,64.4,79.8,0.312,,13.2,311.2,8.09,0.309,1
4,A200004,Road,Concrete,33.3,0.623,1.0,3.028,16300.0,0.062,65.6,46.0,0.435,1.98,0.0,204.4,14.64,0.542,0


In [3]:
df.shape

(25000, 18)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   asset_id                 25000 non-null  object 
 1   asset_type               25000 non-null  object 
 2   material_type            25000 non-null  object 
 3   age_years                25000 non-null  float64
 4   crack_index              23238 non-null  float64
 5   corrosion_level          23227 non-null  float64
 6   vibration_rms            23258 non-null  float64
 7   avg_daily_traffic        23277 non-null  float64
 8   heavy_vehicle_pct        25000 non-null  float64
 9   avg_temperature_f        25000 non-null  float64
 10  humidity_pct             23277 non-null  float64
 11  flood_risk_score         25000 non-null  float64
 12  inspections_per_year     23220 non-null  float64
 13  years_since_last_repair  25000 non-null  float64
 14  strain_peak_microstrai

In [5]:
(df.isna().mean() * 100).sort_values(ascending=False)

inspections_per_year       7.120
corrosion_level            7.092
crack_index                7.048
strain_peak_microstrain    7.036
vibration_rms              6.968
avg_daily_traffic          6.892
humidity_pct               6.892
asset_id                   0.000
failure_probability        0.000
displacement_mm            0.000
years_since_last_repair    0.000
avg_temperature_f          0.000
flood_risk_score           0.000
asset_type                 0.000
heavy_vehicle_pct          0.000
age_years                  0.000
material_type              0.000
failure_within_5yrs        0.000
dtype: float64

**Notes:** Since the missing percentage is relatively small, imputation is reasonable instead of dropping rows.

In [6]:
df_clean = df.copy()

num_cols = df_clean.select_dtypes(include=["float64", "int64"]).columns
for col in num_cols:
    df_clean[col] = df_clean[col].fillna(df_clean[col].median())

**Notes:** I used median imputation because the data contains outliers. Median helps avoid skewing the distribution.

In [7]:
df_clean.isna().sum()

asset_id                   0
asset_type                 0
material_type              0
age_years                  0
crack_index                0
corrosion_level            0
vibration_rms              0
avg_daily_traffic          0
heavy_vehicle_pct          0
avg_temperature_f          0
humidity_pct               0
flood_risk_score           0
inspections_per_year       0
years_since_last_repair    0
strain_peak_microstrain    0
displacement_mm            0
failure_probability        0
failure_within_5yrs        0
dtype: int64

**Notes:** At this point, the dataset is fully numeric-ready except for categorical columns.

In [8]:
#Create a new dataframe with categorical variables encoded
df_encoded = pd.get_dummies(
    df_clean,                      # cleaned dataset
    columns=["asset_type", "material_type"],  # categorical columns to encode
    drop_first=True                # drop first category to avoid redundancy
)
df_encoded.head()

Unnamed: 0,asset_id,age_years,crack_index,corrosion_level,vibration_rms,avg_daily_traffic,heavy_vehicle_pct,avg_temperature_f,humidity_pct,flood_risk_score,...,displacement_mm,failure_probability,failure_within_5yrs,asset_type_Building,asset_type_Pipeline,asset_type_Power_System,asset_type_Road,material_type_Composite,material_type_Concrete,material_type_Steel
0,A200000,37.2,0.577,0.449,2.778,15495.0,0.303,54.7,57.7,0.585,...,1.03,0.5,0,0,1,0,0,0,0,0
1,A200001,39.0,0.509,0.573,3.362,23230.0,0.25,52.2,42.1,0.688,...,6.85,0.32,0,1,0,0,0,1,0,0
2,A200002,50.7,0.501,0.239,2.413,31376.0,0.001,85.9,14.6,0.483,...,8.01,0.294,0,1,0,0,0,0,0,1
3,A200003,37.0,0.371,0.367,2.825,41336.0,0.0,64.4,79.8,0.312,...,8.09,0.309,1,0,0,0,0,0,0,0
4,A200004,33.3,0.623,1.0,3.028,16300.0,0.062,65.6,46.0,0.435,...,14.64,0.542,0,0,0,0,1,0,1,0


In [9]:
#Select only the feature columns for clustering
#Remove IDs and target-related columns
X_cluster = df_encoded.drop(
    ["asset_id", "failure_probability", "failure_within_5yrs"],
    axis=1
)

**Notes:** Clustering should be based only on condition and environment features.

In [10]:
from sklearn.preprocessing import StandardScaler
#Create the scaler
scaler = StandardScaler()
#Fit the scaler and transform the data
X_scaled = scaler.fit_transform(X_cluster)

**Notes:** Scaling is required because features have different units.
KMeans uses distance, so scaling prevents large values from dominating.

In [11]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
silhouette_scores = {}

#Try different numbers of clusters
for k in range(2, 9):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init="auto")
    labels = kmeans.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    silhouette_scores[k] = score
silhouette_scores

{2: 0.07145857367319931,
 3: 0.08274590310329138,
 4: 0.08769134300248894,
 5: 0.09899532512935064,
 6: 0.08967948306626071,
 7: 0.07682788161234982,
 8: 0.08429520072947885}

In [12]:
#KMeans model using the chosen number of clusters
kmeans = KMeans(n_clusters=5, random_state=42, n_init="auto")
#Fit model and assign cluster
df_encoded["risk_cluster"] = kmeans.fit_predict(X_scaled)
df_encoded[["asset_id", "risk_cluster"]].head()

Unnamed: 0,asset_id,risk_cluster
0,A200000,3
1,A200001,4
2,A200002,1
3,A200003,3
4,A200004,3


In [13]:
df_encoded["risk_cluster"].value_counts()

3    8698
0    6116
1    4286
4    3347
2    2553
Name: risk_cluster, dtype: int64

**Notes:** 
- I tested values of K from 2 to 8 using silhouette score.

- The highest silhouette score occurred at K = 5.

- This suggests that 5 clusters provide the best separation for this dataset.

- The silhouette scores are relatively low, which is expected for real-world, noisy infrastructure data.

- Based on this, I selected K = 5 for the final clustering model.

In [14]:
#Calculate average values for key features
cluster_summary = df_encoded.groupby("risk_cluster")[[
    "age_years",
    "crack_index",
    "corrosion_level",
    "vibration_rms",
    "avg_daily_traffic",
    "flood_risk_score",
    "years_since_last_repair"
]].mean()
cluster_summary

Unnamed: 0_level_0,age_years,crack_index,corrosion_level,vibration_rms,avg_daily_traffic,flood_risk_score,years_since_last_repair
risk_cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,38.427224,0.500546,0.461954,3.303564,19100.868378,0.390796,7.234827
1,38.420462,0.496353,0.450626,3.368615,19310.833878,0.387868,7.1972
2,38.425147,0.500008,0.451937,3.341477,19517.354877,0.388878,6.977673
3,38.140573,0.506346,0.457239,3.333958,19103.453782,0.39309,7.195194
4,38.060711,0.503367,0.4501,3.312556,19085.707499,0.39165,7.200717


In [15]:
#Create mapping from cluster number to risk level
risk_map = {
    0: "Medium Risk",
    1: "Low Risk",
    2: "High Risk",
    3: "Medium-High Risk",
    4: "Medium Risk"
}
#Map risk labels
df_encoded["risk_level"] = df_encoded["risk_cluster"].map(risk_map)
df_encoded[["asset_id", "risk_cluster", "risk_level"]].head()

Unnamed: 0,asset_id,risk_cluster,risk_level
0,A200000,3,Medium-High Risk
1,A200001,4,Medium Risk
2,A200002,1,Low Risk
3,A200003,3,Medium-High Risk
4,A200004,3,Medium-High Risk


In [16]:
df_encoded["risk_level"].value_counts()

Medium Risk         9463
Medium-High Risk    8698
Low Risk            4286
High Risk           2553
Name: risk_level, dtype: int64

**Notes:** 
- Most assets fall into the Medium and Medium-High Risk categories.

- Fewer assets are classified as High Risk, which makes sense since extreme risk conditions are less common.

- Low Risk assets make up a smaller portion of the dataset.

 **After analyzing risk clusters, the next step is to build a supervised model, to predict whether an asset will fail within 5 years.**


In [17]:
from sklearn.model_selection import train_test_split

#Separate features and targetss
X = df_encoded.drop(
    ["asset_id", "failure_probability", "failure_within_5yrs"],
    axis=1
)
y = df_encoded["failure_within_5yrs"]

#Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    stratify=y,
    random_state=42
)

**Notes:** The dataset was split into features and the target variable, then divided into training and testing sets. Stratified splitting was used to keep the same proportion of failure cases in both sets.

In [18]:
#Find any columns that are still text/object type
df_encoded.select_dtypes(include="object").columns

Index(['asset_id', 'risk_level'], dtype='object')

**Notes:** This output shows that asset_id and risk_level are still text columns and need to be removed before training the model.

In [19]:
#Create feature matrix by dropping non numerics
X = df_encoded.drop(
    [
        "asset_id",              # identifier
        "risk_level",            # text label
        "failure_probability",   # avoid data leakage
        "failure_within_5yrs"    # target variable
    ],
    axis=1
)
#Target
y = df_encoded["failure_within_5yrs"]

**Notes:** Non-numeric and target-related columns were removed so the model only uses valid numeric features for training.

In [20]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    stratify=y,
    random_state=42
)

**Notes:** The data was split into training and testing sets using a 75/25 split, with stratification to preserve the failure ratio.

In [21]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    random_state=42
)
rf_model.fit(X_train, y_train)

**This output confirms that the Random Forest model was successfully created and trained with the specified parameters.**

In [22]:
#Predict class labels
y_pred = rf_model.predict(X_test)
#Predict probabilities for the failure class
y_prob = rf_model.predict_proba(X_test)[:, 1]

**Notes:** The model predicts failure labels using y_pred and estimates failure probabilities using y_prob, which are used for evaluation and threshold adjustment.

In [23]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.93      1.00      0.96      5811
           1       0.50      0.00      0.00       439

    accuracy                           0.93      6250
   macro avg       0.71      0.50      0.48      6250
weighted avg       0.90      0.93      0.90      6250



**Notes:** The model performs well on non-failures but struggles to detect rare failure cases due to class imbalance.

In [24]:
from sklearn.metrics import roc_auc_score
roc_auc = roc_auc_score(y_test, y_prob)
print("ROC AUC:", roc_auc)

ROC AUC: 0.6278854532817932


**Notes:** A ROC AUC of about 0.63 shows the model has some ability to separate failures from non-failures, but performance is limited due to the nature of the data.

In [25]:
#Lower threshold
threshold = 0.30
#Convert
y_pred_thresh = (y_prob >= threshold).astype(int)

**Notes:** Lowering the threshold allows the model to predict failures more often, which helps detect rare failure events but may reduce precision.

In [26]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_thresh))

              precision    recall  f1-score   support

           0       0.93      1.00      0.96      5811
           1       0.43      0.03      0.06       439

    accuracy                           0.93      6250
   macro avg       0.68      0.52      0.51      6250
weighted avg       0.90      0.93      0.90      6250



**Notes:** The model predicts non-failures accurately but has difficulty detecting rare failure cases because failures are underrepresented in the data.

In [27]:
from sklearn.metrics import confusion_matrix
import pandas as pd

cm_thresh = confusion_matrix(y_test, y_pred_thresh)
pd.DataFrame(
    cm_thresh,
    index=["Actual No Failure", "Actual Failure"],
    columns=["Pred No Failure", "Pred Failure"]
)

Unnamed: 0,Pred No Failure,Pred Failure
Actual No Failure,5791,20
Actual Failure,424,15


**Notes:** The confusion matrix shows that the model predicts non-failure cases very well, with only a few false positives. However, it misses many failure cases, resulting in a high number of false negatives. This is expected because failure events are rare and the features for failure and non-failure cases often overlap.

## Conclusion and Key Takeaways
In this project, I analyzed a realistic infrastructure dataset containing asset condition, environmental, and inspection-related features. The dataset included missing values and noise, so preprocessing was required before modeling. Missing values were handled using median imputation, and categorical variables were encoded to prepare the data for machine learning. Clustering was then used to explore infrastructure risk profiles, which grouped assets into general categories such as low, medium, and high risk. Although the clusters overlapped, this reflects real-world infrastructure systems where asset conditions exist on a continuum.

A Random Forest model was trained to predict whether an asset would fail within five years. The model performed well at identifying non-failure cases but struggled to detect rare failure events due to class imbalance and overlapping feature patterns. Using class weights and adjusting the decision threshold slightly improved failure detection, but recall remained limited. Overall, this project demonstrates the challenges of predicting rare infrastructure failures using static condition data and highlights the importance of additional data, such as time-based sensor readings or maintenance history, for improved prediction accuracy.