# column description

segment_id → Unique ID for each road/catchment segment.

city_name → City where the segment is located.

admin_ward → Administrative ward/sector of the city.

latitude, longitude → Geographic coordinates of the segment.

catchment_id → ID for the catchment area (drainage basin).

elevation_m → Elevation above sea level (in meters).

dem_source → Source of elevation data (e.g., Copernicus, SRTM).

land_use → Type of land use (Residential, Industrial, Institutional, etc.).

soil_group → Soil classification (A, B, C, D) affecting infiltration.

drainage_density_km_per_km2 → Drainage network density in the area.

storm_drain_proximity_m → Distance to the nearest storm drain (meters).

storm_drain_type → Type of drainage (Open Channel, Curb Inlet, Manhole, etc.).

rainfall_source → Source of rainfall data (ERA5, IMD, etc.).

historical_rainfall_intensity_mm_hr → Past rainfall intensity (mm/hr).

return_period_years → Flood return period (5, 10, 25, 50 years).

risk_labels → Flood risk status (e.g., monitor, ponding_hotspot, low_lying, with event date).

In [106]:
import pandas as pd
import seaborn as sns


In [107]:
df = pd.read_csv('urban_pluvial_flood_risk_dataset.csv')
df

Unnamed: 0,segment_id,city_name,admin_ward,latitude,longitude,catchment_id,elevation_m,dem_source,land_use,soil_group,drainage_density_km_per_km2,storm_drain_proximity_m,storm_drain_type,rainfall_source,historical_rainfall_intensity_mm_hr,return_period_years,risk_labels
0,SEG-00001,"Colombo, Sri Lanka",Borough East,6.920633,79.912600,CAT-136,,Copernicus_EEA-10_v5,Institutional,,4.27,160.5,CurbInlet,ERA5,39.4,50,monitor
1,SEG-00002,"Chennai, India",Ward D,13.076487,80.281774,CAT-049,-2.19,Copernicus_EEA-10_v5,Residential,D,7.54,,OpenChannel,ERA5,56.8,25,ponding_hotspot|low_lying|event_2025-05-02
2,SEG-00003,"Ahmedabad, India",Sector 12,23.019473,72.638578,CAT-023,30.88,SRTM_3arc,Industrial,B,11.00,152.5,OpenChannel,IMD,16.3,5,monitor
3,SEG-00004,"Hong Kong, China",Sector 14,22.302602,114.078673,CAT-168,24.28,SRTM_3arc,Residential,B,7.32,37.0,Manhole,ERA5,77.0,10,monitor
4,SEG-00005,"Durban, South Africa",Sector 5,-29.887602,30.911008,CAT-171,35.70,SRTM_3arc,Industrial,C,4.50,292.4,OpenChannel,ERA5,20.8,5,monitor
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2958,SEG-02959,"Paris, France",Ward B,48.872870,2.246250,CAT-036,30.46,Copernicus_GLO-30_v2023,Residential,C,,78.8,GratedInlet,,46.5,10,monitor
2959,SEG-02960,"Shanghai, China",Sector 17,31.195529,121.435540,CAT-050,-3.00,SRTM_3arc,Industrial,C,5.54,4.9,OpenChannel,LocalGauge,70.6,5,ponding_hotspot|low_lying|event_2024-07-26
2960,SEG-02961,"Vancouver, Canada",Sector 12,49.162783,-123.037084,CAT-052,14.16,Copernicus_GLO-90_v2022,Green,B,,236.1,GratedInlet,,27.7,2,monitor
2961,SEG-02962,"Lagos, Nigeria",Zone V,6.504570,3.388571,CAT-092,7.00,Copernicus_GLO-30_v2023,Industrial,B,8.74,294.8,OpenChannel,ERA5,131.6,100,extreme_rain_history|low_lying


In [108]:
# sns.boxplot(df['elevation_m']) - Too many outlier - replacing null with median
# For soil group(categorical column) - replacing with mode 
# sns.boxplot(df['drainage_density_km_per_km2']) - No outliers, replace with mean
# sns.boxplot(df['storm_drain_proximity_m']) - Too many outliers, replacing with median
# sns.boxplot(df['storm_drain_type']) - No outliers replace with mean
# df['rainfall_source'](categorical column) - replace with mode

In [109]:
df.duplicated().any()

np.False_

In [110]:
md_elevation_m = df['elevation_m'].median()
m_soil_group = df['soil_group'].mode()[0]
mn_drainage_density_km_per_km2 = df['drainage_density_km_per_km2'].mean()
md_storm_drain_proximity_m = df['storm_drain_proximity_m'].median()
m_strom_drain_type = df['storm_drain_type'].mode()[0]
m_rainfall_source = df['rainfall_source'].mode()[0]

In [111]:
for col,val in zip(['elevation_m','soil_group','drainage_density_km_per_km2','storm_drain_proximity_m','storm_drain_type','rainfall_source'],[md_elevation_m,m_soil_group,mn_drainage_density_km_per_km2,md_storm_drain_proximity_m,m_strom_drain_type,m_rainfall_source]):
    df[col] = df[col].fillna(val)

In [112]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2963 entries, 0 to 2962
Data columns (total 17 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   segment_id                           2963 non-null   object 
 1   city_name                            2963 non-null   object 
 2   admin_ward                           2963 non-null   object 
 3   latitude                             2963 non-null   float64
 4   longitude                            2963 non-null   float64
 5   catchment_id                         2963 non-null   object 
 6   elevation_m                          2963 non-null   float64
 7   dem_source                           2963 non-null   object 
 8   land_use                             2963 non-null   object 
 9   soil_group                           2963 non-null   object 
 10  drainage_density_km_per_km2          2963 non-null   float64
 11  storm_drain_proximity_m       

In [113]:
df['risk_labels'].unique()

array(['monitor', 'ponding_hotspot|low_lying|event_2025-05-02',
       'extreme_rain_history', 'low_lying|event_2024-02-14', 'low_lying',
       'ponding_hotspot|extreme_rain_history|low_lying|event_2024-09-25',
       'monitor|event_2022-09-04',
       'ponding_hotspot|low_lying|event_2024-07-31',
       'ponding_hotspot|extreme_rain_history|event_2022-10-11',
       'low_lying|sparse_drainage',
       'ponding_hotspot|low_lying|event_2024-07-21',
       'ponding_hotspot|extreme_rain_history|low_lying|event_2023-05-16',
       'sparse_drainage', 'sparse_drainage|event_2024-05-19',
       'ponding_hotspot|extreme_rain_history|low_lying|event_2023-09-13',
       'low_lying|sparse_drainage|event_2023-07-20',
       'extreme_rain_history|low_lying', 'ponding_hotspot|low_lying',
       'ponding_hotspot|low_lying|event_2024-08-16',
       'ponding_hotspot|low_lying|event_2024-03-31',
       'low_lying|event_2024-03-24',
       'ponding_hotspot|low_lying|event_2022-12-18',
       'low_lying|

In [114]:
df['risk_labels'] = df['risk_labels'].str.split('|')
def f1(d):
    risk = []
    for i in d['risk_labels']:
        if 'event' in i:
            d['event'] = i
        elif 'lying' in i:
            d['level'] = i
        else:
            risk.append(i)
            d['risks'] = '|'.join(risk)
    return d
df = df.apply(f1,axis=1)
df['risks'].unique()


array(['monitor', 'ponding_hotspot', 'extreme_rain_history', nan,
       'ponding_hotspot|extreme_rain_history', 'sparse_drainage',
       'extreme_rain_history|sparse_drainage',
       'ponding_hotspot|extreme_rain_history|sparse_drainage',
       'ponding_hotspot|sparse_drainage'], dtype=object)

In [115]:
# df['risk_labels'] = df['risk_labels'].str.split('|')
# df['risk'] = [i[0] for i in df['risk_labels']]
# df['event'] = [i[1] if len(i)>1 else None for i in df['risk_labels']]
# df['level'] = [i[1] if len(i)>2 else None for i in df['risk_labels']]
# df = df.drop(columns="risk_labels")

# Analysis









5. ------------------------------------------------Location Insights------------------------------------------------

Which city has the highest number of flood hotspots?

Within a city, which wards are most affected?

6.------------------------------------------------Combined Factors & Trends------------------------------------------------

Which 2–3 factors combined (e.g., low elevation + residential + high rainfall) appear most often in risky areas?

What patterns or trends can be seen in the dataset overall?

7.------------------------------------------------Decision & Action------------------------------------------------

Which areas (or regions) are at highest risk of flooding?

Which locations must be prioritized for flood prevention measures?

Which areas are safest for new houses or infrastructure?

Which villages/towns need early flood warning systems immediately?

Should resources (funds, manpower) be focused on high rainfall zones, clay soil regions, or urbanized areas?

What is the top recommendation for reducing flood risk in this region?

1. ----------------------------------------------------Dataset Basics-------------------------------------------------

How many total records are there in the dataset?

What are the different risk_labels?

How many segments fall under each risk label?

In [116]:
len(df)

2963

Observation - Total records = 2963

In [117]:
df['risks'].unique()

array(['monitor', 'ponding_hotspot', 'extreme_rain_history', nan,
       'ponding_hotspot|extreme_rain_history', 'sparse_drainage',
       'extreme_rain_history|sparse_drainage',
       'ponding_hotspot|extreme_rain_history|sparse_drainage',
       'ponding_hotspot|sparse_drainage'], dtype=object)

In [118]:
df.groupby('risks')['segment_id'].count()

risks
extreme_rain_history                                     162
extreme_rain_history|sparse_drainage                      11
monitor                                                 1994
ponding_hotspot                                          134
ponding_hotspot|extreme_rain_history                      75
ponding_hotspot|extreme_rain_history|sparse_drainage       6
ponding_hotspot|sparse_drainage                            7
sparse_drainage                                          157
Name: segment_id, dtype: int64

2. -------------------------------------------------Geography & Environment----------------------------------------------

Do low elevation areas have more flood risk than high elevation areas?

Which soil type (A–D) is most common in risky areas?

Which land use type (Residential, Industrial, Institutional) shows the highest flood risk?

In [119]:
df

Unnamed: 0,admin_ward,catchment_id,city_name,dem_source,drainage_density_km_per_km2,elevation_m,event,historical_rainfall_intensity_mm_hr,land_use,latitude,level,longitude,rainfall_source,return_period_years,risk_labels,risks,segment_id,soil_group,storm_drain_proximity_m,storm_drain_type
0,Borough East,CAT-136,"Colombo, Sri Lanka",Copernicus_EEA-10_v5,4.270000,25.13,,39.4,Institutional,6.920633,,79.912600,ERA5,50,[monitor],monitor,SEG-00001,B,160.5,CurbInlet
1,Ward D,CAT-049,"Chennai, India",Copernicus_EEA-10_v5,7.540000,-2.19,event_2025-05-02,56.8,Residential,13.076487,low_lying,80.281774,ERA5,25,"[ponding_hotspot, low_lying, event_2025-05-02]",ponding_hotspot,SEG-00002,D,91.7,OpenChannel
2,Sector 12,CAT-023,"Ahmedabad, India",SRTM_3arc,11.000000,30.88,,16.3,Industrial,23.019473,,72.638578,IMD,5,[monitor],monitor,SEG-00003,B,152.5,OpenChannel
3,Sector 14,CAT-168,"Hong Kong, China",SRTM_3arc,7.320000,24.28,,77.0,Residential,22.302602,,114.078673,ERA5,10,[monitor],monitor,SEG-00004,B,37.0,Manhole
4,Sector 5,CAT-171,"Durban, South Africa",SRTM_3arc,4.500000,35.70,,20.8,Industrial,-29.887602,,30.911008,ERA5,5,[monitor],monitor,SEG-00005,C,292.4,OpenChannel
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2958,Ward B,CAT-036,"Paris, France",Copernicus_GLO-30_v2023,6.290866,30.46,,46.5,Residential,48.872870,,2.246250,ERA5,10,[monitor],monitor,SEG-02959,C,78.8,GratedInlet
2959,Sector 17,CAT-050,"Shanghai, China",SRTM_3arc,5.540000,-3.00,event_2024-07-26,70.6,Industrial,31.195529,low_lying,121.435540,LocalGauge,5,"[ponding_hotspot, low_lying, event_2024-07-26]",ponding_hotspot,SEG-02960,C,4.9,OpenChannel
2960,Sector 12,CAT-052,"Vancouver, Canada",Copernicus_GLO-90_v2022,6.290866,14.16,,27.7,Green,49.162783,,-123.037084,ERA5,2,[monitor],monitor,SEG-02961,B,236.1,GratedInlet
2961,Zone V,CAT-092,"Lagos, Nigeria",Copernicus_GLO-30_v2023,8.740000,7.00,,131.6,Industrial,6.504570,low_lying,3.388571,ERA5,100,"[extreme_rain_history, low_lying]",extreme_rain_history,SEG-02962,B,294.8,OpenChannel


In [120]:
high_level = pd.DataFrame(df.loc[df['elevation_m']>25,['risks']])
high_level['risks'].value_counts()

risks
monitor                                 1400
sparse_drainage                           93
extreme_rain_history                      63
extreme_rain_history|sparse_drainage       5
ponding_hotspot|extreme_rain_history       4
Name: count, dtype: int64

In [121]:
low_level = pd.DataFrame(df.loc[df['elevation_m']<25,['risks']])
low_level['risks'].value_counts()

risks
monitor                                                 594
ponding_hotspot                                         134
extreme_rain_history                                     99
ponding_hotspot|extreme_rain_history                     71
sparse_drainage                                          64
ponding_hotspot|sparse_drainage                           7
extreme_rain_history|sparse_drainage                      6
ponding_hotspot|extreme_rain_history|sparse_drainage      6
Name: count, dtype: int64

Observation - High level areas are at less risk of drainage as most of their drainage's are monitored

In [122]:
df.loc[df['risks']!="monitor",['soil_group']].value_counts()

soil_group
B             380
C             213
D             199
A             177
Name: count, dtype: int64

Observation - B is the most common soil type in risky areas

In [123]:
df.loc[df['risks']!= 'monitor',['land_use']].value_counts()

land_use     
Residential      292
Roads            182
Green            148
Commercial       136
Industrial       100
Mixed             38
Institutional     34
Water             33
Informal           6
Name: count, dtype: int64

3. ------------------------------------------------Infrastructure------------------------------------------------

Are places closer to drains safer compared to those far away?

Which drain type (Open Channel, Manhole, Curb Inlet) is seen more in risky areas?

In [124]:
df.loc[df['risks']!='monitor'].groupby('storm_drain_type')['storm_drain_type'].count()

storm_drain_type
CurbInlet      326
GratedInlet    203
Manhole        252
OpenChannel    188
Name: storm_drain_type, dtype: int64

In [125]:
df.loc[df['risks']!='monitor',['storm_drain_type']].value_counts()

storm_drain_type
CurbInlet           326
Manhole             252
GratedInlet         203
OpenChannel         188
Name: count, dtype: int64

In [126]:
df

Unnamed: 0,admin_ward,catchment_id,city_name,dem_source,drainage_density_km_per_km2,elevation_m,event,historical_rainfall_intensity_mm_hr,land_use,latitude,level,longitude,rainfall_source,return_period_years,risk_labels,risks,segment_id,soil_group,storm_drain_proximity_m,storm_drain_type
0,Borough East,CAT-136,"Colombo, Sri Lanka",Copernicus_EEA-10_v5,4.270000,25.13,,39.4,Institutional,6.920633,,79.912600,ERA5,50,[monitor],monitor,SEG-00001,B,160.5,CurbInlet
1,Ward D,CAT-049,"Chennai, India",Copernicus_EEA-10_v5,7.540000,-2.19,event_2025-05-02,56.8,Residential,13.076487,low_lying,80.281774,ERA5,25,"[ponding_hotspot, low_lying, event_2025-05-02]",ponding_hotspot,SEG-00002,D,91.7,OpenChannel
2,Sector 12,CAT-023,"Ahmedabad, India",SRTM_3arc,11.000000,30.88,,16.3,Industrial,23.019473,,72.638578,IMD,5,[monitor],monitor,SEG-00003,B,152.5,OpenChannel
3,Sector 14,CAT-168,"Hong Kong, China",SRTM_3arc,7.320000,24.28,,77.0,Residential,22.302602,,114.078673,ERA5,10,[monitor],monitor,SEG-00004,B,37.0,Manhole
4,Sector 5,CAT-171,"Durban, South Africa",SRTM_3arc,4.500000,35.70,,20.8,Industrial,-29.887602,,30.911008,ERA5,5,[monitor],monitor,SEG-00005,C,292.4,OpenChannel
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2958,Ward B,CAT-036,"Paris, France",Copernicus_GLO-30_v2023,6.290866,30.46,,46.5,Residential,48.872870,,2.246250,ERA5,10,[monitor],monitor,SEG-02959,C,78.8,GratedInlet
2959,Sector 17,CAT-050,"Shanghai, China",SRTM_3arc,5.540000,-3.00,event_2024-07-26,70.6,Industrial,31.195529,low_lying,121.435540,LocalGauge,5,"[ponding_hotspot, low_lying, event_2024-07-26]",ponding_hotspot,SEG-02960,C,4.9,OpenChannel
2960,Sector 12,CAT-052,"Vancouver, Canada",Copernicus_GLO-90_v2022,6.290866,14.16,,27.7,Green,49.162783,,-123.037084,ERA5,2,[monitor],monitor,SEG-02961,B,236.1,GratedInlet
2961,Zone V,CAT-092,"Lagos, Nigeria",Copernicus_GLO-30_v2023,8.740000,7.00,,131.6,Industrial,6.504570,low_lying,3.388571,ERA5,100,"[extreme_rain_history, low_lying]",extreme_rain_history,SEG-02962,B,294.8,OpenChannel


4. ------------------------------------------------Rainfall------------------------------------------------

Do areas with higher rainfall intensity face more flooding?

Does flood risk increase with longer return periods (5 vs 25 vs 50 years)?

In [130]:
df.loc[df['historical_rainfall_intensity_mm_hr']>43,['risks']].value_counts()

risks                                               
monitor                                                 658
extreme_rain_history                                    162
ponding_hotspot                                          81
ponding_hotspot|extreme_rain_history                     75
sparse_drainage                                          53
extreme_rain_history|sparse_drainage                     11
ponding_hotspot|extreme_rain_history|sparse_drainage      6
ponding_hotspot|sparse_drainage                           4
Name: count, dtype: int64

In [128]:
df.loc[df['historical_rainfall_intensity_mm_hr']<43,["risks"]].value_counts()

risks                          
monitor                            1335
sparse_drainage                     104
ponding_hotspot                      53
ponding_hotspot|sparse_drainage       3
Name: count, dtype: int64

Observation - Area with higher rainfall intensity have lesser flood monitoring and are therefore more prone to flooding

In [None]:
df.groupby('return_period_years')['']