# 🌍 Week 1 – Climate Risk & Disaster Management
### AICTE Cycle 3 (2025)

📌 **Objective:**  
In this notebook, we begin with **data exploration and cleaning** on the EM-DAT Disaster Dataset (1900–2021).  
The goal is to understand the dataset, check for missing values, and prepare it for further analysis.  


In [7]:
import pandas as pd
import numpy as np


### Step 2: Load Dataset  
We load the EM-DAT disasters dataset. Make sure the file is stored in the folder.  


In [8]:
DATA_PATH = "/content/sample_data/1900_2021_DISASTERS- emdat data.csv"
df = pd.read_csv(DATA_PATH)

print("✅ Dataset Loaded with shape:", df.shape)
df.head()


✅ Dataset Loaded with shape: (16126, 45)


Unnamed: 0,Year,Seq,Glide,Disaster Group,Disaster Subgroup,Disaster Type,Disaster Subtype,Disaster Subsubtype,Event Name,Country,...,No Affected,No Homeless,Total Affected,Insured Damages ('000 US$),Total Damages ('000 US$),CPI,Adm Level,Admin1 Code,Admin2 Code,Geo Locations
0,1900,9002,,Natural,Climatological,Drought,Drought,,,Cabo Verde,...,,,,,,3.221647,,,,
1,1900,9001,,Natural,Climatological,Drought,Drought,,,India,...,,,,,,3.221647,,,,
2,1902,12,,Natural,Geophysical,Earthquake,Ground movement,,,Guatemala,...,,,,,25000.0,3.350513,,,,
3,1902,3,,Natural,Geophysical,Volcanic activity,Ash fall,,Santa Maria,Guatemala,...,,,,,,3.350513,,,,
4,1902,10,,Natural,Geophysical,Volcanic activity,Ash fall,,Santa Maria,Guatemala,...,,,,,,3.350513,,,,


### Step 3: Explore Dataset  
We use `.info()`, `.describe()`, and `.isnull().sum()` to get an overview.  


In [9]:
# Dataset information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16126 entries, 0 to 16125
Data columns (total 45 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Year                        16126 non-null  int64  
 1   Seq                         16126 non-null  int64  
 2   Glide                       1581 non-null   object 
 3   Disaster Group              16126 non-null  object 
 4   Disaster Subgroup           16126 non-null  object 
 5   Disaster Type               16126 non-null  object 
 6   Disaster Subtype            13016 non-null  object 
 7   Disaster Subsubtype         1077 non-null   object 
 8   Event Name                  3861 non-null   object 
 9   Country                     16126 non-null  object 
 10  ISO                         16126 non-null  object 
 11  Region                      16126 non-null  object 
 12  Continent                   16126 non-null  object 
 13  Location                    143

In [10]:
# Summary statistics
df.describe()

Unnamed: 0,Year,Seq,Aid Contribution,Dis Mag Value,Start Year,Start Month,Start Day,End Year,End Month,End Day,Total Deaths,No Injured,No Affected,No Homeless,Total Affected,Insured Damages ('000 US$),Total Damages ('000 US$),CPI
count,16126.0,16126.0,677.0,4946.0,16126.0,15739.0,12498.0,16126.0,15418.0,12570.0,11413.0,3895.0,9220.0,2430.0,11617.0,1096.0,5245.0,15811.0
mean,1996.76479,714.78482,125413.6,47350.38,1996.77837,6.444374,15.233957,1996.835607,6.576728,15.77502,2842.866,2621.102,882361.2,73293.14,716508.8,798651.4,724783.5,63.215103
std,20.159065,1929.635089,2997875.0,309424.2,20.15571,3.393965,8.953821,20.14301,3.352965,8.865486,68605.95,34403.43,8573913.0,523005.8,7718598.0,3057638.0,4723131.0,26.734285
min,1900.0,1.0,1.0,-57.0,1900.0,1.0,1.0,1900.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,34.0,2.0,3.221647
25%,1989.0,93.0,175.0,7.0,1989.0,4.0,7.0,1989.0,4.0,8.0,6.0,14.0,1244.75,572.5,650.0,50000.0,8300.0,45.692897
50%,2001.0,270.0,721.0,151.5,2001.0,7.0,15.0,2001.0,7.0,16.0,20.0,50.0,10000.0,3000.0,5965.0,172500.0,60000.0,68.415379
75%,2011.0,486.0,3511.0,11296.5,2011.0,9.0,23.0,2011.0,9.0,23.0,63.0,200.0,91823.0,17500.0,58255.0,500000.0,317300.0,84.252733
max,2021.0,9881.0,78000000.0,13025870.0,2021.0,12.0,31.0,2021.0,12.0,31.0,3700000.0,1800000.0,330000000.0,15850000.0,330000000.0,60000000.0,210000000.0,100.0


In [11]:
# Missing values
df.isnull().sum()


Unnamed: 0,0
Year,0
Seq,0
Glide,14545
Disaster Group,0
Disaster Subgroup,0
Disaster Type,0
Disaster Subtype,3110
Disaster Subsubtype,15049
Event Name,12265
Country,0


### Step 4: Data Cleaning  
- Drop duplicates  
- Handle missing values (numeric → median, categorical → mode)  


In [12]:
# Drop duplicate rows
df = df.drop_duplicates()

# Separate numeric and categorical columns
num_cols = df.select_dtypes(include=[np.number]).columns
cat_cols = df.select_dtypes(include=["object"]).columns

# Fill numeric NaN with median
df[num_cols] = df[num_cols].fillna(df[num_cols].median())

# Fill categorical NaN with mode
for col in cat_cols:
    df[col] = df[col].fillna(df[col].mode()[0])

print("✅ Missing values after cleaning:\n", df.isnull().sum())


✅ Missing values after cleaning:
 Year                          0
Seq                           0
Glide                         0
Disaster Group                0
Disaster Subgroup             0
Disaster Type                 0
Disaster Subtype              0
Disaster Subsubtype           0
Event Name                    0
Country                       0
ISO                           0
Region                        0
Continent                     0
Location                      0
Origin                        0
Associated Dis                0
Associated Dis2               0
OFDA Response                 0
Appeal                        0
Declaration                   0
Aid Contribution              0
Dis Mag Value                 0
Dis Mag Scale                 0
Latitude                      0
Longitude                     0
Local Time                    0
River Basin                   0
Start Year                    0
Start Month                   0
Start Day                     0
End Ye

### Step 5: Save Cleaned Data  
We save the cleaned dataset for Week 2 analysis.  


In [13]:
df.to_csv("1900_2021_DISASTERS_cleaned.csv", index=False)
print("📂 Cleaned dataset saved successfully.")


📂 Cleaned dataset saved successfully.


In [14]:
from google.colab import drive
drive.mount('/content/drive')

# Save into Drive
df.to_csv("/content/drive/MyDrive/1900_2021_DISASTERS_cleaned.csv", index=False)

# Load later from Drive
df = pd.read_csv("/content/drive/MyDrive/1900_2021_DISASTERS_cleaned.csv")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## ✅ Summary (Week 1)
- Dataset loaded successfully.  
- Basic structure and statistics explored.  
- Missing values handled (median/mode).  
- Cleaned dataset saved for Week 2.  


