---
#  Richter's Predictor: Modeling Earthquake Damage
---

# Methodology

-   1. Business Understanding
-   2. Data Understanding
-   3. Data Preparation
-   4. Modeling
-   5. Evaluation
-   6. Deployment
---

### 1. Business Understanding

#### **Objective**
The goal of this project is to predict the level of damage to buildings caused by the 2015 Gorkha earthquake in Nepal based on various building and location features. The prediction target is an ordinal variable, `damage_grade`, which has three categories:

- **1:** Low damage
- **2:** Medium damage
- **3:** Severe damage (almost complete destruction)

Accurately predicting the level of damage will help local authorities, policymakers, and disaster response teams better allocate resources for post-disaster rebuilding efforts and preventive measures.

#### **Key Questions to Answer**
- What factors contribute most to earthquake-induced building damage?
- Can a predictive model provide actionable insights to identify high-risk structures?
- How can this model help inform disaster management strategies?

#### **Success Criteria**
The success of this project will be evaluated using the **micro-averaged F1 score**, which balances precision and recall across all damage grades.

---

c

#### **Data Description**
The dataset consists of information on buildings' structural features, geographic locations, and ownership statuses. Each row represents a specific building affected by the earthquake, with a total of 39 columns:

- **Target Variable:** `damage_grade` (1, 2, 3)
- **Key Features:**
  - `geo_level_1_id`, `geo_level_2_id`, `geo_level_3_id`: Geographic identifiers at different administrative levels.
  - `count_floors_pre_eq`: Number of floors before the earthquake.
  - `age`: Age of the building in years.
  - `area_percentage`: Normalized building footprint area.
  - `height_percentage`: Normalized building height.
  - `land_surface_condition`: Categorical variable for surface condition (`n`, `o`, `t`).
  - `foundation_type`: Categorical variable for foundation type (`h`, `i`, `r`, `u`, `w`).
  - `has_superstructure_*`: Binary flags indicating the material used for building superstructures (e.g., `has_superstructure_adobe_mud`, `has_superstructure_rc_engineered`).
  - `legal_ownership_status`: Ownership status of the building (`a`, `r`, `v`, `w`).
  - `count_families`: Number of families living in the building.

#### **Data Characteristics**
- **Categorical Variables:** Obfuscated random lowercase ASCII characters that do not imply semantic meaning.
- **Binary Variables:** Indicate the presence of specific building features (e.g., superstructure types).
- **Numerical Variables:** Include continuous and discrete values such as building age and count of floors.

#### **Example Data Row**
| Feature                   | Value |
|----------------------------|-------|
| `geo_level_1_id`          | 8     |
| `count_floors_pre_eq`     | 2     |
| `age`                     | 15    |
| `area_percentage`         | 4     |
| `foundation_type`         | r     |
| `has_superstructure_adobe_mud` | 1 |
| `damage_grade`            | 2     |

#### **Performance Metric**
The micro-averaged F1 score will be computed using `sklearn.metrics.f1_score` with `average='micro'` to assess model performance.

---

This structured understanding will guide the data preparation, modeling, and evaluation phases of the project.


---

### 3. Data Preparation
---




####    Data Collection



In [5]:
import pandas as pd
# read data
df_train_values= pd.read_csv('data/train_values.csv')
df_train_labels = pd.read_csv('data/train_labels.csv')

# merge train values and labels for modeling
df_train = pd.merge(df_train_values,df_train_labels, on = 'building_id')


####    Data Inspection

In [6]:
df_train.shape

(260601, 40)

-   There are 260601 rows and  40 columns

In [7]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 260601 entries, 0 to 260600
Data columns (total 40 columns):
 #   Column                                  Non-Null Count   Dtype 
---  ------                                  --------------   ----- 
 0   building_id                             260601 non-null  int64 
 1   geo_level_1_id                          260601 non-null  int64 
 2   geo_level_2_id                          260601 non-null  int64 
 3   geo_level_3_id                          260601 non-null  int64 
 4   count_floors_pre_eq                     260601 non-null  int64 
 5   age                                     260601 non-null  int64 
 6   area_percentage                         260601 non-null  int64 
 7   height_percentage                       260601 non-null  int64 
 8   land_surface_condition                  260601 non-null  object
 9   foundation_type                         260601 non-null  object
 10  roof_type                               260601 non-null 

-   The features are diverse, some are numerical, others binary, yet others are strings

In [8]:
df_train.isna().sum()

building_id                               0
geo_level_1_id                            0
geo_level_2_id                            0
geo_level_3_id                            0
count_floors_pre_eq                       0
age                                       0
area_percentage                           0
height_percentage                         0
land_surface_condition                    0
foundation_type                           0
roof_type                                 0
ground_floor_type                         0
other_floor_type                          0
position                                  0
plan_configuration                        0
has_superstructure_adobe_mud              0
has_superstructure_mud_mortar_stone       0
has_superstructure_stone_flag             0
has_superstructure_cement_mortar_stone    0
has_superstructure_mud_mortar_brick       0
has_superstructure_cement_mortar_brick    0
has_superstructure_timber                 0
has_superstructure_bamboo       

-   The dataset is very complete in that there are no null values

####    Feature Inspection

From the data source [link](https://www.drivendata.org/competitions/57/nepal-earthquake/), these are descriptions of the features:

| **Feature Name** | **Description** |  
|------------------|-----------------|  
| `geo_level_1_id`, `geo_level_2_id`, `geo_level_3_id` | Geographic region in which the building exists, from the largest (level 1) to the most specific sub-region (level 3). Possible values: level 1 (0-30), level 2 (0-1427), level 3 (0-12567). |  
| `count_floors_pre_eq` | Number of floors in the building before the earthquake. |  
| `age` | Age of the building in years. |  
| `area_percentage` | Normalized area of the building footprint. |  
| `height_percentage` | Normalized height of the building footprint. |  
| `land_surface_condition` | Surface condition of the land where the building was built. Possible values: `n`, `o`, `t`. |  
| `foundation_type` | Type of foundation used while building. Possible values: `h`, `i`, `r`, `u`, `w`. |  
| `roof_type` | Type of roof used while building. Possible values: `n`, `q`, `x`. |  
| `ground_floor_type` | Type of the ground floor. Possible values: `f`, `m`, `v`, `x`, `z`. |  
| `other_floor_type` | Type of constructions used in higher than the ground floors (except for roof). Possible values: `j`, `q`, `s`, `x`. |  
| `position` | Position of the building. Possible values: `j`, `o`, `s`, `t`. |  
| `plan_configuration` | Building plan configuration. Possible values: `a`, `c`, `d`, `f`, `m`, `n`, `o`, `q`, `s`, `u`. |  
| `has_superstructure_adobe_mud` | Flag indicating if the superstructure was made of Adobe/Mud. |  
| `has_superstructure_mud_mortar_stone` | Flag indicating if the superstructure was made of Mud Mortar - Stone. |  
| `has_superstructure_stone_flag` | Flag indicating if the superstructure was made of Stone. |  
| `has_superstructure_cement_mortar_stone` | Flag indicating if the superstructure was made of Cement Mortar - Stone. |  
| `has_superstructure_mud_mortar_brick` | Flag indicating if the superstructure was made of Mud Mortar - Brick. |  
| `has_superstructure_cement_mortar_brick` | Flag indicating if the superstructure was made of Cement Mortar - Brick. |  
| `has_superstructure_timber` | Flag indicating if the superstructure was made of Timber. |  
| `has_superstructure_bamboo` | Flag indicating if the superstructure was made of Bamboo. |  
| `has_superstructure_rc_non_engineered` | Flag indicating if the superstructure was made of non-engineered reinforced concrete. |  
| `has_superstructure_rc_engineered` | Flag indicating if the superstructure was made of engineered reinforced concrete. |  
| `has_superstructure_other` | Flag indicating if the superstructure was made of any other material. |  
| `legal_ownership_status` | Legal ownership status of the land where the building was built. Possible values: `a`, `r`, `v`, `w`. |  
| `count_families` | Number of families living in the building. |  
| `has_secondary_use` | Flag indicating if the building was used for any secondary purpose. |  
| `has_secondary_use_agriculture` | Flag indicating if the building was used for agricultural purposes. |  
| `has_secondary_use_hotel` | Flag indicating if the building was used as a hotel. |  
| `has_secondary_use_rental` | Flag indicating if the building was used for rental purposes. |  
| `has_secondary_use_institution` | Flag indicating if the building was used as an institution. |  
| `has_secondary_use_school` | Flag indicating if the building was used as a school. |  
| `has_secondary_use_industry` | Flag indicating if the building was used for industrial purposes. |  
| `has_secondary_use_health_post` | Flag indicating if the building was used as a health post. |  
| `has_secondary_use_gov_office` | Flag indicating if the building was used as a government office. |  
| `has_secondary_use_use_police` | Flag indicating if the building was used as a police station. |  
| `has_secondary_use_other` | Flag indicating if the building was secondarily used for other purposes. |  

