---
#  Richter's Predictor: Modeling Earthquake Damage
---

# Methodology

-   1. Business Understanding
-   2. Data Understanding
-   3. Data Preparation
-   4. Modeling
-   5. Evaluation(Iterative)
-   6. Submission(Iterative)
---

### 1. Business Understanding

#### **Objective**
The goal of this project is to predict the level of damage to buildings caused by the 2015 Gorkha earthquake in Nepal based on various building and location features. The prediction target is an ordinal variable, `damage_grade`, which has three categories:

- **1:** Low damage
- **2:** Medium damage
- **3:** Severe damage (almost complete destruction)

Accurately predicting the level of damage will help local authorities, policymakers, and disaster response teams better allocate resources for post-disaster rebuilding efforts and preventive measures.

#### **Key Questions to Answer**
- What factors contribute most to earthquake-induced building damage?
- Can a predictive model provide actionable insights to identify high-risk structures?
- How can this model help inform disaster management strategies?

#### **Success Criteria**
The success of this project will be evaluated using the **micro-averaged F1 score**, which balances precision and recall across all damage grades.

---


#### **Data Description**
The dataset consists of information on buildings' structural features, geographic locations, and ownership statuses. Each row represents a specific building affected by the earthquake, with a total of 39 columns:

- **Target Variable:** `damage_grade` (1, 2, 3)
- **Key Features:**
  - `geo_level_1_id`, `geo_level_2_id`, `geo_level_3_id`: Geographic identifiers at different administrative levels.
  - `count_floors_pre_eq`: Number of floors before the earthquake.
  - `age`: Age of the building in years.
  - `area_percentage`: Normalized building footprint area.
  - `height_percentage`: Normalized building height.
  - `land_surface_condition`: Categorical variable for surface condition (`n`, `o`, `t`).
  - `foundation_type`: Categorical variable for foundation type (`h`, `i`, `r`, `u`, `w`).
  - `has_superstructure_*`: Binary flags indicating the material used for building superstructures (e.g., `has_superstructure_adobe_mud`, `has_superstructure_rc_engineered`).
  - `legal_ownership_status`: Ownership status of the building (`a`, `r`, `v`, `w`).
  - `count_families`: Number of families living in the building.

#### **Data Characteristics**
- **Categorical Variables:** Obfuscated random lowercase ASCII characters that do not imply semantic meaning.
- **Binary Variables:** Indicate the presence of specific building features (e.g., superstructure types).
- **Numerical Variables:** Include continuous and discrete values such as building age and count of floors.

#### **Example Data Row**
| Feature                   | Value |
|----------------------------|-------|
| `geo_level_1_id`          | 8     |
| `count_floors_pre_eq`     | 2     |
| `age`                     | 15    |
| `area_percentage`         | 4     |
| `foundation_type`         | r     |
| `has_superstructure_adobe_mud` | 1 |
| `damage_grade`            | 2     |

#### **Performance Metric**
The micro-averaged F1 score will be computed using `sklearn.metrics.f1_score` with `average='micro'` to assess model performance.

---

This structured understanding will guide the data preparation, modeling, and evaluation phases of the project.


---

### 3. Data Preparation
---




####    Data Collection



In [18]:
import pandas as pd
# read data
df_train_values= pd.read_csv('data/train_values.csv')
df_train_labels = pd.read_csv('data/train_labels.csv')

# merge train values and labels for modeling
df_train = pd.merge(df_train_values,df_train_labels, on = 'building_id')


####    Data Inspection

In [19]:
df_train.shape

(260601, 40)

-   There are 260601 rows and  40 columns

In [20]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 260601 entries, 0 to 260600
Data columns (total 40 columns):
 #   Column                                  Non-Null Count   Dtype 
---  ------                                  --------------   ----- 
 0   building_id                             260601 non-null  int64 
 1   geo_level_1_id                          260601 non-null  int64 
 2   geo_level_2_id                          260601 non-null  int64 
 3   geo_level_3_id                          260601 non-null  int64 
 4   count_floors_pre_eq                     260601 non-null  int64 
 5   age                                     260601 non-null  int64 
 6   area_percentage                         260601 non-null  int64 
 7   height_percentage                       260601 non-null  int64 
 8   land_surface_condition                  260601 non-null  object
 9   foundation_type                         260601 non-null  object
 10  roof_type                               260601 non-null 

-   The features are diverse, some are numerical, others binary, yet others are strings

In [21]:
df_train.isna().sum()

building_id                               0
geo_level_1_id                            0
geo_level_2_id                            0
geo_level_3_id                            0
count_floors_pre_eq                       0
age                                       0
area_percentage                           0
height_percentage                         0
land_surface_condition                    0
foundation_type                           0
roof_type                                 0
ground_floor_type                         0
other_floor_type                          0
position                                  0
plan_configuration                        0
has_superstructure_adobe_mud              0
has_superstructure_mud_mortar_stone       0
has_superstructure_stone_flag             0
has_superstructure_cement_mortar_stone    0
has_superstructure_mud_mortar_brick       0
has_superstructure_cement_mortar_brick    0
has_superstructure_timber                 0
has_superstructure_bamboo       

-   The dataset is very complete in that there are no null values

####    Feature Inspection

From the data source [link](https://www.drivendata.org/competitions/57/nepal-earthquake/), these are descriptions of the features:

| **Feature Name** | **Description** |  
|------------------|-----------------|  
| `geo_level_1_id`, `geo_level_2_id`, `geo_level_3_id` | Geographic region in which the building exists, from the largest (level 1) to the most specific sub-region (level 3). Possible values: level 1 (0-30), level 2 (0-1427), level 3 (0-12567). |  
| `count_floors_pre_eq` | Number of floors in the building before the earthquake. |  
| `age` | Age of the building in years. |  
| `area_percentage` | Normalized area of the building footprint. |  
| `height_percentage` | Normalized height of the building footprint. |  
| `land_surface_condition` | Surface condition of the land where the building was built. Possible values: `n`, `o`, `t`. |  
| `foundation_type` | Type of foundation used while building. Possible values: `h`, `i`, `r`, `u`, `w`. |  
| `roof_type` | Type of roof used while building. Possible values: `n`, `q`, `x`. |  
| `ground_floor_type` | Type of the ground floor. Possible values: `f`, `m`, `v`, `x`, `z`. |  
| `other_floor_type` | Type of constructions used in higher than the ground floors (except for roof). Possible values: `j`, `q`, `s`, `x`. |  
| `position` | Position of the building. Possible values: `j`, `o`, `s`, `t`. |  
| `plan_configuration` | Building plan configuration. Possible values: `a`, `c`, `d`, `f`, `m`, `n`, `o`, `q`, `s`, `u`. |  
| `has_superstructure_adobe_mud` | Flag indicating if the superstructure was made of Adobe/Mud. |  
| `has_superstructure_mud_mortar_stone` | Flag indicating if the superstructure was made of Mud Mortar - Stone. |  
| `has_superstructure_stone_flag` | Flag indicating if the superstructure was made of Stone. |  
| `has_superstructure_cement_mortar_stone` | Flag indicating if the superstructure was made of Cement Mortar - Stone. |  
| `has_superstructure_mud_mortar_brick` | Flag indicating if the superstructure was made of Mud Mortar - Brick. |  
| `has_superstructure_cement_mortar_brick` | Flag indicating if the superstructure was made of Cement Mortar - Brick. |  
| `has_superstructure_timber` | Flag indicating if the superstructure was made of Timber. |  
| `has_superstructure_bamboo` | Flag indicating if the superstructure was made of Bamboo. |  
| `has_superstructure_rc_non_engineered` | Flag indicating if the superstructure was made of non-engineered reinforced concrete. |  
| `has_superstructure_rc_engineered` | Flag indicating if the superstructure was made of engineered reinforced concrete. |  
| `has_superstructure_other` | Flag indicating if the superstructure was made of any other material. |  
| `legal_ownership_status` | Legal ownership status of the land where the building was built. Possible values: `a`, `r`, `v`, `w`. |  
| `count_families` | Number of families living in the building. |  
| `has_secondary_use` | Flag indicating if the building was used for any secondary purpose. |  
| `has_secondary_use_agriculture` | Flag indicating if the building was used for agricultural purposes. |  
| `has_secondary_use_hotel` | Flag indicating if the building was used as a hotel. |  
| `has_secondary_use_rental` | Flag indicating if the building was used for rental purposes. |  
| `has_secondary_use_institution` | Flag indicating if the building was used as an institution. |  
| `has_secondary_use_school` | Flag indicating if the building was used as a school. |  
| `has_secondary_use_industry` | Flag indicating if the building was used for industrial purposes. |  
| `has_secondary_use_health_post` | Flag indicating if the building was used as a health post. |  
| `has_secondary_use_gov_office` | Flag indicating if the building was used as a government office. |  
| `has_secondary_use_use_police` | Flag indicating if the building was used as a police station. |  
| `has_secondary_use_other` | Flag indicating if the building was secondarily used for other purposes. |  



#### Feature Selection

Right of the cuff, there are some features especaially for building identification which are superfluous for our purposes which we shall remove right away, however, we shall do a correlation check to be sure.

In [22]:
#   check correlation with target 'damage_grade'
df_train.corr()

Unnamed: 0,building_id,geo_level_1_id,geo_level_2_id,geo_level_3_id,count_floors_pre_eq,age,area_percentage,height_percentage,has_superstructure_adobe_mud,has_superstructure_mud_mortar_stone,...,has_secondary_use_hotel,has_secondary_use_rental,has_secondary_use_institution,has_secondary_use_school,has_secondary_use_industry,has_secondary_use_health_post,has_secondary_use_gov_office,has_secondary_use_use_police,has_secondary_use_other,damage_grade
building_id,1.0,-0.00285,0.000347,-0.000393,-0.000654,-0.001476,-0.00207,9.6e-05,-0.000307,0.002423,...,0.001934,-0.002152,0.000706,-0.000362,0.002348,-0.000374,0.000538,-0.003116,-0.002295,0.001063
geo_level_1_id,-0.00285,1.0,-0.061405,0.002718,-0.089364,-0.003908,0.071158,-0.063474,-0.018245,-0.152038,...,0.001911,0.023523,0.0037,0.002977,0.002655,-0.002303,0.00106,0.000523,-0.017992,-0.072347
geo_level_2_id,0.000347,-0.061405,1.0,0.000921,0.04773,0.012594,-0.049443,0.035516,0.015833,0.076491,...,-0.008439,-0.030704,-0.00484,-0.004856,0.000687,-0.000757,-0.000152,0.001926,-0.013068,0.043161
geo_level_3_id,-0.000393,0.002718,0.000921,1.0,-0.021646,-0.006385,-0.005643,-0.024507,-0.015732,0.026294,...,-0.002001,-0.007356,-0.007058,-0.004373,-0.000862,-0.002632,-0.000943,0.000269,-0.002463,0.007932
count_floors_pre_eq,-0.000654,-0.089364,0.04773,-0.021646,1.0,0.086668,0.101071,0.772734,0.174852,-0.027116,...,0.07712,0.035425,0.016384,0.008833,-0.002611,0.006786,0.009639,0.003939,-0.002073,0.122308
age,-0.001476,-0.003908,0.012594,-0.006385,0.086668,1.0,-0.004323,0.061074,0.068032,0.001321,...,-0.010021,0.001193,-0.004189,-0.003514,-0.003658,-0.002169,-0.001764,-0.001195,-0.004534,0.029273
area_percentage,-0.00207,0.071158,-0.049443,-0.005643,0.101071,-0.004323,1.0,0.196645,0.026287,-0.225541,...,0.159885,0.105983,0.052212,0.050164,0.019421,0.015109,0.01529,0.004983,0.013111,-0.125221
height_percentage,9.6e-05,-0.063474,0.035516,-0.024507,0.772734,0.061074,0.196645,1.0,0.149725,-0.106573,...,0.123551,0.068909,0.031366,0.020032,0.001946,0.011192,0.01466,0.004048,0.005397,0.04813
has_superstructure_adobe_mud,-0.000307,-0.018245,0.015833,-0.015732,0.174852,0.068032,0.026287,0.149725,1.0,-0.306861,...,-0.012642,-0.003935,-0.004281,-0.002369,0.001762,-0.003292,-0.002648,-0.001493,-0.010074,0.055314
has_superstructure_mud_mortar_stone,0.002423,-0.152038,0.076491,0.026294,-0.027116,0.001321,-0.225541,-0.106573,-0.306861,1.0,...,-0.159532,-0.117948,-0.036064,-0.02307,-0.025507,-0.008763,-0.011904,-0.00338,0.005628,0.291325


-   Not very clear due to the large no. of variables. Let us check the correlations of our target variable and the features, in descending order

In [23]:
correlations = df_train.corr()['damage_grade'].abs().sort_values(ascending= False)
correlations

damage_grade                              1.000000
has_superstructure_mud_mortar_stone       0.291325
has_superstructure_cement_mortar_brick    0.254131
has_superstructure_rc_engineered          0.179014
has_superstructure_rc_non_engineered      0.158145
area_percentage                           0.125221
count_floors_pre_eq                       0.122308
has_secondary_use_hotel                   0.097942
has_secondary_use_rental                  0.083754
has_secondary_use                         0.079630
geo_level_1_id                            0.072347
has_superstructure_timber                 0.069852
has_superstructure_stone_flag             0.066039
has_superstructure_bamboo                 0.063051
has_superstructure_cement_mortar_stone    0.060295
count_families                            0.056151
has_superstructure_adobe_mud              0.055314
height_percentage                         0.048130
geo_level_2_id                            0.043161
has_superstructure_other       

- As expected, identification variables have very low correlations. Generally, the correlations are not very large. Let us first remove the ID variables and legal ownership status:  

        - `geo_level_1_id`
        - `geo_level_2_id`
        - `geo_level_3_id` 
        - `building_id`
        - `legal_ownership_status`
      

In [24]:
list(df_train.columns)


['building_id',
 'geo_level_1_id',
 'geo_level_2_id',
 'geo_level_3_id',
 'count_floors_pre_eq',
 'age',
 'area_percentage',
 'height_percentage',
 'land_surface_condition',
 'foundation_type',
 'roof_type',
 'ground_floor_type',
 'other_floor_type',
 'position',
 'plan_configuration',
 'has_superstructure_adobe_mud',
 'has_superstructure_mud_mortar_stone',
 'has_superstructure_stone_flag',
 'has_superstructure_cement_mortar_stone',
 'has_superstructure_mud_mortar_brick',
 'has_superstructure_cement_mortar_brick',
 'has_superstructure_timber',
 'has_superstructure_bamboo',
 'has_superstructure_rc_non_engineered',
 'has_superstructure_rc_engineered',
 'has_superstructure_other',
 'legal_ownership_status',
 'count_families',
 'has_secondary_use',
 'has_secondary_use_agriculture',
 'has_secondary_use_hotel',
 'has_secondary_use_rental',
 'has_secondary_use_institution',
 'has_secondary_use_school',
 'has_secondary_use_industry',
 'has_secondary_use_health_post',
 'has_secondary_use_gov_of

In [25]:
# Delete redundant columns

redundant_columns = ['geo_level_1_id','geo_level_2_id','geo_level_3_id','building_id','legal_ownership_status']
df_train = df_train.drop(columns= redundant_columns)
df_train.columns


Index(['count_floors_pre_eq', 'age', 'area_percentage', 'height_percentage',
       'land_surface_condition', 'foundation_type', 'roof_type',
       'ground_floor_type', 'other_floor_type', 'position',
       'plan_configuration', 'has_superstructure_adobe_mud',
       'has_superstructure_mud_mortar_stone', 'has_superstructure_stone_flag',
       'has_superstructure_cement_mortar_stone',
       'has_superstructure_mud_mortar_brick',
       'has_superstructure_cement_mortar_brick', 'has_superstructure_timber',
       'has_superstructure_bamboo', 'has_superstructure_rc_non_engineered',
       'has_superstructure_rc_engineered', 'has_superstructure_other',
       'count_families', 'has_secondary_use', 'has_secondary_use_agriculture',
       'has_secondary_use_hotel', 'has_secondary_use_rental',
       'has_secondary_use_institution', 'has_secondary_use_school',
       'has_secondary_use_industry', 'has_secondary_use_health_post',
       'has_secondary_use_gov_office', 'has_secondary_use_u

In [26]:
df_test = pd.read_csv('data/test_values.csv')

In [27]:
df_test.head()

Unnamed: 0,building_id,geo_level_1_id,geo_level_2_id,geo_level_3_id,count_floors_pre_eq,age,area_percentage,height_percentage,land_surface_condition,foundation_type,...,has_secondary_use_agriculture,has_secondary_use_hotel,has_secondary_use_rental,has_secondary_use_institution,has_secondary_use_school,has_secondary_use_industry,has_secondary_use_health_post,has_secondary_use_gov_office,has_secondary_use_use_police,has_secondary_use_other
0,300051,17,596,11307,3,20,7,6,t,r,...,0,0,0,0,0,0,0,0,0,0
1,99355,6,141,11987,2,25,13,5,t,r,...,1,0,0,0,0,0,0,0,0,0
2,890251,22,19,10044,2,5,4,5,t,r,...,0,0,0,0,0,0,0,0,0,0
3,745817,26,39,633,1,0,19,3,t,r,...,0,0,1,0,0,0,0,0,0,0
4,421793,17,289,7970,3,15,8,7,t,r,...,0,0,0,0,0,0,0,0,0,0


In [28]:
df_test.drop(columns=redundant_columns)
df_test.columns

Index(['building_id', 'geo_level_1_id', 'geo_level_2_id', 'geo_level_3_id',
       'count_floors_pre_eq', 'age', 'area_percentage', 'height_percentage',
       'land_surface_condition', 'foundation_type', 'roof_type',
       'ground_floor_type', 'other_floor_type', 'position',
       'plan_configuration', 'has_superstructure_adobe_mud',
       'has_superstructure_mud_mortar_stone', 'has_superstructure_stone_flag',
       'has_superstructure_cement_mortar_stone',
       'has_superstructure_mud_mortar_brick',
       'has_superstructure_cement_mortar_brick', 'has_superstructure_timber',
       'has_superstructure_bamboo', 'has_superstructure_rc_non_engineered',
       'has_superstructure_rc_engineered', 'has_superstructure_other',
       'legal_ownership_status', 'count_families', 'has_secondary_use',
       'has_secondary_use_agriculture', 'has_secondary_use_hotel',
       'has_secondary_use_rental', 'has_secondary_use_institution',
       'has_secondary_use_school', 'has_secondary_use_i

####    One Hot Encoding

-    We shall now one hot encode the categorical columns. Let's first confirm if binary columns are correct


In [35]:
# Import library
from sklearn.preprocessing import OneHotEncoder

In [29]:
df_train.info()
cols = list(df_test.columns)
print(cols)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 260601 entries, 0 to 260600
Data columns (total 35 columns):
 #   Column                                  Non-Null Count   Dtype 
---  ------                                  --------------   ----- 
 0   count_floors_pre_eq                     260601 non-null  int64 
 1   age                                     260601 non-null  int64 
 2   area_percentage                         260601 non-null  int64 
 3   height_percentage                       260601 non-null  int64 
 4   land_surface_condition                  260601 non-null  object
 5   foundation_type                         260601 non-null  object
 6   roof_type                               260601 non-null  object
 7   ground_floor_type                       260601 non-null  object
 8   other_floor_type                        260601 non-null  object
 9   position                                260601 non-null  object
 10  plan_configuration                      260601 non-null 

In [30]:
['land_surface_condition', 'foundation_type','roof_type','ground_floor_type','other_floor_type','position','plan_configuration']

['building_id',
 'geo_level_1_id',
 'geo_level_2_id',
 'geo_level_3_id',
 'count_floors_pre_eq',
 'age',
 'area_percentage',
 'height_percentage',
 'land_surface_condition',
 'foundation_type',
 'roof_type',
 'ground_floor_type',
 'other_floor_type',
 'position',
 'plan_configuration',
 'has_superstructure_adobe_mud',
 'has_superstructure_mud_mortar_stone',
 'has_superstructure_stone_flag',
 'has_superstructure_cement_mortar_stone',
 'has_superstructure_mud_mortar_brick',
 'has_superstructure_cement_mortar_brick',
 'has_superstructure_timber',
 'has_superstructure_bamboo',
 'has_superstructure_rc_non_engineered',
 'has_superstructure_rc_engineered',
 'has_superstructure_other',
 'legal_ownership_status',
 'count_families',
 'has_secondary_use',
 'has_secondary_use_agriculture',
 'has_secondary_use_hotel',
 'has_secondary_use_rental',
 'has_secondary_use_institution',
 'has_secondary_use_school',
 'has_secondary_use_industry',
 'has_secondary_use_health_post',
 'has_secondary_use_gov_of

In [32]:
for x in binary_cols:
    print(df_train[x].value_counts())

0    231445
1     29156
Name: has_secondary_use, dtype: int64
0    243824
1     16777
Name: has_secondary_use_agriculture, dtype: int64
0    251838
1      8763
Name: has_secondary_use_hotel, dtype: int64
0    258490
1      2111
Name: has_secondary_use_rental, dtype: int64
0    260356
1       245
Name: has_secondary_use_institution, dtype: int64
0    260507
1        94
Name: has_secondary_use_school, dtype: int64
0    260322
1       279
Name: has_secondary_use_industry, dtype: int64
0    260552
1        49
Name: has_secondary_use_health_post, dtype: int64
0    260563
1        38
Name: has_secondary_use_gov_office, dtype: int64
0    260578
1        23
Name: has_secondary_use_use_police, dtype: int64
0    259267
1      1334
Name: has_secondary_use_other, dtype: int64
0    237500
1     23101
Name: has_superstructure_adobe_mud, dtype: int64
1    198561
0     62040
Name: has_superstructure_mud_mortar_stone, dtype: int64
0    251654
1      8947
Name: has_superstructure_stone_flag, dtype: int6

-They are OK. Separating categorical and numerical columns:

In [37]:
categorical_cols = ['land_surface_condition', 'foundation_type','roof_type','ground_floor_type',
                    'other_floor_type','position','plan_configuration']
numerical_cols = ['has_secondary_use','has_secondary_use_agriculture','has_secondary_use_hotel',
               'has_secondary_use_rental','has_secondary_use_institution','has_secondary_use_school',
               'has_secondary_use_industry','has_secondary_use_health_post','has_secondary_use_gov_office',
               'has_secondary_use_use_police','has_secondary_use_other','has_superstructure_adobe_mud',
               'has_superstructure_mud_mortar_stone','has_superstructure_stone_flag','has_superstructure_cement_mortar_stone',
               'has_superstructure_mud_mortar_brick','has_superstructure_cement_mortar_brick','has_superstructure_timber',
               'has_superstructure_bamboo','has_superstructure_rc_non_engineered','has_superstructure_rc_engineered',
               'has_superstructure_other', 'count_floors_pre_eq','age','area_percentage','height_percentage','count_families'     
               ]
# check if all columns are picked
len(categorical_cols)+len(numerical_cols)== len(df_train.columns)-1

True

In [None]:
categorical_df = df_train[categorical_cols]
numerical_df = df_train[numerical_cols]

In [40]:
ohe = OneHotEncoder(handle_unknown="ignore",drop='first')


In [41]:
# Perform One Hot Encoding


X_train_categorical_ohe = ohe.fit_transform(categorical_df).toarray()

X_train_encoded_categorical = pd.DataFrame(
    X_train_categorical_ohe,
    columns=ohe.get_feature_names_out(categorical_df.columns)
)
X_train_encoded_categorical.shape

(260601, 27)

In [50]:



X_train_encoded = pd.concat([X_train_encoded_categorical, numerical_df], axis=1)


In [51]:
X_train_encoded_categorical

Unnamed: 0,land_surface_condition_o,land_surface_condition_t,foundation_type_i,foundation_type_r,foundation_type_u,foundation_type_w,roof_type_q,roof_type_x,ground_floor_type_m,ground_floor_type_v,...,position_t,plan_configuration_c,plan_configuration_d,plan_configuration_f,plan_configuration_m,plan_configuration_n,plan_configuration_o,plan_configuration_q,plan_configuration_s,plan_configuration_u
0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
260596,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
260597,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
260598,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
260599,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [52]:
X_train_encoded.shape

(260601, 54)

In [53]:
X_train_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260601 entries, 0 to 260600
Data columns (total 54 columns):
 #   Column                                  Non-Null Count   Dtype  
---  ------                                  --------------   -----  
 0   land_surface_condition_o                260601 non-null  float64
 1   land_surface_condition_t                260601 non-null  float64
 2   foundation_type_i                       260601 non-null  float64
 3   foundation_type_r                       260601 non-null  float64
 4   foundation_type_u                       260601 non-null  float64
 5   foundation_type_w                       260601 non-null  float64
 6   roof_type_q                             260601 non-null  float64
 7   roof_type_x                             260601 non-null  float64
 8   ground_floor_type_m                     260601 non-null  float64
 9   ground_floor_type_v                     260601 non-null  float64
 10  ground_floor_type_x                     2606

#### Min-Max Scaling


In [54]:
# Import library
from sklearn.preprocessing import MinMaxScaler

# Perform Feature Scaling
scaler = MinMaxScaler()

In [55]:
#Transform train set
scaler.fit(X_train_encoded)
X_train_encoded_scaled = pd.DataFrame(
    scaler.transform(X_train_encoded),
    # index is important to ensure we can concatenate with other columns
    index=X_train_encoded.index,
    columns=X_train_encoded.columns
)
X_train_encoded_scaled

Unnamed: 0,land_surface_condition_o,land_surface_condition_t,foundation_type_i,foundation_type_r,foundation_type_u,foundation_type_w,roof_type_q,roof_type_x,ground_floor_type_m,ground_floor_type_v,...,has_superstructure_timber,has_superstructure_bamboo,has_superstructure_rc_non_engineered,has_superstructure_rc_engineered,has_superstructure_other,count_floors_pre_eq,age,area_percentage,height_percentage,count_families
0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.125,0.030151,0.050505,0.100000,0.111111
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.125,0.010050,0.070707,0.166667,0.111111
2,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.125,0.010050,0.040404,0.100000,0.111111
3,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,0.0,0.0,0.0,0.125,0.010050,0.050505,0.100000,0.111111
4,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.250,0.030151,0.070707,0.233333,0.111111
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
260596,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000,0.055276,0.050505,0.033333,0.111111
260597,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.125,0.000000,0.050505,0.100000,0.111111
260598,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.250,0.055276,0.050505,0.166667,0.111111
260599,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.125,0.010050,0.131313,0.133333,0.111111


####    Data Prepreprocessing on the test set

In [59]:
# Separate Categorical and Numerical columns
categorical_df = df_test[categorical_cols]
numerical_df = df_test[numerical_cols]

# Perform One Hot Encoding
X_test_categorical_ohe = ohe.fit_transform(categorical_df).toarray()

X_test_encoded_categorical = pd.DataFrame(
    X_test_categorical_ohe,
    columns=ohe.get_feature_names_out(categorical_df.columns)
)

#Concatenate Categorical and Numerical columns
X_test_encoded = pd.concat([X_test_encoded_categorical, numerical_df], axis=1)

#Transform train set
scaler.fit(X_test_encoded)
X_test_encoded_scaled = pd.DataFrame(
    scaler.transform(X_test_encoded),
    # index is important to ensure we can concatenate with other columns
    index=X_test_encoded.index,
    columns=X_test_encoded.columns
)
X_test_encoded_scaled


Unnamed: 0,land_surface_condition_o,land_surface_condition_t,foundation_type_i,foundation_type_r,foundation_type_u,foundation_type_w,roof_type_q,roof_type_x,ground_floor_type_m,ground_floor_type_v,...,has_superstructure_timber,has_superstructure_bamboo,has_superstructure_rc_non_engineered,has_superstructure_rc_engineered,has_superstructure_other,count_floors_pre_eq,age,area_percentage,height_percentage,count_families
0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.285714,0.020101,0.065934,0.133333,0.125
1,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.142857,0.025126,0.131868,0.100000,0.125
2,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.142857,0.005025,0.032967,0.100000,0.125
3,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.197802,0.033333,0.250
4,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.285714,0.015075,0.076923,0.166667,0.125
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
86863,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.285714,0.070352,0.208791,0.133333,0.125
86864,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.285714,0.025126,0.054945,0.166667,0.125
86865,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.000000,0.050251,0.021978,0.033333,0.125
86866,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.142857,0.005025,0.087912,0.100000,0.125


In [65]:
X_test = X_test_encoded_scaled

##  Modeling

- For baseline, we shall use the entire given train set for modeling.


In [61]:
df_train.columns

Index(['count_floors_pre_eq', 'age', 'area_percentage', 'height_percentage',
       'land_surface_condition', 'foundation_type', 'roof_type',
       'ground_floor_type', 'other_floor_type', 'position',
       'plan_configuration', 'has_superstructure_adobe_mud',
       'has_superstructure_mud_mortar_stone', 'has_superstructure_stone_flag',
       'has_superstructure_cement_mortar_stone',
       'has_superstructure_mud_mortar_brick',
       'has_superstructure_cement_mortar_brick', 'has_superstructure_timber',
       'has_superstructure_bamboo', 'has_superstructure_rc_non_engineered',
       'has_superstructure_rc_engineered', 'has_superstructure_other',
       'count_families', 'has_secondary_use', 'has_secondary_use_agriculture',
       'has_secondary_use_hotel', 'has_secondary_use_rental',
       'has_secondary_use_institution', 'has_secondary_use_school',
       'has_secondary_use_industry', 'has_secondary_use_health_post',
       'has_secondary_use_gov_office', 'has_secondary_use_u

In [63]:
# Initialize X_train and y_train
X_train = X_train_encoded_scaled
y_train = df_train['damage_grade']

# check if they have same no. of rows
X_train.shape[0] == y_train.shape[0]

True

In [64]:
# Iniatialize Modeling phase 
from sklearn.tree import DecisionTreeClassifier

In [66]:
# Fit a model
dt = DecisionTreeClassifier(random_state=42)
model_dt = dt.fit(X_train, y_train)

print(model_dt)
 # Predict
y_hat_test = dt.predict(X_test)

DecisionTreeClassifier(random_state=42)


In [67]:
y_hat_test.shape

(86868,)

In [69]:
y_hat_test

array([2, 2, 2, ..., 2, 2, 1])

In [83]:
# Get ID mapping for test set
test_df = pd.read_csv("data/test_values.csv")

#Prepare submission df
submission_df = pd.concat([test_df['building_id'], pd.DataFrame({'damage_grade':y_hat_test})], axis= 1)

In [86]:
submission_df

Unnamed: 0,building_id,damage_grade
0,300051,2
1,99355,2
2,890251,2
3,745817,1
4,421793,2
...,...,...
86863,310028,3
86864,663567,3
86865,1049160,2
86866,442785,2


In [87]:
# Save to file 
submission_df.to_csv('Submission_1.csv', index = False)