## Import Required Libraries

In [31]:
%matplotlib inline

# This will reload all modules before executing a new line
# This is important, if we change our modules, we don't have to restart the kernel
%load_ext autoreload
%autoreload 2

from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import loading_data as ld

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Loading Data

In [32]:
df_train = ld.load_train_values()
df_label = ld.load_train_labels()
df_test = ld.load_test_values()

# integrate damage_grade into train_values
# df_train['damage_grade'] = df_label['damage_grade']

In [33]:
print(df_train.shape)
print(df_label.shape)
print(df_test.shape)

(260601, 39)
(260601, 2)
(86868, 39)


## Data Exploration

### Finding the missing values

In [34]:
df_train_summary = pd.DataFrame({
    "Data type": df_train.dtypes,
    "Any nulls?": df_train.isnull().any(),
    "Unique values": df_train.nunique()
})
print(df_train_summary)

                                       Data type  Any nulls?  Unique values
building_id                                int64       False         260601
geo_level_1_id                             int64       False             31
geo_level_2_id                             int64       False           1414
geo_level_3_id                             int64       False          11595
count_floors_pre_eq                        int64       False              9
age                                        int64       False             42
area_percentage                            int64       False             84
height_percentage                          int64       False             27
land_surface_condition                    object       False              3
foundation_type                           object       False              5
roof_type                                 object       False              3
ground_floor_type                         object       False              5
other_floor_

Since there is no null, no data cleaning yet. (for rows)

## Data Preparation

### Balancing the data

In [35]:
import balancing_data as bd

df_train_balanced, df_label_balanced = bd.balance_dataset(df_train, df_label)

### Data Cleaning, Dropping the unnecessary columns, Encoding the categorical variables

In [36]:
import feature_engineering as fe
import geolocation_encoding as ge

df_geolocation_encoding, mean_damage_maps = ge.encode_geolocation(df_train_balanced, df_label=df_label_balanced)
df_train_engineered = fe.engineer_features(df_geolocation_encoding)

Dropping 3 columns from the dataframe.
List of columns to drop:
0:	position
1:	plan_configuration
2:	legal_ownership_status

One-hot encoding 5 columns.
List of columns to one-hot encode:
0:	land_surface_condition
1:	foundation_type
2:	roof_type
3:	ground_floor_type
4:	other_floor_type



In [37]:
df_train_engineered.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260601 entries, 0 to 260600
Data columns (total 53 columns):
 #   Column                                  Non-Null Count   Dtype  
---  ------                                  --------------   -----  
 0   building_id                             260601 non-null  int64  
 1   geo_level_1_id                          260601 non-null  int64  
 2   geo_level_2_id                          260601 non-null  int64  
 3   geo_level_3_id                          260601 non-null  int64  
 4   count_floors_pre_eq                     260601 non-null  int64  
 5   age                                     260601 non-null  int64  
 6   area_percentage                         260601 non-null  int64  
 7   height_percentage                       260601 non-null  int64  
 8   has_superstructure_adobe_mud            260601 non-null  int64  
 9   has_superstructure_mud_mortar_stone     260601 non-null  int64  
 10  has_superstructure_stone_flag           2606

## 3. Modeling: Selection and Implementation

In [38]:
import models as md

X_train, X_test, y_train, y_test, rf_model = md.make_and_return_model(df_train_engineered, df_label_balanced)

## 4. Evaluation

In [39]:
# Predictions
preds = rf_model.predict(X_test)

# We want to evaluate our model with micro average f1 score
from sklearn.metrics import f1_score
f1_score(y_test, preds, average='micro')

0.7396251031254197

**Predictions**

01: 0.5682738243702155

02: 0.7203046756585638

These predictions were greate on the training set, but not on the test set:

03: 0.7440570979067938

04: 0.7444408203986875

Here we got also good results on the test set:

05: 0.7328140288943037

In [40]:
# How is the model doing on each class?
from sklearn.metrics import classification_report
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           1       0.66      0.52      0.58      5025
           2       0.74      0.84      0.79     29652
           3       0.75      0.64      0.69     17444

    accuracy                           0.74     52121
   macro avg       0.72      0.67      0.69     52121
weighted avg       0.74      0.74      0.74     52121



**NOTE: It's doing really bad on class 3, which is the second most common class.**

## 5. Predictions Output

Preparing the predictions for the competition
Format for the submission file (csv):

building_id,damage_grade
11456,1
16528,1
3253,1
18614,1
1544,1

(all numbers need to be integers!)

Steps:
* make a dataframe with the building_id
* add the predictions to the dataframe (damage_grade)
* make building_id the index
* save to csv

In [41]:
# Doing the same preprocessing steps as we did for the training data
df_geolocation_encoding, _ = ge.encode_geolocation(df_test, mean_damage_maps=mean_damage_maps)
df_test_engineered = fe.engineer_features(df_geolocation_encoding, do_fit=False)

# dataframe with the building_id column
df_test_pred = df_test[['building_id']]

# Print shape of the engineered test dataframe and df_test_pred
# print(df_test_engineered.shape)
# print(df_test_pred.shape)

df_test_pred = df_test_pred.copy()
# Predictions adding to the dataframe
df_test_pred['damage_grade'] = rf_model.predict(df_test_engineered)

# making building_id the index
df_test_pred.set_index('building_id', inplace=True)

# Saving the dataframe to a csv file
df_test_pred.to_csv('../data/submission.csv')

geo_level_1_id_mean_damage column has 0 NaN values, which getting filled with mean 2.1798668693427175
geo_level_2_id_mean_damage column has 5 NaN values, which getting filled with mean 2.215149104061613
Dropping 3 columns from the dataframe.
List of columns to drop:
0:	position
1:	plan_configuration
2:	legal_ownership_status

One-hot encoding 5 columns.
List of columns to one-hot encode:
0:	land_surface_condition
1:	foundation_type
2:	roof_type
3:	ground_floor_type
4:	other_floor_type



**Submissions**

01 by Johannes: 0.5683

02 by Johannes: 0.58..

03 by Johannes: 0.7342