This jupyter notebook contains a quick attempt to build a good predictive model for this competition:

https://www.hackerearth.com/problem/machine-learning/predict-the-energy-used-612632a9-3f496e7f/

## 1. Loading dataset

In [2]:
import pandas as pd

train = pd.read_csv("./dataset/train.csv")
test  = pd.read_csv("./dataset/test.csv")
build_own = pd.read_csv("./dataset/Building_Ownership_Use.csv") 
build_str = pd.read_csv("./dataset/Building_Structure.csv")

# Merging dataframes where their columns are equal
build_data = pd.merge(build_str, build_own, on=['building_id', 'district_id', 'vdcmun_id', 'ward_id'])

# train dataset with features of build_own and build_str dataframes
trainFull  = pd.merge(train, build_data, on=['building_id', 'district_id', 'vdcmun_id'])
# test dataset with features of build_own and build_str dataframes
testFull   = pd.merge(test,  build_data, on=['building_id', 'district_id', 'vdcmun_id'])

In [3]:
print("trainFull.shape: "+str(trainFull.shape)+"\n"+"testFull.shape:  "+str(testFull.shape))

trainFull.shape: (631761, 53)
testFull.shape:  (421175, 52)


## 2. Converting categorical features to numerical (and dealing with *NaN* entries)

### 2.1 Dropping datapoint with *NaN*'s

Number os *NaN*'s in each feature:

In [4]:
# Features where the number of NaN's is nonzero
print("trainFull:")
print(trainFull.isnull().sum().loc[trainFull.isnull().sum()!=0])#
print("\ntesFull:")
print(testFull.isnull().sum().loc[testFull.isnull().sum()!=0])

trainFull:
has_repair_started    33417
count_families            1
dtype: int64

tesFull:
has_repair_started    21922
dtype: int64


We can see in the training dataset that the feature *has_repair_started* has 33417 *NaN*'s and *count_families* has only 1 *NaN*. In the test dataset only *has_repair_started* shows up again with 21922 *NaN*'s.

As we only have one instance of *NaN* in *count_families*, and more important, **it only happens in the training dataset**, we will drop this data point/row.

In [5]:
import numpy as np
# Dropping data point with NaN in count_families features
trainCutted = trainFull[np.isfinite(trainFull['count_families'])]

# Showing results
print("Number of samples:\ntrainCutted: {}\ntrainFull:   {}\n".format(len(trainCutted), len(trainFull)))
print("trainCutted NaN's:\n"+str(trainCutted.isnull().sum().loc[trainCutted.isnull().sum()!=0]))

Number of samples:
trainCutted: 631760
trainFull:   631761

trainCutted NaN's:
has_repair_started    33417
dtype: int64


Frequency analysis of *has_repair_started* feature:

In [6]:
display(trainCutted.loc[:, 'has_repair_started'].value_counts(dropna=False))

0.0    409222
1.0    189121
NaN     33417
Name: has_repair_started, dtype: int64

As the number of *NaN*'s is high (33417), we won't cut these data points. It's possible that *NaN*'s in *has_repair_started* feature be relevant information for the models.

### 2.1 Using dummy variables for categorical features

Splitting train dataset between *X_tr* and *Y_tr*:

In [7]:
# Y_tr are the target classes
Y_tr = trainCutted.loc[:, 'damage_grade'].copy()
# Converting Y_tr to numerical format
for i in range(len(Y_tr)):
    Y_tr.values[i] = int(Y_tr.values[i][-1])

Y_tr = Y_tr.astype('int') # change from object -> int
    
# X_tr are the features:
features = trainCutted.columns.values.tolist()
features.remove('damage_grade')
X_tr = trainCutted.loc[:, features].copy()

Then we search for differences between the train and test sets features values.

In [8]:
# Comparing test and train features values
columns = testFull.columns.values
equal     = ""
different = ""
for column in columns:
    temp1 = trainCutted[column].unique()
    temp1.sort()
    temp2 = testFull[column].unique()
    temp2.sort()

    if np.array_equal(temp1, temp2):
        equal+=column+"\n"
    else:
        different+=column+"\n"+\
        "train ({})\n{}\n".format(len(temp1), temp1)+\
        "test ({})\n{}".format(len(temp2), temp2)+"\n\n"

spaces = 45
print("{}TRAIN == TEST:\n{}\n".format(" "*spaces, equal))
print("="*100+"\n\n")
print("{}TRAIN != TEST:\n{}".format(" "*spaces, different))

                                             TRAIN == TEST:
area_assesed
district_id
has_geotechnical_risk
has_geotechnical_risk_fault_crack
has_geotechnical_risk_flood
has_geotechnical_risk_land_settlement
has_geotechnical_risk_landslide
has_geotechnical_risk_liquefaction
has_geotechnical_risk_other
has_geotechnical_risk_rock_fall
count_floors_pre_eq
count_floors_post_eq
land_surface_condition
foundation_type
roof_type
ground_floor_type
other_floor_type
position
plan_configuration
has_superstructure_adobe_mud
has_superstructure_mud_mortar_stone
has_superstructure_stone_flag
has_superstructure_cement_mortar_stone
has_superstructure_mud_mortar_brick
has_superstructure_cement_mortar_brick
has_superstructure_timber
has_superstructure_bamboo
has_superstructure_rc_non_engineered
has_superstructure_rc_engineered
has_superstructure_other
condition_post_eq
legal_ownership_status
has_secondary_use
has_secondary_use_agriculture
has_secondary_use_hotel
has_secondary_use_rental
has_secondary_use_i

Creating and adding dummies to the dataset: (**OBS: WE WILL DROP ONE OF THE *K* VARIABLES, REMAINING THEN K-1 DUMMY VARIABLES PER CATEGORICAL COLUMN**)

In [9]:
X_tr_new = X_tr

# Categorical features that need encoding
catFeat = [
    'area_assesed',
    'district_id',
    'land_surface_condition',
    'foundation_type',
    'roof_type',
    'ground_floor_type',
    'other_floor_type',
    'position',
    'plan_configuration',
    'condition_post_eq',
    'legal_ownership_status',
#   'has_repair_started', # because of NaN's
]   

# Numerical/Ordinal feautures
numFeat = [
    'building_id', # searching for data leakage
    'vdcmun_id',   # categorical, but used as numerical for simplicity
    'ward_id',     # categorical, but used as numerical for simplicity
    'count_floors_pre_eq',
    'count_floors_post_eq',
    'age_building',
    'plinth_area_sq_ft',
    'height_ft_pre_eq',
    'height_ft_post_eq',
    'count_families'
]

X_tr_new = pd.get_dummies(X_tr,     columns=catFeat,                drop_first=True)
X_tr_new = pd.get_dummies(X_tr_new, columns=['has_repair_started'], drop_first=True, dummy_na=True) # because of NaN's
X_tr_new.shape

(631760, 113)

In [10]:
# Converting 'building_id' to numerical format (searching for data leakage)
X_num = X_tr_new

X_num['building_id'] = X_num['building_id'].apply(lambda x: int(x,16))

In [11]:
# Saving dataframes to CSV files
X_num.to_csv("X_tr.csv", index=False)
Y_tr.to_csv("Y_tr.csv", header=False, index=False)