<a href="https://colab.research.google.com/github/Nell87/drivendata_richter/blob/main/script/feature_selection_and_feature_engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Importing data** 

In [4]:
####    INCLUDES  _______________________________________ #### 
#Loading Libraries:# 
import pandas as pd
import os
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

####    READING TRAIN AND TEST DATA _______________________________________ #### 
train_values = data = pd.read_csv("https://raw.githubusercontent.com/Nell87/drivendata_richter/main/data/train_values.csv")
train_labels = pd.read_csv("https://raw.githubusercontent.com/Nell87/drivendata_richter/main/data/train_labels.csv")
test_values = pd.read_csv("https://raw.githubusercontent.com/Nell87/drivendata_richter/main/data/test_values.csv")
train_merge = train_values.merge(train_labels, on = 'building_id', how = 'inner',)
print(train_merge.shape)

(260601, 40)


# **Feature engineering** 

## **Feature engineering*: Location features**
The features **geo_level_1_id, geo_level_2_id, geo_level_3_id** represent the geographic region in which building exists, from largest (level 1) to most specific sub-region (level 3). Possible values: level 1: 0-30, level 2: 0-1427, level 3: 0-12567.

For every location feature there is a high number of categorical values, so we'll apply feature engineering on them. We'll replace every value with their conditional probabilities respect to every damage_grade category

In [5]:
# Function to replace a categorical feature with many values, with their conditional probabilities respecto to the predicted feature
def categoricalvalues_condprob(data, index, columns, new_column_name):
  # Create prob table
  probs = data.groupby(index).size().div(len(data))
  probs_group = data.groupby([index, columns]).size().div(len(data)).div(probs, axis=0, level=index).reset_index()
  probs_group.columns= [index, columns, new_column_name]
  probs_group_wide = probs_group.pivot(index=[index], columns = columns,values = new_column_name) #Reshape from long to wide
  probs_group_wide = probs_group.reset_index(drop=True)

  # Add column to main dataset
  data_merge = data.merge(probs_group_wide, on=[index,columns ], how='left')

  # Get rid of the categorical feature
  data_merge = data_merge.drop(index, axis=1)

  # Return dataset
  return data_merge

# Apply the function
train_merge = categoricalvalues_condprob(train_merge, 'geo_level_1_id', 'damage_grade', 'prob_cond_geo_level_1')
train_merge = categoricalvalues_condprob(train_merge, 'geo_level_2_id', 'damage_grade', 'prob_cond_geo_level_2')
train_merge = categoricalvalues_condprob(train_merge, 'geo_level_3_id', 'damage_grade', 'prob_cond_geo_level_3')

train_merge

Unnamed: 0,building_id,count_floors_pre_eq,age,area_percentage,height_percentage,land_surface_condition,foundation_type,roof_type,ground_floor_type,other_floor_type,...,has_secondary_use_school,has_secondary_use_industry,has_secondary_use_health_post,has_secondary_use_gov_office,has_secondary_use_use_police,has_secondary_use_other,damage_grade,prob_cond_geo_level_1,prob_cond_geo_level_2,prob_cond_geo_level_3
0,802906,2,30,6,5,t,r,n,f,q,...,0,0,0,0,0,0,3,0.248185,0.744444,0.837838
1,28830,2,10,8,7,o,r,n,x,q,...,0,0,0,0,0,0,2,0.446174,0.492462,0.812500
2,94947,2,10,5,5,t,r,n,f,x,...,0,0,0,0,0,0,3,0.584996,0.601136,0.610294
3,590882,2,10,6,5,t,r,n,f,x,...,0,0,0,0,0,0,2,0.739603,0.853659,0.838710
4,201944,3,30,8,9,t,r,n,f,x,...,0,0,0,0,0,0,3,0.384672,0.378613,0.377049
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
260596,688636,1,55,6,3,n,r,n,f,j,...,0,0,0,0,0,0,2,0.779516,0.724138,0.928571
260597,669485,2,0,6,5,t,r,n,f,q,...,0,0,0,0,0,0,3,0.807546,0.934866,0.979592
260598,602512,3,55,6,7,t,r,q,f,q,...,0,0,0,0,0,0,3,0.807546,0.918919,0.863636
260599,151409,2,10,14,6,t,r,x,v,s,...,0,0,0,0,0,0,2,0.559142,0.452947,0.766949


# **Feature selection** 