# Data Exploration: Base Line Model Data

The data contains variables that show the effect of tropical cyclones (TC) in the municipalities in the Philippines. This notebook explores the data to better understand what I'm working with and the implications for the research objective: "how to detect algorithm bias". 

In [1]:
# handling dataframes and csv files
import pandas as pd
import csv
import numpy as np

In [2]:
# reading data files

path_base_data = '/Users/masinde/Documents/phd/causal fairness/data/baseline/cleaned_input_data.csv'
path_inc_data = '/Users/masinde/Documents/phd/causal fairness/data/extra_data/phl_income_cls.csv'
#path_vul_data = ''

## baseline data (municipality data)
base_df = pd.read_csv(path_base_data)

## income class data
inc_df = pd.read_csv(path_inc_data)

# vulnerability data



In [3]:
# first 5 rows of base_df
base_df.head(5)

Unnamed: 0.1,Unnamed: 0,Mun_Code,typhoon,HAZ_rainfall_Total,HAZ_rainfall_max_6h,HAZ_rainfall_max_24h,HAZ_v_max,HAZ_dis_track_min,GEN_landslide_per,GEN_stormsurge_per,...,VUL_StrongRoof_SalvageWall,VUL_LightRoof_StrongWall,VUL_LightRoof_LightWall,VUL_LightRoof_SalvageWall,VUL_SalvagedRoof_StrongWall,VUL_SalvagedRoof_LightWall,VUL_SalvagedRoof_SalvageWall,VUL_vulnerable_groups,VUL_pantawid_pamilya_beneficiary,new_y
0,1,PH175101000,durian2006,185.828571,14.716071,7.381696,55.032241,2.478142,2.64,6.18,...,0.097425,2.533055,41.892832,1.002088,0.0,0.027836,0.083507,2.951511,46.931106,3.632568
1,2,PH030801000,durian2006,28.4875,1.89375,1.070833,23.402905,136.527982,0.78,40.87,...,0.118842,0.248487,2.182368,0.0,0.0,0.010804,0.010804,0.867603,8.967156,0.0
2,3,PH083701000,durian2006,8.81875,0.455208,0.255319,8.72838,288.358553,0.06,0.0,...,0.850008,1.218595,13.645253,0.54912,0.030089,0.090266,0.112833,3.338873,25.989168,0.0
3,5,PH015501000,durian2006,24.175,2.408333,0.957639,10.945624,274.953818,1.52,1.28,...,0.197179,0.667374,15.592295,0.075838,0.0,0.015168,0.075838,2.131755,32.185651,0.0
4,7,PH015502000,durian2006,14.93,1.65,0.58625,12.108701,252.828578,0.0,0.0,...,0.279362,0.675125,7.100454,0.02328,0.01164,0.0,0.128041,1.589369,29.612385,0.0


In [4]:
# datatypes of base_df
base_df.dtypes

Unnamed: 0                            int64
Mun_Code                             object
typhoon                              object
HAZ_rainfall_Total                  float64
HAZ_rainfall_max_6h                 float64
HAZ_rainfall_max_24h                float64
HAZ_v_max                           float64
HAZ_dis_track_min                   float64
GEN_landslide_per                   float64
GEN_stormsurge_per                  float64
GEN_Bu_p_inSSA                      float64
GEN_Bu_p_LS                         float64
GEN_Red_per_LSbldg                  float64
GEN_Or_per_LSblg                    float64
GEN_Yel_per_LSSAb                   float64
GEN_RED_per_SSAbldg                 float64
GEN_OR_per_SSAbldg                  float64
GEN_Yellow_per_LSbl                 float64
TOP_mean_slope                      float64
TOP_mean_elevation_m                float64
TOP_ruggedness_stdev                float64
TOP_mean_ruggedness                 float64
TOP_slope_stdev                 

In [5]:
# first 5 rows of inc_df
inc_df.head(5)

Unnamed: 0.1,Unnamed: 0,Municipality,10 Digit Code,Correspondence Code,Income Class,Population(2020 Census)
0,0,Pateros,1381701000,137606000.0,1st,65227
1,1,Bangued,1400101000,140101000.0,1st,50382
2,2,Boliney,1400102000,140102000.0,5th,4551
3,3,Bucay,1400103000,140103000.0,5th,17953
4,4,Bucloc,1400104000,140104000.0,6th,2395


In [6]:
# data types of the vars in dataframe inc_df?
inc_df.dtypes

Unnamed: 0                   int64
Municipality                object
10 Digit Code                int64
Correspondence Code        float64
Income Class                object
Population(2020 Census)     object
dtype: object

In [7]:
# changing types of column population 
#inc_df.astype({'Municipality': 'str'}, {'Income Class': 'str'}, {'Population(2020 Census)': 'int64'}).dtypes

## first strip removing commas
inc_df['Population(2020 Census)'] = inc_df['Population(2020 Census)'].str.replace(',', '')

## convert population to int
#inc_df.astype({'Population(2020 Census)': 'int64'}).dtypes
#inc_df['Population(2020 Census)'].astype(str).astype(int)

inc_df['Population(2020 Census)'] = pd.to_numeric(inc_df['Population(2020 Census)'])

print(inc_df.head(3))

print(inc_df.dtypes)

   Unnamed: 0 Municipality  10 Digit Code  Correspondence Code Income Class  \
0           0      Pateros     1381701000          137606000.0          1st   
1           1     Bangued      1400101000          140101000.0          1st   
2           2      Boliney     1400102000          140102000.0          5th   

   Population(2020 Census)  
0                    65227  
1                    50382  
2                     4551  
Unnamed: 0                   int64
Municipality                object
10 Digit Code                int64
Correspondence Code        float64
Income Class                object
Population(2020 Census)      int64
dtype: object


# NOTE!!
'Correspondence Code' (in inc_df) matches with 'Mun Code' (in base_data)

In [8]:
# type casting throws error if there are nan
nan_rows = inc_df[inc_df['Correspondence Code'].isnull()]

print(nan_rows)

      Unnamed: 0  Municipality  10 Digit Code  Correspondence Code  \
1485        1485     Kapalawan     1999901000                  NaN   
1486        1486  Old Kaabakan     1999902000                  NaN   
1487        1487    Kadayangan     1999903000                  NaN   
1488        1488     Nabalawag     1999904000                  NaN   
1489        1489    Pahamuddin     1999905000                  NaN   
1490        1490     Malidegao     1999906000                  NaN   
1491        1491     Ligawasan     1999907000                  NaN   
1492        1492       Tugunan     1999908000                  NaN   

     Income Class  Population(2020 Census)  
1485            -                    28463  
1486            -                    16658  
1487            -                    25573  
1488            -                    25723  
1489            -                    19627  
1490            -                    36438  
1491            -                    29784  
1492     

In [9]:
# remove NAN's in inc_df
inc_df = inc_df.dropna()

len(inc_df)

1485

In [24]:
# changing data type of 'Correspondence Code' to string
#inc_df['Correspondence Code'] = inc_df['Correspondence Code'].astype(int).astype(str)

# to integer
inc_df['Correspondence Code'] = inc_df['Correspondence Code'].astype(int)
base_df['Mun_Code_2']  = base_df['Mun_Code_2'].astype(int)

print(inc_df.dtypes)

print(base_df.dtypes)

Unnamed: 0                  int64
Municipality               object
10 Digit Code               int64
Correspondence Code         int64
Income Class               object
Population(2020 Census)     int64
dtype: object
Unnamed: 0                            int64
Mun_Code                             object
typhoon                              object
HAZ_rainfall_Total                  float64
HAZ_rainfall_max_6h                 float64
HAZ_rainfall_max_24h                float64
HAZ_v_max                           float64
HAZ_dis_track_min                   float64
GEN_landslide_per                   float64
GEN_stormsurge_per                  float64
GEN_Bu_p_inSSA                      float64
GEN_Bu_p_LS                         float64
GEN_Red_per_LSbldg                  float64
GEN_Or_per_LSblg                    float64
GEN_Yel_per_LSSAb                   float64
GEN_RED_per_SSAbldg                 float64
GEN_OR_per_SSAbldg                  float64
GEN_Yellow_per_LSbl               

In [27]:
print(inc_df['Correspondence Code'])

0       137606000
1       140101000
2       140102000
3       140103000
4       140104000
          ...    
1480    153808000
1481    153837000
1482    153817000
1483    153813000
1484    153816000
Name: Correspondence Code, Length: 1485, dtype: int64


In [29]:
# change municipality code to digits

base_df['Mun_Code_2'] = base_df['Mun_Code'].str.strip('PH')

base_df['Mun_Code_2']  = base_df['Mun_Code_2'].astype(int)

print(base_df['Mun_Code_2'].head(5))

0    175101000
1     30801000
2     83701000
3     15501000
4     15502000
Name: Mun_Code_2, dtype: int64


In [None]:
# check if any Mun_Code_2 are greater than 9 

In [30]:
# add municipality name, income class, and population to base_df

#base_df['Municipality'] = np.nan
#base_df['Income Class'] =  np.nan
#base_df['Popualtion'] =  np.nan


base_df_merge = pd.merge(base_df, inc_df, how = 'left', left_on = 'Mun_Code_2', right_on = 'Correspondence Code')

base_df_merge.head(5)

Unnamed: 0,Unnamed: 0_x,Mun_Code,typhoon,HAZ_rainfall_Total,HAZ_rainfall_max_6h,HAZ_rainfall_max_24h,HAZ_v_max,HAZ_dis_track_min,GEN_landslide_per,GEN_stormsurge_per,...,VUL_vulnerable_groups,VUL_pantawid_pamilya_beneficiary,new_y,Mun_Code_2,Unnamed: 0_y,Municipality,10 Digit Code,Correspondence Code,Income Class,Population(2020 Census)
0,1,PH175101000,durian2006,185.828571,14.716071,7.381696,55.032241,2.478142,2.64,6.18,...,2.951511,46.931106,3.632568,175101000,522.0,Abra De Ilog,1705101000.0,175101000.0,2nd,35176.0
1,2,PH030801000,durian2006,28.4875,1.89375,1.070833,23.402905,136.527982,0.78,40.87,...,0.867603,8.967156,0.0,30801000,281.0,Abucay,300801000.0,30801000.0,3rd,42984.0
2,3,PH083701000,durian2006,8.81875,0.455208,0.255319,8.72838,288.358553,0.06,0.0,...,3.338873,25.989168,0.0,83701000,949.0,Abuyog,803701000.0,83701000.0,1st*,61216.0
3,5,PH015501000,durian2006,24.175,2.408333,0.957639,10.945624,274.953818,1.52,1.28,...,2.131755,32.185651,0.0,15501000,148.0,Agno,105501000.0,15501000.0,3rd,29947.0
4,7,PH015502000,durian2006,14.93,1.65,0.58625,12.108701,252.828578,0.0,0.0,...,1.589369,29.612385,0.0,15502000,149.0,Aguilar,105502000.0,15502000.0,3rd,45100.0


After join let's check the row match

In [31]:
# length 
len(base_df_merge)

8981

In [32]:
len(base_df)

8981

In [33]:
# are there any missing ...
nan_rows = base_df_merge[base_df_merge['Correspondence Code'].isnull()]

print(nan_rows)

      Unnamed: 0_x     Mun_Code     typhoon  HAZ_rainfall_Total  \
15              28  PH035401000  durian2006           23.300000   
49              79  PH041005000  durian2006          140.171429   
75             119  PH034903000  durian2006           43.610000   
86             140  PH137501000  durian2006           44.266667   
104            175  PH015503000  durian2006           14.933333   
...            ...          ...         ...                 ...   
8772         25607  PH050506000    noul2015           14.050000   
8814         25653  PH051724000    noul2015           39.533333   
8827         25668  PH034919000    noul2015           15.016667   
8896         25747  PH034926000    noul2015           11.660000   
8939         25797  PH034917000    noul2015           13.166667   

      HAZ_rainfall_max_6h  HAZ_rainfall_max_24h  HAZ_v_max  HAZ_dis_track_min  \
15               1.227083              0.880729  17.326242         186.337606   
49              11.227381        

In [40]:
print(base_df.loc[[75]])

    Unnamed: 0     Mun_Code     typhoon  HAZ_rainfall_Total  \
75         119  PH034903000  durian2006               43.61   

    HAZ_rainfall_max_6h  HAZ_rainfall_max_24h  HAZ_v_max  HAZ_dis_track_min  \
75             4.238333              1.506667  14.664434          218.98314   

    GEN_landslide_per  GEN_stormsurge_per  ...  VUL_LightRoof_StrongWall  \
75                0.0                 0.0  ...                  0.210467   

    VUL_LightRoof_LightWall  VUL_LightRoof_SalvageWall  \
75                 2.398722                   0.025375   

    VUL_SalvagedRoof_StrongWall  VUL_SalvagedRoof_LightWall  \
75                     0.029853                    0.164194   

    VUL_SalvagedRoof_SalvageWall  VUL_vulnerable_groups  \
75                       0.30749               0.901297   

    VUL_pantawid_pamilya_beneficiary  new_y  Mun_Code_2  
75                          9.408305    0.0    34903000  

[1 rows x 40 columns]


So after merging we lose 682 rows (observations) because they do not have a common. So we lose approx 134 municipalities.

In [50]:
# but how many of the municipality codes are duplicates?
len(nan_rows['Mun_Code_2'].unique())

134