<h1><center> PPOL564 - Data Science I: Foundations </center><h1>
<h3><center> Lecture 8 <br><br><font color='grey'> 
Data Wrangling using Pandas <br> <br> <em>Part 1</em> </font></center></h3>

## Missingness

Above we immediately notice that a large portion of the storm data is missing or incomplete. We need a way to easily assess the extent of the missingness in our data, to drop these observations (if need be), or to plug the holes by filling in data values. 

### `.isna` & `.isnull`: take a census of the missing

In [41]:
storm.isna().head()

Unnamed: 0,begin_yearmonth,begin_day,begin_time,end_yearmonth,end_day,end_time,episode_id,event_id,state,state_fips,year,month_name,event_type,cz_type,cz_fips,cz_name,wfo,begin_date_time,cz_timezone,end_date_time,injuries_direct,injuries_indirect,deaths_direct,deaths_indirect,damage_property,damage_crops,source,magnitude,magnitude_type,flood_cause,category,tor_f_scale,tor_length,tor_width,tor_other_wfo,tor_other_cz_state,tor_other_cz_fips,tor_other_cz_name,begin_range,begin_azimuth,begin_location,end_range,end_azimuth,end_location,begin_lat,begin_lon,end_lat,end_lon,episode_narrative,event_narrative,data_source
0,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,True,True,True,False,False,False,True,True,True,True,False,True,True,False,True,True,False,False,False,False,True,True,False
1,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,True,True,True,False,False,False,True,True,True,True,False,True,True,False,True,True,False,False,False,False,True,True,False
2,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,True,True,True,False,False,False,True,True,True,True,False,True,True,False,True,True,False,False,False,False,True,True,False
3,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,True,True,True,False,False,False,True,True,True,True,False,True,True,False,True,True,False,False,True,True,True,True,False
4,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,True,True,True,False,False,False,True,True,True,True,False,True,True,False,True,True,False,False,True,True,True,True,False


In [139]:
missing = storm.isna().sum()
missing

begin_yearmonth         0
begin_day               0
begin_time              0
end_yearmonth           0
end_day                 0
end_time                0
episode_id            223
event_id                0
state                   0
state_fips              0
year                    0
month_name              0
event_type              0
cz_type                 0
cz_fips                 0
cz_name                 0
wfo                   223
begin_date_time         0
cz_timezone             0
end_date_time           0
injuries_direct         0
injuries_indirect       0
deaths_direct           0
deaths_indirect         0
damage_property         0
damage_crops            0
source                223
magnitude               0
magnitude_type        223
flood_cause           223
category              223
tor_f_scale             6
tor_length              0
tor_width               0
tor_other_wfo         223
tor_other_cz_state    223
tor_other_cz_fips     223
tor_other_cz_name     223
begin_range 

`.isna()` and `.isnull()` are performing the same operation here. 

In [43]:
np.all(storm.isnull().sum() == missing)

True

We can make this data even more informative by dividing the series by the total number of data entries in order to get a proportion of the total data that is missing.

In [140]:
prop_missing = missing/storm.shape[0]
prop_missing

begin_yearmonth       0.000000
begin_day             0.000000
begin_time            0.000000
end_yearmonth         0.000000
end_day               0.000000
end_time              0.000000
episode_id            1.000000
event_id              0.000000
state                 0.000000
state_fips            0.000000
year                  0.000000
month_name            0.000000
event_type            0.000000
cz_type               0.000000
cz_fips               0.000000
cz_name               0.000000
wfo                   1.000000
begin_date_time       0.000000
cz_timezone           0.000000
end_date_time         0.000000
injuries_direct       0.000000
injuries_indirect     0.000000
deaths_direct         0.000000
deaths_indirect       0.000000
damage_property       0.000000
damage_crops          0.000000
source                1.000000
magnitude             0.000000
magnitude_type        1.000000
flood_cause           1.000000
category              1.000000
tor_f_scale           0.026906
tor_leng

Finally, let's subset the series to only look at the data entries that are missing data. As we can see, we have 16 variables that are _completely missing_ (i.e. there are no data in these columns). Likewise, we have two variables (geo-references) that are missing roughly 50% of their entries.

In [45]:
prop_missing[prop_missing>0].sort_values(ascending=False)

event_narrative       1.000000
episode_narrative     1.000000
wfo                   1.000000
source                1.000000
magnitude_type        1.000000
flood_cause           1.000000
category              1.000000
tor_other_wfo         1.000000
tor_other_cz_state    1.000000
tor_other_cz_fips     1.000000
tor_other_cz_name     1.000000
begin_azimuth         1.000000
begin_location        1.000000
end_azimuth           1.000000
end_location          1.000000
episode_id            1.000000
end_lat               0.484305
end_lon               0.484305
tor_f_scale           0.026906
dtype: float64

Let's `.drop()` the columns that do not contain any data. 

In [46]:
drop_these_vars = prop_missing[prop_missing==1].index
drop_these_vars

Index(['episode_id', 'wfo', 'source', 'magnitude_type', 'flood_cause',
       'category', 'tor_other_wfo', 'tor_other_cz_state', 'tor_other_cz_fips',
       'tor_other_cz_name', 'begin_azimuth', 'begin_location', 'end_azimuth',
       'end_location', 'episode_narrative', 'event_narrative'],
      dtype='object')

Here we create a new object containing the subsetted data.

In [47]:
storm2 = storm.drop(columns=drop_these_vars)

In [48]:
# Compare the dimensions of the two data frames.
print(storm.shape)
print(storm2.shape)

(223, 51)
(223, 35)


Note that in dropping columns, we are making a new data object (i.e. a copy of the original).

In [49]:
id(storm)

4505933024

In [50]:
id(storm2)

4509586936

### `.fillna()` or `.dropna()`: dealing with incompleteness 

There is a trade-off we always have to make when dealing with missing values. 

1. **list-wise deletion**: ignore them and drop them.


2. **imputation**: guess a plausible value that the data could take on.

Neither method is risk-free. Both potentially distort the data in undesirable ways. This decision on how to deal with missing data is ultimately a **hyperparameter**, i.e. a parameter we can't learn from the model but must specify. Thus, we can adjust how we choose to deal with missing data as look at its downstream impact on model performance just as we'll do with any machine learning model.

#### List-wise deletion
If we drop the missing values from our storm data, we'll lose roughly half the data. Given our limited sample size, that might not be ideal.

In [51]:
storm3 = storm2.dropna()

In [52]:
storm3.shape

(115, 35)

#### imputation

Rather we can fill the value with a place holder.

In [141]:
example = storm2.loc[:6,["end_lat","end_lon"]]
example

Unnamed: 0,end_lat,end_lon
0,35.17,-99.2
1,31.73,-98.6
2,40.65,-75.47
3,,
4,,
5,,
6,40.27,-76.07


**Placeholder values**

In [54]:
example.fillna(-99) # Why is this problematic?

Unnamed: 0,end_lat,end_lon
0,35.17,-99.2
1,31.73,-98.6
2,40.65,-75.47
3,-99.0,-99.0
4,-99.0,-99.0
5,-99.0,-99.0
6,40.27,-76.07


In [55]:
example.fillna("Missing") # Why is this problematic?

Unnamed: 0,end_lat,end_lon
0,35.17,-99.2
1,31.73,-98.6
2,40.65,-75.47
3,Missing,Missing
4,Missing,Missing
5,Missing,Missing
6,40.27,-76.07


**forward-fill**: forward propagation of previous values into current values.

In [56]:
example.ffill() 

Unnamed: 0,end_lat,end_lon
0,35.17,-99.2
1,31.73,-98.6
2,40.65,-75.47
3,40.65,-75.47
4,40.65,-75.47
5,40.65,-75.47
6,40.27,-76.07


**back-fill**: backward propagation of future values into current values.

In [57]:
example.bfill()

Unnamed: 0,end_lat,end_lon
0,35.17,-99.2
1,31.73,-98.6
2,40.65,-75.47
3,40.27,-76.07
4,40.27,-76.07
5,40.27,-76.07
6,40.27,-76.07


Forward and back fill make little sense when we're not explicitly dealing with a time series containing the same units (e.g. countries). For example, here location values for other disasters are being used to plug the holes for missing entries from other disasters, making the data meaningless. 

Note that this <u>_barely scratches the surface of imputation techniques_</u>. But we'll always want to think carefully about what it means to manufacture data when data doesn't exist.