# BW \#68 Dangerously hot weather
As we start to enjoy (or not!) warmer weather, I thought it might be interesting to dig into this data, to see if heat-related fatalities are really increasing -- and if so, by how much

## Data and six questions
On the National Weather Service's hazards list, there is a link to download the 80-year summary of all weather-related fatalities in the United States:

https://www.weather.gov/media/hazstat/80years_2023.pdf

As you can see from the file extension, it's a PDF file. You'll want to use the Tabula-py (https://tabula-py.readthedocs.io/en/latest/) package to read this into Pandas.


## Challenges
The learning goals include working with PDF files, nullable dtypes, plotting, and correlations.
- Download the PDF file describing extreme weather incidents. Read the table into a data frame. We don't need the final "All Wx Fatalities" column. We also don't need the final three rows with summaries and totals. Ensure that both header rows are used for the header names. How much memory is being used? What dtypes are being used?

- Set all columns to be of type `pd.Int16Dtype` except for where `pd.Float64Dtype` or `pd.StringDtype` would be more appropriate. Remove any rows containing only NA values. Set "Year" to be the index. How much memory (if any) do you save by using these dtypes?


tabula-py requires a java environment, so let's check the java environment on your machine. Download the Java Development Kit (JDK) from the official Oracle website 

In [1]:
import pandas as pd

In [2]:
import os
jdk_path = r"C:\Program Files\Java\jdk-22\bin"
os.environ['PATH'] = jdk_path + os.pathsep + os.environ['PATH']
os.environ['PATH']

'C:\\Program Files\\Java\\jdk-22\\bin;c:\\Users\\npigeon\\AppData\\Local\\miniconda3;C:\\Users\\npigeon\\AppData\\Local\\miniconda3;C:\\Users\\npigeon\\AppData\\Local\\miniconda3\\Library\\mingw-w64\\bin;C:\\Users\\npigeon\\AppData\\Local\\miniconda3\\Library\\usr\\bin;C:\\Users\\npigeon\\AppData\\Local\\miniconda3\\Library\\bin;C:\\Users\\npigeon\\AppData\\Local\\miniconda3\\Scripts;C:\\Users\\npigeon\\AppData\\Local\\miniconda3\\bin;C:\\Users\\npigeon\\AppData\\Local\\miniconda3\\condabin;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0;C:\\WINDOWS\\System32\\OpenSSH;C:\\Program Files\\Ubisoft\\SRM\\Remedy;C:\\Program Files\\Git\\cmd;C:\\Users\\npigeon\\AppData\\Local\\Programs\\Python\\Python310\\Scripts;C:\\Users\\npigeon\\AppData\\Local\\Programs\\Python\\Python310;C:\\Users\\npigeon\\AppData\\Local\\Programs\\Python\\Python312\\Scripts;C:\\Users\\npigeon\\AppData\\Local\\Programs\\Python\\Python312;C:\\Users\\npigeon\\Ap

In [3]:
!java -version

java version "22.0.2" 2024-07-16
Java(TM) SE Runtime Environment (build 22.0.2+9-70)
Java HotSpot(TM) 64-Bit Server VM (build 22.0.2+9-70, mixed mode, sharing)


After confirming the java environment, install tabula-py by using pip.

In [4]:
!pip install -q tabula-py


Before trying tabula-py, check your environment via tabula-py environment_info() function, which shows Python version, Java version, and your OS environment.

In [5]:
import tabula

tabula.environment_info()

Python version:
    3.12.2 | packaged by Anaconda, Inc. | (main, Feb 27 2024, 17:28:07) [MSC v.1916 64 bit (AMD64)]
Java version:
    java version "22.0.2" 2024-07-16
Java(TM) SE Runtime Environment (build 22.0.2+9-70)
Java HotSpot(TM) 64-Bit Server VM (build 22.0.2+9-70, mixed mode, sharing)
tabula-py version: 2.9.3
platform: Windows-10-10.0.19045-SP0
uname:
    uname_result(system='Windows', node='FLO-LAP-097404', release='10', version='10.0.19045', machine='AMD64')
linux_distribution: ('', '', '')
mac_ver: ('', ('', '', ''), '')


Let's read a PDF from GitHub. tabula-py can load a PDF or file like object on both local or internet by using read_pdf() function.

The result of calling “read_pdf” in this way isn’t a data frame. Rather, it’s a list of data frames. We only want one, and it’s the first one, so we’ll use “[0]” to retrieve the first element of that list. We also invoked “dropna” to remove any row containing NaN values. That’s because I found that a row containing nothing but NaNs somehow got into the data frame when I imported it. By default, “dropna” removes any row containing even one NaN value. That’s too strict for our purposes, so I used the “thresh” keyword argument to say that as long as we have at least 4 good values, we should keep the row.

In [19]:
from tabula import read_pdf
filename = "C:\\Users\\npigeon\\Git\\BW #68 Dangerously hot weather\\80years_2023.pdf"
df = (
    read_pdf(filename,
             pages=1, # We only want the first page
            multiple_tables=False, # We only have one table per page
             pandas_options={'header':[0,1]} # The first row is taken to be the header, but it’s actually a two-line header
            )
    [0]
    .drop('All Wx', level=0, axis='columns') # We don’t need the “All Wx” column
    .iloc[:-3] # We don’t need the final three lines
    .dropna(thresh=4) # Drop rows that have fewer than four non-NA values
)
df

Unnamed: 0_level_0,Year,Lightning,Tornado,Flood,Hurricane,Heat,Cold,Winter,Rip Curr.,Wind,All Hazard
Unnamed: 0_level_1,Unnamed: 0_level_1,Fatalities,Fatalities,Fatalities,Fatalities,Fatalities,Fatalities,Fatalities,Fatalities,Fatalities,Damages (M)
0,1940,340.0,65.0,60.0,51.0,,,,,,
1,1941,388.0,53.0,47.0,10.0,,,,,,
2,1942,372.0,384.0,68.0,8.0,,,,,,
3,1943,432.0,58.0,107.0,16.0,,,,,,
4,1944,419.0,275.0,33.0,64.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
80,2019,20.0,42.0,92.0,0.0,187,35.0,27.0,74.0,51.0,"$7,657.01"
81,2020,17.0,76.0,57.0,24.0,350,13.0,29.0,82.0,55.0,"$27,311.25"
82,2021,11.0,104.0,146.0,12.0,375,106.0,40.0,111.0,56.0,"$18,994.55"
83,2022,19.0,23.0,93.0,116.0,383,22.0,66.0,69.0,55.0,"$21,698.58"


We have a very strange multi-index of column names; they are simply spread across two lines. 

#### How can we make the column names a bit more normal?
We must find some way to combine the two parts of the index into one.

In [7]:
df.columns

MultiIndex([(      'Year', 'Unnamed: 0_level_1'),
            ( 'Lightning',         'Fatalities'),
            (   'Tornado',         'Fatalities'),
            (     'Flood',         'Fatalities'),
            ( 'Hurricane',         'Fatalities'),
            (      'Heat',         'Fatalities'),
            (      'Cold',         'Fatalities'),
            (    'Winter',         'Fatalities'),
            ( 'Rip Curr.',         'Fatalities'),
            (      'Wind',         'Fatalities'),
            ('All Hazard',        'Damages (M)')],
           )

The multi index is a list of tuples. Since we know that each tuple contains two strings (from the outer and inner levels), maybe we can just iterate over each tuple, join the two parts together, and assign that to “df.columns”

However, because the first column had a value for the outer layer and no value for the inner layer, Pandas provided one, “Unnamed: 0_level_1”. Which means that after running this comprehension, the first column has a name of “Year Unnamed: 0_level_1“. Not wrong, but not exactly what I would want.

I thus want to join the two parts of the tuple together unless the second part begins with “Unnamed”. In such a case, I’ll just take the first part. That sounds like an “if” statement in Python, but we can’t really put “if” statements, or any other statements, in list comprehensions. We can only include expressions.

In [20]:
df.columns = [one_t[0] 
                if one_t[1].startswith('Unnamed') 
                else ''.join(one_t)
             for one_t in df.columns]
df

Unnamed: 0,Year,LightningFatalities,TornadoFatalities,FloodFatalities,HurricaneFatalities,HeatFatalities,ColdFatalities,WinterFatalities,Rip Curr.Fatalities,WindFatalities,All HazardDamages (M)
0,1940,340.0,65.0,60.0,51.0,,,,,,
1,1941,388.0,53.0,47.0,10.0,,,,,,
2,1942,372.0,384.0,68.0,8.0,,,,,,
3,1943,432.0,58.0,107.0,16.0,,,,,,
4,1944,419.0,275.0,33.0,64.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
80,2019,20.0,42.0,92.0,0.0,187,35.0,27.0,74.0,51.0,"$7,657.01"
81,2020,17.0,76.0,57.0,24.0,350,13.0,29.0,82.0,55.0,"$27,311.25"
82,2021,11.0,104.0,146.0,12.0,375,106.0,40.0,111.0,56.0,"$18,994.55"
83,2022,19.0,23.0,93.0,116.0,383,22.0,66.0,69.0,55.0,"$21,698.58"


In [21]:
df['Year'] = df['Year'].astype('int16')

In [10]:
df.memory_usage(deep=True).sum()


13293

13293 bytes memory

#### Set all columns to be of type `pd.Int16Dtype` except for where `pd.Float64Dtype` or `pd.StringDtype` would be more appropriate. Remove any rows containing only NA values. Set "Year" to be the index. How much memory (if any) do you save by using these dtypes?

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 84 entries, 0 to 84
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Year                   84 non-null     int16  
 1   LightningFatalities    84 non-null     float64
 2   TornadoFatalities      84 non-null     float64
 3   FloodFatalities        84 non-null     float64
 4   HurricaneFatalities    84 non-null     float64
 5   HeatFatalities         38 non-null     object 
 6   ColdFatalities         36 non-null     float64
 7   WinterFatalities       38 non-null     float64
 8   Rip Curr.Fatalities    22 non-null     float64
 9   WindFatalities         29 non-null     float64
 10  All HazardDamages (M)  36 non-null     object 
dtypes: float64(8), int16(1), object(2)
memory usage: 7.4+ KB


NaN is a float value. We want to handle missing values with appropriate data types and implement these changes in a way that could potentially save memory.

In Pandas, when a column contains NaN, the entire column is often forced to be of type float, even if the original data was integers or strings. This is not ideal because it can lead to unnecessary memory usage and loss of the original data type integrity.

Pandas provides nullable types (pd.Int16Dtype, pd.Float64Dtype, pd.StringDtype) that allow you to keep the column's intended type (integer, float, string) even when there are missing values. Instead of using NaN, these types use pd.NA, which is a special missing value indicator that works with these types without changing the column to a less efficient or incorrect type.

We could convert each column to its most appropriate nullable type using the astype but there is a more convenient way to odo it using the df version of astype that allow us to convert many columns to new types. In this case, we indicate the new, destination types in a dict, with the column names as the keys and the new types as the values. After all, we can iterate over each column name, and then provide an extension type as the value. But wait — what extension type should we provide? In most cases, as I indicated in the question, we’ll use “pd.Int16Dtype”. That should work for all columns except for one, “All Hazard Damages (M)”, which should use “pd.Float64Dtype”. If you have many columns and most of them should be of the same type (e.g., pd.Int16Dtype()), it would be cumbersome to manually create a dictionary with all column names. Instead, you can use a dictionary comprehension to automate this process.

Suppose you want to set most columns to pd.Int16Dtype() except for one specific column. You can create the dictionary that will be passed to astype like this:

In [12]:
dtypes_dict = {col: pd.Int16Dtype() for col in df.columns}
dtypes_dict['All HazardDamages (M)'] = pd.Float64Dtype()


You then pass this dictionary to the DataFrame version of astype:

In [24]:
df['HeatFatalities'] = (df
                         ['HeatFatalities']
                         .str.replace(r'\D', '', 
                                      regex=True)
                        ) # Replace \D (i.e., any non-digit character) with the empty string. That took care of commas, allowing us to have integers in the column.

df['All HazardDamages (M)'] = (df
                                ['All HazardDamages (M)']
                                .str.replace(r'[^\d.]', '',
                                             regex=True)
                               ) # Replace [^\d.] (i.e., any character that is not a digit or a period) with the empty string. That took care of commas, allowing us to have floats in the column.


In [13]:
# Apply the type casting for the entire DataFrame
df = df.astype(dtypes_dict)
df

Unnamed: 0,Year,LightningFatalities,TornadoFatalities,FloodFatalities,HurricaneFatalities,HeatFatalities,ColdFatalities,WinterFatalities,Rip Curr.Fatalities,WindFatalities,All HazardDamages (M)
0,1940,340,65,60,51,,,,,,
1,1941,388,53,47,10,,,,,,
2,1942,372,384,68,8,,,,,,
3,1943,432,58,107,16,,,,,,
4,1944,419,275,33,64,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
80,2019,20,42,92,0,187,35,27,74,51,7657.01
81,2020,17,76,57,24,350,13,29,82,55,27311.25
82,2021,11,104,146,12,375,106,40,111,56,18994.55
83,2022,19,23,93,116,383,22,66,69,55,21698.58


In [25]:
from collections import defaultdict
conversion_dtypes = defaultdict(pd.Int16Dtype)
conversion_dtypes['All HazardDamages (M)'] = pd.Float64Dtype()
conversions = {column_name: conversion_dtypes[column_name]
              for column_name in df.columns}
df = (
    df
    .astype(conversions)
    .set_index('Year')
)


In [27]:
df

Unnamed: 0_level_0,LightningFatalities,TornadoFatalities,FloodFatalities,HurricaneFatalities,HeatFatalities,ColdFatalities,WinterFatalities,Rip Curr.Fatalities,WindFatalities,All HazardDamages (M)
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1940,340,65,60,51,,,,,,
1941,388,53,47,10,,,,,,
1942,372,384,68,8,,,,,,
1943,432,58,107,16,,,,,,
1944,419,275,33,64,,,,,,
...,...,...,...,...,...,...,...,...,...,...
2019,20,42,92,0,187,35,27,74,51,7657.01
2020,17,76,57,24,350,13,29,82,55,27311.25
2021,11,104,146,12,375,106,40,111,56,18994.55
2022,19,23,93,116,383,22,66,69,55,21698.58


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 84 entries, 1940 to 2023
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   LightningFatalities    84 non-null     Int16  
 1   TornadoFatalities      84 non-null     Int16  
 2   FloodFatalities        84 non-null     Int16  
 3   HurricaneFatalities    84 non-null     Int16  
 4   HeatFatalities         38 non-null     Int16  
 5   ColdFatalities         36 non-null     Int16  
 6   WinterFatalities       38 non-null     Int16  
 7   Rip Curr.Fatalities    22 non-null     Int16  
 8   WindFatalities         29 non-null     Int16  
 9   All HazardDamages (M)  36 non-null     Float64
dtypes: Float64(1), Int16(9)
memory usage: 3.2 KB


In [28]:
df.memory_usage(deep=True).sum()

3276

In other words, we’re using about one third the memory of the NumPy value equivalents. And we can now use pd.NA, which is more elegant.