# 4. PySpark Mini-project 

## Dataset

Storm events in US during 2019 are studied in the following analysis using the NOAA's National Weather Service Dataset available [here](https://www.ncdc.noaa.gov/stormevents/ftp.jsp). The folder **data** contains three csv files with event details, fatalities and location of every event.

In [1]:
# import SparkContext
from pyspark import SparkContext

In [2]:
sc = SparkSession.builder.appName("pysparkDataframes").getOrCreate()

In [3]:
sqlContext = SQLContext(sc)

In [4]:
dfStorm2019 = sc.read.format('csv')\
                .option('header', 'true')\
                .option('delimiter', ',')\
                .option('inferSchema', 'true')\
                .load('../pyspark/data/StormEvents_details-ftp_v1.0_d2019_c20200317.csv.gz')

In [5]:
dfFatalities2019 = sc.read.format('csv')\
                    .option('header', 'true')\
                    .option('delimiter', ',')\
                    .option('inferSchema', 'true')\
                    .load('../pyspark/data/StormEvents_fatalities-ftp_v1.0_d2019_c20200317.csv.gz')

In [6]:
dfLocations2019 = sc.read.format('csv')\
                    .option('header', 'true')\
                    .option('delimiter', ',')\
                    .option('inferSchema', 'true')\
                    .load('../pyspark/data/StormEvents_locations-ftp_v1.0_d2019_c20200317.csv.gz')

Displaying the schema of every dataset:

In [7]:
dfStorm2019.printSchema()

root
 |-- BEGIN_YEARMONTH: integer (nullable = true)
 |-- BEGIN_DAY: integer (nullable = true)
 |-- BEGIN_TIME: integer (nullable = true)
 |-- END_YEARMONTH: integer (nullable = true)
 |-- END_DAY: integer (nullable = true)
 |-- END_TIME: integer (nullable = true)
 |-- EPISODE_ID: integer (nullable = true)
 |-- EVENT_ID: integer (nullable = true)
 |-- STATE: string (nullable = true)
 |-- STATE_FIPS: integer (nullable = true)
 |-- YEAR: integer (nullable = true)
 |-- MONTH_NAME: string (nullable = true)
 |-- EVENT_TYPE: string (nullable = true)
 |-- CZ_TYPE: string (nullable = true)
 |-- CZ_FIPS: integer (nullable = true)
 |-- CZ_NAME: string (nullable = true)
 |-- WFO: string (nullable = true)
 |-- BEGIN_DATE_TIME: string (nullable = true)
 |-- CZ_TIMEZONE: string (nullable = true)
 |-- END_DATE_TIME: string (nullable = true)
 |-- INJURIES_DIRECT: integer (nullable = true)
 |-- INJURIES_INDIRECT: integer (nullable = true)
 |-- DEATHS_DIRECT: integer (nullable = true)
 |-- DEATHS_INDIRE

In [8]:
dfFatalities2019.printSchema()

root
 |-- FAT_YEARMONTH: integer (nullable = true)
 |-- FAT_DAY: integer (nullable = true)
 |-- FAT_TIME: integer (nullable = true)
 |-- FATALITY_ID: integer (nullable = true)
 |-- EVENT_ID: integer (nullable = true)
 |-- FATALITY_TYPE: string (nullable = true)
 |-- FATALITY_DATE: string (nullable = true)
 |-- FATALITY_AGE: integer (nullable = true)
 |-- FATALITY_SEX: string (nullable = true)
 |-- FATALITY_LOCATION: string (nullable = true)
 |-- EVENT_YEARMONTH: integer (nullable = true)



In [9]:
dfLocations2019.printSchema()

root
 |-- YEARMONTH: integer (nullable = true)
 |-- EPISODE_ID: integer (nullable = true)
 |-- EVENT_ID: integer (nullable = true)
 |-- LOCATION_INDEX: integer (nullable = true)
 |-- RANGE: double (nullable = true)
 |-- AZIMUTH: string (nullable = true)
 |-- LOCATION: string (nullable = true)
 |-- LATITUDE: double (nullable = true)
 |-- LONGITUDE: double (nullable = true)
 |-- LAT2: integer (nullable = true)
 |-- LON2: integer (nullable = true)



Dataframes have a **registerTempTable attribute** that can be transform into spark sql to generate queries and save the results using **.write.saveAsTable(name_table)**. RDD's don't have this extension because they are not structured tables. 

Obs: **registerTempTable** is deprecated. Instead, use **createOrReplaceTempView**

In [10]:
dfStorm2019.createOrReplaceTempView('stormDetails_table')

### 1. Type of Events

In [11]:
sqlContext.sql('SELECT \
                    EVENT_TYPE, \
                    COUNT(EVENT_TYPE) AS NUMBER_EPISODES\
                    FROM stormDetails_table \
                    GROUP BY EVENT_TYPE \
                    ORDER BY COUNT(EVENT_TYPE) DESC').show()

+--------------------+---------------+
|          EVENT_TYPE|NUMBER_EPISODES|
+--------------------+---------------+
|   Thunderstorm Wind|          18617|
|                Hail|           9013|
|               Flood|           4943|
|         Flash Flood|           4068|
|      Winter Weather|           3800|
|           High Wind|           3743|
|        Winter Storm|           3312|
|          Heavy Snow|           2844|
|Marine Thundersto...|           2502|
|             Tornado|           1727|
|         Strong Wind|           1590|
|          Heavy Rain|           1416|
|                Heat|           1291|
|Extreme Cold/Wind...|           1065|
|             Drought|           1007|
|            Blizzard|            852|
|      Excessive Heat|            827|
|        Frost/Freeze|            654|
|           Dense Fog|            652|
|     Cold/Wind Chill|            470|
+--------------------+---------------+
only showing top 20 rows



In [12]:
sqlContext.sql('SELECT EVENT_TYPE, NUMBER_EPISODES, PERCENTAGE \
                FROM ( \
                   SELECT \
                        DISTINCT(EVENT_TYPE), \
                        COUNT(EVENT_TYPE) OVER (PARTITION BY EVENT_TYPE) AS NUMBER_EPISODES, \
                        ROUND(COUNT(EVENT_TYPE) OVER (PARTITION BY EVENT_TYPE)/COUNT(EVENT_TYPE) OVER(), 5) AS PERCENTAGE\
                        FROM stormDetails_table \
                ORDER BY NUMBER_EPISODES DESC \
                      )').show(100)

+--------------------+---------------+----------+
|          EVENT_TYPE|NUMBER_EPISODES|PERCENTAGE|
+--------------------+---------------+----------+
|   Thunderstorm Wind|          18617|   0.27648|
|                Hail|           9013|   0.13385|
|               Flood|           4943|   0.07341|
|         Flash Flood|           4068|   0.06041|
|      Winter Weather|           3800|   0.05643|
|           High Wind|           3743|   0.05559|
|        Winter Storm|           3312|   0.04919|
|          Heavy Snow|           2844|   0.04224|
|Marine Thundersto...|           2502|   0.03716|
|             Tornado|           1727|   0.02565|
|         Strong Wind|           1590|   0.02361|
|          Heavy Rain|           1416|   0.02103|
|                Heat|           1291|   0.01917|
|Extreme Cold/Wind...|           1065|   0.01582|
|             Drought|           1007|   0.01495|
|            Blizzard|            852|   0.01265|
|      Excessive Heat|            827|   0.01228|


Since 1851 to 2018, hurricanes seasons have hit different states across the country. The top hurricane states on record are Florida, Texas, North Carolina, Luisiana, South Carolina, Alabama, Georgia, Missisipi, New York and Massachussetts. Atlantic Hurricane season runs from June to November, mostly between August and September. From summer to fall, weather conditions become ideal to generate storms later, with cooler air and warm ocean water temperatures. Pacific Hurricane season runs from May to November (the Pacific Coast of the United State is affected by storms originated in Mexico in their roads to come back to the sea, toward Hawaii).

### 2. Number of episodes and events by STATE

One episodes contains an unique or more events happening in different hours and days.

In [13]:
sqlContext.sql('SELECT \
                    STATE, \
                    COUNT(DISTINCT(EPISODE_ID)) AS NUMBER_EPISODES \
                    FROM stormDetails_table \
                    GROUP BY STATE ORDER BY COUNT(DISTINCT(EPISODE_ID))DESC').show(20)

+--------------+---------------+
|         STATE|NUMBER_EPISODES|
+--------------+---------------+
|         TEXAS|            536|
|  SOUTH DAKOTA|            430|
|    CALIFORNIA|            402|
|      COLORADO|            331|
|        KANSAS|            325|
|      ILLINOIS|            299|
|       WYOMING|            294|
|      VIRGINIA|            283|
|GULF OF MEXICO|            280|
|     MINNESOTA|            279|
|      NEBRASKA|            275|
|          IOWA|            263|
|      MISSOURI|            256|
|  PENNSYLVANIA|            256|
|      NEW YORK|            254|
|NORTH CAROLINA|            247|
|       FLORIDA|            243|
|       MONTANA|            232|
|     WISCONSIN|            232|
|          OHIO|            229|
+--------------+---------------+
only showing top 20 rows



The states with more episodes during 2019 were Texas, Soth Dakota, California, Colorado, Kansas

In [14]:
sqlContext.sql('SELECT \
                   STATE, \
                   NUMBER_EPISODES, \
                   PERCENTAGE, \
                   RANK() OVER (ORDER BY NUMBER_EPISODES DESC) AS RANKING \
                   FROM ( \
                       SELECT DISTINCT STATE, \
                            COUNT(EPISODE_ID) OVER (PARTITION BY STATE) AS NUMBER_EPISODES, \
                            ROUND(COUNT(EPISODE_ID) OVER (PARTITION BY STATE)/COUNT(EPISODE_ID) OVER(), 5) AS PERCENTAGE \
                            FROM ( \
                               SELECT DISTINCT EPISODE_ID, STATE \
                               FROM stormDetails_table \
                            ) \
                   )').show(100)

+--------------------+---------------+----------+-------+
|               STATE|NUMBER_EPISODES|PERCENTAGE|RANKING|
+--------------------+---------------+----------+-------+
|               TEXAS|            536|   0.05092|      1|
|        SOUTH DAKOTA|            430|   0.04085|      2|
|          CALIFORNIA|            402|   0.03819|      3|
|            COLORADO|            331|   0.03144|      4|
|              KANSAS|            325|   0.03087|      5|
|            ILLINOIS|            299|    0.0284|      6|
|             WYOMING|            294|   0.02793|      7|
|            VIRGINIA|            283|   0.02688|      8|
|      GULF OF MEXICO|            280|    0.0266|      9|
|           MINNESOTA|            279|    0.0265|     10|
|            NEBRASKA|            275|   0.02612|     11|
|                IOWA|            263|   0.02498|     12|
|            MISSOURI|            256|   0.02432|     13|
|        PENNSYLVANIA|            256|   0.02432|     13|
|            N

In [15]:
sqlContext.sql('SELECT \
                    STATE, \
                    COUNT(DISTINCT(EVENT_ID)) AS NUMBER_EVENTS \
                    FROM stormDetails_table \
                    GROUP BY STATE ORDER BY COUNT(DISTINCT(EVENT_ID))DESC').show(20)

+--------------+-------------+
|         STATE|NUMBER_EVENTS|
+--------------+-------------+
|         TEXAS|         4338|
|        KANSAS|         2672|
|    CALIFORNIA|         2643|
|  SOUTH DAKOTA|         2543|
|      NEW YORK|         2514|
|      VIRGINIA|         2398|
|  PENNSYLVANIA|         2395|
|          OHIO|         2279|
|          IOWA|         2276|
|      MISSOURI|         2159|
|     MINNESOTA|         2126|
|      NEBRASKA|         2088|
|      ILLINOIS|         2084|
|      OKLAHOMA|         1801|
|      COLORADO|         1776|
|     WISCONSIN|         1573|
|      KENTUCKY|         1522|
|NORTH CAROLINA|         1448|
|       INDIANA|         1447|
|       MONTANA|         1286|
+--------------+-------------+
only showing top 20 rows



In [16]:
sqlContext.sql('SELECT \
                   STATE, \
                   NUMBER_EVENTS, \
                   PERCENTAGE, \
                   RANK() OVER (ORDER BY NUMBER_EVENTS DESC) AS RANKING \
                   FROM ( \
                       SELECT DISTINCT STATE, \
                            COUNT(EVENT_ID) OVER (PARTITION BY STATE) AS NUMBER_EVENTS, \
                            ROUND(COUNT(EVENT_ID) OVER (PARTITION BY STATE)/COUNT(EVENT_ID) OVER(), 5) AS PERCENTAGE \
                            FROM ( \
                               SELECT DISTINCT EVENT_ID, STATE \
                               FROM stormDetails_table \
                            ) \
                   )').show(100)

+--------------------+-------------+----------+-------+
|               STATE|NUMBER_EVENTS|PERCENTAGE|RANKING|
+--------------------+-------------+----------+-------+
|               TEXAS|         4338|   0.06442|      1|
|              KANSAS|         2672|   0.03968|      2|
|          CALIFORNIA|         2643|   0.03925|      3|
|        SOUTH DAKOTA|         2543|   0.03777|      4|
|            NEW YORK|         2514|   0.03733|      5|
|            VIRGINIA|         2398|   0.03561|      6|
|        PENNSYLVANIA|         2395|   0.03557|      7|
|                OHIO|         2279|   0.03384|      8|
|                IOWA|         2276|    0.0338|      9|
|            MISSOURI|         2159|   0.03206|     10|
|           MINNESOTA|         2126|   0.03157|     11|
|            NEBRASKA|         2088|   0.03101|     12|
|            ILLINOIS|         2084|   0.03095|     13|
|            OKLAHOMA|         1801|   0.02675|     14|
|            COLORADO|         1776|   0.02637| 

In [17]:
sqlContext.sql('SELECT \
                    EVENT_TYPE, \
                    STATE, \
                    COUNT(EVENT_TYPE) AS NUMBER_EVENTS \
                    FROM stormDetails_table \
                    GROUP BY STATE, EVENT_TYPE ORDER BY COUNT(EVENT_TYPE) DESC').show(20)

+--------------------+--------------+-------------+
|          EVENT_TYPE|         STATE|NUMBER_EVENTS|
+--------------------+--------------+-------------+
|                Hail|         TEXAS|         1394|
|   Thunderstorm Wind|  PENNSYLVANIA|         1245|
|   Thunderstorm Wind|      VIRGINIA|         1199|
|   Thunderstorm Wind|         TEXAS|         1047|
|Marine Thundersto...|ATLANTIC NORTH|          997|
|   Thunderstorm Wind|      NEW YORK|          903|
|   Thunderstorm Wind|          OHIO|          891|
|   Thunderstorm Wind|        KANSAS|          848|
|                Hail|        KANSAS|          847|
|   Thunderstorm Wind|NORTH CAROLINA|          791|
|Marine Thundersto...|GULF OF MEXICO|          775|
|               Flood|  SOUTH DAKOTA|          752|
|   Thunderstorm Wind|       GEORGIA|          673|
|   Thunderstorm Wind|      MISSOURI|          648|
|                Hail|      NEBRASKA|          640|
|   Thunderstorm Wind|SOUTH CAROLINA|          619|
|   Thunders

Hails, thunderstorms and marine thonderstoms were the most common events in 2019.

### 3. Duration of events

First, we inspect the date information available in the stormDetails table

In [18]:
sqlContext.sql('SELECT \
                    EPISODE_ID, \
                    EVENT_ID, \
                    STATE, \
                    EVENT_TYPE, \
                    BEGIN_YEARMONTH, \
                    BEGIN_DAY, \
                    BEGIN_TIME,  \
                    END_YEARMONTH, \
                    END_DAY, \
                    END_TIME \
                    FROM stormDetails_table').show(5)

+----------+--------+---------+-----------------+---------------+---------+----------+-------------+-------+--------+
|EPISODE_ID|EVENT_ID|    STATE|       EVENT_TYPE|BEGIN_YEARMONTH|BEGIN_DAY|BEGIN_TIME|END_YEARMONTH|END_DAY|END_TIME|
+----------+--------+---------+-----------------+---------------+---------+----------+-------------+-------+--------+
|    137295|  824116|    TEXAS|      Flash Flood|         201905|        9|      1554|       201905|      9|    1830|
|    140217|  843354|MINNESOTA|Thunderstorm Wind|         201907|       15|      1640|       201907|     15|    1641|
|    142648|  861581|    TEXAS|Thunderstorm Wind|         201910|       20|      2223|       201910|     20|    2223|
|    142648|  861584|    TEXAS|Thunderstorm Wind|         201910|       20|      2312|       201910|     20|    2312|
|    142648|  861582|    TEXAS|Thunderstorm Wind|         201910|       20|      2236|       201910|     20|    2236|
+----------+--------+---------+-----------------+-------

A better datetime visualization results of the transformation of the begin and end dates as strings with format **yyyy-mm-dd hh-mm**

In [19]:
sqlContext.sql("SELECT \
                    EPISODE_ID, \
                    EVENT_ID, \
                    STATE, \
                    EVENT_TYPE, \
                    CONCAT(SUBSTRING(BEGIN_YEARMONTH, 1, 4), '-', \
                    SUBSTRING(BEGIN_YEARMONTH, 5, 2), '-', \
                    BEGIN_DAY, ' ', SUBSTRING(BEGIN_TIME, 1, 2), ':', SUBSTRING(BEGIN_TIME, 3, 2)) AS BEGIN_DATE, \
                    CONCAT(SUBSTRING(END_YEARMONTH, 1, 4), '-', \
                    SUBSTRING(END_YEARMONTH, 5, 2), '-', \
                    END_DAY, ' ', SUBSTRING(END_TIME, 1, 2), ':', SUBSTRING(END_TIME, 3, 2)) AS END_DATE \
                    FROM stormDetails_table").show()   

+----------+--------+--------------+--------------------+----------------+----------------+
|EPISODE_ID|EVENT_ID|         STATE|          EVENT_TYPE|      BEGIN_DATE|        END_DATE|
+----------+--------+--------------+--------------------+----------------+----------------+
|    137295|  824116|         TEXAS|         Flash Flood| 2019-05-9 15:54| 2019-05-9 18:30|
|    140217|  843354|     MINNESOTA|   Thunderstorm Wind|2019-07-15 16:40|2019-07-15 16:41|
|    142648|  861581|         TEXAS|   Thunderstorm Wind|2019-10-20 22:23|2019-10-20 22:23|
|    142648|  861584|         TEXAS|   Thunderstorm Wind|2019-10-20 23:12|2019-10-20 23:12|
|    142648|  861582|         TEXAS|   Thunderstorm Wind|2019-10-20 22:36|2019-10-20 22:36|
|    142648|  856504|         TEXAS|             Tornado|2019-10-20 20:48|2019-10-20 20:54|
|    141212|  848333|       VERMONT|                Hail| 2019-09-4 12:29| 2019-09-4 12:29|
|    141215|  848338|      NEW YORK|   Thunderstorm Wind|2019-09-26 15:54|2019-0

In the table below, `DELTA_YEAR` and `DELTA_MONTH` are the differences between the begin/end years and months respectively. Both are extracted from `BEGIN_YEARMONTH` and `END_YEARMONTH` using the SQL function SUBSTRING. The same attribute is used to retrieve hours and minutes from `BEGIN_TIME` and `END_TIME`. `DELTA_DAYS` is compute directly from `BEGIN_DAY` and `END_DAY`.

100 rows are displayed as follows:

In [20]:
sqlContext.sql('SELECT \
                   EPISODE_ID, \
                   EVENT_ID, STATE, \
                   EVENT_TYPE, \
                   CAST(SUBSTRING(END_YEARMONTH, 1, 4) AS INT) - CAST(SUBSTRING(BEGIN_YEARMONTH, 1, 4) AS INT) AS DELTA_YEAR, \
                   CAST(SUBSTRING(END_YEARMONTH, 5, 2) AS INT) - CAST(SUBSTRING(BEGIN_YEARMONTH, 5, 2) AS INT) AS DELTA_MONTH, \
                   END_DAY - BEGIN_DAY AS DELTA_DAY, \
                   ((CAST(SUBSTRING(END_TIME, 1, 2) AS INT)*60 + CAST(SUBSTRING(END_TIME, 3, 2) AS INT)) - (CAST(SUBSTRING(BEGIN_TIME, 1, 2) AS INT)*60 + CAST(SUBSTRING(BEGIN_TIME, 3, 2) AS INT)))/60.0 AS DELTA_HOURS \
                   FROM stormDetails_table').show()

+----------+--------+--------------+--------------------+----------+-----------+---------+-----------+
|EPISODE_ID|EVENT_ID|         STATE|          EVENT_TYPE|DELTA_YEAR|DELTA_MONTH|DELTA_DAY|DELTA_HOURS|
+----------+--------+--------------+--------------------+----------+-----------+---------+-----------+
|    137295|  824116|         TEXAS|         Flash Flood|         0|          0|        0|   2.600000|
|    140217|  843354|     MINNESOTA|   Thunderstorm Wind|         0|          0|        0|   0.016667|
|    142648|  861581|         TEXAS|   Thunderstorm Wind|         0|          0|        0|   0.000000|
|    142648|  861584|         TEXAS|   Thunderstorm Wind|         0|          0|        0|   0.000000|
|    142648|  861582|         TEXAS|   Thunderstorm Wind|         0|          0|        0|   0.000000|
|    142648|  856504|         TEXAS|             Tornado|         0|          0|        0|   0.100000|
|    141212|  848333|       VERMONT|                Hail|         0|     

Number of events with a `DELTA_MONTH` upper than 0:

In [21]:
sqlContext.sql('SELECT \
                   EPISODE_ID, \
                   EVENT_ID, \
                   STATE, \
                   EVENT_TYPE, \
                   DELTA_DAY, \
                   DELTA_HOURS \
                   FROM (\
                        SELECT \
                            EPISODE_ID, \
                            EVENT_ID, \
                            STATE, \
                            EVENT_TYPE, \
                            CAST(SUBSTRING(END_YEARMONTH, 1, 4) AS INT) - CAST(SUBSTRING(BEGIN_YEARMONTH, 1, 4) AS INT) AS DELTA_YEAR, \
                            CAST(SUBSTRING(END_YEARMONTH, 5, 2) AS INT) - CAST(SUBSTRING(BEGIN_YEARMONTH, 5, 2) AS INT) AS DELTA_MONTH, \
                            END_DAY - BEGIN_DAY AS DELTA_DAY, \
                            ((CAST(SUBSTRING(END_TIME, 1, 2) AS INT)*60 + CAST(SUBSTRING(END_TIME, 3, 2) AS INT)) - (CAST(SUBSTRING(BEGIN_TIME, 1, 2) AS INT)*60 + CAST(SUBSTRING(BEGIN_TIME, 3, 2) AS INT)))/60.0 AS DELTA_HOURS \
                            FROM stormDetails_table \
                        ) \
                   WHERE DELTA_MONTH > 0').show()

+----------+--------+-----+----------+---------+-----------+
|EPISODE_ID|EVENT_ID|STATE|EVENT_TYPE|DELTA_DAY|DELTA_HOURS|
+----------+--------+-----+----------+---------+-----------+
+----------+--------+-----+----------+---------+-----------+



All the events take place -they start and finish- in the same month. It means that the duration of every event can be compute using only `DELTA_DAY` and `DELTA_HOURS`, as follows:

In [22]:
sqlContext.sql('SELECT \
                   B.EPISODE_ID, \
                   B.EVENT_ID, \
                   B.STATE, \
                   B.EVENT_TYPE, \
                   B.BEGIN_YEARMONTH, \
                   B.END_YEARMONTH, \
                   B.BEGIN_DAY, \
                   B.END_DAY, \
                   B.EFFECTIVE_DELTA_DAYS \
                   FROM ( \
                       SELECT \
                           A.EPISODE_ID, \
                           A.EVENT_ID, \
                           A.STATE, \
                           A.EVENT_TYPE, \
                           A.BEGIN_YEARMONTH, \
                           A.END_YEARMONTH, \
                           A.BEGIN_DAY, \
                           A.END_DAY, \
                           CASE WHEN A.DELTA_HOURS IS NOT NULL THEN A.DELTA_DAY + A.DELTA_HOURS/24.0 ELSE A.DELTA_DAY END AS EFFECTIVE_DELTA_DAYS \
                           FROM (\
                                SELECT \
                                   EPISODE_ID, \
                                   EVENT_ID, \
                                   STATE, \
                                   EVENT_TYPE, \
                                   BEGIN_YEARMONTH, \
                                   END_YEARMONTH, \
                                   BEGIN_DAY, \
                                   END_DAY, \
                                   END_DAY - BEGIN_DAY AS DELTA_DAY, \
                                   ABS((CAST(SUBSTRING(END_TIME, 1, 2) AS INT) + CAST(SUBSTRING(END_TIME, 3, 2) AS INT)) - (CAST(SUBSTRING(BEGIN_TIME, 1, 2) AS INT) + CAST(SUBSTRING(BEGIN_TIME, 3, 2) AS INT)))/60.0 AS DELTA_HOURS \
                                   FROM stormDetails_table \
                                   WHERE CAST(END_TIME AS INT) > 0 AND CAST(BEGIN_TIME AS INT) > 0 \
                                ) AS A \
                    ) AS B \
                    ORDER BY B.EFFECTIVE_DELTA_DAYS DESC').show(200)

+----------+--------+--------------+----------+---------------+-------------+---------+-------+--------------------+
|EPISODE_ID|EVENT_ID|         STATE|EVENT_TYPE|BEGIN_YEARMONTH|END_YEARMONTH|BEGIN_DAY|END_DAY|EFFECTIVE_DELTA_DAYS|
+----------+--------+--------------+----------+---------------+-------------+---------+-------+--------------------+
|    140042|  842536|  SOUTH DAKOTA|     Flood|         201907|       201907|        1|     31|       30.0361111250|
|    140042|  842537|  SOUTH DAKOTA|     Flood|         201907|       201907|        1|     31|       30.0361111250|
|    141294|  848629|         IDAHO|  Wildfire|         201908|       201908|        1|     31|       30.0000000000|
|    137101|  822737|      ILLINOIS|     Flood|         201904|       201904|        1|     30|       29.0500000000|
|    138578|  833168|          IOWA|     Flood|         201906|       201906|        1|     30|       29.0500000000|
|    138858|  835315|  SOUTH DAKOTA|     Flood|         201906| 

### 5. Narrative 

In [23]:
sqlContext.sql('SELECT \
                    EPISODE_ID, \
                    EVENT_ID, \
                    EPISODE_NARRATIVE, \
                    EVENT_NARRATIVE \
                    FROM stormDetails_table').show()

+----------+--------+--------------------+--------------------+
|EPISODE_ID|EVENT_ID|   EPISODE_NARRATIVE|     EVENT_NARRATIVE|
+----------+--------+--------------------+--------------------+
|    137295|  824116|Thunderstorms dev...|Thunderstorms pro...|
|    140217|  843354|An area of low pr...|A 14-inch tree fe...|
|    142648|  861581|Thunderstorms eru...|A trained spotter...|
|    142648|  861584|Thunderstorms eru...|The local police ...|
|    142648|  861582|Thunderstorms eru...|A wind gust of 75...|
|    142648|  856504|Thunderstorms eru...|Tornadic wind dam...|
|    141212|  848333|A strong mid-leve...|Quarter size hail...|
|    141215|  848338|A strong mid-leve...|A few trees and p...|
|    140688|  845760|Fast moving showe...|Weatherflow site ...|
|    140688|  845762|Fast moving showe...|US Air Force wind...|
|    140688|  845764|Fast moving showe...|Weatherflow site ...|
|    141232|  848441|Police responded ...|Following a 9-1-1...|
|    133946|  801726|A strong surface ..

### 4. Devastation: injuries, deaths and damages

In [24]:
sqlContext.sql('SELECT \
                    EPISODE_ID, \
                    EVENT_TYPE, \
                    SUM(INJURIES_DIRECT) AS TOTAL_INJURIES_DIRECT, \
                    SUM(INJURIES_INDIRECT) AS TOTAL_INJURIES_INDIRECT, \
                    SUM(INJURIES_DIRECT) + SUM(INJURIES_INDIRECT) AS TOTAL_INJURIES \
                    FROM stormDetails_table \
                    GROUP BY EPISODE_ID, EVENT_TYPE \
                    HAVING SUM(INJURIES_DIRECT) > 0 OR SUM(INJURIES_INDIRECT) > 0').show()

+----------+-----------------+---------------------+-----------------------+--------------+
|EPISODE_ID|       EVENT_TYPE|TOTAL_INJURIES_DIRECT|TOTAL_INJURIES_INDIRECT|TOTAL_INJURIES|
+----------+-----------------+---------------------+-----------------------+--------------+
|    142648|Thunderstorm Wind|                    3|                      0|             3|
|    138507|      Rip Current|                    1|                      0|             1|
|    136101|Thunderstorm Wind|                    1|                      0|             1|
|    134997|Thunderstorm Wind|                    1|                      0|             1|
|    135479|      Strong Wind|                    1|                      0|             1|
|    135393|   Winter Weather|                    0|                      2|             2|
|    136708|            Sleet|                    0|                      3|             3|
|    134345|        Dense Fog|                    0|                     35|    

In [25]:
sqlContext.sql('SELECT \
                    EPISODE_ID, \
                    EVENT_TYPE, \
                    SUM(DEATHS_DIRECT) AS TOTAL_DEATHS_DIRECT, \
                    SUM(DEATHS_INDIRECT) AS TOTAL_DEATHS_INDIRECT, \
                    SUM(DEATHS_DIRECT) + SUM(DEATHS_INDIRECT) AS TOTAL_DEATHS \
                    FROM stormDetails_table \
                    GROUP BY EPISODE_ID, EVENT_TYPE \
                    HAVING SUM(DEATHS_DIRECT) > 0 OR SUM(DEATHS_INDIRECT) > 0').show()

+----------+--------------------+-------------------+---------------------+------------+
|EPISODE_ID|          EVENT_TYPE|TOTAL_DEATHS_DIRECT|TOTAL_DEATHS_INDIRECT|TOTAL_DEATHS|
+----------+--------------------+-------------------+---------------------+------------+
|    141232|         Rip Current|                  1|                    0|           1|
|    136101|   Thunderstorm Wind|                  1|                    0|           1|
|    138456|               Flood|                  1|                    0|           1|
|    133947|      Winter Weather|                  0|                    1|           1|
|    137299|     Cold/Wind Chill|                  1|                    0|           1|
|    135254|         Debris Flow|                  1|                    0|           1|
|    135179|               Flood|                  1|                    0|           1|
|    135394|        Winter Storm|                  0|                    1|           1|
|    135494|      Win

In [26]:
sqlContext.sql('SELECT \
                    EVENT_TYPE, \
                    SUM(DEATHS_DIRECT) + SUM(DEATHS_INDIRECT) AS TOTAL_DEATHS \
                    FROM stormDetails_table \
                    GROUP BY EVENT_TYPE \
                    ORDER BY SUM(DEATHS_DIRECT) + SUM(DEATHS_INDIRECT) DESC').show(100)

+--------------------+------------+
|          EVENT_TYPE|TOTAL_DEATHS|
+--------------------+------------+
|         Rip Current|          58|
|      Winter Weather|          56|
|               Flood|          49|
|                Heat|          48|
|             Tornado|          42|
|         Flash Flood|          41|
|   Thunderstorm Wind|          39|
|     Cold/Wind Chill|          33|
|           Lightning|          24|
|      Excessive Heat|          20|
|        Winter Storm|          20|
|           High Surf|          19|
|         Strong Wind|          18|
|Extreme Cold/Wind...|          15|
|           Avalanche|          14|
|          Heavy Snow|          10|
|           Dense Fog|           9|
|            Blizzard|           7|
|          Heavy Rain|           6|
|            Wildfire|           6|
| Tropical Depression|           4|
|Marine Thundersto...|           3|
|         Debris Flow|           2|
|         Sneakerwave|           2|
|                Hail|      

In [27]:
sqlContext.sql('SELECT \
                    STATE, \
                    SUM(DEATHS_DIRECT) + SUM(DEATHS_INDIRECT) AS TOTAL_DEATHS \
                    FROM stormDetails_table \
                    GROUP BY STATE \
                    ORDER BY SUM(DEATHS_DIRECT) + SUM(DEATHS_INDIRECT) DESC').show(100)

+--------------------+------------+
|               STATE|TOTAL_DEATHS|
+--------------------+------------+
|          CALIFORNIA|          59|
|              NEVADA|          44|
|               TEXAS|          42|
|             FLORIDA|          40|
|             ALABAMA|          29|
|            MISSOURI|          26|
|           WISCONSIN|          26|
|            COLORADO|          21|
|                OHIO|          17|
|      NORTH CAROLINA|          17|
|            ILLINOIS|          16|
|             INDIANA|          15|
|            KENTUCKY|          13|
|            OKLAHOMA|          12|
|         MISSISSIPPI|          11|
|             ARIZONA|          10|
|        SOUTH DAKOTA|           9|
|          WASHINGTON|           8|
|            VIRGINIA|           8|
|           LOUISIANA|           8|
|           TENNESSEE|           7|
|            NEBRASKA|           7|
|                IOWA|           7|
|      SOUTH CAROLINA|           7|
|              OREGON|      

In [28]:
sqlContext.sql('SELECT \
                    STATE, \
                    SUM(DEATHS_DIRECT) AS TOTAL_DEATHS_DIRECT, \
                    SUM(DEATHS_INDIRECT) AS TOTAL_DEATHS_INDIRECT, \
                    SUM(DEATHS_DIRECT) + SUM(DEATHS_INDIRECT) AS TOTAL_DEATHS \
                    FROM stormDetails_table \
                    GROUP BY STATE \
                    ORDER BY SUM(DEATHS_DIRECT) + SUM(DEATHS_INDIRECT) DESC').show(50)

+--------------+-------------------+---------------------+------------+
|         STATE|TOTAL_DEATHS_DIRECT|TOTAL_DEATHS_INDIRECT|TOTAL_DEATHS|
+--------------+-------------------+---------------------+------------+
|    CALIFORNIA|                 32|                   27|          59|
|        NEVADA|                 25|                   19|          44|
|         TEXAS|                 37|                    5|          42|
|       FLORIDA|                 33|                    7|          40|
|       ALABAMA|                 29|                    0|          29|
|     WISCONSIN|                 10|                   16|          26|
|      MISSOURI|                 14|                   12|          26|
|      COLORADO|                 17|                    4|          21|
|          OHIO|                 11|                    6|          17|
|NORTH CAROLINA|                 17|                    0|          17|
|      ILLINOIS|                 11|                    5|      

### 5. Flood Causes, Damage in properties by episode

In [29]:
sqlContext.sql('SELECT \
                    EPISODE_ID, \
                    EVENT_TYPE, \
                    STATE, \
                    FLOOD_CAUSE, \
                    SUM(CAST(LEFT(DAMAGE_PROPERTY, LENGTH(DAMAGE_PROPERTY)-1) AS FLOAT)) AS DAMAGE_PROPERTY_K \
                    FROM stormDetails_table \
                    WHERE FLOOD_CAUSE IS NOT NULL \
                    GROUP BY EPISODE_ID, EVENT_TYPE, STATE, FLOOD_CAUSE \
                    ORDER BY SUM(CAST(LEFT(DAMAGE_PROPERTY, LENGTH(DAMAGE_PROPERTY)-1) AS FLOAT)) DESC').show(100)

+----------+-----------+--------------------+--------------------+------------------+
|EPISODE_ID| EVENT_TYPE|               STATE|         FLOOD_CAUSE| DAMAGE_PROPERTY_K|
+----------+-----------+--------------------+--------------------+------------------+
|    138078|      Flood|            NEW YORK|Heavy Rain / Snow...|            8550.0|
|    139574|      Flood|            NEW YORK|Heavy Rain / Snow...|            7100.0|
|    137944|      Flood|        SOUTH DAKOTA|Heavy Rain / Snow...| 6322.610000014305|
|    138354|      Flood|            ARKANSAS|          Heavy Rain|            6000.0|
|    142900|      Flood|        SOUTH DAKOTA|          Heavy Rain| 5441.819999575615|
|    135533|Flash Flood|             GEORGIA|          Heavy Rain|            5000.0|
|    136519|      Flood|        SOUTH DAKOTA|Heavy Rain / Snow...| 4492.980007410049|
|    140910|      Flood|            NEW YORK|Heavy Rain / Snow...|            4150.0|
|    139113|      Flood|            ARKANSAS|         

### 6. Damage in properties by TYPE OF EVENT

In [30]:
sqlContext.sql('SELECT \
                    EVENT_TYPE, \
                    SUM(CAST(LEFT(DAMAGE_PROPERTY, LENGTH(DAMAGE_PROPERTY)-1) AS FLOAT)) AS DAMAGE_PROPERTY_K \
                    FROM stormDetails_table \
                    GROUP BY EVENT_TYPE \
                    ORDER BY SUM(CAST(LEFT(DAMAGE_PROPERTY, LENGTH(DAMAGE_PROPERTY)-1) AS FLOAT)) DESC').show(100)

+--------------------+------------------+
|          EVENT_TYPE| DAMAGE_PROPERTY_K|
+--------------------+------------------+
|               Flood|118979.80000536144|
|   Thunderstorm Wind|  91331.7800008934|
|             Tornado| 81087.81000041962|
|         Flash Flood| 75903.32001295686|
|     Lakeshore Flood|           49786.0|
|           High Wind| 23776.90000024438|
|                Hail|14577.580000052229|
|         Strong Wind|   12119.800000377|
|           Lightning| 9683.800000026822|
|        Winter Storm|            7889.5|
|      Winter Weather|            6557.5|
|            Blizzard|3959.0800001621246|
|            Wildfire|2456.7000000476837|
|           Ice Storm| 2306.060000002384|
|          Heavy Snow|2077.9499995708466|
|      Excessive Heat|            1500.0|
|    Lake-Effect Snow|            1307.0|
|          Heavy Rain|            1020.0|
|         Debris Flow|  981.920000076294|
|      Tropical Storm| 948.8000000268221|
|  Marine Strong Wind|            

### 7. Damage in properties by STATE

In [31]:
sqlContext.sql('SELECT \
                    STATE, \
                    SUM(CAST(LEFT(DAMAGE_PROPERTY, LENGTH(DAMAGE_PROPERTY)-1) AS FLOAT)) AS DAMAGE_PROPERTY_K \
                    FROM stormDetails_table \
                    GROUP BY STATE \
                    ORDER BY SUM(CAST(LEFT(DAMAGE_PROPERTY, LENGTH(DAMAGE_PROPERTY)-1) AS FLOAT)) DESC').show(100)

+--------------------+--------------------+
|               STATE|   DAMAGE_PROPERTY_K|
+--------------------+--------------------+
|            NEW YORK|   88445.98000001907|
|        SOUTH DAKOTA|   37072.49000740051|
|               TEXAS|  29313.399999916553|
|         MISSISSIPPI|  27281.850000038743|
|                OHIO|  26119.100000061095|
|            NEBRASKA|   20113.04999923706|
|           MINNESOTA|  19876.259999990463|
|            ARKANSAS|             18800.5|
|                IOWA|   17799.80999982357|
|             GEORGIA|  15952.590000018477|
|          CALIFORNIA|  14429.900000028312|
|           WISCONSIN|  14100.220000006258|
|            MISSOURI|   13335.85000038147|
|           TENNESSEE|  12550.560013025999|
|            OKLAHOMA|  12336.900000013411|
|           LOUISIANA|             10909.0|
|      NORTH CAROLINA|  10389.600000023842|
|        PENNSYLVANIA|  10350.200000047684|
|            KENTUCKY|  10030.099999979138|
|            MICHIGAN|    9304.5

### 8. Damage in properties and crops by STATE

In [32]:
sqlContext.sql('SELECT \
                    STATE, \
                    SUM(CAST(LEFT(DAMAGE_PROPERTY, LENGTH(DAMAGE_PROPERTY)-1) AS FLOAT)) AS DAMAGE_PROPERTY_K, \
                    SUM(CAST(LEFT(DAMAGE_CROPS, LENGTH(DAMAGE_CROPS)-1) AS FLOAT)) AS DAMAGE_CROPS_K \
                    FROM stormDetails_table \
                    GROUP BY STATE').show(100)

+--------------------+--------------------+------------------+
|               STATE|   DAMAGE_PROPERTY_K|    DAMAGE_CROPS_K|
+--------------------+--------------------+------------------+
|               TEXAS|  29313.399999916553|             781.0|
|           MINNESOTA|  19876.259999990463| 4489.860000252724|
|             VERMONT|              8243.0|               0.0|
|            NEW YORK|   88445.98000001907|               1.0|
|      ATLANTIC SOUTH|0.009999999776482582|               0.0|
|             FLORIDA|   5224.810000007972|               1.0|
|       WEST VIRGINIA|   5284.100000008941| 677.3199999332428|
|            ARKANSAS|             18800.5|            4148.0|
|      GULF OF MEXICO|                 5.5|               0.0|
|             MONTANA|              1462.5|               0.0|
|            MISSOURI|   13335.85000038147|            1333.0|
|             GEORGIA|  15952.590000018477|5.5500000063329935|
|         CONNECTICUT|   802.9999999403954|            

In [33]:
sqlContext.sql('SELECT \
                    STATE, \
                    SUM(CAST(LEFT(DAMAGE_CROPS, LENGTH(DAMAGE_CROPS)-1) AS FLOAT)) AS DAMAGE_CROPS_K \
                    FROM stormDetails_table \
                    GROUP BY STATE \
                    ORDER BY SUM(CAST(LEFT(DAMAGE_CROPS, LENGTH(DAMAGE_CROPS)-1) AS FLOAT)) DESC').show(100)

+--------------------+------------------+
|               STATE|    DAMAGE_CROPS_K|
+--------------------+------------------+
|           MINNESOTA| 4489.860000252724|
|        SOUTH DAKOTA|4478.5900003910065|
|            ARKANSAS|            4148.0|
|            NEBRASKA|            3686.0|
|            ILLINOIS|            3615.0|
|         MISSISSIPPI|            2919.0|
|          CALIFORNIA|            2855.0|
|        NORTH DAKOTA|            2220.0|
|                IOWA|2219.3799999281764|
|              KANSAS|1691.7000001370907|
|            MISSOURI|            1333.0|
|           WISCONSIN|             990.0|
|               TEXAS|             781.0|
|      NORTH CAROLINA|             769.0|
|            VIRGINIA|             720.0|
|           LOUISIANA|             700.0|
|       WEST VIRGINIA| 677.3199999332428|
|            COLORADO|             388.0|
|             INDIANA|             371.0|
|            MICHIGAN|             190.0|
|        PENNSYLVANIA|            

### 9. Total damages by STATE

In [34]:
sqlContext.sql('SELECT \
                    STATE, \
                    SUM(CAST(LEFT(DAMAGE_PROPERTY, LENGTH(DAMAGE_PROPERTY)-1) AS FLOAT)) + SUM(CAST(LEFT(DAMAGE_CROPS, LENGTH(DAMAGE_CROPS)-1) AS FLOAT)) AS TOTAL_DAMAGES \
                    FROM stormDetails_table \
                    GROUP BY STATE \
                    ORDER BY SUM(CAST(LEFT(DAMAGE_PROPERTY, LENGTH(DAMAGE_PROPERTY)-1) AS FLOAT)) + SUM(CAST(LEFT(DAMAGE_CROPS, LENGTH(DAMAGE_CROPS)-1) AS FLOAT)) DESC').show(100)

+--------------------+--------------------+
|               STATE|       TOTAL_DAMAGES|
+--------------------+--------------------+
|            NEW YORK|   88446.98000001907|
|        SOUTH DAKOTA|   41551.08000779152|
|         MISSISSIPPI|  30200.850000038743|
|               TEXAS|  30094.399999916553|
|                OHIO|  26271.100000061095|
|           MINNESOTA|  24366.120000243187|
|            NEBRASKA|   23799.04999923706|
|            ARKANSAS|             22948.5|
|                IOWA|  20019.189999751747|
|          CALIFORNIA|  17284.900000028312|
|             GEORGIA|   15958.14000002481|
|           WISCONSIN|  15090.220000006258|
|            MISSOURI|   14668.85000038147|
|           TENNESSEE|  12550.560013025999|
|            OKLAHOMA|  12337.900000013411|
|           LOUISIANA|             11609.0|
|      NORTH CAROLINA|  11158.600000023842|
|        PENNSYLVANIA|  10525.200000047684|
|            KENTUCKY|  10132.819999996573|
|            MICHIGAN|    9494.5

## Fatalities

In [35]:
dfFatalities2019.createOrReplaceTempView('fatalities_table')

In [36]:
sqlContext.sql('SELECT *\
                   FROM fatalities_table').show(10)

+-------------+-------+--------+-----------+--------+-------------+-------------------+------------+------------+--------------------+---------------+
|FAT_YEARMONTH|FAT_DAY|FAT_TIME|FATALITY_ID|EVENT_ID|FATALITY_TYPE|      FATALITY_DATE|FATALITY_AGE|FATALITY_SEX|   FATALITY_LOCATION|EVENT_YEARMONTH|
+-------------+-------+--------+-----------+--------+-------------+-------------------+------------+------------+--------------------+---------------+
|       201906|      6|       0|      38341|  817354|            D|06/06/2019 00:00:00|          97|           M|  Outside/Open Areas|         201906|
|       201906|      9|       0|      38480|  819069|            D|06/09/2019 00:00:00|          45|           M|Vehicle/Towed Tra...|         201906|
|       201906|     20|       0|      38532|  820152|            D|06/20/2019 00:00:00|        null|           M|            In Water|         201906|
|       201906|     22|       0|      38559|  820397|            D|06/22/2019 00:00:00|       

In [37]:
sqlContext.sql('SELECT \
                   FATALITY_LOCATION, \
                   COUNT(FATALITY_LOCATION) \
                   FROM fatalities_table \
                   GROUP BY FATALITY_LOCATION \
                   ORDER BY COUNT(FATALITY_LOCATION) DESC').show(15)

+--------------------+------------------------+
|   FATALITY_LOCATION|count(FATALITY_LOCATION)|
+--------------------+------------------------+
|Vehicle/Towed Tra...|                     168|
|  Outside/Open Areas|                     133|
|            In Water|                     110|
|      Permanent Home|                      38|
| Mobile/Trailer Home|                      32|
|             Unknown|                      22|
|             Boating|                      17|
|          Under Tree|                      13|
| Permanent Structure|                       9|
|               Other|                       7|
|             Camping|                       3|
|Heavy Equipment/C...|                       2|
|          Ball Field|                       1|
+--------------------+------------------------+



In [38]:
sqlContext.sql('SELECT \
                   INT(FATALITY_AGE/10) AS FATALITY_RANGE, \
                   COUNT(INT(FATALITY_AGE/10)) AS NUMBER_FATALITIES\
                   FROM fatalities_table \
                   WHERE FATALITY_AGE IS NOT NULL \
                   GROUP BY INT(FATALITY_AGE/10) \
                   ORDER BY INT(FATALITY_AGE/10)').show(100)

+--------------+-----------------+
|FATALITY_RANGE|NUMBER_FATALITIES|
+--------------+-----------------+
|             0|               38|
|             1|               43|
|             2|               70|
|             3|               57|
|             4|               46|
|             5|              107|
|             6|               70|
|             7|               42|
|             8|               23|
|             9|                4|
+--------------+-----------------+



In [39]:
sqlContext.sql('SELECT \
                   INT(FATALITY_AGE/10) AS FATALITY_RANGE, \
                   FATALITY_SEX, \
                   COUNT(INT(FATALITY_AGE/10)) AS NUMBER_FATALITIES\
                   FROM fatalities_table \
                   WHERE FATALITY_AGE IS NOT NULL AND FATALITY_SEX IS NOT NULL\
                   GROUP BY INT(FATALITY_AGE/10), FATALITY_SEX\
                   ORDER BY INT(FATALITY_AGE/10), FATALITY_SEX').show(100)

+--------------+------------+-----------------+
|FATALITY_RANGE|FATALITY_SEX|NUMBER_FATALITIES|
+--------------+------------+-----------------+
|             0|           F|               14|
|             0|           M|               21|
|             1|           F|               11|
|             1|           M|               32|
|             2|           F|               19|
|             2|           M|               50|
|             3|           F|               12|
|             3|           M|               45|
|             4|           F|               12|
|             4|           M|               34|
|             5|           F|               30|
|             5|           M|               77|
|             6|           F|               14|
|             6|           M|               56|
|             7|           F|               15|
|             7|           M|               27|
|             8|           F|                9|
|             8|           M|           

In [40]:
sc.stop()