# 4. PySpark Mini-project 

## Dataset

Storm events in US during 2019 are studied in the following analysis using the NOAA's National Weather Service Dataset available [here](https://www.ncdc.noaa.gov/stormevents/ftp.jsp). The folder **data** contains three csv files with event details, fatalities and location of every event.

In [1]:
# import SparkContext
from pyspark import SparkContext

In [2]:
sc = SparkSession.builder.appName("pysparkDataframes").getOrCreate()

In [12]:
sqlContext = SQLContext(sc)

In [3]:
dfStorm2019 = sc.read.format('csv')\
                .option('header', 'true')\
                .option('delimiter', ',')\
                .option('inferSchema', 'true')\
                .load('../pyspark/data/StormEvents_details-ftp_v1.0_d2019_c20200317.csv.gz')

In [4]:
dfFatalities2019 = sc.read.format('csv')\
                    .option('header', 'true')\
                    .option('delimiter', ',')\
                    .option('inferSchema', 'true')\
                    .load('../pyspark/data/StormEvents_fatalities-ftp_v1.0_d2019_c20200317.csv.gz')

In [5]:
dfLocations2019 = sc.read.format('csv')\
                    .option('header', 'true')\
                    .option('delimiter', ',')\
                    .option('inferSchema', 'true')\
                    .load('../pyspark/data/StormEvents_locations-ftp_v1.0_d2019_c20200317.csv.gz')

Displaying the schema of every dataset:

In [6]:
dfStorm2019.printSchema()

root
 |-- BEGIN_YEARMONTH: integer (nullable = true)
 |-- BEGIN_DAY: integer (nullable = true)
 |-- BEGIN_TIME: integer (nullable = true)
 |-- END_YEARMONTH: integer (nullable = true)
 |-- END_DAY: integer (nullable = true)
 |-- END_TIME: integer (nullable = true)
 |-- EPISODE_ID: integer (nullable = true)
 |-- EVENT_ID: integer (nullable = true)
 |-- STATE: string (nullable = true)
 |-- STATE_FIPS: integer (nullable = true)
 |-- YEAR: integer (nullable = true)
 |-- MONTH_NAME: string (nullable = true)
 |-- EVENT_TYPE: string (nullable = true)
 |-- CZ_TYPE: string (nullable = true)
 |-- CZ_FIPS: integer (nullable = true)
 |-- CZ_NAME: string (nullable = true)
 |-- WFO: string (nullable = true)
 |-- BEGIN_DATE_TIME: string (nullable = true)
 |-- CZ_TIMEZONE: string (nullable = true)
 |-- END_DATE_TIME: string (nullable = true)
 |-- INJURIES_DIRECT: integer (nullable = true)
 |-- INJURIES_INDIRECT: integer (nullable = true)
 |-- DEATHS_DIRECT: integer (nullable = true)
 |-- DEATHS_INDIRE

In [7]:
dfFatalities2019.printSchema()

root
 |-- FAT_YEARMONTH: integer (nullable = true)
 |-- FAT_DAY: integer (nullable = true)
 |-- FAT_TIME: integer (nullable = true)
 |-- FATALITY_ID: integer (nullable = true)
 |-- EVENT_ID: integer (nullable = true)
 |-- FATALITY_TYPE: string (nullable = true)
 |-- FATALITY_DATE: string (nullable = true)
 |-- FATALITY_AGE: integer (nullable = true)
 |-- FATALITY_SEX: string (nullable = true)
 |-- FATALITY_LOCATION: string (nullable = true)
 |-- EVENT_YEARMONTH: integer (nullable = true)



In [8]:
dfLocations2019.printSchema()

root
 |-- YEARMONTH: integer (nullable = true)
 |-- EPISODE_ID: integer (nullable = true)
 |-- EVENT_ID: integer (nullable = true)
 |-- LOCATION_INDEX: integer (nullable = true)
 |-- RANGE: double (nullable = true)
 |-- AZIMUTH: string (nullable = true)
 |-- LOCATION: string (nullable = true)
 |-- LATITUDE: double (nullable = true)
 |-- LONGITUDE: double (nullable = true)
 |-- LAT2: integer (nullable = true)
 |-- LON2: integer (nullable = true)



Dataframes have a **registerTempTable attribute** that can be transform into spark sql to generate queries and save the results using **.write.saveAsTable(name_table)**. RDD's don't have this extension because they are not structured tables. 

Obs: **registerTempTable** is deprecated. Instead, use **createOrReplaceTempView**

In [11]:
dfStorm2019.createOrReplaceTempView('stormDetails_table')

### 1. Number of events by STATE

In [25]:
sqlContext.sql('SELECT STATE, COUNT(STATE) AS NUMBER_EVENTS \
                FROM stormDetails_table \
                GROUP BY STATE ORDER BY COUNT(STATE) DESC').show(20)

+--------------+-------------+
|         STATE|NUMBER_EVENTS|
+--------------+-------------+
|         TEXAS|         4338|
|        KANSAS|         2672|
|    CALIFORNIA|         2643|
|  SOUTH DAKOTA|         2543|
|      NEW YORK|         2514|
|      VIRGINIA|         2398|
|  PENNSYLVANIA|         2395|
|          OHIO|         2279|
|          IOWA|         2276|
|      MISSOURI|         2159|
|     MINNESOTA|         2126|
|      NEBRASKA|         2088|
|      ILLINOIS|         2084|
|      OKLAHOMA|         1801|
|      COLORADO|         1776|
|     WISCONSIN|         1573|
|      KENTUCKY|         1522|
|NORTH CAROLINA|         1448|
|       INDIANA|         1447|
|       MONTANA|         1286|
+--------------+-------------+
only showing top 20 rows



### 2. Duration of events

First, we inspect the date information available in the stormDetails table

In [31]:
sqlContext.sql('SELECT BEGIN_YEARMONTH, BEGIN_DAY, BEGIN_TIME,  \
                END_YEARMONTH, END_DAY, END_TIME \
                FROM stormDetails_table').show(5)

+---------------+---------+----------+-------------+-------+--------+
|BEGIN_YEARMONTH|BEGIN_DAY|BEGIN_TIME|END_YEARMONTH|END_DAY|END_TIME|
+---------------+---------+----------+-------------+-------+--------+
|         201905|        9|      1554|       201905|      9|    1830|
|         201907|       15|      1640|       201907|     15|    1641|
|         201910|       20|      2223|       201910|     20|    2223|
|         201910|       20|      2312|       201910|     20|    2312|
|         201910|       20|      2236|       201910|     20|    2236|
+---------------+---------+----------+-------------+-------+--------+
only showing top 5 rows



In [46]:
sqlContext.sql("SELECT CONCAT(SUBSTRING(BEGIN_YEARMONTH, 1, 4), '-', \
                SUBSTRING(BEGIN_YEARMONTH, 5, 2), '-', \
                BEGIN_DAY, ' ', SUBSTRING(BEGIN_TIME, 1, 2), ':', SUBSTRING(BEGIN_TIME, 3, 2)) AS BEGIN_DATE \
                FROM stormDetails_table").show(5)   





+----------------+
|      BEGIN_DATE|
+----------------+
| 2019-05-9 15:54|
|2019-07-15 16:40|
|2019-10-20 22:23|
|2019-10-20 23:12|
|2019-10-20 22:36|
+----------------+
only showing top 5 rows



In [47]:
#SELECT DATEDIFF(month, '2017/08/25', '2011/08/25') AS DateDiff;

In [26]:
df_stormPandas = dfStorm2019.toPandas()

In [None]:
sc.stop()