# Exploratory Analysis on Harmful Events

| | |
| --- | --- |
| Author | Julie Koesmarno |
| Last Updated | 2/10/2020 |
| Purpose | To showcase data manipulation and analysis in Python as well as Data Visualization in Plotly - all done in Azure Data Studio! |

## Abstract
This paper uses an exploratory analysis technique to find which types of events are most harmful with respect to population health as well as economically. In this paper, using the data provided in the NOAA Storm Database since 1950, we will show that tornado is most harmful for population health considering both injuries and fatality, as well as economically.

## Introduction
This paper explores the [NOAA Storm Database](https://www.ncdc.noaa.gov/stormevents/) and answers some basic questions about severe weather events. It uses data from U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This illustrates a data analysis reproducible research. 

This paper uses a simple visualization technique of the top 10 events by Property Damage which represents total economic impact since 1950 and Fatalities + Injuries which represents total population health impact. It is possible to analyze which occurence has the most damaging impact at a given time, or the average or median of the impact at a given event, which would be useful for other future deeper analysis. In this paper, we include such methods of doing so as a reference for evolution of this current project. 

## Step 1: Load data into dataframe

Download the zipped data from https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2 

In [1]:
import urllib.request
import pandas as pd

print('Beginning file download with urllib2...')

url = 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
urllib.request.urlretrieve(url, 'C:/Users/jukoesma/Desktop/StormData.csv.bz2')

df = pd.read_csv('C:/Users/jukoesma/Desktop/StormData.csv.bz2', compression = 'bz2',  encoding = "ISO-8859-1")

Beginning file download with urllib2...


  exec(code_obj, self.user_global_ns, self.user_ns)


Check the dataset by inspecting the first few rows

In [2]:
df.head()


Unnamed: 0,STATE__,BGN_DATE,BGN_TIME,TIME_ZONE,COUNTY,COUNTYNAME,STATE,EVTYPE,BGN_RANGE,BGN_AZI,...,CROPDMGEXP,WFO,STATEOFFIC,ZONENAMES,LATITUDE,LONGITUDE,LATITUDE_E,LONGITUDE_,REMARKS,REFNUM
0,1.0,4/18/1950 0:00:00,130,CST,97.0,MOBILE,AL,TORNADO,0.0,,...,,,,,3040.0,8812.0,3051.0,8806.0,,1.0
1,1.0,4/18/1950 0:00:00,145,CST,3.0,BALDWIN,AL,TORNADO,0.0,,...,,,,,3042.0,8755.0,0.0,0.0,,2.0
2,1.0,2/20/1951 0:00:00,1600,CST,57.0,FAYETTE,AL,TORNADO,0.0,,...,,,,,3340.0,8742.0,0.0,0.0,,3.0
3,1.0,6/8/1951 0:00:00,900,CST,89.0,MADISON,AL,TORNADO,0.0,,...,,,,,3458.0,8626.0,0.0,0.0,,4.0
4,1.0,11/15/1951 0:00:00,1500,CST,43.0,CULLMAN,AL,TORNADO,0.0,,...,,,,,3412.0,8642.0,0.0,0.0,,5.0


Check the number of rows & row count

In [3]:
len(df)

902297

In [4]:
df.columns

Index(['STATE__', 'BGN_DATE', 'BGN_TIME', 'TIME_ZONE', 'COUNTY', 'COUNTYNAME',
       'STATE', 'EVTYPE', 'BGN_RANGE', 'BGN_AZI', 'BGN_LOCATI', 'END_DATE',
       'END_TIME', 'COUNTY_END', 'COUNTYENDN', 'END_RANGE', 'END_AZI',
       'END_LOCATI', 'LENGTH', 'WIDTH', 'F', 'MAG', 'FATALITIES', 'INJURIES',
       'PROPDMG', 'PROPDMGEXP', 'CROPDMG', 'CROPDMGEXP', 'WFO', 'STATEOFFIC',
       'ZONENAMES', 'LATITUDE', 'LONGITUDE', 'LATITUDE_E', 'LONGITUDE_',
       'REMARKS', 'REFNUM'],
      dtype='object')

## Step 2: Visualize the top 10 natural disasters

In this step, you will see that Tornado costs above $3M and with nearly 100K total fatalities and injuries.

In [5]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

df['INCIDENTS'] = df['FATALITIES'] + df['INJURIES']
tdf = df.groupby(['EVTYPE']).agg({'PROPDMG':'sum','INCIDENTS':'sum'}).reset_index()

tdf = tdf.sort_values(by=['PROPDMG'], ascending = False).head(10)

# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Scatter(x=tdf['EVTYPE'], y=tdf['PROPDMG'], name="Property Damage", mode='markers'),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(x=tdf['EVTYPE'], y=tdf['INCIDENTS'], name="Incidents", mode='markers'),
    secondary_y=True,
)

# Add figure title
fig.update_layout(
    title_text="Top 10 Natural Disasters"
)

# Set x-axis title
fig.update_xaxes(title_text="Event Type")

# Set y-axes titles
fig.update_yaxes(title_text="Property Damage ($)", secondary_y=False)
fig.update_yaxes(title_text="Fatalities + Injuries", secondary_y=True)

fig.show()



## Optional: Exploring Property Damage statistics on the top 10 event types

The following interactive chart allows you to explore simple statistical functions on Property Damage against top 10 event types. 

In [6]:
import plotly.io as pio

ndf = (df[(df.EVTYPE.isin(tdf.EVTYPE)) & (df.PROPDMG > 100)]  );

subject = ndf['EVTYPE']
score = ndf['PROPDMG']

aggs = ["count","sum","avg","median","mode","rms","stddev","min","max","first","last"]

agg = []
agg_func = []
for i in range(0, len(aggs)):
    agg = dict(
        args=['transforms[0].aggregations[0].func', aggs[i]],
        label=aggs[i],
        method='restyle'
    )
    agg_func.append(agg)


data = [dict(
  type = 'scatter',
  x = subject,
  y = score,
  mode = 'markers',
  transforms = [dict(
    type = 'aggregate',
    groups = subject,
    aggregations = [dict(
        target = 'y', func = 'sum', enabled = True)
    ]
  )]
)]

layout = dict(
  title = '<b>Plotly Aggregations</b><br>aggregation:<br> ',
  xaxis = dict(title = 'Event Type'),
  yaxis = dict(title = 'Property Damage'),
  updatemenus = [dict(
        x = 0.85,
        y = 1.15,
        xref = 'paper',
        yref = 'paper',
        yanchor = 'top',
        active = 1,
        showactive = False,
        buttons = agg_func
  )]
)

fig_dict = dict(data=data, layout=layout)

pio.show(fig_dict, validate=False)


In [7]:
import datetime
now = datetime.datetime.now()
print("Current date and time: ")
print(str(now))


Current date and time: 
2021-10-08 16:51:06.737370


# Conclusion

Based on the above analysis, Tornado causes the most total property damage and fatalities / injuries, therefore most harmful for the population and economically. 

### References
Useful websites that have helped with the data viz eye-candies - 

* [Plotly Aggregations](https://plot.ly/python/aggregations/)
* [Azure Data Studio - Notebooks](https://docs.microsoft.com/en-us/sql/big-data-cluster/notebooks-guidance?view=sql-server-ver15)

