# Automated Detection of Hazards


This notebook performs an exploratory analysis of the  particulate matter (PM) dataset.


Authors: 
- [Cristhian Castillo (KorKux)](https://github.com/Korkux1)
- [Christian Urcuqui](https://github.com/urcuqui)
- [Jhoan Steven Delgado Villarreal](https://)

Date: 03 October 2020


In [77]:
from datetime import datetime
print(f"last update {datetime.now()}")

last update 2020-10-03 15:37:06.509723


## Data Description

Particulate matter of size 2.5 µm (Micrometer) has health effects. Due to limited spatial and temporal coverage of surface PM 2.5 monitors, using data from geostationary satellites to estimate the surface levels can prove beneficial. In addition, meteorlogical factors that affect the surface PM2.5 levels may also help in generating more accurate estimations.

The dataset has data sorted according to the stations and time.

- **staion_id**: Unique identifier of the PM 2.5 monitors stationed accross US
- **stime**: Time and date of sample recorded
- **air_data_value**: EPA air data PM 2.5 readings
- **RH**: relative humidity from HRRR
- **UGRD**, VGRD: Wind speed vectors from HRRR
- **HPBL**: Height of Planetary Boundary Layer from HRRR
- **TMP**: Temperature recorded from HRRR
- **goes_measurement**: AOD reading from GOES R



## Packages

In [78]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from sklearn.preprocessing import PowerTransformer


## Dataset

### Sync Drive

In [79]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Load Dataset
Loading the data from a CSV file

In [80]:
df = pd.read_csv('/content/drive/My Drive/Automated Detection of Hazards/Datasets/PM2.5_dataset.csv')

## Data Staging

Overview of the data in its current state

## Data overview


In [81]:
df.head()

Unnamed: 0.1,Unnamed: 0,station_id,stime,air_data_value,RH,UGRD,VGRD,HPBL,TMP,goes_measurement
0,0,06-011-0007,2019-01-02 20:00:00,17.0,31.6,-2.106623,-1.797583,256.61905,282.8188,-0.005922
1,1,06-019-0500,2019-01-02 20:00:00,13.0,62.2,1.205877,1.764917,337.49405,281.6313,0.08709
2,2,06-061-0003,2019-01-02 20:00:00,21.0,61.5,1.518377,1.014917,270.61905,280.1313,0.094333
3,3,06-073-1201,2019-01-02 20:00:00,6.0,15.400001,2.080877,-1.610083,1009.3066,288.1938,-0.024185
4,4,06-079-2004,2019-01-02 20:00:00,7.0,50.7,2.393377,-1.172583,460.43155,285.1938,-0.014013


In [82]:
df.tail()

Unnamed: 0.1,Unnamed: 0,station_id,stime,air_data_value,RH,UGRD,VGRD,HPBL,TMP,goes_measurement
31565,103560,49-035-4002,2019-10-30 23:00:00,6.3,25.5,0.72385,-2.034275,892.4208,273.49664,0.054725
31566,103591,49-021-0005,2019-10-16 20:00:00,6.4,8.1,6.964932,4.869552,1664.2008,297.60638,0.01003
31567,103592,49-035-4002,2019-10-16 20:00:00,8.5,8.900001,2.214932,4.682052,842.38837,297.60638,0.282129
31568,103594,49-035-4002,2019-10-16 21:00:00,8.1,8.3,1.386543,5.465472,1292.0718,298.64655,0.233581
31569,103595,49-021-0005,2019-10-18 18:00:00,2.1,44.100002,-3.219561,-4.265393,980.8905,283.69052,0.200753


Remove the first colum

In [83]:
df = df.drop(['Unnamed: 0'], axis=1)

### Dimensions

Dataset size

In [84]:
print(f'Rows: {df.shape[0]} \nColumns: {df.shape[1]}')

Rows: 31570 
Columns: 9


### NaN Values

In [85]:
df.isna().sum()

station_id          0
stime               0
air_data_value      0
RH                  0
UGRD                0
VGRD                0
HPBL                0
TMP                 0
goes_measurement    0
dtype: int64

## Brief Description

In [86]:
df.describe()

Unnamed: 0,air_data_value,RH,UGRD,VGRD,HPBL,TMP,goes_measurement
count,31570.0,31570.0,31570.0,31570.0,31570.0,31570.0,31570.0
mean,7.948898,42.447127,1.077064,0.459147,1109.665863,294.993541,0.360805
std,5.912331,21.691148,2.895799,2.927115,845.487755,9.300556,0.574477
min,2.0,2.7,-12.932089,-16.077074,19.441486,252.97461,-0.05
25%,4.0,24.800001,-0.811573,-1.375287,404.699665,289.247043,0.083083
50%,6.5,40.4,0.945436,0.298247,978.84455,295.9647,0.197401
75%,10.0,58.100002,2.859779,2.253714,1607.01055,301.890297,0.408507
max,91.0,100.0,15.319673,13.251497,4843.8394,321.82535,4.999973


Brief Information of various variables

In [87]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31570 entries, 0 to 31569
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   station_id        31570 non-null  object 
 1   stime             31570 non-null  object 
 2   air_data_value    31570 non-null  float64
 3   RH                31570 non-null  float64
 4   UGRD              31570 non-null  float64
 5   VGRD              31570 non-null  float64
 6   HPBL              31570 non-null  float64
 7   TMP               31570 non-null  float64
 8   goes_measurement  31570 non-null  float64
dtypes: float64(7), object(2)
memory usage: 2.2+ MB


Finding measurement records greater than 2.5

In [88]:
print(f"Number of records with measurement greater than 2.5: {df[df['goes_measurement'] >= 2.5].shape[0]}")

Number of records with measurement greater than 2.5: 386


In [89]:
df[df['goes_measurement'] >= 2.5]

Unnamed: 0,station_id,stime,air_data_value,RH,UGRD,VGRD,HPBL,TMP,goes_measurement
44,48-273-0314,2019-01-02 23:00:00,5.0,94.200005,3.133377,-5.757474,94.627990,281.53480,4.999973
49,35-013-0016,2019-01-03 15:00:00,6.0,88.500000,-1.019524,0.068159,116.884960,270.94930,3.648572
75,35-001-0026,2019-01-03 18:00:00,5.6,58.300000,-0.649633,-1.928151,334.216600,268.39594,3.994032
120,44-009-0007,2019-01-03 21:00:00,4.0,55.600002,3.178871,-1.741848,1210.400400,278.64014,4.106462
151,48-201-1034,2019-01-03 23:00:00,4.0,89.000000,3.817409,-1.081760,592.357540,281.70847,4.999973
...,...,...,...,...,...,...,...,...,...
31044,48-273-0314,2019-02-07 20:00:00,11.0,87.400000,-2.742662,0.787426,44.722775,291.76843,4.999973
31128,29-189-3001,2019-02-03 23:00:00,7.8,61.600002,-1.142912,2.629372,1076.420500,290.47894,4.999973
31180,48-273-0314,2019-02-05 14:00:00,8.0,93.200005,-2.191226,2.462985,26.406189,289.92600,4.999973
31462,49-021-0005,2019-10-18 16:00:00,2.6,62.300000,-3.190867,-3.497366,763.928650,280.52770,2.586146


## Visualizations

### Data Distribution

Air data Distribution

In [90]:
px.histogram(df, x='air_data_value')

RH Data Distribution

In [91]:
px.histogram(df, x='RH')

UGRD Data Distribution

In [92]:
px.histogram(df, x='UGRD')

VGRD Data Distribution

In [93]:
px.histogram(df, x='VGRD')

HPBL Data Distribution

In [94]:
px.histogram(df, x='HPBL')

TMP Data Distribution

In [95]:
px.histogram(df, x='TMP')

### Goes measurement distribution

In [96]:
px.histogram(df, x='goes_measurement')

### Relation b/w air_data_value and goes_measurement:

In [97]:
px.histogram(df, x='air_data_value', y='goes_measurement', nbins=20, histfunc='avg')

### Relation b/w RH and goes_measurement:

In [98]:
px.histogram(df, x='RH', y='goes_measurement', nbins=20, histfunc='avg')

### Relation b/w UGRD and goes_measurement:

In [99]:
px.histogram(df, x='UGRD', y='goes_measurement', nbins=20, histfunc='avg')

### Relation b/w VGRD	and goes_measurement:

In [100]:
px.histogram(df, x='VGRD', y='goes_measurement', nbins=20, histfunc='avg')

### Relation b/w HPBL and goes_measurement:

In [101]:
px.histogram(df, x='HPBL', y='goes_measurement', nbins=20, histfunc='avg')

### Relation b/w TMP and goes_measurement:


In [102]:
px.histogram(df, x='TMP', y='goes_measurement', nbins=20, histfunc='avg')

## Variable correlation

In [103]:
columns_t_analyze = df.select_dtypes(["float64", "int64"])
transformer =  PowerTransformer(method='yeo-johnson').fit(columns_t_analyze)

In [104]:
columns_t_analyze = df.select_dtypes(["float64", "int64"])
columns_transformed = PowerTransformer(method='yeo-johnson').fit_transform(columns_t_analyze)

In [105]:
columns_transformed = pd.DataFrame(columns_transformed)
columns_transformed.columns = columns_t_analyze.columns

In [106]:
fig = go.Figure(data=go.Heatmap(
                   z=columns_transformed.corr(),
                   x=columns_transformed.columns,
                   y=columns_transformed.columns,
                   hoverongaps = False))
fig.show()
