<a id='ds0'></a>
#  <div class="h1">  DS4G: Environmental Insights Explorer 🌏</div>
### Exploring alternatives for emissions factor calculations
    
    

[🌏🌿Green Future: Analysis and Solution](https://www.kaggle.com/caesarlupum/green-future-analysis-and-solution/)

<div class="h3"> Submissions: </div>

Following are parts of Kernels Submissions in order:
<ul>
    <li>
        <a href="https://www.kaggle.com/caesarlupum/ds4g-go-to-the-green-future" target="_blank">Part 1: 🌏🌿DS4G: Go to the Green Future! - A Gentle Introduction </a>  
    </li>
    <li>
        <a href="https://www.kaggle.com/maxlenormand/saving-the-power-plants-csv-to-geojson" target="_blank">Part 2: Saving the Power Plants CSV to GeoJSON - EDA Analysis - Tutorial, analytics </a>  
    </li>
    <li>
        <a href="https://www.kaggle.com/caesarlupum/ds4g-anomaly-analysis" target="_blank">Part 3: 🌏🌿Green Future: Anomaly Analysis & Time Series - A Deep Analysis </a>  
    </li>

</ul>

<div align='center'><font size="5" color="#00b899">🌏🌿Green Future: Anomaly Analysis & Time Series</font></div>
<div align='center'>Other Parts: <a href='https://www.kaggle.com/caesarlupum/ds4g-go-to-the-green-future'>Part 1</a> | <a href='https://www.kaggle.com/maxlenormand/saving-the-power-plants-csv-to-geojson'>Part 2</a> | <a href='https://www.kaggle.com/caesarlupum/ds4g-anomaly-analysis'>Part 3</a>  

</div>

<a class="anchor" id="top"></a>
<a id='dsf4'></a>
# <div class="h2">  Table of contents</div>

1. [Glimpse of Data](#PREPARATION)
    * [Import packages](#IMPORT)
2. [Reading S5p data Whether and No2](#READS5P)
3. [Visuals](#V1)
4. [Anomaly Analysis](#OUTLIER)
    4.1 [Gaussian](#OUTLIER1)
    4.2 [Isolation Forest](#OUTLIER2)
    4.3 [One Class SVM](#OUTLIER3)

5. [Prediction using LSTM with Python](#LSTM1)
    5.1 [Get the root mean squared error (RMSE)](#LSTM1)

6. [Arima with Python](#AR)
    6.1. [Rolling Forecast ARIMA Model](#AR2)

7. [Time series prediction using Prophet in Python](#PRO)
    7.1. [Forecast quality evaluation](#PRO1)
    7.2. [Incorporating the Effects of Weather Condition](#PRO2)
    7.3. [Forecast quality evaluation](#PRO3)
    7.4. [Save The Model](#PRO4)
8. [Prediction of  NO2 density for each primary_fuel throughout the year](#M1)
    8.1. [Forecast quality evaluation for Power Plant over the year](#M2)
    8.3. [Outlier Analysis of Power Plant - Coal over the year](#M3)
       
9. [About the data](#ABOUTDATA)  
10. [Ending note](#END)  

  <hr>

## In this notebook we investigated the presence of NO2 concentration in air, considering its constant increase over days, years. Owing to accurate future air quality estimates, the need for detecting the anomalously high increase in the concentration of pollutants cannot be adjourned. This study is helpful in educating the government for decision making and people about spatiotemporal, geographical, and economic conditions responsible for anomalously high NO2 concentrations in air. In this work, we modeling the solution and analyze the impacts of air pollution for each region in Porto Rico for each primary_fuel in the year.

<div class="h2"> Glimpse of Data - Power Plants </div>
<a id="PREPARATION"></a>
[Back to Table of Contents](#top)

[General Findinds](#theend)
  

# <div class="h3">Imports </div>
<a id="IMPORT"></a>
[Back to Table of Contents](#top)

We are using a stack: ``numpy``, ``pandas``, ``sklearn``, ``matplotlib``, ``rasterio``, ``plotly``.

In [None]:
import os, random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('max_columns', 200)

In [None]:
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn import preprocessing

In [None]:
# !pip install plotly
# !pip install fbprophet

In [None]:
from keras.models import Sequential
from keras.layers import Dense, LSTM

import datetime as dt
from statsmodels.tsa.arima_model import ARIMA

from fbprophet import Prophet
from fbprophet.plot import plot_plotly

import plotly.offline as py
from matplotlib import pyplot
py.init_notebook_mode()


In [None]:
%%HTML
<style type="text/css">
div.h1 {
    background-color: #00b899; 
    color: white; 
    padding: 8px; 
    padding-right: 300px; 
    font-size: 35px; 
    max-width: 1500px; 
    margin: auto; 
    margin-top: 50px;
}

div.h2 {
    background-color: #00b899; 
    color: white; 
    padding: 8px; 
    padding-right: 300px; 
    font-size: 25px; 
    max-width: 1500px; 
    margin: auto; 
    margin-top: 50px;
}
div.h3 {
    color: #00b899;
    font-size: 16px; 
    margin-top: 20px; 
    margin-bottom:4px;
}
div.h4 {
    font-size: 15px; 
    margin-top: 20px; 
    margin-bottom: 8px;
}
span.note {
    font-size: 5; 
    color: gray; 
    font-style: italic;
}
span.captiona {
    font-size: 5; 
    color: dimgray; 
    font-style: italic;
    margin-left: 130px;
    vertical-align: top;
}
hr {
    display: block; 
    color: gray
    height: 1px; 
    border: 0; 
    border-top: 1px solid;
}
hr.light {
    display: block; 
    color: lightgray
    height: 1px; 
    border: 0; 
    border-top: 1px solid;
}

</style>

<div class="h1"> Reading S5p data Whether and No2 </div>
<a id="READS5P"></a>
[Back to Table of Contents](#top)

[General Findinds](#theend)
  
You can verify the data with more details here: [Prepare Data for Modeling](https://www.kaggle.com/caesarlupum/ds4g-go-to-the-green-future#-Satellite-Information)

In [None]:
import pandas as pd
no2_weather = pd.read_csv("../input/s5p-data-csv/no2_weather.csv")
s5p_no2_pictures_df = pd.read_csv("../input/s5p-data-csv/s5p_no2_pictures_df.csv")
weather_pictures_df = pd.read_csv("../input/s5p-data-csv/weather_pictures_df.csv")

In [None]:
import rasterio as rio
def split_column_into_new_columns(dataframe,column_to_split,new_column_one,begin_column_one,end_column_one):
    for i in range(0, len(dataframe)):
        dataframe.loc[i, new_column_one] = dataframe.loc[i, column_to_split][begin_column_one:end_column_one]
    return dataframe

Power plants on Puerto Rico

In [None]:
power_plants = pd.read_csv('/kaggle/input/ds4g-environmental-insights-explorer/eie_data/gppd/gppd_120_pr.csv')
power_plants = split_column_into_new_columns(power_plants,'.geo','latitude',50,66)
power_plants = split_column_into_new_columns(power_plants,'.geo','longitude',31,48)
power_plants['latitude'] = power_plants['latitude'].astype(float)
a = np.array(power_plants['latitude'].values.tolist()) # 18 instead of 8
power_plants['latitude'] = np.where(a < 10, a+10, a).tolist() 

power_plants_df = power_plants.sort_values('capacity_mw',ascending=False).reset_index()
power_plants_df['img_idx_lt']=(((18.6-power_plants_df.latitude)*148/(18.6-17.9))).astype(int)
power_plants_df['img_idx_lg']=((67.3+power_plants_df.longitude.astype(float))*475/(67.3-65.2)).astype(int)
power_plants_df['plant']=power_plants_df.name.str[:3]+power_plants_df.name.str[-1]+'_'+power_plants_df.primary_fuel
power_plants=power_plants_df[['name','latitude','longitude','primary_fuel','capacity_mw','img_idx_lt','img_idx_lg','plant']]
power_plants

<div class="h3"> shape and head no2 weather </div>
<a id="P"></a>
  

In [None]:
no2_weather.shape

In [None]:
no2_weather.head()

<div class="h3"> shape and head s5p  </div>
<a id="P"></a>
  

In [None]:
s5p_no2_pictures_df.shape

In [None]:
s5p_no2_pictures_df.head()

<div class="h3"> shape and head weather  </div>
<a id="P"></a>
  

In [None]:
weather_pictures_df.shape

In [None]:
weather_pictures_df.head()

<div class="h2"> Visuals  </div>
<a id="V1"></a>

[Back to Table of Contents](#top)

[General Findinds](#theend)
  

In [None]:
x = weather_pictures_df['date']
y = weather_pictures_df["total_precipitation_surface_mean"]
plt.plot(x,y)
plt.show()

In [None]:
def parser(x):
    return dt.datetime.strptime(x, "%Y-%m-%d")

path= '../input/s5p-data-csv/no2_weather.csv' 
data = pd.read_csv(path, header=0, parse_dates=[0], squeeze=True, date_parser=parser)
data = data[['start_date','no2_emission_sum']]
data["start_date"] = data["start_date"].dt.strftime('%Y%m%d').astype(float)

<div class="h3"> data info  </div>
<a id="DF"></a>
[Back to Table of Contents](#top)

[General Findinds](#theend)
  

In [None]:
data.info()

<div class="h3"> drop nan values  </div>
<a id="P"></a>
[Back to Table of Contents](#top)

[General Findinds](#theend)
  

In [None]:
data.dropna(axis=0, inplace=True)
print(data.shape)

In [None]:
data = data.set_index('start_date')
data.head()

In [None]:
weather_pictures_df.head(2)

In [None]:
weather_pictures_df.shape

In [None]:
no2_weather['start_date'] = pd.to_datetime(no2_weather['start_date'])
no2_weather['no2_emission_sum'] = (no2_weather['no2_emission_sum'] - 32) * 5/9
# plot the data
no2_weather.plot(x='start_date', y='no2_emission_sum')

<div class="h1">Anomaly Analysis</div>

<a id="OUTLIER"></a>

[Back to Table of Contents](#top)

[General Findinds](#theend)
  

Owing to accurate future air quality estimates, need for detecting the anomalously high increase in concentration of pollutants cannot be adjourned. The presence of NO2 concentration in air is investigated in this notebook, considering its constant increase over years as well as its inevitable health risks. Furthermore, spatiotemporal segments with anomalously high NO2 concentrations for  Porto Rico.


In [None]:
# Suppress warnings 
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
from IPython.display import HTML

HTML('<iframe width="700" height="400" src="https://www.youtube.com/embed/8DfXJUDjx64" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

Anomaly Detection | Developing And Evaluating An Anomaly Detection System

<div class="h3">Feature engineering</div>
Extracting some features


In [None]:
# An estimation of anomly population of the dataset (necessary for several algorithm)
outliers_fraction = 0.01

no2_weather['day'] = no2_weather['start_date'].dt.day
# the day of the week (Monday=0, Sunday=6) and if it's a week end day or week day.
no2_weather['DayOfTheWeek'] = no2_weather['start_date'].dt.dayofweek

<div class="h3">creation of 6 distinct categories emissions</div>


In [None]:
# creation of 6 distinct categories that seem useful (week end/day)
no2_weather['catDayEmission'] = no2_weather['DayOfTheWeek']

a = no2_weather.loc[no2_weather['catDayEmission'] == 0, 'no2_emission_sum']
b = no2_weather.loc[no2_weather['catDayEmission'] == 1, 'no2_emission_sum']
c = no2_weather.loc[no2_weather['catDayEmission'] == 2, 'no2_emission_sum']
d = no2_weather.loc[no2_weather['catDayEmission'] == 3, 'no2_emission_sum']
e = no2_weather.loc[no2_weather['catDayEmission'] == 4, 'no2_emission_sum']
f = no2_weather.loc[no2_weather['catDayEmission'] == 5, 'no2_emission_sum']
g = no2_weather.loc[no2_weather['catDayEmission'] == 6, 'no2_emission_sum']



In [None]:
# creation of 6 distinct categories that seem useful (week end/day)
no2_weather['catDayEPrecSurfaceMean'] = no2_weather['DayOfTheWeek']

a2 = no2_weather.loc[no2_weather['catDayEPrecSurfaceMean'] == 0, 'total_precipitation_surface_mean']
b2 = no2_weather.loc[no2_weather['catDayEPrecSurfaceMean'] == 1, 'total_precipitation_surface_mean']
c2 = no2_weather.loc[no2_weather['catDayEPrecSurfaceMean'] == 2, 'total_precipitation_surface_mean']
d2 = no2_weather.loc[no2_weather['catDayEPrecSurfaceMean'] == 3, 'total_precipitation_surface_mean']
e2 = no2_weather.loc[no2_weather['catDayEPrecSurfaceMean'] == 4, 'total_precipitation_surface_mean']
f2 = no2_weather.loc[no2_weather['catDayEPrecSurfaceMean'] == 5, 'total_precipitation_surface_mean']
g2 = no2_weather.loc[no2_weather['catDayEPrecSurfaceMean'] == 6, 'total_precipitation_surface_mean']

Create features for analysing **no2_emission_sum**, **total_precipitation_surface_mean** for each **day of week**

<div class="h3">time with int to plot easily</div>


In [None]:
no2_weather['time_epoch'] = (no2_weather['start_date'].astype(np.int64)/100000000000).astype(np.int64)

In [None]:
# Take useful feature and standardize them
data_IF = no2_weather[['time_epoch','DayOfTheWeek','day','no2_emission_sum','temperature_2m_above_ground_mean','specific_humidity_2m_above_ground_mean','relative_humidity_2m_above_ground_mean','u_component_of_wind_10m_above_ground_mean','v_component_of_wind_10m_above_ground_mean','total_precipitation_surface_mean']]

data_IF.dropna(axis=0, inplace=True)
print(data_IF.shape)

min_max_scaler = preprocessing.StandardScaler()
np_scaled = min_max_scaler.fit_transform(data_IF)
data_IF = pd.DataFrame(np_scaled)

<div class="h2">Gaussian</div>

<a id="OUTLIER1"></a>

[Back to Table of Contents](#top)

[General Findinds](#theend)
  

In [None]:
# Suppress warnings 
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
from IPython.display import HTML

HTML('<iframe width="700" height="400" src="https://www.youtube.com/embed/mh6rAYA0e7Q" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

Anomaly Detection | Gaussian Distribution — [ Machine Learning | Andrew Ng ]


In [None]:
# qq = no2_weather.loc[no2_weather['catDayEmission'] == 0, 'no2_emission_sum']
# qq 
no2_weather['catDayEmission'].value_counts()

In [None]:
df_class0 = no2_weather.loc[no2_weather['catDayEmission'] == 0, 'no2_emission_sum']
df_class1 = no2_weather.loc[no2_weather['catDayEmission'] == 1, 'no2_emission_sum']
df_class2 = no2_weather.loc[no2_weather['catDayEmission'] == 2, 'no2_emission_sum']
df_class3 = no2_weather.loc[no2_weather['catDayEmission'] == 3, 'no2_emission_sum']
df_class4 = no2_weather.loc[no2_weather['catDayEmission'] == 4, 'no2_emission_sum']
df_class5 = no2_weather.loc[no2_weather['catDayEmission'] == 5, 'no2_emission_sum']
df_class6 = no2_weather.loc[no2_weather['catDayEmission'] == 6, 'no2_emission_sum']


<div class="h3">plot the temperature repartition by catDayEmission</div>


In [None]:
fig, axs = plt.subplots(4,2)
df_class0.hist(ax=axs[0,0],bins=28)
df_class1.hist(ax=axs[0,1],bins=28)
df_class2.hist(ax=axs[1,0],bins=28)
df_class3.hist(ax=axs[1,1],bins=28)
df_class4.hist(ax=axs[2,0],bins=28)
df_class5.hist(ax=axs[2,1],bins=28)
df_class6.hist(ax=axs[3,0],bins=28)


In [None]:
df_class0.dropna(axis=0, inplace=True)
df_class1.dropna(axis=0, inplace=True)
df_class2.dropna(axis=0, inplace=True)
df_class3.dropna(axis=0, inplace=True)
df_class4.dropna(axis=0, inplace=True)
df_class5.dropna(axis=0, inplace=True)
df_class6.dropna(axis=0, inplace=True)

In [None]:
print('df_class0.shape ',df_class0.shape)
print('df_class1.shape ',df_class1.shape)
print('df_class2.shape ',df_class2.shape)
print('df_class3.shape ',df_class3.shape)
print('df_class4.shape ',df_class4.shape)
print('df_class5.shape ',df_class5.shape)
print('df_class6.shape ',df_class6.shape)
print('total data no2_weather.shape ',no2_weather.shape)


 
<div class="h3">ellipticEnvelope(gaussian distribution) for each catDayEmission</div>


In [None]:
## apply ellipticEnvelope(gaussian distribution) at each categories

envelope =  EllipticEnvelope(contamination = outliers_fraction) 
X_train = df_class0.values.reshape(-1,1)
envelope.fit(X_train)
df_class0 = pd.DataFrame(df_class0)
df_class0['deviation'] = envelope.decision_function(X_train)
df_class0['anomaly'] = envelope.predict(X_train)

envelope =  EllipticEnvelope(contamination = outliers_fraction) 
X_train = df_class1.values.reshape(-1,1)
envelope.fit(X_train)
df_class1 = pd.DataFrame(df_class1)
df_class1['deviation'] = envelope.decision_function(X_train)
df_class1['anomaly'] = envelope.predict(X_train)

envelope =  EllipticEnvelope(contamination = outliers_fraction) 
X_train = df_class2.values.reshape(-1,1)
envelope.fit(X_train)
df_class2 = pd.DataFrame(df_class2)
df_class2['deviation'] = envelope.decision_function(X_train)
df_class2['anomaly'] = envelope.predict(X_train)

envelope =  EllipticEnvelope(contamination = outliers_fraction) 
X_train = df_class3.values.reshape(-1,1)
envelope.fit(X_train)
df_class3 = pd.DataFrame(df_class3)
df_class3['deviation'] = envelope.decision_function(X_train)
df_class3['anomaly'] = envelope.predict(X_train)

In [None]:
envelope =  EllipticEnvelope(contamination = outliers_fraction) 
X_train = df_class4.values.reshape(-1,1)
envelope.fit(X_train)
df_class4 = pd.DataFrame(df_class4)
df_class4['deviation'] = envelope.decision_function(X_train)
df_class4['anomaly'] = envelope.predict(X_train)

envelope =  EllipticEnvelope(contamination = outliers_fraction) 
X_train = df_class5.values.reshape(-1,1)
envelope.fit(X_train)
df_class5 = pd.DataFrame(df_class5)
df_class5['deviation'] = envelope.decision_function(X_train)
df_class5['anomaly'] = envelope.predict(X_train)

envelope =  EllipticEnvelope(contamination = outliers_fraction) 
X_train = df_class6.values.reshape(-1,1)
envelope.fit(X_train)
df_class6 = pd.DataFrame(df_class6)
df_class6['deviation'] = envelope.decision_function(X_train)
df_class6['anomaly'] = envelope.predict(X_train)

 
<div class="h3">Day Emission with anomalies</div>

In [None]:
a0 = df_class0.loc[df_class0['anomaly'] == 1, 'no2_emission_sum']
b0 = df_class0.loc[df_class0['anomaly'] == -1, 'no2_emission_sum']

a1 = df_class1.loc[df_class1['anomaly'] == 1, 'no2_emission_sum']
b1 = df_class1.loc[df_class1['anomaly'] == -1, 'no2_emission_sum']

a2 = df_class2.loc[df_class2['anomaly'] == 1, 'no2_emission_sum']
b2 = df_class2.loc[df_class2['anomaly'] == -1, 'no2_emission_sum']

a3 = df_class3.loc[df_class3['anomaly'] == 1, 'no2_emission_sum']
b3 = df_class3.loc[df_class3['anomaly'] == -1, 'no2_emission_sum']

a4 = df_class4.loc[df_class4['anomaly'] == 1, 'no2_emission_sum']
b4 = df_class4.loc[df_class4['anomaly'] == -1, 'no2_emission_sum']

a5 = df_class5.loc[df_class5['anomaly'] == 1, 'no2_emission_sum']
b5 = df_class5.loc[df_class5['anomaly'] == -1, 'no2_emission_sum']

a6 = df_class6.loc[df_class6['anomaly'] == 1, 'no2_emission_sum']
b6 = df_class6.loc[df_class6['anomaly'] == -1, 'no2_emission_sum']

 <div class="h3">plot the N02 Day Emission with anomalies</div>

In [None]:

fig, axs = plt.subplots(2,2)
axs[0,0].hist([a0,b0], bins=32, stacked=True, color=['blue', 'red'], label=['normal', 'anomaly'])
axs[0,1].hist([a1,b1], bins=32, stacked=True, color=['blue', 'red'], label=['normal', 'anomaly'])
axs[1,0].hist([a2,b2], bins=32, stacked=True, color=['blue', 'red'], label=['normal', 'anomaly'])
axs[1,1].hist([a3,b3], bins=32, stacked=True, color=['blue', 'red'], label=['normal', 'anomaly'])
plt.legend()
plt.show()

 <div class="h3"> The day of the week (Monday=0, Sunday=6) and if it's a week end day or week day.</div>

Monday NO2 emission stats. IN general Monday have more variation that other day of week

In [None]:
a0.describe()

Tuesday NO2 emission stats.


In [None]:
a1.describe()

Wednesday NO2 emission stats.
 

In [None]:
a3.describe()

Thursday NO2 emission stats.


In [None]:
a4.describe()

Monday, Tuesday, Wednesday, Thursday

In [None]:
fig, axs = plt.subplots(2,2)
axs[0,0].hist([a4,b4], bins=32, stacked=True, color=['blue', 'red'], label=['normal', 'anomaly'])
axs[0,1].hist([a5,b5], bins=32, stacked=True, color=['blue', 'red'], label=['normal', 'anomaly'])
axs[1,0].hist([a6,b6], bins=32, stacked=True, color=['blue', 'red'], label=['normal', 'anomaly'])

plt.legend()
plt.show()

Friday NO2 emission stats.

In [None]:
a4.describe()

Saturday NO2 emission stats.

In [None]:
a5.describe()

Sunday NO2 emission stats.

In [None]:
a6.describe()

Friday, Saturday, Sunday.

In [None]:
# add the data to the main 
df_class = pd.concat([df_class0, df_class1, df_class2, df_class3])
no2_weather['anomaly22'] = df_class['anomaly']
no2_weather['anomaly22'] = np.array(no2_weather['anomaly22'] == -1).astype(int) 

In [None]:
# visualisation of anomaly throughout time
fig, ax = plt.subplots()

a = no2_weather.loc[no2_weather['anomaly22'] == 1, ('time_epoch', 'no2_emission_sum')] #anomaly

ax.plot(no2_weather['time_epoch'], no2_weather['no2_emission_sum'], color='blue')
ax.scatter(a['time_epoch'],a['no2_emission_sum'], color='red')
plt.show()

In [None]:
# visualisation of anomaly with temperature repartition
a = no2_weather.loc[no2_weather['anomaly22'] == 0, 'no2_emission_sum']
b = no2_weather.loc[no2_weather['anomaly22'] == 1, 'no2_emission_sum']

fig, axs = plt.subplots()
axs.hist([a,b], bins=32, stacked=True, color=['blue', 'red'], label=['normal', 'anomaly'])
plt.legend()
plt.show()

Good detections of extreme values.

<div class="h2">Isolation Forest </div>
<a id="OUTLIER2"></a>

[Back to Table of Contents](#top)

[General Findinds](#theend)
  

In [None]:
# Suppress warnings 
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
from IPython.display import HTML

HTML('<iframe width="700" height="400" src="https://www.youtube.com/embed/5p8B2Ikcw-k" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

Unsupervised Anomaly Detection with Isolation Forest - Elena Sharov

In [None]:
# train isolation forest 
model =  IsolationForest(contamination = outliers_fraction)
model.fit(data_IF)
  
no2_weather['anomaly25'] = pd.Series(model.predict(data_IF))
no2_weather['anomaly25'] = no2_weather['anomaly25'].map( {1: 0, -1: 1} )
print(no2_weather['anomaly25'].value_counts())



<div class="h3">Visualisation of anomaly throughout time</div>
<a id="IF"></a>

[Back to Table of Contents](#top)

[General Findinds](#theend)
  

In [None]:
a = no2_weather.loc[no2_weather['anomaly25'] == 0, 'total_precipitation_surface_mean']
b = no2_weather.loc[no2_weather['anomaly25'] == 1, 'total_precipitation_surface_mean']

fig, axs = plt.subplots()
axs.hist([a,b], bins=32, stacked=True, color=['blue', 'red'], label = ['normal', 'anomaly'])
plt.legend()
plt.show()

In [None]:
fig, ax = plt.subplots()

a = no2_weather.loc[no2_weather['anomaly25'] == 1, ['time_epoch', 'total_precipitation_surface_mean']] #anomaly

ax.plot(no2_weather['time_epoch'], no2_weather['total_precipitation_surface_mean'], color='blue')
ax.scatter(a['time_epoch'],a['total_precipitation_surface_mean'], color='red')
plt.show()

In [None]:
a = no2_weather.loc[no2_weather['anomaly25'] == 0, 'no2_emission_sum']
b = no2_weather.loc[no2_weather['anomaly25'] == 1, 'no2_emission_sum']

fig, axs = plt.subplots()
axs.hist([a,b], bins=32, stacked=True, color=['blue', 'red'], label = ['normal', 'anomaly'])
plt.legend()
plt.show()

ax.scatter(a['time_epoch'],a['no2_emission_sum'], color='red')
no2_emission_sum

In [None]:
fig, ax = plt.subplots()

a = no2_weather.loc[no2_weather['anomaly25'] == 1, ['time_epoch', 'no2_emission_sum']] #anomaly

ax.plot(no2_weather['time_epoch'], no2_weather['no2_emission_sum'], color='blue')
ax.scatter(a['time_epoch'],a['no2_emission_sum'], color='red')
plt.show()

<div class="h2">One class SVM</div>
<a id="OUTLIER3"></a>

[Back to Table of Contents](#top)

[General Findinds](#theend)
  

In [None]:
# Suppress warnings 
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
from IPython.display import HTML

HTML('<iframe width="700" height="400" src="https://www.youtube.com/embed/086OcT-5DYI" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

Anomaly Detection Problem | Motivation — [ Machine Learning | Andrew Ng ]

In [None]:
# Take useful feature and standardize them 
data_SVM = no2_weather[['time_epoch','DayOfTheWeek','day','no2_emission_sum','temperature_2m_above_ground_mean','specific_humidity_2m_above_ground_mean','relative_humidity_2m_above_ground_mean','u_component_of_wind_10m_above_ground_mean','v_component_of_wind_10m_above_ground_mean','total_precipitation_surface_mean']]
data_SVM.dropna(axis=0, inplace=True)
print(data_SVM.shape)


min_max_scaler = preprocessing.StandardScaler()
np_scaled = min_max_scaler.fit_transform(data_SVM)
# train one class SVM 
model =  OneClassSVM(nu=0.95 * outliers_fraction) #nu=0.95 * outliers_fraction  + 0.05
data_SVM = pd.DataFrame(np_scaled)
model.fit(data_SVM)
# add the data to the main  
no2_weather['anomaly26'] = pd.Series(model.predict(data_SVM))
no2_weather['anomaly26'] = no2_weather['anomaly26'].map( {1: 0, -1: 1} )
print(no2_weather['anomaly26'].value_counts())


<div class="h3">Visualisation of anomaly throughout time</div>
<a id="IF"></a>

[Back to Table of Contents](#top)

[General Findinds](#theend)
  

total precipitation surface mean

In [None]:
a = no2_weather.loc[no2_weather['anomaly26'] == 0, 'total_precipitation_surface_mean']
b = no2_weather.loc[no2_weather['anomaly26'] == 1, 'total_precipitation_surface_mean']

fig, axs = plt.subplots()
axs.hist([a,b], bins=32, stacked=True, color=['blue', 'red'], label = ['normal', 'anomaly'])
plt.legend()
plt.show()

total_precipitation_surface_mean

In [None]:
fig, ax = plt.subplots()

a = no2_weather.loc[no2_weather['anomaly26'] == 1, ['time_epoch', 'total_precipitation_surface_mean']] #anomaly

ax.plot(no2_weather['time_epoch'], no2_weather['total_precipitation_surface_mean'], color='blue')
ax.scatter(a['time_epoch'],a['total_precipitation_surface_mean'], color='red')
plt.show()

no2 emission sum

In [None]:
a = no2_weather.loc[no2_weather['anomaly26'] == 0, 'no2_emission_sum']
b = no2_weather.loc[no2_weather['anomaly26'] == 1, 'no2_emission_sum']

fig, axs = plt.subplots()
axs.hist([a,b], bins=32, stacked=True, color=['blue', 'red'], label = ['normal', 'anomaly'])
plt.legend()
plt.show()

In [None]:
fig, ax = plt.subplots()

a = no2_weather.loc[no2_weather['anomaly26'] == 1, ['time_epoch', 'no2_emission_sum']] #anomaly

ax.plot(no2_weather['time_epoch'], no2_weather['no2_emission_sum'], color='blue')
ax.scatter(a['time_epoch'],a['no2_emission_sum'], color='red')
plt.show()



<div class="h3">Our purpose is to detect these abnormal observations in advance!</div>


Creating features

In [None]:
no2_weather['yr'] = no2_weather.start_date.dt.year
no2_weather['mt'] = no2_weather.start_date.dt.month
no2_weather['d'] = no2_weather.start_date.dt.day

no2_weather['weekday'] = no2_weather.start_date.dt.weekday
no2_weather['weekday_mean'] = no2_weather.weekday.replace(no2_weather[:199].groupby('weekday')['no2_emission_sum'].mean().to_dict())

In [None]:
no2_weather.head(2)

<div class="h3">Time lag feature - week X Correlation coef </div>

In [None]:
timeLags = np.arange(1,10*48*7)
autoCorr = [no2_weather.no2_emission_sum.autocorr(lag=dt) for dt in timeLags]
plt.figure(figsize=(19,8))
plt.plot(1.0/(48*7)*timeLags, autoCorr)
plt.xlabel('time lag [weeks]')
plt.ylabel('correlation coeff', fontsize=12)

AutoCorrelation 10 weeks depth

The NO2 demand seems to be driven by a weekly trend: on certain days of the week, is higher than the others. We simply prove this computing autocorrelation.

<div class="h1"> Prediction using LSTM with Python</div>
<a id="LSTM1"></a>

[Back to Table of Contents](#top)

[General Findinds](#theend)
  

In [None]:
# Suppress warnings 
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
from IPython.display import HTML

HTML('<iframe width="700" height="400" src="https://www.youtube.com/embed/9zhrxE5PQgY" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

Recurrent Networks can be improved to remember long range dependencies by using whats called a Long-Short Term Memory (LSTM) Cell. Let's build one using just numpy! I'll go over the cell components as well as the forward and backward pass logic
  

In [None]:
data_ = no2_weather.loc[:,['start_date','no2_emission_sum']] 
data_['start_date'] = pd.to_datetime(no2_weather['start_date'])
data_.set_index('start_date', inplace=True)
data_ = data_.resample("1D").sum() # day sum

In [None]:
#Create a new dataframe with only the 'no2_emission_sum column
data_2 = data_.filter(['no2_emission_sum'])
#Convert the dataframe to a numpy array
dataset = data_2.values
#Get the number of rows to train the model on
training_data_len = int(np.ceil( len(dataset) * .8 ))
training_data_len

In [None]:
#Scale the data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0,1))
scaled_data = scaler.fit_transform(dataset)
# scaled_data

#### Create the training data set andC create the scaled training data set

In [None]:
train_data = scaled_data[0:int(training_data_len), :]
#Split the data into x_train and y_train data sets
x_train = []
y_train = []

for i in range(129, len(train_data)):
    x_train.append(train_data[i-129:i, 0])
    y_train.append(train_data[i, 0])
#     if i<= 61:
#         print(x_train)
#         print(y_train)
#         print()

In [None]:
# Convert the x_train and y_train to numpy arrays 
x_train, y_train = np.array(x_train), np.array(y_train)

In [None]:
#Reshape the data
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))

#### Build the LSTM model

In [None]:
model = Sequential()
model.add(LSTM(50, return_sequences=True, input_shape= (x_train.shape[1], 1)))
model.add(LSTM(50, return_sequences= False))
model.add(Dense(25))
model.add(Dense(1))

#### Compile and Train the model

In [None]:

model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(x_train, y_train, batch_size=1, epochs=1)

In [None]:
#Create the testing data set
test_data = scaled_data[training_data_len - 129: , :]
#Create the data sets x_test and y_test
x_test = []
y_test = dataset[training_data_len:, :]
for i in range(129, len(test_data)):
    x_test.append(test_data[i-129:i, 0])
    
# Convert the data to a numpy array
x_test = np.array(x_test)


<div class="h2">Get the root mean squared error (RMSE)</div>
<a id="LSTM2"></a>

[Back to Table of Contents](#top)

[General Findinds](#theend)



In [None]:
# Reshape the data
x_test = np.reshape(x_test, (x_test.shape[0], x_test.shape[1], 1 ))
# Get the models predicted values 
predictions = model.predict(x_test)
predictions = scaler.inverse_transform(predictions)
# Get the root mean squared error (RMSE)
rmse = np.sqrt(np.mean(((predictions - y_test) ** 2)))
rmse

### Visualize the data


In [None]:
# Plot the data
train = data_[:training_data_len]
valid = data_[training_data_len:]
valid['Predictions'] = predictions

In [None]:
# Visualize the data
plt.figure(figsize=(16,8))
plt.title('Model')
plt.xlabel('start_date', fontsize=18)
plt.ylabel('NO2 Emission Sum', fontsize=18)
plt.plot(train['no2_emission_sum'])
plt.plot(valid[['no2_emission_sum', 'Predictions']])
plt.legend(['Train', 'Val', 'Predictions'], loc='lower right')
plt.show()



<div class="h3">30 days NO2 average</div>


In [None]:
f, ax = plt.subplots(figsize=(14,8))
pd.plotting.register_matplotlib_converters() # Add this 
data_.plot(ax=ax, color='C0')
data_.rolling(window=30, center=True).mean().plot(ax=ax, ls='-', lw=3, color='C3')
ax.grid(ls=':')
ax.legend(['daily values','30 days No2 average'], frameon=False, fontsize=14)

[l.set_fontsize(13) for l in ax.xaxis.get_ticklabels()]
[l.set_fontsize(13) for l in ax.yaxis.get_ticklabels()]
ax.set_xlabel('date', fontsize=15)
ax.set_ylabel('N02 values', fontsize=15);
# ax.axvline('2018', color='0.8', lw=8, zorder=-1)

high NO2 mean in September

<div class="h1"> ARIMA with Python</div>
<a id="ARP"></a>

[Back to Table of Contents](#top)

[General Findinds](#theend)
  

In [None]:
# Suppress warnings 
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
from IPython.display import HTML

HTML('<iframe width="700" height="400" src="https://www.youtube.com/embed/zlZaOnBbpUg?list=PL436A4F939FBE10D7" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

The Analysis of Time Series

Printing a summary of the fit model.
This summarizes the coefficient values used as well as the skill of the fit on the on the in-sample observations.

In [None]:
from statsmodels.tsa.arima_model import ARIMA
from sklearn.metrics import mean_squared_error

arimaM = ARIMA(data, order=(5,1,0))
arimaMfit = arimaM.fit(disp=0)
print(arimaMfit.summary())

We get a line plot of the residual errors, suggesting that there may still be some trend information not captured by the model and 

we get a density plot of the residual error values, suggesting the errors are Gaussian, but may not be centered on zero.

<div class="h3"> # plot residual errors </div>
<a id="P"></a>
  

In [None]:
errors = pd.DataFrame(arimaMfit.resid)
errors.plot()
pyplot.show()
errors.plot(kind='kde')
pyplot.show()
print(errors.describe())

The distribution of the residual errors is displayed. 
The results show that indeed there is a bias in the prediction (a non-zero mean in the residuals).


<div class="h2"> Rolling Forecast ARIMA Model</div>
<a id="ARP2"></a>

[Back to Table of Contents](#top)

[General Findinds](#theend)


In [None]:
X = data.values
size = int(len(X) * 0.70)
limitCount = 40
train, test = X[0:size], X[size:size+limitCount]
history = [x for x in train]

We can also calculate a final mean squared error score (MSE) and (RMSLE) for the predictions, providing a point of comparison for other ARIMA configurations.

In [None]:
def rmsle(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

In [None]:
pred = []
for t in range(len(test)):
    model = ARIMA(history, order=(5,1,0))
    model_fit = model.fit(disp=0)
    output = model_fit.forecast()
    yhat = output[0]
    pred.append(yhat)
    obs = test[t]
    history.append(obs)
    print('pred=%f, exp=%f' % (yhat, obs))
error = mean_squared_error(test, pred)
error2 = rmsle(pred,test)

print('Mean Squared Error: %.3f' % error)
print('RMSLE: %.3f' % error)


A line plot is created showing the expected values (blue) compared to the rolling forecast predictions (red). 
We can see the values show some trend and are in the correct scale

In [None]:
pyplot.plot(test)
pyplot.plot(pred, color='red')
pyplot.show()

<div class="h1">Time series prediction using Prophet in Python</div>
<a id="PRO"></a>

[Back to Table of Contents](#top)

[General Findinds](#theend)


In [None]:
# Suppress warnings 
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
from IPython.display import HTML

HTML('<iframe width="700" height="400" src="https://www.youtube.com/embed/pOYAXv15r3A" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

Delivered by Sean Taylor (Facebook) at the 2018 New York R Conference at Work-Bench on April April 20 and 21

<div class="h3">Advantages of using Prophet</div>
- Accommodates seasonality with multiple periods
- Prophet is resilient to missing values
- Best way to handle outliers in Prophet is to remove them
- Fitting of the model is fast
- Intuitive hyper parameters which are easy to tune

Define Prophet dataset

In [None]:
X= no2_weather[['start_date','no2_emission_sum','temperature_2m_above_ground_mean','specific_humidity_2m_above_ground_mean','relative_humidity_2m_above_ground_mean',
            'u_component_of_wind_10m_above_ground_mean','v_component_of_wind_10m_above_ground_mean','total_precipitation_surface_mean']]
y=no2_weather['no2_emission_sum']

### Creating the data set for Prophet

In [None]:
train_dataset= pd.DataFrame()
train_dataset['ds'] = pd.to_datetime(no2_weather["start_date"])
train_dataset['y']=y
train_dataset.head(2)

In [None]:
prophet_basic = Prophet()
prophet_basic.fit(train_dataset)

<div class="h3">Predicting the values for the future</div>

For predicting the values using Prophet, we need to create a dataframe with ds(datetime stamp) containing the dates for which we want to make the predictions.
We use make_future_dataframe() to which we specify the number of days to extend into the future. By default it includes dates from the history

In [None]:
future= prophet_basic.make_future_dataframe(periods=30)
future.tail()

Our prediction contains historical dates with 30 days.

In [None]:
forecast=prophet_basic.predict(future)

<div class='h3'>Plotting the predicted data</div>


In [None]:
fig1 =prophet_basic.plot(forecast)

<div class='h3'>Plotting the forecasted components</div>
We can plot the trend and seasonality, components of the forecast.

In [None]:
fig1 = prophet_basic.plot_components(forecast)

The trend shows that the days as Monday, Weednesday have high values and the last month of year have high values. 

<div class="h2">Forecast quality evaluation</div>
<a id="PRO1"></a>

[Back to Table of Contents](#top)

[General Findinds](#theend)


Let's evaluate the quality of the algorithm by calculating the error metrics for the last 30 days that we predicted. For this, we will need the observations  yi  and the corresponding predicted values  y^i .

Let's look into the object forecast that the library created for us:

In [None]:
from fbprophet.diagnostics import cross_validation, performance_metrics
df_cv = cross_validation(prophet_basic, horizon='30 days')
df_p = performance_metrics(df_cv)
df_p.head(30)

In [None]:
from fbprophet.plot import plot_cross_validation_metric
fig_mape = plot_cross_validation_metric(df_cv, metric='mape')
fig_rmse = plot_cross_validation_metric(df_cv, metric='rmse')


It shows that 10 days forecast results in around 10% error.

#### Adding ChangePoints to Prophet
Changepoints are the datetime points where the time series have abrupt changes in the trajectory.

In [None]:
from fbprophet.plot import add_changepoints_to_plot
fig = prophet_basic.plot(forecast)
a = add_changepoints_to_plot(fig.gca(), prophet_basic, forecast)

We can view the dates where the chagepoints occurred

In [None]:
prophet_basic.changepoints[:10]

We can change the inferred changepoint range by setting the changepoint_range

In [None]:
pro_change= Prophet(changepoint_range=0.9)
forecast = pro_change.fit(train_dataset).predict(future)
fig= pro_change.plot(forecast);
a = add_changepoints_to_plot(fig.gca(), pro_change, forecast)

Decreasing the changepoint_prior_scale to 0.001 to make the trend less flexible

In [None]:
pro_change= Prophet(n_changepoints=20, yearly_seasonality=True, changepoint_prior_scale=0.08)
forecast = pro_change.fit(train_dataset).predict(future)
fig= pro_change.plot(forecast);
a = add_changepoints_to_plot(fig.gca(), pro_change, forecast)

<div class="h2">Incorporating the effects of weather condition</div>
<a id="PRO2"></a>

[Back to Table of Contents](#top)

[General Findinds](#theend)



> Now we add  as extra regressors in the fbprophet model

In [None]:
train_dataset['temperature_2m_above_ground_mean'] = X['temperature_2m_above_ground_mean']
train_dataset['specific_humidity_2m_above_ground_mean'] = X['specific_humidity_2m_above_ground_mean']
train_dataset['relative_humidity_2m_above_ground_mean'] = X['relative_humidity_2m_above_ground_mean']
train_dataset['u_component_of_wind_10m_above_ground_mean'] = X['u_component_of_wind_10m_above_ground_mean']
train_dataset['v_component_of_wind_10m_above_ground_mean'] = X['v_component_of_wind_10m_above_ground_mean']
train_dataset['total_precipitation_surface_mean'] = X['total_precipitation_surface_mean']

train_X= train_dataset[:200]
test_X= train_dataset[200:]


In [None]:
#Additional Regressor
pro_regressor= Prophet()
pro_regressor.add_regressor('temperature_2m_above_ground_mean')
pro_regressor.add_regressor('specific_humidity_2m_above_ground_mean')
pro_regressor.add_regressor('relative_humidity_2m_above_ground_mean')
pro_regressor.add_regressor('u_component_of_wind_10m_above_ground_mean')
pro_regressor.add_regressor('v_component_of_wind_10m_above_ground_mean')
pro_regressor.add_regressor('total_precipitation_surface_mean')
#Fitting the data
pro_regressor.fit(train_X)
future_data = pro_regressor.make_future_dataframe(periods=30) # 30 days
#forecast the data for Test  data
forecast_data = pro_regressor.predict(test_X)
pro_regressor.plot(forecast_data);

Predicted data is the blue shaded region at the end.

<div class="h2">Forecast quality evaluation</div>
<a id="PRO3"></a>

[Back to Table of Contents](#top)

[General Findinds](#theend)


Let's evaluate the quality of the algorithm by calculating the error metrics for the last 30 days that we predicted. For this, we will need the observations  yi  and the corresponding predicted values  y^i .


In [None]:
df_cv_reg = cross_validation(pro_regressor, horizon='30 days')
df_p_reg = performance_metrics(df_cv_reg)
df_p_reg.head(30)

The RMSE for 30 days its 0.984912.

In [None]:
fig_mape_reg = plot_cross_validation_metric(df_cv_reg, metric='mape')
fig_mape_reg = plot_cross_validation_metric(df_cv_reg, metric='rmse')


It shows that 10 days forecast results in around ~7% error.

In [None]:
from fbprophet.plot import plot_plotly, add_changepoints_to_plot
import plotly.offline as py

fig_d_reg = plot_plotly(pro_regressor, forecast_data)

py.iplot(fig_d_reg) 

fig_d_reg = pro_regressor.plot(forecast_data,xlabel='Date',ylabel='N02 values')


- ds — forecast date
- yhat — forecast value for the given date
- yhat_lower — lower forecast boundary for the given date
- yhat_uppet — upper forecast boundary for the given date
Calling plot function for Prophet model displays how the model was trained according to training data (black points — training data, blue line — forecast value, light blue area — forecast boundaries):

## We can see that the Best Approach is Prophet.

<div class="h2">Save The Model</div>
<a id="PRO4"></a>

[Back to Table of Contents](#top)

[General Findinds](#theend)

The model should be re-trained when new data becomes available. There is no point to re-train model, if data is not changed. Save model instead and use it again, when user wants to call predict function. Use pickle functionality for that:


In [None]:
import pickle
with open('forecast_model_No2.pckl', 'wb') as fout:
    pickle.dump(pro_regressor, fout)
with open('forecast_model_No2.pckl', 'rb') as fin:
    m2 = pickle.load(fin)


<div class="h1">Prediction of  NO2 density for each primary_fuel throughout the year</div>
<a id="M1"></a>
[Back to Table of Contents](#top)

[General Findinds](#theend)

<div class="h3">Regional NO2 Density</div>

In [None]:
from datetime import datetime
files=[]
for dirname, _, filenames in os.walk('/kaggle/input/ds4g-environmental-insights-explorer/eie_data/s5p_no2'):
    for filename in filenames:
        files.append(os.path.join(dirname, filename))

# read all the absorbing aerosol index data into one list of arrays
no2_first_day=[]
no2_first_key=[]
no2_arr=[]
band=0 #  NO2_column_number_density
for i in range(0,len(files)):
    no2_first_day.append(datetime.strptime(files[i][76:91], '%Y%m%dT%H%M%S').date())
    no2_first_key.append(datetime.strptime(files[i][76:91], '%Y%m%dT%H%M%S').toordinal()+1) # correction of + 1 day in order to sync on climate data
    no2_arr.append(rio.open(files[i]).read(band+1))



In [None]:
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

a=[]
a_pos=[]
for i in range(0,len(no2_arr)): 
    a.append(np.nanmean(no2_arr[i]))
    a_pos.append(np.nanmean(np.clip(no2_arr[i],0,10000)))
    
no2_rgn=pd.DataFrame({ 'start_date': no2_first_day,'no2_rgn' : a_pos, 'key_date' : no2_first_key })
no2_rgn=no2_rgn.sort_values('start_date')
no2_rgn=no2_rgn.reset_index()

In [None]:
# read only the NO2 index arrays with a nan-percentage <5% into one list of arrays for calculation of local NO2 data
files=[]
for dirname, _, filenames in os.walk('/kaggle/input/ds4g-environmental-insights-explorer/eie_data/s5p_no2'):
    for filename in filenames:
        files.append(os.path.join(dirname, filename))

no2_first_day=[]
no2_first_key=[]
no2_arr=[]
band=0 # NO2_column_number_density
for i in range(0,len(files)):
    a=rio.open(files[i]).read(band+1)
    if pd.isnull(a).sum().sum() < 3515:
        no2_first_day.append(datetime.strptime(files[i][76:91], '%Y%m%dT%H%M%S').date())
        no2_first_key.append(datetime.strptime(files[i][76:91], '%Y%m%dT%H%M%S').toordinal()+1) # correction of + 1 day in order to sync on climate data
        no2_arr.append(np.clip(a,0,10000))  # clip negative values to zero

<div class="h3">Local NO2 Density</div>

In [None]:
gray= power_plants[['name','primary_fuel','capacity_mw','img_idx_lt','img_idx_lg','plant']].copy() 
gray.head()

In [None]:
pollute_clean_primary_fuel= power_plants.loc[((power_plants['primary_fuel']=='Coal') | (power_plants['primary_fuel']=='Oil') | (power_plants['primary_fuel']=='Gas')),['name','primary_fuel','capacity_mw','img_idx_lt','img_idx_lg','plant']]
pollute_clean_primary_fuel.head()

NO2_column_number_density value in proximity of all plants with all locations in location mask - proximity is +/- n points from location of plant. More information [here](https://www.kaggle.com/tiurii/ds4g-modelling-of-emissions-of-power-plants), @tiurii


In [None]:
# NO2_column_number_density value in proximity of all plants with all locations in location mask - proximity is +/- n points from location of plant
n=11
no2_=[]
for j in range(0,len(gray)):
    idx_lt=gray.iloc[j,3]
    idx_lg=gray.iloc[j,4]
    no2_j=[]
    for i in range(0,len(no2_arr)):
        no2_j.append(np.nanmean(no2_arr[i][idx_lt-n:idx_lt+n,idx_lg-n:idx_lg+n])) # calculate average of no2 for location of plant
    
    no2_.append(no2_j)

In [None]:
aa=pd.DataFrame({'key_date':np.array(no2_first_key), 'start_date': no2_first_day}) 

for j in range(0,len(gray)):
    aa[gray.iloc[j,5]]=no2_[j]  # add average of N02 for location of plant to dataframe with column name from df gray.plant

print('size of dataframe with aai data for gray-energy power-plant locations: ',aa.shape)
# sorting dataframe on date to produce ordered time series
aa=aa.sort_values('key_date')
aa=aa.reset_index()
aa=aa.drop(columns=['index'])
aa=aa.fillna(0)
aa.head()

In [None]:
gray.loc[:,'EF_wght']=1
ww2=pd.DataFrame({'start_date':no2_first_day})
XX2=pd.DataFrame({})

for j in range(0,len(gray)):
    ww2[gray.iloc[j,0]]=no2_j[j]  # add average of N02 Description for location of plant to dataframe

    x=ww2.groupby(by='start_date').agg(['mean'])
    X2=pd.merge(aa.loc[:,['start_date',gray.iloc[j,5]]],x, how='inner', on='start_date')
    X2=X2.rename(columns = {gray.iloc[j,5]:'no2_density_locationofplant'})
    c=gray.iloc[j,5]   
    X2[c]=np.ones((len(X2)))*gray.iloc[j,6] # addition of EF_wght for each plant to the dataframe
    XX=pd.concat([XX2,X2], axis=0, sort=False) # aggregation of dataframe per plant_location
XX=XX.fillna(0) 
XX=XX.reset_index()

In [None]:
for i in range(XX.shape[1]):
    if i==0 or i==1:
        pass
    else:
        XX  = XX.rename( columns= {XX.columns[i] :str(XX.columns[i]).replace("'",'').replace('(','').replace(')','').replace(',','').replace(' ','_') })
XX.head()

Save the Data

In [None]:
XX=XX.drop(columns=['index','Tor2_Hydro'])
XX.to_csv('no2_density_estimation.csv', index=False)

In [None]:
X= XX[['start_date','no2_density_locationofplant','Aguirre_mean','Costa_Sur_mean','San_Juan_CC_mean','Palo_Seco_mean','EcoEléctrica_mean','A.E.S._Corp._mean','Cambalache_mean','Mayagüez_mean','Santa_Isabel_Wind_Farm_mean','Oriana_Solar_Farm_mean','Yabucoa_mean','Daguao_mean','Jobos_mean','Vega_Baja_mean','San_Fermin_Solar_Farm_mean','Loiza_Solar_Park_mean','Yauco_1_mean','AES_Ilumina_mean','Punta_Lima_mean','Caonillas_1_mean','Salinas_mean','Dos_Bocas_mean','Carite_1_mean','Yauco_2_mean','Toro_Negro_1_mean','Garzas_1_mean','Vieques_EPP_mean','Garzas_2_mean','Río_Blanco_mean','Windmar_Ponce_mean','Caonillas_2_mean','Toro_Negro_2_mean']]
y=XX['no2_density_locationofplant']
print(XX.shape)
       

Creating the data set for Prophet

In [None]:
train_dataset= pd.DataFrame()
train_dataset['ds'] = pd.to_datetime(XX["start_date"])
train_dataset['y']=y
train_dataset.head(2)

Incorporating the primary_fuel conditions

In [None]:
train_dataset['Aguirre_mean'] = X['Aguirre_mean']
train_dataset['Costa_Sur_mean'] = X['Costa_Sur_mean']
train_dataset['San_Juan_CC_mean'] = X['San_Juan_CC_mean']
train_dataset['Palo_Seco_mean'] = X['Palo_Seco_mean']
train_dataset['EcoEléctrica_mean'] = X['EcoEléctrica_mean']
train_dataset['A.E.S._Corp._mean'] = X['A.E.S._Corp._mean']
train_dataset['Cambalache_mean'] = X['Cambalache_mean']
train_dataset['Mayagüez_mean'] = X['Mayagüez_mean']
train_dataset['Santa_Isabel_Wind_Farm_mean'] = X['Santa_Isabel_Wind_Farm_mean']
train_dataset['Oriana_Solar_Farm_mean'] = X['Oriana_Solar_Farm_mean']
train_dataset['Yabucoa_mean'] = X['Yabucoa_mean']
train_dataset['Daguao_mean'] = X['Daguao_mean']
train_dataset['Jobos_mean'] = X['Jobos_mean']
train_dataset['Vega_Baja_mean'] = X['Vega_Baja_mean']
train_dataset['San_Fermin_Solar_Farm_mean'] = X['San_Fermin_Solar_Farm_mean']
train_dataset['Loiza_Solar_Park_mean'] = X['Loiza_Solar_Park_mean']
train_dataset['Yauco_1_mean'] = X['Yauco_1_mean']
train_dataset['AES_Ilumina_mean'] = X['AES_Ilumina_mean']
train_dataset['Punta_Lima_mean'] = X['Punta_Lima_mean']
train_dataset['Salinas_mean'] = X['Salinas_mean']
train_dataset['Dos_Bocas_mean'] = X['Dos_Bocas_mean']
train_dataset['Carite_1_mean'] = X['Carite_1_mean']
train_dataset['Yauco_2_mean'] = X['Yauco_2_mean']
train_dataset['Toro_Negro_1_mean'] = X['Toro_Negro_1_mean']
train_dataset['Garzas_1_mean'] = X['Garzas_1_mean']
train_dataset['Vieques_EPP_mean'] = X['Vieques_EPP_mean']
train_dataset['Garzas_2_mean'] = X['Garzas_2_mean']
train_dataset['Río_Blanco_mean'] = X['Río_Blanco_mean']
train_dataset['Windmar_Ponce_mean'] = X['Windmar_Ponce_mean']
train_dataset['Caonillas_2_mean'] = X['Caonillas_2_mean']
train_dataset['Toro_Negro_2_mean'] = X['Toro_Negro_2_mean']

train_X= train_dataset[:200]
test_X= train_dataset[200:]

In [None]:
#Additional Regressor
pro_regressor= Prophet()
pro_regressor.add_regressor('Aguirre_mean')
pro_regressor.add_regressor('Costa_Sur_mean')
pro_regressor.add_regressor('San_Juan_CC_mean')
pro_regressor.add_regressor('Palo_Seco_mean')
pro_regressor.add_regressor('EcoEléctrica_mean')
pro_regressor.add_regressor('A.E.S._Corp._mean')
pro_regressor.add_regressor('Cambalache_mean')
pro_regressor.add_regressor('Mayagüez_mean')
pro_regressor.add_regressor('Santa_Isabel_Wind_Farm_mean')
pro_regressor.add_regressor('Oriana_Solar_Farm_mean')
pro_regressor.add_regressor('Yabucoa_mean')
pro_regressor.add_regressor('Daguao_mean')
pro_regressor.add_regressor('Jobos_mean')
pro_regressor.add_regressor('Vega_Baja_mean')
pro_regressor.add_regressor('San_Fermin_Solar_Farm_mean')
pro_regressor.add_regressor('Loiza_Solar_Park_mean')
pro_regressor.add_regressor('Yauco_1_mean')
pro_regressor.add_regressor('AES_Ilumina_mean')
pro_regressor.add_regressor('Punta_Lima_mean')
pro_regressor.add_regressor('Salinas_mean')
pro_regressor.add_regressor('Dos_Bocas_mean')
pro_regressor.add_regressor('Carite_1_mean')
pro_regressor.add_regressor('Yauco_2_mean')
pro_regressor.add_regressor('Toro_Negro_1_mean')
pro_regressor.add_regressor('Garzas_1_mean')
pro_regressor.add_regressor('Vieques_EPP_mean')
pro_regressor.add_regressor('Garzas_2_mean')
pro_regressor.add_regressor('Río_Blanco_mean')
pro_regressor.add_regressor('Windmar_Ponce_mean')
pro_regressor.add_regressor('Caonillas_2_mean')
pro_regressor.add_regressor('Toro_Negro_2_mean')

#Fitting the data
pro_regressor.fit(train_X)
future_data = pro_regressor.make_future_dataframe(periods=30) # 30 days
#forecast the data for Test  data
forecast_data = pro_regressor.predict(test_X)
pro_regressor.plot(forecast_data);

Predicted data is the blue shaded region at the end.

<div class="h2">Forecast quality evaluation for Power Plant over the year</div>
<a id="M2"></a>
[Back to Table of Contents](#top)

[General Findinds](#theend)

Let's evaluate the quality of the algorithm by calculating the error metrics for the last 30 days that we predicted. For this, we will need the observations yi and the corresponding predicted values y^i 


In [None]:
df_cv_reg = cross_validation(pro_regressor, horizon='30 days')
df_p_reg = performance_metrics(df_cv_reg)
df_p_reg.head(30)

The RMSE for 30 days its 0.1.778483e-11.

In [None]:
fig_mape_reg = plot_cross_validation_metric(df_cv_reg, metric='mape')
fig_mape_reg = plot_cross_validation_metric(df_cv_reg, metric='rmse')

In [None]:
from fbprophet.plot import plot_plotly, add_changepoints_to_plot
import plotly.offline as py

fig_d_reg = plot_plotly(pro_regressor, forecast_data)

py.iplot(fig_d_reg) 

fig_d_reg = pro_regressor.plot(forecast_data,xlabel='Date',ylabel='no2 Density by Location of Plant values')

# Save The Model for primary_fuel

In [None]:
import pickle
with open('forecast_model_No2Density.pckl', 'wb') as fout:
    pickle.dump(pro_regressor, fout)
with open('forecast_model_No2Density.pckl', 'rb') as fin:
    m2 = pickle.load(fin)

### Another Example - Verification A.E.S.Corp 

In [None]:
power_plants_df[['name','primary_fuel','plant']][power_plants_df['primary_fuel']=='Coal']

In [None]:
train_dataset['A.E.S._Corp._mean'] = X['A.E.S._Corp._mean']
# Additional Regressor
pro_regressor= Prophet()
pro_regressor.add_regressor('A.E.S._Corp._mean')
train_X= train_dataset[:200]
test_X= train_dataset[200:]

#Fitting the data
pro_regressor.fit(train_X)
future_data = pro_regressor.make_future_dataframe(periods=30) # 30 days
#forecast the data for Test  data
forecast_data = pro_regressor.predict(test_X)
pro_regressor.plot(forecast_data);

### Forecast quality evaluation for region
Let's evaluate the quality of the algorithm by calculating the error metrics for the last 30 days that we predicted. For this, we will need the observations yi and the corresponding predicted values y^i 


In [None]:
df_cv_reg = cross_validation(pro_regressor, horizon='30 days')
df_p_reg = performance_metrics(df_cv_reg)
df_p_reg.head(30)

In [None]:
fig_mape_reg = plot_cross_validation_metric(df_cv_reg, metric='mape')
fig_mape_reg = plot_cross_validation_metric(df_cv_reg, metric='rmse')

<div class="h2">Outlier Analysis of Power Plant - Coal over the year</div>
<a id="M3"></a>
[Back to Table of Contents](#top)

[General Findinds](#theend)
   


Identifying outliers for A.E.S._Corp 

In [None]:
month_p_fuel = X[['start_date', 'no2_density_locationofplant', 'A.E.S._Corp._mean']].copy()
month_p_fuel['date'] = pd.to_datetime(X["start_date"])
month_p_fuel['date'] = month_p_fuel['date'].dt.month
month_p_fuel = month_p_fuel.groupby(['date','A.E.S._Corp._mean']).sum()
month_p_fuel


In [None]:
month_p_fuel_agg = month_p_fuel.groupby(['date', 'A.E.S._Corp._mean']).agg(['sum'])
month_p_fuel_agg = month_p_fuel_agg.reset_index()
level_0 = month_p_fuel_agg.columns.droplevel(0)
level_1 = month_p_fuel_agg.columns.droplevel(1)
level_0 = ['' if x == '' else '-' + x for x in level_0]

month_p_fuel_agg.columns = level_1 + level_0
month_p_fuel_agg.rename_axis(None, axis=1)
# month_p_fuel_agg.head()

In [None]:
from plotly import tools, subplots
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.express as px

fig_total = px.line(month_p_fuel_agg, x='date', y='no2_density_locationofplant-sum', color='A.E.S._Corp._mean', render_mode='svg')
fig_total.update_layout(title='Total NO2 aspect in A.E.S._Corp._mean - 12 months')
fig_total.show()

The sum, facetted for A.E.S._Corp._mean aspect, shows some aberrant values, for example in month 4, April. August have the least N02 value. 

### Outlier Analysis of Power Plant - Coal day of week

In [None]:
dayweek_p_fuel = X[['start_date', 'no2_density_locationofplant', 'A.E.S._Corp._mean']].copy()
dayweek_p_fuel['dateofweek'] = pd.to_datetime(X["start_date"])
dayweek_p_fuel['dateofweek'] = dayweek_p_fuel['dateofweek'].dt.dayofweek
dayweek_p_fuel = dayweek_p_fuel.groupby(['dateofweek','A.E.S._Corp._mean']).sum()

dayofweek_p_fuel_agg = dayweek_p_fuel.groupby(['dateofweek', 'A.E.S._Corp._mean']).agg(['sum'])
dayofweek_p_fuel_agg = dayofweek_p_fuel_agg.reset_index()
level_0 = dayofweek_p_fuel_agg.columns.droplevel(0)
level_1 = dayofweek_p_fuel_agg.columns.droplevel(1)
level_0 = ['' if x == '' else '-' + x for x in level_0]

dayofweek_p_fuel_agg.columns = level_1 + level_0
dayofweek_p_fuel_agg.rename_axis(None, axis=1)

### Outlier Analysys - day of the week with Monday=0, Sunday=6

In [None]:
from plotly import tools, subplots
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.express as px

fig_total = px.line(dayofweek_p_fuel_agg, x='dateofweek', y='no2_density_locationofplant-sum', color='A.E.S._Corp._mean', render_mode='svg')
fig_total.update_layout(title='Total NO2 aspect in A.E.S._Corp._mean - day of week')
fig_total.show()

The sum, faceted for A.E.S.Corp._ mean aspect shows some aberrant values. For example, in general, Thursday (3) has the bigger aberrant value. Tuesday (1) has the least N02 value.

## Our modeling investigate regions, primary_fuel model  and we can decompose emission factors between plants,over time and identifying anomaly events.

   <hr>
Inspired by: [Exploratory Data Analysis and Factor Model](https://www.kaggle.com/ragnar123/exploratory-data-analysis-and-factor-model-idea),
[Modelling of emissions of power plants](https://www.kaggle.com/tiurii/ds4g-modelling-of-emissions-of-power-plants)

source: [Survey](https://www.tandfonline.com/doi/full/10.1080/10962247.2019.1577314?scroll=top&needAccess=true),[Arima python](https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/), [anomaly detection](https://github.com/Vicam/Unsupervised_Anomaly_Detection/blob/master/Anomaly%20detection%2C%20different%20methods%20on%20a%20simple%20example.ipynb), [Intro LSTM](https://colah.github.io/posts/2015-08-Understanding-LSTMs/),[LSTM](https://www.kaggle.com/faressayah/stock-market-analysis-prediction-using-lstm), [prophet facebook](https://towardsdatascience.com/time-series-prediction-using-prophet-in-python-35d65f626236), [forecast in python](https://towardsdatascience.com/forecasting-in-python-with-facebook-prophet-29810eb57e66)

   <hr>
<a id='ds5'></a>
# <div class="h2">About the data</div>
<a id="ABOUTTHEDATA"></a>

[Back to Table of Contents](#top)

[General Findinds](#theend)
<hr>

[Global Power Plant database ](https://developers.google.com/earth-engine/datasets/catalog/WRI_GPPD_power_plants) by WRI
> Description
The Global Power Plant Database is a comprehensive, open source database of power plants around the world. It centralizes power plant data to make it easier to navigate, compare and draw insights for one’s own analysis. The database covers approximately 30,000 power plants from 164 countries and includes thermal plants (e.g. coal, gas, oil, nuclear, biomass, waste, geothermal) and renewables (e.g. hydro, wind, solar). Each power plant is geolocated and entries contain information on plant capacity, generation, ownership, and fuel type. It will be continuously updated as data becomes available.

[Sentinel 5P OFFL NO2](https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S5P_OFFL_L3_NO2) by [EU/ESA/Copernicus](https://sentinel.esa.int/web/sentinel/user-guides/sentinel-5p-tropomi/document-library)
> Sentinel-5 Precursor
Sentinel-5 Precursor is a satellite launched on 13 October 2017 by the European Space Agency to monitor air pollution. The onboard sensor is frequently referred to as Tropomi (TROPOspheric Monitoring Instrument). The OFFL/NO2 is a dataset that provides offline high-resolution imagery of **NO2 concentration**.

[Global Forecast System 384-Hour Predicted Atmosphere Data](https://developers.google.com/earth-engine/datasets/catalog/NOAA_GFS0P25) by NOAA/NCEP/EMC
> The Global Forecast System (GFS) is a weather forecast model produced by the National Centers for Environmental Prediction (NCEP). The GFS dataset consists of selected model outputs (described below) as gridded forecast variables. The 384-hour forecasts, with 3-hour forecast interval, are made at 6-hour temporal resolution (i.e. updated four times daily). Use the 'creation_time' and 'forecast_time' properties to select data of interest.

[Global Land Data Assimilation System](https://developers.google.com/earth-engine/datasets/catalog/NASA_GLDAS_V021_NOAH_G025_T3H) by NASA
> Global Land Data Assimilation System (GLDAS) ingests satellite and ground-based observational data products. Using advanced land surface modeling and data assimilation techniques, it generates optimal fields of land surface states and fluxes. his dataset provided by NASA ingest satellite.

Participants may also consider using other public datasets related to trade commodities for fuel types, total fuel consumed, and/or data from the [US Energy Information Agency (EIA)](https://www.eia.gov/state/data.php?sid=RQ#CarbonDioxideEmissions).

<hr>
<a id='ds5'></a>
# <div class="h2">Don't hesitate to give your suggestions in the comment section.</div>
<a id="theend"></a>
<a id='ds5'></a>
# <div class="h3">Remember the upvote button is next to the fork button, and it's free too! ;)</div>
<a id="theend"></a>

# Ending note