#### Columns to keep

- meas_columns
- L3_CO_CO_column_number_density
- L3_NO2_NO2_column_number_density
- L3_CO_cloud_height
- L3_NO2_absorbing_aerosol_index
- L3_NO2_cloud_fraction
- L3_HCHO_cloud_fraction
- L3_HCHO_tropospheric_HCHO_column_number_density

- L3_SO2_SO2_column_number_density
- L3_SO2_absorbing_aerosol_index
- L3_SO2_cloud_fraction
- L3_O3_O3_column_number_density
- L3_O3_cloud_fraction

- 'L3_CLOUD_cloud_fraction'
- 'L3_CLOUD_cloud_optical_depth'
- 'L3_CLOUD_cloud_base_height'
- L3_AER_AI_absorbing_aerosol_index

In [1]:
#import
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns


from matplotlib.ticker import PercentFormatter
plt.rcParams.update({ "figure.figsize" : (8, 5),"axes.facecolor" : "white", "axes.edgecolor":  "black"})
plt.rcParams["figure.facecolor"]= "w"
pd.plotting.register_matplotlib_converters()
pd.set_option('display.float_format', lambda x: '%.2f' % x) # change decimal places


Ground-based air quality sensors. These measure the target variable (PM2.5 particle concentration). In addition to the target column (which is the daily mean concentration) there are also columns for minimum and maximum readings on that day, the variance of the readings and the total number (count) of sensor readings used to compute the target value. This data is only provided for the train set - you must predict the target variable for the test set.

In [2]:
# load the data
df_train = pd.read_csv('data/Train.csv')
df_train.head()

Unnamed: 0,Place_ID X Date,Date,Place_ID,target,target_min,target_max,target_variance,target_count,precipitable_water_entire_atmosphere,relative_humidity_2m_above_ground,...,L3_SO2_sensor_zenith_angle,L3_SO2_solar_azimuth_angle,L3_SO2_solar_zenith_angle,L3_CH4_CH4_column_volume_mixing_ratio_dry_air,L3_CH4_aerosol_height,L3_CH4_aerosol_optical_depth,L3_CH4_sensor_azimuth_angle,L3_CH4_sensor_zenith_angle,L3_CH4_solar_azimuth_angle,L3_CH4_solar_zenith_angle
0,010Q650 X 2020-01-02,2020-01-02,010Q650,38.0,23.0,53.0,769.5,92,11.0,60.2,...,38.59,-61.75,22.36,1793.79,3227.86,0.01,74.48,37.5,-62.14,22.55
1,010Q650 X 2020-01-03,2020-01-03,010Q650,39.0,25.0,63.0,1319.85,91,14.6,48.8,...,59.62,-67.69,28.61,1789.96,3384.23,0.02,75.63,55.66,-53.87,19.29
2,010Q650 X 2020-01-04,2020-01-04,010Q650,24.0,8.0,56.0,1181.96,96,16.4,33.4,...,49.84,-78.34,34.3,,,,,,,
3,010Q650 X 2020-01-05,2020-01-05,010Q650,49.0,10.0,55.0,1113.67,96,6.91,21.3,...,29.18,-73.9,30.55,,,,,,,
4,010Q650 X 2020-01-06,2020-01-06,010Q650,21.0,9.0,52.0,1164.82,95,13.9,44.7,...,0.8,-68.61,26.9,,,,,,,


In [3]:
### List of weather and satellite measurement_columns to keep
col_keep = list(df_train.columns[0:14]) + ['L3_AER_AI_absorbing_aerosol_index', # AER
                # CLOUD
                'L3_CLOUD_cloud_base_height',
                'L3_CLOUD_cloud_fraction',
                'L3_CLOUD_cloud_optical_depth',
                ## NO2
                'L3_NO2_NO2_column_number_density',
                'L3_NO2_absorbing_aerosol_index',
                'L3_NO2_cloud_fraction',
                ## CO
                'L3_CO_CO_column_number_density',
                'L3_CO_cloud_height',
                ## HCHO
                'L3_HCHO_tropospheric_HCHO_column_number_density',
                'L3_HCHO_cloud_fraction',
                ## O3
                'L3_O3_O3_column_number_density',
                'L3_O3_cloud_fraction',
                ## SO2
                'L3_SO2_SO2_column_number_density',
                'L3_SO2_absorbing_aerosol_index',
                'L3_SO2_cloud_fraction'
                ]

df_train = df_train[col_keep]

In [4]:
import pickle

In [5]:
pd.to_pickle(df_train, 'data/df_train.pkl')

In [6]:
pd.to_pickle(col_keep, 'data/col_keep.pkl')

In [7]:
df_train.shape

(30557, 30)

#### To Do

- **train-test-split**
- replace 0 with NaN in relevant columns
- impute missing values
- (log-transform particle concentration)

- throw out all columns with molecules (only keep weather-related cols) for the baseline models

In [8]:
df_train.drop(['target_min', 'target_max', 'target_variance', 'target_count'], axis=1, inplace=True)


In [10]:
from sklearn.model_selection import train_test_split

# define X and y
X = df_train.drop(['target'], axis=1)
y = df_train.target

# train-test-split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [23]:
from sklearn.linear_model import LinearRegression
coluse = ['precipitable_water_entire_atmosphere', 'relative_humidity_2m_above_ground', 'specific_humidity_2m_above_ground', 'temperature_2m_above_ground', 'u_component_of_wind_10m_above_ground', 'v_component_of_wind_10m_above_ground']
# instantiate the model
linreg = LinearRegression()

# fit model
linreg.fit(X_train[coluse], y_train)

LinearRegression()

In [24]:
# predict
y_train_pred = linreg.predict(X_train[coluse])
y_pred = linreg.predict(X_test[coluse])

In [21]:
from sklearn.metrics import mean_squared_error, accuracy_score, recall_score, roc_auc_score, f1_score, roc_curve, r2_score

In [25]:
# Mean Squared Error
print('MSE_Baseline Train:\n', mean_squared_error(y_train, y_train_pred))
print('MSE_Baseline Test:\n', mean_squared_error(y_test, y_pred))

# Root Mean Squared Error
print('RMSE_Baseline Train:\n', mean_squared_error(y_train, y_train_pred, squared = False))
print('RMSE_Baseline Test:\n', mean_squared_error(y_test, y_pred, squared = False))

# R^2 Score
print('R^2 Train:\n', r2_score(y_train, y_train_pred))
print('R^2 Test:\n', r2_score(y_test, y_pred))


MSE_Baseline Train:
 2114.3870898762484
MSE_Baseline Test:
 2067.9936882800084
RMSE_Baseline Train:
 45.98246502609716
RMSE_Baseline Test:
 45.47519860627338
R^2 Train:
 0.04384884486838381
R^2 Test:
 0.04247753517089092


## Baseline Model

Air Pollution Prediction

Value of Product:

predict PM2.5 particle concentration 

Prediction:

PM2.5 value 

Evaluation Metric:

RMSE (also recommended by Zindi)

Baseline Model:

Model: Linear Regression

$ PM 2.5 = bo + b1 * temperature + b2 * humidity + b2 * precipitation + b3 * windspeed $

Score:

RMSE_Baseline Train:

 45.98246502609716

RMSE_Baseline Test:

 45.47519860627338

R^2 Train:

 0.04384884486838381

R^2 Test:

 0.04247753517089092

In [26]:
# create windstrength column

df_train['windstrength'] = np.sqrt(df_train.u_component_of_wind_10m_above_ground**2 + df_train.v_component_of_wind_10m_above_ground**2)
df_train.windstrength.describe()

count   30557.00
mean        3.10
std         2.21
min         0.02
25%         1.50
50%         2.55
75%         4.15
max        18.16
Name: windstrength, dtype: float64