In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")

## Data Understanding

**The data comes from three main sources:**

1. *Ground-based air quality sensors*. These measure the target variable (PM2.5 particle concentration). In addition to the target column (which is the daily mean concentration) there are also columns for minimum and maximum readings on that day, the variance of the readings and the total number (count) of sensor readings used to compute the target value. This data is only provided for the train set - you must predict the target variable for the test set.
2. *The Global Forecast System (GFS) for weather data*. Humidity, temperature and wind speed, which can be used as inputs for your model.
3. *The Sentinel 5P satellite*. This satellite monitors various pollutants in the atmosphere. For each pollutant, we queried the offline Level 3 (L3) datasets available in Google Earth Engine (you can read more about the individual products here: https://developers.google.com/earth-engine/datasets/catalog/sentinel-5p). For a given pollutant, for example NO2, we provide all data from the Sentinel 5P dataset for that pollutant. This includes the key measurements like NO2_column_number_density (a measure of NO2 concentration) as well as metadata like the satellite altitude. We recommend that you focus on the key measurements, either the column_number_density or the tropospheric_X_column_number_density (which measures density closer to Earth’s surface).

In [4]:
data = pd.read_csv("Test.csv")

In [7]:
data.head()

Unnamed: 0,Place_ID X Date,Date,Place_ID,precipitable_water_entire_atmosphere,relative_humidity_2m_above_ground,specific_humidity_2m_above_ground,temperature_2m_above_ground,u_component_of_wind_10m_above_ground,v_component_of_wind_10m_above_ground,L3_NO2_NO2_column_number_density,...,L3_SO2_sensor_zenith_angle,L3_SO2_solar_azimuth_angle,L3_SO2_solar_zenith_angle,L3_CH4_CH4_column_volume_mixing_ratio_dry_air,L3_CH4_aerosol_height,L3_CH4_aerosol_optical_depth,L3_CH4_sensor_azimuth_angle,L3_CH4_sensor_zenith_angle,L3_CH4_solar_azimuth_angle,L3_CH4_solar_zenith_angle
0,0OS9LVX X 2020-01-02,2020-01-02,0OS9LVX,11.6,30.200001,0.00409,14.656824,3.956377,0.712605,5.3e-05,...,1.445658,-95.984984,22.942019,,,,,,,
1,0OS9LVX X 2020-01-03,2020-01-03,0OS9LVX,18.300001,42.900002,0.00595,15.026544,4.23043,0.661892,5e-05,...,34.641758,-95.014908,18.539116,,,,,,,
2,0OS9LVX X 2020-01-04,2020-01-04,0OS9LVX,17.6,41.299999,0.0059,15.511041,5.245728,1.640559,5e-05,...,55.872276,-94.015418,14.14082,,,,,,,
3,0OS9LVX X 2020-01-05,2020-01-05,0OS9LVX,15.011948,53.100002,0.00709,14.441858,5.454001,-0.190532,5.5e-05,...,59.174188,-97.247602,32.730553,,,,,,,
4,0OS9LVX X 2020-01-06,2020-01-06,0OS9LVX,9.7,71.599998,0.00808,11.896295,3.511787,-0.279441,5.5e-05,...,40.925873,-96.057265,28.320527,1831.261597,3229.118652,0.031068,-100.278343,41.84708,-95.910744,28.498789


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16136 entries, 0 to 16135
Data columns (total 77 columns):
 #   Column                                               Non-Null Count  Dtype  
---  ------                                               --------------  -----  
 0   Place_ID X Date                                      16136 non-null  object 
 1   Date                                                 16136 non-null  object 
 2   Place_ID                                             16136 non-null  object 
 3   precipitable_water_entire_atmosphere                 16136 non-null  float64
 4   relative_humidity_2m_above_ground                    16136 non-null  float64
 5   specific_humidity_2m_above_ground                    16136 non-null  float64
 6   temperature_2m_above_ground                          16136 non-null  float64
 7   u_component_of_wind_10m_above_ground                 16136 non-null  float64
 8   v_component_of_wind_10m_above_ground                 16136 non-nul

**Observations from data.info():**
1. The dataset contains 16136 rows and 77 columns.
2. There are 3 columns with object data types and 74 columns with float64 data types.
3. Some columns have missing values, as indicated by the non-null counts being less than 16136.
4. The memory usage of the dataset is approximately 9.5 MB.
5. The column names are descriptive, but some may need renaming for better readability.
6. The dataset includes a mix of meteorological and atmospheric measurements.

In [18]:
# percent of missing values in each column
missing_columns = data.columns[data.isnull().any()]
data[missing_columns].isnull().mean()*100

L3_NO2_NO2_column_number_density           8.223847
L3_NO2_NO2_slant_column_number_density     8.223847
L3_NO2_absorbing_aerosol_index             8.223847
L3_NO2_cloud_fraction                      8.223847
L3_NO2_sensor_altitude                     8.223847
                                            ...    
L3_CH4_aerosol_optical_depth              80.695340
L3_CH4_sensor_azimuth_angle               80.695340
L3_CH4_sensor_zenith_angle                80.695340
L3_CH4_solar_azimuth_angle                80.695340
L3_CH4_solar_zenith_angle                 80.695340
Length: 68, dtype: float64

**Observations of Missing Values:**

1. The dataset contains 77 columns, out of which 65 columns have missing values.
2. The percentage of missing values varies across columns, with some columns having a significant proportion of missing data.
3. Columns related to CH4 (e.g., `L3_CH4_CH4_column_volume_mixing_ratio_dry_air`) have the highest percentage of missing values, with only 3115 non-null entries out of 16136 rows (~80.7% missing).
4. Columns related to NO2, O3, CO, HCHO, and SO2 have varying levels of missing data, with some columns missing up to ~30% of their values.
5. The presence of missing values may require imputation, removal, or other preprocessing techniques depending on their significance to the analysis or model.
6. The missing data might be due to limitations in data collection from sensors or satellites, and this should be considered when interpreting the results.

68