# Air Polution Challenge
### Iris Winkler, Carlos Duque, Johannes Gooth

neuefische can-ds-24-2

The Air Pollution Challenge on Zindi is focused on predicting air quality in various cities around the world using satellite data. Participants in this challenge are tasked with using machine learning techniques to estimate pollution levels, specifically PM2.5 concentrations. The challenge leverages data that spans across different geographical locations and conditions, providing a platform for developers to test and enhance their predictive modeling skills in real-world scenarios.

The competition is designed to address the growing issue of urban air pollution, providing insights and solutions that can help in managing air quality in densely populated cities. The use of satellite data in this context allows for a wide-reaching analysis that can cover areas where ground-based sensors might not be available, thus filling crucial data gaps in global environmental health studies.

For more detailed discussions and resources related to this challenge, including data sets and participant discussions, you can visit the official challenge page on Zindi's website: Urban Air Pollution Challenge on Zindi.

## Set-Up the Working Environment

In [1]:
# Aviod restarting Kernal 
%load_ext autoreload
%autoreload 2

# Import of relevant packages
import numpy as np
import pandas as pd
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

# Set random seed 
RSEED = 7
warnings.filterwarnings("ignore")

## Import the Data

In [2]:
# Import the data into a Pandas DataFrame
df_train = pd.read_csv('data/Train.csv')
df_test = pd.read_csv('data/Test.csv')

## Quick Look at the Data

In [None]:
# Display first 5 rows of the train DataFrame
df_train.head()

In [None]:
# Display first 5 rows of the test DataFrame
df_test.head()

## Train-Test Split
The test_df holds the final evaluation dataset for the Air Pollution challenge. To ensure that we do not use this data until our final model is ready, we will also divide the train_df into training and testing subsets. This will allow us to assess our models effectively during the optimization process:

In [6]:
# Define your features and target variable
X = df_train.drop('target', axis=1)  # Features (all columns except 'not.fully.paid')
y = df_train['target']               # Target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=RSEED)

# Checking the dimensions of the splits
print("Training set features shape:", X_train.shape)
print("Training set target shape:", y_train.shape)
print("Testing set features shape:", X_test.shape)
print("Testing set target shape:", y_test.shape)

Training set features shape: (21389, 81)
Training set target shape: (21389,)
Testing set features shape: (9168, 81)
Testing set target shape: (9168,)


## Explore the Data

### Shape of the DataFrames

In [None]:
# Shape of the train DataFrame
X_train.shape

The train DataFrame has 81 columns and 30557 rows.

In [None]:
# Shape of the test DataFrame
df_test.shape

The test DataFrame has 77 columns and 16136 rows. 

Thus, the train and test DataFrames have differnt columns. We will analyze the differences between the columns of two the DataFrames by comparing their column names:

In [None]:
# Get sets of column names for both DataFrames
columns_X_train = set(X_train.columns)
columns_df_test = set(df_test.columns)
    
# Find columns that are unique to each DataFrame
unique_to_df_train = list(columns_X_train - columns_df_test)

print(unique_to_df_train)

The SampleSubmission file clearly states that only the 'target' column is the target value of our model (aka the amount of Pm2.5) and should be predicted.  Therefore, 'target_variance', 'target_count', 'target_max', 'target_min' could only be considered as additional features. However, training a machine learning model on features that do not appear in the test data can lead to several issues and is generally not recommended. Here are some key points to consider:
1. Generalization Ability: The primary goal of a machine learning model is to generalize well from the training data to unseen data. If the model is trained on features that are not available in the test data, it might learn patterns that are not applicable when making predictions in real-world scenarios or during evaluation, leading to poor performance.
2. Overfitting: Training on features not present in the test set can cause the model to overfit to the training data. Overfitting occurs when a model learns noise or irrelevant details in the training data instead of capturing the underlying patterns applicable more broadly. This typically results in high accuracy on training data but poor accuracy on unseen (test) data.
3. Feature Relevance: Features used in training should ideally be representative of the data the model will ultimately work with, including the test set and any real-world application data. Features that do not appear in the test set might not be relevant or available when the model is deployed, making them practically useless and potentially misleading during the training phase.
4. Model Complexity: Including irrelevant or unavailable features increases the complexity of the model unnecessarily. This not only affects the model's efficiency but can also complicate the model maintenance and interpretability.
5. Resource Allocation: Training on irrelevant features consumes computational resources and time, which could be better spent on feature engineering, model tuning, or training with relevant features that improve the model's performance on the test set.

Therefore, we decided to drop the other unique columns.

### Columns in the DataFrames

Next, we will check the column names of our DataFrames:

In [None]:
# Columns in the dataframe
X_train.columns

Not all column names are lowercase and use underscores instead of spaces. To adhere to Python naming conventions, we'll convert all column names in the train and test DataFrames to lowercase and replace spaces with underscores.

More information about the column names can be found on https://developers.google.com/earth-engine/datasets/catalog/sentinel-5p.

Next, we will identify the different datatypes present in our datasets:

### Datatypes

In [None]:
# Checking for the data-types
X_train.info()

 | Data-Type | Result |
|:-|:-|
| *object* | There are three features (columns) with an object data-type in our dataset: 'Place ID X Date', 'Place ID' and 'Date'. This means, there are strings or mixed data-types in this column. <br> - Since 'Date' means the date when the measurement was performed, this needs to be changed into a datetime format. |
| *float65* | Floats are the data-types of the remaining features (columns). We consider rounding at a later stage.

Next, we will check for missing values in the dataset.

### Missing Values

In [None]:
# Checking for missing values ('NaN'/'None')
X_train.isnull().sum()

In [None]:
# plotting percentage of missing values per column
msno.bar(X_train)

In [None]:
# plotting the matrix of missing values
msno.matrix(X_train)

There are few instances where missing values co-occur, and there appears to be no discernible pattern in the distribution of missing data. Consequently, it seems that the missing data occurs randomly.

We deliberated on whether to retain or discard columns containing missing values through a thorough examination of the scientific literature:

### Dicussion of feature relevance based on scientific literature review

1. ***Methane (CH4) itself*** is not directly related to PM2.5 concentrations, as it does not directly contribute to particulate matter levels. 

    -> drop columns.

2. ***Sulfur dioxide (SO2)*** is closely related to PM2.5. SO2 is a significant precursor to sulfate aerosols, which are a major component of fine particulate matter (PM2.5) in the atmosphere. Here's how the relationship works:

    Chemical Transformations: When SO2 is emitted into the atmosphere, it can undergo various chemical reactions, particularly with hydroxyl radicals. This transformation leads to the formation of sulfuric acid (H2SO4), which can further react or nucleate to form sulfate particles.
    Secondary Particulate Formation: These sulfate particles contribute to secondary particulate matter, significantly influencing the concentration of PM2.5. This process is especially relevant in urban areas and regions with high fossil fuel combustion rates, where SO2 emissions are substantial.
    Environmental and Health Impact: The sulfate component of PM2.5 is known for its ability to degrade air quality and cause health problems, including respiratory and cardiovascular issues.
    Regulation and Control: Due to its role in forming PM2.5, controlling SO2 emissions is a critical strategy for reducing overall particulate matter levels and improving air quality.
    Understanding the link between SO2 and PM2.5 is crucial for environmental policy and implementing effective air quality management strategies​ (MDPI)​.

    -> keep columns

3. ***Aerosols*** are closely related to PM2.5. In the context of air pollution, "aerosol" refers to the suspension of fine solid particles or liquid droplets in air or another gas. These can include a wide range of particles, such as dust, pollen, soot, smoke, and liquid droplets, which are small enough to be suspended in the atmosphere.

    PM2.5 specifically refers to particulate matter that has a diameter of less than 2.5 micrometers. These fine particles are a subset of aerosols and are significant from both environmental and health perspectives because they can penetrate deep into the lungs and even enter the bloodstream, causing various health issues.

    Aerosols can be generated from both natural sources, such as volcanic eruptions and forest fires, and human-made sources, such as vehicle emissions and industrial processes. The study and management of aerosols, particularly PM2.5, are crucial for air quality monitoring and public health protection.

    -> keep columns

4. ***Cloud characteristics*** such as cloud fraction, height, pressure at the base and top, optical depth, and surface albedo can influence PM2.5 levels, although often indirectly. Here’s how these factors might play a role:

    1. **Cloud Fraction and Optical Depth**: These determine the amount of sunlight that reaches the Earth's surface, which can affect photochemical reactions in the atmosphere. These reactions are crucial for the formation of secondary pollutants that contribute to PM2.5. For instance, lower sunlight due to high cloud cover can reduce the rate of photochemical smog formation, potentially lowering PM2.5 levels.

    2. **Cloud Height and Pressure**: The height and pressure of clouds can influence atmospheric circulation and weather patterns, which in turn affect air pollution dispersion. Higher clouds and different pressure levels can lead to changes in wind patterns, possibly dispersing or concentrating pollutants like PM2.5.

    3. **Surface Albedo**: This refers to the Earth's surface ability to reflect sunlight. Surfaces with high albedo (like those covered with snow or light-colored materials) reflect more solar radiation, which can reduce ground-level temperatures and affect thermal circulations. These changes can impact how pollutants are mixed and dispersed in the lower atmosphere.

    4. **Clouds and Precipitation**: Clouds are also related to precipitation processes. Precipitation can remove pollutants from the atmosphere, a process known as wet deposition. When particulate matter like PM2.5 is washed out of the atmosphere by rain or snow, it leads to temporary improvements in air quality.

    In summary, while clouds don't directly emit or absorb PM2.5, their presence and characteristics can significantly alter meteorological conditions, thereby influencing the levels and distribution of particulate matter indirectly through changes in sunlight exposure, weather patterns, and precipitation.

    -> keep columns

5. ***Formaldehyde (HCHO)*** is related to PM2.5 primarily through secondary formation processes in the atmosphere. Formaldehyde is a volatile organic compound (VOC) that can act as a precursor to secondary organic aerosols (SOAs), which are a component of PM2.5. Here’s how this relationship works:

    1. **Chemical Reactions**: Formaldehyde in the atmosphere can undergo photochemical reactions driven by sunlight. These reactions can produce radicals and other compounds that further react to form secondary organic aerosols.

    2. **Contribution to PM2.5**: The SOAs formed from formaldehyde and other VOCs contribute to the overall mass of PM2.5 in the atmosphere. These aerosols can be significant, especially in urban and industrial areas with high levels of VOC emissions.

    3. **Air Quality Impact**: Because formaldehyde is involved in the formation of particulate matter, controlling VOC emissions (including formaldehyde) is important for managing PM2.5 levels and improving air quality.

    The link between HCHO and PM2.5 underscores the complexity of air pollution chemistry and the importance of monitoring a wide range of pollutants to effectively manage and mitigate air quality issues.

    -> keep columns

6. ***Carbon monoxide (CO)*** is not a direct component of PM2.5, but it is often correlated with PM2.5 concentrations in the atmosphere, primarily due to shared sources. Here’s how CO is related to PM2.5:

    1. **Common Sources**: Both CO and PM2.5 are often emitted from the same sources, such as motor vehicles, biomass burning, and industrial processes. This commonality means that areas with high CO levels often also have high PM2.5 levels due to simultaneous emissions of various pollutants.

    2. **Indicator of Combustion Efficiency**: CO is a product of incomplete combustion. High levels of CO can indicate poor combustion efficiency, which is also likely to produce higher amounts of particulate matter, including PM2.5. For example, inefficient fuel burning in engines or heaters can release both CO and a variety of particulate pollutants.

    3. **Atmospheric Chemistry**: While CO does not directly form PM2.5, it participates in atmospheric chemical reactions that can influence the levels of other pollutants that do form PM2.5. For instance, CO can react in the atmosphere to form secondary pollutants that contribute to the overall particulate load.

    4. **Air Quality Management**: Because of the correlation between the sources of CO and PM2.5, measures to reduce CO emissions (such as improving fuel combustion efficiency or upgrading vehicle emission standards) can also lead to reductions in PM2.5 concentrations.

    Thus, while CO is not a direct precursor or component of PM2.5, its presence is associated with conditions that can lead to increased levels of particulate matter. Monitoring CO levels can help in assessing combustion-related pollution and formulating strategies to control overall air pollution, including PM2.5.

    -> keep columns

7. ***Ozone (O3)*** is not directly a component of PM2.5, but it is related to particulate matter formation through atmospheric chemistry interactions. Here’s how ozone is connected to PM2.5:

    1. **Secondary Organic Aerosols (SOAs)**: Ozone plays a crucial role in the formation of secondary organic aerosols. Ozone can react with volatile organic compounds (VOCs) emitted from sources such as vehicles, industrial processes, and vegetation. These reactions often occur in the presence of sunlight and lead to the formation of new particulate matter, contributing to PM2.5 levels.

    2. **Photochemical Smog**: Ozone is a major component of photochemical smog, which includes a mixture of air pollutants including particulate matter. The same conditions that favor the formation of ozone (such as sunny, warm days) also promote the formation of PM2.5 from precursor gases like VOCs and nitrogen oxides (NOx).

    3. **Oxidizing Capacity**: Ozone has strong oxidizing properties, which can transform primary emissions into more reactive species that can then nucleate or condense to form particulate matter. This process can transform less harmful emissions into particles small enough to be classified as PM2.5.

    4. **Indicator of Air Quality**: Elevated levels of ozone and PM2.5 often occur together under certain meteorological conditions, particularly in urban environments. Thus, high ozone days can often also be high PM2.5 days, especially during periods of stagnant air that prevent the dispersion of pollutants.

    In summary, while ozone itself does not make up PM2.5, it influences the atmospheric processes that lead to the formation of PM2.5. Managing ozone levels, especially in urban areas, is therefore important not only for controlling ozone itself but also for managing levels of fine particulate matter.

    -> keep columns

8. ***Nitrogen dioxide (NO2)*** is related to PM2.5, primarily through its role in the formation of secondary particulate matter. Here’s how NO2 contributes to PM2.5 levels:

    1. **Formation of Nitrate Aerosols**: NO2 reacts with atmospheric compounds to form nitrate aerosols, which are a significant component of PM2.5. This reaction typically occurs with ammonia (NH3) to produce particulate nitrates, which increase the concentration of fine particulate matter in the atmosphere.

    2. **Photochemical Reactions**: NO2 plays a crucial role in photochemical reactions that produce ozone and other photochemical oxidants. These oxidants can further react with volatile organic compounds (VOCs) and other precursors to form secondary organic aerosols (SOAs), contributing to PM2.5 levels.

    3. **Urban Air Pollution**: In urban environments, where NO2 emissions are high due to traffic and industrial activities, the contribution of NO2 to PM2.5 formation is particularly significant. The density of emission sources accelerates the formation of both nitrate aerosols and SOAs.

    4. **Indicator of Combustion Sources**: Similar to carbon monoxide (CO), NO2 is often emitted from combustion sources such as vehicles and power plants. High levels of NO2 can indicate significant combustion-related pollution, which often corresponds with increased PM2.5 from various combustion byproducts.

    Managing NO2 emissions is important for controlling PM2.5 levels, especially in urban areas where exposure to NO2 and fine particulates can pose significant health risks. Reducing NO2 emissions from vehicles and industrial sources is crucial for improving air quality and reducing particulate matter pollution.

    -> keep columns

9. ***Wind*** is significantly related to PM2.5 levels, primarily through its role in the dispersion and transport of these fine particles. Here’s how wind impacts PM2.5:

    1. **Dispersion of Particles**: Wind helps to disperse air pollutants, including PM2.5. Higher wind speeds can dilute concentrations of particulate matter by spreading them over a wider area, potentially reducing the pollutant levels in a specific locale while distributing these particles over a larger region.

    2. **Transport of Pollutants**: Wind can carry PM2.5 from their sources to distant locations. This means that areas downwind of major pollution sources (like industrial areas, urban centers, or areas with heavy traffic) can experience elevated PM2.5 levels due to wind transport.

    3. **Influence on Air Quality**: Wind direction and speed are critical factors in determining daily air quality. For example, winds blowing from industrial zones towards residential areas can lead to higher exposure to PM2.5 for populations in those residential areas.

    4. **Interaction with Weather Patterns**: Wind is also influenced by broader weather patterns and geographical features. For instance, mountainous regions can experience "channeling" effects that may concentrate or disperse air pollutants differently, affecting PM2.5 concentrations.

    5. **Impact on Emission Sources**: In situations where wind influences the operation of pollution sources (such as in open mining, construction sites, or agriculture), it can increase the generation of particulate matter by stirring up dust and other particles.

    In air quality modeling and environmental monitoring, understanding the patterns of wind in a region is crucial for predicting PM2.5 levels and managing pollution control strategies effectively. Wind data is often used alongside other meteorological factors to develop accurate forecasts and public health advisories regarding air quality.

    -> keep columns

10. ***Temperature*** is related to PM2.5 levels through several mechanisms:

    1. **Chemical Reaction Rates**: Higher temperatures can increase the rate of chemical reactions in the atmosphere that lead to the formation of secondary particulate matter. For example, higher temperatures accelerate the photochemical reactions involving volatile organic compounds (VOCs) and nitrogen oxides (NOx), which form secondary organic aerosols and nitrate aerosols, respectively, contributing to PM2.5 levels.

    2. **Atmospheric Stability**: Temperature influences the stability of the atmosphere. Warmer temperatures tend to increase atmospheric instability, which can enhance the vertical mixing of pollutants and disperse air pollutants, including PM2.5. Conversely, lower temperatures, especially during night or winter, can lead to temperature inversions that trap pollutants close to the ground, increasing PM2.5 concentrations.

    3. **Emission Sources**: Temperature can affect emissions from various sources. For instance, residential heating demand increases during colder temperatures, potentially increasing emissions of PM2.5 from combustion processes.

    4. **Volatility of Organic Compounds**: Temperature changes affect the volatility of organic compounds. Higher temperatures can increase the volatility, enhancing the formation of secondary organic aerosols as more vapors are available to participate in atmospheric reactions.

    These relationships highlight the complex interplay between meteorological conditions and air quality, making temperature a critical factor in modeling and managing PM2.5 pollution.

    -> keep columns

11. ***Yumidity*** is related to PM2.5 concentrations in the atmosphere through various mechanisms:

    1. **Particle Growth**: High humidity levels can lead to the growth of particulate matter. Water vapor condenses on particles, causing them to grow in size and mass. This can increase the PM2.5 concentration as smaller particles grow to become part of the PM2.5 classification.

    2. **Chemical Reactions**: Humidity can also influence the chemical reactions that form secondary particles, such as sulfates and nitrates. For example, sulfur dioxide (SO2) and nitrogen oxides (NOx) can react with water vapor to form these secondary particulate matters, which are components of PM2.5.

    3. **Hygroscopic Properties**: Many particles in the PM2.5 category are hygroscopic, meaning they can absorb moisture from the air. This absorption can change the chemical composition and physical properties of the particles, potentially making them more hazardous or altering their ability to be filtered by the respiratory system.

    4. **Agglomeration**: Higher humidity can enhance the agglomeration of particles, where smaller particles stick together to form larger particles. While this might reduce the number of particles classified as PM2.5 by mass, it can also make them more likely to deposit in the respiratory tract.

    Understanding the role of humidity in PM2.5 dynamics is crucial for accurate air quality forecasting and health risk assessment, especially in regions with high humidity levels.

    -> keep columns

12. ***The angles*** in the features refer to geographical and/or directional measurements related to the data, such as the angles of wind direction, satellite data acquisition angles, or other related metrics. Angles can be important in air quality modeling if they relate to how pollutants disperse or how satellite observations are made relative to the Earth's surface.

    For example, wind direction, which could be represented by angles, is crucial for understanding how air pollutants travel and disperse across different areas. Similarly, satellite observation angles might affect the accuracy and relevance of the measurements for specific locations, especially in complex urban geometries where building shadows or reflections might influence sensor readings.

    -> keep columns

To prevent any alteration of the dataset through the creation of synthetic data, we have decided to retain the NaN values in the columns we preserve.

Next, we will examine the dataset for any zero values.



### 0-values

For features with 'angles' in their names, along with 'cloud_fraction' and 'weekday', zero values are permissible. However, in all other features, the presence of zero values may indicate an error.

To focus on columns where zero values might signify an error, we initially exclude columns that contain the word 'angle' in their headers:


In [None]:
# Create a new DataFrame without columns containing the word 'angle' and 'cloud_fraction'
df_filtered_noangle_nocloud = X_train[[col for col in X_train.columns if 'angle' not in col and 'cloud_fraction' not in col]]

print(df_filtered_noangle_nocloud)


In [None]:
# Check for zeros in all columns and count them
zero_counts = (df_filtered_noangle_nocloud == 0).sum()

# Filter and print entries where the sum is not zero
non_zero_counts = zero_counts[zero_counts != 0]
print("Number of zeros in each column:")
print(non_zero_counts)

The 'target_variance' column will be dropped during data preprocessing later on. Zero values in all other columns are likely indicative of errors.  

The occurrence of 0-values across all data for a specific air pollutant suggests that they originate from the same sensor, which was likely malfunctioning. To preserve the integrity of the dataset, we aim to avoid generating synthetic data. Nonetheless, to assess the impact of removing these potentially erroneous entries, we will now investigate how many columns contain 0-values that are likely errors:

In [None]:
# Calculate the number of zeros in each row 
df_filtered_noangle_nocloud['zero_count'] = (df_filtered_noangle_nocloud == 0).sum(axis=1) 

# Count the rows with 6 or more zeros 
number_of_rows_with_error_zero_values = (df_filtered_noangle_nocloud['zero_count'] >= 5).sum()

print(f"There are {number_of_rows_with_error_zero_values} rows with 0-values that likely indicate errors.")

1840 rows constitute less than 10% of our observations. Consequently, we plan to exclude them later.

Next, we'll verify our analysis by examining plots of some of the data:

In [None]:
# Identify columns with missing values
columns_with_zeros = df_filtered_noangle_nocloud.columns[df_filtered_noangle_nocloud.isnull().any()]

# Filter the DataFrame to include only columns with missing values
data_missing = df_filtered_noangle_nocloud[columns_with_zeros]

# Assuming your DataFrame is named data_missing
num_columns = len(data_missing.columns)  # Get the number of columns in the DataFrame
cols_per_row = 3  # Number of plots per row (you can adjust this number as needed)

# Calculate the number of rows needed
num_rows = (num_columns + cols_per_row - 1) // cols_per_row  # Rounds up if not a perfect multiple

# Create a figure with the calculated number of subplots
fig, ax = plt.subplots(num_rows, cols_per_row, figsize=(16, 4 * num_rows))  # Adjusted height per row

count = 0
for item in data_missing.columns:
    sns.boxplot(data_missing[item], ax=ax[count // cols_per_row][count % cols_per_row], color='#33658A').set(title=item, xlabel='')
    count += 1

# Hide empty subplots if they exist
if count < num_rows * cols_per_row:
    for i in range(count, num_rows * cols_per_row):
        ax.flat[i].set_visible(False)

fig.tight_layout(pad=3)
plt.show()

Next, we will look at the descriptive statistics of our data:

### Descriptive Statistics

In [None]:
# Examining the descriptive statistics of the dataset
X_test.describe()

Now, we take a look on the correlation matrix of our data:

### Correlation Matrix

In [None]:
#remove columns with non-numerical values
df_train_num = X_train.drop(['Date', 'Place_ID', 'Place_ID X Date'], axis=1).join(y)

# correlation plot
ax = sns.heatmap(df_train_num.corr(), annot=False)

The density of the data points makes it difficult to derive useful information. As a result, we will refine the correlation matrix to highlight only high correlation values. Given that the satellite angles are inherently correlated, we will remove them from the matrix before replottng the heatmap:

In [None]:
# Create a new DataFrame without columns containing the word 'angle'
df_filtered_noangle = df_train_num[[col for col in df_train_num.columns if 'angle' not in col]]

print(df_filtered_noangle)

In [None]:
ax = sns.heatmap(df_filtered_noangle.corr(), annot=False)

Now, we will filter for high correlation values. To achieve this, we first need to create a DataFrame containing the correlation matrix.

In [None]:
# Create the correlation matrix
corr_matrix = df_filtered_noangle.corr()

# Find correlations of other columns with 'target' where the absolute value of correlation is greater than 0.9
high_corr_with_target = corr_matrix['target'].abs().where(lambda x: abs(x) > 0.2).dropna()

print(high_corr_with_target)

The columns 'target_min', 'target_max', and 'target' will be excluded when training the models later. Despite this, some significant correlations remain evident.

## Preprocess the Data

In [7]:
from basic_functions_preprocessing import* 

In [8]:
X_train, y_train = preprocessing_df(X_train, y_train)

New dataframe shape: (17766, 73) (17766,)


In [9]:
X_test, y_test = preprocessing_df(X_test, y_test)

New dataframe shape: (7595, 73) (7595,)


### Drop all unique columns in the train DataFrame other than 'target'

In [None]:
# Drop all unique columns other than 'target'
df_train.drop(['target_min', 'target_max', 'target_count', 'target_variance'], axis=1, inplace=True)

df_train.shape

The train DataFrame now includes only one additional column compared to the test DataFrame, namely the 'target' column.

### Convert all column names in the train and test DataFrames to lowercase and replace spaces with underscores:

In [None]:
# Transform all column headers of the DataFrames into lower case and replace all spaces by underscores
df_train.columns = df_train.columns.str.lower().str.replace(' ', '_')

df_train.head()

### Convert the 'date' column to datetime

In [None]:
# change "date" dtype to datetime with format %Y/%m/%d
df_train['date'] = pd.to_datetime(df_train['date'], format='%Y-%m-%d')

df_train['month'] = df_train['date'].dt.month
df_train['day'] = df_train['date'].dt.day
df_train['day_of_week'] = df_train['date'].dt.dayofweek

df_train.head()
df_train.info()

### Missing Values
Drop all columns with the word 'CH4' in the header.

### 0-Values

Drop rows in which the following features contain 0-values:

L3_NO2_NO2_column_number_density                       
L3_NO2_NO2_slant_column_number_density                 
L3_NO2_absorbing_aerosol_index                        
L3_NO2_sensor_altitude                                 
L3_NO2_stratospheric_NO2_column_number_density         
L3_NO2_tropopause_pressure                            
L3_NO2_tropospheric_NO2_column_number_density          
L3_O3_O3_column_number_density                         
L3_O3_O3_effective_temperature                          
L3_CO_CO_column_number_density                          
L3_CO_H2O_column_number_density                         
L3_CO_cloud_height                                      
L3_CO_sensor_altitude                                   
L3_HCHO_HCHO_slant_column_number_density                
L3_HCHO_tropospheric_HCHO_column_number_density         
L3_HCHO_tropospheric_HCHO_column_number_density_amf     
L3_SO2_SO2_column_number_density                        
L3_SO2_SO2_column_number_density_amf                    
L3_SO2_SO2_slant_column_number_density                  
L3_SO2_absorbing_aerosol_index                          
L3_CH4_CH4_column_volume_mixing_ratio_dry_air          
L3_CH4_aerosol_height                                  
L3_CH4_aerosol_optical_depth                           

### Normalize/Standartizise the Data