<a href="https://colab.research.google.com/github/Irenekayla/ML_Notebooks/blob/main/AirQo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 **AIR QUALITY PREDICTION**


**Overview**:
Air pollution is a pressing global issue, particularly affecting developing regions such as sub-Saharan Africa. The challenge addresses the need for accurate estimation of PM2.5 levels in Kenya.
 This estimation is crucial for developing interventions to mitigate the adverse effects of air pollution on public health and the environment.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install geodatasets
!pip install ydata-profiling

**Installing necessary libraries**

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

**loading and preprocessing of data**


In [None]:
train_data = pd.read_csv('/content/drive/MyDrive/AirQo_Kenya/Train.csv')
test_data = pd.read_csv('/content/drive/MyDrive/AirQo_Kenya/Test.csv')

In [None]:
train_data = train_data.dropna()
test_data = test_data.dropna()
for col in train_data.columns:
    train_data[col] = pd.to_numeric(train_data[col], errors='coerce')

for col in test_data.columns:
    test_data[col] = pd.to_numeric(test_data[col], errors='coerce')
train_data = train_data.dropna(how='all', axis=1)
test_data = test_data.dropna(how='all', axis=1)


In [None]:
!pip install geodatasets
!pip install ydata-profiling
!pip install pandas


**Separating features and target variable**

In [None]:
X_train, y_train = train_data.drop('pm2_5', axis=1), train_data['pm2_5']
X_test, y_test = test_data.drop('pm2_5', axis=1), test_data['pm2_5']


**Model training**


In [None]:

rf_model = RandomForestRegressor(n_estimators = 100, max_depth = 5, random_state = 1, verbose = 1)

rf_model.fit(X_train, y_train)

[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s


In [None]:
print(train_data.columns)
print(test_data.columns)



In [None]:
!pip install scikit-learn
from sklearn.preprocessing import StandardScaler



**Model Evaluation**

In [None]:
y_pred = rf_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared Score: {r2:.2f}')

**Predict PM2.5 levels for new data**


In [None]:
new_data = pd.read_csv('new_data.csv')

pm2.5_predictions = rf_model.predict(new_data)

**Tips for Improvement:**

**Feature Engineering:** Extract additional features from the meteorological data and temporal trends to capture more variability in PM2.5 levels.

**Advanced Modeling Techniques**: Experiment with ensemble methods, deep learning architectures, or hybrid models.

**Fine-tuning Hyperparameters**: Conduct thorough hyperparameter tuning using techniques like grid search or random search to optimize model performance.

**Data Augmentation**: Augment the training data by incorporating synthetic samples or generating new features to improve the model's robustness.

**Ensemble Learning**: Combine predictions from multiple models using techniques like stacking.
