This project is about building a model that can predict the Air quality in Madrid based on 170k rows of an hourly recorded data from 2001 to 2022
The dataset is downloaded from Kaggle and can be found here
├── data
│ ├── MadridPolution2001-2022.csv
│ ├── MadridPolution2001-2022_cleaned.csv
|
├── notebooks
│ ├── 1 - EDA - Air pollution.ipynb
│ ├── 2 - Data preprocessing - Air pollution.ipynb
│ |── 3 - Modelling implementation & assessment Air pollution.ipynb
|
├── README.md
- Exploratory Data Analysis
- Found that the data is not clean and need to be cleaned
- remove the outliers
- find the correlation between the features
- find the distribution of the features
- Data Preprocessing
- fill the missing values
- group the data by year and month
- get the mean of each year
- split the data into train and test
- Modelling implementation & assessment
- build 3 models
- Logistic Regression
- Random Forest
- XGBoost
- evaluate the models
- accuracy
- precision
- recall
- choose the best model
- XGBoost
- accuracy: 0.77
- precision: 0.77
- recall: 0.77
- f1-score: 0.77
- build 3 models
The best model is XGBoost with an accuracy of 0.77