# Machine Learning Pipeline: Hands-on Exercise

This [dataset](http://201.116.60.46/Datos_de_calidad_del_agua_de_5000_sitios_de_monitoreo.zip) contains different parameters used to evaluate the quality of water at different sites in México. The samples are classified with the  *traffic light* (`SEMAFORO`, in Spanish) code:  


| SEMAFORO | DESCRIPTION                                                                                                          |
|----------|----------------------------------------------------------------------------------------------------------------------|
| Verde    | Excelent quality.                                                                                                    |
| Amarillo | Water contaminated with Total Suspended Solids (SST in Spanish).                                                     |
| Rojo     | Water contaminated with Biochemical Oxigen Demanding (DBO in Spanish) or Chemical Oxigen Demanding (DQO in Spanish). |
| Morado   | Water sample is out of range in Total Dissolved Solids (SDT in Spanish).                                             |
| Azul     | Water sample is within the range of Total Dissolved Solids (SDT in Spanish).                                         |

This [link](https://es.wikipedia.org/wiki/Anexo:Definiciones_usuales_en_calidad_del_agua) explains (in Spanish) the parameters used to evaluate the quality of Water.

## Main objective

Predict the quality of Water for the giving features.

---

1. Download the [dataset](http://201.116.60.46/Datos_de_calidad_del_agua_de_5000_sitios_de_monitoreo.zip)
2. Extract the files and load the `CSV` file `Resultados_de_calidad_del_agua_de_5000_sitios_de_monitoreo.csv` in Pandas and check that data is loaded correctly; if not, find where the problem is (*hint: check that all the columns contain the same type of data. Additional tools like LibreOffice may help you*).
3. Mark all the empty values with `ND` (No Data/No Disponible).

4. Replace the `ND` with `NaN`.

5. Inspect the data (head, describe, dtypes, info, ...)

6. Check that the types of the features are correct; if not, change them.

7. Notice that there are geospatial features (LONGITUD, LATITUD). Plot them.

8. Now, we can work with the numerical features. We can see how they are correlated (or not) with the label `SEMAFORO`. Group the samples by the final class and calculate the mean of DBO, DQO, SST and SDT for each label.

9. Inspect the rest of the features. At this point, which features will you drop? Drop them, if any.

10. Find how many instances we have of each class.

11. Keep only the classes `Rojo`, `Amarillo` and `Verde`.

12. Make a `pairplot` with the numerical features. And explain the result.

13. Plot `LONGITUD`, `LATITUD` with the colors corresponding to the classes (`Rojo`='r', `Àmarillo`='y' `Verde`='g')

14. Find how many missing values are in each feature.

15. Replace the missing values.

16. Build the correlation matrix for numerical features.

17. Transform the categorical features in categorical data types.

18. Check if the classes to predict are balanced. 

19. Choose and build a model to predict the `SEMAFORO` class.

---