The following credentials are to be used to login to the application
- ID: IGAC_user
- Password: 123456
Testing data can be found at ./Datos_prueba.zip/
This .zip
file contains two testing files:
Datos_sin_Clasificacion.csv
Datos_con_Clasificacion.csv
The first file, tests the classification model and updates the plots accordingly; since it doesn't contain the ORDER
column.
The second file, while it doesn't perform a classification, it generates the plot on the Map, Treemap and Pivot Table; this data is not classified since the ORDER
column is included.
This project was conceived out of the necessity from the Instituto Geográfico Agustín Codazzi (IGAC) of having a tool capable of performing automatic soil taxonomy classification, using data collected in the field.
This tool seeks to aid and make easier the process of taxonomic soil classification performed by the edaphologists from the Instituto, whose goal is to make an inventory and cartography of the soils in Colombia. For this purpose, several models were fitted and tested using a database with more than 12,000 observations, which recorded 209 different variables, collected in the Cundiboyacense Plateau.
This classification is based on the USDA’s methodology. The current application is capable of classifying the first category in said taxonomic hierarchy (i.e. Order).
In order to achieve this, the following steps were taken:
- USDA's Soil Taxonomy was studied
- EDA
- Data Cleaning
- Statistical Classification Algorithms
- Dashboard
- Maps
- Graphs
- Databases management
- Backend development
This Project seeks to become the foundations of a deeper automatic taxonomic classification (i.e. Suborder, Great Group, Sub Group, etc.) that may be achieved by fitting further models. Additionally, data-entry standardization is highly recommended, in order to improve the databases quality and minimizing Data Cleaning processes in the future.
The Project is called CATS, which stands for Clasificador Automático Taxonómico de Suelos in Spanish, or Automatic Taxonomic Soil Classifier.
In order to perform the taxonomic classification, several models were fitted, such as:
- Random Forests
- Multinomial Regressions
- Stacked Models
It was concluded that Random Forests was the most appropriate algorithm in solving this classification problem. This was selecting, broadly, by balancing accuracy, speed and parsimony.
The Random Forests were done using the scikit-learn
library. These were optimized by testing more than 70 different forests with different parameters. The selection of the best was based on the Out-of-Bag Error.
The final model, obtained an Accuracy of 94.2% using the Test Dataset.
By using the app, it’s possible to:
- Obtain predictions of data uploaded by the user
- Visualize and Interact with user-uploaded data and the original database
The usage process is as follows:
- Log in using authorized credentials
- Visualize and interact with the data from the original database using the Map, Treemap and Pivot Table
- Filter the data using the filters located above the map
- Obtain detailed information by hovering over each entry point plotted in the map
- Upload valid files (
.csv
,.xls
or.xlsx
) (The maps and graphs will be automatically updated) - Interact with the map and graphs
- Obtain and interpret the classifications performed by the model
If the user wishes to by-pass the Dashboard, this can be done by interacting directly with the project's API.
Currently there are 3 API Endpoints:
-
/api/status/
Pings the API; the API responds with a boolean, whether the API is online and able to respond, or not. -
/api/predict
With this, the user can request a single prediction based on a set of 9 variables (see SampleJSON
Structure below) that are passed in aJSON
-like structure. The response, in again aJSON
structure, the most likely prediction and the probability of being in each of the 5 possible taxonomic orders. -
/api/predict_many
At this request, the API is able to perform several predictions in a single call. AJSON
format must be passed, containing the multiple observations to be predicted, based on the set of 9 variables (see SampleJSON
Structure below). Once again, the response is ordered in aJSON
format which contains the predictions, as well as the probabilities of each. This is a highly efficient method of communication with the application, considerably reducing the response time by several orders of magnitude. For example, this method is capable of performing up to 4000 classifications in less than a second.
This is the only way in which the probability of each classification can be obtained.
NOTE: Numerical values can't contain null values.
Further details of the API documentation can be found here
The following is a sample of how a request would look like in the JSON
structure:
{'ALTITUD': 291.0,
'CONTENIDO_CENIZA_VOLCANICA': 'False',
'DRENAJE_NATURAL': 'Pobre',
'EPIPEDON': 'Ocrico',
'FAMILIA_TEXTURAL: 'Fina',
'H1_ESPESOR': 17.0,
'H1_RESULTADO_ph': 4.5,
'H2_ESPESOR: 38.0,
'PROFUNDIDAD_MAXIMA': 110.0
}
The following table details the characteristics of the valid structure. All fields must be included. It is worth noting that the Random Forest algorithm can deal with null
values, however, using non-null values is encouraged for a better prediction.
Variable | Type | Title |
---|---|---|
ALTITUD | number | Altitud |
CONTENIDO_CENIZA_VOLCANICA | string | Contenido Ceniza Volcanica |
DRENAJE_NATURAL | string | Drenaje Natural |
EPIPEDON | string | Epipedon |
FAMILIA_TEXTURAL | string | Familia Textural |
H1_ESPESOR | number | H1 Espesor |
H1_RESULTADO_ph | number | H1 Resultado pH |
H2_ESPESOR | number | H2 Espesor |
PROFUNDIDAD_MAXIMA | number | Profundidad Maxima |