Agustin-Codazzi-Project

The following credentials are to be used to login to the application

ID: IGAC_user
Password: 123456

Testing data can be found at ./Datos_prueba.zip/

This .zip file contains two testing files:

Datos_sin_Clasificacion.csv
Datos_con_Clasificacion.csv

The first file, tests the classification model and updates the plots accordingly; since it doesn't contain the ORDER column.

The second file, while it doesn't perform a classification, it generates the plot on the Map, Treemap and Pivot Table; this data is not classified since the ORDER column is included.

1. Description

This project was conceived out of the necessity from the Instituto Geográfico Agustín Codazzi (IGAC) of having a tool capable of performing automatic soil taxonomy classification, using data collected in the field.

This tool seeks to aid and make easier the process of taxonomic soil classification performed by the edaphologists from the Instituto, whose goal is to make an inventory and cartography of the soils in Colombia. For this purpose, several models were fitted and tested using a database with more than 12,000 observations, which recorded 209 different variables, collected in the Cundiboyacense Plateau.

This classification is based on the USDA’s methodology. The current application is capable of classifying the first category in said taxonomic hierarchy (i.e. Order).

In order to achieve this, the following steps were taken:

USDA's Soil Taxonomy was studied
EDA
Data Cleaning
Statistical Classification Algorithms
Dashboard
Maps
Graphs
Databases management
Backend development

This Project seeks to become the foundations of a deeper automatic taxonomic classification (i.e. Suborder, Great Group, Sub Group, etc.) that may be achieved by fitting further models. Additionally, data-entry standardization is highly recommended, in order to improve the databases quality and minimizing Data Cleaning processes in the future.

The Project is called CATS, which stands for Clasificador Automático Taxonómico de Suelos in Spanish, or Automatic Taxonomic Soil Classifier.

2. Model

In order to perform the taxonomic classification, several models were fitted, such as:

Random Forests
Multinomial Regressions
Stacked Models

It was concluded that Random Forests was the most appropriate algorithm in solving this classification problem. This was selecting, broadly, by balancing accuracy, speed and parsimony.

The Random Forests were done using the scikit-learn library. These were optimized by testing more than 70 different forests with different parameters. The selection of the best was based on the Out-of-Bag Error.

The final model, obtained an Accuracy of 94.2% using the Test Dataset.

3. App Use

By using the app, it’s possible to:

Obtain predictions of data uploaded by the user
Visualize and Interact with user-uploaded data and the original database

The usage process is as follows:

Log in using authorized credentials
Visualize and interact with the data from the original database using the Map, Treemap and Pivot Table
Filter the data using the filters located above the map
Obtain detailed information by hovering over each entry point plotted in the map
Upload valid files (.csv, .xls or .xlsx) (The maps and graphs will be automatically updated)
Interact with the map and graphs
Obtain and interpret the classifications performed by the model

4. API

If the user wishes to by-pass the Dashboard, this can be done by interacting directly with the project's API.

Currently there are 3 API Endpoints:

/api/status/ Pings the API; the API responds with a boolean, whether the API is online and able to respond, or not.
/api/predict With this, the user can request a single prediction based on a set of 9 variables (see Sample JSON Structure below) that are passed in a JSON-like structure. The response, in again a JSON structure, the most likely prediction and the probability of being in each of the 5 possible taxonomic orders.
/api/predict_many At this request, the API is able to perform several predictions in a single call. A JSON format must be passed, containing the multiple observations to be predicted, based on the set of 9 variables (see Sample JSON Structure below). Once again, the response is ordered in a JSON format which contains the predictions, as well as the probabilities of each. This is a highly efficient method of communication with the application, considerably reducing the response time by several orders of magnitude. For example, this method is capable of performing up to 4000 classifications in less than a second.

This is the only way in which the probability of each classification can be obtained.

NOTE: Numerical values can't contain null values.

Further details of the API documentation can be found here

Sample `JSON` structure

The following is a sample of how a request would look like in the JSON structure:

{'ALTITUD': 291.0,
 'CONTENIDO_CENIZA_VOLCANICA': 'False',
 'DRENAJE_NATURAL': 'Pobre',
 'EPIPEDON': 'Ocrico',
 'FAMILIA_TEXTURAL: 'Fina',
 'H1_ESPESOR': 17.0,
 'H1_RESULTADO_ph': 4.5,
 'H2_ESPESOR: 38.0,
 'PROFUNDIDAD_MAXIMA': 110.0
}

Request `JSON` structure detail

The following table details the characteristics of the valid structure. All fields must be included. It is worth noting that the Random Forest algorithm can deal with null values, however, using non-null values is encouraged for a better prediction.

Variable	Type	Title
ALTITUD	number	Altitud
CONTENIDO_CENIZA_VOLCANICA	string	Contenido Ceniza Volcanica
DRENAJE_NATURAL	string	Drenaje Natural
EPIPEDON	string	Epipedon
FAMILIA_TEXTURAL	string	Familia Textural
H1_ESPESOR	number	H1 Espesor
H1_RESULTADO_ph	number	H1 Resultado pH
H2_ESPESOR	number	H2 Espesor
PROFUNDIDAD_MAXIMA	number	Profundidad Maxima

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
Images		Images
apps		apps
assets		assets
.gitignore		.gitignore
.mapbox_token		.mapbox_token
API_modelo.py		API_modelo.py
CATS.pk		CATS.pk
CATS_label_encoder.pk		CATS_label_encoder.pk
CATS_pipeline.pk		CATS_pipeline.pk
Datos_prueba.zip		Datos_prueba.zip
LICENSE		LICENSE
Modelo definitivo.ipynb		Modelo definitivo.ipynb
Multinomial.R		Multinomial.R
README.md		README.md
app.py		app.py
callbacks.py		callbacks.py
gunicorn_config.py		gunicorn_config.py
index.py		index.py
modelo_4_PROYECTO_IGAC.ipynb		modelo_4_PROYECTO_IGAC.ipynb
requirements.txt		requirements.txt
test_data.pk		test_data.pk

License

DS4A-Team19-2021/Agustin-Codazzi-Project

Folders and files

Latest commit

History

Repository files navigation

Agustin-Codazzi-Project

1. Description

2. Model

3. App Use

4. API

Further details of the API documentation can be found here

Sample JSON structure

Request JSON structure detail

5. Team Members:

About

Resources

License

Stars

Watchers

Forks

Languages

Sample `JSON` structure

Request `JSON` structure detail