Skip to content

DS4A-Team19-2021/Agustin-Codazzi-Project

Repository files navigation

Agustin-Codazzi-Project

The following credentials are to be used to login to the application

  • ID: IGAC_user
  • Password: 123456

Testing data can be found at ./Datos_prueba.zip/

This .zip file contains two testing files:

  1. Datos_sin_Clasificacion.csv
  2. Datos_con_Clasificacion.csv

The first file, tests the classification model and updates the plots accordingly; since it doesn't contain the ORDER column.

The second file, while it doesn't perform a classification, it generates the plot on the Map, Treemap and Pivot Table; this data is not classified since the ORDER column is included.


1. Description

This project was conceived out of the necessity from the Instituto Geográfico Agustín Codazzi (IGAC) of having a tool capable of performing automatic soil taxonomy classification, using data collected in the field.

This tool seeks to aid and make easier the process of taxonomic soil classification performed by the edaphologists from the Instituto, whose goal is to make an inventory and cartography of the soils in Colombia. For this purpose, several models were fitted and tested using a database with more than 12,000 observations, which recorded 209 different variables, collected in the Cundiboyacense Plateau.

This classification is based on the USDA’s methodology. The current application is capable of classifying the first category in said taxonomic hierarchy (i.e. Order).

In order to achieve this, the following steps were taken:

  • USDA's Soil Taxonomy was studied
  • EDA
  • Data Cleaning
  • Statistical Classification Algorithms
  • Dashboard
  • Maps
  • Graphs
  • Databases management
  • Backend development

This Project seeks to become the foundations of a deeper automatic taxonomic classification (i.e. Suborder, Great Group, Sub Group, etc.) that may be achieved by fitting further models. Additionally, data-entry standardization is highly recommended, in order to improve the databases quality and minimizing Data Cleaning processes in the future.

The Project is called CATS, which stands for Clasificador Automático Taxonómico de Suelos in Spanish, or Automatic Taxonomic Soil Classifier.

2. Model

In order to perform the taxonomic classification, several models were fitted, such as:

  • Random Forests
  • Multinomial Regressions
  • Stacked Models

It was concluded that Random Forests was the most appropriate algorithm in solving this classification problem. This was selecting, broadly, by balancing accuracy, speed and parsimony.

The Random Forests were done using the scikit-learn library. These were optimized by testing more than 70 different forests with different parameters. The selection of the best was based on the Out-of-Bag Error.

The final model, obtained an Accuracy of 94.2% using the Test Dataset.

3. App Use

By using the app, it’s possible to:

  • Obtain predictions of data uploaded by the user
  • Visualize and Interact with user-uploaded data and the original database

The usage process is as follows:

  1. Log in using authorized credentials
  2. Visualize and interact with the data from the original database using the Map, Treemap and Pivot Table
  3. Filter the data using the filters located above the map
  4. Obtain detailed information by hovering over each entry point plotted in the map
  5. Upload valid files (.csv, .xls or .xlsx) (The maps and graphs will be automatically updated)
  6. Interact with the map and graphs
  7. Obtain and interpret the classifications performed by the model

4. API

If the user wishes to by-pass the Dashboard, this can be done by interacting directly with the project's API.

Currently there are 3 API Endpoints:

  1. /api/status/ Pings the API; the API responds with a boolean, whether the API is online and able to respond, or not.

  2. /api/predict With this, the user can request a single prediction based on a set of 9 variables (see Sample JSON Structure below) that are passed in a JSON-like structure. The response, in again a JSON structure, the most likely prediction and the probability of being in each of the 5 possible taxonomic orders.

  3. /api/predict_many At this request, the API is able to perform several predictions in a single call. A JSON format must be passed, containing the multiple observations to be predicted, based on the set of 9 variables (see Sample JSON Structure below). Once again, the response is ordered in a JSON format which contains the predictions, as well as the probabilities of each. This is a highly efficient method of communication with the application, considerably reducing the response time by several orders of magnitude. For example, this method is capable of performing up to 4000 classifications in less than a second.

This is the only way in which the probability of each classification can be obtained.

NOTE: Numerical values can't contain null values.

Further details of the API documentation can be found here

Sample JSON structure

The following is a sample of how a request would look like in the JSON structure:

{'ALTITUD': 291.0,
 'CONTENIDO_CENIZA_VOLCANICA': 'False',
 'DRENAJE_NATURAL': 'Pobre',
 'EPIPEDON': 'Ocrico',
 'FAMILIA_TEXTURAL: 'Fina',
 'H1_ESPESOR': 17.0,
 'H1_RESULTADO_ph': 4.5,
 'H2_ESPESOR: 38.0,
 'PROFUNDIDAD_MAXIMA': 110.0
}

Request JSON structure detail

The following table details the characteristics of the valid structure. All fields must be included. It is worth noting that the Random Forest algorithm can deal with null values, however, using non-null values is encouraged for a better prediction.

Variable Type Title
ALTITUD number Altitud
CONTENIDO_CENIZA_VOLCANICA string Contenido Ceniza Volcanica
DRENAJE_NATURAL string Drenaje Natural
EPIPEDON string Epipedon
FAMILIA_TEXTURAL string Familia Textural
H1_ESPESOR number H1 Espesor
H1_RESULTADO_ph number H1 Resultado pH
H2_ESPESOR number H2 Espesor
PROFUNDIDAD_MAXIMA number Profundidad Maxima

5. Team Members:

imagen

About

In this repository there are the codes form the Agustin Codazzi institute in the context of the DS4A course in its version of the 2021.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published