<a href="https://colab.research.google.com/github/mnaR99/narco_aguacate/blob/main/notebooks/siap_asc_selenium.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Producción mensual de aguacate de 2018 a 2020

WebScrapping usando selenium para modificar formularios desarrollados en javascript y extraer el código HTML de las tablas generadas en la consulta.

In [1]:
%%capture
!pip install selenium
!pip install pyjanitor
!apt-get update 
!apt install chromium-chromedriver

Configuración inicial para ejecutar Selenium desde un notebook.

In [2]:
%%capture
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

url = "https://nube.siap.gob.mx/avance_agricola/"

Obtención de datos de la Producción mensual de aguacate de 2018 a 2020 por entidad federativa.

In [3]:
%%capture
tbls_html = []

web = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
web.get(url)

time.sleep(0.7)
cicloProd = Select(web.find_element_by_id("cicloProd"))
cicloProd.select_by_value("3") # Código para Perennes

time.sleep(0.7)
modalidad = Select(web.find_element_by_id("modalidad"))
modalidad.select_by_value("3") # Código para Riego + Temporal

time.sleep(0.7)
cultivo = Select(web.find_element_by_id("cultivo"))
cultivo.select_by_value("7") # Código para Aguacate

time.sleep(1)
for anio in range(2018, 2021):
  
    time.sleep(0.8)
    anioagric = Select(web.find_element_by_id("anioagric"))
    anioagric.select_by_value(str(anio))

    for mes in range(1, 13):
        time.sleep(0.8)
        mesagric = Select(web.find_element_by_id("mesagric"))
        mesagric.select_by_value(str(mes))

        time.sleep(0.8)
        consultar = web.find_element_by_xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "pull-right", " " ))]')
        consultar.click() # Botón de consulta
        
        time.sleep(1)
        tbl_html = web.find_element_by_id("Resultados-reporte").get_attribute("outerHTML")
        tbls_html.append(tbl_html)

time.sleep(0.5)
web.close()

In [4]:
f = open("tbls_html.txt", "w")
f.write("\n".join(tbls_html))
f.close()

Obtención de datos de la Producción mensual de aguacate de 2018 a 2020 por municipio en Michoacán.

In [5]:
%%capture
tbls_mich_html = []

web = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
web.get(url)

time.sleep(2)
mun = web.find_element_by_id('opcionDDRMpio4')
mun.click()

time.sleep(0.7)
cicloProd = Select(web.find_element_by_id("cicloProd"))
cicloProd.select_by_value("3") # Código para Perennes

time.sleep(0.7)
modalidad = Select(web.find_element_by_id("modalidad"))
modalidad.select_by_value("3") # Código para Riego + Temporal

time.sleep(0.7)
entidad = Select(web.find_element_by_id("entidad"))
entidad.select_by_value("16") # Código Michoacán

time.sleep(0.7)
cultivo = Select(web.find_element_by_id("cultivo"))
cultivo.select_by_value("7") # Código para Aguacate

time.sleep(1)
for anio in range(2018, 2021):

    time.sleep(0.8)
    anioagric = Select(web.find_element_by_id("anioagric"))
    anioagric.select_by_value(str(anio))

    for mes in range(1, 13):

        time.sleep(0.8)
        mesagric = Select(web.find_element_by_id("mesagric"))
        mesagric.select_by_value(str(mes))

        time.sleep(0.8)
        consultar = web.find_element_by_xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "pull-right", " " ))]')
        consultar.click() # Botón de consulta

        time.sleep(1)
        tbl_html = web.find_element_by_id("Resultados-reporte").get_attribute("outerHTML")
        tbls_mich_html.append(tbl_html)

time.sleep(0.5)
web.close()

In [6]:
f = open("tbls_mich_html.txt", "w")
f.write("\n".join(tbls_mich_html))
f.close()

Lectura y limpieza de tablas consultadas por WebScrapping para la producción mensual de aguacate de aguacate de 2018 a 2020

In [7]:
import pandas as pd
import itertools
import janitor


def clean_tbl(tbl, args):
    anio, mes = args
    tbl.columns = tbl.columns.droplevel()
    tbl["Anio"] = anio
    tbl["Mes"] = mes
    return tbl


def parse_html(path):
    tbls = pd.read_html(path, index_col=0, encoding="utf-8")
    cp = itertools.product(range(2018, 2021), range(1, 13))

    tbls = [*map(lambda pair: clean_tbl(pair[0], pair[1]), zip(tbls, [*cp]))]
    tbl = pd.concat(tbls, ignore_index=True).clean_names()

    return tbl

In [8]:
prodnac = parse_html("tbls_html.txt")
prodmic = parse_html("tbls_mich_html.txt")

print(prodnac, prodmic, sep="\n\n")

                  entidad   sembrada  cosechada  ...  rendimiento_udm_ha_  anio  mes
0          Aguascalientes      23.00       0.00  ...                 0.00  2018    1
1         Baja California      46.75       0.00  ...                 0.00  2018    1
2     Baja California Sur     168.00       0.00  ...                 0.00  2018    1
3                Campeche      80.50       0.00  ...                 0.00  2018    1
4                  Colima     773.50       0.00  ...                 0.00  2018    1
...                   ...        ...        ...  ...                  ...   ...  ...
1027             Tlaxcala      27.00      25.00  ...                 7.90  2020   12
1028             Veracruz     897.00     684.00  ...                10.80  2020   12
1029              Yucatán     523.28     400.32  ...                22.35  2020   12
1030            Zacatecas      45.50      41.50  ...                 7.09  2020   12
1031                Total  241140.11  224425.94  ...             

In [9]:
prodnac.to_csv("asc_entidad.csv", index=False)
prodmic.to_csv("asc_michoacan.csv", index=False)