# Análisis exploratorio de datos con todos los dataframes combinados

En esta libreta combinaremos los dataframes del proyecto para realizar un EDA (Exploratory data analysis) y poder encontrar patrones entre las variables agrícolas, económicas y climáticas con respecto a los cultivos en México.

### Librerías

In [83]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

### Combinación de los dataframes

Empezaremos por el dataframe del SIAP.

In [84]:
directory = '../limpieza'

df_SIAP = pd.read_csv(os.path.join(directory,"SIAP.csv"))
df_SIAP.head()

Unnamed: 0,Año,Mes,Cultivo,Estado,Distrito,Municipio,Superficie(ha)_Sembrada,Superficie(ha)_Cosechada,Superficie(ha)_Siniestrada,Producción,Rendimiento(udm/ha)
0,2020,1,Tomate rojo,Aguascalientes,Aguascalientes,Calvillo,16.0,6.0,0.0,90.0,15.0
1,2020,1,Tomate rojo,Baja California,Ensenada,Ensenada,19.5,0.0,0.0,0.0,0.0
2,2020,1,Tomate rojo,Baja California Sur,Mulegé,Mulegé,80.0,0.0,0.0,0.0,0.0
3,2020,1,Tomate rojo,Baja California Sur,Comondú,Comondú,127.0,0.0,0.0,0.0,0.0
4,2020,1,Tomate rojo,Baja California Sur,La Paz,La Paz,611.0,106.0,0.0,4429.76,41.79


En algunos dataframes posteriores a combinar, solo se tienen instancias desglosadas por estado y no por municipio. Por lo tanto, agruparemos por estado y haremos una suma de los valores numéricos, con excepción del rendimiento. Esto se debe a que el SIAP lo calcula dividiendo las columnas 'Producción' y 'Superficie(ha)_Cosechada'. Después de realizar la agrupación, calcularemos el rendimiento. Algunos registros del rendimiento darán NaN, ya que hay registros con 0 producción y 0 cosecha, simplemente los convertimos a *0* numérico.

In [85]:
df_SIAP_state = df_SIAP.groupby(['Año', 'Mes', 'Cultivo', 'Estado']).agg({
    'Superficie(ha)_Sembrada': 'sum',
    'Superficie(ha)_Cosechada': 'sum',
    'Superficie(ha)_Siniestrada': 'sum',
    'Producción': 'sum',
}).reset_index()

# Calculo de 'Rendimiento(udm/ha)' y convertimos los registros NaN a 0
df_SIAP_state['Rendimiento(udm/ha)'] = df_SIAP_state['Producción'] / df_SIAP_state['Superficie(ha)_Cosechada']
df_SIAP_state['Rendimiento(udm/ha)'] = df_SIAP_state['Rendimiento(udm/ha)'].fillna(0)

In [86]:
df_SIAP_state.head(10)

Unnamed: 0,Año,Mes,Cultivo,Estado,Superficie(ha)_Sembrada,Superficie(ha)_Cosechada,Superficie(ha)_Siniestrada,Producción,Rendimiento(udm/ha)
0,2020,1,Berenjena,Baja California Sur,11.0,0.0,0.0,0.0,0.0
1,2020,1,Berenjena,Morelos,0.4,0.0,0.0,0.0,0.0
2,2020,1,Berenjena,Nayarit,60.0,0.0,0.0,0.0,0.0
3,2020,1,Berenjena,Quintana Roo,12.5,10.0,0.0,49.0,4.9
4,2020,1,Berenjena,Sinaloa,1313.5,1073.23,0.0,50629.27,47.174669
5,2020,1,Berenjena,Sonora,76.0,0.0,0.0,0.0,0.0
6,2020,1,Berenjena,Yucatán,28.46,0.0,0.0,0.0,0.0
7,2020,1,Brócoli,Aguascalientes,440.0,0.0,0.0,0.0,0.0
8,2020,1,Brócoli,Baja California,886.46,815.81,0.0,12273.01,15.043956
9,2020,1,Brócoli,Baja California Sur,1.9,0.0,0.0,0.0,0.0


In [87]:
df_SIAP_state.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28282 entries, 0 to 28281
Data columns (total 9 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Año                         28282 non-null  int64  
 1   Mes                         28282 non-null  int64  
 2   Cultivo                     28282 non-null  object 
 3   Estado                      28282 non-null  object 
 4   Superficie(ha)_Sembrada     28282 non-null  float64
 5   Superficie(ha)_Cosechada    28282 non-null  float64
 6   Superficie(ha)_Siniestrada  28282 non-null  float64
 7   Producción                  28282 non-null  float64
 8   Rendimiento(udm/ha)         28282 non-null  float64
dtypes: float64(5), int64(2), object(2)
memory usage: 1.9+ MB


In [88]:
columns_to_update = ['Superficie(ha)_Sembrada', 'Superficie(ha)_Cosechada',
					'Superficie(ha)_Siniestrada', 'Producción']
                    
for year in list(df_SIAP_state['Año'].unique()):

    for month in range(12,1,-1):
        
        for state in list(df_SIAP_state['Estado'].unique()):

            for crop in list(df_SIAP_state['Cultivo'].unique()):

                try:
                
                    actual_month = (df_SIAP_state['Año'] == year) & (df_SIAP_state['Mes'] == month) & \
                                   (df_SIAP_state['Estado'] == state) &  (df_SIAP_state['Cultivo'] == crop)
                    previous_month = (df_SIAP_state['Año'] == year) & (df_SIAP_state['Mes'] == month-1) & \
                                     (df_SIAP_state['Estado'] == state) & (df_SIAP_state['Cultivo'] == crop)

                    if(all(df_SIAP_state.loc[actual_month, columns_to_update].values[0] - \
                        df_SIAP_state.loc[previous_month, columns_to_update].values[0] >= 0)):

                        df_SIAP_state.loc[actual_month, columns_to_update] -= \
                        df_SIAP_state.loc[previous_month, columns_to_update].values[0]
                        
                        df_SIAP_state.loc[actual_month, 'Rendimiento(udm/ha)'] = \
                        df_SIAP_state.loc[actual_month, 'Producción'] / \
                        df_SIAP_state.loc[actual_month, 'Superficie(ha)_Cosechada']

                except Exception as e:
                    # print(e)
                    continue

Ahora combinaremos los datos del SIAP con los del SNIIM haciendo un `inner join`.

In [140]:
directory = '../limpieza'

df_sniim = pd.read_csv(os.path.join(directory,"df_sniim.csv"))

df_sniim.head()

Unnamed: 0,Año,Mes,Estado,Cultivo,Precio
0,2020,1,Aguascalientes,Berenjena,19.681818
1,2020,1,Aguascalientes,Brócoli,12.181818
2,2020,1,Aguascalientes,Calabacita,11.393182
3,2020,1,Aguascalientes,Cebolla,6.795455
4,2020,1,Aguascalientes,Chile verde,15.309659


In [141]:
df_sniim.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31288 entries, 0 to 31287
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Año      31288 non-null  int64  
 1   Mes      31288 non-null  int64  
 2   Estado   31288 non-null  object 
 3   Cultivo  31288 non-null  object 
 4   Precio   31288 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 1.2+ MB


In [142]:
df_SIAP_sniim = pd.merge(
    df_SIAP_state,
    df_sniim,
    how="inner",
    on=["Año", "Mes", "Cultivo", "Estado"],
)

df_SIAP_sniim.head()

Unnamed: 0,Año,Mes,Cultivo,Estado,Superficie(ha)_Sembrada,Superficie(ha)_Cosechada,Superficie(ha)_Siniestrada,Producción,Rendimiento(udm/ha),Precio
0,2020,1,Berenjena,Nayarit,60.0,0.0,0.0,0.0,0.0,26.545455
1,2020,1,Berenjena,Yucatán,28.46,0.0,0.0,0.0,0.0,5.83
2,2020,1,Cebolla,Aguascalientes,247.0,7.0,0.0,128.0,18.285714,6.795455
3,2020,1,Cebolla,Baja California Sur,244.25,15.0,0.0,300.0,20.0,15.484848
4,2020,1,Cebolla,Chiapas,218.53,0.0,0.0,0.0,0.0,20.954545


In [143]:
df_SIAP_sniim.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14200 entries, 0 to 14199
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Año                         14200 non-null  int64  
 1   Mes                         14200 non-null  int64  
 2   Cultivo                     14200 non-null  object 
 3   Estado                      14200 non-null  object 
 4   Superficie(ha)_Sembrada     14200 non-null  float64
 5   Superficie(ha)_Cosechada    14200 non-null  float64
 6   Superficie(ha)_Siniestrada  14200 non-null  float64
 7   Producción                  14200 non-null  float64
 8   Rendimiento(udm/ha)         14200 non-null  float64
 9   Precio                      14200 non-null  float64
dtypes: float64(6), int64(2), object(2)
memory usage: 1.1+ MB


Después agregaremos datos meteorológicos por estado con un `left join`, donde, desde ahora en adelenta, el dataframe de la izquierda será el combiando de los datos del SIAP y SNIIM.

In [145]:
directory = '../limpieza'

df_climate = pd.read_csv(os.path.join(directory,"Data_Climatic_Estados_Inferido.csv"))

df_climate.head()

Unnamed: 0,Año,Mes,Estado,Estado_CVE,Temp_Superficial,Temp_Superficial_MAX,Temp_Superficial_MIN,Temp_2_Metros,Temp_2_Metros_MAX,Temp_2_Metros_MIN,...,Temp_2_Metros_Pto_Húmedo,Presión_Superficial,Velocidad_Viento,Humedad_Relativa,Flujo_Evapotranspiración,Perfil_Humedad_Suelo,Dias_Sin_Nubosidad,Precipitacion,Horas_De_Sol,Insolacion_Mediodia
0,2020,1,Aguascalientes,1,11.871672,24.617947,3.252815,11.622434,19.955865,4.849208,...,6.980059,79.269296,2.306334,59.219238,0.035396,0.437889,0.193548,0.8361,10.099707,16.71868
1,2020,1,Baja California,2,11.503548,21.848387,5.296516,12.156581,19.138839,7.561161,...,7.694516,96.414323,2.328968,60.059871,0.034194,0.552387,0.387097,0.225097,9.864516,14.626258
2,2020,1,Baja California Sur,3,18.086452,27.860968,12.222968,17.575806,24.043097,13.04529,...,14.08529,99.619742,2.680968,67.542323,0.014194,0.48071,0.354839,0.050258,10.0,16.26529
3,2020,1,Campeche,4,24.531466,30.339472,20.069032,24.49305,30.125425,19.977243,...,21.721496,100.730762,0.696833,73.75088,0.448534,0.611965,0.085044,1.040616,10.510264,15.509824
4,2020,1,Chiapas,7,20.45848,27.968901,15.246714,20.349943,26.273483,15.924568,...,18.356055,91.695314,1.527048,80.52096,1.063639,0.6931,0.062056,1.351758,11.0,15.682993


In [146]:
df_climate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1504 entries, 0 to 1503
Data columns (total 21 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Año                            1504 non-null   int64  
 1   Mes                            1504 non-null   int64  
 2   Estado                         1504 non-null   object 
 3   Estado_CVE                     1504 non-null   int64  
 4   Temp_Superficial               1472 non-null   float64
 5   Temp_Superficial_MAX           1472 non-null   float64
 6   Temp_Superficial_MIN           1472 non-null   float64
 7   Temp_2_Metros                  1472 non-null   float64
 8   Temp_2_Metros_MAX              1472 non-null   float64
 9   Temp_2_Metros_MIN              1472 non-null   float64
 10  Temp_2_Metros_Pto_Congelación  1472 non-null   float64
 11  Temp_2_Metros_Pto_Húmedo       1472 non-null   float64
 12  Presión_Superficial            1472 non-null   f

In [147]:
df_SIAP_sniim_climate = pd.merge(
    df_SIAP_sniim,
    df_climate,
    how="left",
    on=["Año", "Mes", "Estado"],
)
df_SIAP_sniim_climate.head()

Unnamed: 0,Año,Mes,Cultivo,Estado,Superficie(ha)_Sembrada,Superficie(ha)_Cosechada,Superficie(ha)_Siniestrada,Producción,Rendimiento(udm/ha),Precio,...,Temp_2_Metros_Pto_Húmedo,Presión_Superficial,Velocidad_Viento,Humedad_Relativa,Flujo_Evapotranspiración,Perfil_Humedad_Suelo,Dias_Sin_Nubosidad,Precipitacion,Horas_De_Sol,Insolacion_Mediodia
0,2020,1,Berenjena,Nayarit,60.0,0.0,0.0,0.0,0.0,26.545455,...,15.70579,92.821242,1.488903,63.214177,0.351323,0.5595,0.143548,2.056097,10.15,16.45821
1,2020,1,Berenjena,Yucatán,28.46,0.0,0.0,0.0,0.0,5.83,...,21.435265,101.449988,0.637279,70.907876,0.103201,0.584197,0.027085,0.629178,10.248326,15.474038
2,2020,1,Cebolla,Aguascalientes,247.0,7.0,0.0,128.0,18.285714,6.795455,...,6.980059,79.269296,2.306334,59.219238,0.035396,0.437889,0.193548,0.8361,10.099707,16.71868
3,2020,1,Cebolla,Baja California Sur,244.25,15.0,0.0,300.0,20.0,15.484848,...,14.08529,99.619742,2.680968,67.542323,0.014194,0.48071,0.354839,0.050258,10.0,16.26529
4,2020,1,Cebolla,Chiapas,218.53,0.0,0.0,0.0,0.0,20.954545,...,18.356055,91.695314,1.527048,80.52096,1.063639,0.6931,0.062056,1.351758,11.0,15.682993


In [148]:
df_SIAP_sniim_climate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14200 entries, 0 to 14199
Data columns (total 28 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Año                            14200 non-null  int64  
 1   Mes                            14200 non-null  int64  
 2   Cultivo                        14200 non-null  object 
 3   Estado                         14200 non-null  object 
 4   Superficie(ha)_Sembrada        14200 non-null  float64
 5   Superficie(ha)_Cosechada       14200 non-null  float64
 6   Superficie(ha)_Siniestrada     14200 non-null  float64
 7   Producción                     14200 non-null  float64
 8   Rendimiento(udm/ha)            14200 non-null  float64
 9   Precio                         14200 non-null  float64
 10  Estado_CVE                     13508 non-null  float64
 11  Temp_Superficial               13508 non-null  float64
 12  Temp_Superficial_MAX           13508 non-null 

Incluiremos la precipitación mensual a nivel estatal, también con un `left join`.

In [149]:
directory = '../limpieza'

df_precipitation = pd.read_csv(os.path.join(directory,"precip.csv"))

df_precipitation.head()
df_precipitation = df_precipitation.rename(
                    columns={'Precipitacion': 'Precipitación',
                             'Entidad': 'Estado'})

In [150]:
df_precipitation.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3168 entries, 0 to 3167
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0     3168 non-null   int64  
 1   Estado         3168 non-null   object 
 2   Año            3168 non-null   int64  
 3   Mes            3168 non-null   int64  
 4   Precipitación  3168 non-null   float64
dtypes: float64(1), int64(3), object(1)
memory usage: 123.9+ KB


Por último, el índice de volumen físico, el cual nos dice cuanto cultivo hubo de acuerdo al año anterior. De igual forma, lo agregaremos con un `left join`.

In [151]:
directory = '../limpieza'

df_volume = pd.read_csv(os.path.join(directory,"ivf_15-22.csv"))

df_volume.head()

Unnamed: 0,Cultivo,Año,Mes,Ivf
0,Berenjena,2015,12,3.88
1,Brocolí,2015,12,39.05
2,Calabacita,2015,12,109.13
3,Cebolla,2015,12,86.09
4,Chile verde,2015,12,134.97


---

Ver las diferencias de nombres

In [131]:
print(list(df_SIAP['Estado'].unique()) == list(df_sniim['Estado'].unique()))
print(list(df_SIAP['Estado'].unique()) == list(df_climate['Estado'].unique()))
print(list(df_SIAP['Estado'].unique()) == list(df_precipitation['Estado'].unique()))
print(list(df_SIAP['Estado'].unique()) == list(df_volume['Estado'].unique()))

False
False


KeyError: 'Estado'

In [152]:
print(list(set(df_sniim['Cultivo'].unique()) - set(df_SIAP['Cultivo'].unique())))
print(list(set(df_sniim['Cultivo'].unique()) - set(df_volume['Cultivo'].unique())))

['Limón', 'Plátano', 'Piña', 'Brócoli', 'Sandía', 'Espárrago', 'Melón']
['Limón', 'Plátano', 'Tomate rojo', 'Piña', 'Brócoli', 'Sandía', 'Espárrago', 'Toronja', 'Melón', 'Nopal']


In [153]:
df_SIAP['Cultivo'].unique()

array(['Tomate rojo', 'Chile verde', 'Limón', 'Plátano', 'Mango',
       'Brócoli', 'Cebolla', 'Sandía', 'Papaya', 'Lechuga', 'Nopal',
       'Nuez', 'Fresa', 'Toronja', 'Piña', 'Berenjena', 'Uva', 'Naranja',
       'Papa', 'Melón', 'Manzana', 'Pera', 'Durazno', 'Espárrago',
       'Zarzamora', 'Coliflor', 'Guayaba', 'Tomate verde', 'Frijol',
       'Garbanzo grano', 'Frambuesa', 'Pepino', 'Calabacita'],
      dtype=object)