# **Modelo de MACHINE LEARNING**

**La problemática del proyecto es: Las pérdidas económicas en la agricultura como consecuencia de la mala gestión del cultivo de productos alimenticios, generada por una mala planificación de los cultivos durante el proceso de siembra.**


**El objetivo general del proyecto es: Implementar SmartAgro, un sistema que utiliza la minería de patrones y la simulación de cultivos para dar recomendaciones que coadyuven a la mejor toma de decisiones que permitan disminuir las pérdidas económicas en el sector agrícola en el Perú.** 

Importar Librerias

In [None]:
!pip install https://bit.ly/3o4smsZ

from fim import *
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random 
from graphviz import *

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting https://bit.ly/3o4smsZ
  Downloading https://bit.ly/3o4smsZ (343 kB)
[K     |████████████████████████████████| 343 kB 5.9 MB/s 
[?25hBuilding wheels for collected packages: fim
  Building wheel for fim (setup.py) ... [?25l[?25hdone
  Created wheel for fim: filename=fim-6.27-cp37-cp37m-linux_x86_64.whl size=523513 sha256=053a0441be49d06a17e1de7006aa88a1699256ef3baf1a6574cd6eac9a03690c
  Stored in directory: /tmp/pip-ephem-wheel-cache-n3271krs/wheels/71/84/bf/c9f96714839ef275ecaa4ba1a0cbd3c6dd20931a451e13ba1d
Successfully built fim
Installing collected packages: fim
Successfully installed fim-6.27


## **1) Preprocesamiento de los Datos**

Antes de realizar los pasos correspondientes al proceso de preprocesamiento de los datos se tiene que generar el dataset con el que se va a trabajar. Para generar el dataset final vamos a juntar información de tres datasets distintintos; MIDAGRI (intención de siembra), SISAP (precios y volumenes de los productos) y SENAMHI (temperatura de cada distrito).

### 1.1. Cargar Datasets

#### 1.1.1. Cargar dataset del sistema SISAP

Cargar conjunto de datos sobre el precio de productos y el volumen de producción de cada producto

In [None]:
# Librerias para leer un archivo CSV de una carpeta en Google Drive
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials 
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
# Leer Archivo CSV de una carpeta de Google Drive
from google.colab import data_table
link_google_drive = 'https://drive.google.com/open?id=1p3INJVPnQGylITuqku91bJ0Sn5iq6yE7'
flu, id = link_google_drive.split('=')
dataset_ = drive.CreateFile({'id':id})

# Leer Dataset 'reporte_sisap'
dataset_.GetContentFile('reporte_sisap.csv')

# Generar Dataframe
df_precios = pd.read_csv('reporte_sisap.csv', sep=';')
df_precios.head()

Unnamed: 0,nombre,año,mes,precio,volumen
0,Aji,2020,1,2.41,2291.0
1,Aji,2020,2,2.71,2322.0
2,Aji,2020,3,2.6,1939.0
3,Aji,2020,4,1.87,2040.0
4,Aji,2020,5,1.74,2246.0


#### 1.1.2. Cargar dataset del MIDAGRI

Cargar conjunto de datos sobre intención de siembra de productos agricolas en el período 2021-2021

In [None]:
def cargar_dataset():
  # Cargar Dataset:
  df_siembra = pd.read_excel("https://www.datosabiertos.gob.pe/node/6920/download")
  return df_siembra

In [None]:
dataset = cargar_dataset()

In [None]:
dataset.head()

Unnamed: 0,DEPARTAMENTO,PROVINICA,DISTRITO,CULTIVO,CAMPANA,AGO,SEP,OCT,NOV,DIC,ENE,FEB,MAR,ABR,MAY,JUN,JUL
0,ANCASH,AIJA,SUCCHA,Papa nativa,2020-2021,0,0,1,0,0,0,0,0,0,0,0,0
1,ANCASH,ANTONIO RAYMONDI,LLAMELLIN,Olluco,2020-2021,0,0,1,0,0,0,0,0,0,0,0,0
2,ANCASH,ANTONIO RAYMONDI,LLAMELLIN,Quinua,2020-2021,0,0,1,0,0,0,0,0,0,0,0,0
3,ANCASH,ANTONIO RAYMONDI,ACZO,Quinua,2020-2021,0,0,1,0,0,0,0,0,0,0,0,0
4,ANCASH,ANTONIO RAYMONDI,CHINGAS,Olluco,2020-2021,0,0,1,0,0,0,0,0,0,0,0,0


#### 1.1.3. Cargar dataset del SENAMHI

In [None]:
# Leer Archivo CSV de una carpeta de Google Drive
from google.colab import data_table
link_google_drive = 'https://drive.google.com/open?id=1zdB3VonAoAqb66Vvj2oAUeqBfOL_2AJi'
flu, id = link_google_drive.split('=')
dataset_ = drive.CreateFile({'id':id})

# Leer Dataset 'reporte_sisap'
dataset_.GetContentFile('reporte_senamhi.csv')

# Generar Dataframe
df_clima = pd.read_csv('reporte_senamhi.csv')
df_clima.head()

Unnamed: 0,depa,prov,dist,nombre,year,month,temp,hum,precip
0,AMAZONAS,BAGUA,IMAZA,CHIRIACO,2015,4,25.67,87.45,220.0
1,AMAZONAS,BAGUA,IMAZA,CHIRIACO,2015,5,25.71,88.32,227.6
2,AMAZONAS,BAGUA,IMAZA,CHIRIACO,2015,6,25.77,87.03,144.1
3,AMAZONAS,BAGUA,IMAZA,CHIRIACO,2015,7,25.51,88.04,188.7
4,AMAZONAS,BAGUA,IMAZA,CHIRIACO,2015,8,26.15,84.25,146.3


### 1.2. Juntar datasets: MIDAGRI con SISAP

En esta sección se va a juntar información de los datasets del MIDAGRI y el sistema SISAP

#### 1.2.1. Seleccionar un Mes y un Año (SISAP)

Ordenar el dataset del sistema SISAP en función del nombre de los productos

In [None]:
def Precios_Tabla_Ordenada(df_precios):
  
  df_to_sort = df_precios.sort_values(by='nombre', ascending=False)

  nombre = df_to_sort['nombre'].to_numpy()
  anio = df_to_sort['año'].to_numpy()
  mes = df_to_sort['mes'].to_numpy()
  precio = df_to_sort['precio'].to_numpy()
  volumen = df_to_sort['volumen'].to_numpy()

  new_df_sort = pd.DataFrame(nombre, columns = ['nombre'])

  new_df_sort['año'] = anio.tolist()
  new_df_sort['mes'] = mes.tolist()
  new_df_sort['precio'] = precio.tolist()
  new_df_sort['volumen'] = volumen.tolist()

  return new_df_sort

Seleccionar datos de un mes y año determinado 

In [None]:
def tabla_precios_mes(nro_mes, nro_anio, data_precios):
  data_mes = data_precios.query("mes == @nro_mes and año == @nro_anio")
  data_mes = Precios_Tabla_Ordenada(data_mes)

  return data_mes

In [None]:
# Datos del dataset SISAP, para el mes 11 y el año 2020: 
df_precios_mes = tabla_precios_mes(11, 2020, df_precios)
df_precios_mes.head()

Unnamed: 0,nombre,año,mes,precio,volumen
0,Zapallo,2020,11,0.83,4250.0
1,Zanahoria,2020,11,0.84,9475.0
2,Yuca,2020,11,1.11,5346.0
3,Tomate,2020,11,1.18,7525.0
4,Paprika,2020,11,9.58,14.0


#### 1.2.2. Juntar datos

Funcion que se encarga de agregar la informacion del dataset 'df_precios_mes' al dataset de Siembra (MIDAGRI)

In [None]:
def juntar_df_precios(df_siembra_, df_precios_mes_):
  arr_precios = []
  arr_volumenes = []

  for k in range(len(df_siembra_['CULTIVO'])):
    for i in range(len(df_precios_mes_['nombre'])):
      if df_siembra_['CULTIVO'][k] == df_precios_mes_['nombre'][i]:
        arr_precios.append(df_precios_mes_['precio'][i])
        arr_volumenes.append(df_precios_mes_['volumen'][i])

    if len(arr_precios) <= k:
      arr_precios.append(0)
      arr_volumenes.append(0)

  np_precios = np.array(arr_precios)
  np_volumenes = np.array(arr_volumenes)

  df_siembra_['Precio Promedio'] = np_precios.tolist()
  df_siembra_['Volumen Promedio'] = np_volumenes.tolist()
  
  return df_siembra_

df_siembra_precios = Unión de los datasets SISAP y MIDAGRI

In [None]:
df_siembra_precios = juntar_df_precios(dataset, df_precios_mes)
df_siembra_precios.head()

Unnamed: 0,DEPARTAMENTO,PROVINICA,DISTRITO,CULTIVO,CAMPANA,AGO,SEP,OCT,NOV,DIC,ENE,FEB,MAR,ABR,MAY,JUN,JUL,Precio Promedio,Volumen Promedio
0,ANCASH,AIJA,SUCCHA,Papa nativa,2020-2021,0,0,1,0,0,0,0,0,0,0,0,0,1.65,3033.0
1,ANCASH,ANTONIO RAYMONDI,LLAMELLIN,Olluco,2020-2021,0,0,1,0,0,0,0,0,0,0,0,0,1.7,2954.0
2,ANCASH,ANTONIO RAYMONDI,LLAMELLIN,Quinua,2020-2021,0,0,1,0,0,0,0,0,0,0,0,0,0.0,0.0
3,ANCASH,ANTONIO RAYMONDI,ACZO,Quinua,2020-2021,0,0,1,0,0,0,0,0,0,0,0,0,0.0,0.0
4,ANCASH,ANTONIO RAYMONDI,CHINGAS,Olluco,2020-2021,0,0,1,0,0,0,0,0,0,0,0,0,1.7,2954.0


### 1.3. Juntar datasets: MIDAGRI-SISAP con SENAMHI

En esta sección se va a juntar información del dataset df_siembra_precios (MIDAGRI-SISAP) con el dataset del SENAMHI

#### 1.3.1. Seleccionar un Mes y un Año (SENAMHI)

Ordenar el dataset del SENAMHI en función del nombre de los Departamentos

In [None]:
def Clima_Tabla_Ordenada(df_clima):
  
  df_to_sort = df_clima.sort_values(by='depa', ascending=False)

  depa = df_to_sort['depa'].to_numpy()
  prov = df_to_sort['prov'].to_numpy()
  dist = df_to_sort['dist'].to_numpy()
  nombre = df_to_sort['nombre'].to_numpy()
  year = df_to_sort['year'].to_numpy()
  month = df_to_sort['month'].to_numpy()
  temp = df_to_sort['temp'].to_numpy()
  hum = df_to_sort['hum'].to_numpy()
  precip = df_to_sort['precip'].to_numpy()

  new_df_sort = pd.DataFrame(depa, columns = ['depa'])

  new_df_sort['prov'] = prov.tolist()
  new_df_sort['dist'] = dist.tolist()
  new_df_sort['nombre'] = nombre.tolist()
  new_df_sort['year'] = year.tolist()
  new_df_sort['month'] = month.tolist()
  new_df_sort['temp'] = temp.tolist()
  new_df_sort['hum'] = hum.tolist()
  new_df_sort['precip'] = precip.tolist()

  return new_df_sort

Seleccionar datos de un mes y año determinado 

In [None]:
def tabla_clima_mes(nro_mes, nro_anio, data_clima):
  data_mes = data_clima.query("month == @nro_mes and year == @nro_anio")
  data_mes = Clima_Tabla_Ordenada(data_mes)

  return data_mes

In [None]:
# Datos del dataset SENAMHI, para el mes 11 y el año 2020:
df_clima_mes = tabla_clima_mes(11, 2020, df_clima)
df_clima_mes.head()

Unnamed: 0,depa,prov,dist,nombre,year,month,temp,hum,precip
0,UCAYALI,PURUS,PURUS,PUERTO ESPERANZA,2020,11,26.45,85.65,202.52
1,UCAYALI,PADRE ABAD,IRAZOLA,SAN ALEJANDRO,2020,11,26.7,86.1,87.9
2,UCAYALI,PADRE ABAD,PADRE ABAD,AGUAYTIA,2020,11,27.96,85.62,92.0
3,UCAYALI,PADRE ABAD,CURIMANA,EL MARONAL,2020,11,26.58,90.13,144.1
4,TUMBES,CONTRALMIRANTE VILLAR,CASITAS,CAÑAVERAL,2020,11,26.49,77.96,2.12


#### 1.3.2. Juntar datos

Funcion que se encarga de agregar la informacion del dataset 'df_clima_mes' al dataset MIDAGRI-SISAP

In [None]:
def juntar_df_clima(df_siembra_precios_, df_clima_mes_):
  arr_temperatura = []
  arr_humedad = []
  arr_precipitacion = []

  for k in range(len(df_siembra_precios_['DISTRITO'])):
    for i in range(len(df_clima_mes_['dist'])):

      if df_siembra_precios_['DISTRITO'][k] == df_clima_mes_['dist'][i] or df_siembra_precios_['PROVINICA'][k] == df_clima_mes_['prov'][i] or df_siembra_precios_['DEPARTAMENTO'][k] == df_clima_mes_['depa'][i]:
        arr_temperatura.append(df_clima_mes_['temp'][i])
        arr_humedad.append(df_clima_mes_['hum'][i])
        arr_precipitacion.append(df_clima_mes_['precip'][i])
        break

    if len(arr_temperatura) <= k:
      arr_temperatura.append(0)
      arr_humedad.append(0)
      arr_precipitacion.append(0)    

  np_temperatura = np.array(arr_temperatura)
  np_humedad = np.array(arr_humedad)
  np_precipitacion = np.array(arr_precipitacion)

  df_siembra_precios_['Temperatura'] = np_temperatura.tolist()
  df_siembra_precios_['Humedad'] = np_humedad.tolist()
  df_siembra_precios_['Precipitación'] = np_precipitacion.tolist()
  
  return df_siembra_precios_

df_siembra_precios_clima (df_spc o spc) = Unión de los datasets SISAP, MIDAGRI y SENAMHI

In [None]:
df_siembra_precios_clima = juntar_df_clima(df_siembra_precios, df_clima_mes)
df_siembra_precios_clima.head()

Unnamed: 0,DEPARTAMENTO,PROVINICA,DISTRITO,CULTIVO,CAMPANA,AGO,SEP,OCT,NOV,DIC,...,ABR,MAY,JUN,JUL,Precio Promedio,Volumen Promedio,Temperatura,Humedad,Precipitación,Ubigeo
0,ANCASH,AIJA,SUCCHA,Papa nativa,2020-2021,0,0,1,0,0,...,0,0,0,0,1.65,3033.0,14.14,87.0,112.0,ANCASH-AIJA-SUCCHA
1,ANCASH,ANTONIO RAYMONDI,LLAMELLIN,Olluco,2020-2021,0,0,1,0,0,...,0,0,0,0,1.7,2954.0,14.14,87.0,112.0,ANCASH-ANTONIO RAYMONDI-LLAMELLIN
2,ANCASH,ANTONIO RAYMONDI,LLAMELLIN,Quinua,2020-2021,0,0,1,0,0,...,0,0,0,0,0.0,0.0,14.14,87.0,112.0,ANCASH-ANTONIO RAYMONDI-LLAMELLIN
3,ANCASH,ANTONIO RAYMONDI,ACZO,Quinua,2020-2021,0,0,1,0,0,...,0,0,0,0,0.0,0.0,14.14,87.0,112.0,ANCASH-ANTONIO RAYMONDI-ACZO
4,ANCASH,ANTONIO RAYMONDI,CHINGAS,Olluco,2020-2021,0,0,1,0,0,...,0,0,0,0,1.7,2954.0,14.14,87.0,112.0,ANCASH-ANTONIO RAYMONDI-CHINGAS


### 1.4. Preprocesamiento

#### 1.4.1. Reducir Columnas:

Reducir columnas inecesarias del Dataset

In [None]:
def reducir_tabla_spc_columnas(df_spc):
  df_spc = df_spc.drop('AGO', 1)
  df_spc = df_spc.drop('SEP', 1)
  df_spc = df_spc.drop('OCT', 1)
  df_spc = df_spc.drop('NOV', 1)
  df_spc = df_spc.drop('DIC', 1)
  df_spc = df_spc.drop('ENE', 1)
  df_spc = df_spc.drop('FEB', 1)
  df_spc = df_spc.drop('MAR', 1)
  df_spc = df_spc.drop('ABR', 1)
  df_spc = df_spc.drop('MAY', 1)
  df_spc = df_spc.drop('JUN', 1)
  df_spc = df_spc.drop('JUL', 1)
  df_spc = df_spc.drop('CAMPANA', 1)

  return df_spc

In [None]:
df_spc_ = df_siembra_precios_clima

In [None]:
df_spc_ = reducir_tabla_spc_columnas(df_spc_)
df_spc_.head()

  
  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.
  """
  
  import sys
  
  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()
  if sys.path[0] == '':
  del sys.path[0]
  


Unnamed: 0,DEPARTAMENTO,PROVINICA,DISTRITO,CULTIVO,Precio Promedio,Volumen Promedio,Temperatura,Humedad,Precipitación,Ubigeo
0,ANCASH,AIJA,SUCCHA,Papa nativa,1.65,3033.0,14.14,87.0,112.0,ANCASH-AIJA-SUCCHA
1,ANCASH,ANTONIO RAYMONDI,LLAMELLIN,Olluco,1.7,2954.0,14.14,87.0,112.0,ANCASH-ANTONIO RAYMONDI-LLAMELLIN
2,ANCASH,ANTONIO RAYMONDI,LLAMELLIN,Quinua,0.0,0.0,14.14,87.0,112.0,ANCASH-ANTONIO RAYMONDI-LLAMELLIN
3,ANCASH,ANTONIO RAYMONDI,ACZO,Quinua,0.0,0.0,14.14,87.0,112.0,ANCASH-ANTONIO RAYMONDI-ACZO
4,ANCASH,ANTONIO RAYMONDI,CHINGAS,Olluco,1.7,2954.0,14.14,87.0,112.0,ANCASH-ANTONIO RAYMONDI-CHINGAS


#### 1.4.2. Juntar Columnas DEPARTAMENTO	PROVINICA	DISTRITO

In [None]:
df_spc_['Ubigeo'] = df_spc_.DEPARTAMENTO + '-' + df_spc_.PROVINICA + '-' + df_spc_.DISTRITO
df_spc_.head()

Unnamed: 0,DEPARTAMENTO,PROVINICA,DISTRITO,CULTIVO,Precio Promedio,Volumen Promedio,Temperatura,Humedad,Precipitación,Ubigeo
0,ANCASH,AIJA,SUCCHA,Papa nativa,1.65,3033.0,14.14,87.0,112.0,ANCASH-AIJA-SUCCHA
1,ANCASH,ANTONIO RAYMONDI,LLAMELLIN,Olluco,1.7,2954.0,14.14,87.0,112.0,ANCASH-ANTONIO RAYMONDI-LLAMELLIN
2,ANCASH,ANTONIO RAYMONDI,LLAMELLIN,Quinua,0.0,0.0,14.14,87.0,112.0,ANCASH-ANTONIO RAYMONDI-LLAMELLIN
3,ANCASH,ANTONIO RAYMONDI,ACZO,Quinua,0.0,0.0,14.14,87.0,112.0,ANCASH-ANTONIO RAYMONDI-ACZO
4,ANCASH,ANTONIO RAYMONDI,CHINGAS,Olluco,1.7,2954.0,14.14,87.0,112.0,ANCASH-ANTONIO RAYMONDI-CHINGAS


Ver Valores Únicos:

#### 1.4.2. Reducir Filas

Eliminar columnas que presenten valores faltantes

In [None]:
df_spc_aux = df_spc_

In [None]:
for k in range(len(df_spc_aux['Precio Promedio'])):
  if df_spc_aux['Precio Promedio'][k] == 0.0:
    df_spc_aux.drop(k,axis=0, inplace=True)

In [None]:
df_spc_final = df_spc_aux
df_spc_final.head()

Unnamed: 0,DEPARTAMENTO,PROVINICA,DISTRITO,CULTIVO,Precio Promedio,Volumen Promedio,Temperatura,Humedad,Precipitación,Ubigeo
0,ANCASH,AIJA,SUCCHA,Papa nativa,1.65,3033.0,14.14,87.0,112.0,ANCASH-AIJA-SUCCHA
1,ANCASH,ANTONIO RAYMONDI,LLAMELLIN,Olluco,1.7,2954.0,14.14,87.0,112.0,ANCASH-ANTONIO RAYMONDI-LLAMELLIN
4,ANCASH,ANTONIO RAYMONDI,CHINGAS,Olluco,1.7,2954.0,14.14,87.0,112.0,ANCASH-ANTONIO RAYMONDI-CHINGAS
5,ANCASH,ANTONIO RAYMONDI,CHINGAS,Papa color,0.43,1850.0,14.14,87.0,112.0,ANCASH-ANTONIO RAYMONDI-CHINGAS
7,ANCASH,CORONGO,ACO,Frijol grano seco,1.91,431.0,14.14,87.0,112.0,ANCASH-CORONGO-ACO


Ver Valores Únicos:

In [None]:
print(df_spc_final['DEPARTAMENTO'].unique())
print(len(df_spc_final['DEPARTAMENTO'].unique()))

['ANCASH' 'APURIMAC' 'AREQUIPA' 'CAJAMARCA' 'HUANCAVELICA' 'HUANUCO'
 'JUNIN' 'LIMA' 'MOQUEGUA' 'PUNO' 'CUSCO' 'ICA' 'PIURA' 'AMAZONAS'
 'LA LIBERTAD' 'PASCO' 'TACNA' 'TUMBES' 'LAMBAYEQUE' 'MADRE DE DIOS'
 'UCAYALI' 'LORETO' 'SAN MARTIN']
23


In [None]:
print(df_spc_final['PROVINICA'].unique())
print(len(df_spc_final['PROVINICA'].unique()))

['AIJA' 'ANTONIO RAYMONDI' 'CORONGO' 'ANTABAMBA' 'AYMARAES' 'COTABAMBAS'
 'CARAVELI' 'CASTILLA' 'CAYLLOMA' 'CONDESUYOS' 'CUTERVO' 'HUANCAVELICA'
 'AMBO' 'HUANCAYO' 'CHANCHAMAYO' 'YAULI' 'LIMA' 'HUAROCHIRI' 'YAUYOS'
 'GENERAL SANCHEZ CERRO' 'ILO' 'CHUCUITO' 'MOHO' 'YUNGUYO' 'RECUAY'
 'CHINCHEROS' 'GRAU' 'CONTUMAZA' 'ANTA' 'ACOBAMBA' 'ANGARAES'
 'CASTROVIRREYNA' 'DOS DE MAYO' 'PISCO' 'CONCEPCION' 'CHUPACA' 'MORROPON'
 'PUNO' 'UTCUBAMBA' 'ANDAHUAYLAS' 'CAJAMARCA' 'SAN MARCOS' 'SAN MIGUEL'
 'CANAS' 'CANCHIS' 'SANTIAGO DE CHUCO' 'BARRANCA' 'HUAURA' 'AZANGARO'
 'LAMPA' 'SAN ANTONIO DE PUTINA' 'CHACHAPOYAS' 'BONGARA' 'LUYA' 'CARHUAZ'
 'CARLOS FERMIN FITZCARRALD' 'HUAYLAS' 'PALLASCA' 'YUNGAY' 'ABANCAY'
 'CHOTA' 'LA CONVENCION' 'URUBAMBA' 'HUAYTARA' 'TAYACAJA' 'HUAMALIES'
 'TARMA' 'OTUZCO' 'OYON' 'DANIEL ALCIDES CARRION' 'TACNA' 'JORGE BASADRE'
 'CELENDIN' 'JAEN' 'PAUCARTAMBO' 'PATAZ' 'CANETE' 'SANDIA' 'TUMBES'
 'BOLOGNESI' 'POMABAMBA' 'CHINCHA' 'SANCHEZ CARRION' 'HUARAZ' 'HUARI'
 'MARISCAL LUZ

In [None]:
print(df_spc_final['DISTRITO'].unique())
print(len(df_spc_final['DISTRITO'].unique()))

['SUCCHA' 'LLAMELLIN' 'CHINGAS' ... 'COMANDANTE NOEL' 'MAGDALENA DE CAO'
 'CAYALTI']
1399


In [None]:
print(df_spc_final['Ubigeo'].unique())
print(len(df_spc_final['Ubigeo'].unique()))

['ANCASH-AIJA-SUCCHA' 'ANCASH-ANTONIO RAYMONDI-LLAMELLIN'
 'ANCASH-ANTONIO RAYMONDI-CHINGAS' ...
 'LA LIBERTAD-ASCOPE-MAGDALENA DE CAO' 'PIURA-MORROPON-BUENOS AIRES'
 'LAMBAYEQUE-CHICLAYO-CAYALTI']
1502


Eliminar Columnas DEPARTAMENTO	PROVINICA	DISTRITO	

In [None]:
df_spc_final = df_spc_final.drop('DEPARTAMENTO', 1)
df_spc_final = df_spc_final.drop('PROVINICA', 1)
df_spc_final = df_spc_final.drop('DISTRITO', 1)

  """Entry point for launching an IPython kernel.
  
  This is separate from the ipykernel package so we can avoid doing imports until


In [None]:
df_spc_final.head(5)

Unnamed: 0,CULTIVO,Precio Promedio,Volumen Promedio,Temperatura,Humedad,Precipitación,Ubigeo
0,Papa nativa,1.65,3033.0,14.14,87.0,112.0,ANCASH-AIJA-SUCCHA
1,Olluco,1.7,2954.0,14.14,87.0,112.0,ANCASH-ANTONIO RAYMONDI-LLAMELLIN
4,Olluco,1.7,2954.0,14.14,87.0,112.0,ANCASH-ANTONIO RAYMONDI-CHINGAS
5,Papa color,0.43,1850.0,14.14,87.0,112.0,ANCASH-ANTONIO RAYMONDI-CHINGAS
7,Frijol grano seco,1.91,431.0,14.14,87.0,112.0,ANCASH-CORONGO-ACO


#### 1.4.3. Transformación Datos Categoricos a Numericos

One Hot Encoder:

In [None]:
from sklearn.preprocessing import OneHotEncoder

#creating instance of one-hot-encoder
encoder = OneHotEncoder(handle_unknown='ignore')

#perform one-hot encoding on 'team' column 
encoder_df = pd.DataFrame(encoder.fit_transform(df_spc_final[['Ubigeo']]).toarray())

#merge one-hot encoded columns back with original DataFrame
final_df = df_spc_final.join(encoder_df)
final_df.head()

Unnamed: 0,CULTIVO,Precio Promedio,Volumen Promedio,Temperatura,Humedad,Precipitación,Ubigeo,0,1,2,...,1492,1493,1494,1495,1496,1497,1498,1499,1500,1501
0,Papa nativa,1.65,3033.0,14.14,87.0,112.0,ANCASH-AIJA-SUCCHA,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Olluco,1.7,2954.0,14.14,87.0,112.0,ANCASH-ANTONIO RAYMONDI-LLAMELLIN,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Olluco,1.7,2954.0,14.14,87.0,112.0,ANCASH-ANTONIO RAYMONDI-CHINGAS,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Papa color,0.43,1850.0,14.14,87.0,112.0,ANCASH-ANTONIO RAYMONDI-CHINGAS,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Frijol grano seco,1.91,431.0,14.14,87.0,112.0,ANCASH-CORONGO-ACO,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Target Enconder

In [None]:
from sklearn import preprocessing
#Codificando todas las variables categoricas, ya que los clasificadores solo entienden datos numericos
categorical_feature_mask = df_spc_final.dtypes==object
categorical_cols = df_spc_final.columns[categorical_feature_mask].tolist()
le = TargetEnconder()
df_spc_final[categorical_cols] = df_spc_final[categorical_cols].apply(lambda col: le.fit_transform(col))
df_spc_final.head()

Unnamed: 0,CULTIVO,Precio Promedio,Volumen Promedio,Temperatura,Humedad,Precipitación,Ubigeo
0,14,1.65,3033.0,14.14,87.0,112.0,88
1,11,1.7,2954.0,14.14,87.0,112.0,92
4,11,1.7,2954.0,14.14,87.0,112.0,91
5,13,0.43,1850.0,14.14,87.0,112.0,91
7,7,1.91,431.0,14.14,87.0,112.0,130


Ubicar Target en la última posición

In [None]:
df_spc_final['Cultivo'] = df_spc_final['CULTIVO']

In [None]:
df_spc_final = df_spc_final.drop('CULTIVO', axis=1)
df_spc_final.head()

Unnamed: 0,Precio Promedio,Volumen Promedio,Temperatura,Humedad,Precipitación,Ubigeo,Cultivo
0,1.65,3033.0,14.14,87.0,112.0,88,14
1,1.7,2954.0,14.14,87.0,112.0,92,11
4,1.7,2954.0,14.14,87.0,112.0,91,11
5,0.43,1850.0,14.14,87.0,112.0,91,13
7,1.91,431.0,14.14,87.0,112.0,130,7


#### 1.4.4. Tratamiento de valores Nulos

Generar variable auxiliar 

In [None]:
df_spc_final_ = df_spc_final

Verificar si existen valores nulos

In [None]:
print("cantidad de registros: ", len(df_spc_final_))
df_spc_final_.isnull().sum()

cantidad de registros:  9532


Precio Promedio       0
Volumen Promedio      0
Temperatura         142
Humedad             142
Precipitación         0
Ubigeo                0
Cultivo               0
dtype: int64

Eliminar filas que presenten valores nulos

In [None]:
df_spc_final_.dropna(subset = ["Temperatura"], axis = 0, inplace = True)

In [None]:
df_spc_final_.dropna(subset = ["Humedad"], axis = 0, inplace = True)

Corroborar que ya no existen valores nulos

In [None]:
print("cantidad de registros: ", len(df_spc_final_))
df_spc_final_.isnull().sum()

cantidad de registros:  9390


Precio Promedio     0
Volumen Promedio    0
Temperatura         0
Humedad             0
Precipitación       0
Ubigeo              0
Cultivo             0
dtype: int64

#### 1.4.5. Normalización

In [None]:
#Normalización de datos del dataset
ind_clase = 6 #Se debe indicar el indice de la columna clase o objetivo
names = df_spc_final_.columns.values #Guardando el nombre de las columnas del dataset en un arreglo
dfn = df_spc_final_.copy()
for i in range(len(names)):
  if i != ind_clase:
    dfn[names[i]] = (df_spc_final_[names[i]] - df_spc_final_[names[i]].min())/ (df_spc_final_[names[i]].max()-df_spc_final_[names[i]].min())
dfn.describe()

Unnamed: 0,Precio Promedio,Volumen Promedio,Temperatura,Humedad,Precipitación,Ubigeo,Cultivo
count,9390.0,9390.0,9390.0,9390.0,9390.0,9390.0,9390.0
mean,0.116198,0.189309,0.547463,0.786949,0.206439,0.457771,9.685729
std,0.115743,0.285737,0.193072,0.152633,0.216462,0.272863,4.730185
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.060109,0.052681,0.414485,0.728573,0.034953,0.221352,7.0
50%,0.110383,0.073172,0.480789,0.830567,0.130724,0.436376,10.0
75%,0.139891,0.14607,0.653519,0.860111,0.324712,0.680047,13.0
max,1.0,1.0,1.0,1.0,1.0,1.0,19.0


In [None]:
dfn.head()

Unnamed: 0,Precio Promedio,Volumen Promedio,Temperatura,Humedad,Precipitación,Ubigeo,Cultivo
0,0.133333,0.082706,0.480789,0.908237,0.391472,0.058628,14
1,0.138798,0.080541,0.480789,0.908237,0.391472,0.061292,11
4,0.138798,0.080541,0.480789,0.908237,0.391472,0.060626,11
5,0.0,0.050297,0.480789,0.908237,0.391472,0.060626,13
7,0.161749,0.011424,0.480789,0.908237,0.391472,0.086609,7


#### 1.4.5. Final preprocesamiento

In [None]:
df_spc_final = dfn

Guardar Dataset

In [None]:
df_spc_final_.to_csv('SPC_Final.csv', index=False)

Renombrar Dataset

In [None]:
df_spc_final.head()

Unnamed: 0,Precio Promedio,Volumen Promedio,Temperatura,Humedad,Precipitación,Ubigeo,Cultivo
0,0.133333,0.082706,0.480789,0.908237,0.391472,0.058628,14
1,0.138798,0.080541,0.480789,0.908237,0.391472,0.061292,11
4,0.138798,0.080541,0.480789,0.908237,0.391472,0.060626,11
5,0.0,0.050297,0.480789,0.908237,0.391472,0.060626,13
7,0.161749,0.011424,0.480789,0.908237,0.391472,0.086609,7


El dataset 'df_spc_final' es el dataset final despues de realizar el proceso de preprocesamiento de los datos. Este datasaet es la conjunción de los tres conjuntos de datos que necesaitamos para realizar predicciones

## **2) Predicción: Modelo de MACHINE LEARNING**

Importar librerias para el modelo de Machine Learning

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_recall_fscore_support
#-------------
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
import heapq
from sklearn.linear_model import LassoCV
#--------------
from sklearn.model_selection import StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from numpy import *
from sklearn.ensemble import RandomForestClassifier as RFC

En este sección se entrenara a los algoritmos de SVM, KNN y RF con los datos del dataset 'df_spc_final'.
El objetivo es realizar la predicción de la variable 'CULTIVO' en funcion a las variables [DEPARTAMENTO,	PROVINICA	DISTRITO,	Precio Promedio,	Volumen Promedio,	Temperatura,	Humedad,	Precipitación]


> El target es la variable: **Cultivo**



In [None]:
df_spc_final.head()

Unnamed: 0,Precio Promedio,Volumen Promedio,Temperatura,Humedad,Precipitación,Ubigeo,Cultivo
0,0.133333,0.082706,0.480789,0.908237,0.391472,0.058628,14
1,0.138798,0.080541,0.480789,0.908237,0.391472,0.061292,11
4,0.138798,0.080541,0.480789,0.908237,0.391472,0.060626,11
5,0.0,0.050297,0.480789,0.908237,0.391472,0.060626,13
7,0.161749,0.011424,0.480789,0.908237,0.391472,0.086609,7


#### 2.1. Entrenamiento del Modelo

El entrenamiento del modelo se realiza mediante la función 'test_model'. Esto se realiza con Cross-validation estratificado con k folds, donde k es igual a 10. 

Esta función tiene como objetivo entrenar a cada algoritmo y devolver los resultados que cada algoritmo obtuvo en función en las métreicas acuracy, precision, recall y f1. 

In [None]:
def test_model(model, df, nfols):  # Implementación de función para probar modelo con metricas
    X = df.iloc[:, :-1].values  # Caracteristicas
    Y = df.iloc[:, -1].values  # Columna objetivo
    skf = StratifiedKFold(n_splits=nfols)

    acuracy = 0
    precision = 0
    recall = 0
    f1 = 0

    # Cross-validation estratificado con k folds
    for train_index, test_index in skf.split(X, Y):
	    X_train, X_test = X[train_index], X[test_index]
	    Y_train, Y_test = Y[train_index], Y[test_index]

	    model.fit(X_train, Y_train)
	    predicciones = model.predict(X_test)

	    acuracy += accuracy_score(Y_test, predicciones)

	    precisionp, recallp, f1p, support = precision_recall_fscore_support(
	        Y_test, predicciones, pos_label=1, average='weighted', zero_division=0)

	    precision += precisionp
	    recall += recallp
	    f1 += f1p

    return acuracy/nfols, precision/nfols, recall/nfols, f1/nfols

#### 2.2. Algoritmo SVM

Función para evaluar los resultados obtenidos con el algoritmo de SVM

In [None]:
def svm(df):

  # Definir variables para las métricas:
  accuracy = 0
  precision = 0
  recall = 0
  f1 = 0

  # SVM: Para elegir los hiperparametros realizamos una busqueda mediante GridSearchCV()
  best_svm = SVC(C=135, kernel='rbf', probability=True) 
  
  # Entrenar modelo de SVM:
  acc, pre, rec, f1_ = test_model(best_svm, df, 10)
  accuracy = acc
  precision = pre
  recall = rec
  f1 = f1_

  return accuracy, precision, recall, f1

In [None]:
acc, pre, re, f1 = svm(df_spc_final)
print("Exactitud (Accuracy) obtenida con el modelo de Máquinas de Vectores de Soporte (SVM):", acc)
print("Precisión (Precision) obtenida con el modelo de Máquinas de Vectores de Soporte (SVM):", pre)
print("Recall obtenido con el modelo de Máquinas de Vectores de Soporte (SVM):", re)
print("F1 obtenido con el modelo de Máquinas de Vectores de Soporte (SVM):", f1)

Exactitud (Accuracy) obtenida con el modelo de Máquinas de Vectores de Soporte (SVM): 0.9975505857294994
Precisión (Precision) obtenida con el modelo de Máquinas de Vectores de Soporte (SVM): 0.9976821797379104
Recall obtenido con el modelo de Máquinas de Vectores de Soporte (SVM): 0.9975505857294994
F1 obtenido con el modelo de Máquinas de Vectores de Soporte (SVM): 0.9975549357445818


#### 2.3. Algoritmo KNN

Función para evaluar los resultados obtenidos con el algoritmo de KNN

In [None]:
def knn(df):

  # Definir variables para las métricas:
  accuracy = 0
  precision = 0
  recall = 0
  f1 = 0

  # KNN: Para elegir los hiperparametros realizamos una busqueda mediante GridSearchCV()
  best_knn = KNN(n_neighbors=20, weights='distance', algorithm='brute')

  # Entrenar modelo de KNN:
  acc, pre, rec, f1_ = test_model(best_knn, df, 10)
  accuracy = acc
  precision = pre
  recall = rec
  f1 = f1_

  return accuracy, precision, recall, f1

In [None]:
acc, pre, re, f1 = knn(df_spc_final)
print("Exactitud (Accuracy) obtenida con el modelo de K-Nearest Neighbors (KNN):", acc)
print("Precisión (Precision) obtenida con el modelo de K-Nearest Neighbors (KNN):", pre)
print("Recall obtenido con el modelo de K-Nearest Neighbors (KNN):", re)
print("F1 obtenido con el modelo de K-Nearest Neighbors (KNN):", f1)

Exactitud (Accuracy) obtenida con el modelo de K-Nearest Neighbors (KNN): 0.9473908413205537
Precisión (Precision) obtenida con el modelo de K-Nearest Neighbors (KNN): 0.9512887081178274
Recall obtenido con el modelo de K-Nearest Neighbors (KNN): 0.9473908413205537
F1 obtenido con el modelo de K-Nearest Neighbors (KNN): 0.947366541374046


#### 2.4. Algoritmo RF

In [None]:
def rf(df):

  # Definir variables para las métricas:
  accuracy = 0
  precision = 0
  recall = 0
  f1 = 0

  # RF: Para elegir los hiperparametros realizamos una busqueda mediante GridSearchCV()
  best_rf = RFC(n_estimators=20, max_depth=100, max_features='log2', bootstrap=False)

  # Entrenar modelo de RF:
  acc, pre, rec, f1_ = test_model(best_rf, df, 10)
  accuracy = acc
  precision = pre
  recall = rec
  f1 = f1_

  return accuracy, precision, recall, f1

In [None]:
acc, pre, re, f1 = rf(df_spc_final)
print("Exactitud (Accuracy) obtenida con el modelo de Random Forest (RF):", acc)
print("Precisión (Precision) obtenida con el modelo de Random Forest (RF):", pre)
print("Recall obtenido con el modelo de Random Forest (RF):", re)
print("F1 obtenido con el modelo de Random Forest (RF):", f1)

Exactitud (Accuracy) obtenida con el modelo de Random Forest (RF): 0.9993610223642172
Precisión (Precision) obtenida con el modelo de Random Forest (RF): 0.9993820402968903
Recall obtenido con el modelo de Random Forest (RF): 0.9993610223642172
F1 obtenido con el modelo de Random Forest (RF): 0.9993545577793126


## **3) Resultados** 

### 3.1. Métricas de Evaluación

* ***Resultados Métricas Machine Learning***

![Boxplot Step](https://i.ibb.co/yFTDs8Y/kjudgsgsgdgdgd.png)