### Analisis de Dataset de precios de paltas

Vamos a trabajar este kernel analizando un conjunto de datos y aplicando nuestro conocimiento de EDA asi como reconocimiento de funciones utiles de pandas y numpy

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import LabelEncoder

import pylab as pl
import os
print(os.listdir("../input"))




['avocado.csv']


### Visualizacion de informacion

![Paltas para todos](https://exoticfruitbox.com/wp-content/uploads/2015/10/aguacate.jpg)

## **Analisis de datos**

Para poder trabajar los datos primeros debemos conocer los datos y tipos de datos con los cuales vamos a trabajar, en este caso hemos tomado un Dataset de precios de paltas.

In [2]:
#Cargamos el data set mediante Python y Pandas mediante read_csv
#Tener en cuenta indicadar el separador como recomendacion con delimiter se recomienda
#En caso tengamos un campo donde se guarden los ID unico podemos caolocar elnumero de columna en index_col para usarlo de indexacion
data = pd.read_csv("../input/avocado.csv",delimiter=",",index_col=0)

#Visualizamos el volumen de filas y columnas de nuestros datos
data.shape

(18249, 13)

In [3]:
#Podemos mediante el comando info identificar los tipos de campos 123
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 18249 entries, 0 to 11
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Date          18249 non-null  object 
 1   AveragePrice  18249 non-null  float64
 2   Total Volume  18249 non-null  float64
 3   4046          18249 non-null  float64
 4   4225          18249 non-null  float64
 5   4770          18249 non-null  float64
 6   Total Bags    18249 non-null  float64
 7   Small Bags    18249 non-null  float64
 8   Large Bags    18249 non-null  float64
 9   XLarge Bags   18249 non-null  float64
 10  type          18249 non-null  object 
 11  year          18249 non-null  int64  
 12  region        18249 non-null  object 
dtypes: float64(9), int64(1), object(3)
memory usage: 1.9+ MB


In [4]:
#Vamos a ver un pequeño resumen de los datos
data.head(5)

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


Como bien indique tenemos que entender los datos y el significado de cada columna correctamente, solo de esa manera podemos avanzar cualquier paso siguiente.

En este caso que nos dice cada columna:

| Nombre | Descripcion  |
|---------|--------------|
| Date | Fecha de observacion  |
| AveragePrice | El precio promedio de cada palta  |
| type | El tipo (conventional or organic)  | 
| year | El año  | 
| Region | La ciudad o region observada  |  
| Total Volume | El volumen total comprado  |  
| 4046 | El numero total de paltas con codigo PLU 4046  |  
| 4225 | El numero total de paltas con codigo PLU 4225 sold  |  
| 4770 | El numero total de paltas con codigo PLU 4770 sold  |  
| Total Bags | El numero total de bolsas vendidas  |  
| Small Bags	| El numero total de bolsas pequeñas  |  
| Large Bags | El numero total de bolsas grandes  |  
| XLarge Bags | El numero total de bolsas extra grandes  |  

Cabe resaltar que

Total Volume = 4046 + 4225 + 4770 + Total Bags

Total Bags = Small Bags + Large Bags + XLarge Bags

Mas informacion de la data en [data](http://www.hassavocadoboard.com/retail/volume-and-price-data)

## **Consistencia de la informacion**

Luego de conocer la informacion vamos a analizar la consistencia de la informacion

In [5]:
#Vamos a validar que se cumple ambos formulas
#Todo aquel registro que no cumpla con la regla pasara a ser una inconsistencia la cual descartaremos.
dataIncorrecta = data[((data['4046'] + data['4225'] + data['4770'] + data['Total Bags']) != data['Total Volume'] ) | ((data['Small Bags'] + data['Large Bags'] + data['XLarge Bags']) != data['Total Bags'] )]
dataCorrecta =  data[((data['4046'] + data['4225'] + data['4770'] + data['Total Bags']) == data['Total Volume'] ) & ((data['Small Bags'] + data['Large Bags'] + data['XLarge Bags']) == data['Total Bags'] )]

dataIncorrecta.reset_index(inplace = True)
dataCorrecta.reset_index(inplace = True)

print("Data Total",data.shape)
print("Data Incorrecta",dataIncorrecta.shape)
print("Data Correcta",dataCorrecta.shape)

Data Total (18249, 13)
Data Incorrecta (8143, 14)
Data Correcta (10106, 14)


In [6]:
#Analizamos valores perdidos en caso de que existan
dataCorrecta.isnull().sum()

index           0
Date            0
AveragePrice    0
Total Volume    0
4046            0
4225            0
4770            0
Total Bags      0
Small Bags      0
Large Bags      0
XLarge Bags     0
type            0
year            0
region          0
dtype: int64

## **Enginner Features**
Vamos a realizar un grafico para extraer las anomalias o valores extremos para poder estandarizar lo mas posible nuestros datos

In [7]:
dataCorrecta =dataCorrecta[(dataCorrecta['Small Bags']<300000.0) & (dataCorrecta['Large Bags']<150000.0) & (dataCorrecta['XLarge Bags']<10000.0)]
dataCorrecta.shape

(8888, 14)

In [8]:
conventional = dataCorrecta[dataCorrecta.type=="conventional"]
organic = dataCorrecta[dataCorrecta.type=="organic"]

groupBy1_price = conventional.groupby('Date').mean(numeric_only=True)

groupBy2_price = organic.groupby('Date').mean(numeric_only=True)


In [9]:
#dataCorrecta['Date2']=pd.to_datetime(dataCorrecta['Date'], format="%Y/%m/%d")
dataCorrecta['Date'] =dataCorrecta['Date'].astype('datetime64[ns]')



In [10]:
#Tenemos que analizar los datos categoricos no numericos, para transformarlos en numericos, o ver si son utiles
dataCorrecta['type'].value_counts()

type
organic         5495
conventional    3393
Name: count, dtype: int64

In [11]:
dataCorrecta.head()

Unnamed: 0,index,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
2,4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany
3,5,2015-11-22,1.26,55979.78,1184.27,48067.99,43.61,6683.91,6556.47,127.44,0.0,conventional,2015,Albany
4,7,2015-11-08,0.98,109428.33,703.75,101815.36,80.0,6829.22,6266.85,562.37,0.0,conventional,2015,Albany


In [12]:
dataCorrecta['year'].value_counts()

year
2015    2926
2016    2710
2017    2647
2018     605
Name: count, dtype: int64

In [13]:
dataCorrecta['region'].value_counts()

region
NorthernNewEngland     221
BuffaloRochester       218
Charlotte              214
Albany                 210
Jacksonville           207
Syracuse               205
Nashville              205
HarrisburgScranton     204
Spokane                200
Pittsburgh             200
Tampa                  199
Boston                 198
Boise                  194
Philadelphia           194
Sacramento             192
Roanoke                191
Orlando                191
GrandRapids            191
NewOrleansMobile       190
RaleighGreensboro      190
SouthCarolina          188
MiamiFtLauderdale      188
Louisville             186
SanFrancisco           186
LasVegas               184
RichmondNorfolk        184
StLouis                184
BaltimoreWashington    183
Chicago                180
Columbus               179
SanDiego               176
Atlanta                173
PhoenixTucson          173
HartfordSpringfield    169
Indianapolis           169
Houston                163
CincinnatiDayton     

In [14]:
#Analizamos el comportamiento de los procesio por region de las paltas organicos
mask = dataCorrecta['type']=='organic'


In [15]:
#Analizamos el comportamiento de los procesio por region de las paltas convencionales
mask = dataCorrecta['type']=='conventional'


In [16]:
label = LabelEncoder()
dicts = {}

label.fit(dataCorrecta.type.drop_duplicates()) 
dicts['type'] = list(label.classes_)
dataCorrecta.type = label.transform(dataCorrecta.type) 

In [17]:
dataCorrecta['type'].value_counts()

type
1    5495
0    3393
Name: count, dtype: int64

In [18]:
#Realizamos una tabla de correlacion, para conocer el nivel de relacion entre los campos y entre nuestro target
cols = ['AveragePrice','Total Volume','4046','4225','4770','Total Bags','Small Bags','Large Bags','XLarge Bags','year','type']
corr = dataCorrecta[cols].corr('spearman') 


In [19]:
#Creamos las variables ficticias para region
region_dummi =pd.get_dummies(dataCorrecta['region'], prefix='reg')
region_dummi.head()

Unnamed: 0,reg_Albany,reg_Atlanta,reg_BaltimoreWashington,reg_Boise,reg_Boston,reg_BuffaloRochester,reg_California,reg_Charlotte,reg_Chicago,reg_CincinnatiDayton,...,reg_SouthCarolina,reg_SouthCentral,reg_Southeast,reg_Spokane,reg_StLouis,reg_Syracuse,reg_Tampa,reg_TotalUS,reg_West,reg_WestTexNewMexico
0,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [20]:
#Agregamos los datos fictios a la data original
data_nueva = pd.concat([dataCorrecta, region_dummi], axis=1)
data_nueva.head() 

Unnamed: 0,index,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,...,reg_SouthCarolina,reg_SouthCentral,reg_Southeast,reg_Spokane,reg_StLouis,reg_Syracuse,reg_Tampa,reg_TotalUS,reg_West,reg_WestTexNewMexico
0,0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,...,False,False,False,False,False,False,False,False,False,False
1,2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,...,False,False,False,False,False,False,False,False,False,False
2,4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,...,False,False,False,False,False,False,False,False,False,False
3,5,2015-11-22,1.26,55979.78,1184.27,48067.99,43.61,6683.91,6556.47,127.44,...,False,False,False,False,False,False,False,False,False,False
4,7,2015-11-08,0.98,109428.33,703.75,101815.36,80.0,6829.22,6266.85,562.37,...,False,False,False,False,False,False,False,False,False,False


In [21]:
#eliminamos las columna region
data_nueva = data_nueva.drop(labels=['region'], axis=1)
data_nueva = data_nueva.drop(labels=['index'], axis=1)

Generamos un grafico de correlacion

In [22]:
#corr = data_nueva.corr('spearman') 
#plt.figure(figsize = ( 35 , 20 )) 
#sns.heatmap(corr,annot=True,fmt='.2f',cmap="YlGnBu");

In [23]:
data_train = pd.DataFrame(index=data_nueva.index)
targer_train = pd.DataFrame(index=data_nueva.index)
data_train = data_nueva
targer_train = data_nueva['AveragePrice']

In [24]:
data_train['monthy'] = data_train['Date'].astype('datetime64[ns]').apply(lambda ts: ts.month)

In [25]:
dummi_month =pd.get_dummies(data_train['monthy'], prefix='month')
data_train = pd.concat([data_train, dummi_month], axis=1)

In [26]:
data_train = data_train.drop(['monthy'],axis=1)
#data_train = data_train.drop(['Date'],axis=1)

In [27]:
data_train.head(5)

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,...,month_3,month_4,month_5,month_6,month_7,month_8,month_9,month_10,month_11,month_12
0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,...,False,False,False,False,False,False,False,False,False,True
1,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,...,False,False,False,False,False,False,False,False,False,True
2,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,...,False,False,False,False,False,False,False,False,True,False
3,2015-11-22,1.26,55979.78,1184.27,48067.99,43.61,6683.91,6556.47,127.44,0.0,...,False,False,False,False,False,False,False,False,True,False
4,2015-11-08,0.98,109428.33,703.75,101815.36,80.0,6829.22,6266.85,562.37,0.0,...,False,False,False,False,False,False,False,False,True,False


In [28]:
data_train['year'].value_counts()

year
2015    2926
2016    2710
2017    2647
2018     605
Name: count, dtype: int64

In [29]:
region_dummi_fecha =pd.get_dummies(data_train['year'], prefix='year')

In [30]:
data_train = pd.concat([data_train, region_dummi_fecha], axis=1)

In [31]:
data_train.head(5)

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,...,month_7,month_8,month_9,month_10,month_11,month_12,year_2015,year_2016,year_2017,year_2018
0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,...,False,False,False,False,False,True,True,False,False,False
1,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,...,False,False,False,False,False,True,True,False,False,False
2,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,...,False,False,False,False,True,False,True,False,False,False
3,2015-11-22,1.26,55979.78,1184.27,48067.99,43.61,6683.91,6556.47,127.44,0.0,...,False,False,False,False,True,False,True,False,False,False
4,2015-11-08,0.98,109428.33,703.75,101815.36,80.0,6829.22,6266.85,562.37,0.0,...,False,False,False,False,True,False,True,False,False,False


In [32]:
data_train = data_train.drop(['year'],axis=1)
data_train = data_train.drop(['Date'], axis=1)
data_train = data_train.drop(['AveragePrice'],axis=1)

In [33]:

data_train.head(5)

Unnamed: 0,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,reg_Albany,...,month_7,month_8,month_9,month_10,month_11,month_12,year_2015,year_2016,year_2017,year_2018
0,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,0,True,...,False,False,False,False,False,True,True,False,False,False
1,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,0,True,...,False,False,False,False,False,True,True,False,False,False
2,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,0,True,...,False,False,False,False,True,False,True,False,False,False
3,55979.78,1184.27,48067.99,43.61,6683.91,6556.47,127.44,0.0,0,True,...,False,False,False,False,True,False,True,False,False,False
4,109428.33,703.75,101815.36,80.0,6829.22,6266.85,562.37,0.0,0,True,...,False,False,False,False,True,False,True,False,False,False


In [34]:
targer_train.mean()

1.4789187668766877

Creamos la separacion entre la data de test y entrenamiento

In [35]:
data_train.head()

Unnamed: 0,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,reg_Albany,...,month_7,month_8,month_9,month_10,month_11,month_12,year_2015,year_2016,year_2017,year_2018
0,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,0,True,...,False,False,False,False,False,True,True,False,False,False
1,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,0,True,...,False,False,False,False,False,True,True,False,False,False
2,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,0,True,...,False,False,False,False,True,False,True,False,False,False
3,55979.78,1184.27,48067.99,43.61,6683.91,6556.47,127.44,0.0,0,True,...,False,False,False,False,True,False,True,False,False,False
4,109428.33,703.75,101815.36,80.0,6829.22,6266.85,562.37,0.0,0,True,...,False,False,False,False,True,False,True,False,False,False


## Luego de este punto podemos trabajarlo con un modelo de prediccion
De acuerdo con el modelo de prediccion podemos trabajar los datos para predecir el precio promedio individual, empezamos con el RIDGE

Probamos con varios modelos para determinar el que nos da mejores resultados con la configuracion estandar

En este caso vamos a tomar el RandomForest y lo probamos de manera individual con el MAE