# Pandas - introduccion y lectura de datos

Ya hemos visto los elementos basicos para poder programar cosas simples en python. Ahora nos moveremos hacia herramientas que nos permiten trabajar con datos. Los datos vienen en diversos formatos, sin embargo, los mas comunes son las planillas de excel y/o archivos .csv

Estos archivos tienen en comun en que su estructura es de tablas (con filas y columnas). En python la libreria por excelencia que nos permite leer este tipo de archivos es Pandas y, en particular, la estructura DataFrame

In [1]:
import pandas as pd

En general, un Frame es una manera de guardar datos en grillas con columnas y filas (tal como las plantillas de excel). Un DataFrame es un arreglo de dos dimensiones (filas y columnas). Existen diversas maneras de crear un DataFrame: 

1. A mano
2. A partir de una lista de numeros o de una lista de listas
3. A partir de archivos externos

## Creando DataFrames

In [2]:
# Creacion manual de DataFrame a partir de un diccionario

d = {'col1': [1, 2], 'col2': [3, 4]}


In [4]:
print(d)

{'col1': [1, 2], 'col2': [3, 4]}


In [5]:
df = pd.DataFrame(data=d)

In [6]:
df

Unnamed: 0,col1,col2
0,1,3
1,2,4


In [15]:
alumnos = {"Nombre": ["Tamara", "Johnny", "Francisca"], "Edad": [27,28,24]}

In [16]:
df_alumnos = pd.DataFrame(data=alumnos)

In [19]:
df_alumnos

Unnamed: 0,Nombre,Edad
0,Tamara,27
1,Johnny,28
2,Francisca,24


In [18]:
df_alumnos.describe()

Unnamed: 0,Edad
count,3.0
mean,26.333333
std,2.081666
min,24.0
25%,25.5
50%,27.0
75%,27.5
max,28.0


### Lectura de planillas excel

Para leer archivos de excel, usamos el metodo read_excel de pandas. Este metodo soporta archivos con extensiones xls, xlsx, xlsm, xlsb, odf, ods y odt. Documentacion en: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

In [20]:
data_1 = pd.read_excel("sample.xls", sheet_name="Hoja1")

In [21]:
data_1

Unnamed: 0,Height(Inches),Weight(Pounds)
0,65.78,112.99
1,71.52,136.49
2,69.40,153.03
3,68.22,142.34
4,67.79,144.30
...,...,...
195,65.80,120.84
196,66.11,115.78
197,68.24,128.30
198,68.02,127.47


read_excel() carga la hoja del archivo Excel en una estructura de datos llamada DataFrame. 

In [22]:
data_1.describe()

Unnamed: 0,Height(Inches),Weight(Pounds)
count,200.0,200.0
mean,67.9498,127.22195
std,1.940363,11.960959
min,63.43,97.9
25%,66.5225,119.895
50%,67.935,127.875
75%,69.2025,136.0975
max,73.9,158.96


In [23]:
data_2 = pd.read_excel("sample.xls", sheet_name="Hoja2")

In [24]:
data_2

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,,,,,
1,,,,,
2,,,,,
3,,,,,
4,,,,,
...,...,...,...,...,...
116,4.8,3,1.4,0.3,Iris-setosa
117,5.1,3.8,1.6,0.2,Iris-setosa
118,4.6,3.2,1.4,0.2,Iris-setosa
119,5.3,3.7,1.5,0.2,Iris-setosa


In [25]:
data_2.describe()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4
count,101,101,101.0,101.0,101
unique,29,24,29.0,16.0,3
top,5,3,1.5,0.2,Iris-setosa
freq,10,14,14.0,28.0,50


In [26]:
data_3 = pd.read_excel("sample.xls", sheet_name="Hoja3")

In [27]:
data_3

In [28]:
data_3.describe()

ValueError: Cannot describe a DataFrame without columns

### Lectura de archivos .csv

In [29]:
data_4 = pd.read_csv("winemag-data-130k-v2.csv")

In [30]:
data_4

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129966,129966,Germany,Notes of honeysuckle and cantaloupe sweeten th...,Brauneberger Juffer-Sonnenuhr Spätlese,90,28.0,Mosel,,,Anna Lee C. Iijima,,Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ...,Riesling,Dr. H. Thanisch (Erben Müller-Burggraef)
129967,129967,US,Citation is given as much as a decade of bottl...,,90,75.0,Oregon,Oregon,Oregon Other,Paul Gregutt,@paulgwine,Citation 2004 Pinot Noir (Oregon),Pinot Noir,Citation
129968,129968,France,Well-drained gravel soil gives this wine its c...,Kritt,90,30.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Gresser 2013 Kritt Gewurztraminer (Als...,Gewürztraminer,Domaine Gresser
129969,129969,France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss


## Algunos metodos utiles

Nombres de las columnas

In [17]:
data_4.columns

Index(['Unnamed: 0', 'country', 'description', 'designation', 'points',
       'price', 'province', 'region_1', 'region_2', 'taster_name',
       'taster_twitter_handle', 'title', 'variety', 'winery'],
      dtype='object')

Primeras 5 filas

In [31]:
data_4.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


Resumen del dataframe, tenemos el metodo describe().

Documentacion en: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html

In [32]:
data_4.describe()

Unnamed: 0.1,Unnamed: 0,points,price
count,129971.0,129971.0,120975.0
mean,64985.0,88.447138,35.363389
std,37519.540256,3.03973,41.022218
min,0.0,80.0,4.0
25%,32492.5,86.0,17.0
50%,64985.0,88.0,25.0
75%,97477.5,91.0,42.0
max,129970.0,100.0,3300.0


In [33]:
data_4.describe(include="all")

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
count,129971.0,129908,129971,92506,129971.0,120975.0,129908,108724,50511,103727,98758,129971,129970,129971
unique,,43,119955,37979,,,425,1229,17,19,15,118840,707,16757
top,,US,"Ripe plum, game, truffle, leather and menthol ...",Reserve,,,California,Napa Valley,Central Coast,Roger Voss,@vossroger,Gloria Ferrer NV Sonoma Brut Sparkling (Sonoma...,Pinot Noir,Wines & Winemakers
freq,,54504,3,2009,,,36247,4480,11065,25514,25514,11,13272,222
mean,64985.0,,,,88.447138,35.363389,,,,,,,,
std,37519.540256,,,,3.03973,41.022218,,,,,,,,
min,0.0,,,,80.0,4.0,,,,,,,,
25%,32492.5,,,,86.0,17.0,,,,,,,,
50%,64985.0,,,,88.0,25.0,,,,,,,,
75%,97477.5,,,,91.0,42.0,,,,,,,,


# 