# Introducción a Pandas

**Autor:** Roberto Muñoz <br />
**E-mail:** <rmunoz@metricarts.com>

[Pandas](http://pandas.pydata.org/) es un paquete de Python que proporciona estructuras de datos similares a los dataframes de R. Pandas depende de Numpy, la librería que añade un potente tipo matricial a Python. Los principales tipos de datos que pueden representarse con pandas son:

- Datos tabulares con columnas de tipo heterogéneo con etiquetas en columnas y filas.
- Series temporales.

Pandas proporciona herramientas que permiten:

- leer y escribir datos en diferentes formatos: CSV, Microsoft Excel, bases SQL y formato HDF5
- seleccionar y filtrar de manera sencilla tablas de datos en función de posición, valor o etiquetas
- fusionar y unir datos
- transformar datos aplicando funciones tanto en global como por ventanas
- manipulación de series temporales
- hacer gráficas

En pandas existen tres tipos básicos de objetos todos ellos basados a su vez en Numpy:

- Series (listas, 1D),
- DataFrame (tablas, 2D) y
- Panels (tablas 3D).

In [None]:
import numpy as np
import pandas as pd
pd.__version__

## 1.  Series

A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows:

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

As we see in the output, the Series wraps both a sequence of values and a sequence of indices, which we can access with the values and index attributes. The values are simply a familiar NumPy array:

In [None]:
data.values

The index is an array-like object of type pd.Index, which we'll discuss in more detail momentarily.

In [None]:
data.index

Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:

In [None]:
data[1]

### Series as generalized NumPy array

From what we've seen so far, it may look like the Series object is basically interchangeable with a one-dimensional NumPy array. The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

And the item access works as expected:

In [None]:
data['b']

### Series as specialized dictionary

In this way, you can think of a Pandas Series a bit like a specialization of a Python dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure which maps typed keys to a set of typed values. This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.

In [None]:
population_dict = {'Arica y Parinacota': 243149,
                   'Antofagasta': 631875,
                   'Metropolitana de Santiago': 7399042,
                   'Valparaiso': 1842880,
                   'Bíobío': 2127902,
                   'Magallanes y Antártica Chilena': 165547}
population = pd.Series(population_dict)
population

You can notice the indexes were sorted lexicographically. That's the default behaviour in Pandas

In [None]:
population['Arica y Parinacota']

Unlike a dictionary, though, the Series also supports array-style operations such as slicing:

In [None]:
population['Metropolitana':'Valparaíso']

## 2. DataFrame

The next fundamental structure in Pandas is the DataFrame. Like the Series object discussed in the previous section, the DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. We'll now take a look at each of these perspectives.

### DataFrame as a generalized NumPy array

If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names.

In [None]:
# Area in km^2
area_dict = {'Arica y Parinacota': 16873.3,
             'Antofagasta': 126049.1,
             'Metropolitana de Santiago': 15403.2,
             'Valparaiso': 16396.1,
             'Bíobío': 37068.7,
             'Magallanes y Antártica Chilena': 1382291.1}
area = pd.Series(area_dict)
area

Now that we have this along with the population Series from before, we can use a dictionary to construct a single two-dimensional object containing this information:

In [None]:
regions = pd.DataFrame({'population': population,
                       'area': area})
regions

In [None]:
regions.index

In [None]:
regions.columns

### DataFrame as specialized dictionary

Similarly, we can also think of a DataFrame as a specialization of a dictionary. Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data. For example, asking for the 'area' attribute returns the Series object containing the areas we saw earlier:

In [None]:
regions['area']

### Constructing DataFrame objects
A Pandas DataFrame can be constructed in a variety of ways. Here we'll give several examples.

### From a single Series object¶
A DataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from a single Series:

In [None]:
pd.DataFrame(population, columns=['population'])

### From a dictionary of Series objects
As we saw before, a DataFrame can be constructed from a dictionary of Series objects as well:

In [None]:
pd.DataFrame({'population': population,
              'area': area}, columns=['population', 'area'])

## 3. Reading a CSV file and doing common Pandas operations

In [None]:
regiones_file='data/chile_regiones.csv'
provincias_file='data/chile_provincias.csv'
comunas_file='data/chile_comunas.csv'

regiones=pd.read_csv(regiones_file, header=0, sep=',')
provincias=pd.read_csv(provincias_file, header=0, sep=',')
comunas=pd.read_csv(comunas_file, header=0, sep=',')

In [None]:
print('regiones table: ', regiones.columns.values.tolist())
print('provincias table: ', provincias.columns.values.tolist())
print('comunas table: ', comunas.columns.values.tolist())

In [None]:
regiones.head()

In [None]:
provincias.head()

In [None]:
comunas.head()

In [None]:
regiones_provincias=pd.merge(regiones, provincias, how='outer')
regiones_provincias.head()

In [None]:
provincias_comunas=pd.merge(provincias, comunas, how='outer')
provincias_comunas.head()

In [None]:
regiones_provincias_comunas=pd.merge(regiones_provincias, comunas, how='outer')
regiones_provincias_comunas.index.name='ID'
regiones_provincias_comunas.head()

In [None]:
regiones_provincias_comunas.to_csv('data/chile_demographic_merge.csv', index=False)

## 4. Loading ful dataset

In [None]:
data_file='data/chile_demographic.csv'
data=pd.read_csv(data_file, header=0, sep=',')
data

In [None]:
data.sort_values('Poblacion')

In [None]:
data.sort_values('Poblacion', ascending=False)

In [None]:
(data.groupby(data['Region'])['Poblacion','Superficie'].sum())

In [None]:
(data.groupby(data['Region'])['Poblacion','Superficie'].sum()).sort_values(['Poblacion'])