# List comprehension

** Son una forma de crear listas de una manera elegante simplificando el código al máximo.**

In [None]:
x = [1,2,3,4]

In [None]:
#Forma tradicional
out = []
for item in x:
    out.append(item**2)
print(out)

In [None]:
#utilizando list comprenhension
[item**2 for item in x]

** estructura **

[ expresion(i) for i in list if condición ]

In [None]:
palabras = ['casa', 'perro', 'puerta', 'pizza']
cap = [palabra.title() for palabra in palabras]
print(cap)

['Casa', 'Perro', 'Puerta', 'Pizza']


# lambda expressions


** Una función lambda es una pequeña función anónima.

Una función lambda puede tomar cualquier número de argumentos, pero solo puede tener una expresión.

** Mientras que las funciones normales se definen usando la palabra clave def en Python, las funciones anónimas se definen usando la palabra clave lambda. **

lambda argumentos : cuerpo de la función

In [None]:
def times2(var):
    return var*2

In [None]:
times2(2)

4

In [None]:
func=lambda var: var*2

In [None]:
func(34)

68

In [None]:
con_valores = lambda val, val1, val2 : val + val1 + val2

In [None]:
con_valores(23,12,11)

46

# Programación funcional

## map


** La función map nos permite aplicar una función sobre cada uno de los elementos de un colección (Listas, tuplas, etc...).

map(función a aplicar, objeto iterable)

In [None]:
seq = [1,2,3,4,5]

** La función map retorna un objeto map object

In [None]:
map(times2,seq)

<map at 0x7f934c39f748>

In [None]:
list(map(times2,seq))

[2, 4, 6, 8, 10]

In [None]:
#Obtener el cuadrado de todos los elementos en la lista.

def cuadrado(elemento=0):
    return elemento * elemento

lista = [1,2,3,4,5,6,7,8,9,10]
resultado = list( map(cuadrado, lista) )
print(resultado)

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]


In [None]:
list(map(lambda var: var*2,seq))

[2, 4, 6, 8, 10]

## filter

** Esta función nos permite realizar un filtro sobre los elementos de la colección.

In [None]:
filter(lambda item: item%2 == 0,seq)

<filter at 0x7f934c39fac8>

In [None]:
list(filter(lambda item: item%2 == 0,seq))

[2, 4]

In [None]:
#Obtener la cantidad de elementos mayores a 5 en la tupla.

def mayor_a_cinco(elemento):
    return elemento > 5

tupla = (5,2,6,7,8,10,77,55,2,1,30,4,2,3)
resultado = tuple(filter( mayor_a_cinco, tupla))
resultado = len(resultado)
print(resultado)

7


## Reduce

** Usaremos la función reduce cuando poseamos una colección de elementos y necesitemos generar un único resultado. reduce nos permitirá reducir los elementos de la colección. Podemos ver a esta función como un acumulador.

reduce(función a aplicar, objeto iterable)

Aquí lo importante es detallar la función a aplicar. Esta función debe de poseer, obligatoriamente, dos parámetros. El primer parámetro hará referencia al acumulador, un variable que irá modificando su valor por cada uno de los elementos en la colección. Por otro lado, el segundo parámetro hará referencia a cada elemento de la colección. La función debe de retornar un nuevo valor, será este nuevo valor el que será asignado al acumulador.

In [None]:
#Obtener la suma de todos los elementos en la lista

lista = [1,2,3,4]
acumulador = 0;

for elemento in lista:
    acumulador += elemento

print(acumulador)

10


In [None]:
from functools import reduce

lista = [1,2,3,4]

def funcion_acumulador(acumulador=0, elemento=0):
    return acumulador + elemento

resultado = reduce(funcion_acumulador, lista)
print(resultado)

10


In [None]:
suma = reduce(lambda x, y: x + y, lista)
print(suma)

10


# PANDAS

# Data Input y Output

Este cuaderno es el código de referencia para obtener entradas y salidas, los pandas pueden leer una variedad de tipos de archivos usando sus métodos pd.read_. Echemos un vistazo a los tipos de datos más comunes:

In [None]:
import numpy as np
import pandas as pd

## CSV

### CSV Input

In [None]:
df = pd.read_csv('example.csv')
df

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15


### CSV Output

In [None]:
df.to_csv('example.csv',index=False)

## Excel
Pandas puede leer y escribir archivos de Excel, tenga en cuenta que esto solo importa datos. No son fórmulas o imágenes, tener imágenes o macros puede hacer que este método read_excel se bloquee. 

### Excel Input

In [None]:
pd.read_excel('/home/carlita/Documents/DIPLOMADO/Gen6/PYTHON/Pandas/Excel_Sample.xlsx',sheet_name='Sheet1')

Unnamed: 0.1,Unnamed: 0,a,b,c,d
0,0,0,1,2,3
1,1,4,5,6,7
2,2,8,9,10,11
3,3,12,13,14,15


### Excel Output

In [None]:
df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')

## HTML

Necesitas instalar  htmllib5,lxml, y BeautifulSoup4. En tu terminal, debes correr:

    pip install lxml
    pip install html5lib
    pip install BeautifulSoup4

Luego reinicie Jupyter Notebook.

Pandas pueden leer las pestañas de las tablas en html. Por ejemplo:

### HTML Input

La función read_html de Pandas leerá tablas de una página web y devolverá una lista de objetos DataFrame:

In [None]:
df = pd.read_html('https://en.wikipedia.org/wiki/Minnesota')

In [None]:
df[0]

Unnamed: 0,Minnesota,Minnesota.1
0,State,State
1,State of Minnesota,State of Minnesota
2,FlagSeal,FlagSeal
3,"Nickname(s): Land of 10,000 Lakes;North Star S...","Nickname(s): Land of 10,000 Lakes;North Star S..."
4,Motto(s): L'Étoile du Nord (French: The Star o...,Motto(s): L'Étoile du Nord (French: The Star o...
5,"Anthem: ""Hail! Minnesota""","Anthem: ""Hail! Minnesota"""
6,Map of the United States with Minnesota highli...,Map of the United States with Minnesota highli...
7,Country,United States
8,Before statehood,Minnesota Territory
9,Admitted to the Union,"May 11, 1858 (32nd)"


# DATAFRAMES

In [None]:
from numpy.random import randn
np.random.seed(101)

In [None]:
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())

df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509



#Selección e indexación

Aprendamos los diversos métodos para obtener datos de un DataFrame

In [None]:
df['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [None]:
# Pasar una lista de nombres de columnas
df[['W','Z']]

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


In [None]:
# Sintaxis SQL (NO RECOMENDADA!)
df.W

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

Las columnas de DataFrame son solo series

In [None]:
type(df['W'])

pandas.core.series.Series

**Creando una nueva columna: **

In [None]:
df['new'] = df['W'] + df['Y']

df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762



** Eliminar columnas**

In [None]:
df.drop('new',axis=1)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [None]:
df.drop(columns=["W"])

Unnamed: 0,X,Y,Z,new
A,0.628133,0.907969,0.503826,3.614819
B,-0.319318,-0.848077,0.605965,-0.196959
C,0.740122,0.528813,-0.589001,-1.489355
D,-0.758872,-0.933237,0.955057,-0.744542
E,1.978757,2.605967,0.683509,2.796762


In [None]:
# ¡No en su lugar a menos que se especifique!
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [None]:
df.drop('new',axis=1,inplace=True)

df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


También puede soltar filas de esta manera:

In [None]:
df.drop('E',axis=0)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


** Seleccionar filas**h

In [None]:
df.loc['A']

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

O seleccione según la posición en lugar de la etiqueta

In [None]:
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [None]:
df.iloc[2]

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

** Seleccionar subconjunto de filas y columnas **

In [None]:
df.loc['B','Y']

-0.8480769834036315

In [None]:
df.loc[['A','B'],['W','Y']]

Unnamed: 0,W,Y
A,2.70685,0.907969
B,0.651118,-0.848077


### Selección condicional

Una característica importante de los pandas es la selección condicional mediante notación de corchetes, muy similar a numpy:


In [None]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [None]:
df>0

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True
E,True,True,True,True


In [None]:
df[df>0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
D,0.188695,,,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [None]:
df[df['W']>0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [None]:
df[df['W']>0]['Y']

A    0.907969
B   -0.848077
D   -0.933237
E    2.605967
Name: Y, dtype: float64

In [None]:
df[df['W']>0][['Y','X']]

Unnamed: 0,Y,X
A,0.907969,0.628133
B,-0.848077,-0.319318
D,-0.933237,-0.758872
E,2.605967,1.978757


Para dos condiciones puede utilizar | y & entre paréntesis:

In [None]:
df[(df['W']>0) & (df['Y'] > 1)]

Unnamed: 0,W,X,Y,Z
E,0.190794,1.978757,2.605967,0.683509


Para dos condiciones puede utilizar | y & entre paréntesis:

In [None]:
df[(df['W']>0) & (df['Y'] > 1)]

Unnamed: 0,W,X,Y,Z
E,0.190794,1.978757,2.605967,0.683509


### Más detalles del índice

Analicemos algunas características más de la indexación, incluido el restablecimiento del índice o la configuración de otra cosa. ¡También hablaremos sobre la jerarquía de índices!

In [None]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [None]:
# Restablecer al índice predeterminado 0,1 ... n
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,2.70685,0.628133,0.907969,0.503826
1,B,0.651118,-0.319318,-0.848077,0.605965
2,C,-2.018168,0.740122,0.528813,-0.589001
3,D,0.188695,-0.758872,-0.933237,0.955057
4,E,0.190794,1.978757,2.605967,0.683509


In [None]:
newind = 'CA NY WY OR CO'.split()

In [None]:
df['States'] = newind

In [None]:
df

Unnamed: 0,W,X,Y,Z,States
A,2.70685,0.628133,0.907969,0.503826,CA
B,0.651118,-0.319318,-0.848077,0.605965,NY
C,-2.018168,0.740122,0.528813,-0.589001,WY
D,0.188695,-0.758872,-0.933237,0.955057,OR
E,0.190794,1.978757,2.605967,0.683509,CO


In [None]:
df.set_index('States')

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,2.70685,0.628133,0.907969,0.503826
NY,0.651118,-0.319318,-0.848077,0.605965
WY,-2.018168,0.740122,0.528813,-0.589001
OR,0.188695,-0.758872,-0.933237,0.955057
CO,0.190794,1.978757,2.605967,0.683509


In [None]:
df

Unnamed: 0,W,X,Y,Z,States
A,2.70685,0.628133,0.907969,0.503826,CA
B,0.651118,-0.319318,-0.848077,0.605965,NY
C,-2.018168,0.740122,0.528813,-0.589001,WY
D,0.188695,-0.758872,-0.933237,0.955057,OR
E,0.190794,1.978757,2.605967,0.683509,CO


In [None]:
df.set_index('States',inplace=True)

In [None]:
df

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,2.70685,0.628133,0.907969,0.503826
NY,0.651118,-0.319318,-0.848077,0.605965
WY,-2.018168,0.740122,0.528813,-0.589001
OR,0.188695,-0.758872,-0.933237,0.955057
CO,0.190794,1.978757,2.605967,0.683509


### Jerarquía de índices y múltiples índices

Repasemos cómo trabajar con Multi-Index, primero crearemos un ejemplo rápido de cómo se vería un DataFrame Multi-Indexado:

In [None]:
# Niveles de índice
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)

In [None]:
hier_index

MultiIndex([('G1', 1),
            ('G1', 2),
            ('G1', 3),
            ('G2', 1),
            ('G2', 2),
            ('G2', 3)],
           )

In [None]:
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df

Unnamed: 0,Unnamed: 1,A,B
G1,1,0.903319,1.543568
G1,2,3.014676,-0.13917
G1,3,0.482311,0.977939
G2,1,-0.344431,-1.785068
G2,2,0.530949,0.628364
G2,3,2.283261,0.148912


¡Ahora veamos cómo indexar esto! Para la jerarquía de índices usamos df.loc [], si esto estuviera en el eje de las columnas, solo usaría la notación de corchetes normal df []. Llamar a un nivel del índice devuelve el subtrama de datos:

In [None]:
df.loc['G1']

Unnamed: 0,A,B
1,0.903319,1.543568
2,3.014676,-0.13917
3,0.482311,0.977939


In [None]:
df.loc['G1'].loc[1]

A    0.903319
B    1.543568
Name: 1, dtype: float64

In [None]:
df.index.names

FrozenList([None, None])

In [None]:
df.index.names = ['Group','Num']

In [None]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,0.903319,1.543568
G1,2,3.014676,-0.13917
G1,3,0.482311,0.977939
G2,1,-0.344431,-1.785068
G2,2,0.530949,0.628364
G2,3,2.283261,0.148912


In [None]:
df.xs('G1')

Unnamed: 0_level_0,A,B
Num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.903319,1.543568
2,3.014676,-0.13917
3,0.482311,0.977939


In [None]:
df.xs(['G1',1])

A    0.903319
B    1.543568
Name: (G1, 1), dtype: float64

In [None]:
df.xs(1,level='Num')

Unnamed: 0_level_0,A,B
Group,Unnamed: 1_level_1,Unnamed: 2_level_1
G1,0.903319,1.543568
G2,-0.344431,-1.785068


In [None]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,0.903319,1.543568
G1,2,3.014676,-0.13917
G1,3,0.482311,0.977939
G2,1,-0.344431,-1.785068
G2,2,0.530949,0.628364
G2,3,2.283261,0.148912


In [None]:
a

Unnamed: 0_level_0,A,B
Num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.903319,1.543568
2,3.014676,-0.13917
3,0.482311,0.977939
1,-0.344431,-1.785068
2,0.530949,0.628364
3,2.283261,0.148912


In [None]:

df1 = pd.DataFrame(np.random.randint(1, 100, (5, 4)),
             columns = [['A', 'A', 'B', 'B'],['english', 'math', 'english', 'math']],
             index = [1, 2, 3, 4, 5])

In [None]:
df1

Unnamed: 0_level_0,A,A,B,B
Unnamed: 0_level_1,english,math,english,math
1,65,70,44,6
2,59,30,58,63
3,15,93,49,11
4,64,47,17,81
5,65,25,57,20


In [None]:
df.columns = pd.MultiIndex.from_tuples([
    ('c', 'e'), ('d', 'f')
], names=['level_1', 'level_2'])

NameError: name 'df' is not defined

In [None]:
df1.columns

MultiIndex([('A', 'english'),
            ('A',    'math'),
            ('B', 'english'),
            ('B',    'math')],
           )

In [None]:
df1["A"]["english"]

1    36
2    21
3    27
4    55
5     3
Name: english, dtype: int64

# Groupby
El método groupby le permite agrupar filas de datos y llamar a funciones agregadas

In [None]:
# Crear DataFrame
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350]}

In [None]:
df = pd.DataFrame(data)

In [None]:
df

Unnamed: 0,Company,Person,Sales
0,GOOG,Sam,200
1,GOOG,Charlie,120
2,MSFT,Amy,340
3,MSFT,Vanessa,124
4,FB,Carl,243
5,FB,Sarah,350


** Ahora puede usar el método .groupby () para agrupar filas basándose en el nombre de una columna. Por ejemplo, agrupemos en función de la empresa. Esto creará un objeto DataFrameGroupBy: **

In [None]:
df.groupby('Company')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f5e9ffe9358>

In [None]:
by_comp = df.groupby("Company")

In [None]:
by_comp.mean()

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,296.5
GOOG,160.0
MSFT,232.0


In [None]:
df.groupby('Company').mean()

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,296.5
GOOG,160.0
MSFT,232.0


In [None]:
by_comp.std()

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,75.660426
GOOG,56.568542
MSFT,152.735065


In [None]:
by_comp.min()

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,Carl,243
GOOG,Charlie,120
MSFT,Amy,124


In [None]:
by_comp.max()

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,Sarah,350
GOOG,Sam,200
MSFT,Vanessa,340


In [None]:
by_comp.count()

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,2,2
GOOG,2,2
MSFT,2,2


In [None]:
by_comp.describe()

Unnamed: 0_level_0,Sales,Sales,Sales,Sales,Sales,Sales,Sales,Sales
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Company,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
FB,2.0,296.5,75.660426,243.0,269.75,296.5,323.25,350.0
GOOG,2.0,160.0,56.568542,120.0,140.0,160.0,180.0,200.0
MSFT,2.0,232.0,152.735065,124.0,178.0,232.0,286.0,340.0


In [None]:
by_comp.describe().transpose()

Unnamed: 0,Company,FB,GOOG,MSFT
Sales,count,2.0,2.0,2.0
Sales,mean,296.5,160.0,232.0
Sales,std,75.660426,56.568542,152.735065
Sales,min,243.0,120.0,124.0
Sales,25%,269.75,140.0,178.0
Sales,50%,296.5,160.0,232.0
Sales,75%,323.25,180.0,286.0
Sales,max,350.0,200.0,340.0


In [None]:
by_comp.describe().transpose()['GOOG']

Sales  count      2.000000
       mean     160.000000
       std       56.568542
       min      120.000000
       25%      140.000000
       50%      160.000000
       75%      180.000000
       max      200.000000
Name: GOOG, dtype: float64

In [None]:
df.groupby('Company').agg(["mean","min","max"])

Unnamed: 0_level_0,Sales,Sales,Sales
Unnamed: 0_level_1,mean,min,max
Company,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
FB,296.5,243,350
GOOG,160.0,120,200
MSFT,232.0,124,340


In [None]:
func=lambda x: min(x)

In [None]:
df.groupby('Company').agg([func,"max"])

Unnamed: 0_level_0,Person,Person,Sales,Sales
Unnamed: 0_level_1,<lambda_0>,max,<lambda_0>,max
Company,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
FB,Carl,Sarah,243,350
GOOG,Charlie,Sam,120,200
MSFT,Amy,Vanessa,124,340


# Operaciones
Hay muchas operaciones con pandas que te serán realmente útiles, pero no caigas en ninguna categoría distinta. Mostrémoslos aquí en esta conferencia:

In [None]:
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head()

Unnamed: 0,col1,col2,col3
0,1,444,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


## Información sobre valores únicos

In [None]:
df['col2'].unique()

array([444, 555, 666])

In [None]:
df['col2'].nunique()

3

In [None]:
df['col2'].value_counts()

444    2
555    1
666    1
Name: col2, dtype: int64

## Seleccionar Data

In [None]:
#Seleccione de DataFrame usando criterios de múltiples columnas
newdf = df[(df['col1']>2) & (df['col2']==444)]

In [None]:
newdf

Unnamed: 0,col1,col2,col3
3,4,444,xyz


## Aplicar funciones

In [None]:
def times2(x):
    return x*2

In [None]:
df['col1'].apply(times2)

0    2
1    4
2    6
3    8
Name: col1, dtype: int64

In [None]:
df['col3'].apply(len)

0    3
1    3
2    3
3    3
Name: col3, dtype: int64

In [None]:
df['col1'].sum()

10

** Eliminar una columna de forma permanente**

In [None]:
del df['col1']

In [None]:
df

Unnamed: 0,col2,col3
0,444,abc
1,555,def
2,666,ghi
3,444,xyz


** Obtener nombres de índice y columna: **

In [None]:
df.columns

Index(['col2', 'col3'], dtype='object')

In [None]:
df.index

RangeIndex(start=0, stop=4, step=1)

** Ordenar y ordenar un DataFrame:**

In [None]:
df.sort_values(by='col2') #inplace=False por default

Unnamed: 0,col2,col3
0,444,abc
3,444,xyz
1,555,def
2,666,ghi


** Encontrar valores nulos o comprobar valores nulos**

In [None]:
df.isnull()

Unnamed: 0,col2,col3
0,False,False
1,False,False
2,False,False
3,False,False


In [None]:
# Soltar filas con valores NaN
df.dropna()

Unnamed: 0,col2,col3
0,444,abc
1,555,def
2,666,ghi
3,444,xyz


In [None]:
## tablas pivote

In [None]:
data = {'A':['foo','foo','foo','bar','bar','bar'],
     'B':['one','one','two','two','one','one'],
       'C':['x','y','x','y','x','y'],
       'D':[1,3,2,5,4,1]}

df = pd.DataFrame(data)

In [None]:
df

Unnamed: 0,A,B,C,D
0,foo,one,x,1
1,foo,one,y,3
2,foo,two,x,2
3,bar,two,y,5
4,bar,one,x,4
5,bar,one,y,1


In [None]:
df.pivot_table(values='D',index=['A', 'B'],columns=['C'])

Unnamed: 0_level_0,C,x,y
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,4.0,1.0
bar,two,,5.0
foo,one,1.0,3.0
foo,two,2.0,


# Merging, Joining, and Concatenating

In [None]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                         index=[4, 5, 6, 7]) 

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                        'B': ['B8', 'B9', 'B10', 'B11'],
                        'C': ['C8', 'C9', 'C10', 'C11'],
                        'D': ['D8', 'D9', 'D10', 'D11']},
                        index=[8, 9, 10, 11])

In [None]:
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [None]:
df2

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [None]:
df3

Unnamed: 0,A,B,C,D
8,A8,B8,C8,D8
9,A9,B9,C9,D9
10,A10,B10,C10,D10
11,A11,B11,C11,D11


# Concatenación
La concatenación básicamente une DataFrames. Tenga en cuenta que las dimensiones deben coincidir a lo largo del eje en el que está concatenando. Puede usar ** pd.concat ** y pasar una lista de DataFrames para concatenar juntos:

In [None]:
pd.concat([df1,df2,df3])

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7
8,A8,B8,C8,D8
9,A9,B9,C9,D9


In [None]:
pd.concat([df1,df2,df3],axis=1)

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1,A.2,B.2,C.2,D.2
0,A0,B0,C0,D0,,,,,,,,
1,A1,B1,C1,D1,,,,,,,,
2,A2,B2,C2,D2,,,,,,,,
3,A3,B3,C3,D3,,,,,,,,
4,,,,,A4,B4,C4,D4,,,,
5,,,,,A5,B5,C5,D5,,,,
6,,,,,A6,B6,C6,D6,,,,
7,,,,,A7,B7,C7,D7,,,,
8,,,,,,,,,A8,B8,C8,D8
9,,,,,,,,,A9,B9,C9,D9


In [None]:
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
   
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                          'C': ['C0', 'C1', 'C2', 'C3'],
                          'D': ['D0', 'D1', 'D2', 'D3']}) 

In [None]:
left

Unnamed: 0,key,A,B
0,K0,A0,B0
1,K1,A1,B1
2,K2,A2,B2
3,K3,A3,B3


In [None]:
right

Unnamed: 0,key,C,D
0,K0,C0,D0
1,K1,C1,D1
2,K2,C2,D2
3,K3,C3,D3


## Merging
La función fusionar le permite fusionar DataFrames utilizando una lógica similar a la de fusionar tablas SQL. Por ejemplo:



In [None]:
pd.merge(left,right,how='inner',on='key')

Unnamed: 0,key,A,B,C,D
0,K0,A0,B0,C0,D0
1,K1,A1,B1,C1,D1
2,K2,A2,B2,C2,D2
3,K3,A3,B3,C3,D3


O para mostrar un ejemplo más complicado:

In [None]:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                        'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3']})
    
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                               'key2': ['K0', 'K0', 'K0', 'K0'],
                                  'C': ['C0', 'C1', 'C2', 'C3'],
                                  'D': ['D0', 'D1', 'D2', 'D3']})

In [None]:
left

Unnamed: 0,key1,key2,A,B
0,K0,K0,A0,B0
1,K0,K1,A1,B1
2,K1,K0,A2,B2
3,K2,K1,A3,B3


In [None]:
right

Unnamed: 0,key1,key2,C,D
0,K0,K0,C0,D0
1,K1,K0,C1,D1
2,K1,K0,C2,D2
3,K2,K0,C3,D3


In [None]:
pd.merge(left, right, on=['key1', 'key2'])

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2


In [None]:
pd.merge(left, right, how='outer', on=['key1', 'key2'])# Considera todas las etique tas de on

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K1,A3,B3,,
5,K2,K0,,,C3,D3


In [None]:
pd.merge(left, right, how='right', on=['key1', 'key2'])

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2
3,K2,K0,,,C3,D3


In [None]:
pd.merge(left, right, how='left', on=['key1', 'key2'])

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K1,A3,B3,,


## Joining
Es un método conveniente para combinar las columnas de dos DataFrames indexados de manera potencialmente diferente en un DataFrame de resultado único

In [None]:
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2']) 

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])

In [None]:
left

Unnamed: 0,A,B
K0,A0,B0
K1,A1,B1
K2,A2,B2


In [None]:
right

Unnamed: 0,C,D
K0,C0,D0
K2,C2,D2
K3,C3,D3


In [None]:
left.join(right)

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2


In [None]:
left.join(right, how='outer')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2
K3,,,C3,D3


# SERIES

El primer tipo de datos principal que aprenderemos para los pandas es el tipo de datos Serie. Importemos Pandas y exploremos el objeto Serie.

Una serie es muy similar a una matriz NumPy (de hecho, está construida sobre el objeto de matriz NumPy). Lo que diferencia a la matriz NumPy de una serie es que una serie puede tener etiquetas de eje, lo que significa que puede indexarse mediante una etiqueta, en lugar de solo una ubicación numérica. Tampoco necesita contener datos numéricos, puede contener cualquier Objeto Python arbitrario.

Exploremos este concepto a través de algunos ejemplos:

Crear una serie: 

Puede convertir una lista, una matriz numérica o un diccionario en una serie:

In [None]:
import numpy as np
import pandas as pd

In [None]:
labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}

** Usar Listas**

In [None]:
pd.Series(data=my_list)

0    10
1    20
2    30
dtype: int64

In [None]:
pd.Series(data=my_list,index=labels)

a    10
b    20
c    30
dtype: int64

In [None]:
pd.Series(my_list,labels)

a    10
b    20
c    30
dtype: int64

** NumPy Arrays **

In [None]:
pd.Series(arr)

0    10
1    20
2    30
dtype: int64

In [None]:
pd.Series(arr,labels)

a    10
b    20
c    30
dtype: int64

In [None]:
pd.Series(d)

a    10
b    20
c    30
dtype: int64

Datos en una serie
Una serie de pandas puede contener una variedad de tipos de objetos:

In [None]:
pd.Series(data=labels)

0    a
1    b
2    c
dtype: object

In [None]:
# Incluso funciones ,aunque es poco probable que uses esto
pd.Series([sum,len])

0    <built-in function sum>
1    <built-in function len>
dtype: object

## Usando un índice
La clave para usar una serie es comprender su índice. Pandas hace uso de estos nombres o números de índice al permitir búsquedas rápidas de información (funciona como una tabla hash o un diccionario).

Veamos algunos ejemplos de cómo obtener información de una serie. Creemos dos sereis, ser1 y ser2:

In [None]:
ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])  

In [None]:
ser1

USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

In [None]:
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan']) 

In [None]:
ser2

USA        1
Germany    2
Italy      5
Japan      4
dtype: int64

In [None]:
ser1['USA']

1

Las operaciones también se realizan en función del índice:

In [None]:
ser1 + ser2

Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64