## Libreria Pandas

Pandas, es una librería destinada al procesamiento y análisis de datos (cargar, preparar, manipular, modelar y analizar), esta proporciona estructuras de datos flexibles y que permiten trabajar de forma muy eficiente.

Pandas ofrece las siguientes estructuras de datos:

- Series: Son arrays unidimensionales con indexación (arrays con índice o etiquetados), similar a los diccionarios. Pueden generarse a partir de diccionarios o de listas.
 
- DataFrame: Son estructuras de datos similares a las tablas de bases de datos relacionales como SQL.
 
- Panel, Panel4D y PanelND: Estas estructuras de datos permiten trabajar con más de dos dimensiones, pero en este Notebook no trabajaremos este tipo.

### Instalación

Si tiene Anaconda, simplemente puede instalar Pandas desde su terminal o símbolo del sistema usando:

conda instalar pandas

Si no tiene Anaconda en su computadora, instale Pandas desde su terminal usando:

pip instalar pandas

## Trabajando con pandas

In [2]:
import pandas as pd  #importa la libreria

import numpy as np

## Series

Es una matriz etiquetada unidimensional capaz de contener cualquier tipo de datos (enteros, cadenas, números de punto flotante, objetos de Python, etc.). Las etiquetas de los ejes se denominan colectivamente índice.

In [5]:
a = pd.Series([1,2,3,4,5])
a

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [7]:
a.index

RangeIndex(start=0, stop=5, step=1)

In [8]:
a.values

array([1, 2, 3, 4, 5])

In [10]:
b = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e'])
b

a    1
b    2
c    3
d    4
e    5
dtype: int64

In [11]:
b.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [12]:
b.values

array([1, 2, 3, 4, 5])

In [16]:
c_prev = {'m': 1, 'n':2, 'o':3, 'p':4, 'q':5}
c = pd.Series(c_prev)
c

m    1
n    2
o    3
p    4
q    5
dtype: int64

In [17]:
c.index

Index(['m', 'n', 'o', 'p', 'q'], dtype='object')

In [18]:
c.values

array([1, 2, 3, 4, 5])

In [13]:
a[3]

4

In [15]:
b['d']

4

In [20]:
pd.Series(a, index = [0,1,2,3,4,5])

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    NaN
dtype: float64

In [21]:
pd.Series(4.5, index = ['a','b','c','d'])

a    4.5
b    4.5
c    4.5
d    4.5
dtype: float64

In [22]:
a % 2 == 0

0    False
1     True
2    False
3     True
4    False
dtype: bool

In [23]:
a[a % 2 == 0]

1    2
3    4
dtype: int64

In [24]:
a * 2

0     2
1     4
2     6
3     8
4    10
dtype: int64

In [25]:
a * a

0     1
1     4
2     9
3    16
4    25
dtype: int64

In [26]:
d = pd.Series([-1,-2,-5], index = ['a','b','e'])
d

a   -1
b   -2
e   -5
dtype: int64

In [28]:
b + d  # las operaciones entre series solo se da si los indices son iguales

a    0.0
b    0.0
c    NaN
d    NaN
e    0.0
dtype: float64

### Indexing y Slicing

![Python slicing](https://infohost.nmt.edu/tcc/help/pubs/python/web/fig/slicing.png)

In [29]:
a[3]

4

In [30]:
a[1:4]

1    2
2    3
3    4
dtype: int64

In [31]:
a[-1:]

4    5
dtype: int64

In [32]:
a[:4]

0    1
1    2
2    3
3    4
dtype: int64

In [34]:
e = pd.Series([2,4,6,8], name ='pares')
e

0    2
1    4
2    6
3    8
Name: pares, dtype: int64

In [35]:
e.rename('Par')

0    2
1    4
2    6
3    8
Name: Par, dtype: int64

In [39]:
e.index.name = 'Enteros'
e

Enteros
0    2
1    4
2    6
3    8
Name: pares, dtype: int64

## DataFrames

Es una estructura de datos etiquetada bidimensional con columnas de tipos diferentes. Puede ser como una hoja de cálculo o una tabla SQL o un "dict" de objetos de la serie. En general, es el objeto pandas más utilizado. Al igual que las Series, DataFrame acepta muchos tipos diferentes de entrada:

- Dicts 1-D ndarrays, listas o Series
- 2-D ndarray
- Ndarray estructurado o registro
- Una serie
- Otro DataFrame

Nota:

En el caso del objeto "dict", este es una colección que no está ordenada, que se puede cambiar e indexar. En Python los objetos "dict" están escritos con corchetes, y tienen claves y valores. Cuando los datos son un "dict", y las columnas no se especifican: 
- Las columnas del DataFrame se ordenarán según el orden de inserción del dict, si está utilizando la versión de Python> = 3.6 y Pandas> = 0.23.
- Las columnas de DataFrame se ordenarán segun el orden léxico de las  "keys", si está utilizando Python <3.6 o Pandas <0.23, .

Ejemplo de objeto "dict":

In [2]:
thisdict = {"brand": "Ford", "model": "Mustang", "year": 1964}
thisdict

{'brand': 'Ford', 'model': 'Mustang', 'year': 1964}

### Creando DataFrames

Desde dict de series o dicts

In [7]:
a = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']), 'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
a  # creando desde objeto dict

{'one': a    1.0
 b    2.0
 c    3.0
 dtype: float64, 'two': a    1.0
 b    2.0
 c    3.0
 d    4.0
 dtype: float64}

In [9]:
df_a = pd.DataFrame(a)
df_a

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [10]:
pd.DataFrame(a, index = ['d','c','b','a'])

Unnamed: 0,one,two
d,,4.0
c,3.0,3.0
b,2.0,2.0
a,1.0,1.0


In [11]:
pd.DataFrame(a, index = ['a','b','c'], columns = ['two','three'])

Unnamed: 0,two,three
a,1.0,
b,2.0,
c,3.0,


Desde dict de listas / arrays

In [28]:
b = {'one': [1., 2., 3., 4.], 'two': [4., 3., 2., 1.]}
b

{'one': [1.0, 2.0, 3.0, 4.0], 'two': [4.0, 3.0, 2.0, 1.0]}

In [29]:
df_b = pd.DataFrame(b)
df_b

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


In [14]:
pd.DataFrame(b, index = ['a','b','c','d'])

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


Desde array

In [20]:
m = np.zeros((2, ), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')])
m

array([(0, 0., b''), (0, 0., b'')],
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])

In [21]:
m[:] = [(1, 2., 'Hello'), (2, 3., "World")]
m

array([(1, 2., b'Hello'), (2, 3., b'World')],
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])

In [22]:
pd.DataFrame(m)

Unnamed: 0,A,B,C
0,1,2.0,b'Hello'
1,2,3.0,b'World'


Desde una lista de dicts

In [23]:
x = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
x

[{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

In [24]:
pd.DataFrame(x, index = ['first','second'])

Unnamed: 0,a,b,c
first,1,2,
second,5,10,20.0


In [25]:
pd.DataFrame(x, columns = ['a','b'])

Unnamed: 0,a,b
0,1,2
1,5,10


Desde dict de Tupas

In [26]:
pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
              ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
              ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
              ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
              ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})

Unnamed: 0_level_0,Unnamed: 1_level_0,a,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,a,b,c,a,b
A,B,4.0,1.0,5.0,8.0,10.0
A,C,3.0,2.0,6.0,7.0,
A,D,,,,,9.0


In [3]:
a = {
    'province' : ['M', 'M', 'M', 'B', 'B'],
    'population': [1.5e6, 2e6, 3e6, 5e5, 1.5e6],
    'year' : [1900, 1950, 2000, 1900, 2000]   
}

df_a = pd.DataFrame(a)
df_a

Unnamed: 0,population,province,year
0,1500000.0,M,1900
1,2000000.0,M,1950
2,3000000.0,M,2000
3,500000.0,B,1900
4,1500000.0,B,2000


In [4]:
df_a2 = pd.DataFrame(df_a, columns=['province','population', 'year', 'debt'])
df_a2

Unnamed: 0,province,population,year,debt
0,M,1500000.0,1900,
1,M,2000000.0,1950,
2,M,3000000.0,2000,
3,B,500000.0,1900,
4,B,1500000.0,2000,


In [5]:
df_a2.index

RangeIndex(start=0, stop=5, step=1)

In [6]:
df_a2.columns

Index(['province', 'population', 'year', 'debt'], dtype='object')

In [7]:
df_a2['population']

0    1500000.0
1    2000000.0
2    3000000.0
3     500000.0
4    1500000.0
Name: population, dtype: float64

In [8]:
df_a2.population

0    1500000.0
1    2000000.0
2    3000000.0
3     500000.0
4    1500000.0
Name: population, dtype: float64

In [9]:
df_a2['2nd_language'] = np.nan
df_a2

Unnamed: 0,province,population,year,debt,2nd_language
0,M,1500000.0,1900,,
1,M,2000000.0,1950,,
2,M,3000000.0,2000,,
3,B,500000.0,1900,,
4,B,1500000.0,2000,,


In [10]:
df_a2['2nd_language']

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
Name: 2nd_language, dtype: float64

Tener cuidado con el nombre de las columnas, si la letra inicial es un numero la siguiente llamada no funcionara

In [11]:
df_a2.2nd_language

SyntaxError: invalid syntax (<ipython-input-11-8847d22aea3f>, line 1)

In [12]:
df_a2.index = list('abcde')
df_a2

Unnamed: 0,province,population,year,debt,2nd_language
a,M,1500000.0,1900,,
b,M,2000000.0,1950,,
c,M,3000000.0,2000,,
d,B,500000.0,1900,,
e,B,1500000.0,2000,,


In [13]:
df_a2['debt'] = 10.0
df_a2

Unnamed: 0,province,population,year,debt,2nd_language
a,M,1500000.0,1900,10.0,
b,M,2000000.0,1950,10.0,
c,M,3000000.0,2000,10.0,
d,B,500000.0,1900,10.0,
e,B,1500000.0,2000,10.0,


In [14]:
df_a2['debt'] = [1,0,2,.5,.7]
df_a2

Unnamed: 0,province,population,year,debt,2nd_language
a,M,1500000.0,1900,1.0,
b,M,2000000.0,1950,0.0,
c,M,3000000.0,2000,2.0,
d,B,500000.0,1900,0.5,
e,B,1500000.0,2000,0.7,


In [15]:
df_a2['capital'] = df_a2['province'] == 'M'
df_a2

Unnamed: 0,province,population,year,debt,2nd_language,capital
a,M,1500000.0,1900,1.0,,True
b,M,2000000.0,1950,0.0,,True
c,M,3000000.0,2000,2.0,,True
d,B,500000.0,1900,0.5,,False
e,B,1500000.0,2000,0.7,,False


In [16]:
df_a2['2nd_language'] = 'English'
df_a2

Unnamed: 0,province,population,year,debt,2nd_language,capital
a,M,1500000.0,1900,1.0,English,True
b,M,2000000.0,1950,0.0,English,True
c,M,3000000.0,2000,2.0,English,True
d,B,500000.0,1900,0.5,English,False
e,B,1500000.0,2000,0.7,English,False


In [17]:
df_a2['economy'] = df_a2['year'][3:]
df_a2

Unnamed: 0,province,population,year,debt,2nd_language,capital,economy
a,M,1500000.0,1900,1.0,English,True,
b,M,2000000.0,1950,0.0,English,True,
c,M,3000000.0,2000,2.0,English,True,
d,B,500000.0,1900,0.5,English,False,1900.0
e,B,1500000.0,2000,0.7,English,False,2000.0


In [18]:
df_a2.insert(2,'first_language',df_a2['2nd_language'])
df_a2

Unnamed: 0,province,population,first_language,year,debt,2nd_language,capital,economy
a,M,1500000.0,English,1900,1.0,English,True,
b,M,2000000.0,English,1950,0.0,English,True,
c,M,3000000.0,English,2000,2.0,English,True,
d,B,500000.0,English,1900,0.5,English,False,1900.0
e,B,1500000.0,English,2000,0.7,English,False,2000.0


In [19]:
del df_a2['2nd_language']

In [20]:
debt = df_a2.pop('debt')

In [44]:
df_a2.drop('economy', axis = 1, inplace = True)  # con inplace = True borra definitivamente

In [46]:
df_a2.drop(['e'], inplace = True)

In [47]:
df_a2.T

Unnamed: 0,a,b,c,d
province,M,M,M,B
population,1.5e+06,2e+06,3e+06,500000
first_language,English,English,English,English
year,1900,1950,2000,1900
capital,True,True,True,False


In [48]:
df_a2.describe()

Unnamed: 0,population,year
count,4.0,4.0
mean,1750000.0,1937.5
std,1040833.0,47.871355
min,500000.0,1900.0
25%,1250000.0,1900.0
50%,1750000.0,1925.0
75%,2250000.0,1962.5
max,3000000.0,2000.0


In [49]:
df_a2.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
population,4.0,1750000.0,1040833.0,500000.0,1250000.0,1750000.0,2250000.0,3000000.0
year,4.0,1937.5,47.87136,1900.0,1900.0,1925.0,1962.5,2000.0


In [50]:
df_a2

Unnamed: 0,province,population,first_language,year,capital
a,M,1500000.0,English,1900,True
b,M,2000000.0,English,1950,True
c,M,3000000.0,English,2000,True
d,B,500000.0,English,1900,False


In [26]:
df_a2.index[1]

'b'

In [27]:
df_a2.loc['c']

province                M
population          3e+06
first_language    English
year                 2000
capital              True
economy               NaN
Name: c, dtype: object

In [28]:
df_a2.iloc[2]

province                M
population          3e+06
first_language    English
year                 2000
capital              True
economy               NaN
Name: c, dtype: object

### Indexing y Slicing

![Python slicing](https://infohost.nmt.edu/tcc/help/pubs/python/web/fig/slicing.png)

In [52]:
df_a2[2:]

Unnamed: 0,province,population,first_language,year,capital
c,M,3000000.0,English,2000,True
d,B,500000.0,English,1900,False


In [53]:
df_a2[:2]

Unnamed: 0,province,population,first_language,year,capital
a,M,1500000.0,English,1900,True
b,M,2000000.0,English,1950,True


In [59]:
df_a2[-1:]

Unnamed: 0,province,population,first_language,year,capital
d,B,500000.0,English,1900,False


In [60]:
df_a2[:-1]

Unnamed: 0,province,population,first_language,year,capital
a,M,1500000.0,English,1900,True
b,M,2000000.0,English,1950,True
c,M,3000000.0,English,2000,True


In [64]:
df_a2[df_a2['population'] > 1800000.0]

Unnamed: 0,province,population,first_language,year,capital
b,M,2000000.0,English,1950,True
c,M,3000000.0,English,2000,True


In [65]:
df_a2[(df_a2['population'] > 1800000.0) & (df_a2['year'] < 2000)]

Unnamed: 0,province,population,first_language,year,capital
b,M,2000000.0,English,1950,True


In [66]:
even1 = df_a2['population'] > 1800000.0
even2 = df_a2['year'] < 2000

In [67]:
df_a2[even1 & even2]

Unnamed: 0,province,population,first_language,year,capital
b,M,2000000.0,English,1950,True


In [68]:
df_a2[even1][even2]

  """Entry point for launching an IPython kernel.


Unnamed: 0,province,population,first_language,year,capital
b,M,2000000.0,English,1950,True


### Funciones Especiales

In [69]:
np.sqrt(df_a2['year'])

a    43.588989
b    44.158804
c    44.721360
d    43.588989
Name: year, dtype: float64

In [70]:
df_b = pd.DataFrame(np.random.randn(4,3) * 17 + 15, columns=list('bde'), index=list('BMPZ'))
df_b

Unnamed: 0,b,d,e
B,5.76595,22.495522,14.837095
M,5.377841,30.645967,11.468873
P,-18.325052,28.345197,4.924948
Z,24.57595,41.260861,44.199906


In [71]:
np.abs(df_b)

Unnamed: 0,b,d,e
B,5.76595,22.495522,14.837095
M,5.377841,30.645967,11.468873
P,18.325052,28.345197,4.924948
Z,24.57595,41.260861,44.199906


In [72]:
df_b.apply(lambda x: x.max()-x.min())

b    42.901002
d    18.765339
e    39.274958
dtype: float64

In [73]:
df_b.applymap(lambda x: x % 10)

Unnamed: 0,b,d,e
B,5.76595,2.495522,4.837095
M,5.377841,0.645967,1.468873
P,1.674948,8.345197,4.924948
Z,4.57595,1.260861,4.199906


In [81]:
df_b.apply(lambda x: x.max()-x.min(), axis = 1)

B    16.729572
M    25.268126
P    46.670249
Z    19.623956
dtype: float64

In [87]:
def f(m):
    return pd.Series([m.min(),m.max()], index = ['min','max'])

df_b.apply(f)

Unnamed: 0,b,d,e
min,-18.325052,22.495522,4.924948
max,24.57595,41.260861,44.199906


In [89]:
for item in df_b.items():
    print(item)

('b', B     5.765950
M     5.377841
P   -18.325052
Z    24.575950
Name: b, dtype: float64)
('d', B    22.495522
M    30.645967
P    28.345197
Z    41.260861
Name: d, dtype: float64)
('e', B    14.837095
M    11.468873
P     4.924948
Z    44.199906
Name: e, dtype: float64)


In [90]:
for item in df_b.iteritems():
    print(item)

('b', B     5.765950
M     5.377841
P   -18.325052
Z    24.575950
Name: b, dtype: float64)
('d', B    22.495522
M    30.645967
P    28.345197
Z    41.260861
Name: d, dtype: float64)
('e', B    14.837095
M    11.468873
P     4.924948
Z    44.199906
Name: e, dtype: float64)


In [94]:
def two_digit(m):
    return '%.2f' % m

In [97]:
df_b.applymap(two_digit)

Unnamed: 0,b,d,e
B,5.77,22.5,14.84
M,5.38,30.65,11.47
P,-18.33,28.35,4.92
Z,24.58,41.26,44.2


In [102]:
df_b.sort_index(ascending = False)

Unnamed: 0,b,d,e
Z,24.57595,41.260861,44.199906
P,-18.325052,28.345197,4.924948
M,5.377841,30.645967,11.468873
B,5.76595,22.495522,14.837095


In [103]:
df_b.sort_index(ascending = False, axis = 1)

Unnamed: 0,e,d,b
B,14.837095,22.495522,5.76595
M,11.468873,30.645967,5.377841
P,4.924948,28.345197,-18.325052
Z,44.199906,41.260861,24.57595


In [106]:
df_b.sort_values(by = 'd')

Unnamed: 0,b,d,e
B,5.76595,22.495522,14.837095
P,-18.325052,28.345197,4.924948
M,5.377841,30.645967,11.468873
Z,24.57595,41.260861,44.199906


In [107]:
df_b.sort_values(by = ['d','e'])

Unnamed: 0,b,d,e
B,5.76595,22.495522,14.837095
P,-18.325052,28.345197,4.924948
M,5.377841,30.645967,11.468873
Z,24.57595,41.260861,44.199906


In [109]:
df_c = pd.Series([2,3,8,4,3,2,1], index=list('abcdefg'))
df_c.sort_values()

g    1
a    2
f    2
b    3
e    3
d    4
c    8
dtype: int64

In [112]:
df_c.rank()

a    2.5
b    4.5
c    7.0
d    6.0
e    4.5
f    2.5
g    1.0
dtype: float64

In [113]:
pd.Series([1,1,1]).rank()

0    2.0
1    2.0
2    2.0
dtype: float64

In [114]:
pd.Series([10,20,30], index = ['m','n','o']).rank()

m    1.0
n    2.0
o    3.0
dtype: float64

### Operaciones con DataFrame

In [3]:
x = pd.Series([1.2, np.nan, 4, np.nan, 9], index=list('abcde'))
y = pd.Series([5, 3, 7, np.nan, 14], index=list('abcde'))

In [6]:
df_table = pd.DataFrame([x,y], index = ['x','y']).T
df_table

Unnamed: 0,x,y
a,1.2,5.0
b,,3.0
c,4.0,7.0
d,,
e,9.0,14.0


In [10]:
df_table.sum()  # puedes filtrar columnas df_table['x'].sum()

x    14.2
y    29.0
dtype: float64

In [11]:
df_table.sum(axis = 1)

a     6.2
b     3.0
c    11.0
d     0.0
e    23.0
dtype: float64

In [12]:
df_table.sum(axis = 1, skipna = False)

a     6.2
b     NaN
c    11.0
d     NaN
e    23.0
dtype: float64

In [13]:
df_table.mean()

x    4.733333
y    7.250000
dtype: float64

In [14]:
df_table.mean(axis = 1)

a     3.1
b     3.0
c     5.5
d     NaN
e    11.5
dtype: float64

In [15]:
df_table.cumsum()

Unnamed: 0,x,y
a,1.2,5.0
b,,8.0
c,5.2,15.0
d,,
e,14.2,29.0


In [17]:
df_table.std()

x    3.951371
y    4.787136
dtype: float64

In [18]:
df_table['x'].unique()

array([1.2, nan, 4. , 9. ])

In [19]:
df_table['x'].value_counts()

1.2    1
9.0    1
4.0    1
Name: x, dtype: int64

In [25]:
df_table['x'].isin([1.2])

a     True
b    False
c    False
d    False
e    False
Name: x, dtype: bool

In [27]:
df_table[df_table.isin([1.2]) == True]

Unnamed: 0,x,y
a,1.2,
b,,
c,,
d,,
e,,


### Valores Nulos

In [3]:
df_s = pd.Series(['Ma', 'Lu', 'Ca', 'Va', np.nan])
df_s

0     Ma
1     Lu
2     Ca
3     Va
4    NaN
dtype: object

In [7]:
df_s[df_s.isnull() == False]

0    Ma
1    Lu
2    Ca
3    Va
dtype: object

In [8]:
df_s[df_s.notnull()]

0    Ma
1    Lu
2    Ca
3    Va
dtype: object

In [9]:
df_s[~df_s.notnull()]

4    NaN
dtype: object

In [10]:
df_s2 = pd.DataFrame([[1,2,3], 
                     [np.nan, 8, 7], 
                     [4, np.nan, 90], 
                     [67,42,53]], 
                     columns=list('abc'))
df_s2

Unnamed: 0,a,b,c
0,1.0,2.0,3
1,,8.0,7
2,4.0,,90
3,67.0,42.0,53


In [11]:
df_s2.notnull()

Unnamed: 0,a,b,c
0,True,True,True
1,False,True,True
2,True,False,True
3,True,True,True


In [13]:
df_s2['a'].isnull()

0    False
1     True
2    False
3    False
Name: a, dtype: bool

In [16]:
df_s2.notnull().any()

a    True
b    True
c    True
dtype: bool

In [17]:
df_s2.isnull().any()

a     True
b     True
c    False
dtype: bool

In [18]:
df_s2.notnull().all()

a    False
b    False
c     True
dtype: bool

In [19]:
df_s2.isnull().all()

a    False
b    False
c    False
dtype: bool

In [20]:
df_s2.dropna()

Unnamed: 0,a,b,c
0,1.0,2.0,3
3,67.0,42.0,53


In [21]:
df_s2

Unnamed: 0,a,b,c
0,1.0,2.0,3
1,,8.0,7
2,4.0,,90
3,67.0,42.0,53


In [22]:
df_s2.dropna(axis = 1)

Unnamed: 0,c
0,3
1,7
2,90
3,53


In [23]:
array = np.random.randn(8,3) * 20 + 100
df_a = pd.DataFrame(array, columns = list('xyz'), index = list('abcdefgh'))
df_a

Unnamed: 0,x,y,z
a,104.611895,105.082375,100.00884
b,96.631927,96.991471,77.157631
c,95.802443,113.347616,88.310658
d,143.547377,109.818436,82.114171
e,109.326199,85.891334,76.539174
f,98.342287,104.527499,97.190781
g,105.587103,85.128823,117.55441
h,111.343073,144.498375,75.870951


Asignando valor a determinados elementos del DataFrame

In [27]:
df_a.iloc[2:5, 1] = np.nan
df_a.iloc[1:3, 2] = np.nan
df_a

Unnamed: 0,x,y,z
a,104.611895,105.082375,100.00884
b,96.631927,96.991471,
c,95.802443,,
d,143.547377,,82.114171
e,109.326199,,76.539174
f,98.342287,104.527499,97.190781
g,105.587103,85.128823,117.55441
h,111.343073,144.498375,75.870951


Eliminando columnas con valores nulos

In [30]:
df_a.dropna(thresh = 2)

Unnamed: 0,x,y,z
a,104.611895,105.082375,100.00884
b,96.631927,96.991471,
d,143.547377,,82.114171
e,109.326199,,76.539174
f,98.342287,104.527499,97.190781
g,105.587103,85.128823,117.55441
h,111.343073,144.498375,75.870951


In [36]:
df_a.dropna(thresh = 6, axis = 1)

Unnamed: 0,x,z
a,104.611895,100.00884
b,96.631927,
c,95.802443,
d,143.547377,82.114171
e,109.326199,76.539174
f,98.342287,97.190781
g,105.587103,117.55441
h,111.343073,75.870951


Reemplazando valores nulos

In [37]:
df_a.fillna(0)

Unnamed: 0,x,y,z
a,104.611895,105.082375,100.00884
b,96.631927,96.991471,0.0
c,95.802443,0.0,0.0
d,143.547377,0.0,82.114171
e,109.326199,0.0,76.539174
f,98.342287,104.527499,97.190781
g,105.587103,85.128823,117.55441
h,111.343073,144.498375,75.870951


In [39]:
df_a.fillna({'x' : 100, 'y' : 50, 'z' : 20})

Unnamed: 0,x,y,z
a,104.611895,105.082375,100.00884
b,96.631927,96.991471,20.0
c,95.802443,50.0,20.0
d,143.547377,50.0,82.114171
e,109.326199,50.0,76.539174
f,98.342287,104.527499,97.190781
g,105.587103,85.128823,117.55441
h,111.343073,144.498375,75.870951


In [40]:
df_a.fillna(method='ffill')

Unnamed: 0,x,y,z
a,104.611895,105.082375,100.00884
b,96.631927,96.991471,100.00884
c,95.802443,96.991471,100.00884
d,143.547377,96.991471,82.114171
e,109.326199,96.991471,76.539174
f,98.342287,104.527499,97.190781
g,105.587103,85.128823,117.55441
h,111.343073,144.498375,75.870951


In [42]:
df_a.fillna(df_a.median())

Unnamed: 0,x,y,z
a,104.611895,105.082375,100.00884
b,96.631927,96.991471,89.652476
c,95.802443,104.527499,89.652476
d,143.547377,104.527499,82.114171
e,109.326199,104.527499,76.539174
f,98.342287,104.527499,97.190781
g,105.587103,85.128823,117.55441
h,111.343073,144.498375,75.870951


In [43]:
df_a.median()

x    105.099499
y    104.527499
z     89.652476
dtype: float64