# 1.2 - Intro Pandas (Panel Data)

**[Documentación](https://pandas.pydata.org/docs/reference/index.html#api)**

**[Código fuente](https://github.com/pandas-dev/pandas)**


![pandas](images/pandas.png)


Pandas es una librería de python especializada en el manejo y análisis de estructuras de datos.


Las principales características de esta librería son:

+ Define nuevas estructuras de datos basadas en los arrays de la librería NumPy pero con nuevas funcionalidades.
+ Permite leer y escribir fácilmente ficheros en formato CSV, Excel y bases de datos SQL.
+ Permite acceder a los datos mediante índices o nombres para filas y columnas.
+ Ofrece métodos para reordenar, dividir y combinar conjuntos de datos.
+ Permite trabajar con series temporales.
+ Realiza todas estas operaciones de manera muy eficiente.


**Tipos de datos de Pandas**
Pandas dispone de dos estructuras de datos diferentes:

+ Series: Estructura de una dimensión.
+ DataFrame: Estructura de dos dimensiones (tablas).

Estas estructuras se construyen a partir de arrays de la librería NumPy, añadiendo nuevas funcionalidades.

In [1]:
%pip install pandas

[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd

In [3]:
import numpy as np

In [4]:
import warnings
warnings.filterwarnings('ignore')

### Serie

Son estructuras similares a los arrays de una dimensión. Son homogéneas, es decir, sus elementos tienen que ser del mismo tipo, y su tamaño es inmutable, es decir, no se puede cambiar, aunque si su contenido.

Dispone de un índice que asocia un nombre a cada elemento del la serie, a través de la cuál se accede al elemento.

In [5]:
lst = [(3.4 + i)**2 for i in range(10)]

lst

[11.559999999999999,
 19.360000000000003,
 29.160000000000004,
 40.96000000000001,
 54.760000000000005,
 70.56,
 88.36000000000001,
 108.16000000000001,
 129.96,
 153.76000000000002]

In [6]:
serie=pd.Series(lst)

serie

0     11.56
1     19.36
2     29.16
3     40.96
4     54.76
5     70.56
6     88.36
7    108.16
8    129.96
9    153.76
dtype: float64

In [7]:
serie[2]='hola'

In [8]:
serie

0     11.56
1     19.36
2      hola
3     40.96
4     54.76
5     70.56
6     88.36
7    108.16
8    129.96
9    153.76
dtype: object

In [9]:
type(serie[0])

float

In [10]:
#help(serie)

In [11]:
serie.head()    # la cabeza, 5 primeros por defecto

0    11.56
1    19.36
2     hola
3    40.96
4    54.76
dtype: object

In [12]:
serie.head(10)

0     11.56
1     19.36
2      hola
3     40.96
4     54.76
5     70.56
6     88.36
7    108.16
8    129.96
9    153.76
dtype: object

In [13]:
serie.tail()

5     70.56
6     88.36
7    108.16
8    129.96
9    153.76
dtype: object

In [14]:
serie.index

RangeIndex(start=0, stop=10, step=1)

In [15]:
serie.index = ['q', 't', 'y', 'o', 'p', 'a', 's', 'd', 'f', 'g']

serie

q     11.56
t     19.36
y      hola
o     40.96
p     54.76
a     70.56
s     88.36
d    108.16
f    129.96
g    153.76
dtype: object

In [16]:
serie['p']

54.760000000000005

In [17]:
serie.to_dict()

{'q': 11.559999999999999,
 't': 19.360000000000003,
 'y': 'hola',
 'o': 40.96000000000001,
 'p': 54.760000000000005,
 'a': 70.56,
 's': 88.36000000000001,
 'd': 108.16000000000001,
 'f': 129.96,
 'g': 153.76000000000002}

In [18]:
serie.to_frame()

Unnamed: 0,0
q,11.56
t,19.36
y,hola
o,40.96
p,54.76
a,70.56
s,88.36
d,108.16
f,129.96
g,153.76


### DataFrame

Un objeto del tipo DataFrame define un conjunto de datos estructurado en forma de tabla donde cada columna es un objeto de tipo Series, es decir, todos los datos de una misma columna son del mismo tipo, y las filas son registros que pueden contender datos de distintos tipos.

Un DataFrame contiene dos índices, uno para las filas y otro para las columnas, y se puede acceder a sus elementos mediante los nombres de las filas y las columnas.

In [19]:
columnas=['col1', 'col2', 'col 3', 'col4', 'col5']

array=np.random.random((10, 5))

array

array([[0.32922196, 0.25134209, 0.65102292, 0.61121212, 0.88111804],
       [0.53542901, 0.03740546, 0.3767634 , 0.28943778, 0.02759077],
       [0.98600565, 0.77635575, 0.73208542, 0.50554418, 0.13815986],
       [0.10302939, 0.49233588, 0.91757283, 0.74867143, 0.04460873],
       [0.71508943, 0.72857881, 0.30683738, 0.47557772, 0.52995679],
       [0.25667457, 0.65992076, 0.08235689, 0.06654583, 0.43175918],
       [0.27727478, 0.14756843, 0.10374827, 0.02629115, 0.30466667],
       [0.45217775, 0.67088985, 0.95626472, 0.97295323, 0.08415188],
       [0.91812845, 0.76323463, 0.63800678, 0.02749082, 0.26678428],
       [0.91260289, 0.88664167, 0.26284888, 0.52401953, 0.44184007]])

In [20]:
df = pd.DataFrame(array, columns=columnas)

df

Unnamed: 0,col1,col2,col 3,col4,col5
0,0.329222,0.251342,0.651023,0.611212,0.881118
1,0.535429,0.037405,0.376763,0.289438,0.027591
2,0.986006,0.776356,0.732085,0.505544,0.13816
3,0.103029,0.492336,0.917573,0.748671,0.044609
4,0.715089,0.728579,0.306837,0.475578,0.529957
5,0.256675,0.659921,0.082357,0.066546,0.431759
6,0.277275,0.147568,0.103748,0.026291,0.304667
7,0.452178,0.67089,0.956265,0.972953,0.084152
8,0.918128,0.763235,0.638007,0.027491,0.266784
9,0.912603,0.886642,0.262849,0.52402,0.44184


In [21]:
df['col 3']

0    0.651023
1    0.376763
2    0.732085
3    0.917573
4    0.306837
5    0.082357
6    0.103748
7    0.956265
8    0.638007
9    0.262849
Name: col 3, dtype: float64

In [22]:
df.col 3

SyntaxError: invalid syntax (3887910251.py, line 1)

In [23]:
df.columns = [e.replace(' ', '_') for e in df.columns]

df.columns

Index(['col1', 'col2', 'col_3', 'col4', 'col5'], dtype='object')

In [24]:
df.col_3

0    0.651023
1    0.376763
2    0.732085
3    0.917573
4    0.306837
5    0.082357
6    0.103748
7    0.956265
8    0.638007
9    0.262849
Name: col_3, dtype: float64

In [25]:
df.rename(columns={'col_3': 'columna'},
          inplace=True) # key=viejo nombre, value=nuevo nombre

#help(df.rename)

In [26]:
df

Unnamed: 0,col1,col2,columna,col4,col5
0,0.329222,0.251342,0.651023,0.611212,0.881118
1,0.535429,0.037405,0.376763,0.289438,0.027591
2,0.986006,0.776356,0.732085,0.505544,0.13816
3,0.103029,0.492336,0.917573,0.748671,0.044609
4,0.715089,0.728579,0.306837,0.475578,0.529957
5,0.256675,0.659921,0.082357,0.066546,0.431759
6,0.277275,0.147568,0.103748,0.026291,0.304667
7,0.452178,0.67089,0.956265,0.972953,0.084152
8,0.918128,0.763235,0.638007,0.027491,0.266784
9,0.912603,0.886642,0.262849,0.52402,0.44184


In [27]:
cols=['col2', 'col4']

df[cols]

Unnamed: 0,col2,col4
0,0.251342,0.611212
1,0.037405,0.289438
2,0.776356,0.505544
3,0.492336,0.748671
4,0.728579,0.475578
5,0.659921,0.066546
6,0.147568,0.026291
7,0.67089,0.972953
8,0.763235,0.027491
9,0.886642,0.52402


In [28]:
df[['col2', 'col4']]

Unnamed: 0,col2,col4
0,0.251342,0.611212
1,0.037405,0.289438
2,0.776356,0.505544
3,0.492336,0.748671
4,0.728579,0.475578
5,0.659921,0.066546
6,0.147568,0.026291
7,0.67089,0.972953
8,0.763235,0.027491
9,0.886642,0.52402


In [29]:
df[['col2', 'col4', 'col1']]

Unnamed: 0,col2,col4,col1
0,0.251342,0.611212,0.329222
1,0.037405,0.289438,0.535429
2,0.776356,0.505544,0.986006
3,0.492336,0.748671,0.103029
4,0.728579,0.475578,0.715089
5,0.659921,0.066546,0.256675
6,0.147568,0.026291,0.277275
7,0.67089,0.972953,0.452178
8,0.763235,0.027491,0.918128
9,0.886642,0.52402,0.912603


In [30]:
df['ceros']=0.  # columna de ceros

df

Unnamed: 0,col1,col2,columna,col4,col5,ceros
0,0.329222,0.251342,0.651023,0.611212,0.881118,0.0
1,0.535429,0.037405,0.376763,0.289438,0.027591,0.0
2,0.986006,0.776356,0.732085,0.505544,0.13816,0.0
3,0.103029,0.492336,0.917573,0.748671,0.044609,0.0
4,0.715089,0.728579,0.306837,0.475578,0.529957,0.0
5,0.256675,0.659921,0.082357,0.066546,0.431759,0.0
6,0.277275,0.147568,0.103748,0.026291,0.304667,0.0
7,0.452178,0.67089,0.956265,0.972953,0.084152,0.0
8,0.918128,0.763235,0.638007,0.027491,0.266784,0.0
9,0.912603,0.886642,0.262849,0.52402,0.44184,0.0


In [31]:
df['nulos']=np.nan

df

Unnamed: 0,col1,col2,columna,col4,col5,ceros,nulos
0,0.329222,0.251342,0.651023,0.611212,0.881118,0.0,
1,0.535429,0.037405,0.376763,0.289438,0.027591,0.0,
2,0.986006,0.776356,0.732085,0.505544,0.13816,0.0,
3,0.103029,0.492336,0.917573,0.748671,0.044609,0.0,
4,0.715089,0.728579,0.306837,0.475578,0.529957,0.0,
5,0.256675,0.659921,0.082357,0.066546,0.431759,0.0,
6,0.277275,0.147568,0.103748,0.026291,0.304667,0.0,
7,0.452178,0.67089,0.956265,0.972953,0.084152,0.0,
8,0.918128,0.763235,0.638007,0.027491,0.266784,0.0,
9,0.912603,0.886642,0.262849,0.52402,0.44184,0.0,


In [32]:
df['nueva']=[i for i in range(len(df))]

df

Unnamed: 0,col1,col2,columna,col4,col5,ceros,nulos,nueva
0,0.329222,0.251342,0.651023,0.611212,0.881118,0.0,,0
1,0.535429,0.037405,0.376763,0.289438,0.027591,0.0,,1
2,0.986006,0.776356,0.732085,0.505544,0.13816,0.0,,2
3,0.103029,0.492336,0.917573,0.748671,0.044609,0.0,,3
4,0.715089,0.728579,0.306837,0.475578,0.529957,0.0,,4
5,0.256675,0.659921,0.082357,0.066546,0.431759,0.0,,5
6,0.277275,0.147568,0.103748,0.026291,0.304667,0.0,,6
7,0.452178,0.67089,0.956265,0.972953,0.084152,0.0,,7
8,0.918128,0.763235,0.638007,0.027491,0.266784,0.0,,8
9,0.912603,0.886642,0.262849,0.52402,0.44184,0.0,,9


In [33]:
df.shape

(10, 8)

In [34]:
len(df)

10

In [35]:
df['col10']=df.col1 * df.col2 / df.col4

df

Unnamed: 0,col1,col2,columna,col4,col5,ceros,nulos,nueva,col10
0,0.329222,0.251342,0.651023,0.611212,0.881118,0.0,,0,0.135382
1,0.535429,0.037405,0.376763,0.289438,0.027591,0.0,,1,0.069196
2,0.986006,0.776356,0.732085,0.505544,0.13816,0.0,,2,1.514192
3,0.103029,0.492336,0.917573,0.748671,0.044609,0.0,,3,0.067753
4,0.715089,0.728579,0.306837,0.475578,0.529957,0.0,,4,1.095508
5,0.256675,0.659921,0.082357,0.066546,0.431759,0.0,,5,2.545387
6,0.277275,0.147568,0.103748,0.026291,0.304667,0.0,,6,1.556303
7,0.452178,0.67089,0.956265,0.972953,0.084152,0.0,,7,0.311794
8,0.918128,0.763235,0.638007,0.027491,0.266784,0.0,,8,25.490238
9,0.912603,0.886642,0.262849,0.52402,0.44184,0.0,,9,1.544125


In [36]:
df.insert(1, 'insertada', [i for i in range(len(df))])

df

Unnamed: 0,col1,insertada,col2,columna,col4,col5,ceros,nulos,nueva,col10
0,0.329222,0,0.251342,0.651023,0.611212,0.881118,0.0,,0,0.135382
1,0.535429,1,0.037405,0.376763,0.289438,0.027591,0.0,,1,0.069196
2,0.986006,2,0.776356,0.732085,0.505544,0.13816,0.0,,2,1.514192
3,0.103029,3,0.492336,0.917573,0.748671,0.044609,0.0,,3,0.067753
4,0.715089,4,0.728579,0.306837,0.475578,0.529957,0.0,,4,1.095508
5,0.256675,5,0.659921,0.082357,0.066546,0.431759,0.0,,5,2.545387
6,0.277275,6,0.147568,0.103748,0.026291,0.304667,0.0,,6,1.556303
7,0.452178,7,0.67089,0.956265,0.972953,0.084152,0.0,,7,0.311794
8,0.918128,8,0.763235,0.638007,0.027491,0.266784,0.0,,8,25.490238
9,0.912603,9,0.886642,0.262849,0.52402,0.44184,0.0,,9,1.544125


In [37]:
df[sorted(df.columns)]

Unnamed: 0,ceros,col1,col10,col2,col4,col5,columna,insertada,nueva,nulos
0,0.0,0.329222,0.135382,0.251342,0.611212,0.881118,0.651023,0,0,
1,0.0,0.535429,0.069196,0.037405,0.289438,0.027591,0.376763,1,1,
2,0.0,0.986006,1.514192,0.776356,0.505544,0.13816,0.732085,2,2,
3,0.0,0.103029,0.067753,0.492336,0.748671,0.044609,0.917573,3,3,
4,0.0,0.715089,1.095508,0.728579,0.475578,0.529957,0.306837,4,4,
5,0.0,0.256675,2.545387,0.659921,0.066546,0.431759,0.082357,5,5,
6,0.0,0.277275,1.556303,0.147568,0.026291,0.304667,0.103748,6,6,
7,0.0,0.452178,0.311794,0.67089,0.972953,0.084152,0.956265,7,7,
8,0.0,0.918128,25.490238,0.763235,0.027491,0.266784,0.638007,8,8,
9,0.0,0.912603,1.544125,0.886642,0.52402,0.44184,0.262849,9,9,


In [38]:
df.columns

Index(['col1', 'insertada', 'col2', 'columna', 'col4', 'col5', 'ceros',
       'nulos', 'nueva', 'col10'],
      dtype='object')

In [39]:
df['col10'][5]

2.545386717128177

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   col1       10 non-null     float64
 1   insertada  10 non-null     int64  
 2   col2       10 non-null     float64
 3   columna    10 non-null     float64
 4   col4       10 non-null     float64
 5   col5       10 non-null     float64
 6   ceros      10 non-null     float64
 7   nulos      0 non-null      float64
 8   nueva      10 non-null     int64  
 9   col10      10 non-null     float64
dtypes: float64(8), int64(2)
memory usage: 928.0 bytes


In [41]:
df.fillna('hola', inplace=True)

In [42]:
df

Unnamed: 0,col1,insertada,col2,columna,col4,col5,ceros,nulos,nueva,col10
0,0.329222,0,0.251342,0.651023,0.611212,0.881118,0.0,hola,0,0.135382
1,0.535429,1,0.037405,0.376763,0.289438,0.027591,0.0,hola,1,0.069196
2,0.986006,2,0.776356,0.732085,0.505544,0.13816,0.0,hola,2,1.514192
3,0.103029,3,0.492336,0.917573,0.748671,0.044609,0.0,hola,3,0.067753
4,0.715089,4,0.728579,0.306837,0.475578,0.529957,0.0,hola,4,1.095508
5,0.256675,5,0.659921,0.082357,0.066546,0.431759,0.0,hola,5,2.545387
6,0.277275,6,0.147568,0.103748,0.026291,0.304667,0.0,hola,6,1.556303
7,0.452178,7,0.67089,0.956265,0.972953,0.084152,0.0,hola,7,0.311794
8,0.918128,8,0.763235,0.638007,0.027491,0.266784,0.0,hola,8,25.490238
9,0.912603,9,0.886642,0.262849,0.52402,0.44184,0.0,hola,9,1.544125


In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   col1       10 non-null     float64
 1   insertada  10 non-null     int64  
 2   col2       10 non-null     float64
 3   columna    10 non-null     float64
 4   col4       10 non-null     float64
 5   col5       10 non-null     float64
 6   ceros      10 non-null     float64
 7   nulos      10 non-null     object 
 8   nueva      10 non-null     int64  
 9   col10      10 non-null     float64
dtypes: float64(7), int64(2), object(1)
memory usage: 928.0+ bytes


In [44]:
# introducir datos desde una lista de listas

lst_lst=[[687261, 'hola', 4728364], 
         [83546, 'adios', 58943], 
         [321, 'oo^oo']]


columnas=['num', 'palabra', 'otro_num']

In [45]:
df_lst = pd.DataFrame(lst_lst, columns=columnas)

df_lst

Unnamed: 0,num,palabra,otro_num
0,687261,hola,4728364.0
1,83546,adios,58943.0
2,321,oo^oo,


In [46]:
df_lst.fillna(0., inplace=True)  # rellena todos los nulos

df_lst

Unnamed: 0,num,palabra,otro_num
0,687261,hola,4728364.0
1,83546,adios,58943.0
2,321,oo^oo,0.0


In [47]:
# con dictio

dictio={'casa': lst_lst[0],
        'oficina': lst_lst[1],
        'numero': lst_lst[2]+[0]}

dictio

{'casa': [687261, 'hola', 4728364],
 'oficina': [83546, 'adios', 58943],
 'numero': [321, 'oo^oo', 0]}

In [48]:
df_dictio = pd.DataFrame(dictio)

df_dictio

Unnamed: 0,casa,oficina,numero
0,687261,83546,321
1,hola,adios,oo^oo
2,4728364,58943,0


In [49]:
df.drop('nulos',        # nombre de la columna
        axis=1,         # por columnas, el eje
        inplace=True    # sobreescribe el df
       )  # borra una columna

In [50]:
df.head()

Unnamed: 0,col1,insertada,col2,columna,col4,col5,ceros,nueva,col10
0,0.329222,0,0.251342,0.651023,0.611212,0.881118,0.0,0,0.135382
1,0.535429,1,0.037405,0.376763,0.289438,0.027591,0.0,1,0.069196
2,0.986006,2,0.776356,0.732085,0.505544,0.13816,0.0,2,1.514192
3,0.103029,3,0.492336,0.917573,0.748671,0.044609,0.0,3,0.067753
4,0.715089,4,0.728579,0.306837,0.475578,0.529957,0.0,4,1.095508


In [51]:
df.drop(0,              # indice
        axis=0,         # por filas, el eje
        inplace=True    # sobreescribe el df
       )  # borra una columna

In [52]:
df.reset_index()

Unnamed: 0,index,col1,insertada,col2,columna,col4,col5,ceros,nueva,col10
0,1,0.535429,1,0.037405,0.376763,0.289438,0.027591,0.0,1,0.069196
1,2,0.986006,2,0.776356,0.732085,0.505544,0.13816,0.0,2,1.514192
2,3,0.103029,3,0.492336,0.917573,0.748671,0.044609,0.0,3,0.067753
3,4,0.715089,4,0.728579,0.306837,0.475578,0.529957,0.0,4,1.095508
4,5,0.256675,5,0.659921,0.082357,0.066546,0.431759,0.0,5,2.545387
5,6,0.277275,6,0.147568,0.103748,0.026291,0.304667,0.0,6,1.556303
6,7,0.452178,7,0.67089,0.956265,0.972953,0.084152,0.0,7,0.311794
7,8,0.918128,8,0.763235,0.638007,0.027491,0.266784,0.0,8,25.490238
8,9,0.912603,9,0.886642,0.262849,0.52402,0.44184,0.0,9,1.544125


In [53]:
df.index=[i for i in  range(len(df))]

### Operaciones


In [54]:
df

Unnamed: 0,col1,insertada,col2,columna,col4,col5,ceros,nueva,col10
0,0.535429,1,0.037405,0.376763,0.289438,0.027591,0.0,1,0.069196
1,0.986006,2,0.776356,0.732085,0.505544,0.13816,0.0,2,1.514192
2,0.103029,3,0.492336,0.917573,0.748671,0.044609,0.0,3,0.067753
3,0.715089,4,0.728579,0.306837,0.475578,0.529957,0.0,4,1.095508
4,0.256675,5,0.659921,0.082357,0.066546,0.431759,0.0,5,2.545387
5,0.277275,6,0.147568,0.103748,0.026291,0.304667,0.0,6,1.556303
6,0.452178,7,0.67089,0.956265,0.972953,0.084152,0.0,7,0.311794
7,0.918128,8,0.763235,0.638007,0.027491,0.266784,0.0,8,25.490238
8,0.912603,9,0.886642,0.262849,0.52402,0.44184,0.0,9,1.544125


In [55]:
df.transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8
col1,0.535429,0.986006,0.103029,0.715089,0.256675,0.277275,0.452178,0.918128,0.912603
insertada,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0
col2,0.037405,0.776356,0.492336,0.728579,0.659921,0.147568,0.67089,0.763235,0.886642
columna,0.376763,0.732085,0.917573,0.306837,0.082357,0.103748,0.956265,0.638007,0.262849
col4,0.289438,0.505544,0.748671,0.475578,0.066546,0.026291,0.972953,0.027491,0.52402
col5,0.027591,0.13816,0.044609,0.529957,0.431759,0.304667,0.084152,0.266784,0.44184
ceros,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
nueva,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0
col10,0.069196,1.514192,0.067753,1.095508,2.545387,1.556303,0.311794,25.490238,1.544125


In [56]:
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8
col1,0.535429,0.986006,0.103029,0.715089,0.256675,0.277275,0.452178,0.918128,0.912603
insertada,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0
col2,0.037405,0.776356,0.492336,0.728579,0.659921,0.147568,0.67089,0.763235,0.886642
columna,0.376763,0.732085,0.917573,0.306837,0.082357,0.103748,0.956265,0.638007,0.262849
col4,0.289438,0.505544,0.748671,0.475578,0.066546,0.026291,0.972953,0.027491,0.52402
col5,0.027591,0.13816,0.044609,0.529957,0.431759,0.304667,0.084152,0.266784,0.44184
ceros,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
nueva,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0
col10,0.069196,1.514192,0.067753,1.095508,2.545387,1.556303,0.311794,25.490238,1.544125


In [57]:
df.T.index

Index(['col1', 'insertada', 'col2', 'columna', 'col4', 'col5', 'ceros',
       'nueva', 'col10'],
      dtype='object')

In [58]:
df.columns

Index(['col1', 'insertada', 'col2', 'columna', 'col4', 'col5', 'ceros',
       'nueva', 'col10'],
      dtype='object')

In [59]:
df.index

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8], dtype='int64')

In [60]:
df.sum()

col1          5.156412
insertada    45.000000
col2          5.162931
columna       4.376485
col4          3.636532
col5          2.269518
ceros         0.000000
nueva        45.000000
col10        34.194498
dtype: float64

In [61]:
df.sum(axis=0)

col1          5.156412
insertada    45.000000
col2          5.162931
columna       4.376485
col4          3.636532
col5          2.269518
ceros         0.000000
nueva        45.000000
col10        34.194498
dtype: float64

In [62]:
df.sum(axis=1)

0     3.335823
1     8.652343
2     8.373972
3    11.851548
4    14.042644
5    14.415853
6    17.448232
7    44.103883
8    22.572078
dtype: float64

In [63]:
seleccion=['col1', 'col2', 'col4']

df[seleccion].sum(axis=1)

0    0.862272
1    2.267906
2    1.344037
3    1.919246
4    0.983141
5    0.451134
6    2.096021
7    1.708854
8    2.323264
dtype: float64

In [64]:
df.std()

col1         0.325884
insertada    2.738613
col2         0.294026
columna      0.334279
col4         0.332372
col5         0.188118
ceros        0.000000
nueva        2.738613
col10        8.175566
dtype: float64

In [65]:
df.var()

col1          0.106200
insertada     7.500000
col2          0.086452
columna       0.111742
col4          0.110471
col5          0.035388
ceros         0.000000
nueva         7.500000
col10        66.839872
dtype: float64

In [66]:
df.mean()

col1         0.572935
insertada    5.000000
col2         0.573659
columna      0.486276
col4         0.404059
col5         0.252169
ceros        0.000000
nueva        5.000000
col10        3.799389
dtype: float64

In [67]:
df.median()

col1         0.535429
insertada    5.000000
col2         0.670890
columna      0.376763
col4         0.475578
col5         0.266784
ceros        0.000000
nueva        5.000000
col10        1.514192
dtype: float64

In [68]:
df.describe()

Unnamed: 0,col1,insertada,col2,columna,col4,col5,ceros,nueva,col10
count,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0
mean,0.572935,5.0,0.573659,0.486276,0.404059,0.252169,0.0,5.0,3.799389
std,0.325884,2.738613,0.294026,0.334279,0.332372,0.188118,0.0,2.738613,8.175566
min,0.103029,1.0,0.037405,0.082357,0.026291,0.027591,0.0,1.0,0.067753
25%,0.277275,3.0,0.492336,0.262849,0.066546,0.084152,0.0,3.0,0.311794
50%,0.535429,5.0,0.67089,0.376763,0.475578,0.266784,0.0,5.0,1.514192
75%,0.912603,7.0,0.763235,0.732085,0.52402,0.431759,0.0,7.0,1.556303
max,0.986006,9.0,0.886642,0.956265,0.972953,0.529957,0.0,9.0,25.490238


In [69]:
df.mean() / len(df)

col1         0.063659
insertada    0.555556
col2         0.063740
columna      0.054031
col4         0.044895
col5         0.028019
ceros        0.000000
nueva        0.555556
col10        0.422154
dtype: float64

### Importar archivos

+ CSV
+ XLSX
+ XLS
+ JSON

In [70]:
pd.set_option('display.max_columns', None)  # ver todas las columnas

pd.set_option('display.max_rows', None)     # ver todas las filas

In [71]:
df_csv = pd.read_csv('../data/vehicles_messy.csv')

df_csv.head()

Unnamed: 0,barrels08,barrelsA08,charge120,charge240,city08,city08U,cityA08,cityA08U,cityCD,cityE,cityUF,co2,co2A,co2TailpipeAGpm,co2TailpipeGpm,comb08,comb08U,combA08,combA08U,combE,combinedCD,combinedUF,cylinders,displ,drive,engId,eng_dscr,feScore,fuelCost08,fuelCostA08,fuelType,fuelType1,ghgScore,ghgScoreA,highway08,highway08U,highwayA08,highwayA08U,highwayCD,highwayE,highwayUF,hlv,hpv,id,lv2,lv4,make,model,mpgData,phevBlended,pv2,pv4,range,rangeCity,rangeCityA,rangeHwy,rangeHwyA,trany,UCity,UCityA,UHighway,UHighwayA,VClass,year,youSaveSpend,guzzler,trans_dscr,tCharger,sCharger,atvType,fuelType2,rangeA,evMotor,mfrCode,c240Dscr,charge240b,c240bDscr,createdOn,modifiedOn,startStop,phevCity,phevHwy,phevComb
0,15.695714,0.0,0.0,0.0,19,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,423.190476,21,0.0,0,0.0,0.0,0.0,0.0,4.0,2.0,Rear-Wheel Drive,9011,(FFS),-1,1600,0,Regular,Regular Gasoline,-1,-1,25,0.0,0,0.0,0.0,0.0,0.0,0,0,1,0,0,Alfa Romeo,Spider Veloce 2000,Y,False,0,0,0,0.0,0.0,0.0,0.0,Manual 5-spd,23.3333,0.0,35.0,0.0,Two Seaters,1985,-1250,,,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
1,29.964545,0.0,0.0,0.0,9,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,807.909091,11,0.0,0,0.0,0.0,0.0,0.0,12.0,4.9,Rear-Wheel Drive,22020,(GUZZLER),-1,3050,0,Regular,Regular Gasoline,-1,-1,14,0.0,0,0.0,0.0,0.0,0.0,0,0,10,0,0,Ferrari,Testarossa,N,False,0,0,0,0.0,0.0,0.0,0.0,Manual 5-spd,11.0,0.0,19.0,0.0,Two Seaters,1985,-8500,T,,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
2,12.207778,0.0,0.0,0.0,23,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,329.148148,27,0.0,0,0.0,0.0,0.0,0.0,4.0,2.2,Front-Wheel Drive,2100,(FFS),-1,1250,0,Regular,Regular Gasoline,-1,-1,33,0.0,0,0.0,0.0,0.0,0.0,19,77,100,0,0,Dodge,Charger,Y,False,0,0,0,0.0,0.0,0.0,0.0,Manual 5-spd,29.0,0.0,47.0,0.0,Subcompact Cars,1985,500,,SIL,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
3,29.964545,0.0,0.0,0.0,10,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,807.909091,11,0.0,0,0.0,0.0,0.0,0.0,8.0,5.2,Rear-Wheel Drive,2850,,-1,3050,0,Regular,Regular Gasoline,-1,-1,12,0.0,0,0.0,0.0,0.0,0.0,0,0,1000,0,0,Dodge,B150/B250 Wagon 2WD,N,False,0,0,0,0.0,0.0,0.0,0.0,Automatic 3-spd,12.2222,0.0,16.6667,0.0,Vans,1985,-8500,,,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
4,17.347895,0.0,0.0,0.0,17,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,467.736842,19,0.0,0,0.0,0.0,0.0,0.0,4.0,2.2,4-Wheel or All-Wheel Drive,66031,"(FFS,TRBO)",-1,2150,0,Premium,Premium Gasoline,-1,-1,23,0.0,0,0.0,0.0,0.0,0.0,0,0,10000,0,14,Subaru,Legacy AWD Turbo,N,False,0,90,0,0.0,0.0,0.0,0.0,Manual 5-spd,21.0,0.0,32.0,0.0,Compact Cars,1993,-4000,,,T,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0


In [72]:
len(df_csv)

37843

In [73]:
df_csv.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37843 entries, 0 to 37842
Data columns (total 83 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   barrels08        37843 non-null  float64
 1   barrelsA08       37843 non-null  float64
 2   charge120        37843 non-null  float64
 3   charge240        37843 non-null  float64
 4   city08           37843 non-null  int64  
 5   city08U          37843 non-null  float64
 6   cityA08          37843 non-null  int64  
 7   cityA08U         37843 non-null  float64
 8   cityCD           37843 non-null  float64
 9   cityE            37843 non-null  float64
 10  cityUF           37843 non-null  float64
 11  co2              37843 non-null  int64  
 12  co2A             37843 non-null  int64  
 13  co2TailpipeAGpm  37843 non-null  float64
 14  co2TailpipeGpm   37843 non-null  float64
 15  comb08           37843 non-null  int64  
 16  comb08U          37843 non-null  float64
 17  combA08     

In [74]:
%pip install openpyxl
%pip install xlrd

[0mNote: you may need to restart the kernel to use updated packages.
[0mNote: you may need to restart the kernel to use updated packages.


In [75]:
# xlsx

df_xlsx = pd.read_excel('../data/Online Retail.xlsx')

df_xlsx.head()

Unnamed: 0,InvoiceNo,InvoiceDate,StockCode,Description,Quantity,UnitPrice,Revenue,CustomerID,Country
0,536365,2010-12-01 08:26:00,85123A,CREAM HANGING HEART T-LIGHT HOLDER,6,2.55,15.3,17850,United Kingdom
1,536373,2010-12-01 09:02:00,85123A,CREAM HANGING HEART T-LIGHT HOLDER,6,2.55,15.3,17850,United Kingdom
2,536375,2010-12-01 09:32:00,85123A,CREAM HANGING HEART T-LIGHT HOLDER,6,2.55,15.3,17850,United Kingdom
3,536390,2010-12-01 10:19:00,85123A,CREAM HANGING HEART T-LIGHT HOLDER,64,2.55,163.2,17511,United Kingdom
4,536394,2010-12-01 10:39:00,85123A,CREAM HANGING HEART T-LIGHT HOLDER,32,2.55,81.6,13408,United Kingdom


In [76]:
df_xlsx.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 396034 entries, 0 to 396033
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    396034 non-null  int64         
 1   InvoiceDate  396034 non-null  datetime64[ns]
 2   StockCode    396034 non-null  object        
 3   Description  396034 non-null  object        
 4   Quantity     396034 non-null  int64         
 5   UnitPrice    396034 non-null  float64       
 6   Revenue      396034 non-null  float64       
 7   CustomerID   396034 non-null  int64         
 8   Country      396034 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(3), object(3)
memory usage: 90.7 MB


In [77]:
# xls


df_xls=pd.read_excel('../data/Sensor Data.xls')

df_xls.head()

Unnamed: 0,Input 1,Input 2,Input 3,Input 4,Input 5,Input 6,Input 7,Input 8,Input 9,Input 10,Input 11,Input 12,output1,output2,class
0,1.473,2.311,3.179,2.666,0.2795,0.2771,0.2234,0.1855,0.2539,1.138,1.111,4.712,1,1,one
1,1.46,2.377,3.214,2.92,0.2527,0.3064,0.02563,0.1965,0.3027,1.213,1.027,5.463,1,1,one
2,1.552,2.164,3.064,2.745,0.282,0.21,0.1721,0.1929,0.21,1.221,1.058,5.332,1,1,one
3,1.605,2.228,3.149,2.834,0.2917,0.3613,0.2087,0.1294,0.2734,1.144,1.062,4.829,1,1,one
4,1.534,2.114,3.309,2.976,0.21,0.2502,0.2258,0.177,0.2039,1.254,1.112,5.734,1,1,one


In [78]:
df_xls=pd.read_excel('../data/Sensor Data.xls', 'Sheet1')

df_xls.head()

Unnamed: 0,Input 1,Input 2,Input 3,Input 4,Input 5,Input 6,Input 7,Input 8,Input 9,Input 10,Input 11,Input 12,output1,output2,class
0,1.473,2.311,3.179,2.666,0.2795,0.2771,0.2234,0.1855,0.2539,1.138,1.111,4.712,1,1,one
1,1.46,2.377,3.214,2.92,0.2527,0.3064,0.02563,0.1965,0.3027,1.213,1.027,5.463,1,1,one
2,1.552,2.164,3.064,2.745,0.282,0.21,0.1721,0.1929,0.21,1.221,1.058,5.332,1,1,one
3,1.605,2.228,3.149,2.834,0.2917,0.3613,0.2087,0.1294,0.2734,1.144,1.062,4.829,1,1,one
4,1.534,2.114,3.309,2.976,0.21,0.2502,0.2258,0.177,0.2039,1.254,1.112,5.734,1,1,one


In [79]:
df_xls=pd.read_excel('../data/Sensor Data.xls', 'Sheet2')

df_xls.head()

Unnamed: 0,Sensor Data
0,The data source as well as the exact nature of...
1,Each data instance contains 12 real-valued inp...
2,represents a sensor designed to detect the pre...
3,"of substances. As an alternative, the sensor r..."
4,


In [80]:
df_xls=pd.read_excel('../data/Sensor Data.xls', 'Sheet3')

df_xls.head()

Unnamed: 0,hola,amiguis,estamos,probando,pandas


In [81]:
df_xls=pd.read_excel('../data/Sensor Data.xls', 'Hoja1')

df_xls.head()

Unnamed: 0,32,42,4q34q34
0,23r,4,
1,,42,


In [82]:
df_xls=pd.read_excel('../data/Sensor Data.xls', 0)

df_xls.head()

Unnamed: 0,Input 1,Input 2,Input 3,Input 4,Input 5,Input 6,Input 7,Input 8,Input 9,Input 10,Input 11,Input 12,output1,output2,class
0,1.473,2.311,3.179,2.666,0.2795,0.2771,0.2234,0.1855,0.2539,1.138,1.111,4.712,1,1,one
1,1.46,2.377,3.214,2.92,0.2527,0.3064,0.02563,0.1965,0.3027,1.213,1.027,5.463,1,1,one
2,1.552,2.164,3.064,2.745,0.282,0.21,0.1721,0.1929,0.21,1.221,1.058,5.332,1,1,one
3,1.605,2.228,3.149,2.834,0.2917,0.3613,0.2087,0.1294,0.2734,1.144,1.062,4.829,1,1,one
4,1.534,2.114,3.309,2.976,0.21,0.2502,0.2258,0.177,0.2039,1.254,1.112,5.734,1,1,one


In [83]:
xl = pd.ExcelFile('../data/Sensor Data.xls')

xl.sheet_names  # see all sheet names

['Sheet1', 'Sheet2', 'Sheet3', 'Hoja1']

In [84]:
dictio_df = {}

for hoja in xl.sheet_names:
    
    dictio_df[hoja]=pd.read_excel('../data/Sensor Data.xls', hoja)

In [85]:
dictio_df['Sheet1'].head()

Unnamed: 0,Input 1,Input 2,Input 3,Input 4,Input 5,Input 6,Input 7,Input 8,Input 9,Input 10,Input 11,Input 12,output1,output2,class
0,1.473,2.311,3.179,2.666,0.2795,0.2771,0.2234,0.1855,0.2539,1.138,1.111,4.712,1,1,one
1,1.46,2.377,3.214,2.92,0.2527,0.3064,0.02563,0.1965,0.3027,1.213,1.027,5.463,1,1,one
2,1.552,2.164,3.064,2.745,0.282,0.21,0.1721,0.1929,0.21,1.221,1.058,5.332,1,1,one
3,1.605,2.228,3.149,2.834,0.2917,0.3613,0.2087,0.1294,0.2734,1.144,1.062,4.829,1,1,one
4,1.534,2.114,3.309,2.976,0.21,0.2502,0.2258,0.177,0.2039,1.254,1.112,5.734,1,1,one


In [86]:
dictio_df['Hoja1'].head()

Unnamed: 0,32,42,4q34q34
0,23r,4,
1,,42,


In [87]:
dictio_df.keys()

dict_keys(['Sheet1', 'Sheet2', 'Sheet3', 'Hoja1'])

In [88]:
type(dictio_df['Hoja1'])

pandas.core.frame.DataFrame

In [89]:
dictio_df['Hoja1']['hola']=0

dictio_df['Hoja1']

Unnamed: 0,32,42,4q34q34,hola
0,23r,4,,0
1,,42,,0


In [90]:
dictio_df['Hoja1'].columns

Index([32, 42, '4q34q34', 'hola'], dtype='object')

In [91]:
# json

df_json = pd.read_json('../data/companies.json',
                      orient='records',  # es para orientar segun los registros..
                      lines=True)        # por filas

df_json.head()

Unnamed: 0,_id,name,permalink,crunchbase_url,homepage_url,blog_url,blog_feed_url,twitter_username,category_code,number_of_employees,founded_year,founded_month,founded_day,deadpooled_year,tag_list,alias_list,email_address,phone_number,description,created_at,updated_at,overview,image,products,relationships,competitions,providerships,total_money_raised,funding_rounds,investments,acquisition,acquisitions,offices,milestones,video_embeds,screenshots,external_links,partners,deadpooled_month,deadpooled_day,deadpooled_url,ipo
0,{'$oid': '52cdef7c4bab8bd675297d8a'},Wetpaint,abc2,http://www.crunchbase.com/company/wetpaint,http://wetpaint-inc.com,http://digitalquarters.net/,http://digitalquarters.net/feed/,BachelrWetpaint,web,47.0,2005.0,10.0,17.0,1.0,"wiki, seattle, elowitz, media-industry, media-...",,info@wetpaint.com,206.859.6300,Technology Platform Company,{'$date': 1180075887000},2013-12-08 07:15:44+00:00,<p>Wetpaint is a technology platform company t...,"{'available_sizes': [[[150, 75], 'assets/image...","[{'name': 'Wikison Wetpaint', 'permalink': 'we...","[{'is_past': False, 'title': 'Co-Founder and V...","[{'competitor': {'name': 'Wikia', 'permalink':...",[],$39.8M,"[{'id': 888, 'round_code': 'a', 'source_url': ...",[],"{'price_amount': 30000000, 'price_currency_cod...",[],"[{'description': '', 'address1': '710 - 2nd Av...","[{'id': 5869, 'description': 'Wetpaint named i...",[],"[{'available_sizes': [[[150, 86], 'assets/imag...",[{'external_url': 'http://www.geekwire.com/201...,[],,,,
1,{'$oid': '52cdef7c4bab8bd675297d8b'},AdventNet,abc3,http://www.crunchbase.com/company/adventnet,http://adventnet.com,,,manageengine,enterprise,600.0,1996.0,,,2.0,,Zoho ManageEngine,pr@adventnet.com,925-924-9500,Server Management Software,{'$date': 1180121062000},2012-10-31 18:26:09+00:00,"<p>AdventNet is now <a href=""/company/zoho-man...","{'available_sizes': [[[150, 55], 'assets/image...",[],"[{'is_past': True, 'title': 'CEO and Co-Founde...",[],"[{'title': 'DHFH', 'is_past': True, 'provider'...",$0,[],[],,[],"[{'description': 'Headquarters', 'address1': '...",[],[],"[{'available_sizes': [[[150, 94], 'assets/imag...",[],[],,,,
2,{'$oid': '52cdef7c4bab8bd675297d8c'},Zoho,abc4,http://www.crunchbase.com/company/zoho,http://zoho.com,http://blogs.zoho.com/,http://blogs.zoho.com/feed,zoho,software,1600.0,2005.0,9.0,15.0,3.0,"zoho, officesuite, spreadsheet, writer, projec...",,info@zohocorp.com,1-888-204-3539,Online Business Apps Suite,Fri May 25 19:30:28 UTC 2007,2013-10-30 00:07:05+00:00,"<p>Zoho offers a suite of Business, Collaborat...","{'available_sizes': [[[150, 55], 'assets/image...","[{'name': 'Zoho Office Suite', 'permalink': 'z...","[{'is_past': False, 'title': 'CEO and Founder'...","[{'competitor': {'name': 'Empressr', 'permalin...",[],$0,[],[],,[],"[{'description': 'Headquarters', 'address1': '...","[{'id': 388, 'description': 'Zoho Reaches 2 Mi...","[{'embed_code': '<object width=""430"" height=""2...",[],[{'external_url': 'http://www.online-tech-tips...,[],,,,
3,{'$oid': '52cdef7c4bab8bd675297d8d'},Digg,digg,http://www.crunchbase.com/company/digg,http://www.digg.com,http://blog.digg.com/,http://blog.digg.com/?feed=rss2,digg,news,60.0,2004.0,10.0,11.0,,"community, social, news, bookmark, digg, techn...",,feedback@digg.com,(415) 436-9638,user driven social content website,Fri May 25 20:03:23 UTC 2007,2013-11-05 21:35:47+00:00,<p>Digg is a user driven social content websit...,"{'available_sizes': [[[150, 150], 'assets/imag...","[{'name': 'Digg', 'permalink': 'digg'}]","[{'is_past': False, 'title': 'CEO', 'person': ...","[{'competitor': {'name': 'Reddit', 'permalink'...","[{'title': 'Public Relations', 'is_past': True...",$45M,"[{'id': 1, 'round_code': 'b', 'source_url': 'h...",[],"{'price_amount': 500000, 'price_currency_code'...","[{'price_amount': None, 'price_currency_code':...","[{'description': None, 'address1': '135 Missis...","[{'id': 9588, 'description': 'Another Digg Exe...","[{'embed_code': '<embed src=""http://blip.tv/pl...","[{'available_sizes': [[[117, 150], 'assets/ima...",[{'external_url': 'http://www.sociableblog.com...,[],,,,
4,{'$oid': '52cdef7c4bab8bd675297d8e'},Facebook,facebook,http://www.crunchbase.com/company/facebook,http://facebook.com,http://blog.facebook.com,http://blog.facebook.com/atom.php,facebook,social,5299.0,2004.0,2.0,1.0,,"facebook, college, students, profiles, network...",,,,Social network,Fri May 25 21:22:15 UTC 2007,2013-11-21 19:40:55+00:00,<p>Facebook is the world&#8217;s largest socia...,"{'available_sizes': [[[150, 61], 'assets/image...","[{'name': 'Facebook Platform', 'permalink': 'f...","[{'is_past': False, 'title': 'Founder and CEO,...","[{'competitor': {'name': 'MySpace', 'permalink...","[{'title': '', 'is_past': False, 'provider': {...",$2.43B,"[{'id': 2, 'round_code': 'angel', 'source_url'...","[{'funding_round': {'round_code': 'seed', 'sou...",,"[{'price_amount': None, 'price_currency_code':...","[{'description': 'Headquarters', 'address1': '...","[{'id': 108, 'description': 'Facebook adds com...",[],"[{'available_sizes': [[[150, 68], 'assets/imag...",[{'external_url': 'http://latimesblogs.latimes...,[],,,,"{'valuation_amount': 104000000000, 'valuation_..."
