# 1.2 - Intro Pandas (Panel Data)

**[Documentación](https://pandas.pydata.org/docs/reference/index.html#api)**

**[Código fuente](https://github.com/pandas-dev/pandas)**


![pandas](images/pandas.png)


Pandas es una librería de python especializada en el manejo y análisis de estructuras de datos.


Las principales características de esta librería son:

+ Define nuevas estructuras de datos basadas en los arrays de la librería NumPy pero con nuevas funcionalidades.
+ Permite leer y escribir fácilmente ficheros en formato CSV, Excel y bases de datos SQL.
+ Permite acceder a los datos mediante índices o nombres para filas y columnas.
+ Ofrece métodos para reordenar, dividir y combinar conjuntos de datos.
+ Permite trabajar con series temporales.
+ Realiza todas estas operaciones de manera muy eficiente.


**Tipos de datos de Pandas**
Pandas dispone de dos estructuras de datos diferentes:

+ Series: Estructura de una dimensión.
+ DataFrame: Estructura de dos dimensiones (tablas).

Estas estructuras se construyen a partir de arrays de la librería NumPy, añadiendo nuevas funcionalidades.

In [1]:
%pip install pandas

[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd

In [3]:
import numpy as np

In [4]:
import warnings

warnings.filterwarnings('ignore')

### Serie

Son estructuras similares a los arrays de una dimensión. Son homogéneas, es decir, sus elementos tienen que ser del mismo tipo, y su tamaño es inmutable, es decir, no se puede cambiar, aunque si su contenido.

Dispone de un índice que asocia un nombre a cada elemento del la serie, a través de la cuál se accede al elemento.

In [5]:
lst = [(3.4 + i)**2 for i in range(10)]

lst

[11.559999999999999,
 19.360000000000003,
 29.160000000000004,
 40.96000000000001,
 54.760000000000005,
 70.56,
 88.36000000000001,
 108.16000000000001,
 129.96,
 153.76000000000002]

In [6]:
lst.append('hola')

In [7]:
serie = pd.Series(lst)

serie

0      11.56
1      19.36
2      29.16
3      40.96
4      54.76
5      70.56
6      88.36
7     108.16
8     129.96
9     153.76
10      hola
dtype: object

In [8]:
serie[2]='hola'

serie

0      11.56
1      19.36
2       hola
3      40.96
4      54.76
5      70.56
6      88.36
7     108.16
8     129.96
9     153.76
10      hola
dtype: object

In [9]:
type(serie[0])

float

In [10]:
type(serie[2])

str

In [11]:
#help(serie)

In [12]:
serie.head()    # la cabeza, 5 primeros por defecto

0    11.56
1    19.36
2     hola
3    40.96
4    54.76
dtype: object

In [13]:
serie.head(10) 

0     11.56
1     19.36
2      hola
3     40.96
4     54.76
5     70.56
6     88.36
7    108.16
8    129.96
9    153.76
dtype: object

In [14]:
serie.tail() 

6      88.36
7     108.16
8     129.96
9     153.76
10      hola
dtype: object

In [15]:
serie.index

RangeIndex(start=0, stop=11, step=1)

In [16]:
serie.shape

(11,)

In [17]:
serie.index = ['q', 't', 'y', 'o', 'p', 'a', 's', 'd', 'f', 'g', 'v']

serie

q     11.56
t     19.36
y      hola
o     40.96
p     54.76
a     70.56
s     88.36
d    108.16
f    129.96
g    153.76
v      hola
dtype: object

In [18]:
serie[0]

11.559999999999999

In [19]:
serie['q']

11.559999999999999

In [20]:
serie.to_dict()

{'q': 11.559999999999999,
 't': 19.360000000000003,
 'y': 'hola',
 'o': 40.96000000000001,
 'p': 54.760000000000005,
 'a': 70.56,
 's': 88.36000000000001,
 'd': 108.16000000000001,
 'f': 129.96,
 'g': 153.76000000000002,
 'v': 'hola'}

In [21]:
serie.to_frame()

Unnamed: 0,0
q,11.56
t,19.36
y,hola
o,40.96
p,54.76
a,70.56
s,88.36
d,108.16
f,129.96
g,153.76


In [22]:
serie.to_frame().shape

(11, 1)

### DataFrame

Un objeto del tipo DataFrame define un conjunto de datos estructurado en forma de tabla donde cada columna es un objeto de tipo Series, es decir, todos los datos de una misma columna son del mismo tipo, y las filas son registros que pueden contender datos de distintos tipos.

Un DataFrame contiene dos índices, uno para las filas y otro para las columnas, y se puede acceder a sus elementos mediante los nombres de las filas y las columnas.

In [23]:
columnas=['col1', 'col2', 'col 3', 'col4', 'col5']

array=np.random.random((10, 5))

array

array([[0.35933884, 0.63854683, 0.59038904, 0.62835423, 0.6552042 ],
       [0.13043818, 0.31811479, 0.02432246, 0.59842842, 0.35008353],
       [0.27860645, 0.73718369, 0.10505254, 0.92342884, 0.97071581],
       [0.7752517 , 0.27969096, 0.43244209, 0.90127785, 0.99318389],
       [0.46071881, 0.92056973, 0.54019119, 0.00414845, 0.74703018],
       [0.95935814, 0.87222316, 0.10033472, 0.23299002, 0.63385998],
       [0.23347625, 0.55031907, 0.09222117, 0.66250078, 0.35227427],
       [0.15568885, 0.83083693, 0.81115762, 0.56001674, 0.92826593],
       [0.80666256, 0.19305189, 0.39513005, 0.15598495, 0.05682995],
       [0.17033105, 0.29363353, 0.18875597, 0.44625741, 0.124136  ]])

In [24]:
df = pd.DataFrame(array, columns=columnas)

display(df)

Unnamed: 0,col1,col2,col 3,col4,col5
0,0.359339,0.638547,0.590389,0.628354,0.655204
1,0.130438,0.318115,0.024322,0.598428,0.350084
2,0.278606,0.737184,0.105053,0.923429,0.970716
3,0.775252,0.279691,0.432442,0.901278,0.993184
4,0.460719,0.92057,0.540191,0.004148,0.74703
5,0.959358,0.872223,0.100335,0.23299,0.63386
6,0.233476,0.550319,0.092221,0.662501,0.352274
7,0.155689,0.830837,0.811158,0.560017,0.928266
8,0.806663,0.193052,0.39513,0.155985,0.05683
9,0.170331,0.293634,0.188756,0.446257,0.124136


In [25]:
df['col 3']

0    0.590389
1    0.024322
2    0.105053
3    0.432442
4    0.540191
5    0.100335
6    0.092221
7    0.811158
8    0.395130
9    0.188756
Name: col 3, dtype: float64

In [26]:
df.col 3

SyntaxError: invalid syntax (3887910251.py, line 1)

In [27]:
df.columns = [c.replace(' ', '_') for c in df.columns]

df

Unnamed: 0,col1,col2,col_3,col4,col5
0,0.359339,0.638547,0.590389,0.628354,0.655204
1,0.130438,0.318115,0.024322,0.598428,0.350084
2,0.278606,0.737184,0.105053,0.923429,0.970716
3,0.775252,0.279691,0.432442,0.901278,0.993184
4,0.460719,0.92057,0.540191,0.004148,0.74703
5,0.959358,0.872223,0.100335,0.23299,0.63386
6,0.233476,0.550319,0.092221,0.662501,0.352274
7,0.155689,0.830837,0.811158,0.560017,0.928266
8,0.806663,0.193052,0.39513,0.155985,0.05683
9,0.170331,0.293634,0.188756,0.446257,0.124136


In [28]:
df.col_3

0    0.590389
1    0.024322
2    0.105053
3    0.432442
4    0.540191
5    0.100335
6    0.092221
7    0.811158
8    0.395130
9    0.188756
Name: col_3, dtype: float64

In [29]:
df.col1

0    0.359339
1    0.130438
2    0.278606
3    0.775252
4    0.460719
5    0.959358
6    0.233476
7    0.155689
8    0.806663
9    0.170331
Name: col1, dtype: float64

In [30]:
select_cols = ['col1', 'col4']

df[select_cols]

Unnamed: 0,col1,col4
0,0.359339,0.628354
1,0.130438,0.598428
2,0.278606,0.923429
3,0.775252,0.901278
4,0.460719,0.004148
5,0.959358,0.23299
6,0.233476,0.662501
7,0.155689,0.560017
8,0.806663,0.155985
9,0.170331,0.446257


In [31]:
df[['col1', 'col4']]

Unnamed: 0,col1,col4
0,0.359339,0.628354
1,0.130438,0.598428
2,0.278606,0.923429
3,0.775252,0.901278
4,0.460719,0.004148
5,0.959358,0.23299
6,0.233476,0.662501
7,0.155689,0.560017
8,0.806663,0.155985
9,0.170331,0.446257


In [32]:
df.rename(columns={'col_3': 'nuevo_col'},
          inplace=True)

In [33]:
df

Unnamed: 0,col1,col2,nuevo_col,col4,col5
0,0.359339,0.638547,0.590389,0.628354,0.655204
1,0.130438,0.318115,0.024322,0.598428,0.350084
2,0.278606,0.737184,0.105053,0.923429,0.970716
3,0.775252,0.279691,0.432442,0.901278,0.993184
4,0.460719,0.92057,0.540191,0.004148,0.74703
5,0.959358,0.872223,0.100335,0.23299,0.63386
6,0.233476,0.550319,0.092221,0.662501,0.352274
7,0.155689,0.830837,0.811158,0.560017,0.928266
8,0.806663,0.193052,0.39513,0.155985,0.05683
9,0.170331,0.293634,0.188756,0.446257,0.124136


In [34]:
df['ceros'] = 0.

df

Unnamed: 0,col1,col2,nuevo_col,col4,col5,ceros
0,0.359339,0.638547,0.590389,0.628354,0.655204,0.0
1,0.130438,0.318115,0.024322,0.598428,0.350084,0.0
2,0.278606,0.737184,0.105053,0.923429,0.970716,0.0
3,0.775252,0.279691,0.432442,0.901278,0.993184,0.0
4,0.460719,0.92057,0.540191,0.004148,0.74703,0.0
5,0.959358,0.872223,0.100335,0.23299,0.63386,0.0
6,0.233476,0.550319,0.092221,0.662501,0.352274,0.0
7,0.155689,0.830837,0.811158,0.560017,0.928266,0.0
8,0.806663,0.193052,0.39513,0.155985,0.05683,0.0
9,0.170331,0.293634,0.188756,0.446257,0.124136,0.0


In [35]:
df.insert(0, 'nuevo_cero', [i for i in range(10)])  # inserta en la primera posicion

In [36]:
df

Unnamed: 0,nuevo_cero,col1,col2,nuevo_col,col4,col5,ceros
0,0,0.359339,0.638547,0.590389,0.628354,0.655204,0.0
1,1,0.130438,0.318115,0.024322,0.598428,0.350084,0.0
2,2,0.278606,0.737184,0.105053,0.923429,0.970716,0.0
3,3,0.775252,0.279691,0.432442,0.901278,0.993184,0.0
4,4,0.460719,0.92057,0.540191,0.004148,0.74703,0.0
5,5,0.959358,0.872223,0.100335,0.23299,0.63386,0.0
6,6,0.233476,0.550319,0.092221,0.662501,0.352274,0.0
7,7,0.155689,0.830837,0.811158,0.560017,0.928266,0.0
8,8,0.806663,0.193052,0.39513,0.155985,0.05683,0.0
9,9,0.170331,0.293634,0.188756,0.446257,0.124136,0.0


In [37]:
df['nulos'] = np.nan

df

Unnamed: 0,nuevo_cero,col1,col2,nuevo_col,col4,col5,ceros,nulos
0,0,0.359339,0.638547,0.590389,0.628354,0.655204,0.0,
1,1,0.130438,0.318115,0.024322,0.598428,0.350084,0.0,
2,2,0.278606,0.737184,0.105053,0.923429,0.970716,0.0,
3,3,0.775252,0.279691,0.432442,0.901278,0.993184,0.0,
4,4,0.460719,0.92057,0.540191,0.004148,0.74703,0.0,
5,5,0.959358,0.872223,0.100335,0.23299,0.63386,0.0,
6,6,0.233476,0.550319,0.092221,0.662501,0.352274,0.0,
7,7,0.155689,0.830837,0.811158,0.560017,0.928266,0.0,
8,8,0.806663,0.193052,0.39513,0.155985,0.05683,0.0,
9,9,0.170331,0.293634,0.188756,0.446257,0.124136,0.0,


In [38]:
df.shape

(10, 8)

In [39]:
len(df)

10

In [40]:
df.shape[0]

10

In [41]:
df['col10'] = df.col1 * df.col4 / df.col2

df

Unnamed: 0,nuevo_cero,col1,col2,nuevo_col,col4,col5,ceros,nulos,col10
0,0,0.359339,0.638547,0.590389,0.628354,0.655204,0.0,,0.353603
1,1,0.130438,0.318115,0.024322,0.598428,0.350084,0.0,,0.245377
2,2,0.278606,0.737184,0.105053,0.923429,0.970716,0.0,,0.348995
3,3,0.775252,0.279691,0.432442,0.901278,0.993184,0.0,,2.498176
4,4,0.460719,0.92057,0.540191,0.004148,0.74703,0.0,,0.002076
5,5,0.959358,0.872223,0.100335,0.23299,0.63386,0.0,,0.256266
6,6,0.233476,0.550319,0.092221,0.662501,0.352274,0.0,,0.28107
7,7,0.155689,0.830837,0.811158,0.560017,0.928266,0.0,,0.10494
8,8,0.806663,0.193052,0.39513,0.155985,0.05683,0.0,,0.651779
9,9,0.170331,0.293634,0.188756,0.446257,0.124136,0.0,,0.258865


In [42]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
nuevo_cero,10.0,4.5,3.02765,0.0,2.25,4.5,6.75,9.0
col1,10.0,0.432987,0.305704,0.130438,0.186117,0.318973,0.696618,0.959358
col2,10.0,0.563417,0.27509,0.193052,0.299754,0.594433,0.807424,0.92057
nuevo_col,10.0,0.328,0.264883,0.024322,0.101514,0.291943,0.513254,0.811158
col4,10.0,0.511339,0.304339,0.004148,0.286307,0.579223,0.653964,0.923429
col5,10.0,0.581158,0.344376,0.05683,0.350631,0.644532,0.882957,0.993184
ceros,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
nulos,0.0,,,,,,,
col10,10.0,0.500115,0.722211,0.002076,0.248099,0.269968,0.352451,2.498176


In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   nuevo_cero  10 non-null     int64  
 1   col1        10 non-null     float64
 2   col2        10 non-null     float64
 3   nuevo_col   10 non-null     float64
 4   col4        10 non-null     float64
 5   col5        10 non-null     float64
 6   ceros       10 non-null     float64
 7   nulos       0 non-null      float64
 8   col10       10 non-null     float64
dtypes: float64(8), int64(1)
memory usage: 848.0 bytes


In [44]:
df.fillna('hola', inplace=True)

In [45]:
df.col1 = df.col1.fillna(0)

In [46]:
# introducir datos desde una lista de listas

lst_lst=[[687261, 'hola', 4728364], 
         [83546, 'adios', 58943], 
         [321, 'oo^oo']]


columnas=['num', 'palabra', 'otro_num']

In [47]:
df_lst = pd.DataFrame(lst_lst, columns=columnas)

df_lst

Unnamed: 0,num,palabra,otro_num
0,687261,hola,4728364.0
1,83546,adios,58943.0
2,321,oo^oo,


In [48]:
df_lst['num']

0    687261
1     83546
2       321
Name: num, dtype: int64

In [49]:
df_lst.otro_num.fillna(0, inplace=True)

In [50]:
df_lst

Unnamed: 0,num,palabra,otro_num
0,687261,hola,4728364.0
1,83546,adios,58943.0
2,321,oo^oo,0.0


In [51]:
# con dictio

dictio={'casa': lst_lst[0],
        'oficina': lst_lst[1],
        'numero': lst_lst[2]+[0]}

dictio

{'casa': [687261, 'hola', 4728364],
 'oficina': [83546, 'adios', 58943],
 'numero': [321, 'oo^oo', 0]}

In [52]:
df_dictio = pd.DataFrame(dictio)

df_dictio

Unnamed: 0,casa,oficina,numero
0,687261,83546,321
1,hola,adios,oo^oo
2,4728364,58943,0


In [53]:
df.drop('nulos',   # nombre de columna
        axis=1,    # por columnas
        inplace=True    # sobreescribe el df
       )

In [54]:
df.drop(columns=['col1', 'col2'], inplace=True)

In [55]:
df.drop(0, axis=0, inplace=True)

In [56]:
df.drop(1, inplace=True)

In [57]:
df.drop(index=[5, 6, 7], inplace=True)

In [58]:
df

Unnamed: 0,nuevo_cero,nuevo_col,col4,col5,ceros,col10
2,2,0.105053,0.923429,0.970716,0.0,0.348995
3,3,0.432442,0.901278,0.993184,0.0,2.498176
4,4,0.540191,0.004148,0.74703,0.0,0.002076
8,8,0.39513,0.155985,0.05683,0.0,0.651779
9,9,0.188756,0.446257,0.124136,0.0,0.258865


In [59]:
df.reset_index()

Unnamed: 0,index,nuevo_cero,nuevo_col,col4,col5,ceros,col10
0,2,2,0.105053,0.923429,0.970716,0.0,0.348995
1,3,3,0.432442,0.901278,0.993184,0.0,2.498176
2,4,4,0.540191,0.004148,0.74703,0.0,0.002076
3,8,8,0.39513,0.155985,0.05683,0.0,0.651779
4,9,9,0.188756,0.446257,0.124136,0.0,0.258865


In [60]:
df.reset_index(drop=True)

Unnamed: 0,nuevo_cero,nuevo_col,col4,col5,ceros,col10
0,2,0.105053,0.923429,0.970716,0.0,0.348995
1,3,0.432442,0.901278,0.993184,0.0,2.498176
2,4,0.540191,0.004148,0.74703,0.0,0.002076
3,8,0.39513,0.155985,0.05683,0.0,0.651779
4,9,0.188756,0.446257,0.124136,0.0,0.258865


In [61]:
df.index = [i for i in range(len(df))]

df

Unnamed: 0,nuevo_cero,nuevo_col,col4,col5,ceros,col10
0,2,0.105053,0.923429,0.970716,0.0,0.348995
1,3,0.432442,0.901278,0.993184,0.0,2.498176
2,4,0.540191,0.004148,0.74703,0.0,0.002076
3,8,0.39513,0.155985,0.05683,0.0,0.651779
4,9,0.188756,0.446257,0.124136,0.0,0.258865


### Operaciones


In [62]:
df

Unnamed: 0,nuevo_cero,nuevo_col,col4,col5,ceros,col10
0,2,0.105053,0.923429,0.970716,0.0,0.348995
1,3,0.432442,0.901278,0.993184,0.0,2.498176
2,4,0.540191,0.004148,0.74703,0.0,0.002076
3,8,0.39513,0.155985,0.05683,0.0,0.651779
4,9,0.188756,0.446257,0.124136,0.0,0.258865


In [63]:
df.transpose()

Unnamed: 0,0,1,2,3,4
nuevo_cero,2.0,3.0,4.0,8.0,9.0
nuevo_col,0.105053,0.432442,0.540191,0.39513,0.188756
col4,0.923429,0.901278,0.004148,0.155985,0.446257
col5,0.970716,0.993184,0.74703,0.05683,0.124136
ceros,0.0,0.0,0.0,0.0,0.0
col10,0.348995,2.498176,0.002076,0.651779,0.258865


In [64]:
df.T

Unnamed: 0,0,1,2,3,4
nuevo_cero,2.0,3.0,4.0,8.0,9.0
nuevo_col,0.105053,0.432442,0.540191,0.39513,0.188756
col4,0.923429,0.901278,0.004148,0.155985,0.446257
col5,0.970716,0.993184,0.74703,0.05683,0.124136
ceros,0.0,0.0,0.0,0.0,0.0
col10,0.348995,2.498176,0.002076,0.651779,0.258865


In [65]:
df.T.index

Index(['nuevo_cero', 'nuevo_col', 'col4', 'col5', 'ceros', 'col10'], dtype='object')

In [66]:
df.columns

Index(['nuevo_cero', 'nuevo_col', 'col4', 'col5', 'ceros', 'col10'], dtype='object')

In [67]:
df.index

Int64Index([0, 1, 2, 3, 4], dtype='int64')

In [68]:
df.sum()

nuevo_cero    26.000000
nuevo_col      1.661572
col4           2.431097
col5           2.891896
ceros          0.000000
col10          3.759891
dtype: float64

In [69]:
df.sum(axis=0)

nuevo_cero    26.000000
nuevo_col      1.661572
col4           2.431097
col5           2.891896
ceros          0.000000
col10          3.759891
dtype: float64

In [70]:
df.sum(axis=1)

0     4.348192
1     7.825080
2     5.293446
3     9.259724
4    10.018015
dtype: float64

In [71]:
seleccion=['col4', 'col5', 'col10']

df[seleccion].sum(axis=1)

0    2.243139
1    4.392637
2    0.753255
3    0.864594
4    0.829259
dtype: float64

In [72]:
df.std()

nuevo_cero    3.114482
nuevo_col     0.179890
col4          0.420259
col5          0.456285
ceros         0.000000
col10         1.003389
dtype: float64

In [73]:
df.var()

nuevo_cero    9.700000
nuevo_col     0.032360
col4          0.176617
col5          0.208196
ceros         0.000000
col10         1.006789
dtype: float64

In [74]:
df.mean()

nuevo_cero    5.200000
nuevo_col     0.332314
col4          0.486219
col5          0.578379
ceros         0.000000
col10         0.751978
dtype: float64

In [75]:
df.median()

nuevo_cero    4.000000
nuevo_col     0.395130
col4          0.446257
col5          0.747030
ceros         0.000000
col10         0.348995
dtype: float64

### Importar archivos

+ CSV
+ XLSX
+ XLS
+ JSON

In [76]:
pd.set_option('display.max_columns', None)  # ver todas las columnas

pd.set_option('display.max_rows', None)     # ver todas las filas

In [77]:
df_csv = pd.read_csv('../data/vehicles.csv')

df_csv.head()

Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,17,522.764706,1950
1,AM General,FJ8c Post Office,1984,4.2,6.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
2,AM General,Post Office DJ5 2WD,1985,2.5,4.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,20.600625,16,17,16,555.4375,2100
3,AM General,Post Office DJ8 2WD,1985,4.2,6.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
4,ASC Incorporated,GNX,1987,3.8,6.0,Automatic 4-spd,Rear-Wheel Drive,Midsize Cars,Premium,20.600625,14,21,16,555.4375,2550


In [78]:
df_csv.shape

(35952, 15)

In [80]:
df_csv.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35952 entries, 0 to 35951
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Make                     35952 non-null  object 
 1   Model                    35952 non-null  object 
 2   Year                     35952 non-null  int64  
 3   Engine Displacement      35952 non-null  float64
 4   Cylinders                35952 non-null  float64
 5   Transmission             35952 non-null  object 
 6   Drivetrain               35952 non-null  object 
 7   Vehicle Class            35952 non-null  object 
 8   Fuel Type                35952 non-null  object 
 9   Fuel Barrels/Year        35952 non-null  float64
 10  City MPG                 35952 non-null  int64  
 11  Highway MPG              35952 non-null  int64  
 12  Combined MPG             35952 non-null  int64  
 13  CO2 Emission Grams/Mile  35952 non-null  float64
 14  Fuel Cost/Year        

In [None]:
%pip install openpyxl
%pip install xlrd

In [81]:
# xlsx

df_xlsx = pd.read_excel('../data/Online Retail.xlsx')

df_xlsx.head()

Unnamed: 0,InvoiceNo,InvoiceDate,StockCode,Description,Quantity,UnitPrice,Revenue,CustomerID,Country
0,536365,2010-12-01 08:26:00,85123A,CREAM HANGING HEART T-LIGHT HOLDER,6,2.55,15.3,17850,United Kingdom
1,536373,2010-12-01 09:02:00,85123A,CREAM HANGING HEART T-LIGHT HOLDER,6,2.55,15.3,17850,United Kingdom
2,536375,2010-12-01 09:32:00,85123A,CREAM HANGING HEART T-LIGHT HOLDER,6,2.55,15.3,17850,United Kingdom
3,536390,2010-12-01 10:19:00,85123A,CREAM HANGING HEART T-LIGHT HOLDER,64,2.55,163.2,17511,United Kingdom
4,536394,2010-12-01 10:39:00,85123A,CREAM HANGING HEART T-LIGHT HOLDER,32,2.55,81.6,13408,United Kingdom


In [82]:
df_xlsx.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 396034 entries, 0 to 396033
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    396034 non-null  int64         
 1   InvoiceDate  396034 non-null  datetime64[ns]
 2   StockCode    396034 non-null  object        
 3   Description  396034 non-null  object        
 4   Quantity     396034 non-null  int64         
 5   UnitPrice    396034 non-null  float64       
 6   Revenue      396034 non-null  float64       
 7   CustomerID   396034 non-null  int64         
 8   Country      396034 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(3), object(3)
memory usage: 90.7 MB


In [83]:
# xls

df_xls=pd.read_excel('../data/Sensor Data.xls')

df_xls.head()

Unnamed: 0,Input 1,Input 2,Input 3,Input 4,Input 5,Input 6,Input 7,Input 8,Input 9,Input 10,Input 11,Input 12,output1,output2,class
0,1.473,2.311,3.179,2.666,0.2795,0.2771,0.2234,0.1855,0.2539,1.138,1.111,4.712,1,1,one
1,1.46,2.377,3.214,2.92,0.2527,0.3064,0.02563,0.1965,0.3027,1.213,1.027,5.463,1,1,one
2,1.552,2.164,3.064,2.745,0.282,0.21,0.1721,0.1929,0.21,1.221,1.058,5.332,1,1,one
3,1.605,2.228,3.149,2.834,0.2917,0.3613,0.2087,0.1294,0.2734,1.144,1.062,4.829,1,1,one
4,1.534,2.114,3.309,2.976,0.21,0.2502,0.2258,0.177,0.2039,1.254,1.112,5.734,1,1,one


In [84]:
df_xls=pd.read_excel('../data/Sensor Data.xls', 'Sheet1')

df_xls.head()

Unnamed: 0,Input 1,Input 2,Input 3,Input 4,Input 5,Input 6,Input 7,Input 8,Input 9,Input 10,Input 11,Input 12,output1,output2,class
0,1.473,2.311,3.179,2.666,0.2795,0.2771,0.2234,0.1855,0.2539,1.138,1.111,4.712,1,1,one
1,1.46,2.377,3.214,2.92,0.2527,0.3064,0.02563,0.1965,0.3027,1.213,1.027,5.463,1,1,one
2,1.552,2.164,3.064,2.745,0.282,0.21,0.1721,0.1929,0.21,1.221,1.058,5.332,1,1,one
3,1.605,2.228,3.149,2.834,0.2917,0.3613,0.2087,0.1294,0.2734,1.144,1.062,4.829,1,1,one
4,1.534,2.114,3.309,2.976,0.21,0.2502,0.2258,0.177,0.2039,1.254,1.112,5.734,1,1,one


In [85]:
df_xls=pd.read_excel('../data/Sensor Data.xls', 'Sheet2')

df_xls.head()

Unnamed: 0,Sensor Data
0,The data source as well as the exact nature of...
1,Each data instance contains 12 real-valued inp...
2,represents a sensor designed to detect the pre...
3,"of substances. As an alternative, the sensor r..."
4,


In [86]:
df_xls=pd.read_excel('../data/Sensor Data.xls', 'Sheet3')

df_xls.head()

Unnamed: 0,hola,amiguis,estamos,probando,pandas


In [87]:
df_xls=pd.read_excel('../data/Sensor Data.xls', 'Hoja1')

df_xls.head()

Unnamed: 0,32,42,4q34q34
0,23r,4,
1,,42,


In [88]:
df_xls=pd.read_excel('../data/Sensor Data.xls', 0)

df_xls.head()

Unnamed: 0,Input 1,Input 2,Input 3,Input 4,Input 5,Input 6,Input 7,Input 8,Input 9,Input 10,Input 11,Input 12,output1,output2,class
0,1.473,2.311,3.179,2.666,0.2795,0.2771,0.2234,0.1855,0.2539,1.138,1.111,4.712,1,1,one
1,1.46,2.377,3.214,2.92,0.2527,0.3064,0.02563,0.1965,0.3027,1.213,1.027,5.463,1,1,one
2,1.552,2.164,3.064,2.745,0.282,0.21,0.1721,0.1929,0.21,1.221,1.058,5.332,1,1,one
3,1.605,2.228,3.149,2.834,0.2917,0.3613,0.2087,0.1294,0.2734,1.144,1.062,4.829,1,1,one
4,1.534,2.114,3.309,2.976,0.21,0.2502,0.2258,0.177,0.2039,1.254,1.112,5.734,1,1,one


In [89]:
df_xls=pd.read_excel('../data/Sensor Data.xls', 1)

df_xls.head()

Unnamed: 0,Sensor Data
0,The data source as well as the exact nature of...
1,Each data instance contains 12 real-valued inp...
2,represents a sensor designed to detect the pre...
3,"of substances. As an alternative, the sensor r..."
4,


In [90]:
xl = pd.ExcelFile('../data/Sensor Data.xls')

xl.sheet_names  # see all sheet names

['Sheet1', 'Sheet2', 'Sheet3', 'Hoja1']

In [91]:
dictio_df = {}


for hoja in xl.sheet_names:
    
    dictio_df[hoja] = pd.read_excel('../data/Sensor Data.xls', hoja)

In [93]:
dictio_df['Hoja1']

Unnamed: 0,32,42,4q34q34
0,23r,4,
1,,42,


In [94]:
dictio_df['Sheet1'].shape

(2212, 15)

In [95]:
dictio_df.keys()

dict_keys(['Sheet1', 'Sheet2', 'Sheet3', 'Hoja1'])

In [97]:
# json

df_json = pd.read_json('../data/oficinas.json')

df_json.head()

Unnamed: 0,name,totalOffices,lat,lng,principal
0,Wetpaint,2,47.603122,-122.333253,"{'type': 'Point', 'coordinates': [-122.333253,..."
1,AdventNet,1,37.692934,-121.904945,"{'type': 'Point', 'coordinates': [-121.904945,..."
2,Zoho,1,37.692934,-121.904945,"{'type': 'Point', 'coordinates': [-121.904945,..."
3,Digg,1,37.764726,-122.394523,"{'type': 'Point', 'coordinates': [-122.394523,..."
4,Facebook,3,37.41605,-122.151801,"{'type': 'Point', 'coordinates': [-122.151801,..."


In [98]:
df_json.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9618 entries, 0 to 13743
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   name          9618 non-null   object 
 1   totalOffices  9618 non-null   int64  
 2   lat           9618 non-null   float64
 3   lng           9618 non-null   float64
 4   principal     9618 non-null   object 
dtypes: float64(2), int64(1), object(2)
memory usage: 450.8+ KB


In [102]:
type(df_json['principal'][0])

dict