# 1.2 - Intro Pandas (Panel Data)

**[Documentación](https://pandas.pydata.org/docs/reference/index.html#api)**

**[Código fuente](https://github.com/pandas-dev/pandas)**


![pandas](images/pandas.png)


Pandas es una librería de python especializada en el manejo y análisis de estructuras de datos.


Las principales características de esta librería son:

+ Define nuevas estructuras de datos basadas en los arrays de la librería NumPy pero con nuevas funcionalidades.
+ Permite leer y escribir fácilmente ficheros en formato CSV, Excel y bases de datos SQL.
+ Permite acceder a los datos mediante índices o nombres para filas y columnas.
+ Ofrece métodos para reordenar, dividir y combinar conjuntos de datos.
+ Permite trabajar con series temporales.
+ Realiza todas estas operaciones de manera muy eficiente.


**Tipos de datos de Pandas**
Pandas dispone de dos estructuras de datos diferentes:

+ Series: Estructura de una dimensión.
+ DataFrame: Estructura de dos dimensiones (tablas).

Estas estructuras se construyen a partir de arrays de la librería NumPy, añadiendo nuevas funcionalidades.

In [1]:
%pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd

In [3]:
import numpy as np

In [4]:
import warnings
warnings.filterwarnings('ignore')

### Serie

Son estructuras similares a los arrays de una dimensión. Son homogéneas, es decir, sus elementos tienen que ser del mismo tipo, y su tamaño es inmutable, es decir, no se puede cambiar, aunque si su contenido.

Dispone de un índice que asocia un nombre a cada elemento del la serie, a través de la cuál se accede al elemento.

In [5]:
lst=[(3.4 + i)**2 for i in range(10)]   # lista

lst

[11.559999999999999,
 19.360000000000003,
 29.160000000000004,
 40.96000000000001,
 54.760000000000005,
 70.56,
 88.36000000000001,
 108.16000000000001,
 129.96,
 153.76000000000002]

In [6]:
serie=pd.Series(lst)

serie

0     11.56
1     19.36
2     29.16
3     40.96
4     54.76
5     70.56
6     88.36
7    108.16
8    129.96
9    153.76
dtype: float64

In [7]:
serie.head()   # por defecto 5 primeros

0    11.56
1    19.36
2    29.16
3    40.96
4    54.76
dtype: float64

In [8]:
serie.tail()   # por defecto 5 ultimos

5     70.56
6     88.36
7    108.16
8    129.96
9    153.76
dtype: float64

In [9]:
serie.tail(7)

3     40.96
4     54.76
5     70.56
6     88.36
7    108.16
8    129.96
9    153.76
dtype: float64

In [10]:
type(serie)

pandas.core.series.Series

In [11]:
serie.index

RangeIndex(start=0, stop=10, step=1)

In [12]:
serie.index=['a', '0', 'r', 'tt', 'qw', 'tr', 'b', 'c', 'd', 'e']

serie

a      11.56
0      19.36
r      29.16
tt     40.96
qw     54.76
tr     70.56
b      88.36
c     108.16
d     129.96
e     153.76
dtype: float64

In [13]:
serie['r']

29.160000000000004

### DataFrame

Un objeto del tipo DataFrame define un conjunto de datos estructurado en forma de tabla donde cada columna es un objeto de tipo Series, es decir, todos los datos de una misma columna son del mismo tipo, y las filas son registros que pueden contender datos de distintos tipos.

Un DataFrame contiene dos índices, uno para las filas y otro para las columnas, y se puede acceder a sus elementos mediante los nombres de las filas y las columnas.

In [18]:
columnas=['col1', 'col2', 'col 3', 'col4', 'col5']

array=np.random.random((10, 5))

array

array([[0.20025972, 0.20148142, 0.49378962, 0.7140157 , 0.73143581],
       [0.34827032, 0.97782122, 0.55640765, 0.54732955, 0.59788223],
       [0.10930748, 0.87928854, 0.45474314, 0.78924598, 0.57568456],
       [0.00359591, 0.82663527, 0.50460742, 0.32460935, 0.00894501],
       [0.19802999, 0.83747182, 0.12093094, 0.7177272 , 0.05833654],
       [0.78850876, 0.58147379, 0.14517897, 0.43374   , 0.4185503 ],
       [0.13573374, 0.54654626, 0.6110027 , 0.09197064, 0.7772164 ],
       [0.56707947, 0.42088243, 0.41102323, 0.38130984, 0.98922562],
       [0.37875778, 0.86360851, 0.69933687, 0.96070572, 0.99476563],
       [0.88172836, 0.17535158, 0.19916798, 0.12075066, 0.60464521]])

In [19]:
df=pd.DataFrame(array, columns=columnas)

df

Unnamed: 0,col1,col2,col 3,col4,col5
0,0.20026,0.201481,0.49379,0.714016,0.731436
1,0.34827,0.977821,0.556408,0.54733,0.597882
2,0.109307,0.879289,0.454743,0.789246,0.575685
3,0.003596,0.826635,0.504607,0.324609,0.008945
4,0.19803,0.837472,0.120931,0.717727,0.058337
5,0.788509,0.581474,0.145179,0.43374,0.41855
6,0.135734,0.546546,0.611003,0.091971,0.777216
7,0.567079,0.420882,0.411023,0.38131,0.989226
8,0.378758,0.863609,0.699337,0.960706,0.994766
9,0.881728,0.175352,0.199168,0.120751,0.604645


In [20]:
df['col 3']

0    0.493790
1    0.556408
2    0.454743
3    0.504607
4    0.120931
5    0.145179
6    0.611003
7    0.411023
8    0.699337
9    0.199168
Name: col 3, dtype: float64

In [22]:
df.col2

0    0.201481
1    0.977821
2    0.879289
3    0.826635
4    0.837472
5    0.581474
6    0.546546
7    0.420882
8    0.863609
9    0.175352
Name: col2, dtype: float64

In [26]:
df.columns=['col1', 'col2', 'col3', 'col4', 'col5']

df

Unnamed: 0,col1,col2,col3,col4,col5
0,0.20026,0.201481,0.49379,0.714016,0.731436
1,0.34827,0.977821,0.556408,0.54733,0.597882
2,0.109307,0.879289,0.454743,0.789246,0.575685
3,0.003596,0.826635,0.504607,0.324609,0.008945
4,0.19803,0.837472,0.120931,0.717727,0.058337
5,0.788509,0.581474,0.145179,0.43374,0.41855
6,0.135734,0.546546,0.611003,0.091971,0.777216
7,0.567079,0.420882,0.411023,0.38131,0.989226
8,0.378758,0.863609,0.699337,0.960706,0.994766
9,0.881728,0.175352,0.199168,0.120751,0.604645


In [27]:
df.rename(columns={'col3': 'columna'}, inplace=True)

In [28]:
df

Unnamed: 0,col1,col2,columna,col4,col5
0,0.20026,0.201481,0.49379,0.714016,0.731436
1,0.34827,0.977821,0.556408,0.54733,0.597882
2,0.109307,0.879289,0.454743,0.789246,0.575685
3,0.003596,0.826635,0.504607,0.324609,0.008945
4,0.19803,0.837472,0.120931,0.717727,0.058337
5,0.788509,0.581474,0.145179,0.43374,0.41855
6,0.135734,0.546546,0.611003,0.091971,0.777216
7,0.567079,0.420882,0.411023,0.38131,0.989226
8,0.378758,0.863609,0.699337,0.960706,0.994766
9,0.881728,0.175352,0.199168,0.120751,0.604645


In [30]:
cols=['col2', 'columna', 'col4']

df[cols]

Unnamed: 0,col2,columna,col4
0,0.201481,0.49379,0.714016
1,0.977821,0.556408,0.54733
2,0.879289,0.454743,0.789246
3,0.826635,0.504607,0.324609
4,0.837472,0.120931,0.717727
5,0.581474,0.145179,0.43374
6,0.546546,0.611003,0.091971
7,0.420882,0.411023,0.38131
8,0.863609,0.699337,0.960706
9,0.175352,0.199168,0.120751


In [31]:
df[['col2', 'columna', 'col4']]

Unnamed: 0,col2,columna,col4
0,0.201481,0.49379,0.714016
1,0.977821,0.556408,0.54733
2,0.879289,0.454743,0.789246
3,0.826635,0.504607,0.324609
4,0.837472,0.120931,0.717727
5,0.581474,0.145179,0.43374
6,0.546546,0.611003,0.091971
7,0.420882,0.411023,0.38131
8,0.863609,0.699337,0.960706
9,0.175352,0.199168,0.120751


In [33]:
df['ceros']=0.

In [34]:
df

Unnamed: 0,col1,col2,columna,col4,col5,ceros
0,0.20026,0.201481,0.49379,0.714016,0.731436,0.0
1,0.34827,0.977821,0.556408,0.54733,0.597882,0.0
2,0.109307,0.879289,0.454743,0.789246,0.575685,0.0
3,0.003596,0.826635,0.504607,0.324609,0.008945,0.0
4,0.19803,0.837472,0.120931,0.717727,0.058337,0.0
5,0.788509,0.581474,0.145179,0.43374,0.41855,0.0
6,0.135734,0.546546,0.611003,0.091971,0.777216,0.0
7,0.567079,0.420882,0.411023,0.38131,0.989226,0.0
8,0.378758,0.863609,0.699337,0.960706,0.994766,0.0
9,0.881728,0.175352,0.199168,0.120751,0.604645,0.0


In [36]:
df['Nulos']=np.nan

df

Unnamed: 0,col1,col2,columna,col4,col5,ceros,Nulos
0,0.20026,0.201481,0.49379,0.714016,0.731436,0.0,
1,0.34827,0.977821,0.556408,0.54733,0.597882,0.0,
2,0.109307,0.879289,0.454743,0.789246,0.575685,0.0,
3,0.003596,0.826635,0.504607,0.324609,0.008945,0.0,
4,0.19803,0.837472,0.120931,0.717727,0.058337,0.0,
5,0.788509,0.581474,0.145179,0.43374,0.41855,0.0,
6,0.135734,0.546546,0.611003,0.091971,0.777216,0.0,
7,0.567079,0.420882,0.411023,0.38131,0.989226,0.0,
8,0.378758,0.863609,0.699337,0.960706,0.994766,0.0,
9,0.881728,0.175352,0.199168,0.120751,0.604645,0.0,


In [38]:
df['col10']=df.col1 * df.col2 / df.col4

df

Unnamed: 0,col1,col2,columna,col4,col5,ceros,Nulos,col10
0,0.20026,0.201481,0.49379,0.714016,0.731436,0.0,,0.056509
1,0.34827,0.977821,0.556408,0.54733,0.597882,0.0,,0.622196
2,0.109307,0.879289,0.454743,0.789246,0.575685,0.0,,0.121778
3,0.003596,0.826635,0.504607,0.324609,0.008945,0.0,,0.009157
4,0.19803,0.837472,0.120931,0.717727,0.058337,0.0,,0.231069
5,0.788509,0.581474,0.145179,0.43374,0.41855,0.0,,1.057078
6,0.135734,0.546546,0.611003,0.091971,0.777216,0.0,,0.806614
7,0.567079,0.420882,0.411023,0.38131,0.989226,0.0,,0.625931
8,0.378758,0.863609,0.699337,0.960706,0.994766,0.0,,0.340477
9,0.881728,0.175352,0.199168,0.120751,0.604645,0.0,,1.280427


In [40]:
df['col10'][5]

1.0570783795114187

In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 8 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   col1     10 non-null     float64
 1   col2     10 non-null     float64
 2   columna  10 non-null     float64
 3   col4     10 non-null     float64
 4   col5     10 non-null     float64
 5   ceros    10 non-null     float64
 6   Nulos    0 non-null      float64
 7   col10    10 non-null     float64
dtypes: float64(8)
memory usage: 768.0 bytes


In [44]:
df.fillna('hola', inplace=True)

In [45]:
df

Unnamed: 0,col1,col2,columna,col4,col5,ceros,Nulos,col10
0,0.20026,0.201481,0.49379,0.714016,0.731436,0.0,hola,0.056509
1,0.34827,0.977821,0.556408,0.54733,0.597882,0.0,hola,0.622196
2,0.109307,0.879289,0.454743,0.789246,0.575685,0.0,hola,0.121778
3,0.003596,0.826635,0.504607,0.324609,0.008945,0.0,hola,0.009157
4,0.19803,0.837472,0.120931,0.717727,0.058337,0.0,hola,0.231069
5,0.788509,0.581474,0.145179,0.43374,0.41855,0.0,hola,1.057078
6,0.135734,0.546546,0.611003,0.091971,0.777216,0.0,hola,0.806614
7,0.567079,0.420882,0.411023,0.38131,0.989226,0.0,hola,0.625931
8,0.378758,0.863609,0.699337,0.960706,0.994766,0.0,hola,0.340477
9,0.881728,0.175352,0.199168,0.120751,0.604645,0.0,hola,1.280427


In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 8 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   col1     10 non-null     float64
 1   col2     10 non-null     float64
 2   columna  10 non-null     float64
 3   col4     10 non-null     float64
 4   col5     10 non-null     float64
 5   ceros    10 non-null     float64
 6   Nulos    10 non-null     object 
 7   col10    10 non-null     float64
dtypes: float64(7), object(1)
memory usage: 768.0+ bytes


In [47]:
# introducir datos con una lista de listas

lst_lst=[[687261, 'hola', 4728364], 
         [83546, 'adios', 58943], 
         [321, 'oo^oo']]


columnas=['num', 'palabra', 'otro_num']

In [48]:
df_lst=pd.DataFrame(lst_lst, columns=columnas)

df_lst

Unnamed: 0,num,palabra,otro_num
0,687261,hola,4728364.0
1,83546,adios,58943.0
2,321,oo^oo,


In [49]:
df_lst.fillna(0., inplace=True)

df_lst

Unnamed: 0,num,palabra,otro_num
0,687261,hola,4728364.0
1,83546,adios,58943.0
2,321,oo^oo,0.0


In [50]:
# con dictio

dictio={'casa': lst_lst[0],
        'oficina': lst_lst[1],
        'numero': lst_lst[2]+[0]}

dictio

{'casa': [687261, 'hola', 4728364],
 'oficina': [83546, 'adios', 58943],
 'numero': [321, 'oo^oo', 0]}

In [53]:
df_dictio=pd.DataFrame(dictio)

df_dictio['n_col']=12

df_dictio

Unnamed: 0,casa,oficina,numero,n_col
0,687261,83546,321,12
1,hola,adios,oo^oo,12
2,4728364,58943,0,12


In [54]:
df_dictio.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   casa     3 non-null      object
 1   oficina  3 non-null      object
 2   numero   3 non-null      object
 3   n_col    3 non-null      int64 
dtypes: int64(1), object(3)
memory usage: 224.0+ bytes


In [59]:
df.drop('Nulos', axis=1, inplace=True)

In [60]:
df

Unnamed: 0,col1,col2,columna,col4,col5,ceros,col10
0,0.20026,0.201481,0.49379,0.714016,0.731436,0.0,0.056509
1,0.34827,0.977821,0.556408,0.54733,0.597882,0.0,0.622196
2,0.109307,0.879289,0.454743,0.789246,0.575685,0.0,0.121778
3,0.003596,0.826635,0.504607,0.324609,0.008945,0.0,0.009157
4,0.19803,0.837472,0.120931,0.717727,0.058337,0.0,0.231069
5,0.788509,0.581474,0.145179,0.43374,0.41855,0.0,1.057078
6,0.135734,0.546546,0.611003,0.091971,0.777216,0.0,0.806614
7,0.567079,0.420882,0.411023,0.38131,0.989226,0.0,0.625931
8,0.378758,0.863609,0.699337,0.960706,0.994766,0.0,0.340477
9,0.881728,0.175352,0.199168,0.120751,0.604645,0.0,1.280427


In [61]:
df_dictio.drop(columns=['oficina', 'numero'], inplace=True)

df_dictio

Unnamed: 0,casa,n_col
0,687261,12
1,hola,12
2,4728364,12


### Operaciones


In [62]:
df

Unnamed: 0,col1,col2,columna,col4,col5,ceros,col10
0,0.20026,0.201481,0.49379,0.714016,0.731436,0.0,0.056509
1,0.34827,0.977821,0.556408,0.54733,0.597882,0.0,0.622196
2,0.109307,0.879289,0.454743,0.789246,0.575685,0.0,0.121778
3,0.003596,0.826635,0.504607,0.324609,0.008945,0.0,0.009157
4,0.19803,0.837472,0.120931,0.717727,0.058337,0.0,0.231069
5,0.788509,0.581474,0.145179,0.43374,0.41855,0.0,1.057078
6,0.135734,0.546546,0.611003,0.091971,0.777216,0.0,0.806614
7,0.567079,0.420882,0.411023,0.38131,0.989226,0.0,0.625931
8,0.378758,0.863609,0.699337,0.960706,0.994766,0.0,0.340477
9,0.881728,0.175352,0.199168,0.120751,0.604645,0.0,1.280427


In [63]:
df.transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
col1,0.20026,0.34827,0.109307,0.003596,0.19803,0.788509,0.135734,0.567079,0.378758,0.881728
col2,0.201481,0.977821,0.879289,0.826635,0.837472,0.581474,0.546546,0.420882,0.863609,0.175352
columna,0.49379,0.556408,0.454743,0.504607,0.120931,0.145179,0.611003,0.411023,0.699337,0.199168
col4,0.714016,0.54733,0.789246,0.324609,0.717727,0.43374,0.091971,0.38131,0.960706,0.120751
col5,0.731436,0.597882,0.575685,0.008945,0.058337,0.41855,0.777216,0.989226,0.994766,0.604645
ceros,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
col10,0.056509,0.622196,0.121778,0.009157,0.231069,1.057078,0.806614,0.625931,0.340477,1.280427


In [64]:
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
col1,0.20026,0.34827,0.109307,0.003596,0.19803,0.788509,0.135734,0.567079,0.378758,0.881728
col2,0.201481,0.977821,0.879289,0.826635,0.837472,0.581474,0.546546,0.420882,0.863609,0.175352
columna,0.49379,0.556408,0.454743,0.504607,0.120931,0.145179,0.611003,0.411023,0.699337,0.199168
col4,0.714016,0.54733,0.789246,0.324609,0.717727,0.43374,0.091971,0.38131,0.960706,0.120751
col5,0.731436,0.597882,0.575685,0.008945,0.058337,0.41855,0.777216,0.989226,0.994766,0.604645
ceros,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
col10,0.056509,0.622196,0.121778,0.009157,0.231069,1.057078,0.806614,0.625931,0.340477,1.280427


In [66]:
df.T.index

Index(['col1', 'col2', 'columna', 'col4', 'col5', 'ceros', 'col10'], dtype='object')

In [67]:
df.columns

Index(['col1', 'col2', 'columna', 'col4', 'col5', 'ceros', 'col10'], dtype='object')

In [71]:
df.sum()

col1       3.611272
col2       6.310561
columna    4.196189
col4       5.081405
col5       5.756687
ceros      0.000000
col10      5.151237
dtype: float64

In [72]:
df.sum(axis=1)

0    2.397492
1    3.649907
2    2.930048
3    1.677550
4    2.163566
5    3.424530
6    2.969083
7    3.395452
8    4.237652
9    3.262071
dtype: float64

In [73]:
df.std()

col1       0.296479
col2       0.291124
columna    0.200206
col4       0.288724
col5       0.337702
ceros      0.000000
col10      0.437207
dtype: float64

In [74]:
df.var()

col1       0.087900
col2       0.084753
columna    0.040083
col4       0.083362
col5       0.114043
ceros      0.000000
col10      0.191150
dtype: float64

In [75]:
df.mean()

col1       0.361127
col2       0.631056
columna    0.419619
col4       0.508140
col5       0.575669
ceros      0.000000
col10      0.515124
dtype: float64

In [76]:
df.max()

col1       0.881728
col2       0.977821
columna    0.699337
col4       0.960706
col5       0.994766
ceros      0.000000
col10      1.280427
dtype: float64

In [79]:
df.median() 

col1       0.274265
col2       0.704055
columna    0.474266
col4       0.490535
col5       0.601264
ceros      0.000000
col10      0.481336
dtype: float64

In [83]:
df.mean() / len(df)

col1       0.036113
col2       0.063106
columna    0.041962
col4       0.050814
col5       0.057567
ceros      0.000000
col10      0.051512
dtype: float64

### Importar archivos

+ CSV
+ XLSX
+ XLS
+ JSON

In [87]:
pd.set_option('display.max_columns', None)  # ver todas las columnas
pd.set_option('display.max_rows', None)     # ver todas las filas

In [90]:
df_csv=pd.read_csv('../data/vehicles_messy.csv')

df_csv.head(20)

Unnamed: 0,barrels08,barrelsA08,charge120,charge240,city08,city08U,cityA08,cityA08U,cityCD,cityE,cityUF,co2,co2A,co2TailpipeAGpm,co2TailpipeGpm,comb08,comb08U,combA08,combA08U,combE,combinedCD,combinedUF,cylinders,displ,drive,engId,eng_dscr,feScore,fuelCost08,fuelCostA08,fuelType,fuelType1,ghgScore,ghgScoreA,highway08,highway08U,highwayA08,highwayA08U,highwayCD,highwayE,highwayUF,hlv,hpv,id,lv2,lv4,make,model,mpgData,phevBlended,pv2,pv4,range,rangeCity,rangeCityA,rangeHwy,rangeHwyA,trany,UCity,UCityA,UHighway,UHighwayA,VClass,year,youSaveSpend,guzzler,trans_dscr,tCharger,sCharger,atvType,fuelType2,rangeA,evMotor,mfrCode,c240Dscr,charge240b,c240bDscr,createdOn,modifiedOn,startStop,phevCity,phevHwy,phevComb
0,15.695714,0.0,0.0,0.0,19,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,423.190476,21,0.0,0,0.0,0.0,0.0,0.0,4.0,2.0,Rear-Wheel Drive,9011,(FFS),-1,1600,0,Regular,Regular Gasoline,-1,-1,25,0.0,0,0.0,0.0,0.0,0.0,0,0,1,0,0,Alfa Romeo,Spider Veloce 2000,Y,False,0,0,0,0.0,0.0,0.0,0.0,Manual 5-spd,23.3333,0.0,35.0,0.0,Two Seaters,1985,-1250,,,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
1,29.964545,0.0,0.0,0.0,9,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,807.909091,11,0.0,0,0.0,0.0,0.0,0.0,12.0,4.9,Rear-Wheel Drive,22020,(GUZZLER),-1,3050,0,Regular,Regular Gasoline,-1,-1,14,0.0,0,0.0,0.0,0.0,0.0,0,0,10,0,0,Ferrari,Testarossa,N,False,0,0,0,0.0,0.0,0.0,0.0,Manual 5-spd,11.0,0.0,19.0,0.0,Two Seaters,1985,-8500,T,,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
2,12.207778,0.0,0.0,0.0,23,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,329.148148,27,0.0,0,0.0,0.0,0.0,0.0,4.0,2.2,Front-Wheel Drive,2100,(FFS),-1,1250,0,Regular,Regular Gasoline,-1,-1,33,0.0,0,0.0,0.0,0.0,0.0,19,77,100,0,0,Dodge,Charger,Y,False,0,0,0,0.0,0.0,0.0,0.0,Manual 5-spd,29.0,0.0,47.0,0.0,Subcompact Cars,1985,500,,SIL,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
3,29.964545,0.0,0.0,0.0,10,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,807.909091,11,0.0,0,0.0,0.0,0.0,0.0,8.0,5.2,Rear-Wheel Drive,2850,,-1,3050,0,Regular,Regular Gasoline,-1,-1,12,0.0,0,0.0,0.0,0.0,0.0,0,0,1000,0,0,Dodge,B150/B250 Wagon 2WD,N,False,0,0,0,0.0,0.0,0.0,0.0,Automatic 3-spd,12.2222,0.0,16.6667,0.0,Vans,1985,-8500,,,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
4,17.347895,0.0,0.0,0.0,17,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,467.736842,19,0.0,0,0.0,0.0,0.0,0.0,4.0,2.2,4-Wheel or All-Wheel Drive,66031,"(FFS,TRBO)",-1,2150,0,Premium,Premium Gasoline,-1,-1,23,0.0,0,0.0,0.0,0.0,0.0,0,0,10000,0,14,Subaru,Legacy AWD Turbo,N,False,0,90,0,0.0,0.0,0.0,0.0,Manual 5-spd,21.0,0.0,32.0,0.0,Compact Cars,1993,-4000,,,T,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
5,14.982273,0.0,0.0,0.0,21,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,403.954545,22,0.0,0,0.0,0.0,0.0,0.0,4.0,1.8,Front-Wheel Drive,66020,(FFS),-1,1500,0,Regular,Regular Gasoline,-1,-1,24,0.0,0,0.0,0.0,0.0,0.0,0,0,10001,0,15,Subaru,Loyale,N,False,0,88,0,0.0,0.0,0.0,0.0,Automatic 3-spd,27.0,0.0,33.0,0.0,Compact Cars,1993,-750,,,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
6,13.1844,0.0,0.0,0.0,22,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,355.48,25,0.0,0,0.0,0.0,0.0,0.0,4.0,1.8,Front-Wheel Drive,66020,(FFS),-1,1350,0,Regular,Regular Gasoline,-1,-1,29,0.0,0,0.0,0.0,0.0,0.0,0,0,10002,0,15,Subaru,Loyale,Y,False,0,88,0,0.0,0.0,0.0,0.0,Manual 5-spd,28.0,0.0,41.0,0.0,Compact Cars,1993,0,,,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
7,13.73375,0.0,0.0,0.0,23,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,370.291667,24,0.0,0,0.0,0.0,0.0,0.0,4.0,1.6,Front-Wheel Drive,57005,(FFS),-1,1400,0,Regular,Regular Gasoline,-1,-1,26,0.0,0,0.0,0.0,0.0,0.0,0,0,10003,0,13,Toyota,Corolla,Y,False,0,89,0,0.0,0.0,0.0,0.0,Automatic 3-spd,29.0,0.0,37.0,0.0,Compact Cars,1993,-250,,,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
8,12.677308,0.0,0.0,0.0,23,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,341.807692,26,0.0,0,0.0,0.0,0.0,0.0,4.0,1.6,Front-Wheel Drive,57005,(FFS),-1,1300,0,Regular,Regular Gasoline,-1,-1,31,0.0,0,0.0,0.0,0.0,0.0,0,0,10004,0,13,Toyota,Corolla,Y,False,0,89,0,0.0,0.0,0.0,0.0,Manual 5-spd,30.0,0.0,43.0,0.0,Compact Cars,1993,250,,,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
9,13.1844,0.0,0.0,0.0,23,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,355.48,25,0.0,0,0.0,0.0,0.0,0.0,4.0,1.8,Front-Wheel Drive,57006,(FFS),-1,1350,0,Regular,Regular Gasoline,-1,-1,30,0.0,0,0.0,0.0,0.0,0.0,0,0,10005,0,13,Toyota,Corolla,Y,False,0,89,0,0.0,0.0,0.0,0.0,Automatic 4-spd,29.0,0.0,42.0,0.0,Compact Cars,1993,0,,,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0


In [91]:
len(df_csv.columns)

83

In [92]:
df_csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37843 entries, 0 to 37842
Data columns (total 83 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   barrels08        37843 non-null  float64
 1   barrelsA08       37843 non-null  float64
 2   charge120        37843 non-null  float64
 3   charge240        37843 non-null  float64
 4   city08           37843 non-null  int64  
 5   city08U          37843 non-null  float64
 6   cityA08          37843 non-null  int64  
 7   cityA08U         37843 non-null  float64
 8   cityCD           37843 non-null  float64
 9   cityE            37843 non-null  float64
 10  cityUF           37843 non-null  float64
 11  co2              37843 non-null  int64  
 12  co2A             37843 non-null  int64  
 13  co2TailpipeAGpm  37843 non-null  float64
 14  co2TailpipeGpm   37843 non-null  float64
 15  comb08           37843 non-null  int64  
 16  comb08U          37843 non-null  float64
 17  combA08     

In [93]:
%pip install openpyxl
%pip install xlrd

Collecting openpyxl
  Using cached openpyxl-3.0.10-py2.py3-none-any.whl (242 kB)
Collecting et-xmlfile
  Using cached et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.0.10
Note: you may need to restart the kernel to use updated packages.
Collecting xlrd
  Using cached xlrd-2.0.1-py2.py3-none-any.whl (96 kB)
Installing collected packages: xlrd
Successfully installed xlrd-2.0.1
Note: you may need to restart the kernel to use updated packages.


In [95]:
# xlsx


df_xlsx=pd.read_excel('../data/Online Retail.xlsx')

df_xlsx.head()

Unnamed: 0,InvoiceNo,InvoiceDate,StockCode,Description,Quantity,UnitPrice,Revenue,CustomerID,Country
0,536365,2010-12-01 08:26:00,85123A,CREAM HANGING HEART T-LIGHT HOLDER,6,2.55,15.3,17850,United Kingdom
1,536373,2010-12-01 09:02:00,85123A,CREAM HANGING HEART T-LIGHT HOLDER,6,2.55,15.3,17850,United Kingdom
2,536375,2010-12-01 09:32:00,85123A,CREAM HANGING HEART T-LIGHT HOLDER,6,2.55,15.3,17850,United Kingdom
3,536390,2010-12-01 10:19:00,85123A,CREAM HANGING HEART T-LIGHT HOLDER,64,2.55,163.2,17511,United Kingdom
4,536394,2010-12-01 10:39:00,85123A,CREAM HANGING HEART T-LIGHT HOLDER,32,2.55,81.6,13408,United Kingdom


In [96]:
df_xlsx.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 396034 entries, 0 to 396033
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    396034 non-null  int64         
 1   InvoiceDate  396034 non-null  datetime64[ns]
 2   StockCode    396034 non-null  object        
 3   Description  396034 non-null  object        
 4   Quantity     396034 non-null  int64         
 5   UnitPrice    396034 non-null  float64       
 6   Revenue      396034 non-null  float64       
 7   CustomerID   396034 non-null  int64         
 8   Country      396034 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(3), object(3)
memory usage: 27.2+ MB


In [97]:
# xls


df_xls=pd.read_excel('../data/Sensor Data.xls')

df_xls.head()

Unnamed: 0,Input 1,Input 2,Input 3,Input 4,Input 5,Input 6,Input 7,Input 8,Input 9,Input 10,Input 11,Input 12,output1,output2,class
0,1.473,2.311,3.179,2.666,0.2795,0.2771,0.2234,0.1855,0.2539,1.138,1.111,4.712,1,1,one
1,1.46,2.377,3.214,2.92,0.2527,0.3064,0.02563,0.1965,0.3027,1.213,1.027,5.463,1,1,one
2,1.552,2.164,3.064,2.745,0.282,0.21,0.1721,0.1929,0.21,1.221,1.058,5.332,1,1,one
3,1.605,2.228,3.149,2.834,0.2917,0.3613,0.2087,0.1294,0.2734,1.144,1.062,4.829,1,1,one
4,1.534,2.114,3.309,2.976,0.21,0.2502,0.2258,0.177,0.2039,1.254,1.112,5.734,1,1,one


In [98]:
df_xls=pd.read_excel('../data/Sensor Data.xls', 'Sheet1')

df_xls.head()

Unnamed: 0,Input 1,Input 2,Input 3,Input 4,Input 5,Input 6,Input 7,Input 8,Input 9,Input 10,Input 11,Input 12,output1,output2,class
0,1.473,2.311,3.179,2.666,0.2795,0.2771,0.2234,0.1855,0.2539,1.138,1.111,4.712,1,1,one
1,1.46,2.377,3.214,2.92,0.2527,0.3064,0.02563,0.1965,0.3027,1.213,1.027,5.463,1,1,one
2,1.552,2.164,3.064,2.745,0.282,0.21,0.1721,0.1929,0.21,1.221,1.058,5.332,1,1,one
3,1.605,2.228,3.149,2.834,0.2917,0.3613,0.2087,0.1294,0.2734,1.144,1.062,4.829,1,1,one
4,1.534,2.114,3.309,2.976,0.21,0.2502,0.2258,0.177,0.2039,1.254,1.112,5.734,1,1,one


In [99]:
df_xls=pd.read_excel('../data/Sensor Data.xls', 'Sheet2')

df_xls.head()

Unnamed: 0,Sensor Data
0,The data source as well as the exact nature of...
1,Each data instance contains 12 real-valued inp...
2,represents a sensor designed to detect the pre...
3,"of substances. As an alternative, the sensor r..."
4,


In [100]:
df_xls=pd.read_excel('../data/Sensor Data.xls', 'Sheet3')

df_xls.head()

Unnamed: 0,hola,amiguis,estamos,probando,pandas


In [101]:
df_xls=pd.read_excel('../data/Sensor Data.xls', 'Hoja1')

df_xls.head()

Unnamed: 0,32,42,4q34q34
0,23r,4,
1,,42,


In [104]:
df_xls=pd.read_excel('../data/Sensor Data.xls', 3)

df_xls.head()

Unnamed: 0,32,42,4q34q34
0,23r,4,
1,,42,


In [105]:
xl = pd.ExcelFile('../data/Sensor Data.xls')

xl.sheet_names  # see all sheet names

['Sheet1', 'Sheet2', 'Sheet3', 'Hoja1']

In [106]:
# json

df_json=pd.read_json('../data/companies.json', lines=True, orient='records')

df_json.head()

Unnamed: 0,_id,name,permalink,crunchbase_url,homepage_url,blog_url,blog_feed_url,twitter_username,category_code,number_of_employees,founded_year,founded_month,founded_day,deadpooled_year,tag_list,alias_list,email_address,phone_number,description,created_at,updated_at,overview,image,products,relationships,competitions,providerships,total_money_raised,funding_rounds,investments,acquisition,acquisitions,offices,milestones,video_embeds,screenshots,external_links,partners,deadpooled_month,deadpooled_day,deadpooled_url,ipo
0,{'$oid': '52cdef7c4bab8bd675297d8a'},Wetpaint,abc2,http://www.crunchbase.com/company/wetpaint,http://wetpaint-inc.com,http://digitalquarters.net/,http://digitalquarters.net/feed/,BachelrWetpaint,web,47.0,2005.0,10.0,17.0,1.0,"wiki, seattle, elowitz, media-industry, media-...",,info@wetpaint.com,206.859.6300,Technology Platform Company,{'$date': 1180075887000},2013-12-08 07:15:44+00:00,<p>Wetpaint is a technology platform company t...,"{'available_sizes': [[[150, 75], 'assets/image...","[{'name': 'Wikison Wetpaint', 'permalink': 'we...","[{'is_past': False, 'title': 'Co-Founder and V...","[{'competitor': {'name': 'Wikia', 'permalink':...",[],$39.8M,"[{'id': 888, 'round_code': 'a', 'source_url': ...",[],"{'price_amount': 30000000, 'price_currency_cod...",[],"[{'description': '', 'address1': '710 - 2nd Av...","[{'id': 5869, 'description': 'Wetpaint named i...",[],"[{'available_sizes': [[[150, 86], 'assets/imag...",[{'external_url': 'http://www.geekwire.com/201...,[],,,,
1,{'$oid': '52cdef7c4bab8bd675297d8b'},AdventNet,abc3,http://www.crunchbase.com/company/adventnet,http://adventnet.com,,,manageengine,enterprise,600.0,1996.0,,,2.0,,Zoho ManageEngine,pr@adventnet.com,925-924-9500,Server Management Software,{'$date': 1180121062000},2012-10-31 18:26:09+00:00,"<p>AdventNet is now <a href=""/company/zoho-man...","{'available_sizes': [[[150, 55], 'assets/image...",[],"[{'is_past': True, 'title': 'CEO and Co-Founde...",[],"[{'title': 'DHFH', 'is_past': True, 'provider'...",$0,[],[],,[],"[{'description': 'Headquarters', 'address1': '...",[],[],"[{'available_sizes': [[[150, 94], 'assets/imag...",[],[],,,,
2,{'$oid': '52cdef7c4bab8bd675297d8c'},Zoho,abc4,http://www.crunchbase.com/company/zoho,http://zoho.com,http://blogs.zoho.com/,http://blogs.zoho.com/feed,zoho,software,1600.0,2005.0,9.0,15.0,3.0,"zoho, officesuite, spreadsheet, writer, projec...",,info@zohocorp.com,1-888-204-3539,Online Business Apps Suite,Fri May 25 19:30:28 UTC 2007,2013-10-30 00:07:05+00:00,"<p>Zoho offers a suite of Business, Collaborat...","{'available_sizes': [[[150, 55], 'assets/image...","[{'name': 'Zoho Office Suite', 'permalink': 'z...","[{'is_past': False, 'title': 'CEO and Founder'...","[{'competitor': {'name': 'Empressr', 'permalin...",[],$0,[],[],,[],"[{'description': 'Headquarters', 'address1': '...","[{'id': 388, 'description': 'Zoho Reaches 2 Mi...","[{'embed_code': '<object width=""430"" height=""2...",[],[{'external_url': 'http://www.online-tech-tips...,[],,,,
3,{'$oid': '52cdef7c4bab8bd675297d8d'},Digg,digg,http://www.crunchbase.com/company/digg,http://www.digg.com,http://blog.digg.com/,http://blog.digg.com/?feed=rss2,digg,news,60.0,2004.0,10.0,11.0,,"community, social, news, bookmark, digg, techn...",,feedback@digg.com,(415) 436-9638,user driven social content website,Fri May 25 20:03:23 UTC 2007,2013-11-05 21:35:47+00:00,<p>Digg is a user driven social content websit...,"{'available_sizes': [[[150, 150], 'assets/imag...","[{'name': 'Digg', 'permalink': 'digg'}]","[{'is_past': False, 'title': 'CEO', 'person': ...","[{'competitor': {'name': 'Reddit', 'permalink'...","[{'title': 'Public Relations', 'is_past': True...",$45M,"[{'id': 1, 'round_code': 'b', 'source_url': 'h...",[],"{'price_amount': 500000, 'price_currency_code'...","[{'price_amount': None, 'price_currency_code':...","[{'description': None, 'address1': '135 Missis...","[{'id': 9588, 'description': 'Another Digg Exe...","[{'embed_code': '<embed src=""http://blip.tv/pl...","[{'available_sizes': [[[117, 150], 'assets/ima...",[{'external_url': 'http://www.sociableblog.com...,[],,,,
4,{'$oid': '52cdef7c4bab8bd675297d8e'},Facebook,facebook,http://www.crunchbase.com/company/facebook,http://facebook.com,http://blog.facebook.com,http://blog.facebook.com/atom.php,facebook,social,5299.0,2004.0,2.0,1.0,,"facebook, college, students, profiles, network...",,,,Social network,Fri May 25 21:22:15 UTC 2007,2013-11-21 19:40:55+00:00,<p>Facebook is the world&#8217;s largest socia...,"{'available_sizes': [[[150, 61], 'assets/image...","[{'name': 'Facebook Platform', 'permalink': 'f...","[{'is_past': False, 'title': 'Founder and CEO,...","[{'competitor': {'name': 'MySpace', 'permalink...","[{'title': '', 'is_past': False, 'provider': {...",$2.43B,"[{'id': 2, 'round_code': 'angel', 'source_url'...","[{'funding_round': {'round_code': 'seed', 'sou...",,"[{'price_amount': None, 'price_currency_code':...","[{'description': 'Headquarters', 'address1': '...","[{'id': 108, 'description': 'Facebook adds com...",[],"[{'available_sizes': [[[150, 68], 'assets/imag...",[{'external_url': 'http://latimesblogs.latimes...,[],,,,"{'valuation_amount': 104000000000, 'valuation_..."
