# 1.2 - Intro Pandas (Panel Data)

**[Documentación](https://pandas.pydata.org/docs/reference/index.html#api)**

**[Código fuente](https://github.com/pandas-dev/pandas)**


![pandas](images/pandas.png)


Pandas es una librería de python especializada en el manejo y análisis de estructuras de datos.


Las principales características de esta librería son:

+ Define nuevas estructuras de datos basadas en los arrays de la librería NumPy pero con nuevas funcionalidades.
+ Permite leer y escribir fácilmente ficheros en formato CSV, Excel y bases de datos SQL.
+ Permite acceder a los datos mediante índices o nombres para filas y columnas.
+ Ofrece métodos para reordenar, dividir y combinar conjuntos de datos.
+ Permite trabajar con series temporales.
+ Realiza todas estas operaciones de manera muy eficiente.


**Tipos de datos de Pandas**
Pandas dispone de dos estructuras de datos diferentes:

+ Series: Estructura de una dimensión.
+ DataFrame: Estructura de dos dimensiones (tablas).

Estas estructuras se construyen a partir de arrays de la librería NumPy, añadiendo nuevas funcionalidades.

In [1]:
%pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd

In [3]:
import numpy as np

In [4]:
import warnings
warnings.filterwarnings('ignore')

### Serie

Son estructuras similares a los arrays de una dimensión. Son homogéneas, es decir, sus elementos tienen que ser del mismo tipo, y su tamaño es inmutable, es decir, no se puede cambiar, aunque si su contenido.

Dispone de un índice que asocia un nombre a cada elemento del la serie, a través de la cuál se accede al elemento.

In [5]:
lst = [(3.4 + i) ** 2 for i in range(10)]

lst

[11.559999999999999,
 19.360000000000003,
 29.160000000000004,
 40.96000000000001,
 54.760000000000005,
 70.56,
 88.36000000000001,
 108.16000000000001,
 129.96,
 153.76000000000002]

In [6]:
serie = pd.Series(lst)

serie

0     11.56
1     19.36
2     29.16
3     40.96
4     54.76
5     70.56
6     88.36
7    108.16
8    129.96
9    153.76
dtype: float64

In [7]:
serie.head() # nos muestra los 5 primeros elementos de la serie

0    11.56
1    19.36
2    29.16
3    40.96
4    54.76
dtype: float64

In [8]:
serie.head(2)

0    11.56
1    19.36
dtype: float64

In [9]:
serie.tail() # nos muestra los últimos cinco elementos de la serie

5     70.56
6     88.36
7    108.16
8    129.96
9    153.76
dtype: float64

In [10]:
serie.tail(2)

8    129.96
9    153.76
dtype: float64

In [11]:
serie.index

RangeIndex(start=0, stop=10, step=1)

In [13]:
serie.index = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

serie

a     11.56
b     19.36
c     29.16
d     40.96
e     54.76
f     70.56
g     88.36
h    108.16
i    129.96
j    153.76
dtype: float64

In [14]:
serie['e']

54.760000000000005

In [19]:
pd.Series(['a.', 2, 4.5,])

0     a.
1      2
2    4.5
dtype: object

### DataFrame

Un objeto del tipo DataFrame define un conjunto de datos estructurado en forma de tabla donde cada columna es un objeto de tipo Series, es decir, todos los datos de una misma columna son del mismo tipo, y las filas son registros que pueden contender datos de distintos tipos.

Un DataFrame contiene dos índices, uno para las filas y otro para las columnas, y se puede acceder a sus elementos mediante los nombres de las filas y las columnas.

In [20]:
columnas = ['col1', 'col2', 'col3', 'col4', 'col5']

data = np.random.random((10,5))

data

array([[0.14105378, 0.77363495, 0.36737768, 0.4096323 , 0.6725172 ],
       [0.20290091, 0.46285084, 0.40053348, 0.69730366, 0.45284935],
       [0.82400727, 0.3337272 , 0.54213336, 0.52194031, 0.47414145],
       [0.11772997, 0.83546618, 0.20063629, 0.05842498, 0.73689502],
       [0.38650355, 0.0036334 , 0.68934932, 0.60036556, 0.87265461],
       [0.95353574, 0.7040162 , 0.33994901, 0.74875351, 0.79416392],
       [0.07156641, 0.565189  , 0.54339026, 0.0428382 , 0.79439049],
       [0.59984969, 0.51404524, 0.87543542, 0.93876432, 0.4760651 ],
       [0.34949752, 0.52638328, 0.08896287, 0.84996552, 0.82261722],
       [0.09413659, 0.00774166, 0.70981519, 0.10801707, 0.15955572]])

In [21]:
df = pd.DataFrame(data, columns=columnas)

df.head()

Unnamed: 0,col1,col2,col3,col4,col5
0,0.141054,0.773635,0.367378,0.409632,0.672517
1,0.202901,0.462851,0.400533,0.697304,0.452849
2,0.824007,0.333727,0.542133,0.52194,0.474141
3,0.11773,0.835466,0.200636,0.058425,0.736895
4,0.386504,0.003633,0.689349,0.600366,0.872655


In [23]:
type(df['col2'])

pandas.core.series.Series

In [24]:
type(df)

pandas.core.frame.DataFrame

In [25]:
df.col2

0    0.773635
1    0.462851
2    0.333727
3    0.835466
4    0.003633
5    0.704016
6    0.565189
7    0.514045
8    0.526383
9    0.007742
Name: col2, dtype: float64

In [26]:
df.rename(columns={'col2': 'columna_2'}, inplace=True)

In [27]:
df.head()

Unnamed: 0,col1,columna_2,col3,col4,col5
0,0.141054,0.773635,0.367378,0.409632,0.672517
1,0.202901,0.462851,0.400533,0.697304,0.452849
2,0.824007,0.333727,0.542133,0.52194,0.474141
3,0.11773,0.835466,0.200636,0.058425,0.736895
4,0.386504,0.003633,0.689349,0.600366,0.872655


In [34]:
df[['col1', 'col3', 'col5']] # selección múltiple

Unnamed: 0,col1,col3,col5
0,0.141054,0.367378,0.672517
1,0.202901,0.400533,0.452849
2,0.824007,0.542133,0.474141
3,0.11773,0.200636,0.736895
4,0.386504,0.689349,0.872655
5,0.953536,0.339949,0.794164
6,0.071566,0.54339,0.79439
7,0.59985,0.875435,0.476065
8,0.349498,0.088963,0.822617
9,0.094137,0.709815,0.159556


In [35]:
df['nueva_col'] = df.col1 * df.columna_2 / df.col5

df.head()

Unnamed: 0,col1,columna_2,col3,col4,col5,nueva_col
0,0.141054,0.773635,0.367378,0.409632,0.672517,0.162262
1,0.202901,0.462851,0.400533,0.697304,0.452849,0.207382
2,0.824007,0.333727,0.542133,0.52194,0.474141,0.579982
3,0.11773,0.835466,0.200636,0.058425,0.736895,0.133478
4,0.386504,0.003633,0.689349,0.600366,0.872655,0.001609


In [36]:
df['ceros']= 0.

df.head()

Unnamed: 0,col1,columna_2,col3,col4,col5,nueva_col,ceros
0,0.141054,0.773635,0.367378,0.409632,0.672517,0.162262,0.0
1,0.202901,0.462851,0.400533,0.697304,0.452849,0.207382,0.0
2,0.824007,0.333727,0.542133,0.52194,0.474141,0.579982,0.0
3,0.11773,0.835466,0.200636,0.058425,0.736895,0.133478,0.0
4,0.386504,0.003633,0.689349,0.600366,0.872655,0.001609,0.0


In [37]:
lst = [i*2.3 for i in range(8)]

lst.append(None)
lst.append(None)

lst

[0.0,
 2.3,
 4.6,
 6.8999999999999995,
 9.2,
 11.5,
 13.799999999999999,
 16.099999999999998,
 None,
 None]

In [39]:
df['col_nulos'] = lst

df.tail()

Unnamed: 0,col1,columna_2,col3,col4,col5,nueva_col,ceros,col_nulos
5,0.953536,0.704016,0.339949,0.748754,0.794164,0.845297,0.0,11.5
6,0.071566,0.565189,0.54339,0.042838,0.79439,0.050918,0.0,13.8
7,0.59985,0.514045,0.875435,0.938764,0.476065,0.647705,0.0,16.1
8,0.349498,0.526383,0.088963,0.849966,0.822617,0.223639,0.0,
9,0.094137,0.007742,0.709815,0.108017,0.159556,0.004568,0.0,


In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   col1       10 non-null     float64
 1   columna_2  10 non-null     float64
 2   col3       10 non-null     float64
 3   col4       10 non-null     float64
 4   col5       10 non-null     float64
 5   nueva_col  10 non-null     float64
 6   ceros      10 non-null     float64
 7   col_nulos  8 non-null      float64
dtypes: float64(8)
memory usage: 768.0 bytes


In [41]:
df = df.fillna('hola_soy_un_dato_nulo')

df

Unnamed: 0,col1,columna_2,col3,col4,col5,nueva_col,ceros,col_nulos
0,0.141054,0.773635,0.367378,0.409632,0.672517,0.162262,0.0,0.0
1,0.202901,0.462851,0.400533,0.697304,0.452849,0.207382,0.0,2.3
2,0.824007,0.333727,0.542133,0.52194,0.474141,0.579982,0.0,4.6
3,0.11773,0.835466,0.200636,0.058425,0.736895,0.133478,0.0,6.9
4,0.386504,0.003633,0.689349,0.600366,0.872655,0.001609,0.0,9.2
5,0.953536,0.704016,0.339949,0.748754,0.794164,0.845297,0.0,11.5
6,0.071566,0.565189,0.54339,0.042838,0.79439,0.050918,0.0,13.8
7,0.59985,0.514045,0.875435,0.938764,0.476065,0.647705,0.0,16.1
8,0.349498,0.526383,0.088963,0.849966,0.822617,0.223639,0.0,hola_soy_un_dato_nulo
9,0.094137,0.007742,0.709815,0.108017,0.159556,0.004568,0.0,hola_soy_un_dato_nulo


In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   col1       10 non-null     float64
 1   columna_2  10 non-null     float64
 2   col3       10 non-null     float64
 3   col4       10 non-null     float64
 4   col5       10 non-null     float64
 5   nueva_col  10 non-null     float64
 6   ceros      10 non-null     float64
 7   col_nulos  10 non-null     object 
dtypes: float64(7), object(1)
memory usage: 768.0+ bytes


In [44]:
# Generar dataframe vacio
df_vacio = pd.DataFrame()

df_vacio

In [46]:
# introducir datos con una lista de listas
lst_lst = [[655643, 'buenas', 35432],
          [354, 'como andas', 899],
          [3543, 'ooi']]

columnas = ['num', 'str', 'otro_num']

df_lst = pd.DataFrame(lst_lst, columns=columnas)

df_lst

Unnamed: 0,num,str,otro_num
0,655643,buenas,35432.0
1,354,como andas,899.0
2,3543,ooi,


In [49]:
df_lst.index = df_lst.num

df_lst

Unnamed: 0_level_0,num,str,otro_num
num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
655643,655643,buenas,35432.0
354,354,como andas,899.0
3543,3543,ooi,


In [53]:
# con dictio

dictio = {'casa':lst_lst[0],
         'oficina': lst_lst[1],
         'numero': lst_lst[2]+[0]}

dictio

{'casa': [655643, 'buenas', 35432],
 'oficina': [354, 'como andas', 899],
 'numero': [3543, 'ooi', 0]}

In [54]:
df_dictio = pd.DataFrame(dictio)

df_dictio

Unnamed: 0,casa,oficina,numero
0,655643,354,3543
1,buenas,como andas,ooi
2,35432,899,0


In [55]:
df_dictio.columns = ['a', 'b', 'c']

df_dictio

Unnamed: 0,a,b,c
0,655643,354,3543
1,buenas,como andas,ooi
2,35432,899,0


### Operaciones


In [56]:
df

Unnamed: 0,col1,columna_2,col3,col4,col5,nueva_col,ceros,col_nulos
0,0.141054,0.773635,0.367378,0.409632,0.672517,0.162262,0.0,0.0
1,0.202901,0.462851,0.400533,0.697304,0.452849,0.207382,0.0,2.3
2,0.824007,0.333727,0.542133,0.52194,0.474141,0.579982,0.0,4.6
3,0.11773,0.835466,0.200636,0.058425,0.736895,0.133478,0.0,6.9
4,0.386504,0.003633,0.689349,0.600366,0.872655,0.001609,0.0,9.2
5,0.953536,0.704016,0.339949,0.748754,0.794164,0.845297,0.0,11.5
6,0.071566,0.565189,0.54339,0.042838,0.79439,0.050918,0.0,13.8
7,0.59985,0.514045,0.875435,0.938764,0.476065,0.647705,0.0,16.1
8,0.349498,0.526383,0.088963,0.849966,0.822617,0.223639,0.0,hola_soy_un_dato_nulo
9,0.094137,0.007742,0.709815,0.108017,0.159556,0.004568,0.0,hola_soy_un_dato_nulo


In [58]:
# transponer df
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
col1,0.141054,0.202901,0.824007,0.11773,0.386504,0.953536,0.071566,0.59985,0.349498,0.094137
columna_2,0.773635,0.462851,0.333727,0.835466,0.003633,0.704016,0.565189,0.514045,0.526383,0.007742
col3,0.367378,0.400533,0.542133,0.200636,0.689349,0.339949,0.54339,0.875435,0.088963,0.709815
col4,0.409632,0.697304,0.52194,0.058425,0.600366,0.748754,0.042838,0.938764,0.849966,0.108017
col5,0.672517,0.452849,0.474141,0.736895,0.872655,0.794164,0.79439,0.476065,0.822617,0.159556
nueva_col,0.162262,0.207382,0.579982,0.133478,0.001609,0.845297,0.050918,0.647705,0.223639,0.004568
ceros,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
col_nulos,0.0,2.3,4.6,6.9,9.2,11.5,13.8,16.1,hola_soy_un_dato_nulo,hola_soy_un_dato_nulo


In [59]:
df.T.index

Index(['col1', 'columna_2', 'col3', 'col4', 'col5', 'nueva_col', 'ceros',
       'col_nulos'],
      dtype='object')

In [60]:
df.columns

Index(['col1', 'columna_2', 'col3', 'col4', 'col5', 'nueva_col', 'ceros',
       'col_nulos'],
      dtype='object')

In [61]:
# suma del dataframe

df.sum()

col1         3.740781
columna_2    4.726688
col3         4.757583
col4         4.976005
col5         6.255850
nueva_col    2.856841
ceros        0.000000
dtype: float64

In [63]:
df

Unnamed: 0,col1,columna_2,col3,col4,col5,nueva_col,ceros,col_nulos
0,0.141054,0.773635,0.367378,0.409632,0.672517,0.162262,0.0,0.0
1,0.202901,0.462851,0.400533,0.697304,0.452849,0.207382,0.0,2.3
2,0.824007,0.333727,0.542133,0.52194,0.474141,0.579982,0.0,4.6
3,0.11773,0.835466,0.200636,0.058425,0.736895,0.133478,0.0,6.9
4,0.386504,0.003633,0.689349,0.600366,0.872655,0.001609,0.0,9.2
5,0.953536,0.704016,0.339949,0.748754,0.794164,0.845297,0.0,11.5
6,0.071566,0.565189,0.54339,0.042838,0.79439,0.050918,0.0,13.8
7,0.59985,0.514045,0.875435,0.938764,0.476065,0.647705,0.0,16.1
8,0.349498,0.526383,0.088963,0.849966,0.822617,0.223639,0.0,hola_soy_un_dato_nulo
9,0.094137,0.007742,0.709815,0.108017,0.159556,0.004568,0.0,hola_soy_un_dato_nulo


In [64]:
df_T = df.T

df_T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
col1,0.141054,0.202901,0.824007,0.11773,0.386504,0.953536,0.071566,0.59985,0.349498,0.094137
columna_2,0.773635,0.462851,0.333727,0.835466,0.003633,0.704016,0.565189,0.514045,0.526383,0.007742
col3,0.367378,0.400533,0.542133,0.200636,0.689349,0.339949,0.54339,0.875435,0.088963,0.709815
col4,0.409632,0.697304,0.52194,0.058425,0.600366,0.748754,0.042838,0.938764,0.849966,0.108017
col5,0.672517,0.452849,0.474141,0.736895,0.872655,0.794164,0.79439,0.476065,0.822617,0.159556
nueva_col,0.162262,0.207382,0.579982,0.133478,0.001609,0.845297,0.050918,0.647705,0.223639,0.004568
ceros,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
col_nulos,0.0,2.3,4.6,6.9,9.2,11.5,13.8,16.1,hola_soy_un_dato_nulo,hola_soy_un_dato_nulo


In [62]:
df.std()

col1         0.317829
columna_2    0.287602
col3         0.242734
col4         0.332013
col5         0.226981
nueva_col    0.297033
ceros        0.000000
dtype: float64

In [65]:
df_T.std()

0    0.293232
1    0.722182
2    1.479580
3    2.355553
4    3.142262
5    3.856938
6    4.784200
7    5.495007
dtype: float64

In [67]:
df_T.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, col1 to col_nulos
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       8 non-null      object
 1   1       8 non-null      object
 2   2       8 non-null      object
 3   3       8 non-null      object
 4   4       8 non-null      object
 5   5       8 non-null      object
 6   6       8 non-null      object
 7   7       8 non-null      object
 8   8       8 non-null      object
 9   9       8 non-null      object
dtypes: object(10)
memory usage: 1004.0+ bytes


In [68]:
df.mean()

col1         0.374078
columna_2    0.472669
col3         0.475758
col4         0.497601
col5         0.625585
nueva_col    0.285684
ceros        0.000000
dtype: float64

In [69]:
df.mode()

Unnamed: 0,col1,columna_2,col3,col4,col5,nueva_col,ceros,col_nulos
0,0.071566,0.003633,0.088963,0.042838,0.159556,0.001609,0.0,hola_soy_un_dato_nulo
1,0.094137,0.007742,0.200636,0.058425,0.452849,0.004568,,
2,0.11773,0.333727,0.339949,0.108017,0.474141,0.050918,,
3,0.141054,0.462851,0.367378,0.409632,0.476065,0.133478,,
4,0.202901,0.514045,0.400533,0.52194,0.672517,0.162262,,
5,0.349498,0.526383,0.542133,0.600366,0.736895,0.207382,,
6,0.386504,0.565189,0.54339,0.697304,0.794164,0.223639,,
7,0.59985,0.704016,0.689349,0.748754,0.79439,0.579982,,
8,0.824007,0.773635,0.709815,0.849966,0.822617,0.647705,,
9,0.953536,0.835466,0.875435,0.938764,0.872655,0.845297,,


In [70]:
df.median()

col1         0.276199
columna_2    0.520214
col3         0.471333
col4         0.561153
col5         0.704706
nueva_col    0.184822
ceros        0.000000
dtype: float64

In [71]:
df.max()

col1         0.953536
columna_2    0.835466
col3         0.875435
col4         0.938764
col5         0.872655
nueva_col    0.845297
ceros        0.000000
dtype: float64

In [72]:
df.max(axis=1)

0    0.773635
1    0.697304
2    0.824007
3    0.835466
4    0.872655
5    0.953536
6    0.794390
7    0.938764
8    0.849966
9    0.709815
dtype: float64

In [73]:
df.min(axis= 0)

col1         0.071566
columna_2    0.003633
col3         0.088963
col4         0.042838
col5         0.159556
nueva_col    0.001609
ceros        0.000000
dtype: float64

### Importar archivos

+ CSV
+ XLSX
+ XLS
+ JSON

In [75]:
# csv
pd.set_option('display.max_columns', None) # mostrar todas las columnas de un df
#pd.set_option('display.max_rows', None) # mostrar todas las filas de un df

In [76]:
#help(pd.set_option)

In [77]:
df_csv = pd.read_csv('../data/vehicles_messy.csv')

df_csv.head()

Unnamed: 0,barrels08,barrelsA08,charge120,charge240,city08,city08U,cityA08,cityA08U,cityCD,cityE,cityUF,co2,co2A,co2TailpipeAGpm,co2TailpipeGpm,comb08,comb08U,combA08,combA08U,combE,combinedCD,combinedUF,cylinders,displ,drive,engId,eng_dscr,feScore,fuelCost08,fuelCostA08,fuelType,fuelType1,ghgScore,ghgScoreA,highway08,highway08U,highwayA08,highwayA08U,highwayCD,highwayE,highwayUF,hlv,hpv,id,lv2,lv4,make,model,mpgData,phevBlended,pv2,pv4,range,rangeCity,rangeCityA,rangeHwy,rangeHwyA,trany,UCity,UCityA,UHighway,UHighwayA,VClass,year,youSaveSpend,guzzler,trans_dscr,tCharger,sCharger,atvType,fuelType2,rangeA,evMotor,mfrCode,c240Dscr,charge240b,c240bDscr,createdOn,modifiedOn,startStop,phevCity,phevHwy,phevComb
0,15.695714,0.0,0.0,0.0,19,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,423.190476,21,0.0,0,0.0,0.0,0.0,0.0,4.0,2.0,Rear-Wheel Drive,9011,(FFS),-1,1600,0,Regular,Regular Gasoline,-1,-1,25,0.0,0,0.0,0.0,0.0,0.0,0,0,1,0,0,Alfa Romeo,Spider Veloce 2000,Y,False,0,0,0,0.0,0.0,0.0,0.0,Manual 5-spd,23.3333,0.0,35.0,0.0,Two Seaters,1985,-1250,,,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
1,29.964545,0.0,0.0,0.0,9,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,807.909091,11,0.0,0,0.0,0.0,0.0,0.0,12.0,4.9,Rear-Wheel Drive,22020,(GUZZLER),-1,3050,0,Regular,Regular Gasoline,-1,-1,14,0.0,0,0.0,0.0,0.0,0.0,0,0,10,0,0,Ferrari,Testarossa,N,False,0,0,0,0.0,0.0,0.0,0.0,Manual 5-spd,11.0,0.0,19.0,0.0,Two Seaters,1985,-8500,T,,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
2,12.207778,0.0,0.0,0.0,23,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,329.148148,27,0.0,0,0.0,0.0,0.0,0.0,4.0,2.2,Front-Wheel Drive,2100,(FFS),-1,1250,0,Regular,Regular Gasoline,-1,-1,33,0.0,0,0.0,0.0,0.0,0.0,19,77,100,0,0,Dodge,Charger,Y,False,0,0,0,0.0,0.0,0.0,0.0,Manual 5-spd,29.0,0.0,47.0,0.0,Subcompact Cars,1985,500,,SIL,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
3,29.964545,0.0,0.0,0.0,10,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,807.909091,11,0.0,0,0.0,0.0,0.0,0.0,8.0,5.2,Rear-Wheel Drive,2850,,-1,3050,0,Regular,Regular Gasoline,-1,-1,12,0.0,0,0.0,0.0,0.0,0.0,0,0,1000,0,0,Dodge,B150/B250 Wagon 2WD,N,False,0,0,0,0.0,0.0,0.0,0.0,Automatic 3-spd,12.2222,0.0,16.6667,0.0,Vans,1985,-8500,,,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
4,17.347895,0.0,0.0,0.0,17,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,467.736842,19,0.0,0,0.0,0.0,0.0,0.0,4.0,2.2,4-Wheel or All-Wheel Drive,66031,"(FFS,TRBO)",-1,2150,0,Premium,Premium Gasoline,-1,-1,23,0.0,0,0.0,0.0,0.0,0.0,0,0,10000,0,14,Subaru,Legacy AWD Turbo,N,False,0,90,0,0.0,0.0,0.0,0.0,Manual 5-spd,21.0,0.0,32.0,0.0,Compact Cars,1993,-4000,,,T,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0


In [78]:
df_csv.shape

(37843, 83)

In [80]:
print(df_csv)

       barrels08  barrelsA08  charge120  charge240  city08  city08U  cityA08  \
0      15.695714         0.0        0.0        0.0      19      0.0        0   
1      29.964545         0.0        0.0        0.0       9      0.0        0   
2      12.207778         0.0        0.0        0.0      23      0.0        0   
3      29.964545         0.0        0.0        0.0      10      0.0        0   
4      17.347895         0.0        0.0        0.0      17      0.0        0   
...          ...         ...        ...        ...     ...      ...      ...   
37838  14.982273         0.0        0.0        0.0      19      0.0        0   
37839  14.330870         0.0        0.0        0.0      20      0.0        0   
37840  15.695714         0.0        0.0        0.0      18      0.0        0   
37841  15.695714         0.0        0.0        0.0      18      0.0        0   
37842  18.311667         0.0        0.0        0.0      16      0.0        0   

       cityA08U  cityCD  cityE  cityUF 

In [81]:
%pip install openpyxl
%pip install xlrd

Note: you may need to restart the kernel to use updated packages.
Collecting xlrd
  Using cached xlrd-2.0.1-py2.py3-none-any.whl (96 kB)
Installing collected packages: xlrd
Successfully installed xlrd-2.0.1
Note: you may need to restart the kernel to use updated packages.


In [82]:
# xlsx
df_xlsx = pd.read_excel('../data/Online Retail.xlsx')

df_xlsx.head()

Unnamed: 0,InvoiceNo,InvoiceDate,StockCode,Description,Quantity,UnitPrice,Revenue,CustomerID,Country
0,536365,2010-12-01 08:26:00,85123A,CREAM HANGING HEART T-LIGHT HOLDER,6,2.55,15.3,17850,United Kingdom
1,536373,2010-12-01 09:02:00,85123A,CREAM HANGING HEART T-LIGHT HOLDER,6,2.55,15.3,17850,United Kingdom
2,536375,2010-12-01 09:32:00,85123A,CREAM HANGING HEART T-LIGHT HOLDER,6,2.55,15.3,17850,United Kingdom
3,536390,2010-12-01 10:19:00,85123A,CREAM HANGING HEART T-LIGHT HOLDER,64,2.55,163.2,17511,United Kingdom
4,536394,2010-12-01 10:39:00,85123A,CREAM HANGING HEART T-LIGHT HOLDER,32,2.55,81.6,13408,United Kingdom


In [83]:
# xls

df_xls = pd.read_excel('../data/Sensor Data.xls')

df_xls

Unnamed: 0,Input 1,Input 2,Input 3,Input 4,Input 5,Input 6,Input 7,Input 8,Input 9,Input 10,Input 11,Input 12,output1,output2,class
0,1.473,2.311,3.179,2.666,0.2795,0.2771,0.22340,0.18550,0.2539,1.138,1.111,4.712,1,1,one
1,1.460,2.377,3.214,2.920,0.2527,0.3064,0.02563,0.19650,0.3027,1.213,1.027,5.463,1,1,one
2,1.552,2.164,3.064,2.745,0.2820,0.2100,0.17210,0.19290,0.2100,1.221,1.058,5.332,1,1,one
3,1.605,2.228,3.149,2.834,0.2917,0.3613,0.20870,0.12940,0.2734,1.144,1.062,4.829,1,1,one
4,1.534,2.114,3.309,2.976,0.2100,0.2502,0.22580,0.17700,0.2039,1.254,1.112,5.734,1,1,one
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2207,3.682,1.301,4.939,4.453,0.4895,0.7922,0.23190,0.05005,0.3687,1.478,1.174,5.125,0,1,two
2208,3.412,1.293,4.949,4.199,0.4578,0.9521,0.21360,0.23070,0.4578,1.526,1.167,5.433,0,1,two
2209,3.640,1.284,5.111,4.460,0.5786,0.8020,0.26980,0.31740,0.4309,1.460,1.118,4.867,0,1,two
2210,3.746,1.261,5.049,4.885,0.5835,1.1470,0.32350,0.23070,0.4614,1.482,1.128,5.627,0,1,two


In [84]:
df_xlsx_3 = pd.read_excel('../data/Sensor Data.xls', 'Sheet3')

df_xlsx_3

Unnamed: 0,hola,amiguis,estamos,probando,pandas


In [86]:
# json

df_json = pd.read_json('../data/companies.json', lines=True, orient='records')

df_json.head()

Unnamed: 0,_id,name,permalink,crunchbase_url,homepage_url,blog_url,blog_feed_url,twitter_username,category_code,number_of_employees,founded_year,founded_month,founded_day,deadpooled_year,tag_list,alias_list,email_address,phone_number,description,created_at,updated_at,overview,image,products,relationships,competitions,providerships,total_money_raised,funding_rounds,investments,acquisition,acquisitions,offices,milestones,video_embeds,screenshots,external_links,partners,deadpooled_month,deadpooled_day,deadpooled_url,ipo
0,{'$oid': '52cdef7c4bab8bd675297d8a'},Wetpaint,abc2,http://www.crunchbase.com/company/wetpaint,http://wetpaint-inc.com,http://digitalquarters.net/,http://digitalquarters.net/feed/,BachelrWetpaint,web,47.0,2005.0,10.0,17.0,1.0,"wiki, seattle, elowitz, media-industry, media-...",,info@wetpaint.com,206.859.6300,Technology Platform Company,{'$date': 1180075887000},2013-12-08 07:15:44+00:00,<p>Wetpaint is a technology platform company t...,"{'available_sizes': [[[150, 75], 'assets/image...","[{'name': 'Wikison Wetpaint', 'permalink': 'we...","[{'is_past': False, 'title': 'Co-Founder and V...","[{'competitor': {'name': 'Wikia', 'permalink':...",[],$39.8M,"[{'id': 888, 'round_code': 'a', 'source_url': ...",[],"{'price_amount': 30000000, 'price_currency_cod...",[],"[{'description': '', 'address1': '710 - 2nd Av...","[{'id': 5869, 'description': 'Wetpaint named i...",[],"[{'available_sizes': [[[150, 86], 'assets/imag...",[{'external_url': 'http://www.geekwire.com/201...,[],,,,
1,{'$oid': '52cdef7c4bab8bd675297d8b'},AdventNet,abc3,http://www.crunchbase.com/company/adventnet,http://adventnet.com,,,manageengine,enterprise,600.0,1996.0,,,2.0,,Zoho ManageEngine,pr@adventnet.com,925-924-9500,Server Management Software,{'$date': 1180121062000},2012-10-31 18:26:09+00:00,"<p>AdventNet is now <a href=""/company/zoho-man...","{'available_sizes': [[[150, 55], 'assets/image...",[],"[{'is_past': True, 'title': 'CEO and Co-Founde...",[],"[{'title': 'DHFH', 'is_past': True, 'provider'...",$0,[],[],,[],"[{'description': 'Headquarters', 'address1': '...",[],[],"[{'available_sizes': [[[150, 94], 'assets/imag...",[],[],,,,
2,{'$oid': '52cdef7c4bab8bd675297d8c'},Zoho,abc4,http://www.crunchbase.com/company/zoho,http://zoho.com,http://blogs.zoho.com/,http://blogs.zoho.com/feed,zoho,software,1600.0,2005.0,9.0,15.0,3.0,"zoho, officesuite, spreadsheet, writer, projec...",,info@zohocorp.com,1-888-204-3539,Online Business Apps Suite,Fri May 25 19:30:28 UTC 2007,2013-10-30 00:07:05+00:00,"<p>Zoho offers a suite of Business, Collaborat...","{'available_sizes': [[[150, 55], 'assets/image...","[{'name': 'Zoho Office Suite', 'permalink': 'z...","[{'is_past': False, 'title': 'CEO and Founder'...","[{'competitor': {'name': 'Empressr', 'permalin...",[],$0,[],[],,[],"[{'description': 'Headquarters', 'address1': '...","[{'id': 388, 'description': 'Zoho Reaches 2 Mi...","[{'embed_code': '<object width=""430"" height=""2...",[],[{'external_url': 'http://www.online-tech-tips...,[],,,,
3,{'$oid': '52cdef7c4bab8bd675297d8d'},Digg,digg,http://www.crunchbase.com/company/digg,http://www.digg.com,http://blog.digg.com/,http://blog.digg.com/?feed=rss2,digg,news,60.0,2004.0,10.0,11.0,,"community, social, news, bookmark, digg, techn...",,feedback@digg.com,(415) 436-9638,user driven social content website,Fri May 25 20:03:23 UTC 2007,2013-11-05 21:35:47+00:00,<p>Digg is a user driven social content websit...,"{'available_sizes': [[[150, 150], 'assets/imag...","[{'name': 'Digg', 'permalink': 'digg'}]","[{'is_past': False, 'title': 'CEO', 'person': ...","[{'competitor': {'name': 'Reddit', 'permalink'...","[{'title': 'Public Relations', 'is_past': True...",$45M,"[{'id': 1, 'round_code': 'b', 'source_url': 'h...",[],"{'price_amount': 500000, 'price_currency_code'...","[{'price_amount': None, 'price_currency_code':...","[{'description': None, 'address1': '135 Missis...","[{'id': 9588, 'description': 'Another Digg Exe...","[{'embed_code': '<embed src=""http://blip.tv/pl...","[{'available_sizes': [[[117, 150], 'assets/ima...",[{'external_url': 'http://www.sociableblog.com...,[],,,,
4,{'$oid': '52cdef7c4bab8bd675297d8e'},Facebook,facebook,http://www.crunchbase.com/company/facebook,http://facebook.com,http://blog.facebook.com,http://blog.facebook.com/atom.php,facebook,social,5299.0,2004.0,2.0,1.0,,"facebook, college, students, profiles, network...",,,,Social network,Fri May 25 21:22:15 UTC 2007,2013-11-21 19:40:55+00:00,<p>Facebook is the world&#8217;s largest socia...,"{'available_sizes': [[[150, 61], 'assets/image...","[{'name': 'Facebook Platform', 'permalink': 'f...","[{'is_past': False, 'title': 'Founder and CEO,...","[{'competitor': {'name': 'MySpace', 'permalink...","[{'title': '', 'is_past': False, 'provider': {...",$2.43B,"[{'id': 2, 'round_code': 'angel', 'source_url'...","[{'funding_round': {'round_code': 'seed', 'sou...",,"[{'price_amount': None, 'price_currency_code':...","[{'description': 'Headquarters', 'address1': '...","[{'id': 108, 'description': 'Facebook adds com...",[],"[{'available_sizes': [[[150, 68], 'assets/imag...",[{'external_url': 'http://latimesblogs.latimes...,[],,,,"{'valuation_amount': 104000000000, 'valuation_..."
