# 1.2 - Intro Pandas (Panel Data)

**[Documentación](https://pandas.pydata.org/docs/reference/index.html#api)**

**[Código fuente](https://github.com/pandas-dev/pandas)**


![pandas](images/pandas.png)


Pandas es una librería de python especializada en el manejo y análisis de estructuras de datos.


Las principales características de esta librería son:

+ Define nuevas estructuras de datos basadas en los arrays de la librería NumPy pero con nuevas funcionalidades.
+ Permite leer y escribir fácilmente ficheros en formato CSV, Excel y bases de datos SQL.
+ Permite acceder a los datos mediante índices o nombres para filas y columnas.
+ Ofrece métodos para reordenar, dividir y combinar conjuntos de datos.
+ Permite trabajar con series temporales.
+ Realiza todas estas operaciones de manera muy eficiente.


**Tipos de datos de Pandas**
Pandas dispone de dos estructuras de datos diferentes:

+ Series: Estructura de una dimensión.
+ DataFrame: Estructura de dos dimensiones (tablas).

Estas estructuras se construyen a partir de arrays de la librería NumPy, añadiendo nuevas funcionalidades.

In [None]:
%pip install pandas

In [1]:
import pandas as pd

In [2]:
import numpy as np

In [3]:
import warnings
warnings.filterwarnings('ignore')

### Serie

Son estructuras similares a los arrays de una dimensión. Son homogéneas, es decir, sus elementos tienen que ser del mismo tipo, y su tamaño es inmutable, es decir, no se puede cambiar, aunque si su contenido.

Dispone de un índice que asocia un nombre a cada elemento del la serie, a través de la cuál se accede al elemento.

In [7]:
lst = [(3.4 + i)**2 for i in range(10)]

lst

[11.559999999999999,
 19.360000000000003,
 29.160000000000004,
 40.96000000000001,
 54.760000000000005,
 70.56,
 88.36000000000001,
 108.16000000000001,
 129.96,
 153.76000000000002]

In [9]:
lst.append('hola')

In [10]:
serie = pd.Series(lst)

serie

0      11.56
1      19.36
2      29.16
3      40.96
4      54.76
5      70.56
6      88.36
7     108.16
8     129.96
9     153.76
10      hola
dtype: object

In [11]:
type(serie)

pandas.core.series.Series

In [12]:
serie[2]

29.160000000000004

In [13]:
type(serie[2])

float

In [14]:
serie[10]

'hola'

In [15]:
serie[10] = 0.5

serie

0      11.56
1      19.36
2      29.16
3      40.96
4      54.76
5      70.56
6      88.36
7     108.16
8     129.96
9     153.76
10       0.5
dtype: object

In [17]:
#help(serie)

In [18]:
serie.head()     # la cabeza, pòr defecto los 5 primeros

0    11.56
1    19.36
2    29.16
3    40.96
4    54.76
dtype: object

In [19]:
serie.head(3)

0    11.56
1    19.36
2    29.16
dtype: object

In [20]:
serie.tail()

6      88.36
7     108.16
8     129.96
9     153.76
10       0.5
dtype: object

In [21]:
serie.index

RangeIndex(start=0, stop=11, step=1)

In [22]:
serie.shape

(11,)

In [23]:
len(serie)

11

In [24]:
serie.index = ['q', 't', 'y', 'o', 'p', 'a', 's', 'd', 'f', 'g', 'v']

serie

q     11.56
t     19.36
y     29.16
o     40.96
p     54.76
a     70.56
s     88.36
d    108.16
f    129.96
g    153.76
v       0.5
dtype: object

In [25]:
serie[0]   # por posicion

11.559999999999999

In [26]:
serie['q']   # por nombre

11.559999999999999

In [28]:
serie.to_dict()

{'q': 11.559999999999999,
 't': 19.360000000000003,
 'y': 29.160000000000004,
 'o': 40.96000000000001,
 'p': 54.760000000000005,
 'a': 70.56,
 's': 88.36000000000001,
 'd': 108.16000000000001,
 'f': 129.96,
 'g': 153.76000000000002,
 'v': 0.5}

In [29]:
serie.to_frame()

Unnamed: 0,0
q,11.56
t,19.36
y,29.16
o,40.96
p,54.76
a,70.56
s,88.36
d,108.16
f,129.96
g,153.76


In [30]:
serie.to_frame()[0]

q     11.56
t     19.36
y     29.16
o     40.96
p     54.76
a     70.56
s     88.36
d    108.16
f    129.96
g    153.76
v       0.5
Name: 0, dtype: object

In [31]:
serie.to_frame().shape

(11, 1)

### DataFrame

Un objeto del tipo DataFrame define un conjunto de datos estructurado en forma de tabla donde cada columna es un objeto de tipo Series, es decir, todos los datos de una misma columna son del mismo tipo, y las filas son registros que pueden contender datos de distintos tipos.

Un DataFrame contiene dos índices, uno para las filas y otro para las columnas, y se puede acceder a sus elementos mediante los nombres de las filas y las columnas.

In [32]:
array = np.random.random((10, 5))

array

array([[0.8056203 , 0.84019067, 0.27463653, 0.53337614, 0.31151428],
       [0.9795593 , 0.80667633, 0.18777822, 0.94841119, 0.97999034],
       [0.31133204, 0.56192321, 0.51593251, 0.14590945, 0.40208586],
       [0.54974715, 0.21662595, 0.8318242 , 0.04872466, 0.60432034],
       [0.81076504, 0.83940335, 0.92727757, 0.54346741, 0.78503632],
       [0.42840294, 0.76366696, 0.75921654, 0.49446435, 0.77657444],
       [0.4874654 , 0.28097329, 0.92093774, 0.82954583, 0.09337614],
       [0.89510964, 0.59834308, 0.1875831 , 0.9127205 , 0.7743789 ],
       [0.81952603, 0.50564655, 0.91273152, 0.10356162, 0.69699366],
       [0.75325319, 0.68017777, 0.43453219, 0.78420698, 0.11468395]])

In [37]:
cols = ['col1', 'col2', 'col 3', 'col4', 'col5']

In [38]:
df = pd.DataFrame(array, columns=cols)

display(df)

Unnamed: 0,col1,col2,col 3,col4,col5
0,0.80562,0.840191,0.274637,0.533376,0.311514
1,0.979559,0.806676,0.187778,0.948411,0.97999
2,0.311332,0.561923,0.515933,0.145909,0.402086
3,0.549747,0.216626,0.831824,0.048725,0.60432
4,0.810765,0.839403,0.927278,0.543467,0.785036
5,0.428403,0.763667,0.759217,0.494464,0.776574
6,0.487465,0.280973,0.920938,0.829546,0.093376
7,0.89511,0.598343,0.187583,0.91272,0.774379
8,0.819526,0.505647,0.912732,0.103562,0.696994
9,0.753253,0.680178,0.434532,0.784207,0.114684


In [39]:
df['col 3']

0    0.274637
1    0.187778
2    0.515933
3    0.831824
4    0.927278
5    0.759217
6    0.920938
7    0.187583
8    0.912732
9    0.434532
Name: col 3, dtype: float64

In [41]:
df.col 3

SyntaxError: invalid syntax (3887910251.py, line 1)

In [43]:
df.columns = [e.lower().replace(' ', '_') for e in df.columns]

df.columns

Index(['col1', 'col2', 'col_3', 'col4', 'col5'], dtype='object')

In [46]:
df.col_3

0    0.274637
1    0.187778
2    0.515933
3    0.831824
4    0.927278
5    0.759217
6    0.920938
7    0.187583
8    0.912732
9    0.434532
Name: col_3, dtype: float64

In [47]:
seleccion = ['col2', 'col4']

seleccion

['col2', 'col4']

In [48]:
df[seleccion]

Unnamed: 0,col2,col4
0,0.840191,0.533376
1,0.806676,0.948411
2,0.561923,0.145909
3,0.216626,0.048725
4,0.839403,0.543467
5,0.763667,0.494464
6,0.280973,0.829546
7,0.598343,0.91272
8,0.505647,0.103562
9,0.680178,0.784207


In [52]:
df[['col1', 'col5']]

Unnamed: 0,col1,col5
0,0.80562,0.311514
1,0.979559,0.97999
2,0.311332,0.402086
3,0.549747,0.60432
4,0.810765,0.785036
5,0.428403,0.776574
6,0.487465,0.093376
7,0.89511,0.774379
8,0.819526,0.696994
9,0.753253,0.114684


In [53]:
df.rename(columns={'col_3': 'nueva_col'})    # {key - la vieja: value - la nueva}

Unnamed: 0,col1,col2,nueva_col,col4,col5
0,0.80562,0.840191,0.274637,0.533376,0.311514
1,0.979559,0.806676,0.187778,0.948411,0.97999
2,0.311332,0.561923,0.515933,0.145909,0.402086
3,0.549747,0.216626,0.831824,0.048725,0.60432
4,0.810765,0.839403,0.927278,0.543467,0.785036
5,0.428403,0.763667,0.759217,0.494464,0.776574
6,0.487465,0.280973,0.920938,0.829546,0.093376
7,0.89511,0.598343,0.187583,0.91272,0.774379
8,0.819526,0.505647,0.912732,0.103562,0.696994
9,0.753253,0.680178,0.434532,0.784207,0.114684


In [55]:
# sobreescribir

# 1º manera
df = df.rename(columns={'col_3': 'nueva_col'})


# 2º manera
df.rename(columns={'col_3': 'nueva_col'}, inplace=True)


In [56]:
df

Unnamed: 0,col1,col2,nueva_col,col4,col5
0,0.80562,0.840191,0.274637,0.533376,0.311514
1,0.979559,0.806676,0.187778,0.948411,0.97999
2,0.311332,0.561923,0.515933,0.145909,0.402086
3,0.549747,0.216626,0.831824,0.048725,0.60432
4,0.810765,0.839403,0.927278,0.543467,0.785036
5,0.428403,0.763667,0.759217,0.494464,0.776574
6,0.487465,0.280973,0.920938,0.829546,0.093376
7,0.89511,0.598343,0.187583,0.91272,0.774379
8,0.819526,0.505647,0.912732,0.103562,0.696994
9,0.753253,0.680178,0.434532,0.784207,0.114684


In [57]:
df['nueva'] = 0.

In [58]:
df

Unnamed: 0,col1,col2,nueva_col,col4,col5,nueva
0,0.80562,0.840191,0.274637,0.533376,0.311514,0.0
1,0.979559,0.806676,0.187778,0.948411,0.97999,0.0
2,0.311332,0.561923,0.515933,0.145909,0.402086,0.0
3,0.549747,0.216626,0.831824,0.048725,0.60432,0.0
4,0.810765,0.839403,0.927278,0.543467,0.785036,0.0
5,0.428403,0.763667,0.759217,0.494464,0.776574,0.0
6,0.487465,0.280973,0.920938,0.829546,0.093376,0.0
7,0.89511,0.598343,0.187583,0.91272,0.774379,0.0
8,0.819526,0.505647,0.912732,0.103562,0.696994,0.0
9,0.753253,0.680178,0.434532,0.784207,0.114684,0.0


In [59]:
# posicion, nombre, dato

df.insert(0, 'nueva_cero', [i for i in range(10)])

In [60]:
df

Unnamed: 0,nueva_cero,col1,col2,nueva_col,col4,col5,nueva
0,0,0.80562,0.840191,0.274637,0.533376,0.311514,0.0
1,1,0.979559,0.806676,0.187778,0.948411,0.97999,0.0
2,2,0.311332,0.561923,0.515933,0.145909,0.402086,0.0
3,3,0.549747,0.216626,0.831824,0.048725,0.60432,0.0
4,4,0.810765,0.839403,0.927278,0.543467,0.785036,0.0
5,5,0.428403,0.763667,0.759217,0.494464,0.776574,0.0
6,6,0.487465,0.280973,0.920938,0.829546,0.093376,0.0
7,7,0.89511,0.598343,0.187583,0.91272,0.774379,0.0
8,8,0.819526,0.505647,0.912732,0.103562,0.696994,0.0
9,9,0.753253,0.680178,0.434532,0.784207,0.114684,0.0


In [61]:
df['nulos'] = np.nan

df

Unnamed: 0,nueva_cero,col1,col2,nueva_col,col4,col5,nueva,nulos
0,0,0.80562,0.840191,0.274637,0.533376,0.311514,0.0,
1,1,0.979559,0.806676,0.187778,0.948411,0.97999,0.0,
2,2,0.311332,0.561923,0.515933,0.145909,0.402086,0.0,
3,3,0.549747,0.216626,0.831824,0.048725,0.60432,0.0,
4,4,0.810765,0.839403,0.927278,0.543467,0.785036,0.0,
5,5,0.428403,0.763667,0.759217,0.494464,0.776574,0.0,
6,6,0.487465,0.280973,0.920938,0.829546,0.093376,0.0,
7,7,0.89511,0.598343,0.187583,0.91272,0.774379,0.0,
8,8,0.819526,0.505647,0.912732,0.103562,0.696994,0.0,
9,9,0.753253,0.680178,0.434532,0.784207,0.114684,0.0,


In [62]:
df.shape

(10, 8)

In [63]:
df.size

80

In [64]:
len(df)   # nº de filas

10

In [65]:
len(df.columns)

8

In [66]:
df.shape[0]

10

In [67]:
df2 = pd.DataFrame(np.random.random((10, 3)), columns=['a', 'b', 'c', 'd'])

df2

ValueError: Shape of passed values is (10, 3), indices imply (10, 4)

In [69]:
df['col10'] = df.col1 * df.col4 / df.col2

df

Unnamed: 0,nueva_cero,col1,col2,nueva_col,col4,col5,nueva,nulos,col10
0,0,0.80562,0.840191,0.274637,0.533376,0.311514,0.0,,0.51143
1,1,0.979559,0.806676,0.187778,0.948411,0.97999,0.0,,1.15167
2,2,0.311332,0.561923,0.515933,0.145909,0.402086,0.0,,0.080841
3,3,0.549747,0.216626,0.831824,0.048725,0.60432,0.0,,0.123652
4,4,0.810765,0.839403,0.927278,0.543467,0.785036,0.0,,0.524926
5,5,0.428403,0.763667,0.759217,0.494464,0.776574,0.0,,0.277385
6,6,0.487465,0.280973,0.920938,0.829546,0.093376,0.0,,1.439193
7,7,0.89511,0.598343,0.187583,0.91272,0.774379,0.0,,1.365412
8,8,0.819526,0.505647,0.912732,0.103562,0.696994,0.0,,0.167847
9,9,0.753253,0.680178,0.434532,0.784207,0.114684,0.0,,0.868459


In [74]:
df.describe()

Unnamed: 0,nueva_cero,col1,col2,nueva_col,col4,col5,nueva,nulos,col10
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,0.0,10.0
mean,4.5,0.684078,0.609363,0.595245,0.534439,0.553895,0.0,,0.651082
std,3.02765,0.222812,0.2234,0.310394,0.339632,0.306047,0.0,,0.520808
min,0.0,0.311332,0.216626,0.187583,0.048725,0.093376,0.0,,0.080841
25%,2.25,0.503036,0.519716,0.31461,0.233048,0.334157,0.0,,0.195232
50%,4.5,0.779437,0.63926,0.637575,0.538422,0.650657,0.0,,0.518178
75%,6.75,0.817336,0.795924,0.892505,0.818211,0.776026,0.0,,1.080867
max,9.0,0.979559,0.840191,0.927278,0.948411,0.97999,0.0,,1.439193


In [75]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   nueva_cero  10 non-null     int64  
 1   col1        10 non-null     float64
 2   col2        10 non-null     float64
 3   nueva_col   10 non-null     float64
 4   col4        10 non-null     float64
 5   col5        10 non-null     float64
 6   nueva       10 non-null     float64
 7   nulos       0 non-null      float64
 8   col10       10 non-null     float64
dtypes: float64(8), int64(1)
memory usage: 848.0 bytes


In [77]:
df.fillna('hola')    # rellena los nulos

Unnamed: 0,nueva_cero,col1,col2,nueva_col,col4,col5,nueva,nulos,col10
0,0,0.80562,0.840191,0.274637,0.533376,0.311514,0.0,hola,0.51143
1,1,0.979559,0.806676,0.187778,0.948411,0.97999,0.0,hola,1.15167
2,2,0.311332,0.561923,0.515933,0.145909,0.402086,0.0,hola,0.080841
3,3,0.549747,0.216626,0.831824,0.048725,0.60432,0.0,hola,0.123652
4,4,0.810765,0.839403,0.927278,0.543467,0.785036,0.0,hola,0.524926
5,5,0.428403,0.763667,0.759217,0.494464,0.776574,0.0,hola,0.277385
6,6,0.487465,0.280973,0.920938,0.829546,0.093376,0.0,hola,1.439193
7,7,0.89511,0.598343,0.187583,0.91272,0.774379,0.0,hola,1.365412
8,8,0.819526,0.505647,0.912732,0.103562,0.696994,0.0,hola,0.167847
9,9,0.753253,0.680178,0.434532,0.784207,0.114684,0.0,hola,0.868459


In [79]:
df = df.fillna('hola')


df.fillna('hola', inplace=True)

In [80]:
df

Unnamed: 0,nueva_cero,col1,col2,nueva_col,col4,col5,nueva,nulos,col10
0,0,0.80562,0.840191,0.274637,0.533376,0.311514,0.0,hola,0.51143
1,1,0.979559,0.806676,0.187778,0.948411,0.97999,0.0,hola,1.15167
2,2,0.311332,0.561923,0.515933,0.145909,0.402086,0.0,hola,0.080841
3,3,0.549747,0.216626,0.831824,0.048725,0.60432,0.0,hola,0.123652
4,4,0.810765,0.839403,0.927278,0.543467,0.785036,0.0,hola,0.524926
5,5,0.428403,0.763667,0.759217,0.494464,0.776574,0.0,hola,0.277385
6,6,0.487465,0.280973,0.920938,0.829546,0.093376,0.0,hola,1.439193
7,7,0.89511,0.598343,0.187583,0.91272,0.774379,0.0,hola,1.365412
8,8,0.819526,0.505647,0.912732,0.103562,0.696994,0.0,hola,0.167847
9,9,0.753253,0.680178,0.434532,0.784207,0.114684,0.0,hola,0.868459


In [81]:
df.col1 = df.col1.fillna(0)

In [82]:
df.col1.mean()

0.6840781051045217

In [83]:
df.col1 = df.col1.fillna(df.col1.mean())

In [84]:
# introducir datos desde una lista de listas

lst_lst=[[687261, 'hola', 4728364], 
         [83546, 'adios', 58943], 
         [321, 'oo^oo']]


columnas=['num', 'palabra', 'otro_num']

In [85]:
df_lst = pd.DataFrame(lst_lst, columns=columnas)

In [86]:
df_lst

Unnamed: 0,num,palabra,otro_num
0,687261,hola,4728364.0
1,83546,adios,58943.0
2,321,oo^oo,


In [89]:
df_lst['num']

0    687261
1     83546
2       321
Name: num, dtype: int64

In [91]:
df_lst['otro_num'].fillna(0, inplace=True)  # inplace sobreescribe la variable df

In [92]:
df_lst

Unnamed: 0,num,palabra,otro_num
0,687261,hola,4728364.0
1,83546,adios,58943.0
2,321,oo^oo,0.0


In [93]:
# con dictio

dictio={'casa': lst_lst[0],
        'oficina': lst_lst[1],
        'numero': lst_lst[2]+[0]}

dictio

{'casa': [687261, 'hola', 4728364],
 'oficina': [83546, 'adios', 58943],
 'numero': [321, 'oo^oo', 0]}

In [94]:
df_dictio = pd.DataFrame(dictio)

df_dictio

Unnamed: 0,casa,oficina,numero
0,687261,83546,321
1,hola,adios,oo^oo
2,4728364,58943,0


In [100]:
df_dictio = df_dictio.T

In [102]:
df_dictio.columns = ['a', 'b', 'c']

In [104]:
df_dictio.index

Index(['casa', 'oficina', 'numero'], dtype='object')

In [96]:
df

Unnamed: 0,nueva_cero,col1,col2,nueva_col,col4,col5,nueva,nulos,col10
0,0,0.80562,0.840191,0.274637,0.533376,0.311514,0.0,hola,0.51143
1,1,0.979559,0.806676,0.187778,0.948411,0.97999,0.0,hola,1.15167
2,2,0.311332,0.561923,0.515933,0.145909,0.402086,0.0,hola,0.080841
3,3,0.549747,0.216626,0.831824,0.048725,0.60432,0.0,hola,0.123652
4,4,0.810765,0.839403,0.927278,0.543467,0.785036,0.0,hola,0.524926
5,5,0.428403,0.763667,0.759217,0.494464,0.776574,0.0,hola,0.277385
6,6,0.487465,0.280973,0.920938,0.829546,0.093376,0.0,hola,1.439193
7,7,0.89511,0.598343,0.187583,0.91272,0.774379,0.0,hola,1.365412
8,8,0.819526,0.505647,0.912732,0.103562,0.696994,0.0,hola,0.167847
9,9,0.753253,0.680178,0.434532,0.784207,0.114684,0.0,hola,0.868459


In [98]:
df.drop(columns = ['col1', 'nulos'], inplace=True)

In [107]:
df.drop('col2', axis=1)    # axis=1 por columnas

Unnamed: 0,nueva_cero,nueva_col,col4,col5,nueva,col10
0,0,0.274637,0.533376,0.311514,0.0,0.51143
1,1,0.187778,0.948411,0.97999,0.0,1.15167
2,2,0.515933,0.145909,0.402086,0.0,0.080841
3,3,0.831824,0.048725,0.60432,0.0,0.123652
4,4,0.927278,0.543467,0.785036,0.0,0.524926
5,5,0.759217,0.494464,0.776574,0.0,0.277385
6,6,0.920938,0.829546,0.093376,0.0,1.439193
7,7,0.187583,0.91272,0.774379,0.0,1.365412
8,8,0.912732,0.103562,0.696994,0.0,0.167847
9,9,0.434532,0.784207,0.114684,0.0,0.868459


In [108]:
df.drop(0, axis=0)

Unnamed: 0,nueva_cero,col2,nueva_col,col4,col5,nueva,col10
1,1,0.806676,0.187778,0.948411,0.97999,0.0,1.15167
2,2,0.561923,0.515933,0.145909,0.402086,0.0,0.080841
3,3,0.216626,0.831824,0.048725,0.60432,0.0,0.123652
4,4,0.839403,0.927278,0.543467,0.785036,0.0,0.524926
5,5,0.763667,0.759217,0.494464,0.776574,0.0,0.277385
6,6,0.280973,0.920938,0.829546,0.093376,0.0,1.439193
7,7,0.598343,0.187583,0.91272,0.774379,0.0,1.365412
8,8,0.505647,0.912732,0.103562,0.696994,0.0,0.167847
9,9,0.680178,0.434532,0.784207,0.114684,0.0,0.868459


In [110]:
df.drop(index=[0,1,2,3,4,5,6], inplace=True)

In [119]:
df

Unnamed: 0,nueva_cero,col2,nueva_col,col4,col5,nueva,col10
7,7,0.598343,0.187583,0.91272,0.774379,0.0,1.365412
8,8,0.505647,0.912732,0.103562,0.696994,0.0,0.167847
9,9,0.680178,0.434532,0.784207,0.114684,0.0,0.868459


In [112]:
df.reset_index()

Unnamed: 0,index,nueva_cero,col2,nueva_col,col4,col5,nueva,col10
0,7,7,0.598343,0.187583,0.91272,0.774379,0.0,1.365412
1,8,8,0.505647,0.912732,0.103562,0.696994,0.0,0.167847
2,9,9,0.680178,0.434532,0.784207,0.114684,0.0,0.868459


In [117]:
df.reset_index(drop=True)

Unnamed: 0,nueva_cero,col2,nueva_col,col4,col5,nueva,col10
0,7,0.598343,0.187583,0.91272,0.774379,0.0,1.365412
1,8,0.505647,0.912732,0.103562,0.696994,0.0,0.167847
2,9,0.680178,0.434532,0.784207,0.114684,0.0,0.868459


In [116]:
df['col2'][7]

0.5983430756473155

### Operaciones


In [120]:
df

Unnamed: 0,nueva_cero,col2,nueva_col,col4,col5,nueva,col10
7,7,0.598343,0.187583,0.91272,0.774379,0.0,1.365412
8,8,0.505647,0.912732,0.103562,0.696994,0.0,0.167847
9,9,0.680178,0.434532,0.784207,0.114684,0.0,0.868459


In [121]:
df.T

Unnamed: 0,7,8,9
nueva_cero,7.0,8.0,9.0
col2,0.598343,0.505647,0.680178
nueva_col,0.187583,0.912732,0.434532
col4,0.91272,0.103562,0.784207
col5,0.774379,0.696994,0.114684
nueva,0.0,0.0,0.0
col10,1.365412,0.167847,0.868459


In [122]:
df.transpose()

Unnamed: 0,7,8,9
nueva_cero,7.0,8.0,9.0
col2,0.598343,0.505647,0.680178
nueva_col,0.187583,0.912732,0.434532
col4,0.91272,0.103562,0.784207
col5,0.774379,0.696994,0.114684
nueva,0.0,0.0,0.0
col10,1.365412,0.167847,0.868459


In [124]:
df.T.index

Index(['nueva_cero', 'col2', 'nueva_col', 'col4', 'col5', 'nueva', 'col10'], dtype='object')

In [125]:
df.columns

Index(['nueva_cero', 'col2', 'nueva_col', 'col4', 'col5', 'nueva', 'col10'], dtype='object')

In [126]:
df.index

RangeIndex(start=7, stop=10, step=1)

In [128]:
df.sum()

nueva_cero    24.000000
col2           1.784167
nueva_col      1.534847
col4           1.800489
col5           1.586057
nueva          0.000000
col10          2.401718
dtype: float64

In [129]:
df.sum(axis=0)

nueva_cero    24.000000
col2           1.784167
nueva_col      1.534847
col4           1.800489
col5           1.586057
nueva          0.000000
col10          2.401718
dtype: float64

In [130]:
df.sum(axis=1)

7    10.838438
8    10.386781
9    11.882060
dtype: float64

In [131]:
df['hola'] = 'kdejfgnr'

In [133]:
df.sum()

nueva_cero                          24
col2                          1.784167
nueva_col                     1.534847
col4                          1.800489
col5                          1.586057
nueva                              0.0
col10                         2.401718
hola          kdejfgnrkdejfgnrkdejfgnr
dtype: object

In [134]:
df.mean()

nueva_cero    8.000000
col2          0.594722
nueva_col     0.511616
col4          0.600163
col5          0.528686
nueva         0.000000
col10         0.800573
dtype: float64

In [135]:
df.std()

nueva_cero    1.000000
col2          0.087322
nueva_col     0.368668
col4          0.434843
col5          0.360618
nueva         0.000000
col10         0.601662
dtype: float64

In [136]:
df.var()

nueva_cero    1.000000
col2          0.007625
nueva_col     0.135916
col4          0.189089
col5          0.130045
nueva         0.000000
col10         0.361997
dtype: float64

In [137]:
df.median()

nueva_cero    8.000000
col2          0.598343
nueva_col     0.434532
col4          0.784207
col5          0.696994
nueva         0.000000
col10         0.868459
dtype: float64

In [138]:
df.mode()

Unnamed: 0,nueva_cero,col2,nueva_col,col4,col5,nueva,col10,hola
0,7,0.505647,0.187583,0.103562,0.114684,0.0,0.167847,kdejfgnr
1,8,0.598343,0.434532,0.784207,0.696994,,0.868459,
2,9,0.680178,0.912732,0.91272,0.774379,,1.365412,


### Importar archivos

+ CSV
+ XLSX
+ XLS
+ JSON

In [141]:
# csv


ruta = '../data/vehicles.csv'


df_csv = pd.read_csv(ruta)


df_csv.head(1)

Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,17,522.764706,1950


In [142]:
df_csv.shape

(35952, 15)

In [144]:
df_csv.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35952 entries, 0 to 35951
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Make                     35952 non-null  object 
 1   Model                    35952 non-null  object 
 2   Year                     35952 non-null  int64  
 3   Engine Displacement      35952 non-null  float64
 4   Cylinders                35952 non-null  float64
 5   Transmission             35952 non-null  object 
 6   Drivetrain               35952 non-null  object 
 7   Vehicle Class            35952 non-null  object 
 8   Fuel Type                35952 non-null  object 
 9   Fuel Barrels/Year        35952 non-null  float64
 10  City MPG                 35952 non-null  int64  
 11  Highway MPG              35952 non-null  int64  
 12  Combined MPG             35952 non-null  int64  
 13  CO2 Emission Grams/Mile  35952 non-null  float64
 14  Fuel Cost/Year        

In [None]:
%pip install openpyxl
%pip install xlrd

In [145]:
# xlsx

df_xlsx = pd.read_excel('../data/Online Retail.xlsx')

df_xlsx.head()

Unnamed: 0,InvoiceNo,InvoiceDate,StockCode,Description,Quantity,UnitPrice,Revenue,CustomerID,Country
0,536365,2010-12-01 08:26:00,85123A,CREAM HANGING HEART T-LIGHT HOLDER,6,2.55,15.3,17850,United Kingdom
1,536373,2010-12-01 09:02:00,85123A,CREAM HANGING HEART T-LIGHT HOLDER,6,2.55,15.3,17850,United Kingdom
2,536375,2010-12-01 09:32:00,85123A,CREAM HANGING HEART T-LIGHT HOLDER,6,2.55,15.3,17850,United Kingdom
3,536390,2010-12-01 10:19:00,85123A,CREAM HANGING HEART T-LIGHT HOLDER,64,2.55,163.2,17511,United Kingdom
4,536394,2010-12-01 10:39:00,85123A,CREAM HANGING HEART T-LIGHT HOLDER,32,2.55,81.6,13408,United Kingdom


In [146]:
df_xlsx.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 396034 entries, 0 to 396033
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    396034 non-null  int64         
 1   InvoiceDate  396034 non-null  datetime64[ns]
 2   StockCode    396034 non-null  object        
 3   Description  396034 non-null  object        
 4   Quantity     396034 non-null  int64         
 5   UnitPrice    396034 non-null  float64       
 6   Revenue      396034 non-null  float64       
 7   CustomerID   396034 non-null  int64         
 8   Country      396034 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(3), object(3)
memory usage: 90.7 MB


In [148]:
# xls

df_xls = pd.read_excel('../data/Sensor Data.xls')

df_xls.head()

Unnamed: 0,Input 1,Input 2,Input 3,Input 4,Input 5,Input 6,Input 7,Input 8,Input 9,Input 10,Input 11,Input 12,output1,output2,class
0,1.473,2.311,3.179,2.666,0.2795,0.2771,0.2234,0.1855,0.2539,1.138,1.111,4.712,1,1,one
1,1.46,2.377,3.214,2.92,0.2527,0.3064,0.02563,0.1965,0.3027,1.213,1.027,5.463,1,1,one
2,1.552,2.164,3.064,2.745,0.282,0.21,0.1721,0.1929,0.21,1.221,1.058,5.332,1,1,one
3,1.605,2.228,3.149,2.834,0.2917,0.3613,0.2087,0.1294,0.2734,1.144,1.062,4.829,1,1,one
4,1.534,2.114,3.309,2.976,0.21,0.2502,0.2258,0.177,0.2039,1.254,1.112,5.734,1,1,one


In [149]:
df_xls = pd.read_excel('../data/Sensor Data.xls', 'Sheet1')

df_xls.head()

Unnamed: 0,Input 1,Input 2,Input 3,Input 4,Input 5,Input 6,Input 7,Input 8,Input 9,Input 10,Input 11,Input 12,output1,output2,class
0,1.473,2.311,3.179,2.666,0.2795,0.2771,0.2234,0.1855,0.2539,1.138,1.111,4.712,1,1,one
1,1.46,2.377,3.214,2.92,0.2527,0.3064,0.02563,0.1965,0.3027,1.213,1.027,5.463,1,1,one
2,1.552,2.164,3.064,2.745,0.282,0.21,0.1721,0.1929,0.21,1.221,1.058,5.332,1,1,one
3,1.605,2.228,3.149,2.834,0.2917,0.3613,0.2087,0.1294,0.2734,1.144,1.062,4.829,1,1,one
4,1.534,2.114,3.309,2.976,0.21,0.2502,0.2258,0.177,0.2039,1.254,1.112,5.734,1,1,one


In [150]:
df_xls = pd.read_excel('../data/Sensor Data.xls', 'Sheet2')

df_xls.head()

Unnamed: 0,Sensor Data
0,The data source as well as the exact nature of...
1,Each data instance contains 12 real-valued inp...
2,represents a sensor designed to detect the pre...
3,"of substances. As an alternative, the sensor r..."
4,


In [151]:
df_xls = pd.read_excel('../data/Sensor Data.xls', 'Sheet3')

df_xls.head()

Unnamed: 0,hola,amiguis,estamos,probando,pandas


In [152]:
df_xls = pd.read_excel('../data/Sensor Data.xls', 'Hoja1')

df_xls.head()

Unnamed: 0,32,42,4q34q34
0,23r,4,
1,,42,


In [153]:
df_xls = pd.read_excel('../data/Sensor Data.xls', 0)

df_xls.head()

Unnamed: 0,Input 1,Input 2,Input 3,Input 4,Input 5,Input 6,Input 7,Input 8,Input 9,Input 10,Input 11,Input 12,output1,output2,class
0,1.473,2.311,3.179,2.666,0.2795,0.2771,0.2234,0.1855,0.2539,1.138,1.111,4.712,1,1,one
1,1.46,2.377,3.214,2.92,0.2527,0.3064,0.02563,0.1965,0.3027,1.213,1.027,5.463,1,1,one
2,1.552,2.164,3.064,2.745,0.282,0.21,0.1721,0.1929,0.21,1.221,1.058,5.332,1,1,one
3,1.605,2.228,3.149,2.834,0.2917,0.3613,0.2087,0.1294,0.2734,1.144,1.062,4.829,1,1,one
4,1.534,2.114,3.309,2.976,0.21,0.2502,0.2258,0.177,0.2039,1.254,1.112,5.734,1,1,one


In [154]:
df_xls = pd.read_excel('../data/Sensor Data.xls', 1)

df_xls.head()

Unnamed: 0,Sensor Data
0,The data source as well as the exact nature of...
1,Each data instance contains 12 real-valued inp...
2,represents a sensor designed to detect the pre...
3,"of substances. As an alternative, the sensor r..."
4,


In [155]:
df_xls = pd.read_excel('../data/Sensor Data.xls', 2)

df_xls.head()

Unnamed: 0,hola,amiguis,estamos,probando,pandas


In [156]:
df_xls = pd.read_excel('../data/Sensor Data.xls', 3)

df_xls.head()

Unnamed: 0,32,42,4q34q34
0,23r,4,
1,,42,


In [159]:
xls = pd.ExcelFile('../data/Sensor Data.xls')

xls.sheet_names

['Sheet1', 'Sheet2', 'Sheet3', 'Hoja1']

In [160]:
f'h_{0}' = 0

SyntaxError: cannot assign to f-string expression (1875232999.py, line 1)

In [161]:
xls

<pandas.io.excel._base.ExcelFile at 0x177ae6d30>

In [162]:
dictio_df = {}


for hoja in xls.sheet_names:
    
    dictio_df[hoja] = pd.read_excel('../data/Sensor Data.xls', hoja)

In [166]:
dictio_df['Sheet1'].head()

Unnamed: 0,Input 1,Input 2,Input 3,Input 4,Input 5,Input 6,Input 7,Input 8,Input 9,Input 10,Input 11,Input 12,output1,output2,class
0,1.473,2.311,3.179,2.666,0.2795,0.2771,0.2234,0.1855,0.2539,1.138,1.111,4.712,1,1,one
1,1.46,2.377,3.214,2.92,0.2527,0.3064,0.02563,0.1965,0.3027,1.213,1.027,5.463,1,1,one
2,1.552,2.164,3.064,2.745,0.282,0.21,0.1721,0.1929,0.21,1.221,1.058,5.332,1,1,one
3,1.605,2.228,3.149,2.834,0.2917,0.3613,0.2087,0.1294,0.2734,1.144,1.062,4.829,1,1,one
4,1.534,2.114,3.309,2.976,0.21,0.2502,0.2258,0.177,0.2039,1.254,1.112,5.734,1,1,one


In [165]:
dictio_df.keys()

dict_keys(['Sheet1', 'Sheet2', 'Sheet3', 'Hoja1'])

In [167]:
# json


df_json = pd.read_json('../data/oficinas.json')

df_json.head()

Unnamed: 0,name,totalOffices,lat,lng,principal
0,Wetpaint,2,47.603122,-122.333253,"{'type': 'Point', 'coordinates': [-122.333253,..."
1,AdventNet,1,37.692934,-121.904945,"{'type': 'Point', 'coordinates': [-121.904945,..."
2,Zoho,1,37.692934,-121.904945,"{'type': 'Point', 'coordinates': [-121.904945,..."
3,Digg,1,37.764726,-122.394523,"{'type': 'Point', 'coordinates': [-122.394523,..."
4,Facebook,3,37.41605,-122.151801,"{'type': 'Point', 'coordinates': [-122.151801,..."


In [168]:
for i in range(20):
    
    df_json[i] = 0

In [171]:
df_json.head()

Unnamed: 0,name,totalOffices,lat,lng,principal,0,1,2,3,4,...,10,11,12,13,14,15,16,17,18,19
0,Wetpaint,2,47.603122,-122.333253,"{'type': 'Point', 'coordinates': [-122.333253,...",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,AdventNet,1,37.692934,-121.904945,"{'type': 'Point', 'coordinates': [-121.904945,...",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Zoho,1,37.692934,-121.904945,"{'type': 'Point', 'coordinates': [-121.904945,...",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Digg,1,37.764726,-122.394523,"{'type': 'Point', 'coordinates': [-122.394523,...",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Facebook,3,37.41605,-122.151801,"{'type': 'Point', 'coordinates': [-122.151801,...",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [172]:
pd.set_option('display.max_columns', None)

pd.set_option('display.max_rows', None)

In [173]:
df_json.head()

Unnamed: 0,name,totalOffices,lat,lng,principal,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,Wetpaint,2,47.603122,-122.333253,"{'type': 'Point', 'coordinates': [-122.333253,...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,AdventNet,1,37.692934,-121.904945,"{'type': 'Point', 'coordinates': [-121.904945,...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Zoho,1,37.692934,-121.904945,"{'type': 'Point', 'coordinates': [-121.904945,...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Digg,1,37.764726,-122.394523,"{'type': 'Point', 'coordinates': [-122.394523,...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Facebook,3,37.41605,-122.151801,"{'type': 'Point', 'coordinates': [-122.151801,...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
