**Pandas** é uma API de análise de dados orientada a colunas. É uma ótima ferramenta para manipular e analisar dados de entrada, e muitos frameworks de ML oferecem suporte a estruturas de dados do pandas como entradas. Embora uma introdução abrangente à API do pandas abranja muitas páginas, os conceitos básicos são bastante diretos e os apresentaremos a seguir. Para uma referência mais completa, o site do pandas docs  [Pandas Documentação](https://pandas.pydata.org/pandas-docs/stable/index.html)    contém uma extensa documentação e muitos tutoriais

In [5]:
import numpy as np

In [2]:
import pandas as pd

In [3]:
s = pd.Series([4, 3, 7, 8, 6, 8])
s

0    4
1    3
2    7
3    8
4    6
5    8
dtype: int64

In [7]:
s2 = pd.Series([1, np.nan,8,9,12])
s2

0     1.0
1     NaN
2     8.0
3     9.0
4    12.0
dtype: float64

Trabalhando com datas

In [9]:
datas = pd.date_range('20200101', periods=6)
datas

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06'],
              dtype='datetime64[ns]', freq='D')

Criando estruturas de dados  **"DataFrames"**

In [28]:
novodf = pd.DataFrame({ 'Nome':['Luisa','Julia','Marcos','Andre','Marcia'],
                        'Idade':[23,28,32,41,37],
                        'Renda':[12550,23000,7200,8600,2400]})
novodf

Unnamed: 0,Nome,Idade,Renda
0,Luisa,23,12550
1,Julia,28,23000
2,Marcos,32,7200
3,Andre,41,8600
4,Marcia,37,2400


Criando a partir de uma matriz

In [15]:
matrizdf = np.array([['Joao',25,1995,2016],['Maria',47,1973,2000],['Luisa',38,1982,2005]], dtype=object)
matrizdf

array([['Joao', 25, 1995, 2016],
       ['Maria', 47, 1973, 2000],
       ['Luisa', 38, 1982, 2005]], dtype=object)

Atribuindo os nomes das colunas

In [17]:
df = pd.DataFrame(matrizdf, columns = ['Nome','Idade','Ano de nascimento','Formado em'])
df.head()


Unnamed: 0,Nome,Idade,Ano de nascimento,Formado em
0,Joao,25,1995,2016
1,Maria,47,1973,2000
2,Luisa,38,1982,2005


Criando utilizando opções do numpy

In [29]:
df2 = pd.DataFrame({'A': 2.,
                    'B': pd.Timestamp('20200503'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': np.array([3] * 4, dtype='int32'),
                    'E': pd.Categorical(["teste", "treinamento", "teste", "treinamento"]),
                    'F': 'Validado'})
df2

Unnamed: 0,A,B,C,D,E,F
0,2.0,2020-05-03,1.0,3,teste,Validado
1,2.0,2020-05-03,1.0,3,treinamento,Validado
2,2.0,2020-05-03,1.0,3,teste,Validado
3,2.0,2020-05-03,1.0,3,treinamento,Validado


Utilizando um Dataframe criado com valores numéricos aleatórios

In [37]:
ndf = pd.DataFrame(np.random.randn(6, 5), index=datas, columns=list('ABCDE'))
ndf

Unnamed: 0,A,B,C,D,E
2020-01-01,-0.085765,-1.994739,-0.612832,-0.994954,-0.167886
2020-01-02,0.790734,1.103944,0.177617,1.372493,1.475589
2020-01-03,0.37082,0.668729,-1.673084,1.337538,-0.466822
2020-01-04,0.623525,-1.845401,-0.944458,0.22356,-0.412013
2020-01-05,1.369901,-1.366934,0.157924,-0.350285,0.473977
2020-01-06,0.683218,1.283325,1.455264,1.070301,-0.562917


Transposição de DataFrame

In [38]:
ndf.T

Unnamed: 0,2020-01-01,2020-01-02,2020-01-03,2020-01-04,2020-01-05,2020-01-06
A,-0.085765,0.790734,0.37082,0.623525,1.369901,0.683218
B,-1.994739,1.103944,0.668729,-1.845401,-1.366934,1.283325
C,-0.612832,0.177617,-1.673084,-0.944458,0.157924,1.455264
D,-0.994954,1.372493,1.337538,0.22356,-0.350285,1.070301
E,-0.167886,1.475589,-0.466822,-0.412013,0.473977,-0.562917


Ordenação de Dataframe

In [39]:
ndf.sort_index(axis=1, ascending=False)

Unnamed: 0,E,D,C,B,A
2020-01-01,-0.167886,-0.994954,-0.612832,-1.994739,-0.085765
2020-01-02,1.475589,1.372493,0.177617,1.103944,0.790734
2020-01-03,-0.466822,1.337538,-1.673084,0.668729,0.37082
2020-01-04,-0.412013,0.22356,-0.944458,-1.845401,0.623525
2020-01-05,0.473977,-0.350285,0.157924,-1.366934,1.369901
2020-01-06,-0.562917,1.070301,1.455264,1.283325,0.683218


In [40]:
ndf.sort_index(axis=1, ascending=True)

Unnamed: 0,A,B,C,D,E
2020-01-01,-0.085765,-1.994739,-0.612832,-0.994954,-0.167886
2020-01-02,0.790734,1.103944,0.177617,1.372493,1.475589
2020-01-03,0.37082,0.668729,-1.673084,1.337538,-0.466822
2020-01-04,0.623525,-1.845401,-0.944458,0.22356,-0.412013
2020-01-05,1.369901,-1.366934,0.157924,-0.350285,0.473977
2020-01-06,0.683218,1.283325,1.455264,1.070301,-0.562917


Ordenando por valores

In [41]:
ndf.sort_values(by='B')

Unnamed: 0,A,B,C,D,E
2020-01-01,-0.085765,-1.994739,-0.612832,-0.994954,-0.167886
2020-01-04,0.623525,-1.845401,-0.944458,0.22356,-0.412013
2020-01-05,1.369901,-1.366934,0.157924,-0.350285,0.473977
2020-01-03,0.37082,0.668729,-1.673084,1.337538,-0.466822
2020-01-02,0.790734,1.103944,0.177617,1.372493,1.475589
2020-01-06,0.683218,1.283325,1.455264,1.070301,-0.562917


In [42]:
ndf.sort_values(by='B', ascending=False)

Unnamed: 0,A,B,C,D,E
2020-01-06,0.683218,1.283325,1.455264,1.070301,-0.562917
2020-01-02,0.790734,1.103944,0.177617,1.372493,1.475589
2020-01-03,0.37082,0.668729,-1.673084,1.337538,-0.466822
2020-01-05,1.369901,-1.366934,0.157924,-0.350285,0.473977
2020-01-04,0.623525,-1.845401,-0.944458,0.22356,-0.412013
2020-01-01,-0.085765,-1.994739,-0.612832,-0.994954,-0.167886


### Selecionando uma variável/coluna e criando uma série

In [43]:
ndf['C']

2020-01-01   -0.612832
2020-01-02    0.177617
2020-01-03   -1.673084
2020-01-04   -0.944458
2020-01-05    0.157924
2020-01-06    1.455264
Freq: D, Name: C, dtype: float64

### Selecionando valores/linhas

In [44]:
ndf[0:4]

Unnamed: 0,A,B,C,D,E
2020-01-01,-0.085765,-1.994739,-0.612832,-0.994954,-0.167886
2020-01-02,0.790734,1.103944,0.177617,1.372493,1.475589
2020-01-03,0.37082,0.668729,-1.673084,1.337538,-0.466822
2020-01-04,0.623525,-1.845401,-0.944458,0.22356,-0.412013


### Selecionando entre linhas e colunas

In [45]:
ndf['E'][0:4]

2020-01-01   -0.167886
2020-01-02    1.475589
2020-01-03   -0.466822
2020-01-04   -0.412013
Freq: D, Name: E, dtype: float64

### Selecionando por rótulo

In [47]:
ndf.loc[datas[0]]

A   -0.085765
B   -1.994739
C   -0.612832
D   -0.994954
E   -0.167886
Name: 2020-01-01 00:00:00, dtype: float64

In [49]:
ndf.loc[:, ['B', 'C']]

Unnamed: 0,B,C
2020-01-01,-1.994739,-0.612832
2020-01-02,1.103944,0.177617
2020-01-03,0.668729,-1.673084
2020-01-04,-1.845401,-0.944458
2020-01-05,-1.366934,0.157924
2020-01-06,1.283325,1.455264


### Selecionando uma parte específica dos dados

In [52]:
ndf.loc['20200102':'20200104', ['C', 'D']]

Unnamed: 0,C,D
2020-01-02,0.177617,1.372493
2020-01-03,-1.673084,1.337538
2020-01-04,-0.944458,0.22356


### Trazendo valores específicos[texto do link](https://)

In [54]:
ndf.loc['20200103', ['B', 'D']]

B    0.668729
D    1.337538
Name: 2020-01-03 00:00:00, dtype: float64

### Trazendo um valor específico

In [56]:
ndf.loc[datas[2], 'D']

1.3375381742975472

### Outros tipos de Seleção

In [60]:
ndf.iloc[4]

A    1.369901
B   -1.366934
C    0.157924
D   -0.350285
E    0.473977
Name: 2020-01-05 00:00:00, dtype: float64

In [62]:
ndf.iloc[2:5, 1:3]

Unnamed: 0,B,C
2020-01-03,0.668729,-1.673084
2020-01-04,-1.845401,-0.944458
2020-01-05,-1.366934,0.157924


In [67]:
ndf.iloc[1:4, :]

Unnamed: 0,A,B,C,D,E
2020-01-02,0.790734,1.103944,0.177617,1.372493,1.475589
2020-01-03,0.37082,0.668729,-1.673084,1.337538,-0.466822
2020-01-04,0.623525,-1.845401,-0.944458,0.22356,-0.412013


**Utilizando listas**

In [66]:
ndf.iloc[[1, 3, 5], [1, 3]]

Unnamed: 0,B,D
2020-01-02,1.103944,1.372493
2020-01-04,-1.845401,0.22356
2020-01-06,1.283325,1.070301


### Utilizando indices Booleanos

In [71]:
ndf[ndf > 0]

Unnamed: 0,A,B,C,D,E
2020-01-01,,,,,
2020-01-02,0.790734,1.103944,0.177617,1.372493,1.475589
2020-01-03,0.37082,0.668729,,1.337538,
2020-01-04,0.623525,,,0.22356,
2020-01-05,1.369901,,0.157924,,0.473977
2020-01-06,0.683218,1.283325,1.455264,1.070301,


In [70]:
ndf[ndf['C'] > 0]

Unnamed: 0,A,B,C,D,E
2020-01-02,0.790734,1.103944,0.177617,1.372493,1.475589
2020-01-05,1.369901,-1.366934,0.157924,-0.350285,0.473977
2020-01-06,0.683218,1.283325,1.455264,1.070301,-0.562917



### Convertendo em uma matriz



In [58]:
ndf.to_numpy


<bound method DataFrame.to_numpy of                    A         B         C         D         E
2020-01-01 -0.085765 -1.994739 -0.612832 -0.994954 -0.167886
2020-01-02  0.790734  1.103944  0.177617  1.372493  1.475589
2020-01-03  0.370820  0.668729 -1.673084  1.337538 -0.466822
2020-01-04  0.623525 -1.845401 -0.944458  0.223560 -0.412013
2020-01-05  1.369901 -1.366934  0.157924 -0.350285  0.473977
2020-01-06  0.683218  1.283325  1.455264  1.070301 -0.562917>