In [56]:
import numpy as np
import pandas as pd

# Estruturas básicas de dados em pandas

O Pandas fornece dois tipos de classes para manipulação de dados:

1. **Series:**  
  - Uma matriz rotulada unidimensional que contém dados de qualquer tipo como inteiros, strings, objetos Python etc.

2. **DataFrame:** 
  - Uma estrutura de dados bidimensional que contém dados como uma matriz bidimensional ou uma tabela com linhas e colunas.

# Criação de objeto 

[Veja a seção Introdução às estruturas de dados .](Intro_to_data_structures.ipynb)

- Criando um Seriespassando uma lista de valores, permitindo que o pandas crie um padrão RangeIndex.

In [57]:
s = pd.Series([1,3,5,np.nan,6,8])
display(s)

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

- Criando DataFrameum array NumPy passando um índice datetime usando date_range() colunas rotuladas e :

In [58]:
""" date_range = Intervalo de Datas """

dates = pd.date_range("20130101",periods=6)
display(dates)

df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list("ABCD"))
display(df)

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

Unnamed: 0,A,B,C,D
2013-01-01,0.050611,0.215956,0.45306,1.007216
2013-01-02,-0.029879,-0.964668,1.034395,-0.119978
2013-01-03,0.855928,-0.447104,0.019303,0.000129
2013-01-04,-0.21826,-0.5394,0.238007,0.836796
2013-01-05,-0.000812,0.573812,0.383736,1.239562
2013-01-06,0.39443,-0.371078,0.010284,-0.1017


- Criando um DataFrame passando um dicionário de objetos onde as chaves são os rótulos das colunas e os valores são os valores das colunas.

In [59]:
df2 = pd.DataFrame(
    {
        "A": 1.0, # Contém um valor constante 1.0 (float) para todas as linhas.
        "B": pd.Timestamp("20130102"), #Um timestamp com a mesma data em todas as linhas.
        "C": pd.Series(1, index=list(range(4)), dtype="float32"), #Uma Series de 4 elementos com valor 1 e tipo float32.
        "D": np.array([3] * 4, dtype="int32"), #Um array NumPy com quatro elementos iguais a 3 e tipo int32.
        "E": pd.Categorical(["test", "train", "test", "train"]), # Uma categoria com valores repetidos: "test" e "train". 
        "F": "foo",
    }
)

display(df2)


Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


- As colunas resultantes DataFrametêm diferentes dtypes :

In [60]:
display(df2.dtypes)

A          float64
B    datetime64[s]
C          float32
D            int32
E         category
F           object
dtype: object

# Visualizando dados 
[Veja a seção Funcionalidade básica .](Essential_basic_functionality.ipynb)

- Use DataFrame.head()e DataFrame.tail()para visualizar as linhas superior e inferior do quadro, respectivamente:

In [61]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,0.050611,0.215956,0.45306,1.007216
2013-01-02,-0.029879,-0.964668,1.034395,-0.119978
2013-01-03,0.855928,-0.447104,0.019303,0.000129
2013-01-04,-0.21826,-0.5394,0.238007,0.836796
2013-01-05,-0.000812,0.573812,0.383736,1.239562


In [62]:
df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-04,-0.21826,-0.5394,0.238007,0.836796
2013-01-05,-0.000812,0.573812,0.383736,1.239562
2013-01-06,0.39443,-0.371078,0.010284,-0.1017


- Exibir DataFrame.indexou DataFrame.columns:

In [63]:
display(df.index)

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [64]:
display(df.columns)

Index(['A', 'B', 'C', 'D'], dtype='object')

- Retorna uma representação NumPy dos dados subjacentes sem DataFrame.to_numpy() os rótulos de índice ou coluna:

In [65]:
df.to_numpy()

array([[ 5.06114575e-02,  2.15956198e-01,  4.53060487e-01,
         1.00721604e+00],
       [-2.98789691e-02, -9.64667631e-01,  1.03439527e+00,
        -1.19977930e-01],
       [ 8.55927524e-01, -4.47104449e-01,  1.93029753e-02,
         1.29411005e-04],
       [-2.18260261e-01, -5.39400260e-01,  2.38007097e-01,
         8.36795948e-01],
       [-8.11945889e-04,  5.73811986e-01,  3.83736211e-01,
         1.23956236e+00],
       [ 3.94429822e-01, -3.71078224e-01,  1.02836869e-02,
        -1.01700379e-01]])

* Observação

    Os arrays NumPy têm um dtype para o array inteiro, enquanto os DataFrames pandas têm um dtype por coluna .  
    Quando você chama DataFrame.to_numpy(), o pandas encontrará o dtype NumPy que pode conter todos os dtypes no DataFrame.   
    Se o tipo de dado comum for object, DataFrame.to_numpy() será necessário copiar os dados.

In [66]:
df2.dtypes

A          float64
B    datetime64[s]
C          float32
D            int32
E         category
F           object
dtype: object

In [67]:
df2.to_numpy()

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

- describe()mostra um rápido resumo estatístico dos seus dados:

In [68]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.175336,-0.255414,0.356464,0.477004
std,0.388541,0.555629,0.378739,0.618185
min,-0.21826,-0.964668,0.010284,-0.119978
25%,-0.022612,-0.516326,0.073979,-0.076243
50%,0.0249,-0.409091,0.310872,0.418463
75%,0.308475,0.069198,0.435729,0.964611
max,0.855928,0.573812,1.034395,1.239562


- Transpondo seus dados: Colunas - > Indices e Indices - > Colunas

In [69]:
display(df.T)

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,0.050611,-0.029879,0.855928,-0.21826,-0.000812,0.39443
B,0.215956,-0.964668,-0.447104,-0.5394,0.573812,-0.371078
C,0.45306,1.034395,0.019303,0.238007,0.383736,0.010284
D,1.007216,-0.119978,0.000129,0.836796,1.239562,-0.1017


- DataFrame.sort_index()classifica por um eixo:  
    Retorna um novo DataFrame classificado por rótulo se o argumento inplace for False, caso contrário, atualiza o DataFrame original e retorna None.

In [74]:
display(df.sort_index(axis=1, ascending=False))

Unnamed: 0,D,C,B,A
2013-01-01,1.007216,0.45306,0.215956,0.050611
2013-01-02,-0.119978,1.034395,-0.964668,-0.029879
2013-01-03,0.000129,0.019303,-0.447104,0.855928
2013-01-04,0.836796,0.238007,-0.5394,-0.21826
2013-01-05,1.239562,0.383736,0.573812,-0.000812
2013-01-06,-0.1017,0.010284,-0.371078,0.39443


- DataFrame.sort_values()classifica por valores:

In [77]:
display(df.sort_values(by = "B"))

Unnamed: 0,A,B,C,D
2013-01-02,-0.029879,-0.964668,1.034395,-0.119978
2013-01-04,-0.21826,-0.5394,0.238007,0.836796
2013-01-03,0.855928,-0.447104,0.019303,0.000129
2013-01-06,0.39443,-0.371078,0.010284,-0.1017
2013-01-01,0.050611,0.215956,0.45306,1.007216
2013-01-05,-0.000812,0.573812,0.383736,1.239562


# Seleção

- Observação :  
Embora as expressões padrão Python/NumPy para seleção e configuração sejam intuitivas e úteis para trabalho interativo,  
para código de produção, recomendamos os métodos otimizados de acesso a dados do Pandas DataFrame.at(), DataFrame.iat(), DataFrame.loc()e DataFrame.iloc().