<a href="https://colab.research.google.com/github/CarlosLeandro09/DataAnalysisRadiology/blob/main/Um_pouco_de_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Basic stage**

**1.** Importações importantes para um estágio inicial

In [1]:
import pandas as pd
import numpy as np

**2.** Dados em uma escala temporal: **Séries**

In [2]:
series = pd.Series([np.nan, 0, 1, 2])
series

0    NaN
1    0.0
2    1.0
3    2.0
dtype: float64

**3.** Ainda relativo a séries, vamos "manipular" **datas**...

pd.date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, **kwargs)   

In [3]:
datas = pd.date_range("20200101",periods=4,freq="D")
datas

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04'], dtype='datetime64[ns]', freq='D')

**4.** Criação de **DataFrame**

In [4]:
df = pd.DataFrame(np.random.randn(4,4), index = datas, columns = list("ABCD"))
df

Unnamed: 0,A,B,C,D
2020-01-01,1.264532,0.260836,-0.42595,-0.118932
2020-01-02,-0.412366,0.009984,-0.448599,-1.231509
2020-01-03,0.332856,-0.254038,2.553168,0.558746
2020-01-04,1.133242,0.120179,0.853995,0.343288


In [5]:
df2 = pd.DataFrame({"A":7,
                    "B":pd.Series(1,index=list(range(5)),dtype="float32"),
                    "C":np.array([3]*5,dtype="int32"), 
                    "D":pd.Categorical(["Carro","Coelho","Caipora","Cigarro","Cinema"]),
                    "E":pd.Timestamp("20190204"),
                    "F":"Dragonball"})
df2

Unnamed: 0,A,B,C,D,E,F
0,7,1.0,3,Carro,2019-02-04,Dragonball
1,7,1.0,3,Coelho,2019-02-04,Dragonball
2,7,1.0,3,Caipora,2019-02-04,Dragonball
3,7,1.0,3,Cigarro,2019-02-04,Dragonball
4,7,1.0,3,Cinema,2019-02-04,Dragonball


In [6]:
df2.head(3)

Unnamed: 0,A,B,C,D,E,F
0,7,1.0,3,Carro,2019-02-04,Dragonball
1,7,1.0,3,Coelho,2019-02-04,Dragonball
2,7,1.0,3,Caipora,2019-02-04,Dragonball


In [7]:
df2.tail(3)

Unnamed: 0,A,B,C,D,E,F
2,7,1.0,3,Caipora,2019-02-04,Dragonball
3,7,1.0,3,Cigarro,2019-02-04,Dragonball
4,7,1.0,3,Cinema,2019-02-04,Dragonball


In [8]:
df2.index

Int64Index([0, 1, 2, 3, 4], dtype='int64')

In [9]:
df2.columns

Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

In [10]:
#df --> Tira índices e colunas
df.to_numpy()

array([[ 1.26453209,  0.26083568, -0.42595042, -0.11893177],
       [-0.41236607,  0.00998421, -0.44859887, -1.2315087 ],
       [ 0.33285558, -0.25403825,  2.55316778,  0.558746  ],
       [ 1.13324155,  0.12017861,  0.85399508,  0.34328784]])

In [11]:
df2.dtypes

A             int64
B           float32
C             int32
D          category
E    datetime64[ns]
F            object
dtype: object

In [12]:
df2.shape

(5, 6)

**5.** Adição de **colunas** ao DataFrame

In [13]:
df2["G"] = pd.Series("RX",index=list(range(5)),dtype="str")
df2

Unnamed: 0,A,B,C,D,E,F,G
0,7,1.0,3,Carro,2019-02-04,Dragonball,RX
1,7,1.0,3,Coelho,2019-02-04,Dragonball,RX
2,7,1.0,3,Caipora,2019-02-04,Dragonball,RX
3,7,1.0,3,Cigarro,2019-02-04,Dragonball,RX
4,7,1.0,3,Cinema,2019-02-04,Dragonball,RX


**6.** **Operação** entre colunas

In [14]:
df2["Soma"] = df2["A"] + df2["C"]
df2

Unnamed: 0,A,B,C,D,E,F,G,Soma
0,7,1.0,3,Carro,2019-02-04,Dragonball,RX,10
1,7,1.0,3,Coelho,2019-02-04,Dragonball,RX,10
2,7,1.0,3,Caipora,2019-02-04,Dragonball,RX,10
3,7,1.0,3,Cigarro,2019-02-04,Dragonball,RX,10
4,7,1.0,3,Cinema,2019-02-04,Dragonball,RX,10


In [15]:
#Transposta
df2.T

Unnamed: 0,0,1,2,3,4
A,7,7,7,7,7
B,1,1,1,1,1
C,3,3,3,3,3
D,Carro,Coelho,Caipora,Cigarro,Cinema
E,2019-02-04 00:00:00,2019-02-04 00:00:00,2019-02-04 00:00:00,2019-02-04 00:00:00,2019-02-04 00:00:00
F,Dragonball,Dragonball,Dragonball,Dragonball,Dragonball
G,RX,RX,RX,RX,RX
Soma,10,10,10,10,10


**7.** **Concatenando** dataframes

In [16]:
df1 = pd.DataFrame(np.random.randn(2,2), index = pd.date_range("20190104",periods=2,freq="D"), columns = list("AB"))
df2 = pd.DataFrame(np.random.randn(2,2), index = pd.date_range("20190106",periods=2,freq="D"), columns = list("AB"))
df3 = pd.DataFrame(np.random.randn(2,2), index = pd.date_range("20190108",periods=2,freq="D"), columns = list("AB"))

In [17]:
combinacao = pd.concat([df1,df2,df3],keys=["df1","df2","df3"])
combinacao

Unnamed: 0,Unnamed: 1,A,B
df1,2019-01-04,0.918701,-3.111946
df1,2019-01-05,0.02733,0.48798
df2,2019-01-06,-0.919493,1.10057
df2,2019-01-07,-0.84799,-1.852274
df3,2019-01-08,-1.235591,0.054842
df3,2019-01-09,0.192946,-1.196063


In [18]:
#Removendo valores repetidos
combinacao["ValoresRepetidos"] = list("XYXKXL")
combinacao.drop_duplicates(subset="ValoresRepetidos")

Unnamed: 0,Unnamed: 1,A,B,ValoresRepetidos
df1,2019-01-04,0.918701,-3.111946,X
df1,2019-01-05,0.02733,0.48798,Y
df2,2019-01-07,-0.84799,-1.852274,K
df3,2019-01-09,0.192946,-1.196063,L


In [19]:
#"Selecionando" informações da coluna
combinacao["A"]

df1  2019-01-04    0.918701
     2019-01-05    0.027330
df2  2019-01-06   -0.919493
     2019-01-07   -0.847990
df3  2019-01-08   -1.235591
     2019-01-09    0.192946
Name: A, dtype: float64

In [20]:
#Selecionando key
combinacao.loc["df1"]

Unnamed: 0,A,B,ValoresRepetidos
2019-01-04,0.918701,-3.111946,X
2019-01-05,0.02733,0.48798,Y


**8.** Aplicando **Merge**

In [21]:
df4 = pd.DataFrame({'ID': [123,321,231,213],
                    'lkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [1, 2, 3, 5],
                    'Compras':[00,11,22,33]})
df5 = pd.DataFrame({'ID': [123,323,231,212],
                    'rkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [5, 6, 7, 8]})

In [22]:
#Intersecção --> Valores iguais
pd.merge(df4,df5,how="inner",on=["ID"],suffixes=["_A","_B"])

Unnamed: 0,ID,lkey,value_A,Compras,rkey,value_B
0,123,foo,1,0,foo,5
1,231,baz,3,22,baz,7


In [23]:
#Left join --> Valores iguais + conjunto da esquerda
pd.merge(df4,df5,how="left",on='ID')

Unnamed: 0,ID,lkey,value_x,Compras,rkey,value_y
0,123,foo,1,0,foo,5.0
1,321,bar,2,11,,
2,231,baz,3,22,baz,7.0
3,213,foo,5,33,,


In [24]:
#Outer --> Compara os df's para cada ID reportando lados
pd.merge(df4,df5,how='outer',on='ID',indicator=True)

Unnamed: 0,ID,lkey,value_x,Compras,rkey,value_y,_merge
0,123,foo,1.0,0.0,foo,5.0,both
1,321,bar,2.0,11.0,,,left_only
2,231,baz,3.0,22.0,baz,7.0,both
3,213,foo,5.0,33.0,,,left_only
4,323,,,,bar,6.0,right_only
5,212,,,,foo,8.0,right_only


**9.** Vamos ver agora o famoso **Groupby**

In [25]:
grupinho = pd.DataFrame({'A':["RX","CT","RX","CT","RX","CT"],
                         'B':[1,1,2,2,2,1],
                         'C':np.random.randn(6)})
grupinho

Unnamed: 0,A,B,C
0,RX,1,-1.208905
1,CT,1,0.521042
2,RX,2,-3.426962
3,CT,2,-0.326033
4,RX,2,0.086438
5,CT,1,-2.195619


In [26]:
group = grupinho.groupby(["A"]).sum()
group

Unnamed: 0_level_0,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1
CT,4,-2.00061
RX,5,-4.549429


In [27]:
group.loc['CT']

B    4.00000
C   -2.00061
Name: CT, dtype: float64

In [28]:
# Cada grupo funciona parecido com tabelas concatenadas
group2 = grupinho.groupby(["A","B"]).sum()
group2

Unnamed: 0_level_0,Unnamed: 1_level_0,C
A,B,Unnamed: 2_level_1
CT,1,-1.674577
CT,2,-0.326033
RX,1,-1.208905
RX,2,-3.340524


**10. Indexações**

In [30]:
arrays = [[1,1,3,3],['A','B','A','B']]
pd.MultiIndex.from_arrays(arrays,names=('numero','letra'))

MultiIndex([(1, 'A'),
            (1, 'B'),
            (3, 'A'),
            (3, 'B')],
           names=['numero', 'letra'])

In [31]:
#Produto cartesiano
numbers = [1,2,3]
letras = ['A','B']
pd.MultiIndex.from_product([numbers,letras],names=('numero','letra'))

MultiIndex([(1, 'A'),
            (1, 'B'),
            (2, 'A'),
            (2, 'B'),
            (3, 'A'),
            (3, 'B')],
           names=['numero', 'letra'])