# Por que Python?

- **Linguagem mais usada em Data Science**
- Fácil de aprender: sintaxe simples
- Open source
- Grande número de bibliotecas disponíveis
- Desvantagem: desempenho computacional por ser interpretada

# Breve Intro a Python

In [1]:
# Variaveis sao dinamicamente tipadas
nome = "João"
idade = 45
peso = 85.3

# Listas possuem tipos diferentes e podem ser modificadas
dados = [nome, idade, peso, 5]
print(dados)

# Tuplas possuem tipos diferentes mas não podem ser modificadas
ponto = (1,3,-1)
print(ponto)

# Dicionarios mapeiam uma chave a um valor
dic = {'nome':nome, 'idade':idade, 'peso':peso, 'end': 'Av. Jornalista Anibal Fernandes, Centro de Informática'}
print(dic)
print(f"Nome: {dic['nome']:s}")

['João', 45, 85.3, 5]
(1, 3, -1)
{'nome': 'João', 'idade': 45, 'peso': 85.3, 'end': 'Av. Jornalista Anibal Fernandes, Centro de Informática'}
Nome: João


In [2]:
# Testes condicionais
if idade > 65:
    # Indentacao para definir o nivel das instrucoes
    print(nome, "é um candidato a se aposentar")
    print("Erro")
else:
    print(nome, "tem de trabalhar mais um pouco")

João tem de trabalhar mais um pouco


In [3]:

# Repeticoes podem ser feitas de duas formas
# 1 - usando lacos while
vezes = 1
while vezes <= 5:
    print("Já passei por aqui {0} vez(es)".format(vezes))
    vezes += 1

Já passei por aqui 1 vez(es)
Já passei por aqui 2 vez(es)
Já passei por aqui 3 vez(es)
Já passei por aqui 4 vez(es)
Já passei por aqui 5 vez(es)


In [4]:
# 2 - enumerando os elementos de uma lista (objeto iteravel)
for vezes in [1, 2, 3, 4, 5]:
    print("Já passei por aqui {0} vez(es)".format(vezes))
print("====")
# Ou ainda
for vezes in range(1,6,1):
    print("Já passei por aqui {0} vez(es)".format(vezes))

Já passei por aqui 1 vez(es)
Já passei por aqui 2 vez(es)
Já passei por aqui 3 vez(es)
Já passei por aqui 4 vez(es)
Já passei por aqui 5 vez(es)
====
Já passei por aqui 1 vez(es)
Já passei por aqui 2 vez(es)
Já passei por aqui 3 vez(es)
Já passei por aqui 4 vez(es)
Já passei por aqui 5 vez(es)


In [5]:
for dado in dados:
    print(dado)

João
45
85.3
5


In [6]:
for i in range(0,len(dados),1):
    print(dados[i])

João
45
85.3
5


In [7]:
#Slicing
print(dados[2:4])

[85.3, 5]


In [8]:
# Funcoes em Python
def quadrado(x):
    return x**2

print(quadrado(3))

# Funcoes mais simples podem ser definidas como funcoes lambda
f = lambda x: x**2
print(f(3))

9
9


# Processamento Orientado à Coluna

In [9]:
from sklearn.datasets import california_housing



In [10]:
data = california_housing.fetch_california_housing()
data.keys()

Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to /root/scikit_learn_data


dict_keys(['data', 'target', 'feature_names', 'DESCR'])

In [11]:
print(data["DESCR"])

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bur

# Frameworks

## 1- Pandas

Framework para manipulação de dados tabulares em memória

* **Pros**: API extremamente extensiva (inspirada nos dataframes nativos de R)
* **Cons**: single-thread

Operações podem ser feitas "de uma vez só" numa coluna inteira ao invés de ir linha por linha. `pandas` implementa diversas otimizações por debaixo dos panos, de forma que operações como adicionar uma constante para todos valores de uma coluna é feita de forma *quasi*-simultânea

In [12]:
import pandas as pd

In [33]:
# Construção de data frames
df = pd.DataFrame(data["data"], columns=data["feature_names"])
df.head(10)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
5,4.0368,52.0,4.761658,1.103627,413.0,2.139896,37.85,-122.25
6,3.6591,52.0,4.931907,0.951362,1094.0,2.128405,37.84,-122.25
7,3.12,52.0,4.797527,1.061824,1157.0,1.788253,37.84,-122.25
8,2.0804,42.0,4.294118,1.117647,1206.0,2.026891,37.84,-122.26
9,3.6912,52.0,4.970588,0.990196,1551.0,2.172269,37.84,-122.25


In [15]:
df.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31


In [16]:
df.dtypes

MedInc        float64
HouseAge      float64
AveRooms      float64
AveBedrms     float64
Population    float64
AveOccup      float64
Latitude      float64
Longitude     float64
dtype: object

In [17]:
# Selecionando uma coluna
print(df['Population'].head())

0     322.0
1    2401.0
2     496.0
3     558.0
4     565.0
Name: Population, dtype: float64


In [18]:
# Selecionando múltiplas colunas
print(df[['Latitude','Longitude']].head())

   Latitude  Longitude
0     37.88    -122.23
1     37.86    -122.22
2     37.85    -122.24
3     37.85    -122.25
4     37.85    -122.25


In [19]:
# Selecionando linhas com slice
df[2:5]

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [20]:
# Seleção de linhas e colunas: loc
print(df.loc[10:15,['Population','AveRooms']])

    Population  AveRooms
10       910.0  5.477612
11      1504.0  4.772480
12      1098.0  5.322650
13       345.0  4.000000
14      1212.0  4.262903
15       697.0  4.242424


In [21]:
# Seleção de linhas e colunas: iloc
print(df.iloc[10:15,[4,2]])

    Population  AveRooms
10       910.0  5.477612
11      1504.0  4.772480
12      1098.0  5.322650
13       345.0  4.000000
14      1212.0  4.262903


In [24]:
# Seleção de dados com expressões booleanas
df[(df['Population'] > 1000) & (df.AveBedrms > 1)].head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
7,3.12,52.0,4.797527,1.061824,1157.0,1.788253,37.84,-122.25
8,2.0804,42.0,4.294118,1.117647,1206.0,2.026891,37.84,-122.26
11,3.2705,52.0,4.77248,1.024523,1504.0,2.049046,37.85,-122.26
12,3.075,52.0,5.32265,1.012821,1098.0,2.346154,37.85,-122.26
14,1.9167,52.0,4.262903,1.009677,1212.0,1.954839,37.85,-122.26


# **Índice**

In [34]:
df1 = df[(df.Population > 1000) & (df.AveBedrms > 1)]
df1.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
7,3.12,52.0,4.797527,1.061824,1157.0,1.788253,37.84,-122.25
8,2.0804,42.0,4.294118,1.117647,1206.0,2.026891,37.84,-122.26
11,3.2705,52.0,4.77248,1.024523,1504.0,2.049046,37.85,-122.26
12,3.075,52.0,5.32265,1.012821,1098.0,2.346154,37.85,-122.26
14,1.9167,52.0,4.262903,1.009677,1212.0,1.954839,37.85,-122.26


In [28]:
print(df1.loc[7:11,['MedInc','HouseAge']])

    MedInc  HouseAge
7   3.1200      52.0
8   2.0804      42.0
11  3.2705      52.0


In [35]:
df1.reset_index()

In [40]:
df1.reset_index(inplace=True,drop = True)
df1.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,3.12,52.0,4.797527,1.061824,1157.0,1.788253,37.84,-122.25
1,2.0804,42.0,4.294118,1.117647,1206.0,2.026891,37.84,-122.26
2,3.2705,52.0,4.77248,1.024523,1504.0,2.049046,37.85,-122.26
3,3.075,52.0,5.32265,1.012821,1098.0,2.346154,37.85,-122.26
4,1.9167,52.0,4.262903,1.009677,1212.0,1.954839,37.85,-122.26


In [41]:
print(df1.loc[0:1,['MedInc','HouseAge']])

   MedInc  HouseAge
0  3.1200      52.0
1  2.0804      42.0


In [42]:
print(df1.iloc[0:2,[0,1]])

   MedInc  HouseAge
0  3.1200      52.0
1  2.0804      42.0


<img src="https://raw.githubusercontent.com/ProfLuciano/intro_cd/130b92d33280d3ae7e2e1f34e435b21f8e3a8025/notebooks/pandas_selection.png">

In [43]:
%%time
# Calculando média de número de quartos usando orientação à linha
accumulator = 0
for record in data["data"]:
    accumulator += record[2]
accumulator /= len(data["data"])
print(f"The mean of mean number of bedrooms in california is {accumulator:.3f}")

The mean of mean number of bedrooms in california is 5.429
CPU times: user 14.7 ms, sys: 0 ns, total: 14.7 ms
Wall time: 20.6 ms


In [44]:
%%time
# Calculando média usando orientada à coluna
mean = df.AveRooms.sum() / len(df)
print(f"The mean of mean number of bedrooms in california is {accumulator:.3f}")

The mean of mean number of bedrooms in california is 5.429
CPU times: user 2.06 ms, sys: 0 ns, total: 2.06 ms
Wall time: 2.09 ms


In [46]:
%%time
# Forma mais simples
mean = df.AveRooms.mean()
print(f"The mean of mean number of bedrooms in california is {mean:.3f}")

The mean of mean number of bedrooms in california is 5.429
CPU times: user 438 µs, sys: 0 ns, total: 438 µs
Wall time: 449 µs


In [47]:
# Função apply: aplica uma função a uma coluna inteira
from math import log
df["HouseAgeLog"] = df.HouseAge.apply(log)
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,HouseAgeLog
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,3.713572
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.044522
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.951244
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.951244
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.951244


Exercícios:

1. Qual o centróide geográfico do dataset? (uma linha!)
2. Quantos quarteirões existem com idade média da casa abaixo de 10 anos e com número de habitantes médio abaixo de 3 pessoas?

In [51]:
df[['Latitude','Longitude']].mean()

Latitude      35.631861
Longitude   -119.569704
dtype: float64

In [54]:
len(df[(df['HouseAge'] < 10) & (df.AveOccup < 3)])

771

## 2 - Dask

> Dask is a flexible library for parallel computing in Python.

`dask` é uma alternativa multi-core e/ou distribuída para pandas.
* **Pros** : multi-core, cluster mode, escala
* **Cons**: API limitada, “pensar distribuído”, nem toda operação é trivialmente paralelizada

`dask` segue um modelo *lazy* computacional, então a operação não é de fato executada até que seja chamado `.compute()` explicitamente, retornando um dataframe `pandas` em memória.

In [59]:
!pip install "dask[dataframe]"

Collecting partd>=0.3.10; extra == "dataframe"
  Downloading https://files.pythonhosted.org/packages/41/94/360258a68b55f47859d72b2d0b2b3cfe0ca4fbbcb81b78812bd00ae86b7c/partd-1.2.0-py3-none-any.whl
Collecting locket
  Downloading https://files.pythonhosted.org/packages/50/b8/e789e45b9b9c2db75e9d9e6ceb022c8d1d7e49b2c085ce8c05600f90a96b/locket-0.2.1-py2.py3-none-any.whl
Installing collected packages: locket, partd
Successfully installed locket-0.2.1 partd-1.2.0


In [63]:
import dask.dataframe as dd
from multiprocessing import cpu_count


In [64]:
ddf = dd.from_pandas(df, npartitions= cpu_count())

In [67]:
%%time
mean = ddf["AveRooms"].mean().compute()
print(f"The mean of mean number of bedrooms in california is {mean:.3f}")

The mean of mean number of bedrooms in california is 5.429
CPU times: user 9.67 ms, sys: 0 ns, total: 9.67 ms
Wall time: 10.6 ms


Quando usar `dask` ao invés de `pandas`?

1. Dataset não couber em memória;
2. `Pandas` estiver lento e se deseja utilizar todos os cores da máquina para paralelizar o processamento;
3. Dataset estiver quebrado em inúmeros arquivos - relacionado a (1)

# Referências

* Docs
    * https://pandas.pydata.org/pandas-docs/stable/
    * http://docs.dask.org/en/latest/


* Designing Data-Intensive Applications
    * https://dataintensive.net 
    * https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321


* Extensões do Jupyter notebook: 
    * https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/nbextensions.html
    * https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/install.html
    * https://towardsdatascience.com/jupyter-notebook-extensions-517fa69d2231
    