# Pandas - Trabalhando com Arquivos
- Quando estamos trabalhando com *Grandes Arquivos* temos um desafio grande de gerenciar a memória. 
- As vezes precisamos manipular uma base de dados grande e por isso precisamos trabalhar com arquivos de forma diferente.
- Uma forma é ler esses arquivos de forma limitada para não consumir toda a memória do servidor.
- DASK poder ser uma opção

In [8]:
import pandas as pd

In [9]:
arquivo = 'Arquivos\kc_house_data.csv'

In [3]:
# nrows vai carregar apenas a quantidade de linhas do arquivo
# Por padrão, pandas vai ler TODO o arquivo e carregá-lo em memória.
dataset = pd.read_csv(arquivo, sep=',', nrows = 5)

In [6]:
# Resultado apenas 5 linhas com o nrows = 5
dataset

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900,3,1.0,1180,5650,1,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000,3,2.25,2570,7242,2,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000,2,1.0,770,10000,1,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000,4,3.0,1960,5000,1,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000,3,2.0,1680,8080,1,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [30]:
# Parâmetro chunksize define em qtas linhas cada bloco irá conter, em aruqivos mtos grandes, teremos vários blocos
# Tomar cuidado que depois que é lido, o python automaticamente apaga o chunk.
chunk = pd.read_csv(arquivo, chunksize=10000)

In [31]:
type(chunk)

pandas.io.parsers.TextFileReader

In [32]:
# Imprimindo tamanho das partes de arquivo chunk 
# Nosso arqui tem 21613 linhas, logo foi dividido em 3 blocks: 2 de 10000 e 1 de 1613 (para completar o total)
for parte in chunk:
    print(len(parte))

10000
10000
1613


In [35]:
# Parâmetro chunksize define em qtas linhas cada bloco irá conter, em aruqivos mtos grandes, teremos vários blocos
chunk = pd.read_csv(arquivo, chunksize=10000)

In [36]:
# Como trabalhar com essas infos ?
# Caso a gente queira fazer uma operação com alguma coluna e adicionar em uma nova coluna ou um novo dataset
lista=[]
for parte in chunk:
    lista.append(parte['bedrooms']*2)
    


In [37]:
lista

[0       6.0
 1       6.0
 2       4.0
 3       8.0
 4       6.0
        ... 
 9995    4.0
 9996    6.0
 9997    8.0
 9998    6.0
 9999    6.0
 Name: bedrooms, Length: 10000, dtype: float64,
 10000     8
 10001    10
 10002     6
 10003     8
 10004     6
          ..
 19995    10
 19996     6
 19997     6
 19998     4
 19999     8
 Name: bedrooms, Length: 10000, dtype: int64,
 20000     6
 20001     6
 20002     6
 20003     6
 20004    10
          ..
 21608     6
 21609     8
 21610     4
 21611     6
 21612     4
 Name: bedrooms, Length: 1613, dtype: int64]

In [38]:
dataset['bedrooms_size'] = pd.concat(lista) # Método q concatena uma lista para um dataset existente.

In [39]:
dataset

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,bedrooms_size
0,7129300520,20141013T000000,221900,3,1.0,1180,5650,1,0,0,...,1180,0,1955,0,98178,47.5112,-122.257,1340,5650,6.0
1,6414100192,20141209T000000,538000,3,2.25,2570,7242,2,0,0,...,2170,400,1951,1991,98125,47.721,-122.319,1690,7639,6.0
2,5631500400,20150225T000000,180000,2,1.0,770,10000,1,0,0,...,770,0,1933,0,98028,47.7379,-122.233,2720,8062,4.0
3,2487200875,20141209T000000,604000,4,3.0,1960,5000,1,0,0,...,1050,910,1965,0,98136,47.5208,-122.393,1360,5000,8.0
4,1954400510,20150218T000000,510000,3,2.0,1680,8080,1,0,0,...,1680,0,1987,0,98074,47.6168,-122.045,1800,7503,6.0


# Mais Recursos para trabalhar com Grandes Bases de Dados
1. Tente trabalhar apenas com as colunas que você vai realmente precisar.
2. Atente para o tipo de dado de cada coluna
3. Visualize qual o separador usado para separar os dados.

Sobre o item 3, podemos usar alguns comandos para verificar o arquivo que iremos trabalhar, no linux e no JupyterNotebook,
podemos utilizar o comando:

'head -n 5 dataset.csv' - Linux
'!head -n 5 dataset.csv' - No Jupyter

Já no windows, temos que abrir o PowerShell e usar o comando:

'gc dataset.csv -head 5'

In [10]:
# Lendo a base com o parâmetro nrows
df = pd.read_csv(arquivo, sep=',', nrows=5)

In [11]:
df

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900,3,1.0,1180,5650,1,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000,3,2.25,2570,7242,2,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000,2,1.0,770,10000,1,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000,4,3.0,1960,5000,1,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000,3,2.0,1680,8080,1,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [12]:
# Limpando o número de colunas com o que vamos trabalhar
# Pegar o nome das colunas
df.columns.tolist()

['id',
 'date',
 'price',
 'bedrooms',
 'bathrooms',
 'sqft_living',
 'sqft_lot',
 'floors',
 'waterfront',
 'view',
 'condition',
 'grade',
 'sqft_above',
 'sqft_basement',
 'yr_built',
 'yr_renovated',
 'zipcode',
 'lat',
 'long',
 'sqft_living15',
 'sqft_lot15']

In [13]:
#Filtrar o DF com apenas as colunas que eu quero trabalhar
df = pd.read_csv(arquivo, sep=',', usecols=['id',
 'date',
 'price',
 'bedrooms',
 'bathrooms',
 'sqft_living',
 'sqft_lot',
 'floors',
 'waterfront'])

In [14]:
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront
0,7129300520,20141013T000000,221900.0,3.0,1.0,1180,5650,1.0,0
1,6414100192,20141209T000000,538000.0,3.0,2.25,2570,7242,2.0,0
2,5631500400,20150225T000000,180000.0,2.0,1.0,770,10000,1.0,0
3,2487200875,20141209T000000,604000.0,4.0,3.0,1960,5000,1.0,0
4,1954400510,20150218T000000,510000.0,3.0,2.0,1680,8080,1.0,0


In [15]:
# Essa filtragem pode ser feita por posição da coluna
df = pd.read_csv(arquivo, usecols=[0,1,2,3,4,5])

In [16]:
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living
0,7129300520,20141013T000000,221900.0,3.0,1.0,1180
1,6414100192,20141209T000000,538000.0,3.0,2.25,2570
2,5631500400,20150225T000000,180000.0,2.0,1.0,770
3,2487200875,20141209T000000,604000.0,4.0,3.0,1960
4,1954400510,20150218T000000,510000.0,3.0,2.0,1680


In [17]:
# Lendo o arquivo completo para verificar o uso da memória
df = pd.read_csv(arquivo)

In [19]:
# Arquivo completo consome 3.5+ MB
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21609 non-null  float64
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21612 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long  

In [20]:
# Reduzindo o número das colunas
df = pd.read_csv(arquivo, usecols=[0,1,2,3,4])

In [21]:
# Filtrando a qtd de colunas, o uso de memória é bem menor. (844KB)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   id         21613 non-null  int64  
 1   date       21613 non-null  object 
 2   price      21613 non-null  float64
 3   bedrooms   21609 non-null  float64
 4   bathrooms  21613 non-null  float64
dtypes: float64(3), int64(1), object(1)
memory usage: 844.4+ KB


In [22]:
# Também podemos ler as colunas do arquivo exceto algumas
# Arquivo muito grandes com várias colunas, é mais fácil utilizar o exceto do que selecionar 1 a 1 quais a gente precisa.
df = pd.read_csv(arquivo, usecols = lambda column : column not in ['sqft_living','sfgt_lot','floors'])


In [23]:
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_lot,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3.0,1.0,5650,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3.0,2.25,7242,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2.0,1.0,10000,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4.0,3.0,5000,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3.0,2.0,8080,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21609 non-null  float64
 4   bathrooms      21613 non-null  float64
 5   sqft_lot       21613 non-null  int64  
 6   waterfront     21613 non-null  int64  
 7   view           21613 non-null  int64  
 8   condition      21613 non-null  int64  
 9   grade          21613 non-null  int64  
 10  sqft_above     21613 non-null  int64  
 11  sqft_basement  21613 non-null  int64  
 12  yr_built       21613 non-null  int64  
 13  yr_renovated   21613 non-null  int64  
 14  zipcode        21613 non-null  int64  
 15  lat            21613 non-null  float64
 16  long           21613 non-null  float64
 17  sqft_living15  21613 non-null  int64  
 18  sqft_l

### Trabalhar com os tipos de dados adequados
- Atenção para os tipos ed dados object
- Dados que são categóricos podem receber o tipo de dados category

In [25]:
df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')

In [26]:
# COlunas do tipo Object consomem mais memória
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### Convertendo os tipos de dados 

In [27]:
df.Sex = df.Sex.astype('category')
df.Embarked = df.Embarked.astype('category')
df.Survived = df.Survived.astype('category')
df.Pclass = df.Pclass.astype('category')
df.PassengerId = df.PassengerId.astype('int32')
df.Parch = df.Parch.astype('int32')
df.SibSp = df.SibSp.astype('int32')

In [28]:
# Uso de memória diminuir quase pela metade.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   PassengerId  891 non-null    int32   
 1   Survived     891 non-null    category
 2   Pclass       891 non-null    category
 3   Name         891 non-null    object  
 4   Sex          891 non-null    category
 5   Age          714 non-null    float64 
 6   SibSp        891 non-null    int32   
 7   Parch        891 non-null    int32   
 8   Ticket       891 non-null    object  
 9   Fare         891 non-null    float64 
 10  Cabin        204 non-null    object  
 11  Embarked     889 non-null    category
dtypes: category(4), float64(2), int32(3), object(3)
memory usage: 49.2+ KB


In [31]:
# Existe formar de antes de trazer as informações da base de dados (parametro dtype)
# já converter os tipos evitando que seja carregado tudo pra depois fazer a conversão
data = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
df = pd.read_csv(data, dtype={'Embarked':'category', 'Survived':'category','Parch':'int32'})

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   PassengerId  891 non-null    int64   
 1   Survived     891 non-null    category
 2   Pclass       891 non-null    int64   
 3   Name         891 non-null    object  
 4   Sex          891 non-null    object  
 5   Age          714 non-null    float64 
 6   SibSp        891 non-null    int64   
 7   Parch        891 non-null    int32   
 8   Ticket       891 non-null    object  
 9   Fare         891 non-null    float64 
 10  Cabin        204 non-null    object  
 11  Embarked     889 non-null    category
dtypes: category(2), float64(2), int32(1), int64(3), object(4)
memory usage: 68.1+ KB
