# Trabalhando com Dados Reais

#### Repositórios populares do Open Data
- UC Irvine Machine Learning Repository (http://archive.ics.uci.edu/ml/)
- Conjunto de dados no Kaggle (https://www.kaggle.com/datasets)
- ConjuntodeDadosnoAWSda Amazon (http://aws.amazon.com/fr/datasets/)
#### Meta portais (eles listam repositórios open data):
- http://dataportals.org/
- http://opendatamonitor.eu/
- http://quandl.com/
#### Outras páginas que listam muitos repositórios populares de open data:
- Lista de conjuntos de dados de Aprendizado de Máquina do Wikipedia (https://
goo.gl/SJHN2k)
- Pergunta no Quora.com (http://goo.gl/zDR78y)
- Conjuntos de dados no Reddit (https://www.reddit.com/r/datasets)

Nesta aula, iremos trabalhar com a base de dados do repositório StatLib referente a preços do setor imobiliário na Califórnia2. Este conjunto de dados foi baseado no censo de 1990 na Califórnia.


In [3]:
# Carregando as Bibliotecas
import sys

# Biblioteca de IA
import sklearn

# Biblioteca Matemática
import numpy as np

# Biblioteca para visualização dos dados
import matplotlib as mpl
import matplotlib.pyplot as plt

# Biblioteca para manipulação das tabelas
import pandas as pd

In [4]:
# O caminho é o link mandado no grupo
caminho = 'https://raw.githubusercontent.com/FilipeCFernandes/Machine_Learning/refs/heads/main/BASE_DE_DADOS/housing/housing.csv'

housing = pd.read_csv(caminho)

## Biblioteca Pandas

<div align="center">
    <img src="https://github.com/FilipeCFernandes/Machine_Learning/blob/main/Imagens/Panda.png?raw=true" width="500">
</div>


## Funções úteis do Pandas

df = pd.read_csv(path)

- `df.head(n)`: mostra as `n` primeiras linhas
- `df.tail(n)`: mostra as `n` últimas linhas
- `df.info()`: mostra informações do DataFrame
- `df.describe()`: estatísticas descritivas
- `df['coluna']`: seleciona uma coluna
- `df[['c1','c2']]`: seleciona várias colunas
- `df[df['coluna'] > valor]`: filtra linhas
- `df.groupby('coluna')`: agrupa dados
- `df.sort_values(by='coluna', ascending=True/False)`: ordena dados

### Exercício

In [5]:
# 1. Mostre as 10 primeiras linhas do DataFrame housing
df = pd.read_csv(caminho)
df.head(10)



Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
5,-122.25,37.85,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0,NEAR BAY
6,-122.25,37.84,52.0,2535.0,489.0,1094.0,514.0,3.6591,299200.0,NEAR BAY
7,-122.25,37.84,52.0,3104.0,687.0,1157.0,647.0,3.12,241400.0,NEAR BAY
8,-122.26,37.84,42.0,2555.0,665.0,1206.0,595.0,2.0804,226700.0,NEAR BAY
9,-122.25,37.84,52.0,3549.0,707.0,1551.0,714.0,3.6912,261100.0,NEAR BAY


In [6]:
# 2. Mostre as 10 últimas linhas do DataFrame housing
df.tail(10)


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
20630,-121.32,39.29,11.0,2640.0,505.0,1257.0,445.0,3.5673,112000.0,INLAND
20631,-121.4,39.33,15.0,2655.0,493.0,1200.0,432.0,3.5179,107200.0,INLAND
20632,-121.45,39.26,15.0,2319.0,416.0,1047.0,385.0,3.125,115600.0,INLAND
20633,-121.53,39.19,27.0,2080.0,412.0,1082.0,382.0,2.5495,98300.0,INLAND
20634,-121.56,39.27,28.0,2332.0,395.0,1041.0,344.0,3.7125,116800.0,INLAND
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7,92300.0,INLAND
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND
20639,-121.24,39.37,16.0,2785.0,616.0,1387.0,530.0,2.3886,89400.0,INLAND


In [7]:
# 3. Use `.info()` para obter uma descrição rápida do DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


### Rápida análise estatística dos dados

In [9]:
# 4. Use `.describe()` para obter uma rápida análise estatística do DataFrame housing
df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [10]:
# Visualização dos quartils
df['total_bedrooms']

Unnamed: 0,total_bedrooms
0,129.0
1,1106.0
2,190.0
3,235.0
4,280.0
...,...
20635,374.0
20636,150.0
20637,485.0
20638,409.0


# Bibliografia

* GÉRON, Aurélien. Mãos à obra: aprendizado de máquina com Scikit-Learn & TensorFlow: conceitos, ferramentas e técnicas para a construção de sistemas inteligentes. Rio de Janeiro: Alta Books, 2018.