# Collect Data
We download the data from a remote URL, which will also require the downloaded file to be decompressed.


In [1]:
import os
import tarfile

import wget


DOWNLOAD_URL = 'https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.tgz'

wget.download(DOWNLOAD_URL, 'data/temp.tgz')   
tarfile.open('data/temp.tgz').extractall('data/')
os.remove('data/temp.tgz')

Test that we can load the file, which should be called `housing.csv`.

In [2]:
import pandas as pd

housing_data = pd.read_csv('data/housing.csv')
housing_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


## Split Data into Train and Test Sets
Implementing a simple splitting of the data that is purely random. A better solution would be to use `StratifiedShuffleSplit` working on categories of median income. This will, however, suffice for a simple example.

In [8]:
from sklearn.model_selection import train_test_split

housing_train, housing_test = train_test_split(housing_data, test_size = 0.2, random_state = 43)

housing_train.to_csv('data/housing_train.csv', index=False)
housing_test.to_csv('data/housing_test.csv', index=False)