# **01 - Data Collection**

## Objectives

* Download housing data from Kaggle using authentication.

* Store the raw data in the correct directory: inputs/datasets/raw/.

* Review the downloaded files to confirm they are complete and usable.

* Save cleaned copies of the datasets in: outputs/datasets/collection/.

## Inputs

* Kaggle API JSON file: Used to authenticate access to the dataset on Kaggle.

## Outputs

* inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv

* inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv

* inputs/datasets/raw/house-metadata.txt

* outputs/datasets/collection/HousePricesRecords.csv

* outputs/datasets/collection/InheritedHouses.csv

## Additional Comments

### Business Requirements Addressed
* BR1: The client wants to understand how different house features (e.g., size, location, condition) affect sale prices in Ames, Iowa. She expects visualizations that clearly show these relationships.

* BR2: The client owns four inherited properties. She wants to predict their potential sale prices as well as understand the market value of other properties in Ames.

### Additional Notes
* HousePricesRecords.csv (in outputs/datasets/collection/) will be used to create data visualizations that show trends between house features and prices.

* InheritedHouses.csv (in outputs/datasets/collection/) includes the specific properties the client owns. These will be passed to the prediction model to estimate their expected sale prices.

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/house-price-for-UK/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/house-price-for-UK'

## Kaggle

Kaggle API
This downloads the UK Housing Prices Paid dataset from Kaggle using the Kaggle API. 

In [4]:
%pip install kaggle==1.5.12


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


this code sets up the environment for Kaggle API access. It tells Python where to find the kaggle.json

In [5]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

This code sets the dataset path (KaggleDatasetPath) for the UK housing prices dataset and the local folder (DestinationFolder) where the dataset will be saved. It then uses the Kaggle API command to download the dataset into the specified folder.

In [6]:
KaggleDatasetPath = "hm-land-registry/uk-housing-prices-paid"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading uk-housing-prices-paid.zip to inputs/datasets/raw
100%|███████████████████████████████████████▊| 729M/731M [00:15<00:00, 52.0MB/s]
100%|████████████████████████████████████████| 731M/731M [00:15<00:00, 48.9MB/s]


This command unzips the downloaded .zip file into the same folder, then deletes the original .zip file and the kaggle.json API key file to keep the workspace clean and organized.

In [7]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Archive:  inputs/datasets/raw/uk-housing-prices-paid.zip
  inflating: inputs/datasets/raw/price_paid_records.csv  




---

## Load and Inspect Kaggle data

I cannot download the full size of the data pack so I limited it to 1000 

In [8]:
import pandas as pd

df_chunk = pd.read_csv("inputs/datasets/raw/price_paid_records.csv", nrows=1000)
df_chunk.head()


Unnamed: 0,Transaction unique identifier,Price,Date of Transfer,Property Type,Old/New,Duration,Town/City,District,County,PPDCategory Type,Record Status - monthly file only
0,{81B82214-7FBC-4129-9F6B-4956B4A663AD},25000,1995-08-18 00:00,T,N,F,OLDHAM,OLDHAM,GREATER MANCHESTER,A,A
1,{8046EC72-1466-42D6-A753-4956BF7CD8A2},42500,1995-08-09 00:00,S,N,F,GRAYS,THURROCK,THURROCK,A,A
2,{278D581A-5BF3-4FCE-AF62-4956D87691E6},45000,1995-06-30 00:00,T,N,F,HIGHBRIDGE,SEDGEMOOR,SOMERSET,A,A
3,{1D861C06-A416-4865-973C-4956DB12CD12},43150,1995-11-24 00:00,T,N,F,BEDFORD,NORTH BEDFORDSHIRE,BEDFORDSHIRE,A,A
4,{DD8645FD-A815-43A6-A7BA-4956E58F1874},18899,1995-06-23 00:00,S,N,F,WAKEFIELD,LEEDS,WEST YORKSHIRE,A,A


---

## Check Raw Data Files
code checks whether the raw data files have been successfully downloaded and saved to the correct folder (`inputs/datasets/raw`)


In [9]:
# Check what files were downloaded
import os
for root, dirs, files in os.walk("inputs/datasets/raw"):
    for name in files:
        print(os.path.join(root, name))


inputs/datasets/raw/price_paid_records.csv


---

## Check Cleaned Output Files

This code checks whether the cleaned and processed data files were saved correctly in the `outputs/datasets/collection` folder. It ensures that your cleaning process worked and the results were stored in the right place.


In [10]:
# Create collection folder for cleaned data outputs
os.makedirs("../outputs/datasets/collection", exist_ok=True)
print("outputs/datasets/collection/ created")

outputs/datasets/collection/ created


---