# **01 - Data Collection**

## Objectives

* Download housing data from Kaggle using authentication.

* Store the raw data in the correct directory: inputs/datasets/raw/.

* Review the downloaded files to confirm they are complete and usable.

* Save cleaned copies of the datasets in: outputs/datasets/collection/.

## Inputs

* Kaggle API JSON file: Used to authenticate access to the dataset on Kaggle.

## Outputs

* inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv

* inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv

* inputs/datasets/raw/house-metadata.txt

* outputs/datasets/collection/HousePricesRecords.csv

* outputs/datasets/collection/InheritedHouses.csv

## Additional Comments

### Business Requirements Addressed
* BR1: The client wants to understand how different house features (e.g., size, location, condition) affect sale prices in Ames, Iowa. She expects visualizations that clearly show these relationships.

* BR2: The client owns four inherited properties. She wants to predict their potential sale prices as well as understand the market value of other properties in Ames.

### Additional Notes
* HousePricesRecords.csv (in outputs/datasets/collection/) will be used to create data visualizations that show trends between house features and prices.

* InheritedHouses.csv (in outputs/datasets/collection/) includes the specific properties the client owns. These will be passed to the prediction model to estimate their expected sale prices.

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/house-price-for-UK/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspaces/house-price-for-UK'

# Kaggle

Kaggle API
This downloads the UK Housing Prices Paid dataset from Kaggle using the Kaggle API. 

In [5]:
%pip install kaggle==1.5.12


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

chmod: cannot access 'kaggle.json': No such file or directory


In [3]:
KaggleDatasetPath = "hm-land-registry/uk-housing-prices-paid"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Traceback (most recent call last):
  File "/home/cistudent/.local/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/home/cistudent/.local/lib/python3.12/site-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/home/cistudent/.local/lib/python3.12/site-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /workspaces/house-price-for-UK/jupyter_notebooks. Or use the environment method.


In [10]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Archive:  inputs/datasets/raw/uk-housing-prices-paid.zip
  inflating: inputs/datasets/raw/price_paid_records.csv  




---

# Section 2

# Load and Inspect Kaggle data

I cannot download the full size of the data pack

In [1]:
import pandas as pd
df_chunk = pd.read_csv("../inputs/datasets/raw/price_paid_records.csv", nrows=1000)
df_chunk.head()


Unnamed: 0,Transaction unique identifier,Price,Date of Transfer,Property Type,Old/New,Duration,Town/City,District,County,PPDCategory Type,Record Status - monthly file only
0,{81B82214-7FBC-4129-9F6B-4956B4A663AD},25000,1995-08-18 00:00,T,N,F,OLDHAM,OLDHAM,GREATER MANCHESTER,A,A
1,{8046EC72-1466-42D6-A753-4956BF7CD8A2},42500,1995-08-09 00:00,S,N,F,GRAYS,THURROCK,THURROCK,A,A
2,{278D581A-5BF3-4FCE-AF62-4956D87691E6},45000,1995-06-30 00:00,T,N,F,HIGHBRIDGE,SEDGEMOOR,SOMERSET,A,A
3,{1D861C06-A416-4865-973C-4956DB12CD12},43150,1995-11-24 00:00,T,N,F,BEDFORD,NORTH BEDFORDSHIRE,BEDFORDSHIRE,A,A
4,{DD8645FD-A815-43A6-A7BA-4956E58F1874},18899,1995-06-23 00:00,S,N,F,WAKEFIELD,LEEDS,WEST YORKSHIRE,A,A


In [5]:
import pandas as pd

# Load a chunk of the data
df_chunk = pd.read_csv("../inputs/datasets/raw/price_paid_records.csv", nrows=1000)

# Drop columns that are not useful for modelling
df_chunk.drop(columns=[
    "Transaction unique identifier",
    "District",  # Optional – too many unique values can be noisy
    "Record Status - monthly file only"
], inplace=True)

# Convert 'Date of Transfer' to datetime
df_chunk["Date of Transfer"] = pd.to_datetime(df_chunk["Date of Transfer"], errors='coerce')

# Extract year and month
df_chunk["Year"] = df_chunk["Date of Transfer"].dt.year
df_chunk["Month"] = df_chunk["Date of Transfer"].dt.month

# Encode 'Old/New': N = 0, Y = 1
df_chunk["Old/New"] = df_chunk["Old/New"].map({'N': 0, 'Y': 1})

# Encode 'Duration': F = 1 (Freehold), L = 0 (Leasehold)
df_chunk["Duration"] = df_chunk["Duration"].map({'F': 1, 'L': 0})

# One-hot encode 'Property Type'
df_chunk = pd.get_dummies(df_chunk, columns=["Property Type"], prefix="Property")


# Preview result
df_chunk.head()



Unnamed: 0,Price,Date of Transfer,Old/New,Duration,Town/City,County,PPDCategory Type,Year,Month,Property_D,Property_F,Property_S,Property_T
0,25000,1995-08-18,0,1,OLDHAM,GREATER MANCHESTER,A,1995,8,False,False,False,True
1,42500,1995-08-09,0,1,GRAYS,THURROCK,A,1995,8,False,False,True,False
2,45000,1995-06-30,0,1,HIGHBRIDGE,SOMERSET,A,1995,6,False,False,False,True
3,43150,1995-11-24,0,1,BEDFORD,BEDFORDSHIRE,A,1995,11,False,False,False,True
4,18899,1995-06-23,0,1,WAKEFIELD,WEST YORKSHIRE,A,1995,6,False,False,True,False


---

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
