# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection

## Inputs

* Kaggle JSON file - the authentication token.

## Outputs

* Generate Dataset: outputs/datasets/collection/HousingPricesData.csv

## Additional Comments

* In the work environment, projects are not done using Kaggle data, but instead, the data comes from multiple data sources that may be hosted internally (like in a data warehouse) or outside your company. For this project learning context, we are fetching the data from Kaggle.

* Another aspect is that in the work environment, the data has never been pushed to a public repository due to security reasons. Just for this project learning context, we are hosting the data in a public repository.


---

## Install Python packages in the notebooks ##

### Change working directory ###

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\david\\Portfolio 5\\heritage-housing\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print(f"Current directory changed to: {os.getcwd()}")

Current directory changed to: c:\Users\david\Portfolio 5\heritage-housing


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\david\\Portfolio 5\\heritage-housing'

# Fetch data from Kaggle 

Install Kaggle package to fetch the data

In [4]:
%pip install kaggle

Collecting kaggle
  Downloading kaggle-1.8.3-py3-none-any.whl.metadata (16 kB)
Collecting black>=24.10.0 (from kaggle)
  Downloading black-25.12.0-cp312-cp312-win_amd64.whl.metadata (86 kB)
Collecting kagglesdk<1.0,>=0.1.14 (from kaggle)
  Downloading kagglesdk-0.1.14-py3-none-any.whl.metadata (13 kB)
Collecting mypy>=1.15.0 (from kaggle)
  Downloading mypy-1.19.1-cp312-cp312-win_amd64.whl.metadata (2.3 kB)
Collecting protobuf (from kaggle)
  Downloading protobuf-6.33.3-cp310-abi3-win_amd64.whl.metadata (593 bytes)
Collecting python-slugify (from kaggle)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting tqdm (from kaggle)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting types-requests (from kaggle)
  Downloading types_requests-2.32.4.20260107-py3-none-any.whl.metadata (2.0 kB)
Collecting types-tqdm (from kaggle)
  Downloading types_tqdm-4.67.0.20250809-py3-none-any.whl.metadata (1.7 kB)
Collecting click>=8.0.0 (from black>=24.10.


[notice] A new release of pip is available: 24.3.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [6]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()

KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"
os.makedirs(DestinationFolder, exist_ok=True)

! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

! powershell -Command "Expand-Archive -Path '{DestinationFolder}\\*.zip' -DestinationPath '{DestinationFolder}'; Remove-Item '{DestinationFolder}\\*.zip'; Remove-Item 'kaggle.json'"


Dataset URL: https://www.kaggle.com/datasets/codeinstitute/housing-prices-data
License(s): unknown
housing-prices-data.zip: Skipping, found more recently modified local copy (use --force to force download)


---

# Load and Inspect Kaggle data

Section 2 content

In [18]:
!pip install pandas
import pandas as pd

df = pd.read_csv(r"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df.head()






Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1422 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1315 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1225 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

We want to check if any houses appear multiple times.

In [20]:
duplicates = df[df.duplicated()]
print(f"Number of duplicate rows: {len(duplicates)}")

Number of duplicate rows: 0


---

# Push files to Repo

In [23]:
import os
os.makedirs('outputs/datasets/collection', exist_ok=True)

df.to_csv('outputs/datasets/collection/house_prices_records.csv', index=False)

