_This project was developed independently as part of Code Institute’s Predictive Analytics Project. Any datasets or templates used are openly provided by the course or via public sources like Kaggle. All commentary and code logic are my own._

# Notebook 01: Data Collection

## Objectives
The purpose of this notebook is to gather, load, and perform a preliminary inspection of the raw data for the Heritage Housing project. This step ensures we're working with valid data sources, understand their structure, and prepare for cleaning and wrangling in the next phase.


### Inputs  
- Kaggle API token (`kaggle.json`)  
- Dataset from Kaggle: Ames Housing (Code Institute)

### Outputs  
- `/data/raw/house_prices_records.csv`  
- `/data/raw/inherited_houses.csv`  
- `/data/raw/house-metadata.txt`


### Change Working Directory

- To allow smooth access to the data files, we need to adjust our working directory. 
- Since this notebook lives in a subfolder (e.g. jupyter_notebooks), we need to change the working directory from its current folder to its parent folder

In [None]:
"""
Heritage Housing – Data Collection Script

Purpose:
This script automates the initial data collection process:
- Installs and configures the Kaggle CLI
- Downloads the housing dataset from the Code Institute competition page
- Extracts the dataset into the project's raw data directory
- Loads house price and inherited property data into memory
- Provides initial shape and column inspection for both datasets
"""

# rest of your code...
import os
# Smart Working Directory Setup 
project_root = '/workspaces/heritage_housing'
if os.getcwd() != project_root:
    try:
        os.chdir(project_root)
        print(f"[INFO] Changed working directory to project root: {os.getcwd()}")
    except FileNotFoundError:
        raise FileNotFoundError(f"[ERROR] Project root '{project_root}' not found!")

### Fetch Dataset from Kaggle

- To keep the workflow reproducible and professional, we will use Kaggle’s API to programmatically download the dataset.

Setup: Install Kaggle and Authenticate

In [None]:
# Install Kaggle CLI (if not already installed)
!pip3 install kaggle==1.5.12

- You must first create a Kaggle account and generate an API token from your account settings. This will download a kaggle.json file.
- Move kaggle.json to the root directory of this repo.
- Run the below to register the token and adjust permissions:

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json  # Secure the file
assert os.path.exists('kaggle.json'), "[ERROR] kaggle.json not found. Please place it in the project root."

### Download the Dataset

- We now fetch the dataset using the CLI. This project uses the "Heritage Housing Predictor" dataset from Kaggle competitions.

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "data/raw"

! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

- Unzip the Downloaded File

In [None]:
import zipfile
from pathlib import Path

for zip_file in Path(DestinationFolder).glob("*.zip"):
    with zipfile.ZipFile(zip_file, 'r') as z:
        z.extractall(DestinationFolder)
    zip_file.unlink()  # Delete zip file after extraction

### Load Required Libraries

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Load Datasets

In [None]:
# Generic reference used for structure inspection only (used in df.info(), etc.)
# To read dataset from the csv file in a Pandas dataframe
df = pd.read_csv(f'data/raw/house_prices_records.csv')

# Load house sales records
house_prices_df = pd.read_csv('data/raw/house_prices_records.csv')

# Load inherited houses data
inherited_df = pd.read_csv('data/raw/inherited_houses.csv')

### Quick Peek at Each Dataset
- This gives us a rough idea of the data shape and the kind of features we’ll be dealing with.

In [None]:
print("\n--- House Prices Data ---")
print(house_prices_df.shape)
print(house_prices_df.columns)
house_prices_df.head()

print("\n--- Inherited Houses Data ---")
print(inherited_df.shape)
print(inherited_df.columns)
inherited_df.head()

### Dataframe Summary
- the .info() methos can now be called on the dataframe object to read the dataframe summary. The result below presents the output into it's own datadrame for readability purposes.


In [None]:
df.info()

## Our Initial Observations

- house_prices_records.csv contains 1,460 rows and 24 columns. The target variable SalePrice is included.

- inherited_houses.csv contains 4 rows and 23 columns — all the same features except for the missing SalePrice column.

- No explicit ID column is shared between the datasets, so merging isn’t directly possible.

- Features include both numeric (e.g., `LotArea`, `YearBuilt`) and categorical data (e.g., `BsmtExposure`, `KitchenQual`).

- Likely nulls in columns like `GarageYrBlt`, `LotFrontage`, and basement features.

- Column names match well between datasets — suggesting they're aligned in structure.

## Summary of Actions Completed

- Changed working directory to project root.

- Installed Kaggle CLI and configured authentication.

- Programmatically downloaded and extracted all raw data from Kaggle.

- Loaded raw datasets and confirmed structural consistency.

- Inspected column names, dimensions, and got a feel for data types and formats.

- Noted down initial findings to guide the cleaning strategy.

## Next Steps

Our acquired raw datasets are now prepared for:


### Preprocessing and Data Cleaning:

- Address missing values for columns such as `LotFrontage`, `GarageYrBlt`, and `BedroomAbvGr` in house_prices_records.csv.
- Examine and deal with columns like `EnclosedPorch` and `WoodDeckSF` that have a lot of null values to ascertain their applicability.
- To make integration and analysis easier, align the two datasets' layout and structure.
- Where there may be differences, standardize the types of columns (floats vs. integers, for example).

### EDA (Exploratory Data Analysis):

- To find important features, look into the relationships between house attributes and sale prices in house_prices_records.csv.
- Use visual aids, such as heatmaps or scatter plots, to help direct feature selection and model construction.
- Examine any connections that might exist between the data in the larger dataset and the characteristics of the inherited homes.

