_This project was developed independently as part of Code Institute’s Predictive Analytics Project. Any datasets or templates used are openly provided by the course or via public sources like Kaggle. All commentary and code logic are my own._

# Notebook 01: Data Collection

## Objectives
The purpose of this notebook is to gather, load, and perform a preliminary inspection of the raw data for the Heritage Housing project. This step ensures we're working with valid data sources, understand their structure, and prepare for cleaning and wrangling in the next phase.


## Source Files
- Kaggle JSON Authentication Token: Required to access the Kaggle API.

- We are using three key files from the Kaggle dataset:

  - house_prices_records.csv: Historical house sale records.

  - inherited_houses.csv: List of inherited houses we want to estimate prices for.

  - house-metadata.txt: Describes the features present in both CSVs.

    These were placed in the /data folder for structured access.

### Change Working Directory

- To allow smooth access to the data files, we need to adjust our working directory. 
- Since this notebook lives in a subfolder (e.g. jupyter_notebooks), we need to change the working directory from its current folder to its parent folder

In [8]:
import os

# Display the current directory
current_dir = os.getcwd()
print(f"Current directory: {current_dir}")

# Move to the parent directory
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

# Confirm the change
current_dir = os.getcwd()
print(f"New directory: {current_dir}")

Current directory: /workspaces/heritage_housing/jupyter_notebooks
You set a new current directory
New directory: /workspaces/heritage_housing


### Fetch Dataset from Kaggle

- To keep the workflow reproducible and professional, we will use Kaggle’s API to programmatically download the dataset.

Setup: Install Kaggle and Authenticate

In [9]:
# Install Kaggle CLI (if not already installed)
!pip3 install kaggle==1.5.12

Collecting kaggle==1.5.12
  Downloading kaggle-1.5.12.tar.gz (58 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73026 sha256=e6b97c9f4058ea4c5d8989bfaf3288727a6d55898d93a95f9bbf7e65371f76a4
  Stored in directory: /home/cistudent/.cache/pip/wheels/f5/69/4d/d701fc604b9fb09be59718b4056fd5556a22588ce1f25dd090
Successfully built kaggle
Installing collected packages: kaggle
  Attempting uninstall: kaggle
    Found existing installation: kaggle 1.7.4.2
    Uninstalling kaggle-1.7.4.2:
      Successfully uninstalled kaggle-1.7.4.2
Successfully installed kaggle-1.5.12

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[

- You must first create a Kaggle account and generate an API token from your account settings. This will download a kaggle.json file.
- Move kaggle.json to the root directory of this repo.
- Run the below to register the token and adjust permissions:

In [1]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
!chmod 600 kaggle.json  # Secure the file

chmod: cannot access 'kaggle.json': No such file or directory


### Download the Dataset

- We now fetch the dataset using the CLI. This project uses the "Heritage Housing Predictor" dataset from Kaggle competitions.

In [2]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "data/raw"

!kaggle {KaggleDatasetPath} download -c heritage-housing-predictor -p {DestinationFolder}

Traceback (most recent call last):
  File "/home/cistudent/.local/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/home/cistudent/.local/lib/python3.12/site-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/home/cistudent/.local/lib/python3.12/site-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /workspaces/heritage_housing/jupyter_notebooks. Or use the environment method.


- Unzip the Downloaded File

In [3]:
import zipfile
from pathlib import Path

for zip_file in Path(DestinationFolder).glob("*.zip"):
    with zipfile.ZipFile(zip_file, 'r') as z:
        z.extractall(DestinationFolder)
    zip_file.unlink()  # Delete zip file after extraction

### Load Required Libraries

In [4]:
import pandas as pd
import numpy as np
pd.options.display.max_columns = None
pd.options.display.max_rows = None

### Load Datasets

In [11]:
df = pd.read_csv(f"../data/raw/house_prices_records.csv")

# Load house sales records
house_prices_df = pd.read_csv('../data/raw/house_prices_records.csv')

# Load inherited houses data
inherited_df = pd.read_csv('../data/raw/inherited_houses.csv')

### Quick Peek at Each Dataset
- This gives us a rough idea of the data shape and the kind of features we’ll be dealing with.

In [12]:
print("\n--- House Prices Data ---")
print(house_prices_df.shape)
print(house_prices_df.columns)
house_prices_df.head()

print("\n--- Inherited Houses Data ---")
print(inherited_df.shape)
print(inherited_df.columns)
inherited_df.head()


--- House Prices Data ---
(1460, 24)
Index(['1stFlrSF', '2ndFlrSF', 'BedroomAbvGr', 'BsmtExposure', 'BsmtFinSF1',
       'BsmtFinType1', 'BsmtUnfSF', 'EnclosedPorch', 'GarageArea',
       'GarageFinish', 'GarageYrBlt', 'GrLivArea', 'KitchenQual', 'LotArea',
       'LotFrontage', 'MasVnrArea', 'OpenPorchSF', 'OverallCond',
       'OverallQual', 'TotalBsmtSF', 'WoodDeckSF', 'YearBuilt', 'YearRemodAdd',
       'SalePrice'],
      dtype='object')

--- Inherited Houses Data ---
(4, 23)
Index(['1stFlrSF', '2ndFlrSF', 'BedroomAbvGr', 'BsmtExposure', 'BsmtFinSF1',
       'BsmtFinType1', 'BsmtUnfSF', 'EnclosedPorch', 'GarageArea',
       'GarageFinish', 'GarageYrBlt', 'GrLivArea', 'KitchenQual', 'LotArea',
       'LotFrontage', 'MasVnrArea', 'OpenPorchSF', 'OverallCond',
       'OverallQual', 'TotalBsmtSF', 'WoodDeckSF', 'YearBuilt',
       'YearRemodAdd'],
      dtype='object')


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,GarageYrBlt,GrLivArea,KitchenQual,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd
0,896,0,2,No,468.0,Rec,270.0,0,730.0,Unf,1961.0,896,TA,11622,80.0,0.0,0,6,5,882.0,140,1961,1961
1,1329,0,3,No,923.0,ALQ,406.0,0,312.0,Unf,1958.0,1329,Gd,14267,81.0,108.0,36,6,6,1329.0,393,1958,1958
2,928,701,3,No,791.0,GLQ,137.0,0,482.0,Fin,1997.0,1629,TA,13830,74.0,0.0,34,5,5,928.0,212,1997,1998
3,926,678,3,No,602.0,GLQ,324.0,0,470.0,Fin,1998.0,1604,Gd,9978,78.0,20.0,36,6,6,926.0,360,1998,1998


### Dataframe Summary
- the .info() methos can now be called on the dataframe object to read the dataframe summary. The result below presents the output into it's own datadrame for readability purposes.


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1422 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1315 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1225 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

## Our Initial Observations

- house_prices_records.csv contains 1,460 rows and 24 columns. The target variable SalePrice is included.

- inherited_houses.csv contains 4 rows and 23 columns — all the same features except for the missing SalePrice column.

- No explicit ID column is shared between the datasets, so merging isn’t directly possible.

- Features include both numeric (e.g., `LotArea`, `YearBuilt`) and categorical data (e.g., `BsmtExposure`, `KitchenQual`).

- Likely nulls in columns like `GarageYrBlt`, `LotFrontage`, and basement features.

- Column names match well between datasets — suggesting they're aligned in structure.

## Summary of Actions Completed

- Changed working directory to project root.

- Installed Kaggle CLI and configured authentication.

- Programmatically downloaded and extracted all raw data from Kaggle.

- Loaded raw datasets and confirmed structural consistency.

- Inspected column names, dimensions, and got a feel for data types and formats.

- Noted down initial findings to guide the cleaning strategy.

## Next Steps

Our acquired raw datasets are now prepared for:


### Preprocessing and Data Cleaning:

- Address missing values for columns such as `LotFrontage`, `GarageYrBlt`, and `BedroomAbvGr` in house_prices_records.csv.
- Examine and deal with columns like `EnclosedPorch` and `WoodDeckSF` that have a lot of null values to ascertain their applicability.
- To make integration and analysis easier, align the two datasets' layout and structure.
- Where there may be differences, standardize the types of columns (floats vs. integers, for example).

### EDA (Exploratory Data Analysis):

- To find important features, look into the relationships between house attributes and sale prices in house_prices_records.csv.
- Use visual aids, such as heatmaps or scatter plots, to help direct feature selection and model construction.
- Examine any connections that might exist between the data in the larger dataset and the characteristics of the inherited homes.

