# Data Collection Notebook - Heritage Housing Issues

## Objectives
- Fetch the Ames Housing dataset from Kaggle and save it as raw data.
- Inspect the dataset to confirm successful collection.
- Store the dataset under `data/raw` for further analysis.

## Inputs
- **Kaggle JSON File:** Authentication token for Kaggle API.
- **Kaggle Dataset:** `codeinstitute/housing-prices-data`

## Outputs
- **Raw Dataset:** `data/raw/housing-prices-data.csv`

## Additional Comments
- In real-world projects, data is often hosted in internal data warehouses or external APIs rather than Kaggle.
- For this project, we are using Kaggle to simulate a **data collection process**.
- Ensure that your **Kaggle API key is not pushed to a public repository** for security reasons.


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/Heritage-Housing-Issues/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/Heritage-Housing-Issues'

## Install Python Packages
We will install the `kaggle` package to download datasets programmatically.

In [4]:
%pip install kaggle --quiet

Note: you may need to restart the kernel to use updated packages.


---

## Kaggle Authentication

We will authenticate by placing `kaggle.json` in the `.kaggle` folder inside our Codespace.  
This keeps our credentials safe and avoids hardcoding keys.

In [5]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

In [6]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "data/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/codeinstitute/housing-prices-data
License(s): unknown
Downloading housing-prices-data.zip to data/raw
  0%|                                               | 0.00/49.6k [00:00<?, ?B/s]
100%|███████████████████████████████████████| 49.6k/49.6k [00:00<00:00, 215MB/s]


Unzipped the downloaded file and then deleted the zip file and kaggle.json file containing the API

In [7]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Archive:  data/raw/housing-prices-data.zip
  inflating: data/raw/house-metadata.txt  
  inflating: data/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv  
  inflating: data/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv  


---

# Data Inspection**

In this step, we will:
1. Load the datasets from our `data/raw` folder.
2. Inspect their structure, columns, and dimensions.
3. Check for missing values and data types.

**Datasets:**
- `house_prices_records.csv` → Main dataset for model training.
- `inherited_houses.csv` → The 4 inherited houses for prediction.
- `house_metadata.txt` → Metadata describing the dataset.

In [9]:
import pandas as pd

# Load the datasets
main_data = pd.read_csv('data/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv')
inherited_data = pd.read_csv('data/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv')

print("Main Dataset Shape:", main_data.shape)
print("Inherited Houses Shape:", inherited_data.shape)

# Preview first 5 rows of each
display(main_data.head())
display(inherited_data)


Main Dataset Shape: (1460, 24)
Inherited Houses Shape: (4, 23)


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd
0,896,0,2,No,468.0,Rec,270.0,0,730.0,Unf,...,11622,80.0,0.0,0,6,5,882.0,140,1961,1961
1,1329,0,3,No,923.0,ALQ,406.0,0,312.0,Unf,...,14267,81.0,108.0,36,6,6,1329.0,393,1958,1958
2,928,701,3,No,791.0,GLQ,137.0,0,482.0,Fin,...,13830,74.0,0.0,34,5,5,928.0,212,1997,1998
3,926,678,3,No,602.0,GLQ,324.0,0,470.0,Fin,...,9978,78.0,20.0,36,6,6,926.0,360,1998,1998


### **Check Data Types and Missing Values**
We now check for:
- Column data types
- Null values count per column
- Basic descriptive statistics

In [10]:
# Check info and missing values
print("\n--- Main Dataset Info ---")
print(main_data.info())
print("\n--- Missing Values (Main Dataset) ---")
print(main_data.isnull().sum())

print("\n--- Inherited Houses Info ---")
print(inherited_data.info())
print("\n--- Missing Values (Inherited Houses) ---")
print(inherited_data.isnull().sum())

# Optional: Quick descriptive stats
main_data.describe().T.head(10)



--- Main Dataset Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1422 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1315 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1225 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-n

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
1stFlrSF,1460.0,1162.626712,386.587738,334.0,882.0,1087.0,1391.25,4692.0
2ndFlrSF,1374.0,348.524017,438.865586,0.0,0.0,0.0,728.0,2065.0
BedroomAbvGr,1361.0,2.869214,0.820115,0.0,2.0,3.0,3.0,8.0
BsmtFinSF1,1460.0,443.639726,456.098091,0.0,0.0,383.5,712.25,5644.0
BsmtUnfSF,1460.0,567.240411,441.866955,0.0,223.0,477.5,808.0,2336.0
EnclosedPorch,136.0,25.330882,66.684115,0.0,0.0,0.0,0.0,286.0
GarageArea,1460.0,472.980137,213.804841,0.0,334.5,480.0,576.0,1418.0
GarageYrBlt,1379.0,1978.506164,24.689725,1900.0,1961.0,1980.0,2002.0,2010.0
GrLivArea,1460.0,1515.463699,525.480383,334.0,1129.5,1464.0,1776.75,5642.0
LotArea,1460.0,10516.828082,9981.264932,1300.0,7553.5,9478.5,11601.5,215245.0


---

# **Conclusions and Next Steps**

**Observations:**
1. The main dataset contains `X` rows and `Y` columns.
2. The inherited houses dataset contains 4 rows, which will be used for prediction.
3. Several columns have missing values that will require data cleaning.
4. The target variable for our ML task is `SalePrice`.

**Next Steps:**
- Perform **Exploratory Data Analysis (EDA)** to identify key patterns.
- Investigate **correlations and distributions** of features with `SalePrice`.
- Formulate **initial project hypotheses** for validation.
