# **01 - Data Collection**

## Objectives

* Install and configure the Kaggle API for dataset retrieval.
* Download and save the raw dataset for further processing.
* Inspect the dataset to ensure it is loaded correctly.

## Inputs

* **Dataset**: [Housing Prices Data](https://www.kaggle.com/codeinstitute/housing-prices-data)
* **Authentication**: Kaggle API token stored in `kaggle.json`.

## Outputs

* Raw dataset saved as `inputs/datasets/raw/house_prices_records.csv`.

---

## Change Working Directory

Ensure the working directory is set to the project root for consistent file paths. This ensures that all file paths work correctly, regardless of where the notebook is executed.

In [33]:
import os
current_dir = os.getcwd()
current_dir

'd:\\'

In [34]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [35]:
current_dir = os.getcwd()
current_dir

'd:\\'

---

## 1. Data Retrieval

### 1.1 Install Kaggle API
The Kaggle API is required to download datasets directly from Kaggle. In this step, we will:
1. Install the Kaggle API package.
2. Configure the Kaggle API token for authentication.
3. Restrict access to the token for security purposes.

In [37]:
%pip install kaggle

Note: you may need to restart the kernel to use updated packages.


### 1.2 Configure Kaggle API
To authenticate with the Kaggle API, we need to:
1. Set the environment variable `KAGGLE_CONFIG_DIR` to the directory containing the `kaggle.json` file.
2. Restrict access to the `kaggle.json` file to ensure it is secure.

In [38]:
# Configure the Kaggle API
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()

# Restrict access to the Kaggle API token for security
! chmod 600 kaggle.json

'chmod' is not recognized as an internal or external command,
operable program or batch file.


### 1.3 Download Dataset
Using the Kaggle API, we will download the raw dataset and save it in the `inputs/datasets/raw/` folder.

In [39]:
# Define the Kaggle dataset path and destination folder
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"

# Download the dataset
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "d:\Projects\milestone-project-heritage-housing-issues\.venv\Scripts\kaggle.exe\__main__.py", line 6, in <module>
  File "d:\Projects\milestone-project-heritage-housing-issues\.venv\Lib\site-packages\kaggle\cli.py", line 68, in main
    out = args.func(**command_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\Projects\milestone-project-heritage-housing-issues\.venv\Lib\site-packages\kaggle\api\kaggle_api_extended.py", line 1741, in dataset_download_cli
    with self.build_kaggle_client() as kaggle:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\Projects\milestone-project-heritage-housing-issues\.venv\Lib\site-packages\kaggle\api\kaggle_api_extended.py", line 688, in build_kaggle_client
    username=self.config_values['username'],
             ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
KeyError: 'username'


### 1.4 Unzip and Organize Dataset
After downloading the dataset, we will:
1. Unzip the dataset.
2. Remove unnecessary files and folders.
3. Save the raw data in the appropriate directory.

In [40]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

'unzip' is not recognized as an internal or external command,
operable program or batch file.


---

## 2. Load and Inspect Kaggle Data

### Objective
Load the dataset into Pandas DataFrames and perform an initial inspection.

### 2.1 Load the dataset

In [None]:
# Import Pandas for data manipulation
import pandas as pd

# Create DataFrame containing house price records
df_prices = pd.read_csv("inputs/datasets/raw/house_prices_records.csv")

# Create DataFrame containing inherited house information
df_inherited = pd.read_csv("inputs/datasets/raw/inherited_houses.csv")

# Display the first few rows of the df_prices dataset for initial inspection
df_prices.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


### 2.2 Inspect the structure and summary statistics.
In this step, we will inspect the `df_prices` dataset to:
* Understand its structure and data types.
* Identify any missing values.
* Generate a statistical summary of the numeric columns.

*Note: We are only processing `df_prices` at this stage.*

In [42]:
# Inspect dataset
print("Dataset info:")
df_prices.info()

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1422 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1315 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1225 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  

In [43]:
# Generate a statistical summary of the numeric columns
print("Statistical Summary of df_prices:")
print(df_prices.describe())

Statistical Summary of df_prices:
          1stFlrSF     2ndFlrSF  BedroomAbvGr   BsmtFinSF1    BsmtUnfSF  \
count  1460.000000  1374.000000   1361.000000  1460.000000  1460.000000   
mean   1162.626712   348.524017      2.869214   443.639726   567.240411   
std     386.587738   438.865586      0.820115   456.098091   441.866955   
min     334.000000     0.000000      0.000000     0.000000     0.000000   
25%     882.000000     0.000000      2.000000     0.000000   223.000000   
50%    1087.000000     0.000000      3.000000   383.500000   477.500000   
75%    1391.250000   728.000000      3.000000   712.250000   808.000000   
max    4692.000000  2065.000000      8.000000  5644.000000  2336.000000   

       EnclosedPorch   GarageArea  GarageYrBlt    GrLivArea        LotArea  \
count     136.000000  1460.000000  1379.000000  1460.000000    1460.000000   
mean       25.330882   472.980137  1978.506164  1515.463699   10516.828082   
std        66.684115   213.804841    24.689725   525.480

In [44]:
# Check for missing values
df_prices.isnull().sum()

1stFlrSF            0
2ndFlrSF           86
BedroomAbvGr       99
BsmtExposure       38
BsmtFinSF1          0
BsmtFinType1      145
BsmtUnfSF           0
EnclosedPorch    1324
GarageArea          0
GarageFinish      235
GarageYrBlt        81
GrLivArea           0
KitchenQual         0
LotArea             0
LotFrontage       259
MasVnrArea          8
OpenPorchSF         0
OverallCond         0
OverallQual         0
TotalBsmtSF         0
WoodDeckSF       1305
YearBuilt           0
YearRemodAdd        0
SalePrice           0
dtype: int64

---

## Conclusions and Next Steps

## Conclusions and Next Steps