# **01 - Data Collection**

## Objectives

* Install and configure the Kaggle API for dataset retrieval.
* Download and save the raw dataset for further processing.
* Inspect the dataset to ensure it is loaded correctly.

## Inputs

* **Dataset**: [Housing Prices Data](https://www.kaggle.com/codeinstitute/housing-prices-data)
* **Authentication**: Kaggle API token stored in `kaggle.json`.

## Outputs

* Raw dataset saved as `inputs/datasets/raw/house_price_records.csv`.

---

## Change Working Directory

Ensure the working directory is set to the project root for consistent file paths. This ensures that all file paths work correctly, regardless of where the notebook is executed.

In [1]:
import os
current_dir = os.getcwd()
current_dir

'd:\\Projects\\milestone-project-heritage-housing-issues\\jupyter_notebooks'

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

In [2]:
current_dir = os.getcwd()
current_dir

'd:\\Projects\\milestone-project-heritage-housing-issues\\jupyter_notebooks'

---

## 1. Data Retrieval

### 1.1 Install Kaggle API
The Kaggle API is required to download datasets directly from Kaggle. In this step, we will:
1. Install the Kaggle API package.
2. Configure the Kaggle API token for authentication.
3. Restrict access to the token for security purposes.

In [None]:
%pip install kaggle

### 1.2 Configure Kaggle API
To authenticate with the Kaggle API, we need to:
1. Set the environment variable `KAGGLE_CONFIG_DIR` to the directory containing the `kaggle.json` file.
2. Restrict access to the `kaggle.json` file to ensure it is secure.

In [None]:
# Configure the Kaggle API
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()

# Restrict access to the Kaggle API token for security
! chmod 600 kaggle.json

### 1.3 Download Dataset
Using the Kaggle API, we will download the raw dataset and save it in the `inputs/datasets/raw/` folder.

In [None]:
# Define the Kaggle dataset path and destination folder
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"

# Download the dataset
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

### 1.4 Unzip and Organize Dataset
After downloading the dataset, we will:
1. Unzip the dataset.
2. Remove unnecessary files and folders.
3. Save the raw data in the appropriate directory.

In [16]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

'unzip' is not recognized as an internal or external command,
operable program or batch file.


---

## 2. Load and Inspect Kaggle Data

### Objective
Load the dataset into Pandas DataFrames and perform an initial inspection.

### 2.1 Load the dataset

In [17]:
# Import Pandas for data manipulation
import pandas as pd

# Create DataFrame of raw dataset containing house price records
df_prices = pd.read_csv("inputs/datasets/raw/house_prices_records.csv")

# Create DataFrame of dataset containing inherited house information
df_inherited = pd.read_csv("inputs/datasets/raw/inherited_houses.csv")

# Display the first few rows of the df_prices dataset for initial inspection
df_prices.head()

FileNotFoundError: [Errno 2] No such file or directory: 'inputs/datasets/raw/house_prices_records.csv'

### 2.2 Inspect the structure and summary statistics.
In this step, we will inspect the `df_prices` dataset to:
* Understand its structure and data types.
* Identify any missing values.
* Generate a statistical summary of the numeric columns.

*Note: We are only processing `df_prices` at this stage.*

In [None]:
# Inspect dataset
print("Dataset info:")
df_prices.info()

In [None]:
# Generate a statistical summary of the numeric columns
print("Statistical Summary of df_prices:")
print(df_prices.describe())

In [None]:
# Check for missing values
df_prices.isnull().sum()

---

## Conclusions and Next Steps

### Conclusions
- The dataset contains **1460 rows** and **24 columns**.
- Key columns with missing values include:
  - `LotFrontage` (17.7% missing values).
  - `GarageFinish` (16.1% missing values).
- Initial exploration highlighted potential challenges such as:
  - Missing values in key columns.
  - Outliers in numeric columns.

### Next Steps: Data Cleaning
1. **Handle missing values**:
   - Impute numeric columns like `LotFrontage` with the median.
   - Fill categorical columns like `GarageFinish` with logical values such as `'None'`.
2. **Drop irrelevant columns**:
   - Remove columns with high percentages of missing data or low relevance.
3. **Explore outliers**:
   - Investigate potential outliers in numeric columns to ensure data quality.