# **Data Collection and Loading**

## Objectives

* Fetch data from Kaggle and prepare it for further processes.

## Inputs

* Kaggle JSON file: Authentication token.
* Datasets:
    * house_prices_records.csv: A dataset containing housing attributes and sale prices in Ames, Iowa.
    * inherited_houses.csv: A dataset containing information on four inherited houses.

## Outputs

* Verified and cleaned datasets saved in the /data folder.

## Additional Comments

* No additional comments


---

# Import Packages

In [5]:
# Install required packages
%pip install -r ../requirements.txt

Collecting numpy==1.26.1
  Downloading numpy-1.26.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.2/18.2 MB[0m [31m77.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting pandas==2.1.1
  Downloading pandas-2.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m112.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting matplotlib==3.8.0
  Downloading matplotlib-3.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.6/11.6 MB[0m [31m160.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting seaborn==0.13.2
  Downloading seaborn-0.13.2-py3-none-any.whl (294 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.9/294.9 KB[0m [31m100.5 MB/s[0m eta [36m0:0

In [4]:
# Check Python version
!python --version

Python 3.12.2


In [4]:
!which python

/home/gitpod/.pyenv/shims/python


In [5]:
%pip install pandas

Collecting pandas
  Downloading pandas-2.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m99.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting pytz>=2020.1
  Downloading pytz-2025.2-py2.py3-none-any.whl (509 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m509.2/509.2 KB[0m [31m89.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting numpy>=1.22.4
  Downloading numpy-2.2.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.8/16.8 MB[0m [31m85.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting tzdata>=2022.7
  Downloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m347.8/347.8 KB[0m [31m94.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pytz, tzdata, numpy, pandas
Successfull

In [6]:
# Import required libraries
import pandas as pd
import numpy as np
import os
import zipfile

# Change working directory

In [7]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/heritage-housing-analysis/jupyter_notebooks'

In [8]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [9]:
current_dir = os.getcwd()
current_dir

'/workspace/heritage-housing-analysis'

# Install Kaggle API and Authenticate

In [10]:
# Install Kaggle
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


In [11]:
# Set Kaggle authentication
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json


chmod: cannot access 'kaggle.json': No such file or directory


---

# Download Datasets from Kaggle

In [3]:
# Define Kaggle dataset path and destination folder
kaggle_dataset_path = "codeinstitute/housing-prices-data"
destination_folder = "data"

# Download dataset from Kaggle
! kaggle datasets download -d {kaggle_dataset_path} -p {destination_folder}

Traceback (most recent call last):
  File "/workspace/.pip-modules/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/workspace/.pip-modules/lib/python3.10/site-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/workspace/.pip-modules/lib/python3.10/site-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /home/gitpod/.kaggle. Or use the environment method.


In [14]:
# Unzip the downloaded dataset
with zipfile.ZipFile(f"{destination_folder}/housing-prices-data.zip", 'r') as zip_ref:
    zip_ref.extractall(destination_folder)

# Remove the zip file after extraction
os.remove(f"{destination_folder}/housing-prices-data.zip")

print("Dataset successfully downloaded and extracted.")

FileNotFoundError: [Errno 2] No such file or directory: 'data/housing-prices-data.zip'

# Inspect Datasets

In [12]:
# Load datasets
house_prices_path = "data/house_prices_records.csv"
inherited_houses_path = "data/inherited_houses.csv"

house_prices_df = pd.read_csv(house_prices_path)
inherited_houses_df = pd.read_csv(inherited_houses_path)

# Inspect the datasets
print("House Prices Dataset:")
print(house_prices_df.info())
print(house_prices_df.head())

print("\nInherited Houses Dataset:")
print(inherited_houses_df.info())
print(inherited_houses_df.head())

House Prices Dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1422 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1315 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1225 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null  

# Handle Missing Data

In [13]:
# Check for missing data
print("Missing values in House Prices Dataset:")
print(house_prices_df.isnull().sum())

print("\nMissing values in Inherited Houses Dataset:")
print(inherited_houses_df.isnull().sum())

# Identify numeric and non-numeric columns
numeric_cols = inherited_houses_df.select_dtypes(include=['number']).columns
non_numeric_cols = inherited_houses_df.select_dtypes(exclude=['number']).columns

# Handle missing data
# Fill numeric columns with their median
inherited_houses_df[numeric_cols] = inherited_houses_df[numeric_cols].fillna(inherited_houses_df[numeric_cols].median())

# Fill non-numeric columns with a placeholder or mode (e.g., most frequent value)
for col in non_numeric_cols:
    inherited_houses_df[col] = inherited_houses_df[col].fillna(inherited_houses_df[col].mode()[0])

# Print confirmation
print("Missing data handled successfully!")
print(inherited_houses_df.isnull().sum())

Missing values in House Prices Dataset:
1stFlrSF            0
2ndFlrSF           86
BedroomAbvGr       99
BsmtExposure       38
BsmtFinSF1          0
BsmtFinType1      145
BsmtUnfSF           0
EnclosedPorch    1324
GarageArea          0
GarageFinish      235
GarageYrBlt        81
GrLivArea           0
KitchenQual         0
LotArea             0
LotFrontage       259
MasVnrArea          8
OpenPorchSF         0
OverallCond         0
OverallQual         0
TotalBsmtSF         0
WoodDeckSF       1305
YearBuilt           0
YearRemodAdd        0
SalePrice           0
dtype: int64

Missing values in Inherited Houses Dataset:
1stFlrSF         0
2ndFlrSF         0
BedroomAbvGr     0
BsmtExposure     0
BsmtFinSF1       0
BsmtFinType1     0
BsmtUnfSF        0
EnclosedPorch    0
GarageArea       0
GarageFinish     0
GarageYrBlt      0
GrLivArea        0
KitchenQual      0
LotArea          0
LotFrontage      0
MasVnrArea       0
OpenPorchSF      0
OverallCond      0
OverallQual      0
TotalBsmtSF  

---

# Push files to Repo