# **Data Collection and Loading**

## Objectives

* Fetch data from Kaggle and prepare it for further processes.

## Inputs

* Kaggle JSON file: Authentication token.
* Datasets:
    * house_prices_records.csv: A dataset containing housing attributes and sale prices in Ames, Iowa.
    * inherited_houses.csv: A dataset containing information on four inherited houses.

## Outputs

* Verified and cleaned datasets saved in the /data folder.

## Additional Comments

* No additional comments


---

Import Packages

In [10]:
# Install required packages
%pip install -r ./requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [9]:
# Check Python version
!python --version

Python 3.12.2


In [12]:
!which python

/home/gitpod/.pyenv/versions/3.12.2/bin/python


In [13]:
%pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [2]:
# Import required libraries
import pandas as pd
import numpy as np
import os
import zipfile

# Change working directory

In [3]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/heritage-housing-analysis/jupyter_notebooks'

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [5]:
current_dir = os.getcwd()
current_dir

'/workspace/heritage-housing-analysis'

# Install Kaggle API and Authenticate

In [7]:
# Install Kaggle
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


In [15]:
# Set Kaggle authentication
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json


---

# Download Datasets from Kaggle

In [17]:
# Define Kaggle dataset path and destination folder
kaggle_dataset_path = "codeinstitute/housing-prices-data"
destination_folder = "data"

# Download dataset from Kaggle
! kaggle datasets download -d {kaggle_dataset_path} -p {destination_folder}

Downloading housing-prices-data.zip to data
  0%|                                               | 0.00/49.6k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 49.6k/49.6k [00:00<00:00, 54.0MB/s]


In [19]:
# Unzip the downloaded dataset
with zipfile.ZipFile(f"{destination_folder}/housing-prices-data.zip", 'r') as zip_ref:
    zip_ref.extractall(destination_folder)

# Remove the zip file after extraction
os.remove(f"{destination_folder}/housing-prices-data.zip")

print("Dataset successfully downloaded and extracted.")

Dataset successfully downloaded and extracted.


# Inspect Datasets

In [23]:
# Load datasets
house_prices_path = "data/house_prices_records.csv"
inherited_houses_path = "data/inherited_houses.csv"

house_prices_df = pd.read_csv(house_prices_path)
inherited_houses_df = pd.read_csv(inherited_houses_path)

# Inspect the datasets
print("House Prices Dataset:")
print(house_prices_df.info())
print(house_prices_df.head())

print("\nInherited Houses Dataset:")
print(inherited_houses_df.info())
print(inherited_houses_df.head())

House Prices Dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1422 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1315 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1225 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null  

# Handle Missing Data

In [30]:
# Check for missing data
print("Missing values in House Prices Dataset:")
print(house_prices_df.isnull().sum())

print("\nMissing values in Inherited Houses Dataset:")
print(inherited_houses_df.isnull().sum())

# Identify numeric and non-numeric columns
numeric_cols = inherited_houses_df.select_dtypes(include=['number']).columns
non_numeric_cols = inherited_houses_df.select_dtypes(exclude=['number']).columns

# Handle missing data
# Fill numeric columns with their median
inherited_houses_df[numeric_cols] = inherited_houses_df[numeric_cols].fillna(inherited_houses_df[numeric_cols].median())

# Fill non-numeric columns with a placeholder or mode (e.g., most frequent value)
for col in non_numeric_cols:
    inherited_houses_df[col] = inherited_houses_df[col].fillna(inherited_houses_df[col].mode()[0])

# Print confirmation
print("Missing data handled successfully!")
print(inherited_houses_df.isnull().sum())

Missing values in House Prices Dataset:
1stFlrSF            0
2ndFlrSF           86
BedroomAbvGr       99
BsmtExposure       38
BsmtFinSF1          0
BsmtFinType1      145
BsmtUnfSF           0
EnclosedPorch    1324
GarageArea          0
GarageFinish      235
GarageYrBlt        81
GrLivArea           0
KitchenQual         0
LotArea             0
LotFrontage       259
MasVnrArea          8
OpenPorchSF         0
OverallCond         0
OverallQual         0
TotalBsmtSF         0
WoodDeckSF       1305
YearBuilt           0
YearRemodAdd        0
SalePrice           0
dtype: int64

Missing values in Inherited Houses Dataset:
1stFlrSF         0
2ndFlrSF         0
BedroomAbvGr     0
BsmtExposure     0
BsmtFinSF1       0
BsmtFinType1     0
BsmtUnfSF        0
EnclosedPorch    0
GarageArea       0
GarageFinish     0
GarageYrBlt      0
GrLivArea        0
KitchenQual      0
LotArea          0
LotFrontage      0
MasVnrArea       0
OpenPorchSF      0
OverallCond      0
OverallQual      0
TotalBsmtSF  

#  Save Cleaned Data

In [31]:
# Save the cleaned datasets back to the data folder
house_prices_df.to_csv("data/cleaned_house_prices.csv", index=False)
inherited_houses_df.to_csv("data/cleaned_inherited_houses.csv", index=False)

print("Cleaned datasets saved.")

Cleaned datasets saved.


---

NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
