# **Data Collection**

## Objectives

* Fetch raw data from Kaggle.
* Inspect datasets.
* Save modified copy of the raw house prices dataset.
* Save modified copy of the inherited houses dataset.

## Inputs

* kaggle JSON file

## Outputs

* modified copy of raw house prices dataset: outputs/datasets/collection/house_prices.csv
* modified copy of the raw inherited houses dataset: outputs/datasets/collection/inherited_houses.csv

---

## Change working directory

Working directory changed to its parent folder.

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir))
os.getcwd()

## Fetch Dataset from kaggle

The housing dataset located by the client is downloaded from kaggle, updating any existing copy of the dataset.

In [None]:
import env

KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

In [None]:
! unzip -u {DestinationFolder}/*.zip -d {DestinationFolder} \
&& rm {DestinationFolder}/*.zip

## Load datasets and inspect

**House prices dataset:**

In [None]:
import pandas as pd

house_prices_df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
house_prices_df.info()

**Initial observations:**  
The dataset has 24 columns consisting of three data types. Unsurprisingly not all instances have (valid) values for all columns.

---

In [None]:
house_prices_df.head()

Check for duplicates

In [None]:
True in house_prices_df.duplicated().unique()

Rename column name 'EnclosedPorch' to 'EnclosedPorchSF'.

In [None]:
house_prices_df.rename(columns={'EnclosedPorch': 'EnclosedPorchSF'}, inplace=True)

**Possible dataset limitations**:  
* The dataset for the housing prices lacks any features that represent the location of a house, which may influence the house sale price; for example the proximity to the nearest school or town centre. 
* In addition the dataset lacks any features relating to the time of sale, which may again have a significant impact on the sale price of a house.

In the absence of the above features, the dataset may not be able to sufficiently generate a ML model capable of adequately predicting the sale price of a house in Ames, Iowa;
or at least the model performance may be higher if such features were included in the dataset.

**Inherited houses dataset:**

In [None]:
inherited_houses_df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
inherited_houses_df.info()

In [None]:
inherited_houses_df.head()

Rename column name 'EnclosedPorch' to 'EnclosedPorchSF'.

In [None]:
inherited_houses_df.rename(columns={'EnclosedPorch': 'EnclosedPorchSF'}, inplace=True)

## Save outputs

In [None]:
try:
    path = os.path.join(os.getcwd(), 'outputs/datasets/collection')
    os.makedirs(path)
except Exception as e:
  print(e)

In [None]:
try:
    house_prices_df.to_csv(os.path.join(path, 'house_prices.csv'), index=False)
except Exception as e:
    print(e)

In [None]:
try:
    inherited_houses_df.to_csv(os.path.join(path, 'inherited_houses.csv'), index=False)
except Exception as e:
    print(e)