# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save as raw data

## Inputs

* Kaggle JSON file - authentication token to access and download Kaggle data 

## Outputs

* Generate Datasets: 
  * inputs/datasets/raw/house_price_records.csv
  * inputs/datasets/raw/inherited_houses.csv

## Additional Comments

* The first dataset in the outputs above is the data used to build our machine learning model(s). The second file consists of the inherited houses whose prices our client wants to predict. 


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/heritage-housing-issues/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspace/heritage-housing-issues'

# Fetch raw data from Kaggle

First we need to install kaggle to access the kaggle API and to fetch the raw data.

In [None]:
! pip install kaggle==1.5.12

Make the kaggle authentication token available for the session. 

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* Define path to the Kaggle dataset we want to download 
* Indicate the destination folder for the downloaded data 
* Download the data

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

* Unzip the downloaded folder
* Remove the zipped folder
* Remove the kaggle JSON file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

# Load and Inspect Kaggle data

* Import the pandas library
* Load the dataset as a pandas DataFrame and assign it to df
* View the first five rows of the data in the df variable

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/house_prices_records.csv")
df.head()

DataFrame summary to see the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).

* The DataFrame has one target variable and 23 features.
* The data consists of features that have int, float or object data types.
* Some features such as EnclosedPorch and WoodDeckSF have null values in the great majority of cases.

In [None]:
df.info(max_cols=24)

# Push files to Repo

In this notebook, we have collected kaggle data and inspected the columns in the dataset. 
* From the quick inspection of the data, we can already see that the data should be cleaned before any analysis.

We push the datasets to Repo.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/HousePrices.csv",index=False)
