# Data Collection

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection

## Inputs

* Kaggle JSON file - the authentication token.

## Outputs

* Generate housing record dataset: 'outputs/datasets/collection/HousePricesRecords.csv'

## Additional Comments

* There are no ethical or Privacy concerns. 
* The client found a public dataset.


---

## Install python packages

In [None]:
%pip install -r /workspaces/Project-heritage-housing-issues/requirements.txt

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

## Fetch data from Kaggle

In [None]:
! pip install kaggle

In the Data Collection Section notebook we studied how to download a JSON file (authentication token) from Kaggle. That is needed to authenticate Kaggle to download data in this session.

* You will need kaggle.json available
* In case you don't have it, please refer to the Data Collection > Data Collection Unit 1: Getting Your Data notebook.

The next step is to manually drag the kaggle.json into the session

Once you do that run the cell below, so the token is recognized in the session

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

We are using the following [Kaggle URL](https://www.kaggle.com/codeinstitute/telecom-churn-dataset)

Get the dataset path from the Kaggle url
* When you are viewing the dataset at Kaggle, check what is after https://www.kaggle.com/ .

Define the Kaggle dataset, and destination folder and download it.

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

---

# Load and Inspect Kaggle data

## Dataset of house prices records 

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df.head()

---

DataFrame Summary

In [None]:
df.info()

## Convert area datatype from int to float so that all area variables use the same datatype

In [None]:
df[['1stFlrSF', 'BsmtFinSF1', 'BsmtUnfSF', 'GarageArea', 'GrLivArea', 'LotArea', 'OpenPorchSF', 'TotalBsmtSF', 'SalePrice']] = df[['1stFlrSF', 'BsmtFinSF1', 'BsmtUnfSF', 'GarageArea', 'GrLivArea', 'LotArea', 'OpenPorchSF', 'TotalBsmtSF', 'SalePrice']].astype(float)

In [None]:
df.info()

### Show first five rows

In [None]:
df.head(10)

---

# Push files to Repo

In [None]:
import os
try:
  # create here your folder
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/HousePricesRecords.csv",index=False)

---