# Data Collection

## Objectives
- Retrieve dataset from Kaggle and save it as raw data in a .csv-file

## Input

- Authentication Token for Kaggle (kaggle.json)

## Output
- outputs/datasets/collection/house_prices_records.csv

# Change working directory

### We need to change the working directory from its current folder to its parent folder

- We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

### We want to make the parent of the current directory the new current directory.

- os.path.dirname() gets the parent directory
- os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Fetch data from Kaggle 

Install Kaggle package to fetch data

In [None]:
%pip install kaggle==1.5.12

Once you do that run the cell below, so the token is recognized in the session

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Define the Kaggle dataset, and destination folder and download it.

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file by running the cell below; it also deletes the zip file and deletes the kaggle.json file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

# Load and Inspect Kaggle data

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df.head()

DataFrame Summary

In [None]:
df.info()

# Save DataFrame to New Folder

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/house_prices_records.csv",index=False)