# **Data_Collection**

## Objectives

* Fetch data from Kaggle and save as raw data
* Inspect the data and save under outputs/datasets/collection

## Inputs

* Kaggle JSON file - authentication token

## Outputs

* Generate Datset: inputs/datasets/collection/housing_prices_data

## CRISP-DM

* "Data Collection"

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Fetch data from Kaggle

Install kaggle

In [None]:
!pip install kaggle 

Run the cell below to **change the kaggle configuration directory to current working directory and permission of kaggle authentication json**

In [None]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Connect Kaggle dataset to the notebook

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

---

# Load and inspect data

All house price records

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/house_prices_data/house-price/house_prices_records.csv")
df.head()

Inherited houses

In [None]:
import pandas as pd
df_inherited = pd.read_csv(f"inputs/datasets/house_prices_data/house-price/inherited_houses.csv")
df_inherited.head()

DataFrame Summary

In [None]:
df.info()

In [None]:
df_inherited.info()

---

Check for duplicated data

In [None]:
df[df.duplicated(subset=None, keep='first')]

There is no duplicated data

---

# Push files to Repo

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/house_prices_records.csv",index=False)
df.to_csv(f"outputs/datasets/collection/inherited_houses.csv",index=False)