# **Data Collection Notebook**

### Objectives

* Fetch data from Kaggle and save as raw data.
* Inspect the data and save it under outputs/datasets/collection.
* Push the files to the github.

### Inputs

* Kaggle house prices records 
* API-Key

### Outputs

* outputs/datasets/collection/house-prices.csv 
* outputs/datasets/collection/inherited_houses.csv

### Additional Comments

* This file and its contents were inspired by and adapted from the Churnometer Walkthrough Project 2.


---

### Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

### Fetch data from Kaggle

Install Kaggle package to fetch data

Section 1 content

In [None]:
! pip install kaggle==1.5.12

---

import os
os.environ["KAGGLE_CONFIG_DIR"] = os.getcwd()
! chmod 600 kaggle.json

In [None]:
print(os.environ)


In [None]:
import os
os.environ["KAGGLE_CONFIG_DIR"] = os.getcwd()
! chmod 600 kaggle.json


In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

### Load and Inspect Kaggle data

House Prices Records

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df.head()

In [None]:
df.info()

In [None]:
df[df.duplicated(subset=None, keep= "first")]

There are no duplicates 

### Load and inspect the inherited house records

In [None]:
df_inherited = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
df_inherited.head()

In [None]:
df_inherited[df_inherited.duplicated(subset=None, keep= "first")]

### Conclusions

* There is a difference between the extracted house price dataset and the inherited houses dataset.
* There are no duplicates in either dataset.

### Push files to Repo

In [None]:
import os
try:
  os.makedirs(name="outputs/datasets/collection") 
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/house_prices.csv",index=False)
df_inherited.to_csv(f"outputs/datasets/collection/inherited_houses.csv",index=False)