# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save as raw data
* Load and inspect the data and save under inputs/datasets/raw 
* Push the files to the github repository

## Inputs

* Kaggle JSON authentication token
* Downloaded the recommended house price dataset from [Kaggle](https://www.kaggle.com/datasets/codeinstitute/housing-prices-data) 

## Outputs

* The Kaggle files were unzipped to:
  * inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv
  * inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv

## Additional Comments

* This notebook was written based on the guidelines provided in the Customer Churn walk through project, data collection lesson.
* This notebook relates to the Data Understanding step of Crisp-DM methodology


---

# Change working directory

Change the working directory from its current folder to its parent folder
* Access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Fetch data from Kaggle

Install Kaggle library

In [None]:
! pip3 install kaggle==1.5.12

* In order for the data download to work a user is required to have a Kaggle account and have downloaded a kaggle.json file.
* The Kaggle.json file contains an authentication token, which is required in order to authenticate a data download from Kaggle.
* The kaggel.json token file must then be copied to the root directory of the project repository.
* Next, set the Kaggle environment variable and set permission to the token file to read write for the user.

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Get the dataset path from the kaggle url

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

unzip the downloaded file and delete the zip file and the kaggle.json file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

In [None]:
! pip3 uninstall -y kaggle==1.5.12

---

# Import packages & set environment variables

In [None]:
import pandas as pd
pd.options.display.max_columns = None
pd.options.display.max_rows = None

---

# Load and Inspect the House Price Records

Read the house_prices_records dataset csv file into a Pandas dataframe

In [None]:
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
print(df.shape)
df.head()

Display dataframe summary information

In [None]:
df.info()

It is noted that there is no `id` field to mark row data uniqueness. Therefore, no need to check for duplicate data.

---

# Load and Inspect the Inherited House Records

Read the inherited_houses dataset csv file into a Pandas dataframe

In [None]:
df_inherited = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
print(df_inherited.shape)
df_inherited

Display dataframe summary information

In [None]:
df_inherited.info()

---

# Conclusions and Next Steps

* Note that there is a difference in datatypes between the house price dataset and the inherited houses dataset.
  * some features in the house price dataset are of type int whilst they are of type float in the inherited houses dataset and visa versa.
  * this difference should however not affect the analysis of the data.
  * SalePrice (exists only in house price dataset of course) is of type integer.
* Now that we have collected the datasets we need, we can move on to data cleaning 

---