# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save as raw data
* Inspect the data and save under outputs/datasets/collection

## Inputs

*   Kaggle JSON file - authentication token 

## Outputs

* Generate Dataset: outputs/datasets/collection/HeritageHousing.csv

## Additional Comments

* For this project, we are fetching the data from Kaggle.

---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Fetch data from Kaggle

---

First we need to install kaggle to fetch the data

In [None]:
! pip install kaggle==1.5.12

Next we need our json file from kaggle. This is needed to authenticate kaggle to download data required for this project.
* You will need a **kaggle.json** file available. You can get this by registering a free account at https://www.kaggle.com/
* Drag and drop the kaggle.json file into the root directory of this project.

![Kaggle](../media/kaggle.png)

Once that is done we run the cell below, so the token is recognized in the session

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

We are using the following [Kaggle url](https://www.kaggle.com/datasets/codeinstitute/housing-prices-data)

Using the dataset URL, we then define the Kaggle dataset, destination folder and download it

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Next we unzip the downloaded file, delete the zip file and delete kaggle.json file. Now we have the data and aren't at risk of pushing our private API key.


In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

---

# Load and Inspect Kaggle Data

Let's take a look at our data.

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df.head(10)

### Abbreviations explained:

1stFlrSF: First Floor square feet

2ndFlrSF: Second floor square feet

BedroomAbvGr: Bedrooms above grade (does NOT include basement bedrooms)

BsmtExposure: Refers to walkout or garden level walls
*   Gd: Good Exposure;
*   Av: Average Exposure;
*   Mn: Mimimum Exposure;
*   No: No Exposure;
*   None: No Basement

BsmtFinType1: Rating of basement finished area
*   GLQ: Good Living Quarters;
*   ALQ: Average Living Quarters;
*   BLQ: Below Average Living Quarters;
*   Rec: Average Rec Room;
*   LwQ: Low Quality;
*   Unf: Unfinshed;
*   None: No Basement

BsmtFinSF1: Type 1 finished square feet

BsmtUnfSF: Unfinished square feet of basement area

TotalBsmtSF: Total square feet of basement area

GarageArea: Size of garage in square feet

GarageFinish: Interior finish of the garage
*   Fin: Finished;
*   RFn: Rough Finished;
*   Unf: Unfinished;
*   None: No Garage

GarageYrBlt: Year garage was built

GrLivArea: Above grade (ground) living area square feet

KitchenQual: Kitchen quality
*   Ex: Excellent;
*   Gd: Good;
*   TA: Typical/Average;
*   Fa: Fair;
*   Po: Poor

LotArea: Lot size in square feet

LotFrontage: Linear feet of street connected to property

MasVnrArea: Masonry veneer area in square feet

EnclosedPorch: Enclosed porch area in square feet

OpenPorchSF: Open porch area in square feet

OverallCond: Rates the overall condition of the house
*   10: Very Excellent;
*   9: Excellent;
*   8: Very Good;
*   7: Good;
*   6: Above Average;
*   5: Average;
*   4: Below Average;
*   3: Fair;
*   2: Poor;
*   1: Very Poor

OverallQual: Rates the overall material and finish of the house

### Dataframe Summary

In [None]:
df.info()

---

# Push files to Repo

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/HeritageHousing.csv",index=False)
