# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection

## Inputs

*   Kaggle JSON file - the authentication token.
*   inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv
*   inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv

## Outputs

* Generate Dataset: outputs/datasets/collection/HousePrices.csv
* Generate Dataset: outputs/datasets/collection/InheritedHouses.csv

## Additional Comments

*  Data collection is essential to be able to deliver according to the two business requirements.
*  The client is interested in discovering how the house attributes correlate with the sale price. 
   Therefore, the client expects data visualisations of the correlated variables against the sale price to show that.
*  The client is interested in predicting the house sale price from her four inherited houses and any other house in Ames, Iowa.


---

# Install python packages in the notebooks

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory.
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Fetch data from Kaggle

Install Kaggle package to fetch data

In [None]:
%pip install kaggle==1.5.12

In the Data Collection Section notebook we studied how to download a **JSON file (authentication token)** from Kaggle. That is needed to authenticate Kaggle to download data in this session.
* You will need **kaggle.json** available
* In case you don't have it, please refer to the Data Collection > Data Collection Unit 1: Getting Your Data notebook.


The next step is to manually drag the kaggle.json into the session

Once you do that run the cell below, so the token is recognized in the session

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

We are using the following [Kaggle URL](https://www.kaggle.com/datasets/codeinstitute/housing-prices-data)

Get the dataset path from the Kaggle url
* When you are viewing the dataset at Kaggle, check what is after https://www.kaggle.com/ .

Define the Kaggle dataset, and destination folder and download it.

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw/"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

---

# Load and Inspect Kaggle data

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df_inherited = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
df.head()

Inherited Houses doesn't have the SalePrice column because this is an unknown value at that stage.

In [None]:
df_inherited.head()

DataFrame Summary

In [None]:
df.info()

In [None]:
df_inherited.info()

We want to check if there are any duplicates: There are not.

In [None]:
df[df.duplicated(subset=None)]

In [None]:
df_inherited[df_inherited.duplicated(subset=None)]

We noticed `BsmtExposure` is a categorical variable: Yes or No. We will replace/convert it to an integer as the ML model requires numeric variables. 

In [None]:
df['BsmtExposure'].unique()

In [None]:
df['BsmtExposure'] = df['BsmtExposure'].replace({"No": 2, "Av": 1, "Other": 0})

Check the `BsmtExposure` data type.

In [None]:
df['BsmtExposure'].dtype

The data collection is ready to be sent to the repo including the new output directories and csv files.

# Push files to Repo

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/HousePrices.csv",index=False)
df_inherited.to_csv(f"outputs/datasets/collection/InheritedHouses.csv",index=False)

### The next steps
Next notebook will be about the Data Study