# **Notebook 1: Data Collection**

## Objectives

* Fetch (download) data from Kaggle and save it as raw data (inputs/datasets/raw)
* Inspect Data and save it outputs/datasets/raw
* Save Project and push to GitHub Repository

## Inputs

* Kaggle JSON file - Kaggle authentication token to download dataset 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook
* Save Datasets in outputs/datasets/collection  

***

## Change working directory

In This section we will get location of current directory and move one step up, to parent folder, so App will be accessing project folder.

1. We need to change the working directory from its current folder to its parent folder
    * We access the current directory with os.getcwd()

In [None]:
import os

current_dir = os.getcwd()
current_dir

2. We want to make the parent of the current directory the new current directory
    * os.path.dirname() gets the parent directory
    * os.chdir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("you have set a new current directory")

3. Confirm new current directory

In [None]:
current_dir = os.getcwd()
current_dir

## Getting (fetching) data from  Kaggle

1. First we have to install Kaggle library

In [None]:
! pip install kaggle==1.6.12

2. Download authentication token from Kaggle:
    * You will require Kaggle authentication token 'kaggle.json', for that you will need to download it from Kaggle account settings, under APIs (create new token)
    * If you do not have Kaggle account, it is advised to create one and download kaggle.json token
    * Once token is downloaded, put it in main project folder
    * After that run cell below, to adjust permissions to handle the token

In [None]:
import os

os.environ["KAGGLE_CONFIG_DIR"] = os.getcwd()
! chmod 600 kaggle.json

3. Fetching dataset from Kaggle
    * We will be using dataset names "House Prices"
    * We will define dataset destination
    * Fetching dataset

In [None]:
Kaggle_dataset_name = "codeinstitute/housing-prices-data"
Destination_folder = "inputs/datasets/raw"
! kaggle datasets download -d {Kaggle_dataset_name} -p {Destination_folder}

4. Unzip downloaded file
    * First we unzip downloaded file into destination folder
    * After unzipping we delete downloaded zip file
    * Deleting kaggle token, as it will not be required anymore

In [None]:
! unzip {Destination_folder}/*.zip -d {Destination_folder}
! rm {Destination_folder}/*.zip
! rm kaggle.json

5. Kaggle library is not needed anymore, we will uninstall it

In [None]:
! pip uninstall -y kaggle==1.6.12

## Importing packages and setting environment variables

In [None]:
import pandas as pd

pd.options.display.max_columns = None
pd.options.display.max_rows = None

## Loading and Inspecting Dataset Records
* we will open dataset csv file into Pandas dataframe

In [None]:
df = pd.read_csv(f'inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv')
print(df.shape)
df.head()

## Dataframe Summary
* We will get dataset summary using method .info()

In [None]:
df.info()

* we can see that there is no Customer ID or any other fields with ID, so there is no need to check for duplicated data

## Open and read Inherited houses dataset into Pandas dataframe

In [None]:
df_inherited = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
print(df_inherited.shape)
df_inherited

# Save datasets to output folder and push files to GitHub

In [None]:
import os

try:
    # create outputs folder
    os.makedirs(name='outputs/datasets/collection')
except Exception as e:
    print(e)
df.to_csv(f'outputs/datasets/collection/HousePricesRecords.csv')
df_inherited.to_csv(f'outputs/datasets/collection/InheritedHouses.csv')

# Outcome and Next Steps 
* Both datasets are different (house prices and inherited houses)
* House Prices dataset is a mix of INT and FLOAT type features
* Inherited houses do not have price in dataset
* Next steps will be cleaning given data