# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save as raw data

## Inputs

* Kaggle JSON file - authentication token to access and download Kaggle data 

## Outputs

* Generate Datasets: 
  * inputs/datasets/raw/house_price_records.csv
  * inputs/datasets/raw/inherited_houses.csv

## Additional Comments

* The first dataset in the outputs above is the data used to build our machine learning model(s). The second file consists of the inherited houses whose prices our client wants to predict. 


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/PP5-Housing-Prices/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/PP5-Housing-Prices'

# Fetch raw data from Kaggle

First we need to install kaggle to access the kaggle API and to fetch the raw data.

In [4]:
! pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.0/59.0 kB[0m [31m425.0 kB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting python-slugify (from kaggle==1.5.12)
  Downloading python_slugify-8.0.1-py2.py3-none-any.whl (9.7 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle==1.5.12)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.2/78.2 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73026 sha256=5c77151af2e4cd953c39d62c238d92ce7219fbca10cd20c17f0638784344347f
  Stored in directory: /home/codeany/.cache/pip/wheels/29/da/11/144cc25aebdaeb4931b231e25fd34b394e6a5725cbb2f50106
Suc

Make the kaggle authentication token available for the session. 

In [5]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

chmod: cannot access 'kaggle.json': No such file or directory


* Define path to the Kaggle dataset we want to download 
* Indicate the destination folder for the downloaded data 
* Download the data

In [6]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Traceback (most recent call last):
  File "/home/codeany/.local/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/home/codeany/.local/lib/python3.8/site-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/home/codeany/.local/lib/python3.8/site-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /workspaces/PP5-Housing-Prices. Or use the environment method.


* Unzip the downloaded folder
* Remove the zipped folder
* Remove the kaggle JSON file

In [7]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

unzip:  cannot find or open inputs/datasets/raw/*.zip, inputs/datasets/raw/*.zip.zip or inputs/datasets/raw/*.zip.ZIP.

No zipfiles found.


# Load and Inspect Kaggle data

* Import the pandas library
* Load the dataset as a pandas DataFrame and assign it to df
* View the first five rows of the data in the df variable

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/house_prices_records.csv")
df.head()

DataFrame summary to see the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).

* The DataFrame has one target variable and 23 features.
* The data consists of features that have int, float or object data types.
* Some features such as EnclosedPorch and WoodDeckSF have null values in the great majority of cases.

In [None]:
df.info(max_cols=24)

# Push files to Repo

In this notebook, we have collected kaggle data and inspected the columns in the dataset. 
* From the quick inspection of the data, we can already see that the data should be cleaned before any analysis.

We push the datasets to Repo.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/HousePrices.csv",index=False)
