# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection

## Inputs

* Kaggle JSON file and authentication token. 

## Outputs

* Generate Datasets: outputs/datasets/collection/house_prices.csv and outputs/datasets/collection/inherited_houses.csv

## Additional Comments

* This data is coming from an open, public source and poses no ethical or privacy concerns.


---

# Change working directory

* Install python packages in the notebooks

In [1]:
%pip install -r /workspaces/data-analytics-housing-project/requirements.txt

Note: you may need to restart the kernel to use updated packages.


We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/data-analytics-housing-project/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspaces/data-analytics-housing-project'

## Fetch Kaggle Data

Install Kaggle package to fetch data.

In [5]:
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


* A personal authentication token (JSON file) is needed to authenticate Kaggle in order to download the data.

* Once you have your token, drag and drop the file into the directory and then run the following:

In [6]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

chmod: cannot access 'kaggle.json': No such file or directory


* Next, we need the datset url: 'datasets/codeinstitute/housing-prices-data'

* Define the Kaggle dataset and destination folder. Then download your dataset.

In [7]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Traceback (most recent call last):
  File "/home/codeany/.local/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/home/codeany/.local/lib/python3.8/site-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/home/codeany/.local/lib/python3.8/site-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /workspaces/data-analytics-housing-project. Or use the environment method.


* Unzip the downloaded file, delete the zip file and delete the kaggle.json file.

In [8]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

unzip:  cannot find or open inputs/datasets/raw/*.zip, inputs/datasets/raw/*.zip.zip or inputs/datasets/raw/*.zip.ZIP.

No zipfiles found.


---

## Load and Inspect Kaggle Data

* First, we will inspect the dataset of the houses in Ames, Iowa from Kaggle (not including our client's four inherited houses)

In [9]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/house_prices_records.csv")
df.head(20)

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000
5,796,566.0,1.0,No,732,GLQ,64,,480,Unf,...,85.0,0.0,30,5,5,796,,1993,1995,143000
6,1694,0.0,3.0,Av,1369,GLQ,317,,636,RFn,...,75.0,186.0,57,5,8,1686,,2004,2005,307000
7,1107,983.0,3.0,Mn,859,ALQ,216,,484,,...,,240.0,204,6,7,1107,,1973,1973,200000
8,1022,752.0,2.0,No,0,Unf,952,,468,Unf,...,51.0,0.0,0,5,7,952,,1931,1950,129900
9,1077,0.0,2.0,No,851,GLQ,140,,205,RFn,...,50.0,0.0,4,6,5,991,,1939,1950,118000


* Data Frame Summary

In [None]:
df.info()


* This is an index of what the different variables mean:

    <img src="../static/images/abbreviations1.png" alt="abbreviations for housing data" height="500" />

    <img src="../static/images/abbreviations2.png" alt="abbreviations for housing data" height="460" />

We want to check if there are any duplicates in this dataset.

In [None]:
has_duplicates = df.duplicated().any()

if has_duplicates:
    print("There are duplicates in the Data Frame.")
else:
    print("There are no duplicates in the Data Frame.")


* The variable '2ndFlrSF' contains both 0 and NaN results. Many of the data points that have NaN values for this variable have actual figures for 1stFlrSF which suggests that NaN reflects that the property has no second storey. 

* To make this data point more accessible for comparison, NaN for this variable will be converted to 0 as it is a Float64 type. If '2ndFlrSF = 0' it is assumed from this point forward that the property does not have a second storey (but may have a basement).

In [None]:
import numpy as np
df['2ndFlrSF'] = df['2ndFlrSF'].replace(np.nan, 0)

* We inspect the Data Frame again, and this change has been applied.

In [None]:
df.head(20)

* Secondly, we inspect the dataset for our client's inherited houses.

In [None]:
import pandas as pd
df_inherited = pd.read_csv(f"inputs/datasets/raw/inherited_houses.csv")
df_inherited.head()

* We inspect 'df_inherited' and note that it's columns and types are the same as 'df' except it does not contain a Sales Price column.

In [None]:
df_inherited.info()

---

# Push files to Repo

* We will save both Data Frames and remove their indexes.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/house_prices.csv", index=False) 
df_inherited.to_csv(f"outputs/datasets/collection/inherited_houses.csv", index=False)
