# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save as raw data
* Inspect the data and save under outputs/datasets/collection

## Inputs

* Kaggle JSON file - authentication token

## Outputs

* Generate Dataset: outputs/datasets/collection/house_prices_records.csv
                    outputs/datasets/collection/inherited_houses.csv


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Fetch data from Kaggle

Install Kaggle package to fetch data

In [None]:
! pip install kaggle

Recognising token

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Defining Kaggle dataset, destination folder and download it.

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the download file, as well as delete the zip file and kaggle.json file.

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

---

# Load and Inspect Data

Load and inspecting house_prices_records.csv file

In [None]:
import pandas as pd
df1 = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df1.head()

Check basic information about the house_prices_records dataset

In [None]:
df1.info()

Show summary statistics of the numerical columns of house_prices_records dataset

In [None]:
df1.describe()

Check for missing values in each column of house_prices_records dataset

In [None]:
df1.isnull().sum()

There are missing values, but most of them can be filled, with numeric features being zero and categorical features with None, though there will be unique cases.

In [None]:
numeric_fill_zero = ['2ndFlrSF', 'BedroomAbvGr', 'EnclosedPorch', 'WoodDeckSF']
df1[numeric_fill_zero] = df1[numeric_fill_zero].fillna(0)

categorical_fill_none = ['BsmtExposure', 'BsmtFinType1', 'GarageFinish']
df1[categorical_fill_none] = df1[categorical_fill_none].fillna('None')

df1['GarageYrBlt'] = df1['GarageYrBlt'].fillna('1900')

df1['LotFrontage']  = df1['LotFrontage'].fillna('21')

 Check to see any missing values missed.

In [None]:
df1.isnull().sum()

Load and inspecting inherited_houses.csv file.

In [None]:
df2 = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
df2.head()

Check basic information about the inherited_houses.csv dataset

In [None]:
df2.info()

Show summary statistics of the numerical columns of inherited_houses dataset

In [None]:
df2.describe()

Check for missing values in each column of inherited_houses dataset.

In [None]:
df2.isnull().sum()

All data has been collected and ready for analysis.

---

# Push files to Repo

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df1.to_csv(f"outputs/datasets/collection/house_prices_records.csv", index=False)
df2.to_csv(f"outputs/datasets/collection/inherited_houses.csv", index=False)