# **Data Collection**

## Objectives

* Fetch data from Kaggle and save as raw data.

## Inputs

* Kaggle JSON file - authentication token to access and download Kaggle data.

## Outputs

* Generate Datasets:
  * inputs/datasets/raw/house_price_records.csv
  * inputs/datasets/raw/inherited_houses.csv

## Additional Comments

* The first dataset in the outputs above is the data used to build our machine learning model(s). The second file consists of the inherited houses whose prices our client wants to predict.


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [14]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/heritage-housing2'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [15]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [16]:
current_dir = os.getcwd()
current_dir

'/workspace'

---

# Fetch raw data from Kaggle

First we need to install kaggle to access the kaggle API and to fetch the raw data.

In [5]:
! pip install kaggle==1.5.12



Make the kaggle authentication token available for the session.

In [2]:
import os

# Set the path to the directory containing kaggle.json
kaggle_config_dir = '/workspace/heritage-housing2/'

# Set the environment variable
os.environ['KAGGLE_CONFIG_DIR'] = kaggle_config_dir

# Change the file permissions of kaggle.json
!chmod 600 {kaggle_config_dir}/kaggle.json

- Define path to the Kaggle dataset we want to download.
- Establish the destination folder for the downloaded data.
- Download the data.

In [6]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading housing-prices-data.zip to inputs/datasets/raw
  0%|                                               | 0.00/49.6k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 49.6k/49.6k [00:00<00:00, 2.70MB/s]


- Unzip the downloaded folder.
- Remove the zipped folder.
- Remove the kaggle JSON file.

In [7]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm /workspace/heritage-housing2/kaggle.json

Archive:  inputs/datasets/raw/housing-prices-data.zip
replace inputs/datasets/raw/house-metadata.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


---

# Load and Inspect Kaggle data

- Install pandas.
- Import the pandas library.
- Load the dataset as a pandas DataFrame and assign it to df.
- View the first five rows of the data in the df variable.

In [6]:
%pip install pandas

Note: you may need to restart the kernel to use updated packages.


- Import the pandas library.
- Load the dataset as a pandas DataFrame and assign it to df.
- View the first five rows of the data in the df variable.

In [3]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


DataFrame summary to see the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).

- The DataFrame has one target variable and 23 features.
- The data consists of features that have int, float or object data types.
- Some features such as EnclosedPorch and WoodDeckSF have null values in the great majority of cases.

In [4]:
df.info(max_cols=24)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1422 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1315 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1225 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

---

# Push files to Repo

In this notebook, we have collected kaggle data and inspected the columns in the dataset.

- From the quick inspection of the data, we can already see that the data should be cleaned before any analysis.
- We push the datasets to Repo.

In [5]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/HousePrices.csv",index=False)