# **Data Collection**

 
 
## Objectives

* Fetch data from Kaggle and save as raw data.
* Load and inspect the data and save under outputs/datasets/collection
* Push files to github repository.

## Inputs

* Kaggle JSON - authentication token
* Downloaded the recommended house price dataset from [Kaggle](https://www.kaggle.com/datasets/codeinstitute/housing-prices-data)

## Outputs

* The Kaggle files were unzipped to reveal the following files for input:
* inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv
* inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv
* Generate Dataset: input/datasets/house-price/house_prices_records and input/datasets/house-price/inherited_houses

## Additional Comments

* The Data understanding within Crisp-DM methodology is shown in this notebook.
* Walkthrough Project 2 has formed the guidelines for this notebook in terms of undertaking data collection.

---

# Change working directory

Here we will change the working directory from its current folder to its parent folder.
* We access the current directory with the command os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/Heritage-Housing-Issues-PP5/jupyter_notebooks'

We want to make the parent of the current directory the new current directory

* os.path.dirname() gets the parent directory

* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Here we will confirm the new current directory.

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/Heritage-Housing-Issues-PP5'

# Fetch data from Kaggle

* Required to install the Kaggle library hereafter.

Section 1 content

In [4]:
 ! pip3 install kaggle==1.5.12

Collecting kaggle==1.5.12
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.0/59.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: kaggle
  Building wheel for kaggle (pyproject.toml) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73027 sha256=6d9315d7ce3b6024e5654956a5014bb124eb777d578ca2b56a690a94e8d9cac2
  Stored in directory: /workspace/.pyenv_mirror/pip_cache/wheels/2e/27/39/f44e52756a6407b444143f233abe9fda0e18a23e8b20e0cd1c
Successfully built kaggle
Installing collected packages: kaggle
Successfully installed kaggle-1.5.12

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[

* Your are required to create a Kaggle account and a kaggle.json file (to be downloaded) in order to allow for the data download to work.

* The Kaggle.json file hosts authentication token, which is required to authenticate a data download from Kaggle itself, one can be requested from your Kaggle account settings under API's (create new token).

* We can then copy the downloaded kaggle.json file into the root directory of the project repository.

* Next, we set the Kaggle environment variable and permission to the token.

---

# Using your key

* Once you have downloaded your API key from the kaggle website, please place the file in the root directory for your project. You would just need to drag and drop.

* Thereafter, run the below cell to ensure the correct permissions are assigned for handling the file.

In [6]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

# Retrieving the dataset path from the Kaggle URL

* We are using the Kaggle dataset called House Prices.

* Remember to copy the section of the URL on Kaggle that comes after "https://www.kaggle.com/"

* We then define the dataset and it's destination folder after download.

In [7]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading housing-prices-data.zip to inputs/datasets/raw
  0%|                                               | 0.00/49.6k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 49.6k/49.6k [00:00<00:00, 1.76MB/s]


* We now have to unzip the downloaded file, delete the zip file and delete the kaggle.json file.

In [8]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json 

Archive:  inputs/datasets/raw/housing-prices-data.zip
  inflating: inputs/datasets/raw/house-metadata.txt  
  inflating: inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv  
  inflating: inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv  


In [9]:
! pip3 uninstall -y kaggle==1.5.12

Found existing installation: kaggle 1.5.12
Uninstalling kaggle-1.5.12:
  Successfully uninstalled kaggle-1.5.12


# Import packages & set environment variables

In [10]:
import pandas as pd
pd.options.display.max_columns = None
pd.options.display.max_rows = None

# Load and Inspect the House Price Records

* We can read the house_prices_records dataset csv file into a Pandas dataframe.

In [11]:
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
print(df.shape)
df.head()

(1460, 24)


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,GarageYrBlt,GrLivArea,KitchenQual,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,2003.0,1710,Gd,8450,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,1976.0,1262,TA,9600,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,2001.0,1786,Gd,11250,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,1998.0,1717,Gd,9550,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,2000.0,2198,Gd,14260,84.0,350.0,84,5,8,1145,,2000,2000,250000


# Dataframe Summary
* We are also able to read the dataframe summary by calling the method .info() on the dataframe object, but the snippet below reads this output into it's own dataframe for readability purposes.

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1422 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1315 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1225 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

* From above there is no id field to show data uniqueness and thus there is no need to check for duplicate data.

# Load and Inspect Inherited House Records

* Read the inherited_houses dataset csv file into a Pandas dataframe.

In [14]:
df_inherited = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
print(df_inherited.shape)
df_inherited

(4, 23)


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,GarageYrBlt,GrLivArea,KitchenQual,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd
0,896,0,2,No,468.0,Rec,270.0,0,730.0,Unf,1961.0,896,TA,11622,80.0,0.0,0,6,5,882.0,140,1961,1961
1,1329,0,3,No,923.0,ALQ,406.0,0,312.0,Unf,1958.0,1329,Gd,14267,81.0,108.0,36,6,6,1329.0,393,1958,1958
2,928,701,3,No,791.0,GLQ,137.0,0,482.0,Fin,1997.0,1629,TA,13830,74.0,0.0,34,5,5,928.0,212,1997,1998
3,926,678,3,No,602.0,GLQ,324.0,0,470.0,Fin,1998.0,1604,Gd,9978,78.0,20.0,36,6,6,926.0,360,1998,1998


---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

# Conclusions and Next Steps

* There is a difference between the house price and inherited house datasets.

* Some features in the house price dataset are in int while others are in float type.

* However, the above should not affect the analysis of the data.

* Salesprice is only found in the house price dataset in integer form.

* We can now proceed towards cleaning our data.