# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save as raw data
* Inspect the data and save under outputs/datasets/collection

## Inputs

*   Kaggle JSON file - authentication token 

## Outputs

* Generate Dataset: outputs/datasets/collection/HeritageHousing.csv

## Additional Comments

* For this project, we are fetching the data from Kaggle.

---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/Heritage-Housing-Issues-P5/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspace/Heritage-Housing-Issues-P5'

# Fetch data from Kaggle

---

First we need to install kaggle to fetch the data

In [5]:
! pip install kaggle==1.5.12



Next we need our json file from kaggle. This is needed to authenticate kaggle to download data required for this project.
* You will need a **kaggle.json** file available. You can get this by registering a free account at https://www.kaggle.com/
* Drag and drop the kaggle.json file into the root directory of this project.

![Kaggle](../media/kaggle.png)

Once that is done we run the cell below, so the token is recognized in the session

In [6]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

chmod: cannot access 'kaggle.json': No such file or directory


We are using the following [Kaggle url](https://www.kaggle.com/datasets/codeinstitute/housing-prices-data)

Using the dataset URL, we then define the Kaggle dataset, destination folder and download it

In [7]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Traceback (most recent call last):
  File "/workspace/.pip-modules/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/workspace/.pip-modules/lib/python3.8/site-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/workspace/.pip-modules/lib/python3.8/site-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /workspace/Heritage-Housing-Issues-P5. Or use the environment method.


Next we unzip the downloaded file, delete the zip file and delete kaggle.json file. Now we have the data and aren't at risk of pushing our private API key.


In [8]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

unzip:  cannot find or open inputs/datasets/raw/*.zip, inputs/datasets/raw/*.zip.zip or inputs/datasets/raw/*.zip.ZIP.

No zipfiles found.


---

# Load and Inspect Kaggle Data

Let's take a look at our data.

In [9]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df.head(10)

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000
5,796,566.0,1.0,No,732,GLQ,64,,480,Unf,...,85.0,0.0,30,5,5,796,,1993,1995,143000
6,1694,0.0,3.0,Av,1369,GLQ,317,,636,RFn,...,75.0,186.0,57,5,8,1686,,2004,2005,307000
7,1107,983.0,3.0,Mn,859,ALQ,216,,484,,...,,240.0,204,6,7,1107,,1973,1973,200000
8,1022,752.0,2.0,No,0,Unf,952,,468,Unf,...,51.0,0.0,0,5,7,952,,1931,1950,129900
9,1077,0.0,2.0,No,851,GLQ,140,,205,RFn,...,50.0,0.0,4,6,5,991,,1939,1950,118000


### Abbreviations explained:

1stFlrSF: First Floor square feet

2ndFlrSF: Second floor square feet

BedroomAbvGr: Bedrooms above grade (does NOT include basement bedrooms)

BsmtExposure: Refers to walkout or garden level walls
*   Gd: Good Exposure;
*   Av: Average Exposure;
*   Mn: Mimimum Exposure;
*   No: No Exposure;
*   None: No Basement

BsmtFinType1: Rating of basement finished area
*   GLQ: Good Living Quarters;
*   ALQ: Average Living Quarters;
*   BLQ: Below Average Living Quarters;
*   Rec: Average Rec Room;
*   LwQ: Low Quality;
*   Unf: Unfinshed;
*   None: No Basement

BsmtFinSF1: Type 1 finished square feet

BsmtUnfSF: Unfinished square feet of basement area

TotalBsmtSF: Total square feet of basement area

GarageArea: Size of garage in square feet

GarageFinish: Interior finish of the garage
*   Fin: Finished;
*   RFn: Rough Finished;
*   Unf: Unfinished;
*   None: No Garage

GarageYrBlt: Year garage was built

GrLivArea: Above grade (ground) living area square feet

KitchenQual: Kitchen quality
*   Ex: Excellent;
*   Gd: Good;
*   TA: Typical/Average;
*   Fa: Fair;
*   Po: Poor

LotArea: Lot size in square feet

LotFrontage: Linear feet of street connected to property

MasVnrArea: Masonry veneer area in square feet

EnclosedPorch: Enclosed porch area in square feet

OpenPorchSF: Open porch area in square feet

OverallCond: Rates the overall condition of the house
*   10: Very Excellent;
*   9: Excellent;
*   8: Very Good;
*   7: Good;
*   6: Above Average;
*   5: Average;
*   4: Below Average;
*   3: Fair;
*   2: Poor;
*   1: Very Poor

OverallQual: Rates the overall material and finish of the house

### Dataframe Summary

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1460 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1346 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1298 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

---

# Push files to Repo

In [11]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/HeritageHousing.csv",index=False)


[Errno 17] File exists: 'outputs/datasets/collection'
