# **Notebook 1: Data Collection**

## Objectives

* Fetch (download) data from Kaggle and save it as raw data (inputs/datasets/raw)
* Inspect Data and save it outputs/datasets/raw
* Save Project and push to GitHub Repository

## Inputs

* Kaggle JSON file - Kaggle authentication token to download dataset 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook
* Save Datasets in outputs/datasets/collection  

***

## Change working directory

In This section we will get location of current directory and move one step up, to parent folder, so App will be accessing project folder.

1. We need to change the working directory from its current folder to its parent folder
    * We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/Users/pecukevicius/DataspellProjects/heritage_houses_p5/jupyter_notebooks'

2. We want to make the parent of the current directory the new current directory
    * os.path.dirname() gets the parent directory
    * os.chdir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("you have set a new current directory")

you have set a new current directory


3. Confirm new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/Users/pecukevicius/DataspellProjects/heritage_houses_p5'

## Getting (fetching) data from  Kaggle

1. First we have to install Kaggle library

In [4]:
! pip install kaggle==1.6.12

Collecting kaggle==1.6.12
  Using cached kaggle-1.6.12-py3-none-any.whl
Installing collected packages: kaggle
Successfully installed kaggle-1.6.12


2. Download authentication token from Kaggle:
    * You will require Kaggle authentication token 'kaggle.json', for that you will need to download it from Kaggle account settings, under API's (create new token)
    * If you do not have Kaggle account, it is advised to create one and download kaggle.json token
    * Once token is downloaded, put it in main project folder
    * After thatt run cell below, to adjust permissions to handle the token

In [5]:
import os

os.environ["KAGGLE_CONFIG_DIR"] = os.getcwd()
! chmod 600 kaggle.json

3. Fetching dataset from Kaggle
    * We will be using dataset names "House Prices"
    * We will define dataset destination
    * Fetching dataset

In [6]:
Kaggle_dataset_name = "codeinstitute/housing-prices-data"
Destination_folder = "inputs/datasets/raw"
! kaggle datasets download -d {Kaggle_dataset_name} -p {Destination_folder}

Dataset URL: https://www.kaggle.com/datasets/codeinstitute/housing-prices-data
License(s): unknown
Downloading housing-prices-data.zip to inputs/datasets/raw
  0%|                                               | 0.00/49.6k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 49.6k/49.6k [00:00<00:00, 2.54MB/s]


4. Unzip downloaded file
    * First we unzip downloaded file into destination folder
    * After unziping we delete downloaded zip file
    * Deleting kaggle token, as it will not be required anymore

In [7]:
! unzip {Destination_folder}/*.zip -d {Destination_folder}
! rm {Destination_folder}/*.zip
! rm kaggle.json

Archive:  inputs/datasets/raw/housing-prices-data.zip
  inflating: inputs/datasets/raw/house-metadata.txt  
  inflating: inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv  
  inflating: inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv  


5. Kaggle library is not needed anymore, we will unnistal it

In [8]:
! pip uninstall -y kaggle==1.6.12

Found existing installation: kaggle 1.6.12
Uninstalling kaggle-1.6.12:
  Successfully uninstalled kaggle-1.6.12


## Importing packages and setting environment variables

In [9]:
import pandas as pd

pd.options.display.max_columns = None
pd.options.display.max_rows = None

## Loading and Inspecting Dataset Records
* we will open dataset csv file into Pandas dataframe

In [10]:
df = pd.read_csv(f'inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv')
print(df.shape)
df.head()

(1460, 24)


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,GarageYrBlt,GrLivArea,KitchenQual,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,2003.0,1710,Gd,8450,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,1976.0,1262,TA,9600,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,2001.0,1786,Gd,11250,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,1998.0,1717,Gd,9550,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,2000.0,2198,Gd,14260,84.0,350.0,84,5,8,1145,,2000,2000,250000


## Dataframe Summary
* We will get dataset summary using method .info()

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1460 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1346 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1298 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

* we can see that there is no Customer ID or any other fields with ID, so there is no need to check for duplicated data

## Open and read Inherited houses dataset into Pandas dataframe

In [12]:
df_inherited = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
print(df_inherited.shape)
df_inherited

(4, 23)


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,GarageYrBlt,GrLivArea,KitchenQual,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd
0,896,0,2,No,468.0,Rec,270.0,0,730.0,Unf,1961.0,896,TA,11622,80.0,0.0,0,6,5,882.0,140,1961,1961
1,1329,0,3,No,923.0,ALQ,406.0,0,312.0,Unf,1958.0,1329,Gd,14267,81.0,108.0,36,6,6,1329.0,393,1958,1958
2,928,701,3,No,791.0,GLQ,137.0,0,482.0,Fin,1997.0,1629,TA,13830,74.0,0.0,34,5,5,928.0,212,1997,1998
3,926,678,3,No,602.0,GLQ,324.0,0,470.0,Fin,1998.0,1604,Gd,9978,78.0,20.0,36,6,6,926.0,360,1998,1998


# Save datasets to output folder and push files to GitHub

In [13]:
import os

try:
    # create outputs folder
    os.makedirs(name='outputs/datasets/collection')
except Exception as e:
    print(e)
df.to_csv(f'outputs/datasets/collection/HousePricesRecords.csv')
df_inherited.to_csv(f'outputs/datasets/collection/InheritedHouses.csv')

# Outcome and Next Steps 
* Both datasets are different (house prices and inherited houses)
* House Prices dataset is a mix of INT and FLOAT type features
* Inherited houses do not have price in dataset
* Next steps will be cleaning given data