# **Notebook 1: Data Collection**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection.
* Push files to GitHub repository.

## Inputs

* Kaggle JSON File - Authentication Token.
* Recommended house price dataset downloaded from Kaggle.

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory

* It is assumed that you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/Predictive-Analytics-PP5/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/Predictive-Analytics-PP5'

# Fetch Data from Kaggle

* Use the below command to install the Kaggle Library:

In [4]:
! pip3 install kaggle==1.5.12



* In order to proceed further you will need to create an account with Kaggle. Without an account you will not be able to use the data required.
* You will need to download a Kaggle.json file which contains an autentication token which is to be utilised to authenticate any data downloaded from Kaggle. This can be requested via your Kaggle account settings under the API's section by creating a new token.
* Once you have downloaded the required JSON file this can be uploaded into the root directory of the project repository. Simply drag and drop the Kaggle file you have downloaded into the project root directory.
* In the next sections, we will set up the Kaggle environment variable and permission for the token we have downloaded.

---

# Using Your Authentication Token

* Now that the authentication token has been added to the project, you will need to run the below command to ensure the correct permissions have been assigned for the handling of the file.

In [5]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

# Retrieve the Dataset Path from the Kaggle URL

* The dataset being used for this project is called "House Prices".
* You will need to copy the section of the Kaggle URL that follows "https://www.kaggle.com/"
* You can then define the dataset and it's destination folder following the download.

In [6]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/dataset/raw"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading housing-prices-data.zip to inputs/dataset/raw
  0%|                                               | 0.00/49.6k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 49.6k/49.6k [00:00<00:00, 2.08MB/s]


* The data is downloaded in a zip file, so you will now need to unzip the file to export the data and delete the zip file. You will now also delete the Kaggle JSON file.

In [7]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
    && rm {DestinationFolder}/*.zip \
    && rm kaggle.json

Archive:  inputs/dataset/raw/housing-prices-data.zip
  inflating: inputs/dataset/raw/house-metadata.txt  
  inflating: inputs/dataset/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv  
  inflating: inputs/dataset/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv  


In [8]:
! pip3 uninstall -y kaggle==1.5.12

Found existing installation: kaggle 1.5.12
Uninstalling kaggle-1.5.12:
  Successfully uninstalled kaggle-1.5.12


# Import Packages & Set Environment Variables

* Using the below command you will install the pandas package and set the environment variables.

In [9]:
import pandas as pd
pd.options.display.max_columns = None
pd.options.display.max_rows = None

# Load & Inspect the House Price Records Dataset

* Now that you have imported the Pandas package you can read the house_prices_records dataset from the csv file in a Pandas dataframe.

In [10]:
df = pd.read_csv(f"inputs/dataset/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
print(df.shape)
df.head()

(1460, 24)


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,GarageYrBlt,GrLivArea,KitchenQual,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,2003.0,1710,Gd,8450,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,1976.0,1262,TA,9600,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,2001.0,1786,Gd,11250,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,1998.0,1717,Gd,9550,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,2000.0,2198,Gd,14260,84.0,350.0,84,5,8,1145,,2000,2000,250000


# Dataframe Summary

* You can now call the .info() methos on the dataframe object so that you can read the dataframe summary. The snippet below presents the output into it's own datadrame for readability purposes.

---

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1460 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1346 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1298 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

* You will note from the above that there is no id field for the data uniqueness and as such there is no requirement to check for any duplicate data.

# Load & Inspect the Inherited House Records Dataset

* Following the same process as noted above, you can now read the inherited_houses dataset csv file in a Pandas dataframe.

In [13]:
df_inherited = pd.read_csv("inputs/dataset/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
print(df_inherited.shape)
df_inherited

(4, 23)


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,GarageYrBlt,GrLivArea,KitchenQual,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd
0,896,0,2,No,468.0,Rec,270.0,0,730.0,Unf,1961.0,896,TA,11622,80.0,0.0,0,6,5,882.0,140,1961,1961
1,1329,0,3,No,923.0,ALQ,406.0,0,312.0,Unf,1958.0,1329,Gd,14267,81.0,108.0,36,6,6,1329.0,393,1958,1958
2,928,701,3,No,791.0,GLQ,137.0,0,482.0,Fin,1997.0,1629,TA,13830,74.0,0.0,34,5,5,928.0,212,1997,1998
3,926,678,3,No,602.0,GLQ,324.0,0,470.0,Fin,1998.0,1604,Gd,9978,78.0,20.0,36,6,6,926.0,360,1998,1998


* Again you can run the below command to display the dataframe summary information:

In [14]:
df_inherited.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 23 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       4 non-null      int64  
 1   2ndFlrSF       4 non-null      int64  
 2   BedroomAbvGr   4 non-null      int64  
 3   BsmtExposure   4 non-null      object 
 4   BsmtFinSF1     4 non-null      float64
 5   BsmtFinType1   4 non-null      object 
 6   BsmtUnfSF      4 non-null      float64
 7   EnclosedPorch  4 non-null      int64  
 8   GarageArea     4 non-null      float64
 9   GarageFinish   4 non-null      object 
 10  GarageYrBlt    4 non-null      float64
 11  GrLivArea      4 non-null      int64  
 12  KitchenQual    4 non-null      object 
 13  LotArea        4 non-null      int64  
 14  LotFrontage    4 non-null      float64
 15  MasVnrArea     4 non-null      float64
 16  OpenPorchSF    4 non-null      int64  
 17  OverallCond    4 non-null      int64  
 18  OverallQual   

---

# Conclusion & Next Steps

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [15]:
import os
try:
  os.makedirs(name='outputs/datasets/raw')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/raw/house_prices_records.csv", index=False)
df_inherited.to_csv(f"outputs/datasets/raw/inherited_houses.csv", index=False)

print("Raw files have been saved.")

Raw files have been saved.
