# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save as raw data
* Inspect the data and save it under outputs/datasets/collection

## Inputs

* Kaggle JSON file - the authentication token.

## Outputs

* Generate Datasets: outputs/datasets/collection/HousePricesRecords.csv<br>outputs/datasets/collection/InheritedHouses.csv


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [3]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/heritage-housing-predictive-analytics/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [5]:
current_dir = os.getcwd()
current_dir

'/workspace/heritage-housing-predictive-analytics'

# Fetch data from Kaggle

Run the following to recognise the kaggle token

In [6]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

chmod: cannot access 'kaggle.json': No such file or directory


Define the Kaggle dataset, and destination folder and download it.

In [7]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Traceback (most recent call last):
  File "/workspace/.pip-modules/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/workspace/.pip-modules/lib/python3.8/site-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/workspace/.pip-modules/lib/python3.8/site-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /workspace/heritage-housing-predictive-analytics. Or use the environment method.


Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [8]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

unzip:  cannot find or open inputs/datasets/raw/*.zip, inputs/datasets/raw/*.zip.zip or inputs/datasets/raw/*.zip.ZIP.

No zipfiles found.


---

# Load and inspect Kaggle data

Import dataframe

In [9]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


Dataframe summary

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1460 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1346 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1298 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

BsmtFinType1 is a categorical variable and needs to be converted into numerical. I believe this category will effect the house price.<br>
Find the unique categories and the number of each.

In [20]:
unique_categories = df['BsmtFinType1'].unique()
print(unique_categories)

category_counts = df['BsmtFinType1'].value_counts()
print(category_counts)

['GLQ' 'ALQ' 'Unf' 'Rec' nan 'BLQ' 'None' 'LwQ']
Unf     396
GLQ     385
ALQ     202
BLQ     136
Rec     126
LwQ      70
None     31
Name: BsmtFinType1, dtype: int64


BsmtExposure is a categorical variable. Find the unique categories and the number of each.

In [21]:
unique_categories = df['BsmtExposure'].unique()
print(unique_categories)

category_counts = df['BsmtExposure'].value_counts()
print(category_counts)

['No' 'Gd' 'Mn' 'Av' 'None']
No      953
Av      221
Gd      134
Mn      114
None     38
Name: BsmtExposure, dtype: int64


GarageFinish is a categorical variable. Find the unique categories and the number of each.

In [22]:
unique_categories = df['GarageFinish'].unique()
print(unique_categories)

category_counts = df['GarageFinish'].value_counts()
print(category_counts)

['RFn' 'Unf' nan 'Fin' 'None']
Unf     546
RFn     366
Fin     313
None     73
Name: GarageFinish, dtype: int64


KitchenQual is also categorical and will need converting to numerical.

In [23]:
unique_categories = df['KitchenQual'].unique()
print(unique_categories)

category_counts = df['KitchenQual'].value_counts()
print(category_counts)

['Gd' 'TA' 'Ex' 'Fa']
TA    735
Gd    586
Ex    100
Fa     39
Name: KitchenQual, dtype: int64


---

# Missing values

Display the number of missing values per category. Consider dropping categories with very high missing values.

In [24]:
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0]) 

2ndFlrSF           86
BedroomAbvGr       99
BsmtFinType1      114
EnclosedPorch    1324
GarageFinish      162
GarageYrBlt        81
LotFrontage       259
MasVnrArea          8
WoodDeckSF       1305
dtype: int64


---

# Initial Inspection Summary

There are 4 categorical variables that need converting to numerical values. These are identified as BsmtExposure, BsmtFinType1, GarageFinish and KitchenQual.<br>
EnclosedPorch and WoodDeckSF have a very high number of missing values and will therefore be dropped.