# **(Data Collection Notebook)**

## Objectives

* Fetch data from Kaggle and save as raw data

## Inputs

* Kaggle JSON file - authentication token used to access and download Kaggle data

## Outputs

* Generate Dataset: 

## Additional Comments

* We use kaggle data and push it publicaly in this case, normally we would not have done it.

---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [3]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/housepricepred2/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [5]:
current_dir = os.getcwd()
current_dir

'/workspace/housepricepred2'

# Fetch Data From Kaggle To Use In Project

To access the Kaggle API and download the raw data, the first step is to install the Kaggle CLI (Command Line Interface).

In [6]:
! pip install kaggle==1.5.12


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.8 -m pip install --upgrade pip[0m


Drag the json token into the base directory. Then we must make the Kaggle uthentication token available for the session.

In [7]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* Specify the location or directory to the Kaggle dataset.
* Designate the target folder where you intend to store the downloaded data.
* Initiate the data retrieval process, which involves fetching and saving the dataset to the specified folder.

In [8]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading housing-prices-data.zip to inputs/datasets/raw
  0%|                                               | 0.00/49.6k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 49.6k/49.6k [00:00<00:00, 2.22MB/s]


* Extract the contents of the downloaded file by decompressing it.
* Remove the zip file from the directory.
* Delete the kaggle.json file, if it exists, from your system.

In [9]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Archive:  inputs/datasets/raw/housing-prices-data.zip
replace inputs/datasets/raw/house-metadata.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


## Load and Inspect the Kaggle data

* Import the pandas library into your Python environment.
* Load the dataset and create a pandas DataFrame called 'df' to store the data.
* Display the first five rows of the DataFrame 'df' to view the initial data entries

In [10]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


* Column Count and Range Index: This DataFrame has 24 columns in total, which encompasses one target variable and 23 feature variables. It presents data for 100 entries, indexed from 0 to 99.
* Column Names and Data Types: Each column in the DataFrame is labeled, such as 'Target', 'Feature1', 'EnclosedPorch', etc. The columns are of different data types: 23 of them are of the float64 type, 2 are int64, and 1 is of the object type.
* Non-Null Values in Each Column: The summary also shows the number of non-null (non-missing) values in each column. For instance, columns like 'EnclosedPorch' and 'WoodDeckSF' have a significant number of null values.
* Memory Usage: The DataFrame's memory usage is detailed, which is approximately 273.9+ KB in this case.
* This command is particularly useful for getting an overview of the dataset's structure, understanding the data types present, and identifying columns with missing values.

In [11]:
df.info(max_cols=24)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1460 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1346 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1298 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

---

# Push files to Repo

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/HousePrices.csv",index=False)