# **(Data Collection Notebook)**

## Objectives

* Fetch data from Kaggle and save as raw data

## Inputs

* Kaggle JSON file - authentication token used to access and download Kaggle data

## Outputs

* Generate Dataset: outputs/datasets/collection/ 

## Additional Comments

* We use kaggle data and push it publicaly in this case, normally we would not have done it.


---

# Change working directory

In [None]:
%pip install -r /workspaces/Housepriceissues/requirements.txt

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Fetch Data From Kaggle To Use In Project


To access the Kaggle API and download the raw data, the first step is to install the Kaggle CLI (Command Line Interface).

In [None]:
! pip install kaggle==1.5.12

Drag the json token into the base directory. Then we must make the Kaggle uthentication token available for the session.

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* Specify the location or directory to the Kaggle dataset.
* Designate the target folder where you intend to store the downloaded data.
* Initiate the data retrieval process, which involves fetching and saving the dataset to the specified folder.

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

* Extract the contents of the downloaded file by decompressing it.
* Remove the zip file from the directory.
* Delete the kaggle.json file, if it exists, from your system.

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

---

# Load and Inspect the Kaggle data


* Import the pandas library into your Python environment.
* Load the dataset and create a pandas DataFrame called 'df' to store the data.
* Display the first five rows of the DataFrame 'df' to view the initial data entries

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df.head()

* Column Count and Range Index: This DataFrame has 24 columns in total, which encompasses one target variable and 23 feature variables. It presents data for 100 entries, indexed from 0 to 99.
* Column Names and Data Types: Each column in the DataFrame is labeled, such as 'Target', 'Feature1', 'EnclosedPorch', etc. The columns are of different data types: 23 of them are of the float64 type, 2 are int64, and 1 is of the object type.
* Non-Null Values in Each Column: The summary also shows the number of non-null (non-missing) values in each column. For instance, columns like 'EnclosedPorch' and 'WoodDeckSF' have a significant number of null values.
* Memory Usage: The DataFrame's memory usage is detailed, which is approximately   273.9+ KB in this case.
* This command is particularly useful for getting an overview of the dataset's structure, understanding the data types present, and identifying columns with missing values.

In [None]:
df.info(max_cols=24)

---

# Push files to Repo

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/HousePrices.csv",index=False)
