# **Data Collection**

## Objectives

- **Fetch Data from Kaggle**: Obtain the dataset from Kaggle and save it as raw data.
- **Inspect Data**: Review the acquired data.
- **Save Data**: Save the dataset in the designated location under `outputs/datasets/collection`.

## Inputs

- **Kaggle JSON File**: The authentication token required to access the Kaggle dataset.

## Outputs

- **Generated Dataset**: The dataset will be saved at `outputs/datasets/collection/housingtinherentied` & `housing-prices-data.csv es-data.csv`.

## Additional Comments

- In a typical scenario, housing price data is considered sensitive, but for the purpose of this business example, the repository containing the dataset can be found online. Please refer to the `readme.md` file for formatting and further details.


---

# Install python packages in the notebooks

In [None]:
pip install -r /workspaces/test/requirements.txt

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Fetch data from Kaggle

Install Kaggle package to fetch data

In [None]:
%pip install kaggle==1.5.12

---

add API key afterwards run

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Get the dataset path from the Kaggle url

* When you are viewing the dataset at Kaggle, check what is after https://www.kaggle.com/ .
Define the Kaggle dataset, and destination folder and download it.

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

---

# Load and Inspect Kaggle data

## Dataset inherited houses

In [None]:
import pandas as pd
df_inherited = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
df_inherited.head()

- DataFrame Summary

In [None]:
df_inherited.info()

- **float64**: 7 columns
- **int64**: 12 columns
- **object**: 4 columns

  - Categorical Columns (may need encoding):
    - "KitchenQual"
    - "GarageFinish"
    - "BsmtFinType1"
    - "BsmtExposure"

The columns with data type 'object' are typically categorical variables, and for machine learning models, they will likely need encoding into numerical values. will go more into depth in feature engineering notebook

## Dataset house prices records

In [None]:
df_houseprice = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df_houseprice.head()

- Dataframe summary

In [None]:
df_houseprice.info()

- check for duplicates

Checking for duplicates is crucial for maintaining data quality. Duplicates can lead to errors, inaccuracies in analysis, inefficient resource use, and inconsistencies in reporting. Detecting and removing duplicates is essential for data integrity and accurate decision-making.

In [None]:
df_houseprice[df_houseprice.duplicated(subset=None, keep='first')]

# Push files to Repo

* the loaded data is pushed into the repositry

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df_houseprice.to_csv(f"outputs/datasets/collection/house_prices_records.csv",index=False)
df_inherited.to_csv(f"outputs/datasets/collection/inherited_houses.csv",index=False)