# **Data Collection**

## Objectives

* Fetch electricity cost dataset from Kaggle and save it as raw data
* Load and inspect the dataset
* Save the collected dataset under outputs/datasets/collection

## Inputs

* Kaggle JSON file (authentication token)
* Kaggle dataset: shalmamuji/electricity-cost-prediction-dataset 

## Outputs

* outputs/datasets/collection/ElectricityCost.csv

## Additional Comments

* Initial inspection only; data quality issues will be addressed in later notebooks


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Import Dataset from Kaggle

Firstly, the Kaggle API must be installed before the data can be loaded.

A valid account must be registered with Kaggle to obtain an API key (as a JSON-file).

In [None]:
%pip install kaggle==1.5.12

Next, the Kaggle config directory is set to the current working directory, and the read/write permissions are set to user only (600).


In [None]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Get the dataset path from the Kaggle URL

* When you are viewing the data at Kaggle, check what is after https://www.kaggle.com/datasets/.

Define the Kaggle dataset, and destination folder and download it.

In [None]:
KaggleDatasetPath = "shalmamuji/electricity-cost-prediction-dataset"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
    && rm {DestinationFolder}/*.zip \
        && rm kaggle.json

Check the exact filename Kaggle provides 

In [None]:
os.listdir("inputs/datasets/raw")

---

# Load and inspect the data

Using the pandas library, the dataset is loaded as a dataframe and briefly inspected to understand its structure.

In [None]:
import pandas as pd

df = pd.read_csv("inputs/datasets/raw/electricity_cost_dataset.csv")
df.head()

The size of the dataset can be obtained, along with a dataframe summary, and non-null counts.

In [None]:
df.shape

In [None]:
df.info()

The dataset consists of numerical and categorical features, which will require encoding and transformation during the modelling phase.

In [None]:
columns_with_nan = df.columns[df.isna().sum() > 0].to_list()
columns_with_nan

The dataset contains 10,000 observations and 9 columns. No missing values were detected at this stage. A mix of numerical and categorical features is present, which will be addresses in later notebooks.

---

# Save the dataset

Save the dataset to the outputs/datasets/collection directory for use in subsequent notebooks.

In [None]:
try:
    os.makedirs("outputs/datasets/collection")
except Exception as e:
    print(e)

df.to_csv("outputs/datasets/collection/ElectricityCost.csv", index=False)

---

## Conclusions

In this notebook, the following steps were completed:

* The electricity cost dataset was successfully downloaded from Kaggle
* The dataset was loaded and inspected
* No missing values were identified during initial inspection
* The dataset was saved to the outputs directory for further analysis

## Next Steps

The next notebook will focus on exploratory data analysis (EDA) to better understand feature distribution and relationships with electricity cost.