# **Data Collection**

## Objectives

* Fetch electricity cost dataset from Kaggle and save it as raw data
* Load and inspect the dataset
* Save the collected dataset under outputs/datasets/collection

## Inputs

* Kaggle JSON file (authentication token)
* Kaggle dataset: shalmamuji/electricity-cost-prediction-dataset 

## Outputs

* outputs/datasets/collection/ElectricityCost.csv

## Additional Comments

* Initial inspection only; data quality issues will be addressed in later notebooks


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/PP5-TBC-/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/PP5-TBC-'

# Import Dataset from Kaggle

Firstly, the Kaggle API must be installed before the data can be loaded.

A valid account must be registered with Kaggle to obtain an API key (as a JSON-file).

In [4]:
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


Next, the Kaggle config directory is set to the current working directory, and the read/write permissions are set to user only (600).


In [5]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Get the dataset path from the Kaggle URL

* When you are viewing the data at Kaggle, check what is after https://www.kaggle.com/datasets/.

Define the Kaggle dataset, and destination folder and download it.

In [6]:
KaggleDatasetPath = "shalmamuji/electricity-cost-prediction-dataset"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

electricity-cost-prediction-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)


Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [7]:
! unzip -o {DestinationFolder}/*.zip -d {DestinationFolder} \
    && rm {DestinationFolder}/*.zip \
        && rm kaggle.json

Archive:  inputs/datasets/raw/electricity-cost-prediction-dataset.zip
  inflating: inputs/datasets/raw/electricity_cost_dataset.csv  


Check the exact filename Kaggle provides 

In [8]:
os.listdir("inputs/datasets/raw")

['electricity_cost_dataset.csv']

---

# Load and inspect the data

Using the pandas library, the dataset is loaded as a dataframe and briefly inspected to understand its structure.

In [9]:
import pandas as pd

df = pd.read_csv("inputs/datasets/raw/electricity_cost_dataset.csv")
df.head()

Unnamed: 0,site area,structure type,water consumption,recycling rate,utilisation rate,air qality index,issue reolution time,resident count,electricity cost
0,1360,Mixed-use,2519.0,69,52,188,1,72,1420.0
1,4272,Mixed-use,2324.0,50,76,165,65,261,3298.0
2,3592,Mixed-use,2701.0,20,94,198,39,117,3115.0
3,966,Residential,1000.0,13,60,74,3,35,1575.0
4,4926,Residential,5990.0,23,65,32,57,185,4301.0


The size of the dataset can be obtained, along with a dataframe summary, and non-null counts.

In [10]:
df.shape

(10000, 9)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   site area             10000 non-null  int64  
 1   structure type        10000 non-null  object 
 2   water consumption     10000 non-null  float64
 3   recycling rate        10000 non-null  int64  
 4   utilisation rate      10000 non-null  int64  
 5   air qality index      10000 non-null  int64  
 6   issue reolution time  10000 non-null  int64  
 7   resident count        10000 non-null  int64  
 8   electricity cost      10000 non-null  float64
dtypes: float64(2), int64(6), object(1)
memory usage: 703.3+ KB


The dataset consists of numerical and categorical features, which will require encoding and transformation during the modelling phase.

In [12]:
columns_with_nan = df.columns[df.isna().sum() > 0].to_list()
columns_with_nan

[]

The dataset contains 10,000 observations and 9 columns. No missing values were detected at this stage. A mix of numerical and categorical features is present, which will be addresses in later notebooks.

---

# Save the dataset

Save the dataset to the outputs/datasets/collection directory for use in subsequent notebooks.

In [13]:
try:
    os.makedirs("outputs/datasets/collection")
except Exception as e:
    print(e)

df.to_csv("outputs/datasets/collection/ElectricityCost.csv", index=False)

[Errno 17] File exists: 'outputs/datasets/collection'


---

## Conclusions

In this notebook, the following steps were completed:

* The electricity cost dataset was successfully downloaded from Kaggle
* The dataset was loaded and inspected
* No missing values were identified during initial inspection
* The dataset was saved to the outputs directory for further analysis

## Next Steps

The next notebook will focus on exploratory data analysis (EDA) to better understand feature distribution and relationships with electricity cost.