# Data Cleaning & Preprocessing

## Objectives

* Prepare the electricity cost dataset for machine learning
* Apply cleaning and preprocessing steps informed by EDA findings
* Ensure that the features are model-ready while retaining business interpretability

## Inputs

* outputs/datasets/collection/ElectricityCost.csv

## Outputs

* Cleaned dataset for modelling
* Preprocessing logic reusable in the ML pipeline

## Additional Comments

* This step supports **Business Requirement 2** by ensuring the input data is suitable for training a reliable electricity cost prediction model.


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/PP5-TBC-/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/PP5-TBC-'

---

# Load data

Loading the dataset produced in the data collection notebook for cleaning and preprocessing.

In [4]:
import pandas as pd
import numpy as np

df = pd.read_csv("outputs/datasets/collection/ElectricityCost.csv")
df.head()

Unnamed: 0,site area,structure type,water consumption,recycling rate,utilisation rate,air qality index,issue reolution time,resident count,electricity cost
0,1360,Mixed-use,2519.0,69,52,188,1,72,1420.0
1,4272,Mixed-use,2324.0,50,76,165,65,261,3298.0
2,3592,Mixed-use,2701.0,20,94,198,39,117,3115.0
3,966,Residential,1000.0,13,60,74,3,35,1575.0
4,4926,Residential,5990.0,23,65,32,57,185,4301.0


---

# Initial data checks

Confirm dataset shapre, data types, and absence of missing values before applying transformations. 

In [5]:
df.shape

(10000, 9)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   site area             10000 non-null  int64  
 1   structure type        10000 non-null  object 
 2   water consumption     10000 non-null  float64
 3   recycling rate        10000 non-null  int64  
 4   utilisation rate      10000 non-null  int64  
 5   air qality index      10000 non-null  int64  
 6   issue reolution time  10000 non-null  int64  
 7   resident count        10000 non-null  int64  
 8   electricity cost      10000 non-null  float64
dtypes: float64(2), int64(6), object(1)
memory usage: 703.3+ KB


---

# Standardise column names

Column names are standardised to snake_case for consistency, readability, and compatibility with ML pipelines.
This addresses minor naming inconsistencies identified during EDA.

In [7]:
df.columns = (
    df.columns
    .str.strip()
    .str.lower()
    .str.replace(" ", "_")
)

df.columns

Index(['site_area', 'structure_type', 'water_consumption', 'recycling_rate',
       'utilisation_rate', 'air_qality_index', 'issue_reolution_time',
       'resident_count', 'electricity_cost'],
      dtype='object')

---

# Correct known column name spelling issues

Minor spelling errors are corrected to avoid confusion and ensure clarity in the subsequent analysis.

In [8]:
df = df.rename(columns={
    "air_qality_index": "air_quality_index",
    "issue_reolution_time": "issue_resolution_time"
})

df.columns

Index(['site_area', 'structure_type', 'water_consumption', 'recycling_rate',
       'utilisation_rate', 'air_quality_index', 'issue_resolution_time',
       'resident_count', 'electricity_cost'],
      dtype='object')

---

# Encode structure type

one-hot encoding step for structure type

---

# Handle resident count

* preserve zero residents
* reduce skew
* imrpove interpretability

---

# Handle water consumption skewness

---

# Section Header

content

---

# Conclusions

Content 

# Next Steps

Content