# Part 1: Data Cleaning and Preprocessing

## 1.1 Load and Inspect the Dataset
    a. Load the dataset and display its shape, colum names, and data types


In [118]:
import pandas as pd
df = pd.read_csv("Building_Energy_Benchmarking.csv")
print("Shape:", df.shape)
print(('_'*70))
print("List of columns:\n", df.columns)
print(('_'*70))
print("Data Types:\n")
df.info()

Shape: (494, 31)
______________________________________________________________________
List of columns:
 Index(['Property Id', 'Property Name', 'Address 1', 'City', 'Postal Code',
       'Province', 'Primary Property Type - Self Selected',
       'Number of Buildings', 'Year Built',
       'Property GFA - Self-Reported (m²)', 'ENERGY STAR Score',
       'Site Energy Use (GJ)', 'Weather Normalized Site Energy Use (GJ)',
       'Site EUI (GJ/m²)', 'Weather Normalized Site EUI (GJ/m²)',
       'Source Energy Use (GJ)', 'Weather Normalized Source Energy Use (GJ)',
       'Source EUI (GJ/m²)', 'Weather Normalized Source EUI (GJ/m²)',
       'Total GHG Emissions (Metric Tons CO2e)',
       'Total GHG Emissions Intensity (kgCO2e/m²)',
       'Direct GHG Emissions (Metric Tons CO2e)',
       'Direct GHG Emissions Intensity (kgCO2e/m²)',
       'Electricity Use - Grid Purchase (kWh)', 'Natural Gas Use (GJ)',
       'District Hot Water Use (GJ)',
       'Electricity Use – Generated from Onsite 

    b. Identify and list the number of missing values in each column

In [120]:
# Detect Missing Values in the DataFrame
df.isnull()

Unnamed: 0,Property Id,Property Name,Address 1,City,Postal Code,Province,Primary Property Type - Self Selected,Number of Buildings,Year Built,Property GFA - Self-Reported (m²),...,Direct GHG Emissions (Metric Tons CO2e),Direct GHG Emissions Intensity (kgCO2e/m²),Electricity Use - Grid Purchase (kWh),Natural Gas Use (GJ),District Hot Water Use (GJ),Electricity Use – Generated from Onsite Renewable Systems (kWh),Green Power - Onsite and Offsite (kWh),Avoided Emissions - Onsite and Offsite Green Power (Metric Tons CO2e),Year Ending,Unique ID
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,True,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,True,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
489,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,True,False,False,False,False
490,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,True,False,False,False,False
491,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,True,False,False,False,False
492,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False


In [121]:
# Count Missing Values per Column
df.isna().sum()

Property Id                                                                0
Property Name                                                              0
Address 1                                                                  0
City                                                                       0
Postal Code                                                                0
Province                                                                   0
Primary Property Type - Self Selected                                      0
Number of Buildings                                                        0
Year Built                                                                 0
Property GFA - Self-Reported (m²)                                          0
ENERGY STAR Score                                                        329
Site Energy Use (GJ)                                                       0
Weather Normalized Site Energy Use (GJ)                                    0

## 1.2 Handling Missing Data
    a. Drop columns with more then 40% missing values
    b. For numerical columns, fill missing values with the median of their respective column
    c. For categorical colums, fill missing values with the mode oftheir respective column

In [123]:
# Percentage of missing data
missing_data = df.isnull().mean() * 100
print("Percentage of Missing Data per Column:\n")
print(missing_data)

# Drop columns/rows with more than 40% missing values
df_cleaned = df.drop(columns=missing_data[missing_data > 40].index)
df_cleaned = df_cleaned.dropna(thresh=int(df_cleaned.shape[1] * 0.6), axis=0)

#mpute missing values in numerical columns using the median/Categorical columns using the mode.
numerical_columns = df_cleaned.select_dtypes(include=['float64', 'int64']).columns
df_cleaned[numerical_columns] = df_cleaned[numerical_columns].fillna(df_cleaned[numerical_columns].median())

categorical_columns = df_cleaned.select_dtypes(include=['object', 'category']).columns
df_cleaned[categorical_columns] = df_cleaned[categorical_columns].fillna(df_cleaned[categorical_columns].mode().iloc[0])

Percentage of Missing Data per Column:

Property Id                                                               0.000000
Property Name                                                             0.000000
Address 1                                                                 0.000000
City                                                                      0.000000
Postal Code                                                               0.000000
Province                                                                  0.000000
Primary Property Type - Self Selected                                     0.000000
Number of Buildings                                                       0.000000
Year Built                                                                0.000000
Property GFA - Self-Reported (m²)                                         0.000000
ENERGY STAR Score                                                        66.599190
Site Energy Use (GJ)                           

## 1.3 Extracting and Cleaning Data Using Regex
    a. Extract numeric values from text-based numeric columns (e.g., Property GFA, Energy Use, Emissions)
    b. Standardize Postal Codes to follow the Canadian format (A1A 1A1)
    c. Clean and extract meaningful text from Property Names and Addresses
    d. Ensure extracted values are properly converted to numerical types for analysis

In [131]:
import re
#Find object columns with numerical value
def find_str_num(df):                                                                      
    numeric_columns = []                                                              
    for col in df.select_dtypes(include=['object']).columns:                          
        if df[col].astype(str).str.match(r'^-?\d+(\.\d+)?$', na=False).any():        
            numeric_columns.append(col)                                               
    return numeric_columns
names = find_str_num(df_cleaned)

#Extract the numerical values from 'object' columns
for col in names:                                                                    
    df_cleaned[col] = df_cleaned[col].apply(lambda x: float(re.findall(r'-?\d+\.?\d*', str(x))[0]) if re.findall(r'-?\d+\.?\d*', str(x)) else None)

#Standardize Postal Codes to follow the Canadian format (A1A 1A1) - I will use a function
def Official_format(postal):                                                                   
    match = re.match(r'([a-zA-Z]\d[a-zA-Z])\s?(\d[a-zA-Z]\d)', str(postal).upper().strip())   
    return f"{match.group(1)} {match.group(2)}" if match else None                       
df_cleaned["Postal Code"] = df_cleaned["Postal Code"].apply(Official_format)                                 

#Clean and extract meaningful text from Property Names and Addresses
df_cleaned['Property Name'] = df_cleaned['Property Name'].apply(lambda x: re.sub(r'/^(?!.*  )[a-zA-Z0-9#-]+( [a-zA-Z0-9#-]+)?$/', '', x).strip())

#Addres 1: Function so that words with more than 3 letters (name) are not all capitalized but only the first letter of the word.
def capitalize_large_words(text):
    return re.sub(r'\b[A-Z]{3,}\b', lambda m: m.group(0).capitalize(), text)
df_cleaned['Address 1']= df_cleaned['Address 1'].apply(capitalize_large_words)

#Ensure extracted values are properly converted to numerical types for analysis.
df_cleaned['Year Built'] = df_cleaned['Year Built'].astype(int)
df_cleaned['Year Ending'] = df_cleaned['Year Ending'].astype(int)

print(('_'*70))
print("Numeric values from text-based numeric columns:\n")
print(names)

print(('_'*70))
print("Data convertion validation:\n")
print(df_cleaned.info())

print(('_'*70))
print("Standardized Postal Codes:\n")
print(df_cleaned["Postal Code"].head())

print(('_'*70))
print("Text from Property Names and Addresses:\n")
df_cleaned[['Property Name', 'Address 1']].head(20)

______________________________________________________________________
Numeric values from text-based numeric columns:

[]
______________________________________________________________________
Data convertion validation:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 494 entries, 0 to 493
Data columns (total 26 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   Property Id                                 494 non-null    int64  
 1   Property Name                               494 non-null    object 
 2   Address 1                                   494 non-null    object 
 3   City                                        494 non-null    object 
 4   Postal Code                                 491 non-null    object 
 5   Province                                    494 non-null    object 
 6   Primary Property Type - Self Selected       494 non-null    object 
 7   Number of 

Unnamed: 0,Property Name,Address 1
0,Acadia Aquatic & Fitness Centre,9009 Fairmount Dr SE
1,Ad Valorem,2924 11 ST NE
2,Alberta Trade Centre,315 10 AV SE
3,Andrew Davison,133 6 AV SE
4,Animal Services Centre,2201 Portland ST SE
5,Apparatus Repair Shop and Spare Apparatus Shop,1725 18 AV NE
6,Beltline Aquatic & Fitness Centre,221 12 Av SW
7,Bob Bohan Aquatic and Fitness Centre,4812 14 Av SE
8,Bowmont Civic Building,5000 Bowness Civic Building
9,Calgary Public Building,205 8 AV SE
