# 1. Electricity Consumption Dataset Preparation

The dataset chosen for analysis is the Electricity Consumption dataset from Ireland, provided by the Commission for Energy Regulation (CER) as part of the CER Smart Metering initiative. This publicly available dataset captures electricity consumption and is updated twice an hour, resulting in 48 entries per day.

## Project Background

The dataset is sourced from the "Electricity Customer Behaviour Trial, 2009-2010, 1st Edition," archived by the Irish Social Science Data Archive in 2012. The data is publicly accessible at [this URL](https://www.ucd.ie/issda/data/commissionforenergyregulationcer/).

## Dataset Details

- **File Structure:** Six zipped files (File1.txt.zip to File6.txt.zip), each containing a single text file.
- **Data Format:** Each data file consists of three columns:
  1. Meter ID
  2. Five-digit code
  3. Day code (digits 1-3, with day 1 corresponding to January 1, 2009)
  4. Time code (digits 4-5, representing 30-minute intervals from 00:00:00 to 00:29:59)
  5. Electricity consumed during the 30-minute interval (in kWh)

## Unify Data :

This code aims to concatenate multiple text files into a single one. It defines a list called `files` containing the names of the files to be concatenated. The result of the concatenation will be saved in a new file named `FileData`.

In [None]:
# List of input files to be concatenated
files = ['File1.txt', 'File2.txt', 'File3.txt', 'File4.txt', 'File5.txt', 'File6.txt']

# Output file where the concatenated data will be saved
output_file = 'FileData.txt'

# Function to concatenate files
def concatenate_files(input_files, output_file):
    # Open the output file in write mode
    with open(output_file, 'w') as output:
        # Iterate through each input file
        for file in input_files:
            # Open the current input file in read mode
            with open(file, 'r') as f:
                # Read the content of the input file
                content = f.read()
                # Write the content to the output file
                output.write(content)

# Call the function to concatenate files
concatenate_files(files, output_file)

# Print a message indicating the successful concatenation
print(f'Files concatenated into {output_file}')

## Clean Data :

The purpose of this code is to clean and process hourly household electricity consumption data. Each observation has 24 values corresponding to the 24-hour consumption of a consumer for a day. The data preparation involves applying several filters suggested:

- All daily consumption associated with non-household consumers, defined as those with electricity consumption greater than 15 kWh at any hour of any day, has been removed.
- Observations where the consumption of any hour is missing have been removed.
- Observations with total daily consumption less than 100 W have been removed.

In [None]:
import csv

In [None]:
# Input and output files
input_file = 'FileData.txt'
output_file = 'FinalData.csv'

# Dictionary to store the data
data = {}

# Read the input file and store the lines
with open(input_file, 'r') as file:
    lines = file.readlines()

# Process each line in the input file
for line in lines:
    # Split the line into three parts: ID, date, and consumption
    ID, date, consumption = line.split()

    # Extract day and hour information from the date
    day = int(date[:3])
    half_hour = int(date[3:5])
    hour = (half_hour - 1) // 2
    
    # Initialize the data dictionary if necessary
    if ID not in data:
        data[ID] = {}
    
    if day not in data[ID]:
        data[ID][day] = [0.0] * 24

    # Update the consumption information
    if 0 <= hour <= 23:
        data[ID][day][hour] += float(consumption) * 1000  # Convert kWh to Wh

# Additional filters:
# - Total daily consumption must be greater than 100 Wh
# - Maximum consumption in an hour must not exceed 15000 Wh
# - No zero values for any hour

low_threshold = 100
high_threshold = 15000

# Identify IDs with daily consumption below the minimum threshold
IDs_below_threshold = {
    ID for ID, date_and_consumption in data.items()
    if any(sum(consumption_list) < low_threshold for day, consumption_list in date_and_consumption.items())
}

# Count the total number of days meeting the condition
total_days_below_threshold = sum(
    sum(sum(consumption_list) < low_threshold for day, consumption_list in date_and_consumption.items())
    for ID, date_and_consumption in data.items()
)

# Identify IDs with maximum consumption exceeding 15000 Wh in any hour
IDs_above_threshold = {
    ID for ID, date_and_consumption in data.items()
    if any(max(consumption_list) > high_threshold for day, consumption_list in date_and_consumption.items())
}

# Remove lines with zero values for any hour
cleaned_data = {
    ID: {
        day: consumption_list
        for day, consumption_list in date_and_consumption.items()
        if all(hour_consumption != 0 for hour_consumption in consumption_list)
    }
    for ID, date_and_consumption in data.items()
}

# Filter the data based on additional conditions
filtered_data = {
    ID: {
        day: consumption_list
        for day, consumption_list in date_and_consumption.items()
        if sum(consumption_list) > low_threshold
    }
    for ID, date_and_consumption in cleaned_data.items()
    if ID not in IDs_above_threshold
}

# Create the CSV file
with open(output_file, 'w', newline='') as file:
    writer = csv.writer(file)
    # Header
    writer.writerow(['ID', 'date', 'H0', 'H1', 'H2', 'H3', 'H4', 'H5', 'H6', 'H7', 'H8', 'H9', 'H10', 'H11', 'H12', 'H13', 'H14', 'H15', 'H16', 'H17', 'H18', 'H19', 'H20', 'H21', 'H22', 'H23'])
    
    # Write each row to the CSV file
    for ID, date_and_consumption in filtered_data.items():
        for day, consumption_list in date_and_consumption.items():
            # Join to combine the first two columns with spaces and the rest with commas
            row = [ID, day] + consumption_list
            writer.writerow(row)
            

The following code snippet prints various statistics related to the processed electricity consumption data. It provides insights into the distribution of users based on daily consumption and the duration of data available for each user.

In [None]:
# Print the number of users with daily consumption below 100 Wh
print('Number of users with daily consumption below 100 Wh:', len(IDs_below_threshold))  # 220

# Print the total number of days with daily consumption below 100 Wh
print('Total number of days with daily consumption below 100 Wh:', total_days_below_threshold)  # 15729

# Print the number of users with maximum consumption above 15000 Wh
print('Number of users with maximum consumption above 15000 Wh:', len(IDs_above_threshold))  # 946

# Additional statistics for the loaded filtered data
print('Number of unique users:', len(loaded_filtered_data))  # 5489
print('Number of rows:', sum([len(date_and_consumption) for date_and_consumption in loaded_filtered_data.values()]))  # 2772102
print('Number of users with more than 365 days of data:', sum([len(date_and_consumption) > 365 for date_and_consumption in loaded_filtered_data.values()]))  # 5118
print('Number of users with 365 days or fewer of data:', sum([len(date_and_consumption) <= 365 for date_and_consumption in loaded_filtered_data.values()]))  # 371