<a href="https://colab.research.google.com/github/Cralsic123/forecasting_pr_Meter_LON/blob/main/meter_forecasting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("")

# Task
Load, explore, and summarize the dataset from the CSV file '/content/LCL-June2015v2_99.csv' to understand its structure, content, and potential use cases, including identifying column meanings and generating descriptive statistics.

## Load Dataset

### Subtask:
Load the provided CSV file '/content/LCL-June2015v2_99.csv' into a pandas DataFrame.


**Reasoning**:
Load the specified CSV file into a pandas DataFrame and display its first few rows.



In [None]:
df = pd.read_csv('/content/LCL-June2015v2_99.csv')
df.head()

Unnamed: 0,LCLid,stdorToU,DateTime,KWH/hh (per half hour)
0,MAC004221,Std,2013-08-06 07:00:00.0000000,0.046
1,MAC004221,Std,2013-08-06 07:30:00.0000000,0.04
2,MAC004221,Std,2013-08-06 08:00:00.0000000,0.016
3,MAC004221,Std,2013-08-06 08:30:00.0000000,0.018
4,MAC004221,Std,2013-08-06 09:00:00.0000000,0.016


## Explore Data Structure

### Subtask:
Display column names, data types, and non-null counts of the dataset to understand its basic structure.


**Reasoning**:
To display column names, data types, and non-null counts, I will use the `.info()` method on the DataFrame `df`.



In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 4 columns):
 #   Column                   Non-Null Count    Dtype 
---  ------                   --------------    ----- 
 0   LCLid                    1000000 non-null  object
 1   stdorToU                 1000000 non-null  object
 2   DateTime                 1000000 non-null  object
 3   KWH/hh (per half hour)   1000000 non-null  object
dtypes: object(4)
memory usage: 30.5+ MB


## Generate Descriptive Statistics

### Subtask:
Calculate and display descriptive statistics for numerical columns and unique value counts for categorical columns to get an overview of the data distribution.


**Reasoning**:
The 'DateTime' and 'KWH/hh (per half hour)' columns need to be converted to appropriate data types before descriptive statistics can be generated. 'DateTime' will be converted to datetime objects, and 'KWH/hh (per half hour)' will be converted to numeric, with errors coerced to NaN.



In [None]:
df['DateTime'] = pd.to_datetime(df['DateTime'])
df['KWH/hh (per half hour)'] = pd.to_numeric(df['KWH/hh (per half hour)'], errors='coerce')
print("Data types after conversion:")
df.info()

KeyError: 'KWH/hh (per half hour)'

**Reasoning**:
The previous code failed with a KeyError, likely due to hidden whitespace characters in the column name 'KWH/hh (per half hour)'. To fix this, I will first strip any leading/trailing whitespace from all column names. Then, I will proceed with converting 'DateTime' to datetime and 'KWH/hh (per half hour)' to numeric, coercing errors, and finally display the descriptive statistics and unique value counts as requested.



In [None]:
df.columns = df.columns.str.strip()
df['DateTime'] = pd.to_datetime(df['DateTime'])
df['KWH/hh (per half hour)'] = pd.to_numeric(df['KWH/hh (per half hour)'], errors='coerce')
print("Data types after conversion:")
df.info()

print("\nDescriptive statistics for numerical columns:")
print(df.describe())

print("\nUnique value count for 'LCLid':")
print(df['LCLid'].nunique())

print("\nUnique value count for 'stdorToU':")
print(df['stdorToU'].nunique())

Data types after conversion:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 4 columns):
 #   Column                  Non-Null Count    Dtype         
---  ------                  --------------    -----         
 0   LCLid                   1000000 non-null  object        
 1   stdorToU                1000000 non-null  object        
 2   DateTime                1000000 non-null  datetime64[ns]
 3   KWH/hh (per half hour)  999965 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 30.5+ MB

Descriptive statistics for numerical columns:
                            DateTime  KWH/hh (per half hour)
count                        1000000           999965.000000
mean   2013-04-02 14:47:53.520819712                0.219970
min              2012-05-23 10:30:00                0.000000
25%              2012-10-24 20:00:00                0.063000
50%              2013-03-28 14:30:00                0.127000
75%

## Identify Column Meanings

### Subtask:
Attempt to infer the meaning and purpose of each column based on its name, data type, and values, and provide any contextual information that can be deduced.


### Column Meanings and Purpose

Based on the column names, data types, and descriptive statistics, we can infer the following about each column:

*   **`LCLid`**: This column likely represents a unique identifier for each customer or household participating in the study. The high number of unique values (36) indicates that the dataset contains data from multiple distinct electricity meters or locations, allowing for individual customer analysis.

*   **`stdorToU`**: This column appears to indicate the type of tariff or pricing scheme applied to the electricity consumption. With only one unique value ('Std'), it suggests that all entries in this specific dataset correspond to a 'Standard' tariff. This implies that the dataset might be a subset focusing on a particular tariff type, or that time-of-use (ToU) tariffs are not represented in this particular data extract.

*   **`DateTime`**: This column stores the timestamp of each electricity consumption measurement. It has been successfully converted to a datetime object, allowing for time-series analysis. The data ranges from `2012-05-23 10:30:00` to `2014-02-28 00:00:00`, indicating approximately two years of half-hourly consumption data.

*   **`KWH/hh (per half hour)`**: This is the core consumption data column, representing the electricity consumed in kilowatt-hours (KWH) during each half-hour period. It has been converted to a numeric (float64) type. The descriptive statistics show that consumption values range from a minimum of `0.000` (indicating no consumption during a half-hour period) to a maximum of `4.330` KWH. The presence of `999965` non-null values out of `1000000` entries indicates a very small number of missing values (35), which were coerced to NaN during the type conversion.

## Final Task

### Subtask:
Provide a comprehensive summary of the dataset, including insights into its attributes, their meanings, and any initial observations or potential use cases.


## Summary:

### Data Analysis Key Findings

*   The dataset `LCL-June2015v2_99.csv` was successfully loaded, containing 1,000,000 entries across four columns: `LCLid`, `stdorToU`, `DateTime`, and `KWH/hh (per half hour)`.
*   Initial data exploration revealed all columns were of `object` type with no initial missing values.
*   Data type conversions were performed:
    *   The `DateTime` column was converted to `datetime64[ns]`, with data spanning from `2012-05-23 10:30:00` to `2014-02-28 00:00:00`.
    *   The `KWH/hh (per half hour)` column was converted to `float64`; 35 out of 1,000,000 values were coerced to `NaN` during this process.
*   Descriptive statistics for `KWH/hh (per half hour)` show a mean consumption of approximately `0.22` KWH per half-hour, ranging from `0.00` to `4.33` KWH.
*   The `LCLid` column contains 36 unique identifiers, suggesting data from multiple customers or households.
*   The `stdorToU` column has only one unique value ('Std'), indicating that all entries in this subset relate to a "Standard" tariff.
*   **Column Meanings**:
    *   `LCLid`: Unique identifier for each customer/household.
    *   `stdorToU`: Tariff type, consistently 'Standard' in this dataset.
    *   `DateTime`: Timestamp of electricity consumption measurements.
    *   `KWH/hh (per half hour)`: Electricity consumed in kilowatt-hours per half-hour period.

### Insights or Next Steps

*   The dataset, spanning nearly two years of half-hourly electricity consumption for 36 unique IDs under a standard tariff, is well-suited for time-series analysis, consumption pattern identification, and customer behavior studies, particularly for customers on a "Standard" tariff.
*   Address the 35 `NaN` values in the `KWH/hh (per half hour)` column (e.g., imputation or removal) before conducting in-depth numerical analysis. Further investigation into the specific dates/times and `LCLid` for these `NaN` values could be insightful.
