## Predicting Energy Consumption in Low Energy Houses Using Data Driven Approach

### Exploratory Data Analysis

Data Set is from https://www.kaggle.com/datasets/sohommajumder21/appliances-energy-prediction-data-set oroginally from https://archive.ics.uci.edu/dataset/374/appliances+energy+prediction

In [None]:
# Import Libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

In [None]:
#Load the data
df = pd.read_csv('energydata.csv')

In [None]:
#inspect appearance of rows and columns
df.head()

In [None]:
df.tail()

In [None]:
#checking dimensionality
df.shape

In [None]:
#looking for data that needs to be converted (date column is non-numerical)
df.dtypes

In [None]:
#no missing value
df.isnull().sum()

In [None]:
#no duplicate in any row or column, all are unique values
df.duplicated().sum()

In [None]:
df.info()
#Generally, the data set which is 19735x29 is already clean, and date is the only object type of data

In [None]:
#Descriptive Statistics
df.describe()
#of all the 29 variables, the target "Appliance" has the highest std at 102.52 WH which indicates wide variance and outliers

In [None]:
#lets set the dictionary to further undesrtand the graphs

#dictionary to store the units for each column
columns_units = {
    'Appliances': 'energy (Wh)',
    'lights': 'energy (Wh)',
    'T1': 'Temperature:kitchen, degC',
    'RH_1': 'Humidity:kitchen, %',
    'T2': 'Temperature:living rm, degC',
    'RH_2': 'Humidity:living rm, %',
    'T3': 'Temperature:laundry rm, degC',
    'RH_3': 'Humidity:laundry rm, %',
    'T4': 'Temperature:office rm, degC',
    'RH_4': 'Humidity:office rm, %',
    'T5': 'Temperature:bathroom, degC',
    'RH_5': 'Humidity:bathroom, %',
    'T6': 'Temperature outside bldg (North), degC',
    'RH_6': 'Humidity outside bldg (North), %',
    'T7': 'Temperature:ironing rm, degC',
    'RH_7': 'Humidity:ironing rm, %',
    'T8': 'Temperature: teen\'s rm 2, degC',
    'RH_8': 'Humidity:teen\'s rm2, %',
    'T9': 'Temperature:parents rm, degC',
    'RH_9': 'Humidity:parents rm, %',
    'T_out': 'Temperature outside, degC',
    'Press_mm_hg': 'Pressure outside, mmHg',
    'RH_out': 'Humidity outside, %',
    'Windspeed': 'Wind speed, outside, m/s',
    'Visibility': 'Visibility, outside, km',
    'Tdewpoint': 'Dewpoint, outside, °C',
    'rv1': 'Random var1',
    'rv2': 'Random var2'
}

# Adjust the figure size
plt.figure(figsize=(15, 10))  

# Loop over columns and create subplots
for i, col in enumerate(df.columns):
    plt.subplot(6, 5, i + 1)  
    sns.histplot(df[col], kde=True)  # kde=True adds the density plot
    
    # Set the title of the plot (Cplot title)
    plt.title(f"{col}")
    
    # Set the x-axis label as the column name from the dictionary
    plt.xlabel(f"{columns_units.get(col, '')}")  # Get the unit from the dictionary 
    plt.ylabel("Density")  # The y-axis label remains as "Density"

# Adjust layout to prevent overlapping of titles and labels
plt.tight_layout()
plt.show()


#The dependent variable "Appliances " showed skewed distribution with long tail to the right
#The independent variables on the other hand showed different types of distribution  like:

## T1,RH1,T2,RH2,T4,RH5,T6,Dewpoint,Temp_outside, Pressure_outside looks normal distrubution with slight skwedness
## T3,RH3,RH4,T5,T7,RH8,T9,RH9 are whosing twin peaks which could indicate mix of two different process or population
## RH6 looks to be no patern at all
## date is just a one block which is expected as it a uniform data collection in 10mins interval from Jan-May 2016
## rv1 and rv2 are synthetic data included to represent uncertain factors

In [None]:
# Adjust the figure size for the boxplots
plt.figure(figsize=(15, 10))  # Adjust the figure size if necessary

# Loop over columns and create subplots for boxplots
for i, col in enumerate(df.columns):
    plt.subplot(6, 5, i + 1)  # Adjust the number of rows and columns for better layout (6 rows, 5 columns)
    sns.boxplot(x=df[col])  # Create a boxplot for the column
    
    # Set the title of the plot (Column name as the plot title)
    plt.title(f"{col}")
    
    # Set the x-axis label as the column name with the unit from the dictionary
    plt.xlabel(f"{columns_units.get(col, '')}")  # Get the unit from the dictionary (default empty string if not found)
    plt.ylabel("Value")  # The y-axis label remains as "Value"

# Adjust layout to prevent overlapping of titles and labels
plt.tight_layout()

plt.show()

# Outliers are almost present in all variables, eepcially target .

In [None]:
#Feature Transformation by Extracting parts of a date-time feature (feature engineering app#1)

# Convert 'date' column to datetime format (without dropping the original)
df['date'] = pd.to_datetime(df['date'], format='%d-%m-%Y %H:%M', errors='coerce')

# Extract numerical features from the 'date' column,
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['hour'] = df['date'].dt.hour
df['minute'] = df['date'].dt.minute

#New data shape after date transformation: 19735x33 (including original date column and target)

In [None]:
#new df (year,month,day,hour and minute were separated and added as new columns)
df_new = df.copy()
df_new

In [None]:
# Tried to to do a date-series graph, output doesnt seem to be correct. Used excel to plot in the end.
# Set 'date' as the index for df_new
df_new.set_index('date', inplace=True)

# Select only numeric columns (exclude 'date' and any non-numeric columns)
df_numeric = df_new.select_dtypes(include=['float64', 'int64'])

# Resample data to 5-hour intervals (same day) and aggregate using mean
df_resampled = df_numeric.resample('5H').mean()

# Extract the relevant features for the plot
df_selected = df_resampled[['T2', 'T6', 'Appliances', 'lights']]

# Create the figure and axes
fig, ax1 = plt.subplots(figsize=(12, 8))

# Plot temperature features (T2 and T6) on the first y-axis
ax1.plot(df_selected.index, df_selected['T2'], label='Temperature:living rm, degC', color='tab:blue', alpha=0.7)
ax1.plot(df_selected.index, df_selected['T6'], label='Temperature outside bldg (North), degC', color='tab:orange', alpha=0.7)

# Formatting for the first y-axis
ax1.set_xlabel("Date and Hour")
ax1.set_ylabel("Temperature (°C)", color='tab:blue')
ax1.tick_params(axis='y', labelcolor='tab:blue')

# Create a second y-axis for energy values
ax2 = ax1.twinx()
ax2.plot(df_selected.index, df_selected['Appliances'], label='Energy from Appliances (Wh)', color='tab:green', alpha=0.7)
ax2.plot(df_selected.index, df_selected['lights'], label='Energy from Lights (Wh)', color='tab:red', alpha=0.7)

# Formatting for the second y-axis
ax2.set_ylabel("Energy (Wh)", color='tab:green')
ax2.tick_params(axis='y', labelcolor='tab:green')

# Customize x-axis labels to show both date and time (YYYY-MM-DD HH:MM format)
ax1.set_xticklabels(df_selected.index.strftime('%Y-%m-%d %H:%M'), rotation=45)

# Title and legend
plt.title("Time-Series Plot of Temperature and Energy Features (5-Hour Interval)")
ax1.legend(loc='upper left', fontsize=8)
ax2.legend(loc='upper right', fontsize=8)

# Show the plot
plt.grid(True)
plt.tight_layout()

plt.show()

In [None]:
# Reset the index to make 'date' a column
df_new.reset_index(inplace=True)

# Convert 'date' column to datetime format
df_new['date'] = pd.to_datetime(df_new['date'])

# Set 'date' as the index for df_new
df_new.set_index('date', inplace=True)

# Select only numeric columns (exclude 'date' and any non-numeric columns)
df_numeric = df_new.select_dtypes(include=['float64', 'int64'])

# Resample data to 5-hour intervals (same day) and aggregate using mean
df_resampled = df_numeric.resample('5H').mean()

# Loop through each numeric feature and plot with respect to 'lights'
for feature in df_resampled.columns:
    if feature != 'lights':  # Skip plotting 'lights' against itself
        # Create the figure and axes for each feature
        fig, ax1 = plt.subplots(figsize=(12, 8))

        # Plot the selected feature on the first y-axis
        ax1.plot(df_resampled.index, df_resampled[feature], label=f'{feature}', color='tab:blue', alpha=0.7)

        # Formatting for the first y-axis
        ax1.set_xlabel("Date and Hour")
        ax1.set_ylabel(f"{feature} (Unit)", color='tab:blue')
        ax1.tick_params(axis='y', labelcolor='tab:blue')

        # Create a second y-axis for energy values ('lights')
        ax2 = ax1.twinx()
        ax2.plot(df_resampled.index, df_resampled['lights'], label='Energy from Lights (Wh)', color='tab:red', alpha=0.7)

        # Formatting for the second y-axis
        ax2.set_ylabel("Energy (Wh)", color='tab:red')
        ax2.tick_params(axis='y', labelcolor='tab:red')

        # Customize x-axis labels to show both date and time (YYYY-MM-DD HH:MM format)
        ax1.set_xticklabels(df_resampled.index.strftime('%Y-%m-%d %H:%M'), rotation=45)

        # Title and legend
        plt.title(f"Time-Series Plot of {feature} and Energy from Lights (5-Hour Interval)")
        ax1.legend(loc='upper left', fontsize=8)
        ax2.legend(loc='upper right', fontsize=8)

        # Show the plot
        plt.grid(True)
        plt.tight_layout()

        # Display the plot for each feature
        plt.show()

In [None]:
# Create the figure with 3 subplots (for Daily, Weekly, and Monthly data)
fig, axs = plt.subplots(3, 1, figsize=(12, 15), sharex=True)

# Daily Data Plot
# Plot appliances and lights (energy in Wh) on the left y-axis
df_new.resample('D').mean()['Appliances'].plot(ax=axs[0], color='tab:blue', label='Appliances (Wh)')
df_new.resample('D').mean()['lights'].plot(ax=axs[0], color='tab:orange', label='Lights (Wh)')

# Plot features like RH_1 on the right y-axis
ax2 = axs[0].twinx()  
df_new.resample('D').mean()['RH_1'].plot(ax=ax2, color='tab:green', label='Humidity: kitchen (%)', linestyle='--')

# Add labels and title for the first subplot
axs[0].set_ylabel("Energy (Wh)", color='tab:blue')
ax2.set_ylabel("Humidity (%)", color='tab:green')
axs[0].set_title("Daily Energy and Features")
axs[0].grid(True)

# Weekly Data Plot
# Plot appliances and lights (energy in Wh) on the left y-axis
df_new.resample('W').mean()['Appliances'].plot(ax=axs[1], color='tab:blue', label='Appliances (Wh)')
df_new.resample('W').mean()['lights'].plot(ax=axs[1], color='tab:orange', label='Lights (Wh)')

# Plot features like T1 on the right y-axis
ax2 = axs[1].twinx()  
df_new.resample('W').mean()['T1'].plot(ax=ax2, color='tab:red', label='Temperature: kitchen (°C)', linestyle='--')

# Add labels and title for the second subplot
axs[1].set_ylabel("Energy (Wh)", color='tab:blue')
ax2.set_ylabel("Temperature (°C)", color='tab:red')
axs[1].set_title("Weekly Energy and Features")
axs[1].grid(True)

# Monthly Data Plot
# Plot appliances and lights (energy in Wh) on the left y-axis
df_new.resample('M').mean()['Appliances'].plot(ax=axs[2], color='tab:blue', label='Appliances (Wh)')
df_new.resample('M').mean()['lights'].plot(ax=axs[2], color='tab:orange', label='Lights (Wh)')

# Plot features like RH_2 on the right y-axis
ax2 = axs[2].twinx()  
df_new.resample('M').mean()['RH_2'].plot(ax=ax2, color='tab:green', label='Humidity: living room (%)', linestyle='--')

# Add labels and title for the third subplot
axs[2].set_ylabel("Energy (Wh)", color='tab:blue')
ax2.set_ylabel("Humidity (%)", color='tab:green')
axs[2].set_title("Monthly Energy and Features")
axs[2].grid(True)

# Format the x-axis labels to be readable
for ax in axs:
    ax.set_xlabel("Date")
    ax.legend(loc='upper left')

# Adjust layout for better spacing between subplots
plt.tight_layout()

# Display the plots
plt.show()

In [None]:
# List of feature columns to plot (excluding time-related and energy columns)
feature_columns = [
    'T1', 'RH_1', 'T2', 'RH_2', 'T3', 'RH_3', 'T4', 'RH_4', 'T5', 'RH_5',
    'T6', 'RH_6', 'T7', 'RH_7', 'T8', 'RH_8', 'T9', 'RH_9', 'T_out', 'Press_mm_hg',
    'RH_out', 'Windspeed', 'Visibility', 'Tdewpoint', 'rv1', 'rv2'
]

# Create the figure with subplots (3 for daily, weekly, and monthly data)
fig, axs = plt.subplots(len(feature_columns), 3, figsize=(15, len(feature_columns)*5), sharex=True)

# Loop through each feature
for i, feature in enumerate(feature_columns):
    # Plot daily data
    df_new.resample('D').mean()['Appliances'].plot(ax=axs[i, 0], color='tab:blue', label='Appliances (Wh)')
    df_new.resample('D').mean()['lights'].plot(ax=axs[i, 0], color='tab:orange', label='Lights (Wh)')
    ax2 = axs[i, 0].twinx()
    df_new.resample('D').mean()[feature].plot(ax=ax2, color='tab:green', label=feature, linestyle='--')
    axs[i, 0].set_title(f"Daily: Energy and {feature}")
    axs[i, 0].set_ylabel("Energy (Wh)", color='tab:blue')
    ax2.set_ylabel(f"{feature} Value", color='tab:green')
    axs[i, 0].grid(True)
    
    # Plot weekly data
    df_new.resample('W').mean()['Appliances'].plot(ax=axs[i, 1], color='tab:blue', label='Appliances (Wh)')
    df_new.resample('W').mean()['lights'].plot(ax=axs[i, 1], color='tab:orange', label='Lights (Wh)')
    ax2 = axs[i, 1].twinx()
    df_new.resample('W').mean()[feature].plot(ax=ax2, color='tab:green', label=feature, linestyle='--')
    axs[i, 1].set_title(f"Weekly: Energy and {feature}")
    axs[i, 1].set_ylabel("Energy (Wh)", color='tab:blue')
    ax2.set_ylabel(f"{feature} Value", color='tab:green')
    axs[i, 1].grid(True)
    
    # Plot monthly data
    df_new.resample('ME').mean()['Appliances'].plot(ax=axs[i, 2], color='tab:blue', label='Appliances (Wh)')
    df_new.resample('ME').mean()['lights'].plot(ax=axs[i, 2], color='tab:orange', label='Lights (Wh)')
    ax2 = axs[i, 2].twinx()
    df_new.resample('ME').mean()[feature].plot(ax=ax2, color='tab:green', label=feature, linestyle='--')
    axs[i, 2].set_title(f"Monthly: Energy and {feature}")
    axs[i, 2].set_ylabel("Energy (Wh)", color='tab:blue')
    ax2.set_ylabel(f"{feature} Value", color='tab:green')
    axs[i, 2].grid(True)

# Format the x-axis labels to be readable (rotate them)
for ax in axs.flatten():
    ax.set_xlabel("Date")
    ax.legend(loc='upper left', fontsize=8)
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45)

# Adjust layout for better spacing between subplots
plt.tight_layout()

# Display the plots
plt.show()

In [None]:
sns.pairplot(df[['T2', 'T_out', 'RH_out', 'Appliances','lights','hour','Tdewpoint','Windspeed','Visibility']])  # features of interest
plt.show()

In [None]:
#Finding relationship of features with the target variable "Appliances"
# Calculate the correlation matrix
correlation_matrix = df_new.corr()

# Create a mask for the upper triangle of the correlation matrix
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))

# Set up the matplotlib figure
plt.figure(figsize=(15, 10))

# Create the heatmap with the mask applied
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', mask=mask, vmin=-1, vmax=1, cbar_kws={'shrink': 0.8})

# Add title
plt.title('Correlation Heatmap')

# Show the plot
plt.show()

#we can observe highly correlated features as well as an observation of
#synthetic data being present in the data set


### Data Preparation and Feature Engineering

In [None]:
# Splitting the data set into Training, Evaluation, and Test Set
from sklearn.model_selection import train_test_split

# Define your features and target
df_new = df_new.reset_index(drop=True)
df_new

In [None]:
X = df_new.drop(columns=['Appliances','year'])  # Exclude the target and year column
y = df_new['Appliances']  # Target variable

# Split the data into training (70%), evaluation (15%), and test (15%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_eval, X_test, y_eval, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Check the shape of the resulting splits
print(f"Training set size: {X_train.shape}")
print(f"Evaluation set size: {X_eval.shape}")
print(f"Test set size: {X_test.shape}")

#the columns increased from 28 to 33 as the date was transformed in year
#,month,dat,hour, and minute


In [None]:
# Calculate the correlation of each feature with the target variable (y_train)
correlation_with_target = X_train.apply(lambda x: x.corr(y_train))

# Define your thresholds
thresholds = [0.03, 0.05, 0.075, 0.1]

# Initialize a list to store selected features at each threshold
selected_features_at_thresholds = {}

for threshold in thresholds:
    # Select features based on correlation threshold
    selected_features = correlation_with_target[correlation_with_target.abs() > threshold].index
    
    # Store the selected features for this threshold
    selected_features_at_thresholds[threshold] = selected_features

    print(f"Threshold: {threshold}, Features selected: {len(selected_features)}")


In [None]:
# Store selected features for each threshold
selected_features_at_thresholds = {}

for threshold in thresholds:
    # Select features based on correlation threshold
    selected_features = correlation_with_target[correlation_with_target.abs() > threshold].index.tolist()
    
    # Store the selected features
    selected_features_at_thresholds[threshold] = selected_features

    # Print threshold and selected features
    print(f"\nThreshold: {threshold}, Features selected: {len(selected_features)}")
    print("Selected Features:", selected_features)

In [None]:
# Check the actual correlation values with the target for each feature
print(correlation_with_target)
#the correlation values was reduced compared to the overall correlatio due to 
#data splitting (using training set which is only 70% of the data set), n the feature selection step, you are computing the 
#correlation only between features and the target variable (y_train), which can be much weaker.

In [None]:
#double check if the subset is representing the whole data set
# Select a feature that had high correlation in the full dataset
feature = 'T_out'  # 0.97

plt.figure(figsize=(10, 5))
sns.histplot(df_new[feature], color='blue', label='Full Data', kde=True)
sns.histplot(X_train[feature], color='red', label='Training Data', kde=True)
plt.legend()
plt.title(f"Distribution of {feature} in Full vs. Training Data")
plt.show()

In [None]:
# Compute Spearman correlation instead of Pearson
spearman_correlation = X_train.corrwith(y_train, method='spearman').sort_values(ascending=False)
print(spearman_correlation)

In [None]:
#Check if the data set  is linear or non-linear using Pearson (linear) and Spearman (non-linear) correlation

# Compute Pearson correlation
pearson_corr = X_train.corrwith(y_train, method='pearson')

# Compute Spearman correlation
spearman_corr = X_train.corrwith(y_train, method='spearman')

# Combine both into a DataFrame
correlation_df = pd.DataFrame({'Pearson': pearson_corr, 'Spearman': spearman_corr})

# Sort by absolute Pearson correlation for better visualization
correlation_df = correlation_df.reindex(pearson_corr.abs().sort_values(ascending=False).index)

# Plot
plt.figure(figsize=(12, 6))
correlation_df.plot(kind='bar', figsize=(15, 5), width=0.8, colormap='coolwarm')
plt.axhline(y=0, color='black', linewidth=0.8)  # Add a reference line at 0
plt.title("Pearson vs. Spearman Correlation with Target (Appliances)")
plt.ylabel("Correlation Coefficient")
plt.xlabel("Features")
plt.legend(loc="upper right")
plt.xticks(rotation=90)
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.show()

#Spearman Correlation > Pearson, meaning features have non-linear relationship with the target
# This suggest that feature selection based on correlation is not the recommende featue emgineering approach

In [None]:
# Further Checking Pearson and Spearman scores with  Mutual Information Regression

from sklearn.feature_selection import mutual_info_regression
mi = mutual_info_regression(X_train, y_train)
mi_series = pd.Series(mi, index=X_train.columns).sort_values(ascending=False)
print(mi_series)

In [None]:
# Combine Pearson, Spearman, and Mutual Information into a DataFrame and plot

correlation_df = pd.DataFrame({
    'Pearson': pearson_corr,
    'Spearman': spearman_corr,
    'Mutual Info': mi_series
})

# Sort by highest mutual information for better visualization
correlation_df = correlation_df.reindex(mi_series.sort_values(ascending=False).index)

# Plot
plt.figure(figsize=(15, 6))
correlation_df.plot(kind='bar', figsize=(15, 5), width=0.8, colormap='coolwarm')
plt.axhline(y=0, color='black', linewidth=0.8)  # Reference line at 0
plt.title("Feature Importance: Pearson vs. Spearman vs. Mutual Information")
plt.ylabel("Score")
plt.xlabel("Features")
plt.legend(loc="upper right")
plt.xticks(rotation=90)
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.show()

#### Interpretation:
- Spearman > Pearson indicates that as one variable increases, the other consistently increase or decrease but at variable rate,and is not perfectly linear,
- Mutual info of RH_6, Press_mm_hg,RH2,RH9,RH7,and RH_out have positive score while Pearson and Spearman are on negative score-this indicates complex interactions.
- RH5 and RH3 has (+) score on Pearson and Mutual Info but (-) in Spearman, there is likely a non-monotonic but partially linear relationship with the target. Additionally, Mutual Information is (+) verifies that RH5 and RH3 contain predictive information about the dependent variable, even if the relationship isn’t purely linear or monotonic.

In [None]:
#Veryfying the (-) Spearman with (+) Pearson & MI score
for feature in ['RH_3', 'RH_5']:
    sns.lmplot(x=feature, y='Appliances', data=df_new, lowess=True)
    plt.title(f'Lowess Curve for {feature} vs Appliances')
    plt.show()

#These scatter plot doesnt seem to show high pattern to indicate relationship with Appliances
# RH_3 and RH_5 is not inidcative as having direct relationship with the target, instead the outliers could be distorting 
# the Spearman scores thus may be giving incorrect information.

In [None]:
#Checking for outliers, this can be distorting Spearman score which explains the unclear scatter plot pattern

sns.boxplot(x=df_new['RH_3'])
plt.show()
sns.boxplot(x=df_new['RH_5'])
plt.show()

- Feature selection using correlation is not the best approach as indicated by the higher Spearman score over Pearson corelation value.
- Thus Recursive Feature Elimination will be checked using the training set

In [None]:
#Check for skweness
# Identify highly skewed features
skewed_features = X_train.skew().sort_values(ascending=False)
highly_skewed = skewed_features[skewed_features > 0.75].index

# Plot histograms
plt.figure(figsize=(12, 6))
for i, col in enumerate(highly_skewed[:6]):  # Limit to first 6 for readability
    plt.subplot(2, 3, i + 1)
    sns.histplot(X_train[col], bins=30, kde=True)
    plt.title(f'Distribution of {col}')
plt.tight_layout()
plt.show()

In [None]:
print(skewed_features[skewed_features > 0.75]) 

In [None]:
plt.figure(figsize=(12, 6))
for i, col in enumerate(highly_skewed[:6]):  # Limit to first 6
    plt.subplot(2, 3, i + 1)
    sns.boxplot(x=X_train[col])
    plt.title(f'Boxplot of {col}')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(10, 5))
sns.barplot(x=skewed_features.index, y=skewed_features.values, palette="viridis")
plt.xticks(rotation=90)
plt.ylabel("Skewness")
plt.title("Skewness of Features in X_train")
plt.show()

In [None]:
# Skipping the skweness issue, as this is more advantageous in linear relationships
# Instead focus on Models whihc are Tree-Based () Random Forest, Gradient Boosting, XGBoost) and Neural Network
# these models focus on non-linear relationships and less sensitive to skewness and outliers
# Will not apply transformation for now

In [None]:
# Assuming 'df_new' is your original dataset
X_train = df_new.drop(columns=['Appliances', 'year'])

# Split the data again as you did before (if needed)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_eval, X_test, y_eval, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

In [None]:
# Check the shape of the resulting splits
print(f"Training set size: {X_train.shape}")
print(f"Evaluation set size: {X_eval.shape}")
print(f"Test set size: {X_test.shape}")

In [None]:
#This will select most impt features
from sklearn.feature_selection import mutual_info_regression

# Compute MI Scores
mi_scores = mutual_info_regression(X_train, y_train)
mi_scores = pd.Series(mi_scores, index=X_train.columns).sort_values(ascending=False)

# Select the top 15 features
top_features = mi_scores.head(15).index.tolist()

# Reduce feature set
X_train = X_train[top_features]

# Display selected features
print(mi_scores.head(15))

# mutual information regression, results in 
#MI scores that select the most relevant features the model. Features with higher MI scores can be prioritized,
#while those with lower scores can be dropped to reduce dimensionality and avoid overfitting

In [None]:
# Plot the MI scores as a bar chart
plt.figure(figsize=(10, 6))
mi_scores.head(20).plot(kind='bar')
plt.title('Top 15 Mutual Information Scores')
plt.xlabel('Features')
plt.ylabel('Mutual Information Score')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

In [None]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor

# Train a Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Extract the feature importances
feature_importances = rf_model.feature_importances_

# Create a DataFrame to pair features with their importance scores
importance_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': feature_importances})

# Sort the features by importance
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Plot the feature importances as a bar chart
plt.figure(figsize=(10, 6))
plt.barh(importance_df['Feature'], importance_df['Importance'], color='skyblue')
plt.title('Random Forest Feature Importances')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()

#### Summary of Feature engineering
- Feature engineering is applied in the Training set, rows = 13814, columns=31 (exlude target and year)
1. Dropping highly correlated features using correlation values was not use.
   - The correlation values from the X_train falls mostly at 0.03 to 0.075 which indicates weak relationship between features and target. To verify if the train set is a representative of the whole data set, histogram of the highest correlation value in the whole dataset was plotted against the training set.
   - This is verified using Spearman and Pearson correlation score. The comparison showed higher Spearman scores among the fetures, indicating a non-linear relationship.
   - This is further checked by using the Mutual Information Regression score or Mi score. For most of the features Mi score is between  Spearman and Pearson, thus indicating
2. Skwedness of the features in the traning set were checked but later left out as seen to be not very advantageous due to non-linear behavior of the independent variables to the dependent variables.
3. MI scores (Mutual_info_regression) were used to identify top 15 features and these are hour,T9,T7,RH_6,T5,T4,T8,T1,T3,RH_1,Press_mm_hg,T_out,T2,RH_8, and RH_5.
4. Recursive Feature engineering was also used to select the most important features. It gives hour,Press_mm_hg,T_out,T3,RH_1,RH_5,T8,RH_8,RH_6,T4,T7,T2,T5,T9, and T1.

In [None]:
#from these, plot feature selection using Venn diagram 

from matplotlib_venn import venn2
import matplotlib.pyplot as plt

# Features selected by MI and RFE
mi_features = {'hour', 'T9', 'T7', 'RH_6', 'T5', 'T4', 'T8', 'T1', 'T3', 'RH_1', 'Press_mm_hg', 'T_out', 'T2', 'RH_8', 'RH_5'}
rfe_features = {'hour', 'Press_mm_hg', 'T_out', 'T3', 'RH_1', 'RH_5', 'T8', 'RH_8', 'RH_6', 'T4', 'T7', 'T2', 'T5', 'T9', 'T1'}

# Create the Venn diagram
plt.figure(figsize=(8, 6))
venn2([mi_features, rfe_features], set_labels=('MI Scores', 'RFE'), alpha=0.5)
plt.title("Venn Diagram of Features Selected by MI Scores and RFE")
plt.show()

#These Venn diagram showed that the both Mutual_info_regression and
# Recursive Feature elimination gave the same important features in relation the the dependent variable "Appliances"


In [None]:
# Looking at the relationship of the top 15 features in the tsne
X_train

In [None]:
#Visualizing the whole training set with top-15 features
from sklearn.manifold import TSNE

# Initialize t-SNE model
tsne = TSNE(n_components=2, random_state=42)

# Fit and transform the training data using t-SNE
X_train_tsne = tsne.fit_transform(X_train)

# Create a scatter plot of the transformed data
plt.figure(figsize=(10, 6))
plt.scatter(X_train_tsne[:, 0], X_train_tsne[:, 1], c=y_train, cmap='viridis', s=20, alpha=0.7)
plt.title("t-SNE visualization of top features from MI and RFE")
plt.xlabel("t-SNE component 1")
plt.ylabel("t-SNE component 2")
plt.colorbar(label='Appliances')
plt.show()

In [None]:
# Create a pairplot without diagonal elements
sns.pairplot(X_train, corner=True)
plt.show()

##### The top features doesn't seem to have specific scatter plots pattern
##### The histogram showed different distrubution and a few  showed almost normal like RH_5,T2,T_out,Press_mm_hg,RH1,T8,T4.
#### My final 15 features to be used in my modelling are: 
- Note: Temperaure are all in degC, RH is relative humidity in %
1. hour = in 24 hours cycle
2. T9 = Temp in parent's room, 
3. T7 = Temp in ironing roomm,
4. RH_6 = HUmidity outside the building(North), 
5. T5 = Temp in  bathroom, 
6. T4 = Temp in office room
7. T8 = Temp in teens room
8. T1 = Temp in kitchen
9. T3 = Temp in laundry
10. RH_1 = Humidity in kitchen
11. Press_mm_hg = Pressure outside, in mmHg
12. T_out = Temp outside
13. T2 = Temp in living room
14. RH_8 = Humidity in teen's room
15. RH_5 = Humidity in bathroom


#####  Next step is to proceed with modelling using SVM, Random Forest, Gradient Boosting, XGBoost, and Neural Netwrok .
