# Introduction
Motivation for participating in decarbonization stems from a deep understanding of the urgent need to address the escalating environmental challenges posed by climate change. Decarbonization refers to the process of reducing carbon dioxide and other greenhouse gas emissions in order to mitigate the effects of climate change. It involves transitioning from fossil fuels to cleaner and renewable energy sources, implementing energy-efficient technologies, and adopting sustainable practices across various sectors of society.

 

The motivation to engage in decarbonization arises from a recognition of the detrimental impacts of climate change on our planet and the well-being of future generations. Rising global temperatures, extreme weather events, sea-level rise, and ecological disruptions are just a few examples of the consequences of unchecked carbon emissions. By actively participating in decarbonization efforts, individuals, communities, businesses, and governments aim to mitigate these risks, preserve the natural environment, and create a sustainable future.

 

Decarbonization is vital in our society for several reasons. First and foremost, it is crucial for averting the worst effects of climate change and safeguarding the planet's habitability. By reducing greenhouse gas emissions, we can limit global warming, protect ecosystems, and promote biodiversity. Additionally, decarbonization brings numerous societal benefits, including improved air quality, enhanced public health, and reduced dependency on finite fossil fuel resources. It also drives innovation and economic opportunities, as the transition to clean energy and sustainable practices creates jobs and fosters the development of new technologies and industries. Furthermore, decarbonization helps build resilient communities and promotes social equity by ensuring access to affordable and clean energy for all, regardless of socioeconomic status or geographical location.

 

In summary, the motivation to participate in decarbonization arises from a shared responsibility to combat climate change and secure a sustainable future. By reducing carbon emissions and embracing cleaner alternatives, we can protect the environment, improve public health, drive economic growth, and foster a more equitable society. Decarbonization is not only important for addressing the challenges we face today but also for creating a better world for generations to come.

# Our Solution
Infosys, a leading global technology services and consulting company, has been at the forefront of sustainability and decarbonization efforts. They have a strong commitment to environmental responsibility and have set ambitious goals to become carbon neutral. Infosys has implemented various initiatives to reduce their carbon footprint, including energy-efficient infrastructure, renewable energy adoption, and waste management programs. They have also actively pursued green building certifications and invested in innovative technologies to optimize energy usage across their operations.

 

Our solution, the AI-powered chatbot for energy, cleantech, and sustainability, aligns perfectly with Infosys' goals and vision for becoming carbon neutral. The chatbot's predictive models, optimization module, and intelligent recommendations would provide invaluable support to Infosys in their journey towards decarbonization. By leveraging historical and real-time data, the chatbot can identify peak energy usage patterns, optimize the utilization of renewable energy sources, and suggest energy-saving practices for Infosys' facilities.

 

The chatbot's ability to determine optimal locations for EV charging stations would also be highly beneficial for Infosys. By strategically placing charging infrastructure based on traffic patterns and population density, Infosys can encourage the adoption of electric vehicles among its employees and visitors, further reducing carbon emissions from transportation.

 

Moreover, the chatbot's optimization module, which showcases the ideal energy mix at different times, would empower Infosys to make informed decisions regarding resource allocation and energy distribution. It would assist in maximizing the use of renewable energy sources while minimizing reliance on fossil fuels, resulting in a significant reduction in greenhouse gas emissions.

 

Overall, our solution would complement Infosys' ongoing sustainability efforts and provide a powerful tool to accelerate their journey towards carbon neutrality. By leveraging the chatbot's insights and recommendations, Infosys can optimize their energy usage, further increase the adoption of renewable energy, promote sustainable practices, and drive a culture of environmental responsibility within their organization. The solution would not only assist Infosys in achieving their decarbonization goals but also position them as a leader in sustainable technology and inspire others to follow suit.

# First Checkpoint: Energy mix and grid flexibility requirements

Our proposed solution effectively addresses the challenge of energy mix and grid flexibility requirements. Through the chatbot's predictive models and optimization module, we can analyze historical and real-time data to understand energy demand patterns and identify the optimal energy mix for different time intervals. By leveraging the dataset from PJM Interconnection LLC, which includes over 10 years of hourly energy consumption data, we can train our models to accurately predict energy demand in the region served by PJM.

 

The predictive model will utilize advanced algorithms to analyze historical energy consumption patterns, taking into account factors such as time of day, day of the week, seasonal variations, and specific events that impact energy usage. This analysis allows the model to make accurate predictions about when energy demand is expected to be high or low. By incorporating this information into the chatbot, users can receive real-time recommendations on how to manage energy consumption during periods of high demand, promoting energy efficiency and reducing strain on the grid.

 

 


In conclusion, our solution includes a predictive model trained on PJM's dataset, which enables accurate predictions of energy demand. By leveraging this information, along with the chatbot's optimization capabilities, we can effectively address the energy mix and grid flexibility requirements. This empowers users to make informed decisions, optimize energy usage, and contribute to the decarbonization goals of the PJM region.

[Link to Dataset](https://www.kaggle.com/datasets/selfishgene/historical-hourly-weather-data)


Import libraries for model we are going to make

In [180]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import datetime
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from sklearn.metrics import mean_absolute_error

import tensorflow as tf

Read in data and combine into 1 dataframe

In [96]:
df1 = pd.read_csv('data_check1/AEP_hourly.csv', index_col='Datetime')
df2 = pd.read_csv('data_check1/DAYTON_hourly.csv', index_col='Datetime')
df3 = pd.read_csv('data_check1/PJME_hourly.csv', index_col='Datetime')
df4 = pd.read_csv('data_check1/PJMW_hourly.csv', index_col='Datetime')

# For visualization purpose
df_final = df1.join([df2, df3, df4])

We want to know basic information ie column headers and null counts

In [97]:
df_final.info()

We want to find the main stats

In [98]:
df_final.describe()

# EDA

We are plotting the data to see trends or observations ...

In [99]:
df_final.index = pd.to_datetime(df_final.index)

df_final.plot(figsize=(20,8))
plt.title('PJM Energy Cosumption', weight='bold', fontsize=25)

We want to see which hours are the ones that are high demand

In [100]:
df_final['Hour'] = df_final.index.hour
df_final['Day'] = df_final.index.day
df_final['Month'] = df_final.index.month
df_final['Quarter'] = df_final.index.quarter
df_final['Year'] = df_final.index.year

columns = ['AEP_MW', 'DAYTON_MW', 'PJME_MW', 'PJMW_MW']

f, axes = plt.subplots(nrows=6, ncols=2, figsize=(20, 14))
f.suptitle('Daily Average Energy Consumption', weight='bold', fontsize=25)
# We just need 11 figures, so we delete the last one
f.delaxes(axes[5][1])

for i, col in enumerate(columns):
    sns.boxplot(data=df_final, x='Hour', y=col, ax=axes.flatten()[i])

In [101]:
# Things todo: edit the loop such that it shows every year for each state/city
f, axes = plt.subplots(nrows=6, ncols=2, figsize=(20, 14))
f.suptitle('Monthly Average Energy Consumption', weight='bold', fontsize=25)
# We just need 11 figures, so we delete the last one
f.delaxes(axes[5][1])

for i, col in enumerate(columns):
    sns.stripplot(data=df_final, x='Day', y=col, ax=axes.flatten()[i],hue = "Year")

In [102]:
# Things todo: edit the loop such that it shows every year for each state/city 
f, axes = plt.subplots(nrows=6, ncols=2, figsize=(20, 14))
f.suptitle('Yearly Average Energy Consumption', weight='bold', fontsize=25)
# We just need 11 figures, so we delete the last one
f.delaxes(axes[5][1])

for i, col in enumerate(columns):
    sns.barplot(data=df_final, x='Month', y=col, ax=axes.flatten()[i], hue = "Year")

In [103]:
f, axes = plt.subplots(nrows=6, ncols=2, figsize=(20, 14))
f.suptitle('Average Energy Consumption\nfrom 2014-2018', weight='bold', fontsize=25)
# We just need 11 figures, so we delete the last one
f.delaxes(axes[5][1])

for i, col in enumerate(columns):
    sns.barplot(data=df_final, x='Year', y=col, ax=axes.flatten()[i], hue="Year")

Some observations that we made were 

* x

In [104]:
df1.index = pd.to_datetime(df1.index)
df1 = df1.sort_index()
df2.index = pd.to_datetime(df2.index)
df2 = df2.sort_index()
df3.index = pd.to_datetime(df3.index)
df3 = df3.sort_index()
df4.index = pd.to_datetime(df4.index)
df4 = df4.sort_index()
df1.head()
df2.head()
df3.head()
df4.head()

In [105]:
df2_train, df2_test = df2[df2.index < '2016-01-01'], df2[df2.index >= '2016-01-01']

print('Train:\t', len(df2_train))
print('Test:\t', len(df2_test))

df1_train, df1_test = df1[df1.index < '2016-01-01'], df1[df1.index >= '2016-01-01']

print('Train:\t', len(df1_train))
print('Test:\t', len(df1_test))

df3_train, df3_test = df3[df3.index < '2016-01-01'], df3[df3.index >= '2016-01-01']

print('Train:\t', len(df3_train))
print('Test:\t', len(df3_test))

df4_train, df4_test = df4[df4.index < '2016-01-01'], df4[df4.index >= '2016-01-01']

print('Train:\t', len(df4_train))
print('Test:\t', len(df4_test))

In [106]:
plt.figure(figsize=(20,8))

df2_train['DAYTON_MW'].plot(label='Training Set')
df2_test['DAYTON_MW'].plot(label='Test Set')
plt.axvline('2016-01-01', color='black', ls='--', lw=3)
plt.text('2016-02-01', 3700, 'Split', fontsize=20, fontweight='bold')
plt.title('Data Splitting', weight='bold', fontsize=25)
plt.legend()

In [107]:
df1_train['AEP_MW'].plot(label='Training Set')
df1_test['AEP_MW'].plot(label='Test Set')
plt.axvline('2016-01-01', color='black', ls='--', lw=3)
plt.text('2016-02-01', 3700, 'Split', fontsize=20, fontweight='bold')
plt.title('Data Splitting', weight='bold', fontsize=25)
plt.legend()

In [108]:
df3_train['PJME_MW'].plot(label='Training Set')
df3_test['PJME_MW'].plot(label='Test Set')
plt.axvline('2016-01-01', color='black', ls='--', lw=3)
plt.text('2016-02-01', 3700, 'Split', fontsize=20, fontweight='bold')
plt.title('Data Splitting', weight='bold', fontsize=25)
plt.legend()

In [109]:
df4_train['PJMW_MW'].plot(label='Training Set')
df4_test['PJMW_MW'].plot(label='Test Set')
plt.axvline('2016-01-01', color='black', ls='--', lw=3)
plt.text('2016-02-01', 3700, 'Split', fontsize=20, fontweight='bold')
plt.title('Data Splitting', weight='bold', fontsize=25)
plt.legend()

In [110]:
# EXAMPLE

dataset = tf.expand_dims(df2_train['DAYTON_MW'].head(10), axis=-1)

# Generate a tf dataset with 10 elements (i.e. numbers 0 to 9)
dataset = tf.data.Dataset.from_tensor_slices(dataset)

# Window the data but only take those with the specified size
dataset = dataset.window(5, shift=1, drop_remainder=True)

# Flatten the windows by putting its elements in a single batch
dataset = dataset.flat_map(lambda window: window.batch(5))

# Create tuples with features (first four elements of the window) and labels (last element)
dataset = dataset.map(lambda window: (window[:-1], window[-1]))

# dataset2 = tf.expand_dims(df_train['PJME_MW'].head(10), axis=-1)

# # Generate a tf dataset with 10 elements (i.e. numbers 0 to 9)
# dataset2 = tf.data.Dataset.from_tensor_slices(dataset2)

# # Window the data but only take those with the specified size
# dataset2 = dataset2.window(5, shift=1, drop_remainder=True)

# # Flatten the windows by putting its elements in a single batch
# dataset2 = dataset2.flat_map(lambda window: window.batch(5))

# # Create tuples with features (first four elements of the window) and labels (last element)
# dataset2 = dataset2.map(lambda window: (window[:-1], window[-1]))

# dataset3 = tf.expand_dims(df_train['PJMW_MW'].head(10), axis=-1)

# # Generate a tf dataset with 10 elements (i.e. numbers 0 to 9)
# dataset3 = tf.data.Dataset.from_tensor_slices(dataset3)

# # Window the data but only take those with the specified size
# dataset3 = dataset3.window(5, shift=1, drop_remainder=True)

# # Flatten the windows by putting its elements in a single batch
# dataset3 = dataset3.flat_map(lambda window: window.batch(5))

# # Create tuples with features (first four elements of the window) and labels (last element)
# dataset3 = dataset3.map(lambda window: (window[:-1], window[-1]))

# Print the results
for x,y in dataset:
    print("x = ", x.numpy())
    print("y = ", y.numpy())
    print()
    
# dataset1 = tf.expand_dims(df_train['AEP_MW'].head(10), axis=-1)

# # Generate a tf dataset with 10 elements (i.e. numbers 0 to 9)
# dataset1 = tf.data.Dataset.from_tensor_slices(dataset1)

# # Window the data but only take those with the specified size
# dataset1 = dataset1.window(5, shift=1, drop_remainder=True)

# # Flatten the windows by putting its elements in a single batch
# dataset1 = dataset1.flat_map(lambda window: window.batch(5))

# # Create tuples with features (first four elements of the window) and labels (last element)
# dataset1 = dataset1.map(lambda window: (window[:-1], window[-1]))

# # Print the results
# for x,y in dataset1:
#     print("x = ", x.numpy())
#     print("y = ", y.numpy())
#     print()
    
# # Print the results
# for x,y in dataset2:
#     print("x = ", x.numpy())
#     print("y = ", y.numpy())
#     print()
    
# # Print the results
# for x,y in dataset3:
#     print("x = ", x.numpy())
#     print("y = ", y.numpy())
#     print()

In [111]:
def windowing(data, window_size, shuffle_buffer, batch_size):
    dataset = tf.expand_dims(data, axis=-1)
    dataset = tf.data.Dataset.from_tensor_slices(dataset)
    dataset = dataset.window(window_size+1, shift=1, drop_remainder=True) # window size = 24 + 1 (test)
    dataset = dataset.flat_map(lambda window: window.batch(window_size+1))
    dataset = dataset.map(lambda window: (window[:-1], window[-1])) # (train, test) 
    dataset = dataset.shuffle(shuffle_buffer)
    dataset = dataset.batch(batch_size).prefetch(1)
    
    return dataset

In [112]:
train = windowing(df2_train['DAYTON_MW'], 24, 72, 32)
test = windowing(df2_test['DAYTON_MW'], 24, 72, 32)
# train1 = windowing(df_train['PJME_MW'], 24, 72, 32)
# test1 = windowing(df_test['PJME_MW'], 24, 72, 32)
# train2 = windowing(df_train['AEP_MW'], 24, 72, 32)
# test2 = windowing(df_test['AEP_MW'], 24, 72, 32)
# train3 = windowing(df_train['PJME_MW'], 24, 72, 32)
# test3 = windowing(df_test['PJME_MW'], 24, 72, 32)

In [113]:
dnn_model = tf.keras.models.Sequential([
    tf.keras.layers.Conv1D(filters=16, kernel_size=3,
                      strides=1, padding="causal",
                      activation="relu",
                      input_shape=[24,1]),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(16, activation='relu')),
    tf.keras.layers.Dense(16),
    tf.keras.layers.Dense(1)
])

dnn_model.summary()

In [114]:
dnn_model.compile(loss='mae', optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
dnn_model.fit(train, validation_data=test, epochs=10)

In [115]:
metric = pd.DataFrame(dnn_model.history.history)
metric.plot()

In [116]:
window_size = 24
forecast = []

train_length = len(df2_train)
forecast_series = df2[train_length - window_size:]

# Use the model to predict data points per window size
for time in range(len(forecast_series) - window_size):
    forecast.append(dnn_model.predict(np.expand_dims(forecast_series[time:time + window_size], axis=-1)[np.newaxis]))

# Convert to a numpy array and drop single dimensional axes
results = np.array(forecast).squeeze()

In [117]:
df2_test['Pred'] = results

mae = round(mean_absolute_error(df2_test['DAYTON_MW'], df2_test['Pred']), 3)

plt.figure(figsize=(20,8))

df2_test['DAYTON_MW'].plot(label='Actual')
df2_test['Pred'].plot(label='Predicted')
plt.text(16770, 3250, 'MAE: {}'.format(mae), fontsize=20, color='red')
plt.title('Testing Set Forecast', weight='bold', fontsize=25)
plt.legend()
plt.show()

# Second Checkpoint: EV charging and infrastructure

Our proposed solution effectively tackles the challenge of EV charging infrastructure by leveraging AI algorithms and data analysis techniques. To address this issue, our solution incorporates geographical data such as traffic patterns and population density. By acquiring and analyzing this data, we can gain valuable insights into areas with high traffic volume and concentrated population, which are prime locations for EV charging stations.

 

Using AI algorithms, we can optimize the placement of EV chargers based on the analyzed data. The algorithms consider factors such as traffic congestion, commuting patterns, and population distribution to identify optimal locations for charging stations. By strategically placing chargers in areas where EV demand is high and where drivers frequently commute, we can ensure convenient access to charging infrastructure and promote the adoption of electric vehicles.

 

Furthermore, the chatbot's interactive interface allows users to input specific preferences or requirements for EV charging, such as proximity to certain landmarks or availability of fast charging options. The chatbot can then generate customized recommendations that align with the user's preferences and optimize the overall charging infrastructure network.

 

In conclusion, our solution effectively addresses the challenge of EV charging and infrastructure by leveraging AI algorithms and analyzing geographical data. By optimizing the placement of EV chargers based on traffic patterns and population density, we can ensure that charging infrastructure is strategically located to cater to the needs of EV owners. This not only facilitates the widespread adoption of electric vehicles but also contributes to the decarbonization goals of reducing greenhouse gas emissions from the transportation sector.

# Third Checkpoint: The need for longer-term backup capacity
Our proposed solution effectively addresses the need for longer-term backup capacity by leveraging AI models and predictive analysis. To tackle this challenge, we have developed an AI model that takes into account historical data from two solar plants in India spanning the last 34 days. This data includes power generation patterns, solar irradiance levels, weather conditions, and other relevant factors.

 

By training the AI model on this dataset, we can accurately predict future power generation from the solar plants. This enables us to anticipate periods of high or low solar energy availability, which is crucial for planning backup capacity. For instance, if the AI model predicts a period of low solar power generation due to cloudy weather or other factors, it alerts the chatbot and prompts recommendations for alternative energy sources or backup systems to be activated during that period.

 

The chatbot interacts with users in real-time, providing information and suggestions based on the AI model's predictions. It can suggest the activation of backup systems or advise on energy conservation measures during periods of reduced solar power generation. By incorporating these predictions and recommendations into the chatbot's functionality, we enable electrical companies to plan for and manage longer-term backup capacity effectively.

 

In conclusion, our solution addresses the need for longer-term backup capacity by utilizing AI models and predictive analysis. By training the AI model on data from solar plants in India, we can accurately forecast power generation in the future. This information allows the chatbot to provide timely recommendations and assist electrical companies in planning backup capacity during periods of reduced solar energy availability. With our solution, companies can optimize their backup systems, ensure uninterrupted power supply, and contribute to a more reliable and sustainable energy infrastructure.

[Link to Dataset](https://www.kaggle.com/datasets/anikannal/solar-power-generation-data)

In [118]:
# Standard Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from time import perf_counter
from sys import getsizeof

# import for improving a colorbar
from matplotlib.colors import rgb2hex, Normalize;
from matplotlib import rcParams

# Machine learning imports
from sklearn.preprocessing import LabelEncoder, StandardScaler;
from sklearn.model_selection import train_test_split
from sklearn.model_selection import TimeSeriesSplit, cross_validate;
from sklearn.linear_model import LinearRegression, SGDRegressor, Ridge;
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR, LinearSVR;
from sklearn.pipeline import Pipeline
from sklearn.base import clone
from sklearn.metrics import mean_squared_error, r2_score;

# Statistical aids
from scipy.stats import kurtosis, skew;

# initialize settings
from cycler import cycler
rcParams['axes.prop_cycle'] = cycler(
                                color=['navy','orange','k','b',
                                       'y','pink', 'magenta','cyan',
                                       'r','midnightblue',]
                                    )

# for ploting residuals distribution
from ipywidgets import widgets, interact

In [119]:
# prod_p1 is for: Production of plant #1 
prod_p1 = pd.read_csv('solar_data/Plant_1_Generation_Data.csv')
prod_p1.head()

In [120]:
prod_p1.info()

In [121]:
prod_p1.describe()

In [122]:
# Histogram of all non zero values
sample_df = prod_p1.loc[:, prod_p1.columns != 'PLANT_ID']
sample_df = sample_df[(prod_p1['DAILY_YIELD'] > 0) & (prod_p1['AC_POWER'] > 0)]
sample_df.hist(figsize=(10,10))
plt.show()

## Observations:

In [123]:
def optimize_formats(df):
    
    if df.columns[0].isupper():
        initial_size = getsizeof(df)
        supp_data = dict({'plant_id':0,'source_key':[]})
        
        # change column names to lowercase
        lower_case = lambda date: date.lower()
        df.columns = map(lower_case,df.columns)

        # encode "source_key" into integers, store riginal "source_key" in separate variable  
        encoder = LabelEncoder()
        encoder.fit(np.unique(df['source_key']))
        df['source_key']= np.array(encoder.transform(df['source_key'].values),dtype=np.int8)
        supp_data['source_key'] = encoder.classes_

        # delete "plant_id" column and stores it's value in an external variable"
        plant_id = df['plant_id'].values[0]
        df.drop(columns=['plant_id'],inplace=True)
        supp_data['plant_id'] = plant_id

        # change 'date_time' from string to pd.Timestamp
        df['date_time'] =pd.to_datetime(df['date_time'].values, dayfirst=True)
        final_size = getsizeof(df) + getsizeof(supp_data)
        print(f'Initial size: {initial_size/1e6:.2f} Mb')
        print(f'Final size:    {final_size/1e6:.2f} Mb')
        print(f"Memory footprint reduction: {(initial_size - final_size)/initial_size*100:.2f}%")
    else:
        raise ValueError("Formats allready optimized !")
    return df, supp_data

In [124]:
prod_p1, prod_p1_supp_data = optimize_formats(prod_p1)

In [125]:
prod_p1_supp_data['source_key'],prod_p1_supp_data['plant_id']

In [126]:
prod_p1_dc_power = prod_p1[prod_p1['dc_power'] > 0]['dc_power'].values
prod_p1_ac_power = prod_p1[prod_p1['ac_power'] > 0]['ac_power'].values

prod_p2 = pd.read_csv('solar_data/Plant_2_Generation_Data.csv')
prod_p2_dc_power =prod_p2[prod_p2['DC_POWER'] > 0]['DC_POWER'].values
prod_p2_ac_power =prod_p2[prod_p2['AC_POWER'] > 0]['AC_POWER'].values

data = [prod_p1_dc_power,prod_p1_ac_power,prod_p2_dc_power,prod_p2_ac_power]
labels = ['Plant 1: DC_POWER','Plant 1: AC_POWER',
            'Plant 2: DC_POWER','Plant 2: AC_POWER']
plt.figure(figsize=(10,3))
patches = plt.boxplot(data ,labels=labels,vert=False, patch_artist=True)
patches['boxes'][0].set_facecolor( '#FF7722')
plt.xticks([0,np.median(prod_p1_ac_power),prod_p1_ac_power.max(),
            np.median(prod_p1_dc_power),prod_p1_dc_power.max()])
plt.title("Scale Comparison AC & DC Power for the Two plants")

plt.show()


In [127]:
plant2_eff = 100*np.max(prod_p2_ac_power)/np.max( prod_p2_dc_power)
print(f"Power ratio AC/DC (Efficiency) plant #2: {plant2_eff:0.3f}%")
plant1_eff = 100*np.max(prod_p1_ac_power)/np.max(prod_p1_dc_power )
print(f"Power ratio AC/DC (Efficiency) plant #1:  {plant1_eff:0.3f}%")
print(f"Eff_plant_1/Eff_plant_2 (using max values): {plant2_eff/plant1_eff:.3f}")

In [128]:
efficiency = ((prod_p2_ac_power)/np.mean(prod_p2_dc_power))/(np.mean(prod_p1_ac_power)/np.mean(prod_p1_dc_power))
print(f"Scale ratio comparison ( using mean values ): {efficiency.mean():.3f}")

In [129]:
# rescaling dc_power of plant #1
def scale_dc_power(df):
    df['dc_power'] = df['dc_power'].values/10
    return df

In [130]:
prod_p1 = scale_dc_power(prod_p1)

In [131]:
prod_p1_dc_power = prod_p1['dc_power'][prod_p1['dc_power']>0].values
data2 = [prod_p1_dc_power,prod_p1_ac_power,prod_p2_dc_power,prod_p2_ac_power]
plt.figure(figsize=(6,3))
patches = plt.boxplot(data2,showfliers=False, vert=False,labels=labels, patch_artist=True)
median_2 = np.median(np.hstack(data2))
max_val_2 = prod_p1_dc_power.max()
plt.xticks([0,median_2,max_val_2])
plt.title("Corrected scale AC & DC Power for the Two plants")
# fig.tight_layout(pad=5)
plt.show()

In [132]:
def plot_inverter(df, supp_data, key_n, start=0, end=3400):
    '''
    Visualizes the ac_power, dc_power, daily_yield and total_yield of
    an inverter in a given datetime interval
    
    returns: List of pyplot.axes for the 4 variables.
    
    Variables
    ---------
    df   : Pandas DataFrame with the production data
    key_n: int from 0 to 21 representing the inverter number
    start: int (0-3400) representing the start datetime
    end  : int (0-3400) representing the end datetime
    '''
    
    df = df.copy()
    fig_size = (12,12)
    fig = plt.figure(figsize=fig_size)
    for i,item in enumerate(['ac_power','dc_power','daily_yield',
                             'total_yield']):
        xsize,ysize = fig_size
        key_data = df[df['source_key'] == key_n].iloc[start:end]
        plt.subplot(4,1,i+1)
        plt.plot(key_data['date_time'].values, key_data[item].values,
                 linewidth=1.5,alpha=.4)
        ymin = key_data[item].values.min()
        ymax = key_data[item].values.max() 
        plt.yticks(np.linspace(ymin, ymax, 5))
#       plt.xticks(key_data['date_time'],key_data['date_time'],rotation=90)
        plt.xticks([]) # plotting the xlabels takes too much time !!!

        key  = supp_data['source_key'][key_n]
        start_date= pd.to_datetime(key_data['date_time'].iloc[0]).date()
        end_date  = pd.to_datetime(key_data['date_time'].iloc[-1]).date()
        text=f'{item}, inverter #{key_n} ({key})\nfrom {start_date} to {end_date}'
        plt.title(text, fontsize=12)
    fig.subplots_adjust(hspace=0.3)
#     plt.savefig(f'inverter_{key}.png')
    return fig.axes


In [133]:
# a sample taken from one of the inverters
# in the arrows dict xy are pints as % of x and y axis
# in xy are 4 matrices, one for each plot 'ac_power', 'dc_power',
#         'daily_yield' & 'total_yield'

arrows =   dict({
            'saffron_arrows':
            dict({'color':'#FF7722','angle':140,
                  'xy': [[[.280,.1]],
                         [[.280,.1]],
                         [[.295,.1],[.785,.05]],
                         []]}),
            'green_arrows':
            dict({'color':'g','angle':15,
                  'xy':[[],
                        [],
                        [[.39,.79],[.64,.85],[.7,.73],[.82,.88]],
                        []]})
                })

def annotate_arrows(axes,arrows):
    for j,ax in enumerate(axes):
        for i,arrow in enumerate(arrows.values()):
            for xy_raw in arrow['xy'][j]:
                rad = lambda angle: angle/360*2*np.pi
                xmin,xmax = ax.get_xlim()
                xspan = xmax-xmin
                ymin,ymax = ax.get_ylim()
                yspan = ymax-ymin
                xy_new = np.multiply(xy_raw,[xspan,yspan])+np.array([xmin,ymin])
                xy_text= np.array(xy_new)+ np.array([
                                    0.041*np.cos(rad(arrow['angle']))*xspan,
                                    0.250*np.sin(rad(arrow['angle']))*yspan])
                ax.annotate("",xy=xy_new,xytext=xy_text,
                            arrowprops=dict(arrowstyle="->",mutation_aspect=1.2,
                            mutation_scale=15,color=arrow['color'],lw=3))

In [134]:
#values for key_n= 14, start=200, end=1500
axes = plot_inverter(prod_p1,prod_p1_supp_data,14,start=200,end=1500)
annotate_arrows(axes, arrows)

In [135]:
def plot_all_ac_power(prod_df):
    '''this function plot all values for ac_power to visualize
    at which time the solar power plant begins and ends production
    to be able to separate day from night'''
    
    date_time = pd.DatetimeIndex(prod_df['date_time'].values)
    xlabels = np.unique(date_time.strftime("%H:%M"))
    date_time = date_time.hour + date_time.minute/60
    xticks =  np.unique(date_time)
    ac_power = prod_df['ac_power'].values
    plt.figure(figsize=(11,5))
    ax = plt.gca()
    ax.scatter(date_time,ac_power,s=1)
    ax.set_xticks(xticks)
    ax.set_xticklabels(xlabels,rotation=90,fontsize=11)
    #not necesary to show all zero values !
    xmin, xmax = ax.get_xlim()
    xmin = xmin + (xmax-xmin)*0.22
    xmax = xmax - (xmax-xmin)*0.26
    plt.gca().set_xlim((xmin,xmax))
    plt.title("ac_power by hour of the day for all inverters")
    plt.show()

In [136]:
plot_all_ac_power(prod_p1)

In [137]:
date_time = pd.DatetimeIndex(prod_p1['date_time'].values)
date_time = date_time.minute/60 + date_time.hour
get_hr = lambda s : int(s.split(":")[0])+int(s.split(":")[1])/60
for s in ["05:45","06:00","18:30","18:45"]:
    answer = (prod_p1[date_time==get_hr(s)]['ac_power']>0).any()
    print(f"Is there any ac_power > 0 at {s} ?", answer)

In [138]:
# visualize daily zeros

arrows =   dict({
            'saffron_arrows1':
            dict({'color':'#FF7722','angle':140,
                  'xy': [[[.12,.05]],
                         [[.12,.05]],
                         [],
                         []]}),
            'saffron_arrows2':
            dict({'color':'#FF7722','angle':20,
                  'xy': [[[.77,.05]],
                         [[.77,.05]],
                         [],
                         []]}),
            'green_arrows':
            dict({'color':'b','angle':120,
                  'xy':[[],
                        [],
                        [[.125,.45],[.76,.32]],
                        [[.125,.12],[.76,.75]]]})
                })
axes = plot_inverter(prod_p1,prod_p1_supp_data,11,start=2070,end=3030)
annotate_arrows(axes,arrows)

In [139]:

def find_missing(df):
    '''
    Function to find missing timestamps ad it's corresponding missing inverters
    find_missing(df: pandas.DataFrame)
    
    Returns pandas.DataFrame object containing the columns: 
        "date_time": missing Timestamps in "date_time" or
                     present Timestamps with missing inverters.
        "source_key" : list containing the "source_key" of the missing inverters
                     for a given Timestamp''
    Parameters
    ----------
    df: production data of solar plant'''
    
    missing_data = pd.DataFrame({})
    df = df.copy()
    
    date_time, dt_count = np.unique(df['date_time'].values,return_counts=True)
    dt_range  = pd.date_range(start=date_time[0],
                              end=date_time[-1],freq="15min")
    key = np.unique(df['source_key'].values)
    
    # find the datetimes that are not present in the datetimes of the df
    missing_data['date_time'] = dt_range[np.isin(dt_range,date_time)==False]
    source_key = [key for _ in range(len(missing_data['date_time']))]
    missing_data['source_key'] = source_key
    
    # find which inverters are missing in the datetimes that have < 22 invs.   
    dt_with_missing = date_time[dt_count < 22]
    df_missing = df[np.isin(df['date_time'].values,dt_with_missing)]
    for dt in dt_with_missing:
        present_key= df_missing[df_missing['date_time']==dt]['source_key'].values
        missing_key= key[np.isin(key,present_key)==False]
        missing_data = pd.concat([missing_data, pd.DataFrame.from_dict({'date_time':dt,
                            'source_key':missing_key})], ignore_index=True)
#         missing_data = missing_data.concat({'date_time':dt,
#                             'source_key':missing_key}, ignore_index=True)
    missing_data.sort_values(by=['date_time'],ignore_index=True,inplace=True)
    return missing_data

In [140]:
prod_p1_missing_data = find_missing(prod_p1)

In [141]:
def fill_night_missing(df, missing_data):
    '''fill nighttime missing values into the DataFrame with 0 for dc_power,
    ac_power and daily_yield and the total_yield missing is replaced with the
    nearest previous value'''
    
    cols = ['date_time','source_key','dc_power','ac_power',
            'daily_yield','total_yield'] 
    new_entries = pd.DataFrame([],columns=cols)
    night_ix = pd.DatetimeIndex(missing_data['date_time'].values)
    night_ix = night_ix.hour + night_ix.minute/60
    night_ix = (night_ix < 6.00) | (night_ix > 18.5)
    night_missing = missing_data[night_ix].explode('source_key')
    for key in np.unique(night_missing['source_key'].values):
        key_data = df[df['source_key']==key]
        
        #source_key missing during the night time = skmnt
        skmnt = night_missing[night_missing['source_key']==key]['date_time'].values
        skmnt = pd.DatetimeIndex(skmnt)
        index = np.searchsorted( key_data['date_time'].values,skmnt)
        nearest_total_yield = key_data.iloc[index-1]['total_yield']
        source_key  = np.full(len(index),key, dtype=np.int8)
        zeros = np.full(len(index),0.0, dtype=np.float32)
        total_yield = np.full(len(index),nearest_total_yield, dtype=np.float32)
        intermediate_df = pd.DataFrame( zip(skmnt, source_key,
                                            zeros, zeros, zeros, total_yield),
                                            columns = cols)
#         new_entries = new_entries.append(intermediate_df,ignore_index=True )
        new_entries = pd.concat([new_entries, intermediate_df], ignore_index=True)
    new_entries['source_key'] = np.array(new_entries['source_key' ].values, np.int8)
#     df = df.append(new_entries,ignore_index=True)
    df = pd.concat([df, new_entries], ignore_index=True)
    df.sort_values(by=['date_time'],ascending=True,ignore_index=True, inplace=True)
    return df

def std_daily_yield(df):
    'replace daily_yield with zero in hours between 18:30 and 6:00'
    
    date_time = pd.DatetimeIndex(df['date_time'].values)
    date_time = date_time.hour + date_time.minute/60
    ix = np.multiply((date_time <= 18.5), (date_time >= 6))
    df['daily_yield'] = np.multiply(df['daily_yield'].values,ix)
   
    return df

In [142]:
def remove_daytime_zeros(prod_df):
    date_time = pd.DatetimeIndex(prod_df['date_time'].values)
    date_time = date_time.hour + date_time.minute/60
    day_ix = (date_time > 6.5) & (date_time < 18)
    daytime_zeros = day_ix & (prod_df['ac_power'] <= 0)
    return prod_df[daytime_zeros == 0]

In [143]:
prod_p1 = remove_daytime_zeros(prod_p1)
prod_p1 = fill_night_missing(prod_p1, prod_p1_missing_data)
prod_p1 = std_daily_yield(prod_p1)

In [144]:
# Therefore the first 4 values will be droped for this inverter

prod_p1.drop(index=prod_p1[prod_p1['source_key']==7].iloc[:4].index,inplace=True)
plt.figure(figsize=(12,3))
sample = prod_p1[prod_p1['source_key']==7]['total_yield'].values
plt.plot(range(len(sample)),sample)
plt.show()
sample[0:6]

In [145]:
p1ws = pd.read_csv('solar_data/Plant_1_Weather_Sensor_Data.csv')
p1ws.head()

In [146]:
p1ws.info()
p1ws.describe()

In [147]:
def optimize_formats_ws(df):
    if df.columns[0].isupper():
        initial_size = getsizeof(df)
        supp_data = dict({'plant_id':0,'sensors':[]})

        # drop "source_key" , store original "source_key" in separate variable
        supp_data['sensors'] = df['SOURCE_KEY'].values[0]
        df.drop(columns=['SOURCE_KEY'], inplace=True)

        # drop "plant_id" column and stores it's value in an external variable"
        supp_data['plant_id'] = df['PLANT_ID'].values[0]
        df.drop(columns=['PLANT_ID'], inplace=True)

        # change column names to lowercase, rename columns to shorter name
        df.rename(columns={'DATE_TIME':'date_time','AMBIENT_TEMPERATURE': 'ambient_t',
                           'MODULE_TEMPERATURE':'module_t', 'IRRADIATION':'irradiation'},
                 inplace=True)

        # change 'date_time' from string to pd.Timestamp
        df['date_time'] = pd.DatetimeIndex(df['date_time'].values, dayfirst=True)
        final_size = getsizeof(df) + getsizeof(supp_data)
        print(f'Initial size: {initial_size/1e6:.2f} Mb')
        print(f'Final size:    {final_size/1e6:.2f} Mb')
        print(f"Memory footprint reduction: {(initial_size - final_size)/initial_size*100:.2f}%")
    else:
        raise ValueError("Formats allready optimized !")
    return df, supp_data

In [148]:
p1ws, p1ws_supp_data = optimize_formats_ws(p1ws)

In [149]:
p1ws.head(3)

In [150]:
p1ws_supp_data['sensors'],p1ws_supp_data['plant_id']

In [151]:
prod_p2 = pd.read_csv('solar_data/Plant_2_Generation_Data.csv')
prod_p2[prod_p2['DC_POWER']>0].head()

In [152]:
prod_p2[prod_p2['DC_POWER']>0].tail()

In [153]:
prod_p2, prod_p2_supp_data = optimize_formats(prod_p2)

In [154]:
prod_p2_missing_data = find_missing(prod_p2)

In [155]:
# Fill the night-missing time-stamps with zeros for plant #2
prod_p2 = fill_night_missing(prod_p2, prod_p2_missing_data)

# Actualize the missing values after night-missing are filled
prod_p2_missing_data = find_missing(prod_p2)

# Standardize daily_yield to be zero when there is no longer production
prod_p2 = std_daily_yield(prod_p2)

# Remove zero or negative outliers from daytime
prod_p2 = remove_daytime_zeros(prod_p2)

# Actualize the missing values after night-missing are filled
prod_p2_missing_data = find_missing(prod_p2)

In [156]:
def filter_total_yield_anomalies(prod_df, variable):
    new_ix = [] #list of the indexes without anomalies
    for key in np.unique(prod_df['source_key'].values):
        key_df = prod_df[prod_df['source_key']==key]
        field = key_df[variable].values
        if variable == 'daily_yield':
                field = np.cumsum(field)
        filter_ix = (field[1:]-field[0:-1]) < 0
        present_outliers = np.any(filter_ix == True)
        while present_outliers:
            dt = pd.DatetimeIndex(key_df['date_time'].values)
            field = key_df[variable].values
            if variable == 'daily_yield':
                field = np.cumsum(field)
            filter_ix = np.hstack([[False],(field[1:]-field[0:-1]) < 0])
            key_df = key_df[filter_ix == False]
            field = key_df[variable].values
            filter_ix = np.hstack([(field[1:]-field[0:-1]) < 0,[False]])
            present_outliers = np.any(filter_ix == True)
        new_ix += list(key_df.index)
    new_ix = list(np.sort(new_ix))
    init = len(prod_df.index)
    end  = len(new_ix)
    print(f'Initial Dataframe Length: {init}')
    print(f'Final Dataframe length: {end}')
    print(f'filtered out records: {init-end} ({(init-end)/init*100:.1f}%)')
    return prod_df.loc[new_ix,:]

In [157]:
prod_p2 = filter_total_yield_anomalies(prod_p2,'total_yield')
prod_p1 = filter_total_yield_anomalies(prod_p1,'total_yield')

In [158]:
def filter_daily_yield_anomalies(prod_df):
    df = prod_df.copy()
    initial_size = len(df)
    
    def get_filter_values(sub_df):
        dt = pd.DatetimeIndex(sub_df['date_time'].values)
        daily_yield = sub_df['daily_yield'].values
        delta = np.hstack([[0],daily_yield[1:]-daily_yield[0:-1]])
        # 
        periods = np.hstack([ [1], (dt[1:] - dt[0:-1]).seconds/900])
        delta = delta/periods
        return (delta < 0) & (daily_yield != 0)
        
    for key in np.unique(df['source_key'].values):
        key_df = df[df['source_key'] == key]
        yield_filter = get_filter_values(key_df)
        ix_to_drop = []
        while np.any(yield_filter):
            key_ix = key_df.index
            bad_ix = [ key_ix[key_ix < ix][-1] for ix in  key_ix[yield_filter]]
            ix_to_drop += bad_ix
            key_df = key_df.drop(bad_ix)
            yield_filter = get_filter_values(key_df)
        df.drop(ix_to_drop, inplace=True)
    print(f"Dataset reduction: {(initial_size-len(df))/initial_size*100:0.2f}%")
    return df

In [159]:
prod_p2 = filter_daily_yield_anomalies(prod_p2)
prod_p1 = filter_daily_yield_anomalies(prod_p1)

In [160]:
prod_p2_missing_data = find_missing(prod_p2)
prod_p2 = fill_night_missing(prod_p2, prod_p2_missing_data)
prod_p2_missing_data = find_missing(prod_p2)

prod_p1_missing_data = find_missing(prod_p1)
prod_p1 = fill_night_missing(prod_p1, prod_p1_missing_data)
prod_p1_missing_data = find_missing(prod_p1)

In [161]:
p2ws = pd.read_csv('solar_data/Plant_2_Weather_Sensor_Data.csv')

p2ws, p2ws_supp_data = optimize_formats_ws(p2ws)

In [162]:
# visualization of the relationship between power production, temperature, and irradiation

def power_vs_irr_vs_temp(sensor_df,prod_df,key):
    key_df = pd.merge(left=sensor_df,right=prod_df,
                      how='inner',on='date_time')
    key_df = key_df[key_df['source_key']==key]
    date_time = pd.DatetimeIndex(key_df['date_time'].values)
    date_time = date_time.hour + date_time.minute/60
    day_ix = (date_time >= 6) & (date_time <= 18.5)
    key_df = key_df[day_ix]
    key_df.drop(columns=key_df.columns.difference(['date_time','ambient_t',
                    'module_t','irradiation','ac_power']),inplace=True)
    key_df = key_df.sort_values(by=['irradiation'])
    fig = plt.figure(figsize=(16,6))
    plt.subplot(1,2,1)
    ax = plt.gca()
    color_index = np.unique(key_df['irradiation'].values/(1.1*key_df['irradiation'].max())+0.1)
    colors = plt.cm.Greens(X=color_index)[-1::-1]
    ax.scatter(key_df['ambient_t'],key_df['ac_power'],color= colors,alpha=1)
    ax.set_title("AC Power Vs Ambient Temperature Plant #1")
    ax.set_xlabel("Ambient Temperature")
    ax.set_ylabel("AC Power (Kw)",labelpad=30)
    ax.yaxis.get_label().set_rotation(-90)
    plt.colorbar( plt.cm.ScalarMappable(norm=Normalize(vmax=0.0,vmin=1.2),
                        cmap='Greens_r'), ticks=np.arange(0,1.21,0.2),label='Irradiation (W / $m^2$)')

    fig.add_subplot(1,2,2,projection='3d')
    ax = plt.gca()
    ax.scatter(key_df['ambient_t'],key_df['irradiation'].values,key_df['ac_power'])
    ax.view_init(elev=30,azim=-140)
    ax.set_xlabel('Ambient Temperature (ºC)')
    ax.set_ylabel('Irradiation (W / $m^2$)')
    ax.set_zlabel('AC Power (Kw)')
    ax.zaxis.get_label().set_rotation(1)
    ax.set_title('AC Power Vs abient_t & Irradiation Plant #1 inv #1 ')
    plt.show()

In [163]:
power_vs_irr_vs_temp(p1ws,prod_p1,21)

In [164]:
def create_new_features(prod_df,sensor_df,key_n):
    df = prod_df.copy()[prod_df['source_key']==key_n]
    df = pd.merge(left=sensor_df,right=df,how='inner',on='date_time')
    hours = pd.DatetimeIndex(df['date_time'].values)
    hours = hours.hour + hours.minute/60
    df = df[(hours >= 6) & (hours <= 18.5)]
    df.rename(columns={'irradiation':'Gir','ac_power':'pac', 
                       'ambient_t': 'Ta'}, inplace=True)
    df = df[['date_time','source_key','Gir','Ta','pac']]
    df['hours'] = hours[(hours >= 6) & (hours <= 18.5)]
    df['Gir^3'] = np.power(df['Gir'],3)
    df['Gir^2'] = np.power(df['Gir'],2)
    df['Ta^2']  = np.power(df['Ta'],2)
    df['Gir^2.Ta'] = np.multiply(df['Gir^2'],df['Ta'])
    df['Gir.Ta^2'] = np.multiply(df['Gir'],df['Ta^2'])
    df['Gir.Ta'] =  np.multiply(df['Gir'],df['Ta'])
    df = df[['date_time','hours','Gir^3','Gir^2','Gir^2.Ta','Gir.Ta^2',
                     'Gir.Ta','Gir','pac']]
                     
    return df

In [165]:
prod_p1_new_data = create_new_features(prod_p1,p1ws,1)

In [166]:
def split_data(df):
    X = df.drop(columns=['hours','pac'])
    pac = df['pac'].values
    train_size = int(0.8*len(pac))
    X_train ,  X_test  =   X.iloc[:train_size,:],   X.iloc[train_size:,:];
    pac_train, pac_test= pac[:train_size],   pac[train_size:];
    #filter out pac=0
    ix = pac_test > 0
    pac_test = pac_test[ix]
    X_test = X_test[ix]
    return X_train, X_test, pac_train, pac_test;


In [167]:
X_train, X_test, pac_train, pac_test = split_data(prod_p1_new_data)

In [168]:
def find_best_regressor(X_train, pac_train):
    X_train = X_train.drop(columns=['date_time']).values
    models = [
            ('Linear', LinearRegression(fit_intercept =False, n_jobs=-1)),
            ('Ridge', Ridge(fit_intercept =False, solver= 'lsqr', random_state=1973)),
            ('DTree', DecisionTreeRegressor(random_state=1973)),
            ('RForest', RandomForestRegressor(random_state=1973, n_jobs=-1)),
            ('KNReg', KNeighborsRegressor(n_neighbors=5,n_jobs=-1)),
            ]
    scores = dict({})
    ts_split = TimeSeriesSplit(n_splits=6)
    for name,model in models:
        folds = ts_split.split(X_train, pac_train)
        regressor = clone(model)
        cv_scores = cross_validate(regressor, X_train, pac_train,
                                   cv=folds, scoring='neg_mean_squared_error')
        scores[name.rjust(10," ")] = np.mean(cv_scores['test_score'])
    scores = pd.DataFrame(scores.items(),columns=['regressor','Neg_MSE'])
    scores = scores.sort_values(by=['Neg_MSE'],ascending=False,ignore_index=True)
    return scores

In [169]:
find_best_regressor(X_train, pac_train )

In [170]:
def construct_models(prod_df,sensor_df):
    models = []
    predictions_df = pd.DataFrame({})
    rmse = []
    def get_final_model(X, X_t, pac, pac_t):
        model = LinearRegression()
        model.fit(X, pac)
        pac_pred = model.predict(X_t).flatten()
        return model, pac_pred, mean_squared_error(pac_t, pac_pred)
    
    for key in np.unique(prod_df['source_key'].values):
        new_features = create_new_features(prod_df,sensor_df,key)
        X_train, X_test, pac_train, pac_test = split_data(new_features)
        date_time = X_test['date_time'].values
        X_train = X_train.drop(columns=['date_time']).values
        X_test  = X_test.drop(columns=['date_time']).values
        model, pac_predicted, mse = get_final_model(X_train, X_test,
                                                    pac_train, pac_test)
        models += [model]
        rmse += [np.sqrt(mse)]
        r2score = int(100*r2_score(pac_test,pac_predicted))
        residuals = (pac_predicted-pac_test)
        str1 = f'#{str(key).rjust(2)}   --->   '
        str2 = f' rmse: {int(np.sqrt(mse))}'
        str3 = f',      R^2 (Determination coeff.): {r2score}%'
        print( str1 + str2 +str3)
        predictions_df = pd.concat([predictions_df, pd.DataFrame.from_dict({
              'date_time':date_time,'source_key':np.full(len(residuals),key),
              'pac':pac_test, 'pac_predicted':pac_predicted,
              'residuals':residuals})])
    predictions_df.sort_values(by='date_time', ignore_index=True, inplace=True)
    return models, predictions_df, rmse

In [171]:
( plant1_models, plant1_predictions,
plant1_rmses ) = construct_models(prod_p1, p1ws)

In [172]:
# plot AC_Power test-set and predicted AC_Power for this set.
def plot_sample(predictions_df,key):
    key_df = predictions_df[predictions_df['source_key']==key]
    plt.figure(figsize=(16,3))
    plt.plot(key_df['date_time'],key_df['pac'],label='Pac',alpha=0.7)
    plt.plot(key_df['date_time'],key_df['pac_predicted'], alpha=0.7,label="prediction")
    plt.legend(loc='best')
    plt.title(f"prediction and actual AC_POWER for inverter #{key}")
    plt.show()

In [173]:
plot_sample(plant1_predictions, 4)

In [174]:
def plot_all_residuals(prediction_df,figsize=(16,16),set_title=True):
    keys = np.unique(prediction_df['source_key'].values)
    cols = 4
    rows = len(keys)//cols + 1*(len(keys)>0)
    if rows == 1:
        cols = len(keys)
    fig ,axes = plt.subplots(nrows=rows,ncols=cols,figsize=figsize)
    if type(axes) == np.ndarray:
        axes = axes.ravel()
    else:
        axes = [axes]
    for key,ax in zip(keys, axes[:len(keys)]):
        key_df =  prediction_df[prediction_df['source_key']==key]
        residuals = key_df['residuals'].values
        plt.subplot(rows,cols,key+1)
        ax.hist(residuals, bins=50, density=True,color='navy',
                label = 'Residuals')
        if set_title == True:
            title = f"Residuals distribution for Inv #{key}"
            ax.set_title(title)
        xmin,xmax = ax.get_xlim()
        ymin,ymax = ax.get_ylim()
        xticks=np.linspace(xmin,xmax,5)
        labels=[f'{int(x)}' for x in xticks]
        plt.xticks(ticks=xticks,labels=labels)
        # make room for the legends
        plt.gca().set_xlim((xmin,xmax + (xmax-xmin)*0.65))
        plt.vlines(xmax,ymin,ymax,color='k')
        plt.gca().set_ylim((ymin,ymax))
        
        #draw legends
        n = len(residuals)
        mu = residuals.mean()
        m2 = np.sum(np.power(residuals-mu,2))/n
        m3 = np.sum(np.power(residuals-mu,3))/n
        m4 = np.sum(np.power(residuals-mu,4))/n
        mu = residuals.mean()
        sigma  = residuals.std()
        kurt = m4/(m2**2)
        skew = m3/(m2**1.5)
        rmse = int(np.sqrt(np.sum(np.power(residuals,2))/n))
        r2score = int(100*r2_score(key_df['pac'],key_df['pac_predicted']))
        legends = [ 
            f'     $\mu$ :' + f'{mu:0.1f}'.rjust(6," "),
            f'     $\sigma$ :' +  f'{sigma:0.1f}'.rjust(6," "),
            f'   kurt:' + f'{kurt:0.1f}'.rjust(6," "),
            f'skew:' + f'{skew:0.1f}'.rjust(6," "),
            f' rmse:' + f'{rmse}'.rjust(6," "),
            f'  R^2 :' + f'{r2score}%'.rjust(6," "),]
        for i,legend in enumerate(legends):
            plt.text(xmax + (xmax-xmin)*0.05,
                     ymax-(ymax-ymin)*(0.11*(i+1)), legend,color='navy')
            
    # do not show empty plots
    for i in range(rows*cols)[len(keys):]:
        axes[i].axis('off')
    fig.subplots_adjust(hspace=0.5)

In [175]:
plot_all_residuals(plant1_predictions)

In [176]:
# for plant 2

power_vs_irr_vs_temp(p2ws,prod_p2,21)

In [177]:
prod_p2_new_data = create_new_features(prod_p2, p2ws, 1)

In [178]:
( plant2_models, plant2_predictions,
plant2_rmses ) = construct_models(prod_p2, p2ws)

In [179]:
plot_sample(plant2_predictions,4)
plot_all_residuals(plant2_predictions)

# Business Plan

### Executive Summary:

 

Our company offers an innovative AI-powered chatbot solution designed to optimize energy usage, facilitate decarbonization efforts, and enhance sustainability. We target electrical companies, including prominent players like PGNE and their international counterparts, as our primary customers. Additionally, defense companies and the airplane industry, which are transitioning to alternate energy resources like hydrogen power, present potential customer segments.

 

Our solution leverages predictive models, optimization algorithms, and real-time data analysis to provide actionable insights and recommendations. By accurately predicting energy demand, optimizing renewable energy utilization, suggesting optimal locations for EV charging stations, and addressing the need for longer-term backup capacity, our solution enables customers to reduce carbon emissions, improve energy efficiency, and meet sustainability goals.

 

The market for energy optimization and decarbonization solutions is rapidly growing, driven by the increasing global focus on mitigating climate change and transitioning to clean energy. Electrical companies, defense organizations, and the airplane industry are actively seeking innovative technologies to enhance their energy efficiency and reduce their environmental impact. Our solution caters to these market segments by offering a comprehensive AI-powered platform that aligns with their sustainability objectives.

 

### Market Analysis:

 

The market for energy optimization and decarbonization solutions is experiencing significant growth due to rising environmental concerns and increased regulatory pressures. Electrical companies are under increasing pressure to reduce their carbon footprint and transition to clean energy sources. Defense organizations and the airplane industry are also embracing sustainability initiatives and seeking solutions that optimize their energy usage while transitioning to alternative energy resources.

 

Our solution aligns perfectly with the market needs as it offers a comprehensive platform that addresses the specific challenges faced by these industries. By providing accurate predictions of energy demand, optimizing renewable energy utilization, and suggesting optimal EV charging station locations, our solution enables customers to make informed decisions that reduce emissions, improve efficiency, and align with their sustainability goals.

 

### Product Differentiation:

 

Our solution stands out from existing offerings through its advanced AI capabilities, customization options, and real-time optimization features. The chatbot's integration with predictive models and optimization algorithms provides valuable insights tailored to customers' specific needs. The ability to predict energy demand, suggest optimal renewable energy utilization, optimize EV charging station locations, and address backup capacity requirements sets us apart from competitors. Moreover, our solution's user-friendly interface and interactive nature enhance customer experience and ease of adoption.

 

The customization options offered by our solution allow customers to tailor the chatbot's recommendations and insights to their specific requirements. This flexibility ensures that the solution can adapt to different energy infrastructures and unique operational demands, making it a highly desirable choice for electrical companies, defense organizations, and the airplane industry.

 

### Customer Acquisition and Benefits:

 

Our strategy for customer acquisition will focus on targeted marketing campaigns, partnerships with industry leaders, and demonstrations of our solution's effectiveness. Electrical companies will be motivated to adopt our solution to achieve their decarbonization goals, enhance energy efficiency, and reduce operational costs. Defense companies and the airplane industry can benefit from our solution's ability to optimize their transition to alternate energy resources like hydrogen power. Our product's desirability lies in its ability to offer real-time, data-driven recommendations that optimize energy usage, improve sustainability, and ensure regulatory compliance.

 

Customers adopting our solution can expect numerous benefits. Firstly, they will experience improved energy efficiency, leading to reduced operational costs and increased profitability. Secondly, our solution helps customers meet their sustainability goals by reducing carbon emissions and reliance on fossil fuels. Thirdly, the real-time optimization capabilities of our solution ensure that customers can adapt quickly to changing energy demand patterns, resulting in improved grid stability and flexibility.

 

### Revenue Model:

 

Our revenue model will be based on a subscription-based pricing structure, where customers pay a recurring fee to access and utilize our AI-powered chatbot and optimization features. The subscription fee will be tailored to the size and requirements of each customer, ensuring scalability and affordability. Additional revenue streams may include customization services, data analytics insights, and value-added features.

 

To ensure customer satisfaction and long-term relationships, we will provide ongoing customer support and updates, ensuring that our solution continuously evolves to meet the changing needs of our customers.

 

### Financial Projections:

 

Our financial projections are based on anticipated customer acquisition rates, average subscription fees, and cost analysis. We expect a steady increase in revenue as we acquire customers and expand our market presence. Investments will be allocated to research and development, marketing, talent acquisition, and infrastructure development. Detailed financial projections, including revenue, expenses, and profitability, will be provided in the attached financial statement.

 

In conclusion, our AI-powered energy optimization solution holds significant viability, feasibility, and desirability in the market. The combination of advanced AI capabilities, real-time optimization, and customization options positions us as a leader in the energy decarbonization landscape. By targeting electrical companies, defense organizations, and the airplane industry, we are poised to capture a significant market share and contribute to global sustainability goals. With a strong business plan, solid financial projections, and a focus on customer satisfaction, our company is well-positioned for success in the rapidly evolving energy optimization market.



# Data Privacy and Ethics
Data ethics and privacy are of paramount importance in our AI-powered energy optimization solution. We recognize that the datasets we use, such as the energy consumption data from PJM and the solar plant data from India, contain sensitive information and must be handled with utmost care.

 

To ensure data ethics, we strictly adhere to privacy regulations and industry best practices. We anonymize and encrypt personal data, removing any personally identifiable information from the datasets. We also implement strong security measures to protect the data from unauthorized access or breaches.

 

Furthermore, we obtain proper consent and permissions when collecting and using the data. We ensure transparency by clearly communicating the purpose of data collection and obtaining informed consent from individuals or organizations involved. Our data collection and usage processes comply with applicable data protection laws and regulations, safeguarding the privacy of individuals and organizations involved in the datasets.

 

We also prioritize data minimization and data retention policies. We only collect and retain the data necessary for the functioning of our solution, and we promptly discard any unnecessary or outdated information. This approach helps reduce the risk of data misuse and ensures compliance with data protection principles.

 

In summary, data ethics and privacy are foundational principles in our solution. We are committed to handling data responsibly, complying with privacy regulations, and implementing robust security measures to protect the privacy and confidentiality of the datasets we use. By upholding these principles, we maintain the trust of our customers and prioritize the ethical use of data in our AI-powered energy optimization solution.

# Challenges
Another challenge we faced was finding datasets that encompassed all the necessary variables and factors required for our solution. The availability and accessibility of relevant and comprehensive datasets can be a challenge, especially when aiming to incorporate various parameters such as energy consumption patterns, renewable energy availability, and geographical data. We dedicated substantial effort to sourcing and curating datasets that would provide the foundation for accurate predictions and optimizations.

 

Additionally, we encountered a hurdle in obtaining access to an API that would provide us with the real-time data needed to identify optimal EV charging station positions. Without direct access to the API, we had to rely on alternative approaches and approximations to generate recommendations for optimal charging station locations. Despite this limitation, we leveraged available data and applied our expertise to design an AI algorithm that optimized EV charging station placements based on factors like traffic patterns and population density.

 

Despite these challenges, we persevered and developed a viable solution that addresses energy optimization, decarbonization, and sustainability. Our commitment to problem-solving, creativity, and resourcefulness allowed us to overcome these obstacles and deliver a valuable solution during the hackathon.



# Future Directions
In terms of future directions and next steps for the project, there are several key areas to focus on. First, implementing the chatbot and conducting benchmarking against other AI models like Claude and ChatGPT would be crucial. This evaluation will help assess the performance and effectiveness of our chatbot's responses, ensuring that it meets or exceeds the standards set by existing AI models in terms of accuracy and quality.

 

Expanding the available data to make the models more robust and applicable to larger areas is another important step. By acquiring additional datasets from different regions, we can enhance the predictive capabilities of our models and ensure their viability in various geographical contexts.

 

Implementing optimal EV charger locations based on traffic patterns, population density, and other relevant factors is an essential aspect of the project. By incorporating this feature, we enable electrical companies to strategically plan the placement of charging infrastructure, further promoting the adoption of electric vehicles and supporting the transition to cleaner transportation.

 

Refining our time series models by utilizing transformers instead of neural networks can enhance the accuracy and ability to establish long-term patterns. Transformers have shown promise in effectively capturing temporal dependencies and long-term trends, making them suitable for time series analysis tasks.

 

Designing an architecture that integrates all the models into a cohesive and interactive chatbot is another important step. This architecture should facilitate seamless interaction with users, provide accurate predictions and recommendations, and ensure a smooth user experience.

 

Lastly, benchmarking the current models with additional datasets to assess their accuracy and performance is essential. By comparing the outputs of our models against diverse datasets, we can gain insights into their effectiveness and identify areas for improvement.

 

By focusing on these future directions and next steps, we can further refine and enhance the capabilities of our solution, ensuring its effectiveness, accuracy, and scalability.