# Plan..

•⁠  ⁠The main goal is to create an energy prediction model of prosumers to reduce energy imbalance costs.

# Analyze..

-- Asks for my analysis:

•⁠  ⁠⁠What are the most important reasons affecting production or consumption ?

•⁠  ⁠What product type consume the most energy?

•⁠  ⁠What product type production most energy ?

•⁠  ⁠⁠What are the perfect conditions for each product in terms of the most important factors affecting it?

•⁠  ⁠⁠What the appropriate budget to cover the requirements of each business product ?

•⁠  When is product consumption above the limit or outlier?

•⁠  When consumption or production highest for each product ?

In [None]:
import pandas as pd
import numpy as np
import plotly as plot
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, date, timedelta
import plotly.express as px
import plotly.subplots as sp
import plotly.graph_objects as go
import warnings
from scipy.stats import probplot
from scipy.stats import ttest_ind
from sklearn.feature_selection import mutual_info_classif
from sklearn.neighbors import LocalOutlierFactor
from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.preprocessing import StandardScaler
from scipy import stats
from sklearn.model_selection import cross_val_predict
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
warnings.filterwarnings('ignore')

**- Prepare dataset for analysis..**

In [None]:
client_train=pd.read_csv('/content/drive/MyDrive/predict-energy-behavior-of-prosumers/client.csv')
electricity_train=pd.read_csv('/content/drive/MyDrive/predict-energy-behavior-of-prosumers/electricity_prices.csv')
hist_weather_train=pd.read_csv('/content/drive/MyDrive/predict-energy-behavior-of-prosumers/historical_weather.csv')
gas_price_train=pd.read_csv('/content/drive/MyDrive/predict-energy-behavior-of-prosumers/gas_prices.csv')
train=pd.read_csv('/content/drive/MyDrive/predict-energy-behavior-of-prosumers/train.csv')
locations=pd.read_csv('/content/drive/MyDrive/archive-4/county_lon_lats.csv')

In [None]:
train

In [None]:
train.info() #Convert type to datetime column ( Object --> datetime )
             #Remove row_id columns
             #Seperate Datetime into (month,year,day,hour)

In [None]:
train.isnull().sum() # Remove all rows contain null value in target field

In [None]:
client_train

In [None]:
client_train.info()#Convert type to datetime column ( Object --> datetime )
                   #Seperate Datetime into (month,year,day,hour)
                   #Subtract 1 from data_block_id

In [None]:
client_train.isnull().sum() #Perfect

In [None]:
electricity_train

In [None]:
electricity_train.info() #Convert type to datetime column ( Object --> datetime )
                         #Seperate Datetime into (hour)
                         #Subtract 1 from data_block_id

In [None]:
electricity_train.isnull().sum()

In [None]:
hist_weather_train

In [None]:
hist_weather_train.info()#Convert type to datetime column ( Object --> datetime )
                         #Seperate Datetime into (hour)
                         #Subtract 1 from data_block_id

In [None]:
hist_weather_train.isnull().sum() #Perfect

In [None]:
gas_price_train

In [None]:
gas_price_train.info()#Subtract 1 from data_block_id

In [None]:
gas_price_train.isnull().sum() #Perfect

In [None]:
locations #I need this data for merge with weather data because I need knowing county for each lognitude and latitude .
          # Drop 'Unnamed: 0' column

**- Build functions to solve all problems in datasets the merge togther .**

In [None]:
def preprocessing_client(client):
    client['data_block_id']-=2
    return client

In [None]:
def preprocessing_gas(gas):
    gas['data_block_id']-=1
    return gas

In [None]:
def preprocessing_electricity(electricity):
    electricity = electricity.rename(columns= {'forecast_date' : 'datetime'})
    electricity['datetime'] = pd.to_datetime(electricity['datetime'], utc= True)
    electricity['hour'] = electricity['datetime'].dt.hour
    electricity['data_block_id']-=1
    return electricity

In [None]:
def preprocessing_weather(weather,locations):
    hist_weather=weather[['temperature','dewpoint','cloudcover_total', 'cloudcover_low','latitude', 'longitude',
       'cloudcover_mid', 'cloudcover_high','direct_solar_radiation','snowfall','datetime','data_block_id']]
    locations = locations.drop('Unnamed: 0', axis= 1)
    hist_weather[['latitude', 'longitude']] = hist_weather[['latitude', 'longitude']].astype(float).round(1)
    hist_weather= hist_weather.merge(locations, how='left', on=['longitude','latitude'])
    hist_weather.dropna(axis= 0, inplace= True)
    hist_weather.drop(['latitude', 'longitude'], axis=1, inplace= True)
    hist_weather['county'] = hist_weather['county'].astype('int64')
    hist_weather['datetime']= pd.to_datetime(hist_weather['datetime'], utc= True)
    hist_weather_datetime= hist_weather.groupby([hist_weather['datetime'].dt.to_period('h')])[list(hist_weather.drop(['county','datetime','data_block_id'], axis= 1).columns)].mean().reset_index()
    hist_weather_datetime['datetime']= pd.to_datetime(hist_weather_datetime['datetime'].dt.to_timestamp(), utc=True)
    hist_weather_datetime= hist_weather_datetime.merge(hist_weather[['datetime', 'data_block_id']], how='left', on='datetime')
    hist_weather_datetime_county= hist_weather.groupby(['county',hist_weather['datetime'].dt.to_period('h')])[list(hist_weather.drop(['county','datetime', 'data_block_id'], axis= 1).columns)].mean().reset_index()
    hist_weather_datetime_county['datetime']= pd.to_datetime(hist_weather_datetime_county['datetime'].dt.to_timestamp(), utc=True)
    hist_weather_datetime_county= hist_weather_datetime_county.merge(hist_weather[['datetime', 'data_block_id']], how='left', on='datetime')
    hist_weather_datetime['hour']= hist_weather_datetime['datetime'].dt.hour
    hist_weather_datetime_county['hour']= hist_weather_datetime_county['datetime'].dt.hour
    hist_weather_datetime.drop_duplicates(inplace=True)
    hist_weather_datetime_county.drop_duplicates(inplace=True)
    hist_weather_datetime.drop('datetime', axis= 1, inplace= True)
    hist_weather_datetime_county.drop('datetime', axis= 1, inplace= True)
    return(hist_weather_datetime_county,hist_weather_datetime)


In [None]:
def preprocessing_train(train):
  train= train[train['target'].notnull()]
  train['datetime'] = pd.to_datetime(train['datetime'], utc=True)
  train['year'] = train['datetime'].dt.year
  train['month'] = train['datetime'].dt.month
  train['day'] = train['datetime'].dt.day
  train['hour'] = train['datetime'].dt.hour
  return (train)

In [None]:
def data_merge(data,client,hist_weather_datetime_county,hist_weather_datetime,electricity,gas):
    data= data.merge(client.drop(columns = ['date']), how='left', on=['data_block_id', 'county', 'is_business', 'product_type'])
    data= data.merge(gas[['data_block_id', 'lowest_price_per_mwh', 'highest_price_per_mwh']], how='left', on='data_block_id')
    data= data.merge(electricity[['euros_per_mwh', 'hour', 'data_block_id']], how='left', on=['hour', 'data_block_id'])
    data= data.merge(hist_weather_datetime, how='left', on=['data_block_id', 'hour'])
    data= data.merge(hist_weather_datetime_county, how='left', on=['data_block_id', 'county', 'hour'],
                     suffixes= ('_hist_mean','_hist_mean_by_county'))
    data= data.groupby(['year', 'day', 'hour'], as_index=False).apply(lambda row: row.ffill().bfill()).reset_index()
    data.drop(['row_id','index', 'data_block_id'],axis=1,inplace=True)
    data.dropna(inplace=True)
    data=data[data['year']<2023]
    return (data)

In [None]:
client_train=preprocessing_client(client_train.copy())
hist_weather_datetime_county,hist_weather_datetime=preprocessing_weather(hist_weather_train.copy(),locations)
train=preprocessing_train(train.copy())
electricity_train=preprocessing_electricity(electricity_train.copy())
gas_price_train=preprocessing_gas(gas_price_train.copy())


data=data_merge(train.copy(),client_train.copy(),hist_weather_datetime_county.copy(),hist_weather_datetime.copy(),electricity_train.copy(),gas_price_train.copy())

In [None]:
data

In [None]:
data.info() #Perfect

In [None]:
data.isnull().sum()

**- Analysis..**

**-I will apply all strategies and answer all questions on (Group 1) . Behind the scense , I will implement the same things for remaining groups.**

In [None]:
#The answer to my questions : I want to split the data into a group of sections based on (prediction_unit_id)
groups=data.groupby(['prediction_unit_id'])
print("Number of types of product:",len(groups.groups.keys()))

In [None]:
#Group 1
group=groups.get_group(1)
consumption=group[group['is_consumption']==1]
production=group[group['is_consumption']==0]

**1)Consumption Analysis...**

In [None]:
#general information about product
print("Number of counties in this product:",len(consumption.county.unique()))
answer="yes" if consumption['is_business'].values[0] else "no"
print("this product is business product?",answer)
product_type=consumption['product_type'].values[0]
if product_type==0:
  product_type="Combined"
elif product_type==1:
  product_type='Fixed'
elif product_type==2:
  product_type="General service"
else:
  product_type="Spot"
print("product type:",product_type)

**- Q: When consumption highest for this product ?**

In [None]:
#descrptive analysis for consumption
display(consumption.target.describe())
sns.distplot(consumption.target)
plt.show()
probplot(consumption.target, dist='norm', plot=plt)
#result
##The consumption of this product is clusterd around 20 ,
##Change or distribution in its consumption is considered appropriate ,
##I will search for reasons  any number > 34 or number < 11

**- Q:What the appropriate budget to cover the requirements of each business product ?**

In [None]:
#Example on budget you need in 2021
#This is not a business product ,but anyway apply this: we calculated the total money from production during 2021 and subtracted the total money spent.
consumption_21=consumption[consumption['year']==2021]
production_21=production[production['year']==2021]
energy_consumption = np.sum(consumption_21['target'].values)
price_per_mwh = np.mean(consumption_21['euros_per_mwh'].values)
total_cost_consumption = energy_consumption * price_per_mwh
energy_production =np.sum( production_21['target'].values)
lowest_price_per_mwh = np.mean(production_21['lowest_price_per_mwh'].values)
highest_price_per_mwh = np.mean(production_21['highest_price_per_mwh'].values)
average_price_per_mwh = (lowest_price_per_mwh + highest_price_per_mwh) / 2

total_revenue_production = energy_production * average_price_per_mwh

total_budget = total_revenue_production - total_cost_consumption
print(total_budget)

**-Q: What are the most important reasons affecting production or consumption ?**

In [None]:
possible_inf_columns=consumption.drop(['county','is_business','product_type','prediction_unit_id','year','day','month','hour','is_consumption','datetime',
                                       'lowest_price_per_mwh', 'highest_price_per_mwh', 'euros_per_mwh'],axis=1).columns
possible_inf_columns

In [None]:
# search in features importanat.. 
# 'eic_count' --> The aggregated number of consumption points
fig = sp.make_subplots(rows=2, cols=2, subplot_titles=['Target_2021','eic_count','Target_2022','eic_count'])
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].target),row=1,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].eic_count),row=1,col=2)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].target),row=2,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].eic_count),row=2,col=2)

fig.update_layout(template='plotly_dark')
fig.show()
#correlation
sns.heatmap(consumption[['target','eic_count']].corr(),
            annot=True,
            cmap='coolwarm')
plt.show()

##hypothesis testing
t_statistic, p_value = ttest_ind(consumption.target.values, consumption.eic_count.values)

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the means.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the means.")

#results
##The relationship is not clear , p-value is very small(Nice).

In [None]:
#installed_capacity --> Installed photovoltaic solar panel capacity in kilowatts.
fig = sp.make_subplots(rows=2, cols=2, subplot_titles=['Target_2021','installed_capacity','Target_2022','installed_capacity'])
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].target),row=1,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].installed_capacity),row=1,col=2)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].target),row=2,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].installed_capacity),row=2,col=2)

fig.update_layout(template='plotly_dark')
fig.show()
sns.heatmap(consumption[['target','installed_capacity']].corr(),
            annot=True,
            cmap='coolwarm')
plt.show()

##hypothesis testing
t_statistic, p_value = ttest_ind(consumption.target.values, consumption.installed_capacity.values)

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the means.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the means.")

#results
##The relationship is semi clear , and p-value is equal 0 this means:
       # 1)There is strong evidence that the observed effect is real and not due to random chance.
       # 2) A p-value of 0.0 is below any conventional significance level (e.g., 0.05 or 0.01).

In [None]:
# temperature_hist_mean --> the mean(based on datetime) forecast temperature
fig = sp.make_subplots(rows=2, cols=2, subplot_titles=['Target_2021','temperature_hist_mean','Target_2022','temperature_hist_mean'])
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].target),row=1,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].temperature_hist_mean),row=1,col=2)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].target),row=2,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].temperature_hist_mean),row=2,col=2)

fig.update_layout(template='plotly_dark')
fig.show()
sns.heatmap(consumption[['target','temperature_hist_mean']].corr(),
            annot=True,
            cmap='coolwarm')
plt.show()

##hypothesis testing
t_statistic, p_value = ttest_ind(consumption.target.values, consumption.temperature_hist_mean.values)

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the means.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the means.")

#results
##The relationship is very clear , and p-value is equal 0 this means:
       # 1)There is strong evidence that the observed effect is real and not due to random chance.
       # 2) A p-value of 0.0 is below any conventional significance level (e.g., 0.05 or 0.01).

In [None]:
# temperature_hist_mean_by_county--> the mean(based on county) forecast temperature
fig = sp.make_subplots(rows=2, cols=2, subplot_titles=['Target_2021','temperature_hist_mean_county','Target_2022','temperature_hist_mean_county'])
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].target),row=1,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].temperature_hist_mean_by_county),row=1,col=2)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].target),row=2,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].temperature_hist_mean_by_county),row=2,col=2)

fig.update_layout(template='plotly_dark')
fig.show()
sns.heatmap(consumption[['target','temperature_hist_mean_by_county']].corr(),
            annot=True,
            cmap='coolwarm')
plt.show()

##hypothesis testing
t_statistic, p_value = ttest_ind(consumption.target.values, consumption.temperature_hist_mean_by_county.values)

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the means.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the means.")

#results
##The relationship is very clear , and p-value is equal 0 this means:
       # 1)There is strong evidence that the observed effect is real and not due to random chance.
       # 2) A p-value of 0.0 is below any conventional significance level (e.g., 0.05 or 0.01).

In [None]:
# dewpoint_hist_mean--> the mean of dewpoint (based on datetime) forecast dewpoint
fig = sp.make_subplots(rows=2, cols=2, subplot_titles=['Target_2021','dewpoint_hist_mean','Target_2022','dewpoint_hist_mean'])
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].target),row=1,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].dewpoint_hist_mean),row=1,col=2)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].target),row=2,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].dewpoint_hist_mean),row=2,col=2)

fig.update_layout(template='plotly_dark')
fig.show()
sns.heatmap(consumption[['target','dewpoint_hist_mean']].corr(),
            annot=True,
            cmap='coolwarm')
plt.show()

##hypothesis testing
t_statistic, p_value = ttest_ind(consumption.target.values, consumption.dewpoint_hist_mean.values)

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the means.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the means.")

#results
##The relationship is very clear , and p-value is equal 0 this means:
       # 1)There is strong evidence that the observed effect is real and not due to random chance.
       # 2) A p-value of 0.0 is below any conventional significance level (e.g., 0.05 or 0.01).

In [None]:
# dewpoint_hist_mean_by_county--> the mean of dewpoint (based on county) forecast dewpoint
fig = sp.make_subplots(rows=2, cols=2, subplot_titles=['Target_2021','dewpoint_hist_mean_by_county','Target_2022','dewpoint_hist_mean_by_county'])
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].target),row=1,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].dewpoint_hist_mean_by_county),row=1,col=2)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].target),row=2,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].dewpoint_hist_mean_by_county),row=2,col=2)

fig.update_layout(template='plotly_dark')
fig.show()
sns.heatmap(consumption[['target','dewpoint_hist_mean_by_county']].corr(),
            annot=True,
            cmap='coolwarm')
plt.show()

##hypothesis testing
t_statistic, p_value = ttest_ind(consumption.target.values, consumption.dewpoint_hist_mean_by_county.values)

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the means.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the means.")

#results
##The relationship is very clear , and p-value is equal 0 this means:
       # 1)There is strong evidence that the observed effect is real and not due to random chance.
       # 2) A p-value of 0.0 is below any conventional significance level (e.g., 0.05 or 0.01).

In [None]:
# cloudcover_total_hist_mean--> the mean of cloudcover_total (based on datetime) forecast cloudcover_total
fig = sp.make_subplots(rows=2, cols=2, subplot_titles=['Target_2021','cloudcover_total_hist_mean','Target_2022','cloudcover_total_hist_mean'])
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].target),row=1,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].cloudcover_total_hist_mean),row=1,col=2)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].target),row=2,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].cloudcover_total_hist_mean),row=2,col=2)

fig.update_layout(template='plotly_dark')
fig.show()
sns.heatmap(consumption[['target','cloudcover_total_hist_mean']].corr(),
            annot=True,
            cmap='coolwarm')
plt.show()

t_statistic, p_value = ttest_ind(consumption.target.values, consumption.cloudcover_total_hist_mean.values)

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the means.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the means.")

#results
##The relationship is not clear , and p-value is equal 0 this means:
       # 1)There is strong evidence that the observed effect is real and not due to random chance.
       # 2) A p-value of 0.0 is below any conventional significance level (e.g., 0.05 or 0.01). 

In [None]:
# cloudcover_total_hist_mean_by_county--> the mean of cloudcover_total (based on county) forecast cloudcover_total
fig = sp.make_subplots(rows=2, cols=2, subplot_titles=['Target_2021','cloudcover_total_hist_mean_by_county','Target_2022','cloudcover_total_hist_mean_by_county'])
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].target),row=1,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].cloudcover_total_hist_mean_by_county),row=1,col=2)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].target),row=2,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].cloudcover_total_hist_mean_by_county),row=2,col=2)

fig.update_layout(template='plotly_dark')
fig.show()
sns.heatmap(consumption[['target','cloudcover_total_hist_mean_by_county']].corr(),
            annot=True,
            cmap='coolwarm')
plt.show()

t_statistic, p_value = ttest_ind(consumption.target.values, consumption.cloudcover_total_hist_mean_by_county.values)

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the means.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the means.")

#results
##The relationship is not clear , and p-value is equal 0 this means:
       # 1)There is strong evidence that the observed effect is real and not due to random chance.
       # 2) A p-value of 0.0 is below any conventional significance level (e.g., 0.05 or 0.01). 

In [None]:
# cloudcover_low_hist_mean--> the mean of cloudcover_low (based on datetime) forecast cloudcover_low
fig = sp.make_subplots(rows=2, cols=2, subplot_titles=['Target_2021','cloudcover_low_hist_mean','Target_2022','cloudcover_low_hist_mean'])
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].target),row=1,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].cloudcover_low_hist_mean),row=1,col=2)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].target),row=2,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].cloudcover_low_hist_mean),row=2,col=2)

fig.update_layout(template='plotly_dark')
fig.show()
sns.heatmap(consumption[['target','cloudcover_low_hist_mean']].corr(),
            annot=True,
            cmap='coolwarm')
plt.show()

t_statistic, p_value = ttest_ind(consumption.target.values, consumption.cloudcover_low_hist_mean.values)

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the means.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the means.")

#results
##The relationship is not clear (in visual), and p-value is equal 0 this means:
       # 1)There is strong evidence that the observed effect is real and not due to random chance.
       # 2) A p-value of 0.0 is below any conventional significance level (e.g., 0.05 or 0.01). 

In [None]:
# cloudcover_low_hist_mean_by_county--> the mean of cloudcover_low (based on county) forecast cloudcover_low
fig = sp.make_subplots(rows=2, cols=2, subplot_titles=['Target_2021','cloudcover_low_hist_mean_by_county','Target_2022','cloudcover_low_hist_mean_by_county'])
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].target),row=1,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].cloudcover_low_hist_mean_by_county),row=1,col=2)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].target),row=2,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].cloudcover_low_hist_mean_by_county),row=2,col=2)

fig.update_layout(template='plotly_dark')
fig.show()
sns.heatmap(consumption[['target','cloudcover_low_hist_mean_by_county']].corr(),
            annot=True,
            cmap='coolwarm')
plt.show()

t_statistic, p_value = ttest_ind(consumption.target.values, consumption.cloudcover_low_hist_mean_by_county.values)

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the means.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the means.")

#results
##The relationship is not clear (in visual), and p-value is equal 0 this means:
       # 1)There is strong evidence that the observed effect is real and not due to random chance.
       # 2) A p-value of 0.0 is below any conventional significance level (e.g., 0.05 or 0.01). 

In [None]:
# cloudcover_mid_hist_mean--> the mean of cloudcover_mid (based on datetime) forecast cloudcover_mid
fig = sp.make_subplots(rows=2, cols=2, subplot_titles=['Target_2021','cloudcover_mid_hist_mean','Target_2022','cloudcover_mid_hist_mean'])
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].target),row=1,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].cloudcover_mid_hist_mean),row=1,col=2)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].target),row=2,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].cloudcover_mid_hist_mean),row=2,col=2)

fig.update_layout(template='plotly_dark')
fig.show()
sns.heatmap(consumption[['target','cloudcover_mid_hist_mean']].corr(),
            annot=True,
            cmap='coolwarm')
plt.show()

t_statistic, p_value = ttest_ind(consumption.target.values, consumption.cloudcover_mid_hist_mean.values)

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the means.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the means.")

#results
##The relationship is not clear (in visual), and p-value is very small.

In [None]:
# cloudcover_mid_hist_mean_by_county--> the mean of cloudcover_mid (based on county) forecast cloudcover_mid
fig = sp.make_subplots(rows=2, cols=2, subplot_titles=['Target_2021','cloudcover_mid_hist_mean_by_county','Target_2022','cloudcover_mid_hist_mean_by_county'])
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].target),row=1,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].cloudcover_mid_hist_mean_by_county),row=1,col=2)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].target),row=2,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].cloudcover_mid_hist_mean_by_county),row=2,col=2)

fig.update_layout(template='plotly_dark')
fig.show()
sns.heatmap(consumption[['target','cloudcover_mid_hist_mean_by_county']].corr(),
            annot=True,
            cmap='coolwarm')
plt.show()

t_statistic, p_value = ttest_ind(consumption.target.values, consumption.cloudcover_mid_hist_mean_by_county.values)

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the means.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the means.")

#results
##The relationship is not clear (in visual), and p-value is very small.

In [None]:
# cloudcover_high_hist_mean--> the mean of cloudcover_high (based on datetime) forecast cloudcover_high
fig = sp.make_subplots(rows=2, cols=2, subplot_titles=['Target_2021','cloudcover_high_hist_mean','Target_2022','cloudcover_high_hist_mean'])
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].target),row=1,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].cloudcover_high_hist_mean),row=1,col=2)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].target),row=2,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].cloudcover_high_hist_mean),row=2,col=2)

fig.update_layout(template='plotly_dark')
fig.show()
sns.heatmap(consumption[['target','cloudcover_high_hist_mean']].corr(),
            annot=True,
            cmap='coolwarm')
plt.show()

t_statistic, p_value = ttest_ind(consumption.target.values, consumption.cloudcover_high_hist_mean.values)

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the means.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the means.")

#results
##The relationship is not clear (in visual), and p-value & correlation is very small --> small impact on target feature . 

In [None]:
# cloudcover_high_hist_mean_by_county--> the mean of cloudcover_high (based on county) forecast cloudcover_high
fig = sp.make_subplots(rows=2, cols=2, subplot_titles=['Target_2021','cloudcover_high_hist_mean_by_county','Target_2022','cloudcover_high_hist_mean_by_county'])
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].target),row=1,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].cloudcover_high_hist_mean_by_county),row=1,col=2)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].target),row=2,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].cloudcover_high_hist_mean_by_county),row=2,col=2)

fig.update_layout(template='plotly_dark')
fig.show()
sns.heatmap(consumption[['target','cloudcover_high_hist_mean_by_county']].corr(),
            annot=True,
            cmap='coolwarm')
plt.show()

t_statistic, p_value = ttest_ind(consumption.target.values, consumption.cloudcover_high_hist_mean_by_county.values)

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the means.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the means.")

#results
##The relationship is not clear (in visual), and p-value & correlation is very small --> small impact on target feature . 

In [None]:
# direct_solar_radiation_hist_mean--> the mean of direct_solar_radiation_hist (based on datetime) forecast direct_solar_radiation
fig = sp.make_subplots(rows=2, cols=2, subplot_titles=['Target_2021','direct_solar_radiation_hist_mean','Target_2022','direct_solar_radiation_hist_mean'])
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].target),row=1,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].direct_solar_radiation_hist_mean),row=1,col=2)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].target),row=2,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].direct_solar_radiation_hist_mean),row=2,col=2)

fig.update_layout(template='plotly_dark')
fig.show()
sns.heatmap(consumption[['target','direct_solar_radiation_hist_mean']].corr(),
            annot=True,
            cmap='coolwarm')
plt.show()

t_statistic, p_value = ttest_ind(consumption.target.values, consumption.direct_solar_radiation_hist_mean.values)

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the means.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the means.")

#results
##The relationship is clear (in visual), and p-value is very small , coreelating is high (nice) . 

In [None]:
# direct_solar_radiation_hist_mean_by_county--> the mean of direct_solar_radiation_hist (based on county) forecast direct_solar_radiation
fig = sp.make_subplots(rows=2, cols=2, subplot_titles=['Target_2021','direct_solar_radiation_hist_mean_by_county','Target_2022','direct_solar_radiation_hist_mean_by_county'])
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].target),row=1,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].direct_solar_radiation_hist_mean_by_county),row=1,col=2)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].target),row=2,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].direct_solar_radiation_hist_mean_by_county),row=2,col=2)

fig.update_layout(template='plotly_dark')
fig.show()
sns.heatmap(consumption[['target','direct_solar_radiation_hist_mean_by_county']].corr(),
            annot=True,
            cmap='coolwarm')
plt.show()

t_statistic, p_value = ttest_ind(consumption.target.values, consumption.direct_solar_radiation_hist_mean_by_county.values)

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the means.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the means.")

#results
##The relationship is clear (in visual), and p-value is very small , coreelating is high (nice) . 

In [None]:
# snowfall_hist_mean--> the mean of snowfall_hist_mean (based on datetime) forecast snowfall
fig = sp.make_subplots(rows=2, cols=2, subplot_titles=['Target_2021','snowfall_hist_mean','Target_2022','snowfall_hist_mean'])
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].target),row=1,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].snowfall_hist_mean),row=1,col=2)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].target),row=2,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].snowfall_hist_mean),row=2,col=2)

fig.update_layout(template='plotly_dark')
fig.show()
sns.heatmap(consumption[['target','snowfall_hist_mean']].corr(),
            annot=True,
            cmap='coolwarm')
plt.show()

t_statistic, p_value = ttest_ind(consumption.target.values, consumption.snowfall_hist_mean.values)

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the means.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the means.")

#results
##The relationship is semi clear (in visual), and p-value is equal 0 this means:
       # 1)There is strong evidence that the observed effect is real and not due to random chance.
       # 2) A p-value of 0.0 is below any conventional significance level (e.g., 0.05 or 0.01). 

In [None]:
# snowfall_hist_mean_by_county--> the mean of snowfall_hist_mean (based on county) forecast snowfall
fig = sp.make_subplots(rows=2, cols=2, subplot_titles=['Target_2021','snowfall_hist_mean_by_county','Target_2022','snowfall_hist_mean_by_county'])
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].target),row=1,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2021].datetime, y=consumption[consumption['year']==2021].snowfall_hist_mean_by_county),row=1,col=2)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].target),row=2,col=1)
fig.add_trace(go.Scatter(x=consumption[consumption['year']==2022].datetime, y=consumption[consumption['year']==2022].snowfall_hist_mean_by_county),row=2,col=2)

fig.update_layout(template='plotly_dark')
fig.show()
sns.heatmap(consumption[['target','snowfall_hist_mean_by_county']].corr(),
            annot=True,
            cmap='coolwarm')
plt.show()

t_statistic, p_value = ttest_ind(consumption.target.values, consumption.snowfall_hist_mean_by_county.values)

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the means.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the means.")

#results
##The relationship is semi clear (in visual), and p-value is equal 0 this means:
       # 1)There is strong evidence that the observed effect is real and not due to random chance.
       # 2) A p-value of 0.0 is below any conventional significance level (e.g., 0.05 or 0.01). 

In [None]:
#we want to apply a specific experiment to test our analysis..

#build a ML on choosed features
x=consumption[['installed_capacity','temperature_hist_mean','dewpoint_hist_mean','cloudcover_total_hist_mean',
               'cloudcover_low_hist_mean','direct_solar_radiation_hist_mean','snowfall_hist_mean','year','hour','day']]
y=consumption['target']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
rf_model = RandomForestRegressor(n_estimators=4, random_state=42)
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

##build a ML on all features
x=consumption.drop(['target','datetime','prediction_unit_id'],axis=1)
y=consumption['target']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
rf_model = RandomForestRegressor(n_estimators=4, random_state=42)
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error(all): {mse}")
print(f"R-squared(all): {r2}")


In [None]:
drop_list=['eic_count','cloudcover_mid_hist_mean','loudcover_mid_hist_mean_by_county','cloudcover_high_hist_mean','cloudcover_high_hist_mean_by_county',
           'temperature_hist_mean_by_county',
       'dewpoint_hist_mean_by_county', 'cloudcover_total_hist_mean_by_county',
       'cloudcover_low_hist_mean_by_county',
       'cloudcover_mid_hist_mean_by_county',
       'direct_solar_radiation_hist_mean_by_county',
       'snowfall_hist_mean_by_county']
mask = np.isin(possible_inf_columns, drop_list, invert=True)
possible_inf_columns=possible_inf_columns[mask]

**- Answer Based on above:**

  **(target,installed_capacity,temperature_hist_mean,dewpoint_hist_mean,cloudcover_total_hist_mean,
      ,cloudcover_low_hist_mean,direct_solar_radiation_hist_mean,snowfall_hist_mean) and of course datetime(month,year,day,hour).**

**- Q: What are the perfect conditions for each product in terms of the most important factors affecting it?**


In [None]:
#we want to find the hours during which the product makes profits abd extract from them the appropriate conditions for it 
energy_consumption = consumption['target'].values
price_per_mwh =consumption['euros_per_mwh'].values
total_cost_consumption = energy_consumption * price_per_mwh
energy_production =production['target'].values
lowest_price_per_mwh = production['lowest_price_per_mwh'].values
highest_price_per_mwh = production['highest_price_per_mwh'].values
average_price_per_mwh = (lowest_price_per_mwh + highest_price_per_mwh) / 2
total_revenue_production = energy_production * average_price_per_mwh
total_budget = total_revenue_production - total_cost_consumption
profit_data=pd.concat([consumption[total_budget>0],production[total_budget>0]])

In [None]:
fig = sp.make_subplots(rows=8, cols=1, subplot_titles=['target','installed_capacity',
 'temperature_hist_mean','dewpoint_hist_mean','cloudcover_total_hist_mean','cloudcover_low_hist_mean','direct_solar_radiation_hist_mean',
 'snowfall_hist_mean'])
start=0
end=0
profit_data=profit_data.sort_values(by='datetime')
for row in range(8):
    fig.add_trace(go.Scatter(x=profit_data.datetime, y=profit_data[f'{possible_inf_columns[row]}']),row=row+1,col=1)
  


fig.update_layout(
                 height=2000,
                 template='plotly_dark',
                  )
fig.show()


**-⁠ Q:When is product consumption above the limit or outlier?**

In [None]:
#we will apply two methods for the outlier
#Outlier univariate
#The goal of the first method is to achieve  significantly higher consumption
data=consumption.target.values
# Normalize the data
scaler = StandardScaler()
normalized_data = scaler.fit_transform(data.reshape(-1, 1)).flatten()

# Set your Z-score threshold (commonly 2 or 3)
z_threshold = 2.5

# Perform univariate outlier detection
z_scores = np.abs(stats.zscore(normalized_data))
outliers = data[z_scores > z_threshold]
non_outliers = data[z_scores <= z_threshold]

# Print or use the results as needed
print("Detected outliers:", outliers.shape[0]/consumption.shape[0])

In [None]:
max_targets=consumption.iloc[z_scores > z_threshold]
drop_list=['eic_count','cloudcover_mid_hist_mean','loudcover_mid_hist_mean_by_county','cloudcover_high_hist_mean','cloudcover_high_hist_mean_by_county',
           'temperature_hist_mean_by_county',
       'dewpoint_hist_mean_by_county', 'cloudcover_total_hist_mean_by_county',
       'cloudcover_low_hist_mean_by_county',
       'cloudcover_mid_hist_mean_by_county',
       'direct_solar_radiation_hist_mean_by_county',
       'snowfall_hist_mean_by_county']
mask = np.isin(possible_inf_columns, drop_list, invert=True)
possible_inf_columns=possible_inf_columns[mask]
fig = sp.make_subplots(rows=8, cols=1, subplot_titles=['target','installed_capacity',
 'temperature_hist_mean','dewpoint_hist_mean','cloudcover_total_hist_mean','cloudcover_low_hist_mean',
'direct_solar_radiation_hist_mean','snowfall_hist_mean'])
start=0
end=0
max_targets=max_targets.sort_values(by='datetime')
for row in range(8):
    fig.add_trace(go.Scatter(x=max_targets.datetime, y=max_targets[f'{possible_inf_columns[row]}']),row=row+1,col=1)
  


fig.update_layout(
                 height=2000,
                 template='plotly_dark',
                  )
fig.show()


In [None]:
#Outlier
#our goal on extarcting outliers is to identify unexpected behavior from the product 
lof = LocalOutlierFactor(n_neighbors=5)
outlier_labels = lof.fit_predict(consumption[possible_inf_columns])
print('percentage:',consumption[outlier_labels==-1].shape[0]/consumption.shape[0])
#if percentage is large this is an indeication that there is something wrong in the product 

In [None]:
#check on production's product 
diff=production.target.values-consumption.target.values
print("number of negative hours production:",consumption[diff<=0].shape[0])
print("total number of hours production:",consumption.shape[0])
print("percentage of negative hours production:",consumption[diff<=0].shape[0]/consumption.shape[0])

In [None]:
#bad_conditions
bad_conditions=consumption[diff<=0]
fig = sp.make_subplots(rows=8, cols=1, subplot_titles=['target','installed_capacity',
 'temperature_hist_mean','dewpoint_hist_mean','cloudcover_total_hist_mean','cloudcover_low_hist_mean',
'direct_solar_radiation_hist_mean','snowfall_hist_mean'])
start=0
end=0
bad_conditions=bad_conditions.sort_values(by='datetime')
for row in range(8):
    fig.add_trace(go.Scatter(x=bad_conditions.datetime, y=bad_conditions[f'{possible_inf_columns[row]}']),row=row+1,col=1)
  


fig.update_layout(
                 height=2000,
                 template='plotly_dark',
                  )
fig.show()


**-Reapte apply this on all consumption and production product .**

# - Construct..

**- Q: What is the needed transformation?**

In [None]:
#Because we are dealing with TimeSieres data , I am supposed to provides the model with the value of my targe before n days
def create_n_day_lags(data, N_day_lags):
    original_datetime = data['datetime']
    revealed_targets = data[['datetime', 'prediction_unit_id', 'is_consumption', 'target']].copy()
    for day_lag in range(2, N_day_lags+1):
        revealed_targets['datetime'] = original_datetime + pd.DateOffset(day_lag)
        data = data.merge(revealed_targets, 
                          how='left', on = ['datetime', 'prediction_unit_id', 'is_consumption'],
                          suffixes = ('', f'_{day_lag}_days_ago'))
    data['sin_hour']= (np.pi * np.sin(data['hour']) / 12)
    data['cos_hour']= (np.pi * np.cos(data['hour']) / 12)
    data['target_mean']= data[[f'target_{i}_days_ago' for i in range(2, N_day_lags+1)]].mean(1)
    data['target_std']= data[[f'target_{i}_days_ago' for i in range(2, N_day_lags+1)]].std(1)
    data['target_var']= data[[f'target_{i}_days_ago' for i in range(2, N_day_lags+1)]].var(1)
    return data

In [None]:
def present_skew_data(data):
    skew_df = pd.DataFrame(data.select_dtypes(np.number).columns, columns=['Feature'])
    skew_df['Skew'] = skew_df['Feature'].apply(lambda feature: scipy.stats.skew(data[feature]))
    skew_df['Absolute Skew'] = skew_df['Skew'].apply(abs)
    skew_df['Skewed'] = skew_df['Absolute Skew'].apply(lambda x: True if x >= 0.5 else False)
    skew_df=skew_df[~skew_df['Feature'].isin(['year', 'prediction_unit_id'])]
    return ( skew_df)

In [None]:
#Editing the distribution for some columns that have a significant deviation
present_skew_data(data)

In [None]:
def skew_data(data):
    skew_df = pd.DataFrame(data.select_dtypes(np.number).columns, columns=['Feature'])
    skew_df['Skew'] = skew_df['Feature'].apply(lambda feature: scipy.stats.skew(data[feature]))
    skew_df['Absolute Skew'] = skew_df['Skew'].apply(abs)
    skew_df['Skewed'] = skew_df['Absolute Skew'].apply(lambda x: True if x >= 0.5 else False)
    skew_df=skew_df[~skew_df['Feature'].isin(['year', 'prediction_unit_id'])]
    columns=skew_df[skew_df['Skewed']==True].Feature.values
    data=distribution_preprocessing(data.copy(),columns)
    return (data)

In [None]:
def distribution_preprocessing(data,columns):
    for i in columns:
        data[f"{i}"]= np.where((data[i])!= 0, np.log(data[i]),0)
        return data

In [None]:
#apply functions
data=create_n_day_lags(data.copy(),7)
data=skew_data(data)

In [None]:
#drop unneeded columns
data.drop(['datetime','prediction_unit_id'],axis=1,inplace=True)

**-Q: Do we need tow models?**

- Yes , we need to build two models , one for consumption and other for production 
- but , why?Because of the different goal for each model , it is illogical to build a single model for two completely contraductory goal , instead od the definitely different patterns between them and the influencing factors.

In [None]:
x_is_consumption= data[data['is_consumption'] != 0].drop('target', axis= 1)
y_is_consumption= data[data['is_consumption'] != 0]['target']
x_production= data[data['is_consumption'] == 0].drop('target', axis= 1)
y_production= data[data['is_consumption'] == 0]['target']

In [None]:
x_is_consumption_train,x_is_consumption_test,y_is_consumption_train,y_is_consumption_test=train_test_split(x_is_consumption,y_is_consumption,test_size=0.3,shuffle=True, random_state=5)
x_production_train,x_production_test,y_production_train,y_production_test=train_test_split(x_production,y_production,test_size=0.3, shuffle=True, random_state=51)

**-Q:Which is the best model?**

- We have experienced a group of algo , the best of which , as a result , is RandomForest.
- An important note, the resources available to us limit our use of some complext algorithms . 

In [None]:
#Apply gridsearch for optemize parameters..
#xgb_is_consumption = xgb.XGBRegressor()
#param_grid = {
   # 'n_estimators': [4,100, 200, 300],  
   # 'max_depth': [3, 4, 5],  
   # 'learning_rate': [0.1, 0.01, 0.001]  
#}

#grid_search = GridSearchCV(estimator=xgb_is_consumption , param_grid=param_grid, scoring='neg_mean_squared_error', cv=3, verbose=1)
#grid_result = grid_search.fit(x_is_consumption_train,y_is_consumption_train)

#print("Best Parameters:", grid_result.best_params_)
#print("Best Score:", grid_result.best_score_)

In [None]:
xgb_is_consumption = RandomForestRegressor(n_estimators=4)
xgb_is_consumption.fit(x_is_consumption_train,y_is_consumption_train)
y_is_consumption_pred=xgb_is_consumption.predict(x_is_consumption_test)
mse = mean_squared_error(y_is_consumption_test, y_is_consumption_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_is_consumption_test, y_is_consumption_pred)
r_squared = r2_score(y_is_consumption_test, y_is_consumption_pred)
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"R-squared (R²): {r_squared}")
x_is_consumption_test['log_traget_predicting']=y_is_consumption_pred
x_is_consumption_test['target']=y_is_consumption_test
x_is_consumption_test.loc[x_is_consumption_test['log_traget_predicting']<0,'log_traget_predicting']=0
x_is_consumption_test['traget_predicting']=np.exp(y_is_consumption_pred)
x_is_consumption_test['diff']=x_is_consumption_test['log_traget_predicting']-x_is_consumption_test['target']
x_is_consumption_test[['traget_predicting','log_traget_predicting','target','diff']].head()

In [19]:
#cross validation 
#scores = cross_val_score(RandomForestRegressor(n_estimators=4),x_is_consumption,y_is_consumption, cv=5)
#print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

In [None]:
xgb_production = RandomForestRegressor(n_estimators=4)
xgb_production.fit(x_production_train,y_production_train)
y_production_pred=xgb_production.predict(x_production_test)
mse = mean_squared_error(y_production_test, y_production_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_production_test, y_production_pred)
r_squared = r2_score(y_production_test, y_production_pred)
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"R-squared (R²): {r_squared}")
x_production_test['log_traget_predicting']=y_production_pred
x_production_test['traget_predicting']=np.exp(y_production_pred)
x_production_test['target']=y_production_test
x_production_test.loc[x_production_test['log_traget_predicting']<0,'log_traget_predicting']=0
x_production_test['traget_predicting']=np.exp(y_production_pred)
x_production_test['diff']=x_production_test['log_traget_predicting']-x_production_test['target']
x_production_test[['traget_predicting','log_traget_predicting','target','diff']].head()

In [None]:
#cross validation 
scores = cross_val_score(RandomForestRegressor(n_estimators=4),x_is_production,y_is_production, cv=5)
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

**- A questions may come to your mind : Why did use only  the 4 estimators? we did'nt see much difference in accuracy !! , Indeed, if this accuracy is achieved with least complexity of the model structure, this is perfect ..**

# - Execute..

**- Some useful tips:**

**- There are some products with high energy consumption in some special circumstances, there are tow solutions to this as for dispensing with theses prosucts in special circumstances reduce the working time as possible .**

**- Noticed that there are somw weak point in each product that is written in the analysis stage , we must work on solving these weak points.**

**- There are some product that work at certain times and some cases due to circumstances.Is it possible to change the time? In order to reduce energy consumption.**

**- We must pay attention to the difference between the product's energy production and its energy consumption during half the day, why? Because if it is negative , we must know that there is a problem and it is most likely caused by weather conditions .**

**- Noticed a climate pattern for each county at the same time for each year. 
we don't know the nature of the locations of the products , but if their locations can be changed , the there are counties that are not suitable for some products due to their weather conditions .**