<h1>Project: Fundamentals of Information Systems </h1>

<h3>Introduction</h3>

This data has been gathered at **two solar power plants** in India over a **34 day** period. It has **two** pairs of files - each pair has **one power generation dataset** and **one sensor readings dataset**. 
- The **power generation datasets** are gathered at the inverter level - each inverter has multiple lines of solar panels attached to it. 
- The **sensor data** is gathered at a plant level - a single array of sensors optimally placed at the plant.

# Output

**Questions:**
 - What is the **mean** value of **daily yield**? 
 - What is the **total irradiation per day**? 
 - What is the **max ambient** and **module temperature**? 
 - **How many inverters** are there **for each plant**? 
 - What is the **maximum/minimum amount** of **DC/AC Power generated** in a **time interval/day**? 
 - **Which inverter** (source_key) has produced **maximum DC/AC power**? 
 - **Rank the inverters** based on the **DC/AC power** they produce? Is there **any missing data**?
 
 
 - Graphs that explain the patterns for attributes independent of other variables. These will usually be tracked as changes of attributes against DATETIME, DATE, or TIME. 
**Examples.** How is DC or AC Power changing as time goes by? how is irradiation changing as time goes by? how are ambient and module temperature changing as time goes by? how does yield change as time goes by? Explore plotting variables against different granularities of DATETIME and which is the best option for each variable.

#  Variables

**Power generation data**
- DATE_TIME: Date and time for each observation. Observations recorded at 15 minute intervals.
 
- PLANT_ID: Plant ID - this will be common for the entire file.
 
- SOURCE_KEY: Source key in this file stands for the inverter id.
 
- DC_POWER: Amount of DC power generated by the inverter (source_key) in this 15 minute interval. Units - kW.
 
- AC_POWER: Amount of AC power generated by the inverter (source_key) in this 15 minute interval. Units - kW.
 
- DAILY_YIELD: Daily yield is a cumulative sum of power generated on that day, till that point in time.
 
- TOTAL_YIELD: This is the total yield for the inverter till that point in time.
 
**Weather sensor data**
 - DATE_TIME: Date and time for each observation. Observations recorded at 15 minute intervals.
 
 - PLANT_ID: Plant ID - this will be common for the entire file.
 
 - SOURCE_KEY: Stands for the sensor panel id. This will be common for the entire file because there's only one sensor panel for the plant.
 
 - AMBIENT_TEMPERATURE: This is the ambient temperature at the plant.
 
 - MODULE_TEMPERATURE: There's a module (solar panel) attached to the sensor panel. This is the temperature reading for that module.
 
 - IRRADIATION: Amount of irradiation for the 15 minute interval.



<h5>Libraries needed</h5>

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import datetime

**Data**

In [2]:
p1_gen = pd.read_csv('Plant_1_Generation_data.csv')
p1_wea = pd.read_csv('Plant_1_Weather_Sensor_Data.csv')
p2_gen = pd.read_csv("Plant_2_Generation_data.csv") 
p2_wea = pd.read_csv("Plant_2_Weather_Sensor_Data.csv")

In [3]:
print("Shape of the table for plant 1 generation data: ", p1_gen.shape)
print("Shape of the table for plant 2 generation data: ",p2_gen.shape)
print("Name of the columns for the dataframes: \n",list(p1_gen.columns))
assert(np.all(p1_gen.columns == p2_gen.columns))  # just making sure they have the same columns

Shape of the table for plant 1 generation data:  (68778, 7)
Shape of the table for plant 2 generation data:  (67698, 7)
Name of the columns for the dataframes: 
 ['DATE_TIME', 'PLANT_ID', 'SOURCE_KEY', 'DC_POWER', 'AC_POWER', 'DAILY_YIELD', 'TOTAL_YIELD']


In [4]:
print("Shape of the table for weather sensor plant 1: ", p1_wea.shape)
print("Shape of the table for weather sensor plant 2: ", p2_wea.shape)
print("Column names for weather sensor tables: \n", list(p1_wea.columns))
assert(np.all(p1_wea.columns == p2_wea.columns)) # just making sure they have the same columns

Shape of the table for weather sensor plant 1:  (3182, 6)
Shape of the table for weather sensor plant 2:  (3259, 6)
Column names for weather sensor tables: 
 ['DATE_TIME', 'PLANT_ID', 'SOURCE_KEY', 'AMBIENT_TEMPERATURE', 'MODULE_TEMPERATURE', 'IRRADIATION']


<h3> DATA CLEANING </h3>

In [5]:
### the format of the dates in the DATE_TIME column is different between df p1_gen and p2_gen;
### if we want to join them it's better to convert to the same format

def convert_dates(date):
    return date[8:10]+date[4:8]+date[:4]+date[10:-3]

p2_gen.DATE_TIME = p2_gen.DATE_TIME.apply(convert_dates)
p1_wea.DATE_TIME = p1_wea.DATE_TIME.apply(convert_dates)
p2_wea.DATE_TIME = p2_wea.DATE_TIME.apply(convert_dates)

In [6]:
data1 = pd.merge(p1_gen, p1_wea, on = "DATE_TIME")
data1

Unnamed: 0,DATE_TIME,PLANT_ID_x,SOURCE_KEY_x,DC_POWER,AC_POWER,DAILY_YIELD,TOTAL_YIELD,PLANT_ID_y,SOURCE_KEY_y,AMBIENT_TEMPERATURE,MODULE_TEMPERATURE,IRRADIATION
0,15-05-2020 00:00,4135001,1BY6WEcLGh8j5v7,0.0,0.0,0.000,6259559.0,4135001,HmiyD2TTLFNqkNe,25.184316,22.857507,0.0
1,15-05-2020 00:00,4135001,1IF53ai7Xc0U56Y,0.0,0.0,0.000,6183645.0,4135001,HmiyD2TTLFNqkNe,25.184316,22.857507,0.0
2,15-05-2020 00:00,4135001,3PZuoBAID5Wc2HD,0.0,0.0,0.000,6987759.0,4135001,HmiyD2TTLFNqkNe,25.184316,22.857507,0.0
3,15-05-2020 00:00,4135001,7JYdWkrLSPkdwr4,0.0,0.0,0.000,7602960.0,4135001,HmiyD2TTLFNqkNe,25.184316,22.857507,0.0
4,15-05-2020 00:00,4135001,McdE0feGgRqW7Ca,0.0,0.0,0.000,7158964.0,4135001,HmiyD2TTLFNqkNe,25.184316,22.857507,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
68769,17-06-2020 23:45,4135001,uHbuxQJl8lW7ozc,0.0,0.0,5967.000,7287002.0,4135001,HmiyD2TTLFNqkNe,21.909288,20.427972,0.0
68770,17-06-2020 23:45,4135001,wCURE6d3bPkepu2,0.0,0.0,5147.625,7028601.0,4135001,HmiyD2TTLFNqkNe,21.909288,20.427972,0.0
68771,17-06-2020 23:45,4135001,z9Y9gH1T5YWrNuG,0.0,0.0,5819.000,7251204.0,4135001,HmiyD2TTLFNqkNe,21.909288,20.427972,0.0
68772,17-06-2020 23:45,4135001,zBIq5rxdHJRwDNY,0.0,0.0,5817.000,6583369.0,4135001,HmiyD2TTLFNqkNe,21.909288,20.427972,0.0


In [7]:
data2 = pd.merge(p2_gen, p2_wea, on = "DATE_TIME")
data2

Unnamed: 0,DATE_TIME,PLANT_ID_x,SOURCE_KEY_x,DC_POWER,AC_POWER,DAILY_YIELD,TOTAL_YIELD,PLANT_ID_y,SOURCE_KEY_y,AMBIENT_TEMPERATURE,MODULE_TEMPERATURE,IRRADIATION
0,15-05-2020 00:00,4136001,4UPUqMRk7TRMgml,0.0,0.0,9425.000000,2.429011e+06,4136001,iq8k7ZNt4Mwm3w0,27.004764,25.060789,0.0
1,15-05-2020 00:00,4136001,81aHJ1q11NBPMrL,0.0,0.0,0.000000,1.215279e+09,4136001,iq8k7ZNt4Mwm3w0,27.004764,25.060789,0.0
2,15-05-2020 00:00,4136001,9kRcWv60rDACzjR,0.0,0.0,3075.333333,2.247720e+09,4136001,iq8k7ZNt4Mwm3w0,27.004764,25.060789,0.0
3,15-05-2020 00:00,4136001,Et9kgGMDl729KT4,0.0,0.0,269.933333,1.704250e+06,4136001,iq8k7ZNt4Mwm3w0,27.004764,25.060789,0.0
4,15-05-2020 00:00,4136001,IQ2d7wF4YD8zU1Q,0.0,0.0,3177.000000,1.994153e+07,4136001,iq8k7ZNt4Mwm3w0,27.004764,25.060789,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
67693,17-06-2020 23:45,4136001,q49J1IKaHRwDQnt,0.0,0.0,4157.000000,5.207580e+05,4136001,iq8k7ZNt4Mwm3w0,23.202871,22.535908,0.0
67694,17-06-2020 23:45,4136001,rrq4fwE8jgrTyWY,0.0,0.0,3931.000000,1.211314e+08,4136001,iq8k7ZNt4Mwm3w0,23.202871,22.535908,0.0
67695,17-06-2020 23:45,4136001,vOuJvMaM2sgwLmb,0.0,0.0,4322.000000,2.427691e+06,4136001,iq8k7ZNt4Mwm3w0,23.202871,22.535908,0.0
67696,17-06-2020 23:45,4136001,xMbIugepa2P7lBB,0.0,0.0,4218.000000,1.068964e+08,4136001,iq8k7ZNt4Mwm3w0,23.202871,22.535908,0.0


Allora qui per dire che adesso manipolo io

In [8]:
uniq1 = data1.SOURCE_KEY_x.unique()
uniq1

array(['1BY6WEcLGh8j5v7', '1IF53ai7Xc0U56Y', '3PZuoBAID5Wc2HD',
       '7JYdWkrLSPkdwr4', 'McdE0feGgRqW7Ca', 'VHMLBKoKgIrUVDU',
       'WRmjgnKYAwPKWDb', 'ZnxXDlPa8U1GXgE', 'ZoEaEvLYb1n2sOq',
       'adLQvlD726eNBSB', 'bvBOhCH3iADSZry', 'iCRJl6heRkivqQ3',
       'ih0vzX44oOqAx2f', 'pkci93gMrogZuBj', 'rGa61gmuvPhdLxV',
       'sjndEbLyjtCKgGv', 'uHbuxQJl8lW7ozc', 'wCURE6d3bPkepu2',
       'z9Y9gH1T5YWrNuG', 'zBIq5rxdHJRwDNY', 'zVJPv84UY57bAof',
       'YxYtjZvoooNbGkE'], dtype=object)

In [9]:
uniq2 = data2.SOURCE_KEY_x.unique()
uniq2

array(['4UPUqMRk7TRMgml', '81aHJ1q11NBPMrL', '9kRcWv60rDACzjR',
       'Et9kgGMDl729KT4', 'IQ2d7wF4YD8zU1Q', 'LYwnQax7tkwH5Cb',
       'LlT2YUhhzqhg5Sw', 'Mx2yZCDsyf6DPfv', 'NgDl19wMapZy17u',
       'PeE6FRyGXUgsRhN', 'Qf4GUc1pJu5T6c6', 'Quc1TzYxW2pYoWX',
       'V94E5Ben1TlhnDV', 'WcxssY2VbP4hApt', 'mqwcsP2rE7J0TFp',
       'oZ35aAeoifZaQzV', 'oZZkBaNadn6DNKz', 'q49J1IKaHRwDQnt',
       'rrq4fwE8jgrTyWY', 'vOuJvMaM2sgwLmb', 'xMbIugepa2P7lBB',
       'xoJJ8DcxJEcupym'], dtype=object)

In [10]:
esempio1 = data1[data1.SOURCE_KEY_x == uniq1[0]]
esempio1

Unnamed: 0,DATE_TIME,PLANT_ID_x,SOURCE_KEY_x,DC_POWER,AC_POWER,DAILY_YIELD,TOTAL_YIELD,PLANT_ID_y,SOURCE_KEY_y,AMBIENT_TEMPERATURE,MODULE_TEMPERATURE,IRRADIATION
0,15-05-2020 00:00,4135001,1BY6WEcLGh8j5v7,0.0,0.0,0.0,6259559.0,4135001,HmiyD2TTLFNqkNe,25.184316,22.857507,0.0
21,15-05-2020 00:15,4135001,1BY6WEcLGh8j5v7,0.0,0.0,0.0,6259559.0,4135001,HmiyD2TTLFNqkNe,25.084589,22.761668,0.0
42,15-05-2020 00:30,4135001,1BY6WEcLGh8j5v7,0.0,0.0,0.0,6259559.0,4135001,HmiyD2TTLFNqkNe,24.935753,22.592306,0.0
63,15-05-2020 00:45,4135001,1BY6WEcLGh8j5v7,0.0,0.0,0.0,6259559.0,4135001,HmiyD2TTLFNqkNe,24.846130,22.360852,0.0
84,15-05-2020 01:00,4135001,1BY6WEcLGh8j5v7,0.0,0.0,0.0,6259559.0,4135001,HmiyD2TTLFNqkNe,24.621525,22.165423,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
68664,17-06-2020 22:45,4135001,1BY6WEcLGh8j5v7,0.0,0.0,5521.0,6485319.0,4135001,HmiyD2TTLFNqkNe,22.150570,21.480377,0.0
68686,17-06-2020 23:00,4135001,1BY6WEcLGh8j5v7,0.0,0.0,5521.0,6485319.0,4135001,HmiyD2TTLFNqkNe,22.129816,21.389024,0.0
68708,17-06-2020 23:15,4135001,1BY6WEcLGh8j5v7,0.0,0.0,5521.0,6485319.0,4135001,HmiyD2TTLFNqkNe,22.008275,20.709211,0.0
68730,17-06-2020 23:30,4135001,1BY6WEcLGh8j5v7,0.0,0.0,5521.0,6485319.0,4135001,HmiyD2TTLFNqkNe,21.969495,20.734963,0.0


In [11]:
gior = []
def get_days(x):
    gior.append(x[0:2])

esempio1.DATE_TIME.apply(get_days)

conta = []
def contare(x):
    uguali = 0
    for i in range(len(x) - 1):
        if x[i] == x[i + 1]:
            uguali += 1
        else:
            conta.append(uguali)
            uguali = 0
    return conta

conto = np.array(contare(gior))
conto.T

array([92, 87, 95, 95, 92, 75, 64, 95, 88, 95, 92, 94, 95, 89, 70, 95, 95,
       95, 95, 94, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95])

In [12]:
copia = esempio1.copy()
copia

Unnamed: 0,DATE_TIME,PLANT_ID_x,SOURCE_KEY_x,DC_POWER,AC_POWER,DAILY_YIELD,TOTAL_YIELD,PLANT_ID_y,SOURCE_KEY_y,AMBIENT_TEMPERATURE,MODULE_TEMPERATURE,IRRADIATION
0,15-05-2020 00:00,4135001,1BY6WEcLGh8j5v7,0.0,0.0,0.0,6259559.0,4135001,HmiyD2TTLFNqkNe,25.184316,22.857507,0.0
21,15-05-2020 00:15,4135001,1BY6WEcLGh8j5v7,0.0,0.0,0.0,6259559.0,4135001,HmiyD2TTLFNqkNe,25.084589,22.761668,0.0
42,15-05-2020 00:30,4135001,1BY6WEcLGh8j5v7,0.0,0.0,0.0,6259559.0,4135001,HmiyD2TTLFNqkNe,24.935753,22.592306,0.0
63,15-05-2020 00:45,4135001,1BY6WEcLGh8j5v7,0.0,0.0,0.0,6259559.0,4135001,HmiyD2TTLFNqkNe,24.846130,22.360852,0.0
84,15-05-2020 01:00,4135001,1BY6WEcLGh8j5v7,0.0,0.0,0.0,6259559.0,4135001,HmiyD2TTLFNqkNe,24.621525,22.165423,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
68664,17-06-2020 22:45,4135001,1BY6WEcLGh8j5v7,0.0,0.0,5521.0,6485319.0,4135001,HmiyD2TTLFNqkNe,22.150570,21.480377,0.0
68686,17-06-2020 23:00,4135001,1BY6WEcLGh8j5v7,0.0,0.0,5521.0,6485319.0,4135001,HmiyD2TTLFNqkNe,22.129816,21.389024,0.0
68708,17-06-2020 23:15,4135001,1BY6WEcLGh8j5v7,0.0,0.0,5521.0,6485319.0,4135001,HmiyD2TTLFNqkNe,22.008275,20.709211,0.0
68730,17-06-2020 23:30,4135001,1BY6WEcLGh8j5v7,0.0,0.0,5521.0,6485319.0,4135001,HmiyD2TTLFNqkNe,21.969495,20.734963,0.0


In [13]:
def datesemplici(date):
    return date[3:5]+date[0:2]

copia.DATE_TIME = copia.DATE_TIME.apply(datesemplici)
copia

Unnamed: 0,DATE_TIME,PLANT_ID_x,SOURCE_KEY_x,DC_POWER,AC_POWER,DAILY_YIELD,TOTAL_YIELD,PLANT_ID_y,SOURCE_KEY_y,AMBIENT_TEMPERATURE,MODULE_TEMPERATURE,IRRADIATION
0,0515,4135001,1BY6WEcLGh8j5v7,0.0,0.0,0.0,6259559.0,4135001,HmiyD2TTLFNqkNe,25.184316,22.857507,0.0
21,0515,4135001,1BY6WEcLGh8j5v7,0.0,0.0,0.0,6259559.0,4135001,HmiyD2TTLFNqkNe,25.084589,22.761668,0.0
42,0515,4135001,1BY6WEcLGh8j5v7,0.0,0.0,0.0,6259559.0,4135001,HmiyD2TTLFNqkNe,24.935753,22.592306,0.0
63,0515,4135001,1BY6WEcLGh8j5v7,0.0,0.0,0.0,6259559.0,4135001,HmiyD2TTLFNqkNe,24.846130,22.360852,0.0
84,0515,4135001,1BY6WEcLGh8j5v7,0.0,0.0,0.0,6259559.0,4135001,HmiyD2TTLFNqkNe,24.621525,22.165423,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
68664,0617,4135001,1BY6WEcLGh8j5v7,0.0,0.0,5521.0,6485319.0,4135001,HmiyD2TTLFNqkNe,22.150570,21.480377,0.0
68686,0617,4135001,1BY6WEcLGh8j5v7,0.0,0.0,5521.0,6485319.0,4135001,HmiyD2TTLFNqkNe,22.129816,21.389024,0.0
68708,0617,4135001,1BY6WEcLGh8j5v7,0.0,0.0,5521.0,6485319.0,4135001,HmiyD2TTLFNqkNe,22.008275,20.709211,0.0
68730,0617,4135001,1BY6WEcLGh8j5v7,0.0,0.0,5521.0,6485319.0,4135001,HmiyD2TTLFNqkNe,21.969495,20.734963,0.0


In [14]:
tab1 = pd.DataFrame()

tab1["DATE"] = copia.DATE_TIME.unique()
tab1["DC_MEAN"] = np.array(copia.groupby("DATE_TIME").DC_POWER.mean())
#tab1["DC_MED"] = np.array(copia.groupby("DATE_TIME").DC_POWER.median())
tab1["DC_MAX"] = np.array(copia.groupby("DATE_TIME").DC_POWER.max())
tab1["AC_MEAN"] = np.array(copia.groupby("DATE_TIME").AC_POWER.mean())
#tab1["AC_MED"] = np.array(copia.groupby("DATE_TIME").AC_POWER.median())
tab1["AC_MAX"] = np.array(copia.groupby("DATE_TIME").AC_POWER.max())
tab1["DY_MEAN"] = np.array(copia.groupby("DATE_TIME").DAILY_YIELD.mean())
#tab1["DY_MED"] = np.array(copia.groupby("DATE_TIME").DAILY_YIELD.median())
tab1["DY_MAX"] = np.array(copia.groupby("DATE_TIME").DAILY_YIELD.max())
tab1["IRR_MEAN"] = np.array(copia.groupby("DATE_TIME").IRRADIATION.mean())
#tab1["IRR_MED"] = np.array(copia.groupby("DATE_TIME").IRRADIATION.median())
tab1["IRR_MAX"] = np.array(copia.groupby("DATE_TIME").IRRADIATION.max())

tab1

Unnamed: 0,DATE,DC_MEAN,DC_MAX,AC_MEAN,AC_MAX,DY_MEAN,DY_MAX,IRR_MEAN,IRR_MAX
0,515,2530.545123,10642.75,247.812372,1039.35,2641.120776,5754.0,0.204699,0.893661
1,516,2916.24858,11209.0,285.50558,1095.285714,3380.406047,6292.0,0.211951,0.812241
2,517,3000.414807,11416.42857,293.467187,1114.814286,3473.035714,7045.0,0.238869,0.997904
3,518,2125.315662,12238.85714,208.026116,1193.628571,2271.423549,4998.0,0.159026,0.971481
4,519,2497.605031,10854.5,244.528783,1059.8,3087.0649,6449.0,0.194031,0.835832
5,520,3031.744987,12094.5,296.266557,1179.225,3155.774279,8249.0,0.240073,0.975161
6,521,4441.17848,11813.875,434.202042,1152.2375,5266.323535,7243.0,0.362403,1.038991
7,522,2925.094494,13335.14286,286.006603,1300.171429,3356.12965,6848.0,0.230408,1.047775
8,523,3668.182785,12904.625,358.684611,1258.1875,4073.705123,7966.0,0.293333,1.112297
9,524,3219.913877,12591.75,314.795499,1227.7125,3702.991443,7537.0,0.259762,0.975827


In [15]:
tab1.mean()

DATE        1.514858e+133
DC_MEAN      2.904187e+03
DC_MAX       1.195947e+04
AC_MEAN      2.840983e+02
AC_MAX       1.166773e+03
DY_MEAN      3.265246e+03
DY_MAX       6.639866e+03
IRR_MEAN     2.327109e-01
IRR_MAX      9.887491e-01
dtype: float64

creo una tabella di medie per ogni singola SOURCE_KEY del primo impianto

In [16]:
def datesemplici(date):
    return date[3:5]+date[0:2]

def crea_summit(data, uniq):
    tab1 = pd.DataFrame()
    source_keys = []
    dc_mean = []
    dc_max = []
    ac_mean = []
    ac_max = []
    dy_max = []
    irr_mean = []
    irr_max = []
    amb_temp = []
    mod_temp = []
    
    for sou_key in uniq:
        pannello = data[data.SOURCE_KEY_x == sou_key]
        pannello.DATE_TIME = pannello.DATE_TIME.apply(datesemplici)
        source_keys.append(sou_key)
        dc_mean.append(pannello.groupby("DATE_TIME").DC_POWER.mean().mean())
        dc_max.append(pannello.groupby("DATE_TIME").DC_POWER.max().mean())
        ac_mean.append(pannello.groupby("DATE_TIME").AC_POWER.mean().mean())
        ac_max.append(pannello.groupby("DATE_TIME").AC_POWER.max().mean())
        dy_max.append(pannello.groupby("DATE_TIME").DAILY_YIELD.max().mean())
        irr_mean.append(pannello.groupby("DATE_TIME").IRRADIATION.mean().mean())
        irr_max.append(pannello.groupby("DATE_TIME").IRRADIATION.max().mean())
        amb_temp.append(pannello.groupby("DATE_TIME").AMBIENT_TEMPERATURE.mean().mean())
        mod_temp.append(pannello.groupby("DATE_TIME").MODULE_TEMPERATURE.mean().mean())
         
    tab1["SOURCE_KEY"] = np.array(source_keys)
    tab1["DC_MEAN"] = np.array(dc_mean)
    tab1["DC_MAX"] = np.array(dc_max)
    tab1["AC_MEAN"] = np.array(ac_mean)
    tab1["AC_MAX"] = np.array(ac_max)
    tab1["DY_MAX"] = np.array(dy_max)
    tab1["IRR_MEAN"] = np.array(irr_mean)
    tab1["IRR_MAX"] = np.array(irr_max)
    tab1["AMB_TEMP"] = np.array(amb_temp)
    tab1["MOD_TEMP"] = np.array(mod_temp)
    return tab1
        
tab1 = crea_summit(data1, uniq1)
tab1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Unnamed: 0,SOURCE_KEY,DC_MEAN,DC_MAX,AC_MEAN,AC_MAX,DY_MAX,IRR_MEAN,IRR_MAX,AMB_TEMP,MOD_TEMP
0,1BY6WEcLGh8j5v7,2904.186572,11959.468312,284.098314,1166.773389,6639.865546,0.232711,0.988749,25.603412,31.309793
1,1IF53ai7Xc0U56Y,3264.930737,12931.521009,319.257001,1261.170221,7348.806373,0.235808,0.988749,25.617678,31.422753
2,3PZuoBAID5Wc2HD,3260.505174,12974.507354,318.823751,1265.327363,7341.279412,0.235826,0.988749,25.617266,31.42305
3,7JYdWkrLSPkdwr4,3171.342839,12716.30042,310.139821,1240.220693,7172.323529,0.234996,0.988749,25.617426,31.39979
4,McdE0feGgRqW7Ca,3248.772036,12896.538341,317.702792,1257.733981,7341.778361,0.235442,0.988749,25.602581,31.402481
5,VHMLBKoKgIrUVDU,3249.088195,12936.119537,317.706009,1261.57792,7347.004902,0.234996,0.988749,25.617426,31.39979
6,WRmjgnKYAwPKWDb,3193.666422,12679.143907,312.314877,1236.587973,7199.757353,0.235826,0.988749,25.617266,31.42305
7,ZnxXDlPa8U1GXgE,3235.59806,12852.486869,316.401831,1253.483876,7310.443102,0.235358,0.988749,25.618357,31.412678
8,ZoEaEvLYb1n2sOq,3174.965815,12625.627626,310.490107,1231.447847,7166.02521,0.235517,0.988749,25.601868,31.404252
9,adLQvlD726eNBSB,3308.80741,13166.518908,323.525326,1284.00562,7443.841912,0.235808,0.988749,25.617678,31.422753


In [17]:
print(tab1[tab1.DC_MEAN == tab1.DC_MEAN.max()]["SOURCE_KEY"])
print(tab1[tab1.DC_MAX == tab1.DC_MAX.max()]["SOURCE_KEY"])
print(tab1[tab1.AC_MEAN == tab1.AC_MEAN.max()]["SOURCE_KEY"])
print(tab1[tab1.AC_MAX == tab1.AC_MAX.max()]["SOURCE_KEY"])
print(tab1[tab1.DY_MAX == tab1.DY_MAX.max()]["SOURCE_KEY"])
print(tab1[tab1.IRR_MEAN == tab1.IRR_MEAN.max()]["SOURCE_KEY"])
print(tab1[tab1.AMB_TEMP == tab1.AMB_TEMP.max()]["SOURCE_KEY"])
print(tab1[tab1.MOD_TEMP == tab1.MOD_TEMP.max()]["SOURCE_KEY"])

9    adLQvlD726eNBSB
Name: SOURCE_KEY, dtype: object
9    adLQvlD726eNBSB
Name: SOURCE_KEY, dtype: object
9    adLQvlD726eNBSB
Name: SOURCE_KEY, dtype: object
9    adLQvlD726eNBSB
Name: SOURCE_KEY, dtype: object
9    adLQvlD726eNBSB
Name: SOURCE_KEY, dtype: object
21    YxYtjZvoooNbGkE
Name: SOURCE_KEY, dtype: object
21    YxYtjZvoooNbGkE
Name: SOURCE_KEY, dtype: object
21    YxYtjZvoooNbGkE
Name: SOURCE_KEY, dtype: object


In [18]:
tab2 = crea_summit(data2, uniq2)
tab2

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Unnamed: 0,SOURCE_KEY,DC_MEAN,DC_MAX,AC_MEAN,AC_MAX,DY_MAX,IRR_MEAN,IRR_MAX,AMB_TEMP,MOD_TEMP
0,4UPUqMRk7TRMgml,276.818778,1179.613263,270.660532,1151.181681,7703.903782,0.229919,0.914182,28.140243,32.786542
1,81aHJ1q11NBPMrL,230.752762,1111.707997,225.685761,1085.282549,5916.019608,0.23271,0.918425,28.068912,32.770801
2,9kRcWv60rDACzjR,244.682823,1087.866092,239.310767,1062.094384,5835.535294,0.23271,0.918425,28.068912,32.770801
3,Et9kgGMDl729KT4,188.657027,951.975777,184.649488,930.26514,4579.882353,0.229919,0.914182,28.140243,32.786542
4,IQ2d7wF4YD8zU1Q,281.698326,1179.599969,275.459547,1151.054683,7487.958608,0.207058,0.891533,27.643789,31.699897
5,LYwnQax7tkwH5Cb,196.333153,1023.663887,192.136332,1000.068375,4850.168627,0.23271,0.918425,28.068912,32.770801
6,LlT2YUhhzqhg5Sw,245.524546,1092.763291,240.141363,1066.766471,6074.411765,0.23271,0.918425,28.068912,32.770801
7,Mx2yZCDsyf6DPfv,284.494955,1189.859034,278.15656,1161.045672,7096.523109,0.229919,0.914182,28.140243,32.786542
8,NgDl19wMapZy17u,270.087739,1163.24684,264.137249,1135.162083,6925.028571,0.207058,0.891533,27.643789,31.699897
9,PeE6FRyGXUgsRhN,248.640609,1103.0257,243.175606,1076.753964,6326.855042,0.23271,0.918425,28.068912,32.770801


In [19]:
print(tab2[tab2.DC_MEAN == tab2.DC_MEAN.max()]["SOURCE_KEY"])
print(tab2[tab2.DC_MAX == tab2.DC_MAX.max()]["SOURCE_KEY"])
print(tab2[tab2.AC_MEAN == tab2.AC_MEAN.max()]["SOURCE_KEY"])
print(tab2[tab2.AC_MAX == tab2.AC_MAX.max()]["SOURCE_KEY"])
print(tab2[tab2.DY_MAX == tab2.DY_MAX.max()]["SOURCE_KEY"])
print(tab2[tab2.IRR_MEAN == tab2.IRR_MEAN.max()]["SOURCE_KEY"])
print(tab2[tab2.AMB_TEMP == tab2.AMB_TEMP.max()]["SOURCE_KEY"])
print(tab2[tab2.MOD_TEMP == tab2.MOD_TEMP.max()]["SOURCE_KEY"])

7    Mx2yZCDsyf6DPfv
Name: SOURCE_KEY, dtype: object
7    Mx2yZCDsyf6DPfv
Name: SOURCE_KEY, dtype: object
7    Mx2yZCDsyf6DPfv
Name: SOURCE_KEY, dtype: object
7    Mx2yZCDsyf6DPfv
Name: SOURCE_KEY, dtype: object
0    4UPUqMRk7TRMgml
Name: SOURCE_KEY, dtype: object
1     81aHJ1q11NBPMrL
2     9kRcWv60rDACzjR
5     LYwnQax7tkwH5Cb
6     LlT2YUhhzqhg5Sw
9     PeE6FRyGXUgsRhN
12    V94E5Ben1TlhnDV
13    WcxssY2VbP4hApt
16    oZZkBaNadn6DNKz
17    q49J1IKaHRwDQnt
18    rrq4fwE8jgrTyWY
19    vOuJvMaM2sgwLmb
21    xoJJ8DcxJEcupym
Name: SOURCE_KEY, dtype: object
0     4UPUqMRk7TRMgml
3     Et9kgGMDl729KT4
7     Mx2yZCDsyf6DPfv
10    Qf4GUc1pJu5T6c6
11    Quc1TzYxW2pYoWX
15    oZ35aAeoifZaQzV
Name: SOURCE_KEY, dtype: object
0     4UPUqMRk7TRMgml
3     Et9kgGMDl729KT4
7     Mx2yZCDsyf6DPfv
10    Qf4GUc1pJu5T6c6
11    Quc1TzYxW2pYoWX
15    oZ35aAeoifZaQzV
Name: SOURCE_KEY, dtype: object


In [20]:
risultati = pd.concat([tab1, tab2], ignore_index=True)
risultati.tail()

Unnamed: 0,SOURCE_KEY,DC_MEAN,DC_MAX,AC_MEAN,AC_MAX,DY_MAX,IRR_MEAN,IRR_MAX,AMB_TEMP,MOD_TEMP
39,q49J1IKaHRwDQnt,226.172926,1091.008683,221.262808,1065.217283,6269.882353,0.23271,0.918425,28.068912,32.770801
40,rrq4fwE8jgrTyWY,209.15914,1049.468053,204.665397,1024.943179,5182.382353,0.23271,0.918425,28.068912,32.770801
41,vOuJvMaM2sgwLmb,262.095831,1142.515182,256.216411,1114.802143,6839.470588,0.23271,0.918425,28.068912,32.770801
42,xMbIugepa2P7lBB,277.049914,1171.544518,270.934373,1143.215852,7493.028205,0.207058,0.891533,27.643789,31.699897
43,xoJJ8DcxJEcupym,240.852238,1128.430056,235.561469,1101.55965,6080.701961,0.23271,0.918425,28.068912,32.770801


In [31]:
rank1 = risultati.sort_values("DC_MEAN", ascending= False)["SOURCE_KEY"]
rank1 = {rank1.iloc[i] : i for i in range(len(rank1))}
rank1

{'adLQvlD726eNBSB': 0,
 '1IF53ai7Xc0U56Y': 1,
 '3PZuoBAID5Wc2HD': 2,
 'VHMLBKoKgIrUVDU': 3,
 'McdE0feGgRqW7Ca': 4,
 'ZnxXDlPa8U1GXgE': 5,
 'uHbuxQJl8lW7ozc': 6,
 'iCRJl6heRkivqQ3': 7,
 'zVJPv84UY57bAof': 8,
 'YxYtjZvoooNbGkE': 9,
 'wCURE6d3bPkepu2': 10,
 'pkci93gMrogZuBj': 11,
 'rGa61gmuvPhdLxV': 12,
 'WRmjgnKYAwPKWDb': 13,
 'sjndEbLyjtCKgGv': 14,
 'zBIq5rxdHJRwDNY': 15,
 'ZoEaEvLYb1n2sOq': 16,
 '7JYdWkrLSPkdwr4': 17,
 'z9Y9gH1T5YWrNuG': 18,
 'ih0vzX44oOqAx2f': 19,
 '1BY6WEcLGh8j5v7': 20,
 'bvBOhCH3iADSZry': 21,
 'Mx2yZCDsyf6DPfv': 22,
 'IQ2d7wF4YD8zU1Q': 23,
 'Qf4GUc1pJu5T6c6': 24,
 'xMbIugepa2P7lBB': 25,
 '4UPUqMRk7TRMgml': 26,
 'oZ35aAeoifZaQzV': 27,
 'mqwcsP2rE7J0TFp': 28,
 'NgDl19wMapZy17u': 29,
 'V94E5Ben1TlhnDV': 30,
 'vOuJvMaM2sgwLmb': 31,
 'oZZkBaNadn6DNKz': 32,
 'PeE6FRyGXUgsRhN': 33,
 'LlT2YUhhzqhg5Sw': 34,
 'WcxssY2VbP4hApt': 35,
 '9kRcWv60rDACzjR': 36,
 'xoJJ8DcxJEcupym': 37,
 '81aHJ1q11NBPMrL': 38,
 'q49J1IKaHRwDQnt': 39,
 'rrq4fwE8jgrTyWY': 40,
 'LYwnQax7tkwH5Cb': 41,
 '

In [22]:
rank2 = risultati.sort_values("DC_MAX", ascending= False)["SOURCE_KEY"]
rank2 = {rank2.iloc[i] : i for i in range(len(rank1))}

In [23]:
rank3 = risultati.sort_values("AC_MEAN", ascending= False)["SOURCE_KEY"]
rank3 = {rank3.iloc[i] : i for i in range(len(rank1))}

In [24]:
rank4 = risultati.sort_values("AC_MAX", ascending= False)["SOURCE_KEY"]
rank4 = {rank4.iloc[i] : i for i in range(len(rank1))}

In [25]:
rank5 = risultati.sort_values("DY_MAX", ascending= False)["SOURCE_KEY"]
rank5 = {rank5.iloc[i] : i for i in range(len(rank1))}


In [26]:
ranking = {k: rank1.get(k, 0) + rank2.get(k, 0) + rank3.get(k, 0) + rank4.get(k, 0) + rank5.get(k, 0) for k in set(rank1)}
classifica = pd.DataFrame([ranking])
classifica = classifica.T
classifica.index.name = "SOURCE"
classifica = classifica.sort_values(0)
classifica

Unnamed: 0_level_0,0
SOURCE,Unnamed: 1_level_1
adLQvlD726eNBSB,3
1IF53ai7Xc0U56Y,12
3PZuoBAID5Wc2HD,13
VHMLBKoKgIrUVDU,15
McdE0feGgRqW7Ca,24
ZnxXDlPa8U1GXgE,30
iCRJl6heRkivqQ3,38
wCURE6d3bPkepu2,41
YxYtjZvoooNbGkE,48
zVJPv84UY57bAof,48


In [27]:
finale = pd.merge(classifica, risultati, left_on= "SOURCE", right_on= "SOURCE_KEY", how= "left")
finale.sort_values(int(0))
finale

Unnamed: 0,0,SOURCE_KEY,DC_MEAN,DC_MAX,AC_MEAN,AC_MAX,DY_MAX,IRR_MEAN,IRR_MAX,AMB_TEMP,MOD_TEMP
0,3,adLQvlD726eNBSB,3308.80741,13166.518908,323.525326,1284.00562,7443.841912,0.235808,0.988749,25.617678,31.422753
1,12,1IF53ai7Xc0U56Y,3264.930737,12931.521009,319.257001,1261.170221,7348.806373,0.235808,0.988749,25.617678,31.422753
2,13,3PZuoBAID5Wc2HD,3260.505174,12974.507354,318.823751,1265.327363,7341.279412,0.235826,0.988749,25.617266,31.42305
3,15,VHMLBKoKgIrUVDU,3249.088195,12936.119537,317.706009,1261.57792,7347.004902,0.234996,0.988749,25.617426,31.39979
4,24,McdE0feGgRqW7Ca,3248.772036,12896.538341,317.702792,1257.733981,7341.778361,0.235442,0.988749,25.602581,31.402481
5,30,ZnxXDlPa8U1GXgE,3235.59806,12852.486869,316.401831,1253.483876,7310.443102,0.235358,0.988749,25.618357,31.412678
6,38,iCRJl6heRkivqQ3,3232.509674,12851.616247,316.096061,1253.433508,7301.991597,0.235355,0.988749,25.601191,31.398116
7,41,wCURE6d3bPkepu2,3206.30725,12898.061099,313.537907,1257.866947,7235.289916,0.23535,0.988749,25.601242,31.398107
8,48,YxYtjZvoooNbGkE,3218.100411,12783.210609,314.701028,1246.744748,7219.845588,0.23708,0.988749,25.624307,31.478844
9,48,zVJPv84UY57bAof,3224.806457,12701.308824,315.34253,1238.772479,7268.209559,0.235442,0.988749,25.602581,31.402481


Roba D'altri

In [28]:
### we concatenate the two dataframes regarding the inverters

panels = pd.concat([p1_gen,p2_gen])

In [29]:
### grouping by day is going to be much easier for the computations we'll have to do
### we create a day column for the inverters dataframe, which ignores the hour of the relevation

panels["DAY"] = panels.DATE_TIME.apply(lambda x: datetime.datetime.strptime(x,"%d-%m-%Y %H:%M").date())

assert(panels.shape == (p1_gen.shape[0]+p2_gen.shape[0],p1_gen.shape[1]+1)) # just to check dimensions are fine after the merge
panels.head()

Unnamed: 0,DATE_TIME,PLANT_ID,SOURCE_KEY,DC_POWER,AC_POWER,DAILY_YIELD,TOTAL_YIELD,DAY
0,15-05-2020 00:00,4135001,1BY6WEcLGh8j5v7,0.0,0.0,0.0,6259559.0,2020-05-15
1,15-05-2020 00:00,4135001,1IF53ai7Xc0U56Y,0.0,0.0,0.0,6183645.0,2020-05-15
2,15-05-2020 00:00,4135001,3PZuoBAID5Wc2HD,0.0,0.0,0.0,6987759.0,2020-05-15
3,15-05-2020 00:00,4135001,7JYdWkrLSPkdwr4,0.0,0.0,0.0,7602960.0,2020-05-15
4,15-05-2020 00:00,4135001,McdE0feGgRqW7Ca,0.0,0.0,0.0,7158964.0,2020-05-15


In [30]:
### we do the same for the sensors, concatenating the two dataframes and creating a new column for the day

sensors = pd.concat([p1_wea,p2_wea])
sensors["DAY"] = sensors.DATE_TIME.apply(lambda x: datetime.datetime.strptime(x,"%Y-%m-%d %H:%M:%S").date())

assert(sensors.shape == (p1_wea.shape[0]+p2_wea.shape[0],p1_wea.shape[1]+1)) # just to check dimensions are fine after the merge
sensors.head()

ValueError: time data '15-05-2020 00:00' does not match format '%Y-%m-%d %H:%M:%S'

<h3> QUESTIONS </h3>

<h6>What is the mean value of daily yield?</h6>

In [None]:
### we group by inverter ID and day, considering the maximum per day 
### (daily yield gets resetted at midnight so we're basically choosing the last daily_yield relevation per day per source)

grouped_df = panels.DAILY_YIELD.groupby([panels.SOURCE_KEY,panels.DAY]).max().reset_index()

In [None]:
grouped_df.head()

We can compute the mean daily yield per inverter (and plot the distribution)

In [None]:
## we consider the mean daily_yield per source_key and we plot it as an histogram

mean_yields_per_inverter = grouped_df.DAILY_YIELD.groupby(grouped_df.SOURCE_KEY).mean().reset_index()

In [None]:
mean_yields_per_inverter.head()

In [None]:
plt.figure(figsize=(6,5))
plt.hist(mean_yields_per_inverter.DAILY_YIELD,density=True,bins=20)
plt.xlabel("mean daily yield",fontsize=12)
plt.ylabel("density",fontsize=12)
plt.show()

In [None]:
mean_inverter_daily_yield = mean_yields_per_inverter.DAILY_YIELD.mean()

In [None]:
print(f"The mean inverter daily yield, calculated considering all panels, is {round(mean_inverter_daily_yield,2)} kW/day.")

We can instead compute the total daily yield considering every inverter and grouping by day

In [None]:
total_daily_yields = grouped_df.DAILY_YIELD.groupby(grouped_df.DAY).sum().reset_index()

In [None]:
total_daily_yields.head()

In [None]:
plt.figure(figsize=(6,5))
plt.hist(total_daily_yields.DAILY_YIELD,density=True,bins=10)
plt.xlabel("daily yields",fontsize=12)
plt.ylabel("density",fontsize=12)
plt.xticks(rotation=70)
plt.show()

In [None]:
mean_total_daily_yield = total_daily_yields.DAILY_YIELD.mean()
print(f"Mean total daily yield is {round(mean_total_daily_yield,2)} kW/day.")

Idee, osservazioni per domande,probabilmente da segnalare all'inizio: (**MISSING DATA**)

In [None]:
#### missing data: some inverters are missing data (indices 8,13,28,39 of command below)

# grouped_df.DAY.groupby(grouped_df.SOURCE_KEY).count().reset_index()

#### they all have 8 days of missing data (26 instead of 34) (TOTAL 4*8 = 32 missing data entries)

In [None]:
### in fact between 21/05 and 28/05 we have 4 less data entries per day (8*4 = 32 missing data entries)
### command below shows this

# grouped_df.SOURCE_KEY.groupby(grouped_df.DAY).count().reset_index()

<h6>What is the total irradiation per day?</h6>

In [None]:
sensors.head()

In [None]:
irr_per_day_df = sensors.IRRADIATION.groupby(sensors.DAY).sum().reset_index()

In [None]:
irr_per_day_df.head()

we can see how irradiation correlates to the total daily yield obtained:

In [None]:
fig,ax = plt.subplots()
plt.xticks(rotation=70)
ax.plot(irr_per_day_df.DAY,irr_per_day_df.IRRADIATION,color="red")
ax2 = ax.twinx()
ax2.plot(total_daily_yields.DAY,total_daily_yields.DAILY_YIELD,color="blue")
fig.legend(["Total Irradiation","Total daily yield"],loc=1)
plt.show()

# the descrepancy between 21/05 and 28/05 is due to missing data as reported above

<h6>What is the max ambient and module temperature?</h6>

In [None]:
ambient_temp = sensors.AMBIENT_TEMPERATURE.max()
print("max ambient temperature:",round(ambient_temp,2))
module_temp = sensors.MODULE_TEMPERATURE.max()
print("max module temperature:",round(module_temp,2))

<h6>How many inverters are there for each plant?</h6>

In [None]:
plant1_panels = len(p1_gen.SOURCE_KEY.unique())
plant2_panels = len(p2_gen.SOURCE_KEY.unique())

print(f"There are {plant1_panels} inverters in plant 1.")
print(f"There are {plant2_panels} inverters in plant 2.")

<h6>What is the maximum/minimum amount of DC/AC Power generated in a time interval/day?</h6>

In [None]:
dc_power_per_day = panels.DC_POWER.groupby(panels.DAY).sum().reset_index()
ac_power_per_day = panels.AC_POWER.groupby(panels.DAY).sum().reset_index()

In [None]:
dc_power_per_day.head()

In [None]:
ac_power_per_day.head()

In [None]:
print(f"Maximum DC POWER per day: {max(dc_power_per_day.DC_POWER)}")
print(f"Minimum DC POWER per day: {min(dc_power_per_day.DC_POWER)}")
print(f"Maximum AC POWER per day: {max(ac_power_per_day.AC_POWER)}")
print(f"Minimum AC POWER per day: {min(ac_power_per_day.AC_POWER)}")

**....altre domande** (TODO)

<h5>parte grafica</h5> (totalmente da fare, solo un esempio) (TODO)

e.g. how are ambient and module temperature changing as time goes by?

In [None]:
# per esempio per un specifico sensore
first_sensor_df = sensor2[sensor2["SOURCE_KEY"]=="iq8k7ZNt4Mwm3w0"]
first_sensor_df.shape
plt.subplot(2,1,1)
plt.plot(first_sensor_df["DATE_TIME"][::24],first_sensor_df["AMBIENT_TEMPERATURE"][::24])
plt.subplot(2,1,2)
plt.plot(first_sensor_df["DATE_TIME"][::24],first_sensor_df["MODULE_TEMPERATURE"][::24])
plt.show()