# Project: Fundamentals of Information Systems 

### Introduction

This data has been gathered at **two solar power plants** in India over a **34 day** period. It has **two** pairs of files - each pair has **one power generation dataset** and **one sensor readings dataset**. 
- The **power generation datasets** are gathered at the inverter level - each inverter has multiple lines of solar panels attached to it. 
- The **sensor data** is gathered at a plant level - a single array of sensors optimally placed at the plant.

### Output

**Questions.**
- What is the **mean** value of **daily yield**? 
- What is the **total irradiation per day**? 
- What is the **max ambient** and **module temperature**? 
- **How many inverters** are there **for each plant**? 
- What is the **maximum/minimum amount** of **DC/AC Power generated** in a **time interval/day**? 
- **Which inverter** (source_key) has produced **maximum DC/AC power**? 
- **Rank the inverters** based on the **DC/AC power** they produce? Is there **any missing data**?


- Graphs that explain the patterns for attributes independent of other variables. These will usually be tracked as changes of attributes against DATETIME, DATE, or TIME. 

**Examples.** How is DC or AC Power changing as time goes by? how is irradiation changing as time goes by? how are ambient and module temperature changing as time goes by? how does yield change as time goes by? Explore plotting variables against different granularities of DATETIME and which is the best option for each variable.

### Variables

##### Power generation data
- DATE_TIME: Date and time for each observation. Observations recorded at 15 minute intervals.
 
- PLANT_ID: Plant ID - this will be common for the entire file.

- SOURCE_KEY: Source key in this file stands for the inverter id.

- DC_POWER: Amount of DC power **(corrente continua)** generated by the inverter (source_key) in this 15 minute interval. Units - kW.

- AC_POWER: Amount of AC power **(corrente alternata)** generated by the inverter (source_key) in this 15 minute interval. Units - kW.

- DAILY_YIELD: Daily yield is a cumulative sum of power generated on that day, till that point in time.

- TOTAL_YIELD: This is the total yield for the inverter till that point in time.

##### Weather sensor data
- DATE_TIME: Date and time for each observation. Observations recorded at 15 minute intervals.

- PLANT_ID: Plant ID - this will be common for the entire file.

- SOURCE_KEY: Stands for the sensor panel id. This will be common for the entire file because there's only one sensor panel for the plant.

- AMBIENT_TEMPERATURE: This is the ambient temperature at the plant.

- MODULE_TEMPERATURE: There's a module (solar panel) attached to the sensor panel. This is the temperature reading for that module.

- IRRADIATION: Amount of irradiation for the 15 minute interval.

###################################################################################################################

### Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime
import re
import seaborn as sns

### Data

In [2]:
p1_gen = pd.read_csv('./Plant_1_Generation_Data.csv')
p1_wea = pd.read_csv('./Plant_1_Weather_Sensor_Data.csv')
p2_gen = pd.read_csv('./Plant_2_Generation_Data.csv')
p2_wea = pd.read_csv('./Plant_2_Weather_Sensor_Data.csv')

In [3]:
print("Shape of the table for plant 1 generation data: ", p1_gen.shape)
print("Shape of the table for plant 2 generation data: ",p2_gen.shape)
print("Name of the columns for the dataframes: \n",list(p1_gen.columns))
assert(np.all(p1_gen.columns == p2_gen.columns))  # just making sure they have the same columns

Shape of the table for plant 1 generation data:  (68778, 7)
Shape of the table for plant 2 generation data:  (67698, 7)
Name of the columns for the dataframes: 
 ['DATE_TIME', 'PLANT_ID', 'SOURCE_KEY', 'DC_POWER', 'AC_POWER', 'DAILY_YIELD', 'TOTAL_YIELD']


In [4]:
print("Shape of the table for weather sensor plant 1: ", p1_wea.shape)
print("Shape of the table for weather sensor plant 2: ", p2_wea.shape)
print("Column names for weather sensor tables: \n", list(p1_wea.columns))
assert(np.all(p1_wea.columns == p2_wea.columns)) # just making sure they have the same columns

Shape of the table for weather sensor plant 1:  (3182, 6)
Shape of the table for weather sensor plant 2:  (3259, 6)
Column names for weather sensor tables: 
 ['DATE_TIME', 'PLANT_ID', 'SOURCE_KEY', 'AMBIENT_TEMPERATURE', 'MODULE_TEMPERATURE', 'IRRADIATION']


In [5]:
p1_gen.head()

Unnamed: 0,DATE_TIME,PLANT_ID,SOURCE_KEY,DC_POWER,AC_POWER,DAILY_YIELD,TOTAL_YIELD
0,15-05-2020 00:00,4135001,1BY6WEcLGh8j5v7,0.0,0.0,0.0,6259559.0
1,15-05-2020 00:00,4135001,1IF53ai7Xc0U56Y,0.0,0.0,0.0,6183645.0
2,15-05-2020 00:00,4135001,3PZuoBAID5Wc2HD,0.0,0.0,0.0,6987759.0
3,15-05-2020 00:00,4135001,7JYdWkrLSPkdwr4,0.0,0.0,0.0,7602960.0
4,15-05-2020 00:00,4135001,McdE0feGgRqW7Ca,0.0,0.0,0.0,7158964.0


In [6]:
p2_gen.head()

Unnamed: 0,DATE_TIME,PLANT_ID,SOURCE_KEY,DC_POWER,AC_POWER,DAILY_YIELD,TOTAL_YIELD
0,2020-05-15 00:00:00,4136001,4UPUqMRk7TRMgml,0.0,0.0,9425.0,2429011.0
1,2020-05-15 00:00:00,4136001,81aHJ1q11NBPMrL,0.0,0.0,0.0,1215279000.0
2,2020-05-15 00:00:00,4136001,9kRcWv60rDACzjR,0.0,0.0,3075.333333,2247720000.0
3,2020-05-15 00:00:00,4136001,Et9kgGMDl729KT4,0.0,0.0,269.933333,1704250.0
4,2020-05-15 00:00:00,4136001,IQ2d7wF4YD8zU1Q,0.0,0.0,3177.0,19941530.0


In [7]:
p1_wea.head()

Unnamed: 0,DATE_TIME,PLANT_ID,SOURCE_KEY,AMBIENT_TEMPERATURE,MODULE_TEMPERATURE,IRRADIATION
0,2020-05-15 00:00:00,4135001,HmiyD2TTLFNqkNe,25.184316,22.857507,0.0
1,2020-05-15 00:15:00,4135001,HmiyD2TTLFNqkNe,25.084589,22.761668,0.0
2,2020-05-15 00:30:00,4135001,HmiyD2TTLFNqkNe,24.935753,22.592306,0.0
3,2020-05-15 00:45:00,4135001,HmiyD2TTLFNqkNe,24.84613,22.360852,0.0
4,2020-05-15 01:00:00,4135001,HmiyD2TTLFNqkNe,24.621525,22.165423,0.0


In [8]:
p2_wea.head()

Unnamed: 0,DATE_TIME,PLANT_ID,SOURCE_KEY,AMBIENT_TEMPERATURE,MODULE_TEMPERATURE,IRRADIATION
0,2020-05-15 00:00:00,4136001,iq8k7ZNt4Mwm3w0,27.004764,25.060789,0.0
1,2020-05-15 00:15:00,4136001,iq8k7ZNt4Mwm3w0,26.880811,24.421869,0.0
2,2020-05-15 00:30:00,4136001,iq8k7ZNt4Mwm3w0,26.682055,24.42729,0.0
3,2020-05-15 00:45:00,4136001,iq8k7ZNt4Mwm3w0,26.500589,24.420678,0.0
4,2020-05-15 01:00:00,4136001,iq8k7ZNt4Mwm3w0,26.596148,25.08821,0.0


### Missing Data

Since data is being collected every 15 minutes for 34 days we'd expect to have 3264 (34\*24\*4) entries for every source key. But this is not the case:

In [9]:
p1_gen.groupby("SOURCE_KEY").DATE_TIME.count()

SOURCE_KEY
1BY6WEcLGh8j5v7    3154
1IF53ai7Xc0U56Y    3119
3PZuoBAID5Wc2HD    3118
7JYdWkrLSPkdwr4    3133
McdE0feGgRqW7Ca    3124
VHMLBKoKgIrUVDU    3133
WRmjgnKYAwPKWDb    3118
YxYtjZvoooNbGkE    3104
ZnxXDlPa8U1GXgE    3130
ZoEaEvLYb1n2sOq    3123
adLQvlD726eNBSB    3119
bvBOhCH3iADSZry    3155
iCRJl6heRkivqQ3    3125
ih0vzX44oOqAx2f    3130
pkci93gMrogZuBj    3125
rGa61gmuvPhdLxV    3124
sjndEbLyjtCKgGv    3124
uHbuxQJl8lW7ozc    3125
wCURE6d3bPkepu2    3126
z9Y9gH1T5YWrNuG    3126
zBIq5rxdHJRwDNY    3119
zVJPv84UY57bAof    3124
Name: DATE_TIME, dtype: int64

Data is missing, we don't always have data entries every 15 minutes.

It's going to be easier to appropriatly analyze data, calculate aggregates and create plots while having the same number of entries for all inverters/sensors. This will also possibly allow us to explain incongruencies between variables. Rows will be added to the dataframes to account for the missing entries.

### Data Preparation

Since the inverters' identifiers are complex alphanumeric strings, we'll change them to better handle them:

In [10]:
def simplify_source_keys(df,n) :
    '''Changes the complex source keys identifiers to simpler/easier to handle identifiers.
    
    Args:
        df (pd.DataFrame): the dataframe to modify
        n (int): number of the plant we're referring to
    
    Returns:
        dict: dictionary to link old inverters' identifiers (keys) to new identifiers (values)
    '''
    repl_dict = {} #
    for i,j in enumerate(sorted(list(set(df["SOURCE_KEY"])))):
        repl_dict[j] = "s"+str(i)+"_gen"+str(n)
    df["SOURCE_KEY"] = df["SOURCE_KEY"].replace(repl_dict)
    return repl_dict

In [11]:
inverters_ids_1 = simplify_source_keys(p1_gen,1) 
inverters_ids_2 = simplify_source_keys(p2_gen,2)

In [12]:
inverters_ids_1

{'1BY6WEcLGh8j5v7': 's0_gen1',
 '1IF53ai7Xc0U56Y': 's1_gen1',
 '3PZuoBAID5Wc2HD': 's2_gen1',
 '7JYdWkrLSPkdwr4': 's3_gen1',
 'McdE0feGgRqW7Ca': 's4_gen1',
 'VHMLBKoKgIrUVDU': 's5_gen1',
 'WRmjgnKYAwPKWDb': 's6_gen1',
 'YxYtjZvoooNbGkE': 's7_gen1',
 'ZnxXDlPa8U1GXgE': 's8_gen1',
 'ZoEaEvLYb1n2sOq': 's9_gen1',
 'adLQvlD726eNBSB': 's10_gen1',
 'bvBOhCH3iADSZry': 's11_gen1',
 'iCRJl6heRkivqQ3': 's12_gen1',
 'ih0vzX44oOqAx2f': 's13_gen1',
 'pkci93gMrogZuBj': 's14_gen1',
 'rGa61gmuvPhdLxV': 's15_gen1',
 'sjndEbLyjtCKgGv': 's16_gen1',
 'uHbuxQJl8lW7ozc': 's17_gen1',
 'wCURE6d3bPkepu2': 's18_gen1',
 'z9Y9gH1T5YWrNuG': 's19_gen1',
 'zBIq5rxdHJRwDNY': 's20_gen1',
 'zVJPv84UY57bAof': 's21_gen1'}

The DATE_TIME columns of the two inverters' dataframes display the date and hour of the data entry with two different formats, we'll convert this to a unique format:

In [13]:
def convert_dates(date):
    '''Converts date string from format %d-%m-%Y %H:%M to date string format %Y-%m-%d %H:%M:%S.
    
    Args:
        date (string): the date string to convert
        
    Returns:
        string
    '''
    return date[6:10]+"-"+date[3:5]+"-"+date[:2]+date[10:]+":00"

In [14]:
p1_gen.DATE_TIME = p1_gen.DATE_TIME.apply(convert_dates)

p1_gen["DATE_TIME"] = p1_gen["DATE_TIME"].apply(lambda x: datetime.datetime.strptime(x,"%Y-%m-%d %H:%M:%S"))
p2_gen["DATE_TIME"] = p2_gen["DATE_TIME"].apply(lambda x: datetime.datetime.strptime(x,"%Y-%m-%d %H:%M:%S"))

Missing data entries will be now added to the dataframes:

In [15]:
def add_missing_entries(df):
    '''Adds data missing from the df dataframe, in order to have entries every 15 minutes for all the sources.
    
    Args:
        df (pd.DataFrame): dataframe with incomplete data
    
    Returns:
        new_df (pd.DataFrame): dataframe with complete timestamps for every source
    '''
    
    init = datetime.datetime(year = 2020, month = 5, day = 15, hour = 0, minute = 0)
    finit = datetime.datetime(year = 2020, month = 6, day = 18, hour = 0, minute = 0)
    new_DATE_TIME = np.arange(init, finit, datetime.timedelta(minutes = 15)).astype(datetime.datetime)
    new_DATE_TIME = pd.DataFrame(new_DATE_TIME,columns=["DATE_TIME"])
    
    new_df = pd.DataFrame(columns=list(df.columns))
    
    for source in df.SOURCE_KEY.unique():
        p1_s = df.loc[df['SOURCE_KEY'] == source]
        p1_s = pd.merge(p1_s, new_DATE_TIME, on = 'DATE_TIME', how = 'right')
        p1_s.loc[p1_s['SOURCE_KEY'].isna() == True, 'SOURCE_KEY'] = source
        new_df = pd.concat([new_df,p1_s],ignore_index=True)
    
    return new_df

In [16]:
p1_gen = add_missing_entries(p1_gen)
p2_gen = add_missing_entries(p2_gen)

It's useful to split the DATE_TIME column in a date column and a time column, since we're going to plot against different timeframes later on. Dataframes gets also sorted by date and time:

In [17]:
p1_gen["DATE"] = p1_gen["DATE_TIME"].apply(lambda x: x.date())
p2_gen["DATE"] = p2_gen["DATE_TIME"].apply(lambda x: x.date())

p1_gen["TIME"] = p1_gen["DATE_TIME"].apply(lambda x: x.time())
p2_gen["TIME"] = p2_gen["DATE_TIME"].apply(lambda x: x.time())

p1_gen["DATE_TIME"] = p1_gen["DATE_TIME"].apply(lambda x: x.strftime("%Y-%m-%d %H:%M:%S"))
p2_gen["DATE_TIME"] = p2_gen["DATE_TIME"].apply(lambda x: x.strftime("%Y-%m-%d %H:%M:%S"))

p1_gen = p1_gen.sort_values(["DATE", "TIME"]).reset_index(drop=True)
p2_gen = p2_gen.sort_values(["DATE", "TIME"]).reset_index(drop=True)

Now we have NaN values in the dataframes corresponding to the previously missing entries:

In [18]:
p1_gen.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71808 entries, 0 to 71807
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   DATE_TIME    71808 non-null  object 
 1   PLANT_ID     68778 non-null  float64
 2   SOURCE_KEY   71808 non-null  object 
 3   DC_POWER     68778 non-null  float64
 4   AC_POWER     68778 non-null  float64
 5   DAILY_YIELD  68778 non-null  float64
 6   TOTAL_YIELD  68778 non-null  float64
 7   DATE         71808 non-null  object 
 8   TIME         71808 non-null  object 
dtypes: float64(5), object(4)
memory usage: 4.9+ MB


In [19]:
p2_gen.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71808 entries, 0 to 71807
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   DATE_TIME    71808 non-null  object 
 1   PLANT_ID     67698 non-null  float64
 2   SOURCE_KEY   71808 non-null  object 
 3   DC_POWER     67698 non-null  float64
 4   AC_POWER     67698 non-null  float64
 5   DAILY_YIELD  67698 non-null  float64
 6   TOTAL_YIELD  67698 non-null  float64
 7   DATE         71808 non-null  object 
 8   TIME         71808 non-null  object 
dtypes: float64(5), object(4)
memory usage: 4.9+ MB


We'll now fill in the appropriate values for all the columns:
- **PLANT_ID** will be filled with the corresponding unique id for the two dataframes
- A new column **MISS_VAL** is going to be created, denoting if the data entry was previously missing. This might come in handy later on aswell.
- **DC_POWER** and **AC_POWER** track how much power has been produced in the last 15 minutes (they're not cumulative). Missing data will be set to 0 instead of NaN, as we're going to assume that no power has been produced when data is missing. 

In [20]:
p1_gen.loc[p1_gen['PLANT_ID'].isna() == True, 'PLANT_ID'] = 4135001.0
p2_gen.loc[p2_gen['PLANT_ID'].isna() == True, 'PLANT_ID'] = 4136001.0

p1_gen["MISS_VAL"] = np.where(p1_gen["DC_POWER"].isna(),1,0)
p2_gen["MISS_VAL"] = np.where(p2_gen["DC_POWER"].isna(),1,0)

p1_gen.loc[p1_gen['DC_POWER'].isna() == True, 'DC_POWER':'AC_POWER'] = 0
p2_gen.loc[p2_gen['DC_POWER'].isna() == True, 'DC_POWER':'AC_POWER'] = 0


Finally **DAILY_YIELD** and **TOTAL_YIELD** are cumulative columns so we have to treat them accordingly while dealing with missing data.
In particular, **DAILY_YIELD** is a daily cumulative measure and it gets resetted approximately around midnight (### NO CHANGE THIS), while **TOTAL_YIELD** is a cumulative measure of the total power produced since even before the data available here was gathered.

In [21]:
################## TO ADD HOW TO HANDLE MISSING DATA FOR DAILY_YIELD AND TOTAL_YIELD #######################

Finally, we're going to order the columns in a different way:

In [22]:
p1_gen = p1_gen.reindex(columns=['DATE_TIME', 'DATE', 'TIME', 'MISS_VAL', 'PLANT_ID', 'SOURCE_KEY',
                                 'DC_POWER', 'AC_POWER', 'DAILY_YIELD', 'TOTAL_YIELD'])
p2_gen = p2_gen.reindex(columns=['DATE_TIME', 'DATE', 'TIME', 'MISS_VAL', 'PLANT_ID', 'SOURCE_KEY',
                                 'DC_POWER', 'AC_POWER', 'DAILY_YIELD', 'TOTAL_YIELD'])

Recap:

In [23]:
p1_gen.head()

Unnamed: 0,DATE_TIME,DATE,TIME,MISS_VAL,PLANT_ID,SOURCE_KEY,DC_POWER,AC_POWER,DAILY_YIELD,TOTAL_YIELD
0,2020-05-15 00:00:00,2020-05-15,00:00:00,0,4135001.0,s0_gen1,0.0,0.0,0.0,6259559.0
1,2020-05-15 00:00:00,2020-05-15,00:00:00,0,4135001.0,s1_gen1,0.0,0.0,0.0,6183645.0
2,2020-05-15 00:00:00,2020-05-15,00:00:00,0,4135001.0,s2_gen1,0.0,0.0,0.0,6987759.0
3,2020-05-15 00:00:00,2020-05-15,00:00:00,0,4135001.0,s3_gen1,0.0,0.0,0.0,7602960.0
4,2020-05-15 00:00:00,2020-05-15,00:00:00,0,4135001.0,s4_gen1,0.0,0.0,0.0,7158964.0


####################

Data entries are missing aswell for the sensors' dataframes, we're going to add them aswell:

In [24]:
init = datetime.datetime(year = 2020, month = 5, day = 15, hour = 0, minute = 0, second = 0)
finit = datetime.datetime(year = 2020, month = 6, day = 18, hour = 0, minute = 0, second = 0)
new_DATE_TIME = np.arange(init, finit, datetime.timedelta(minutes = 15)).astype(datetime.datetime)
new_DATE_TIME = pd.DataFrame(new_DATE_TIME,columns = ['DATE_TIME'])
p1_wea["DATE_TIME"] = p1_wea["DATE_TIME"].apply(lambda x: datetime.datetime.strptime(x,"%Y-%m-%d %H:%M:%S"))
p2_wea["DATE_TIME"] = p2_wea["DATE_TIME"].apply(lambda x: datetime.datetime.strptime(x,"%Y-%m-%d %H:%M:%S"))
p1_wea = pd.merge(p1_wea, new_DATE_TIME, on = 'DATE_TIME', how = 'right')
p2_wea = pd.merge(p2_wea, new_DATE_TIME, on = 'DATE_TIME', how = 'right')

As we did for the inverters' dataframe we're going to to split **DATE_TIME** into **DATE** and **TIME** and sort by them:

In [25]:
p1_wea["DATE"] = p1_wea["DATE_TIME"].apply(lambda x: x.date())
p2_wea["DATE"] = p2_wea["DATE_TIME"].apply(lambda x: x.date())

p1_wea["TIME"] = p1_wea["DATE_TIME"].apply(lambda x: x.time())
p2_wea["TIME"] = p2_wea["DATE_TIME"].apply(lambda x: x.time())

p1_wea["DATE_TIME"] = p1_wea["DATE_TIME"].apply(lambda x: x.strftime("%Y-%m-%d %H:%M:%S"))
p2_wea["DATE_TIME"] = p2_wea["DATE_TIME"].apply(lambda x: x.strftime("%Y-%m-%d %H:%M:%S"))

p1_wea = p1_wea.sort_values(["DATE", "TIME"]).reset_index(drop=True)
p2_wea = p2_wea.sort_values(["DATE", "TIME"]).reset_index(drop=True)

In [26]:
p1_wea.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3264 entries, 0 to 3263
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   DATE_TIME            3264 non-null   object 
 1   PLANT_ID             3182 non-null   float64
 2   SOURCE_KEY           3182 non-null   object 
 3   AMBIENT_TEMPERATURE  3182 non-null   float64
 4   MODULE_TEMPERATURE   3182 non-null   float64
 5   IRRADIATION          3182 non-null   float64
 6   DATE                 3264 non-null   object 
 7   TIME                 3264 non-null   object 
dtypes: float64(4), object(4)
memory usage: 204.1+ KB


In [27]:
p2_wea.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3264 entries, 0 to 3263
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   DATE_TIME            3264 non-null   object 
 1   PLANT_ID             3259 non-null   float64
 2   SOURCE_KEY           3259 non-null   object 
 3   AMBIENT_TEMPERATURE  3259 non-null   float64
 4   MODULE_TEMPERATURE   3259 non-null   float64
 5   IRRADIATION          3259 non-null   float64
 6   DATE                 3264 non-null   object 
 7   TIME                 3264 non-null   object 
dtypes: float64(4), object(4)
memory usage: 204.1+ KB


Adding missing basic informations for the new data entries and a new column to denote if the entry was missing or not:

In [28]:
p1_wea.loc[p1_wea['PLANT_ID'].isna() == True, 'PLANT_ID'] = 4135001.0
p2_wea.loc[p2_wea['PLANT_ID'].isna() == True, 'PLANT_ID'] = 4136001.0

p1_wea["MISS_VAL"] = np.where(p1_wea["AMBIENT_TEMPERATURE"].isna(),1,0)
p2_wea["MISS_VAL"] = np.where(p2_wea["AMBIENT_TEMPERATURE"].isna(),1,0)

p1_wea.loc[p1_wea['AMBIENT_TEMPERATURE'].isna() == True, 'SOURCE_KEY'] = 'HmiyD2TTLFNqkNe'
p2_wea.loc[p2_wea['AMBIENT_TEMPERATURE'].isna() == True, 'SOURCE_KEY'] = 'iq8k7ZNt4Mwm3w0'

In [29]:
############### DECIDE HOW TO HANDLE MISSING AMBIENT_TEMP,MODULE_TEMP,IRRADIATION ##############

Finally, we're going to order the columns in a different way:

In [30]:
p1_wea = p1_wea.reindex(columns=['DATE_TIME', 'DATE', 'TIME', 'MISS_VAL', 'PLANT_ID', 'SOURCE_KEY',
                                 'AMBIENT_TEMPERATURE', 'MODULE_TEMPERATURE', 'IRRADIATION'])
p2_wea = p2_wea.reindex(columns=['DATE_TIME', 'DATE', 'TIME', 'MISS_VAL', 'PLANT_ID', 'SOURCE_KEY', 
                                 'AMBIENT_TEMPERATURE', 'MODULE_TEMPERATURE', 'IRRADIATION'])

Recap:

In [31]:
p1_wea.head()

Unnamed: 0,DATE_TIME,DATE,TIME,MISS_VAL,PLANT_ID,SOURCE_KEY,AMBIENT_TEMPERATURE,MODULE_TEMPERATURE,IRRADIATION
0,2020-05-15 00:00:00,2020-05-15,00:00:00,0,4135001.0,HmiyD2TTLFNqkNe,25.184316,22.857507,0.0
1,2020-05-15 00:15:00,2020-05-15,00:15:00,0,4135001.0,HmiyD2TTLFNqkNe,25.084589,22.761668,0.0
2,2020-05-15 00:30:00,2020-05-15,00:30:00,0,4135001.0,HmiyD2TTLFNqkNe,24.935753,22.592306,0.0
3,2020-05-15 00:45:00,2020-05-15,00:45:00,0,4135001.0,HmiyD2TTLFNqkNe,24.84613,22.360852,0.0
4,2020-05-15 01:00:00,2020-05-15,01:00:00,0,4135001.0,HmiyD2TTLFNqkNe,24.621525,22.165423,0.0


To save the new data: (if needed)

In [32]:
# p1_gen.to_csv('./NEW_Plant_1_Generation_Data.csv', header = True, index = False)
# p1_wea.to_csv('./NEW_Plant_1_Weather_Sensor_Data.csv', header = True, index = False)
# p2_gen.to_csv('./NEW_Plant_2_Generation_Data.csv', header = True, index = False)
# p2_wea.to_csv('./NEW_Plant_2_Weather_Sensor_Data.csv', header = True, index = False)