# Explorative Data Analysis <a id='Explorative_Data_Analysis'></a>

### 1 Table of Contents<a id='Contents'></a>
* [Explorative Data Analysis](#Explorative_Data_Analysis)
  * [1 Contents](#Contents)
  * [2 Introduction](#2_Introduction)
      * [2.1 Recap](#2.1_Recap)
      * [2.2 Next Steps](#2.2_Next_Steps)
  * [3 Imports](#3_Imports)
  * [4 Load Data](#4_Load_Data)
  * [5 Looking at the Data](#5_Looking_at_the_Data)


### 2 Introduction<a id='2_Introduction'></a>

#### 2.1 Recap<a id='2.1_Recap'></a>

m

#### 2.2 Next Steps<a id='2.2_Next_Steps'></a>

m

### 3 Imports<a id='3_Imports'></a>

In [1]:
#import warnings
#warnings.simplefilter('ignore')
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
import pandas_profiling
from library.sb_utils import save_file
import os
import csv

### 4 Load Data<a id='4_Load_Data'></a>

In [2]:
YMCA = pd.read_csv('../data/YMCA.csv')
ForestRoad = pd.read_csv('../data/ForestRoad.csv')
MapDriveEast = pd.read_csv('../data/MapleDriveEast.csv')
UK_data = pd.read_csv('../data/UK_data.csv')

The UK_data is the combination of data from all three sites: YMCA, Forest Road, and Maple Drive East.

### 5 Looking at the Data<a id='5_Looking_at_the_Data'></a>

In [3]:
UK_data.head()

Unnamed: 0,Site,TempOut,HiTemp,LowTemp,OutHum,DewPt,WindSpeed,WindRun,HiSpeed,WindChill,HeatIndex,THWIndex,Bar,Rain,RainRate,SolarRad,SolarEnergy,HiSolarRad,P_GEN
0,YMCA,14.1,14.2,14.1,91.0,12.6,0,0.0,0,14.1,14.1,14.1,755.5,0.0,0.0,11.0,0.47,18.0,0.014
1,YMCA,14.1,14.1,14.1,91.0,12.7,0,0.0,2,14.1,14.2,14.2,755.5,0.0,0.0,23.0,0.99,26.0,0.067
2,YMCA,14.4,14.4,14.1,91.0,12.9,0,0.0,2,14.4,14.4,14.4,755.7,0.0,0.0,34.0,1.46,81.0,0.216
3,YMCA,15.1,15.1,14.4,86.0,12.7,0,0.0,2,15.1,15.1,15.1,755.8,0.0,0.0,57.0,2.45,137.0,0.256
4,YMCA,16.0,16.0,15.1,77.0,12.0,0,0.0,1,16.0,15.9,15.9,756.2,0.0,0.0,56.0,2.41,192.0,0.284


In [4]:
UK_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
TempOut,11846.0,17.354001,4.01585,1.1,14.6,17.2,19.9,32.8
HiTemp,11846.0,17.582213,4.093306,1.1,14.7,17.4,20.3,32.9
LowTemp,11846.0,17.022548,3.984318,0.6,14.3,16.9,19.6,32.2
OutHum,11846.0,74.827706,14.671047,29.0,64.0,77.0,87.0,98.0
DewPt,11846.0,12.487675,2.848155,0.2,10.5,12.5,14.6,20.2
WindSpeed,11846.0,1.292504,1.507305,0.0,0.0,1.0,2.0,9.0
WindRun,11846.0,0.646252,0.753653,0.0,0.0,0.5,1.0,4.5
HiSpeed,11846.0,5.532669,4.181073,0.0,2.0,5.0,8.0,25.0
WindChill,11846.0,17.342875,4.031489,1.1,14.6,17.2,19.9,32.8
HeatIndex,11846.0,17.359666,4.078058,1.0,14.5,17.2,19.9,34.9


#### 5.1 Scaling the Data<a id='5.1_Scaling_the_Data'></a>

We should take a moment to look at the site specific power generation data

In [5]:
UK_data.groupby(by='Site')['P_GEN'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Site,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Forest Road,5526.0,0.483572,0.714115,0.001,0.002,0.088,0.733,3.008
Maple Drive East,2579.0,1.00077,0.912627,0.0,0.234,0.717,1.574,3.884
YMCA,3741.0,0.101918,0.100662,0.0,0.021,0.063,0.158,0.444


YMCA is generating much less power than the other two sites. Is this because of the weather conditions? No, instead this can be explained by the smaller size of the installation at the YMCA site. In fact, all of the installations are off different sizes. The G83 register, a part of the UK Power Networks, reports the sizes of solar installations. Forest Road, Maple Drive East, and YMCA are registered at sizes of 3.29kW, 3.83kW, and 0.6kW respectively.

Feeding the data to our model before dealing with the discrepency in installation sizes would mean our models results would be useless. The choice I am making here is to scale down the Forest Road and Maple Drive East data to the size of YMCA. 

In [6]:
ForestScaleFactor = 0.6/3.29
MapleScaleFactor = 0.6/3.83
UK_scaled = UK_data
UK_scaled.loc[UK_scaled['Site'] == 'Forest Road', 'P_GEN'] *= ForestScaleFactor
UK_scaled.loc[UK_scaled['Site'] == 'Maple Drive East', 'P_GEN'] *= MapleScaleFactor

In [7]:
UK_data.groupby(by='Site')['P_GEN'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Site,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Forest Road,5526.0,0.088189,0.130234,0.000182,0.000365,0.016049,0.133678,0.548571
Maple Drive East,2579.0,0.156779,0.14297,0.0,0.036658,0.112324,0.24658,0.60846
YMCA,3741.0,0.101918,0.100662,0.0,0.021,0.063,0.158,0.444


Much closer. But there is still some variability. This can be expected. After all, the sites are not in the same city, therefore weather conditions should be different at all of the sites.

#### 5.2 PCA<a id='5.2_PCA'></a>

Let's look at the correlation between the features.

In [None]:
plt.subplots(figsize = (10, 8))
sns.heatmap(UK_data.corr());

Solar radiation and solar energy are the most correlated to power generated, obviously. 

In [None]:
def scatterplots(columns, ncol=None, figsize=(15, 8)):
    if ncol is None:
        ncol = len(columns)
    nrow = int(np.ceil(len(columns) / ncol))
    fig, axes = plt.subplots(nrow, ncol, figsize=figsize, squeeze=False)
    fig.subplots_adjust(wspace=0.5, hspace=0.6)
    for i, col in enumerate(columns):
        ax = axes.flatten()[i]
        ax.scatter(x = col, y = 'P_GEN', data=UK_data, alpha=0.5)
        ax.set(xlabel=col, ylabel='Power Generated')
    nsubplots = nrow * ncol    
    for empty in range(i+1, nsubplots):
        axes.flatten()[empty].set_visible(False)