# Exploratory Data Analysis

This document is dedicated for the exploratory part of the project. We are looking for features that might correlate with the data. So that the machine learning algorithm may learn from the most important data.

In [10]:
import os
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import datetime

%matplotlib inline

pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)

![Data Categorization](./pictures/data_categorization.png)


The picture above illustrates how the data is categorized.

In [16]:
train_a = pd.read_parquet('A/train_targets.parquet')
print("\n These are the column categories of TRAIN_A: ", train_a.columns.values)

X_train_estimated_a = pd.read_parquet('A/X_train_estimated.parquet')
print("\n These are the column categories of X_TRAIN_ESTIMATED_A: ", X_train_estimated_a.columns.values[0:7])
print("The total length is: ", len(X_train_estimated_a.columns.values))

X_train_observed_a = pd.read_parquet('A/X_train_observed.parquet')
print("\n These are the column categories of X_TRAIN_OBSERVED_A: ", X_train_observed_a.columns.values[0:7])
print("The total length is: ", len(X_train_observed_a.columns.values))

X_test_estimated_a = pd.read_parquet('A/X_test_estimated.parquet')
print("\n These are the column categories of X_TEST_ESTIMATED_A: ", X_test_estimated_a.columns.values[0:7])
print("The total length is: ", len(X_test_estimated_a.columns.values))


"""
Looking for differences between the categories of each set.
"""

dif = list(set(X_train_observed_a.columns.values) - set(X_train_estimated_a.columns.values))
print("\n The differences between X_TRAIN_OBSERVED_A and X_TRAIN_ESTIMATED_A:", dif)

dif = list(set(X_test_estimated_a.columns.values) - set(X_train_observed_a.columns.values))
print("\n The differences between X_TRAIN_OBSERVED_A and X_TEST_ESTIMATED_A:", dif)

dif = list(set(X_train_estimated_a.columns.values) - set(X_test_estimated_a.columns.values))
print("\n The differences between X_TEST_ESTIMATED_A and X_TRAIN_ESTIMATED_A:", dif)

"""
Looking for differences between the categories of each set.
"""

"""
This is intended to be a function that has the capabilities to see gaps!
"""

dates = train_a["time"]
init_datetime_str = dates[0]

datetime_obj = init_datetime_str
missing_dates = []

count = 0
while count < len(dates):
    if dates[count] != datetime_obj:
        missing_dates.append(datetime_obj)
    else:
        count += 1
    datetime_obj += datetime.timedelta(hours=1)

for date in missing_dates:
    print(date)

"""
Should also have a missing feature spotter! That catches data gaps.
"""

"""
Create a set of boxplots that helps us see different aspects if the data clearly.
"""



 These are the column categories of TRAIN_A:  ['time' 'pv_measurement']

 These are the column categories of X_TRAIN_ESTIMATED_A:  ['date_calc' 'date_forecast' 'absolute_humidity_2m:gm3'
 'air_density_2m:kgm3' 'ceiling_height_agl:m' 'clear_sky_energy_1h:J'
 'clear_sky_rad:W']
The total length is:  47

 These are the column categories of X_TRAIN_OBSERVED_A:  ['date_forecast' 'absolute_humidity_2m:gm3' 'air_density_2m:kgm3'
 'ceiling_height_agl:m' 'clear_sky_energy_1h:J' 'clear_sky_rad:W'
 'cloud_base_agl:m']
The total length is:  46

 These are the column categories of X_TEST_ESTIMATED_A:  ['date_calc' 'date_forecast' 'absolute_humidity_2m:gm3'
 'air_density_2m:kgm3' 'ceiling_height_agl:m' 'clear_sky_energy_1h:J'
 'clear_sky_rad:W']
The total length is:  47

 The differences between X_TRAIN_OBSERVED_A and X_TRAIN_ESTIMATED_A: []

 The differences between X_TRAIN_OBSERVED_A and X_TEST_ESTIMATED_A: ['date_calc']

 The differences between X_TEST_ESTIMATED_A and X_TRAIN_ESTIMATED_A: []
202

## Discovery
There is a difference between the trained data, and this is mainly that "X_trained_observed" has one less category than the 2 other X-values. Why this is remains however unclear.

![Data Categorization](./pictures/model_testing_procedure.png)
The picture above describes the procedure that will be used to test the model.