<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Purpose-of-this-Notebook" data-toc-modified-id="Purpose-of-this-Notebook-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Purpose of this Notebook</a></span></li><li><span><a href="#Reading-the-txt-files-and-creating-the-dataframe" data-toc-modified-id="Reading-the-txt-files-and-creating-the-dataframe-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Reading the txt files and creating the dataframe</a></span><ul class="toc-item"><li><span><a href="#Importing-the-libraries" data-toc-modified-id="Importing-the-libraries-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Importing the libraries</a></span></li><li><span><a href="#Reading-the-data-from-the-txt-file" data-toc-modified-id="Reading-the-data-from-the-txt-file-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Reading the data from the txt file</a></span></li><li><span><a href="#Creating-the-dataframe" data-toc-modified-id="Creating-the-dataframe-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Creating the dataframe</a></span></li></ul></li><li><span><a href="#Pre-Processing-the-dataframe" data-toc-modified-id="Pre-Processing-the-dataframe-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Pre-Processing the dataframe</a></span><ul class="toc-item"><li><span><a href="#Converting-hour-values" data-toc-modified-id="Converting-hour-values-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Converting hour values</a></span></li><li><span><a href="#Calculate-hours-of-daylight" data-toc-modified-id="Calculate-hours-of-daylight-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Calculate hours of daylight</a></span></li><li><span><a href="#Changing-to-datetime-and-filtering" data-toc-modified-id="Changing-to-datetime-and-filtering-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Changing to datetime and filtering</a></span></li></ul></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

# ORNL Research House Project (Pre-Processing Sunset & Sunrise Data)

This project will look at the dataset from the Oak Ridge National Laborotary research house. This is a house solitely built for research somewhere in East Tennessee, USA. The house is comprised with various sensors that record energy usage and weather data every 15 minutes. For this project I will be using the dataset found in this (link)[]. This comprises records from the house between 2013-2014. The aim of the project is to understand hidden patterns in the energy usage of this establishment and perhaps uncover relationships between weather data and energy usage. 


![ORNL](img/ORNL_house.jpg)


The project will be divided in four distinct stages:

1. **Data Wrangling / Cleaning :** This where we import the dataset and carry out some initial manipulation and cleaning before we proceed to analysis


2. **Data Manipulation / Feature Extraction :** At this point we have a fairly cleaned data that is ready to be analysed. Before that we will split the data into different parts so we can organise our study better and extract some useful features from current features that will be useful for the next section. 


3. **Exploratory Data Analysis :** Using the cleaned data we will use visualisation techniques in Python to get a better understanding of the data, establish relationships and understand potential modelling applciations.


4. **Modeling :** This section will use the results of the second step to carry out some modeling techniques to perhaps carry out some predictive analytics that will be useful.



This is an Appendix notebook that goes through the method of pre-processing the sunset dataset.

## Purpose of this Notebook
This notebook as mentioned above will take the reader through the pre-processing of the sunset & sunrise dataset which will be used as meta data for this project. The dataset was taken from this [website](). 

The following steps were carried out before working on this notebook:

1. Data was extracted from the website and copied to a text file
2. All titles were removed to make the job here easier
3. To make manipulation easier, whenever there was a blank value in the original dataset because some months don't have 31 days, the value was just replaced with 1 to allow easier parsing and it was removed later in python (See below)
4. A text file was saved for each year (2013 & 2014) to be processed in this notebook.

## Reading the txt files and creating the dataframe

### Importing the libraries

In [1]:
import pandas as pd
import numpy as np
from datetime import date, timedelta
from collections import defaultdict

### Reading the data from the txt file

The form of the text file looks like this :

![textfile](img/txt_file.jpg)

Where going along the row axis we go through the days of the month and going along the column axis we are going along the months. Although an inefficient way, we had to go through the file 12 times, one for each month and get all the data from the corresponding index. It is noted that for each line we need to extract the first (sunrise) and second (sunset) and split the in two separate lists. The function is show below. 

In [2]:
# Define function on extracting data
#------------------------------------
def read_sunset_file(file):

    """
    Takes in a filename of the format above
    goes through each line once for each month
    collects the sunset and sunrise time and saves
    it in a dictionary
    
    """
    
    # Define the dictionaries
    sunsets = defaultdict(list)
    sunrises = defaultdict(list)
    
    # Month list to loop through
    months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

    for i, mon in enumerate(months):
        
        # Counter that allows us to find which column to address per pass (i.e. start at 0, 2, 4 etc.)
        count = i * 2
        
        # Open the file and go through each line
        with open("data/sunset_2013.txt") as f: 

            for line in f: 
                
                
                # Append ignoring 1s, this is to get rid of the data 
                #we added manually for ease of parsing
                if not(line.split()[count] == "1"):
                    
                    # Add each time to the corresponding list / dictionary
                    sunrises[mon].append(line.split()[count])
                    sunsets[mon].append(line.split()[count+1])
                    
        
    return sunsets, sunrises

In [3]:
# Run the function for both 2013 and 2014
sunset_13, sunrise_13 = read_sunset_file("data/sunset_2013.txt")
sunset_14, sunrise_14 = read_sunset_file("data/sunset_2014.txt")

### Creating the dataframe
The following function takes in a range of dates and their corresponding sunset and sunrise times and creates a dataframe of the form Date | Sunrise | Sunset

In [4]:
def create_sun_df(d1,d2, sunsets, sunrises):
    
    # Dates Dictionary
    #------------------
    
    # Find time period range
    delta = d2 - d1         
    
    # Define list to store all dates
    all_dates = []
    
    # Increment days sequentially and append to the list
    for i in range(delta.days + 1):
        all_dates.append(d1 + timedelta(i))
    
    
    # Sunset Dictionary
    #------------------
    
    # Define the list to add all times
    sunset_all = []
    
    # Go through all values and add them 
    # One month after the other
    for k,v in sunsets.items():

        sunset_all = sunset_all + v

        
    # Sunrise Dictionary
    #-------------------
    
    # Define the list to add all times
    sunrise_all = []
    
    # Go through all values and add them 
    # One month after the other
    for k,v in sunrises.items():

        sunrise_all = sunrise_all + v
    
    
    # Creating the dataframe
    #-----------------------
    
    # create d to combine all
    d = defaultdict(list)
    
    # Add entries
    d["Date"] = all_dates
    d["Sunrise"] = sunrise_all
    d["Sunset"] = sunset_all
    
    # Create the df
    sunset_df = pd.DataFrame(d)
    
    return sunset_df

In [5]:
# Define the ranges for the two cases

d1_13 = date(2013, 1, 1)  # start date
d2_13 = date(2013, 12, 31)  # end date
d1_14 = date(2014, 1, 1)  # start date
d2_14 = date(2014, 12, 31)  # end date

# Get the two datasets
sunset_df_13 = create_sun_df(d1_13, d2_13, sunset_13, sunrise_13)
sunset_df_14 = create_sun_df(d1_14, d2_14, sunset_14, sunrise_14)

In [6]:
# Preview one of the datasets
sunset_df_13.head(10)

Unnamed: 0,Date,Sunrise,Sunset
0,2013-01-01,708,1659
1,2013-01-02,708,1700
2,2013-01-03,709,1701
3,2013-01-04,709,1702
4,2013-01-05,709,1703
5,2013-01-06,709,1703
6,2013-01-07,709,1704
7,2013-01-08,709,1705
8,2013-01-09,709,1706
9,2013-01-10,709,1707


This achieved exactly what we want, the only thing that is left to do is to concatenate the two dfs.

In [7]:
df_sun = pd.concat([sunset_df_13, sunset_df_14]).reset_index().drop("index", axis = 1)

# Get the first columns
df_sun.head(10)

Unnamed: 0,Date,Sunrise,Sunset
0,2013-01-01,708,1659
1,2013-01-02,708,1700
2,2013-01-03,709,1701
3,2013-01-04,709,1702
4,2013-01-05,709,1703
5,2013-01-06,709,1703
6,2013-01-07,709,1704
7,2013-01-08,709,1705
8,2013-01-09,709,1706
9,2013-01-10,709,1707


In [8]:
# Get the last columns to confirm
df_sun.tail(10)

Unnamed: 0,Date,Sunrise,Sunset
720,2014-12-22,705,1653
721,2014-12-23,705,1653
722,2014-12-24,706,1654
723,2014-12-25,706,1654
724,2014-12-26,707,1655
725,2014-12-27,707,1656
726,2014-12-28,707,1656
727,2014-12-29,708,1657
728,2014-12-30,708,1658
729,2014-12-31,708,1658


This is done now and we have a dataframe with all sunset times. We now need to process it slightly further to make it more applicable for the study. This is achieved in the next section.

## Pre-Processing the dataframe

In this section we will pre-process the data so it is more applicable to our study. The three changes / additions we need to make are:

- Change the time from a string of "0715" for example to 7.25 , indicating that it is the 7th hour and a quarter or 0.25. This will allows to easily graph the results.

- Calculate the number of hours of daylight for each day, as this will be a useful indicator to compare directly with energy consumption

- Convert the date column to datetime and extract only up to the last date of our original data which is the 09/12/2014 or in US 12/09/2014

### Converting hour values
We will define a function that takes in a time in a string form e.g. "0715" and converts it to 0.25. We will then apply it to the sunset and sunrise column. 

In [9]:
# Define the function
#--------------------

def convert_time(x):
    
    """
    Takes in a string of time
    converts it to a decimal 
    e.g. "0715" --> 7.25
    
    """
    
    # Get the hour and minute
    hour = int(x[:2])
    minute = int(x[2:])

    # Calculate minute percentage
    minute_perc = round(minute / 60,2)
    
    # Return the sum
    return hour + minute_perc

In [10]:
# Apply to both columns, we save it to a new column
# to preserve the original data
df_sun["Sunrise Dec"] = df_sun["Sunrise"].apply(convert_time)
df_sun["Sunset Dec"] = df_sun["Sunset"].apply(convert_time)

# Show the data
df_sun.head(10)

Unnamed: 0,Date,Sunrise,Sunset,Sunrise Dec,Sunset Dec
0,2013-01-01,708,1659,7.13,16.98
1,2013-01-02,708,1700,7.13,17.0
2,2013-01-03,709,1701,7.15,17.02
3,2013-01-04,709,1702,7.15,17.03
4,2013-01-05,709,1703,7.15,17.05
5,2013-01-06,709,1703,7.15,17.05
6,2013-01-07,709,1704,7.15,17.07
7,2013-01-08,709,1705,7.15,17.08
8,2013-01-09,709,1706,7.15,17.1
9,2013-01-10,709,1707,7.15,17.12


### Calculate hours of daylight
In this section we will define two new columns Daylight Hours which indicate the number of hours of daylight for each specific day, or in other words the length of the day. To calculate this all we need to do is subtract the sunrise time from the sunset time.

In [11]:
# Create a column with the length of day
df_sun["Daylight Hours"] = df_sun["Sunset Dec"] - df_sun["Sunrise Dec"]

# Present the results
df_sun.head(10)

Unnamed: 0,Date,Sunrise,Sunset,Sunrise Dec,Sunset Dec,Daylight Hours
0,2013-01-01,708,1659,7.13,16.98,9.85
1,2013-01-02,708,1700,7.13,17.0,9.87
2,2013-01-03,709,1701,7.15,17.02,9.87
3,2013-01-04,709,1702,7.15,17.03,9.88
4,2013-01-05,709,1703,7.15,17.05,9.9
5,2013-01-06,709,1703,7.15,17.05,9.9
6,2013-01-07,709,1704,7.15,17.07,9.92
7,2013-01-08,709,1705,7.15,17.08,9.93
8,2013-01-09,709,1706,7.15,17.1,9.95
9,2013-01-10,709,1707,7.15,17.12,9.97


### Changing to datetime and filtering
We will now change the date column to datetime and extract only the relevant data before saving the file.

In [12]:
# Convert to datetime
df_sun["Date"] = pd.to_datetime(df_sun["Date"])

# Extract values only up to the point we are interest
df_sun["Date"] = df_sun[df_sun["Date"] <= "2014-12-09"]

# Save the file
df_sun.to_csv("data/sun_data.csv")

## Conclusion

In this notebook we got some data from the web on sunset and sunrise times for the years 2013 and 2014 for the area around the research house. We extracted the data from the text files, allocated them in a dataframe and then pre-processed the dataframe to be suitable for this study. 