# Notebook 1: Web Scraping of Energy and Weather Data
Please see the `README.md` file to learn about this project.

In this first project notebook the raw energy data and weather data are requested from https://www.caiso.com/ and https://www.ncdc.noaa.gov/, respectively.


## 1.1 Web Scraping of Energy Data

The California ISO website reports the daily energy demand in megawatts in 5-minute increments. Historical data can be achieved dating back to April of 2018. The "demand" values being requested in this code are the total system demand in California. Note that this is distinct from the "net demand" values, also available on the website, that have subtracted the wind and solar energy. The net demand will not be analyzed in this project.


### Using Selenium to Automate Data Retrieval

The California ISO website allows the public to download energy demand CSV files for dates back through 2018. This process would take a very long time to do manually for each day though, so the Selenium 2 WebDriver API will be used to automate the process. The Chrome implementation is used here.

In [283]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys #The `Keys` class provides all the keys in the keyboard.

Start Chromedriver

In [284]:
#Create an instance of the Chrome WebDriver
driver = webdriver.Chrome('C:\\Users\\18053\\Desktop\\chromedriver') #previously downloaded ChromeDriver, see README file for more info

#Driver.get to navigate to a given page
driver.get("http://www.caiso.com/TodaysOutlook/Pages/default.aspx") #page where data is housed

  driver = webdriver.Chrome('C:\\Users\\18053\\Desktop\\chromedriver')


In [285]:
#imports
import pandas as pd
import time
import datetime
import pprint
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

Set up date range to scrape from

In [286]:
start_dt = datetime.datetime(2018,4,10) #Earliest date available on the website

end_dt = datetime.datetime(2022,5,19) #Yesterday


Create a list of the dates in the format of MM/DD/YYYY

In [287]:
dates = []
for dt in pd.date_range(start_dt, end_dt):
    dates.append(dt.strftime("%m/%d/%Y"))

Print the dates to make sure the for loop worked correctly. Printing them using `PrettyPrinter`

In [288]:
pp = pprint.PrettyPrinter(indent=4,compact=True)
pp.pprint(dates)

[   '04/10/2018', '04/11/2018', '04/12/2018', '04/13/2018', '04/14/2018',
    '04/15/2018', '04/16/2018', '04/17/2018', '04/18/2018', '04/19/2018',
    '04/20/2018', '04/21/2018', '04/22/2018', '04/23/2018', '04/24/2018',
    '04/25/2018', '04/26/2018', '04/27/2018', '04/28/2018', '04/29/2018',
    '04/30/2018', '05/01/2018', '05/02/2018', '05/03/2018', '05/04/2018',
    '05/05/2018', '05/06/2018', '05/07/2018', '05/08/2018', '05/09/2018',
    '05/10/2018', '05/11/2018', '05/12/2018', '05/13/2018', '05/14/2018',
    '05/15/2018', '05/16/2018', '05/17/2018', '05/18/2018', '05/19/2018',
    '05/20/2018', '05/21/2018', '05/22/2018', '05/23/2018', '05/24/2018',
    '05/25/2018', '05/26/2018', '05/27/2018', '05/28/2018', '05/29/2018',
    '05/30/2018', '05/31/2018', '06/01/2018', '06/02/2018', '06/03/2018',
    '06/04/2018', '06/05/2018', '06/06/2018', '06/07/2018', '06/08/2018',
    '06/09/2018', '06/10/2018', '06/11/2018', '06/12/2018', '06/13/2018',
    '06/14/2018', '06/15/2018', '06/16

Define functions to select a given date on the website and download the CSV file for that date

In [289]:
def select_date(date,calendar_class):
    """
    input date in 'MM/DD/YYYY' form, and calendar_class as 'demand-date'
    
    """
    
    inputElement = driver.find_element(by=By.CLASS_NAME, value=calendar_class) #The HTML "calendar_class" is what is needed to select a date. This code grabs the date input box on the webpage.
    inputElement.clear() #Clears any prior selection from date box
    inputElement.send_keys(date) #Selects passed in date value into date box
    inputElement.send_keys(Keys.ENTER) #Enters selected date value

In [290]:
def download_file(button_id, hidden_button_id = ('','')):
    """Expecting the following:
    button_id: downloadDemandCSV
    Hidden_button_id: (dropdownMenu1,0), (dropdownMenu1,2)
    button_id: actual download button, hidden button is the one hiding it."""
    
    #On webpage, the download option must be clicked twice to actually obtain the CSV file; this code handles the hidden button
    if hidden_button_id[0] !='':
        if hidden_button_id[1]!='':
            inputElement = driver.find_elements(by=By.ID, value=hidden_button_id[0])[hidden_button_id[1]]
            inputElement.send_keys(Keys.ENTER)
            
        #Only needed if the net demand trend was requested; currently not being used    
        else:
            inputElement = driver.find_elements(by=By.ID, value=hidden_button_id[0])
            inputElement.send_keys(Keys.ENTER)
            
    wait = WebDriverWait(driver, 20)
    downloadButton= wait.until(EC.element_to_be_clickable((By.ID,button_id)))
    downloadButton.click()

Add a status bar for tracking for loop progress

In [292]:
import tqdm

Web scraping

In [293]:
for dt in tqdm.tqdm(dates):
    select_date(dt,'demand-date') #dates is the iterable object defined above, 'demand-date' is the HTML class needed to select the calendar
    download_file('downloadDemandCSV',('dropdownMenu1',0)) #HTML classes for the buttons

100%|██████████| 1501/1501 [11:05<00:00,  2.26it/s]


In [294]:
driver.close()

At this point the files were manually moved from the downloads folder to a new directory where they could readily be compiled into a single dataframe.

In [5]:
csv_download_path = os.getcwd()+'\\Energy_Demand_data' #Set dir for files

## 1.2 Compiling Raw Energy Demand Data

Examining format of a single day energy demand CSV file

In [4]:
df = pd.read_csv(csv_download_path + '\\CAISO-demand-20210827.csv')

In [5]:
df.head()

Unnamed: 0,Demand 08/27/2021,00:00,00:05,00:10,00:15,00:20,00:25,00:30,00:35,00:40,...,23:15,23:20,23:25,23:30,23:35,23:40,23:45,23:50,23:55,00:00.1
0,Day-ahead forecast,29164,27365,27365,27365,27365,27365,27365,27365,27365,...,30257,30257,30257,30257,30257,30257,30257,30257,30257,30257.0
1,Hour-ahead forecast,27593,27191,27191,27191,26760,26760,26760,26263,26263,...,30627,29995,29995,29995,29291,29291,29291,28743,28743,28743.0
2,Demand,27545,27444,27375,27237,27085,26900,26769,26583,26454,...,30700,30392,30163,29956,29763,29560,29384,29173,28959,


The energy demand is reported in 5 minute increments in the third row of the dataframe. The first two rows are forecasted demand. For now only the true demand values are of interest, but the forecasted demand may be worth examining later as a comparison to the model that is developed to predict energy demand.

In [6]:
#Calculating the daily energy demand by summing all the columns in the third row.
daily_energy_demand =df.iloc[2,1:].sum()

In [7]:
daily_energy_demand 

9087498

In [8]:
#grabbing the date from the dataframe
date = df.columns[0][-10:]

In [9]:
date

'08/27/2021'

Setting up a master dataframe where all the days will have their daily energy demand values

In [10]:
master_df= pd.DataFrame(columns = ['Date', 'Daily Energy Demand'])

In [11]:
master_df

Unnamed: 0,Date,Daily Energy Demand


Checking to see if the date and energy demand values are being correctly sliced from the original dataframe

In [12]:
master_df.loc[len(master_df.index)] = [date,daily_energy_demand] 

In [13]:
master_df

Unnamed: 0,Date,Daily Energy Demand
0,08/27/2021,9087498


Performing above steps for all 1501 energy demand files

In [2]:
import glob
import os

In [15]:
files = glob.glob(csv_download_path + '/*.csv') #Look for csv files in the dir created earlier

In [19]:
master_df= pd.DataFrame(columns = ['Date', 'Daily Energy Demand'])
for f in tqdm.tqdm(files):
   
    #same steps as was performed for 08/27/2021
    temp_df = pd.read_csv(f)
    daily_energy_demand =temp_df.iloc[2,1:].sum()
    date = temp_df.columns[0][-10:]
    master_df.loc[len(master_df.index)] = [date,daily_energy_demand] 



100%|██████████| 1501/1501 [00:12<00:00, 121.01it/s]


In [20]:
master_df

Unnamed: 0,Date,Daily Energy Demand
0,04/10/2018,7183786.0
1,04/11/2018,1961585.0
2,04/12/2018,6670701.0
3,04/13/2018,6643068.0
4,04/14/2018,6183992.0
...,...,...
1496,05/15/2022,6435770.0
1497,05/16/2022,6824397.0
1498,05/17/2022,6813256.0
1499,05/18/2022,7144889.0


In [21]:
energy_df=master_df.copy()

Looks good. The same number of rows in the dataframe as the number of files that were downloaded, so each date is occuring exactly once in the dataframe. As another check, the date range is what was specified in the initial scraping.

## 1.3 Web Scraping of Weather Data

NOAA's website has an API that can be used to request weather data. To use this API a web services token was first requested. With access gained to NCDC CDO Web Services the base URL can be modified to request specific data sets.


To keep the scope manageable for this project, it was decided that only the most populated areas of California would have their weather as features.

| City | Approximate Population |
| --- | --- |
| Los Angeles | 4,000,000 | 
| San Diego | 1,400,000 | 
| San Jose | 1,000,000 | 
| San Francisco | 900,000 | 
| Fresno | 500,000 | 
| Sacramento | 500,000 | 

Since Los Angeles has such a large population, and is geographically diverse, it was decided that two weather stations from the Los Angeles area would be used as features, one from the LA inland region and one from the LA coastal region. The daily minimum and maximum temperatures were collected from seven regions total.

An effort was made to find stations in each region with >99% data coverage in the date range of interest. Stations were found using https://www.ncdc.noaa.gov/cdo-web/datatools/findstation, California's id: FIPS:06 and the Daily Summaries datasetid: GHCND.






In [22]:
station_Locations = {'Los Angeles International Airport': 'GHCND:USW00023174', 'San Diego Airport':'GHCND:USW00023188',
                     'San Francisco Downtown': 'GHCND:USW00023272', 'FRESNO YOSEMITE INTERNATIONAL': 'GHCND:USW00093193', 'SACRAMENTO METROPOLITAN AIRPORT': 'GHCND:USW00093225',
                    'San Jose': 'GHCND:USW00023293', 'Ontario Airport': 'GHCND:USW00003102'}

Use Python's Requests library to request the data 

In [23]:
import requests
import json 
import numpy as np
from datetime import datetime

Request Web Serviced Token from: https://www.ncdc.noaa.gov/cdo-web/token

In [47]:
Token = ''

The API has a 1000 item maximum limit per call so it will be called once per year in the date range of interest.

In [48]:
#Initialize empty lists
dates = []
max_temp_values = []
min_temp_values = []
stations = []

#Assign the station list to the values from the dictionary of stations of interest
station_list = station_Locations.values()


#Scraping data from station_locations in 2018-2022
for station_id in station_list:
    for year in range(2018, 2023):
        try:
            year = str(year)
            #Print progress
            print('Working on year {} for station {}.'.format(year, station_id))

            #Make the API call
            #Units=standard means temp will be reported in F
            #datatypeid=TMAX and TMIN
            r = requests.get('https://www.ncdc.noaa.gov/cdo-web/api/v2/data?datasetid=GHCND&datatypeid=TMAX&datatypeid=TMIN&units=standard&limit=1000&stationid='+station_id+'&startdate='+year+'-01-01&enddate='+year+'-12-31', headers={'token':Token})
            # Using the JSON library to deserialize the text attribute of the r object
            d = json.loads(r.text)
            
            #Get all TMAX and TMIN items in the response 
            max_temps = [item for item in d['results'] if item['datatype']=='TMAX']
            min_temps = [item for item in d['results'] if item['datatype']=='TMIN']

            #Get the date field from all temp readings
            dates += [item['date'] for item in max_temps]
            
            #Get the actual values of the min and max temps 
            max_temp_values += [item['value'] for item in max_temps]
            min_temp_values += [item['value'] for item in min_temps]
            
            #Get the station field
            stations += [item['station'] for item in max_temps]
            
        except:
            pass



Working on year 2018 for station GHCND:USW00023174.
Working on year 2019 for station GHCND:USW00023174.
Working on year 2020 for station GHCND:USW00023174.
Working on year 2021 for station GHCND:USW00023174.
Working on year 2022 for station GHCND:USW00023174.
Working on year 2018 for station GHCND:USW00023188.
Working on year 2019 for station GHCND:USW00023188.
Working on year 2020 for station GHCND:USW00023188.
Working on year 2021 for station GHCND:USW00023188.
Working on year 2022 for station GHCND:USW00023188.
Working on year 2018 for station GHCND:USW00023272.
Working on year 2019 for station GHCND:USW00023272.
Working on year 2020 for station GHCND:USW00023272.
Working on year 2021 for station GHCND:USW00023272.
Working on year 2022 for station GHCND:USW00023272.
Working on year 2018 for station GHCND:USW00093193.
Working on year 2019 for station GHCND:USW00093193.
Working on year 2020 for station GHCND:USW00093193.
Working on year 2021 for station GHCND:USW00093193.
Working on y

The length of min_temp_values was != the length of the max_temp_values, so the script can be run again for the minimum values and collect those dates. First let's see how the max temp data are looking.

In [49]:
weather_df = pd.DataFrame(columns=['Date', 'Max Temp', 'Station ID'])
weather_df['Date'] = [datetime.strptime(d, "%Y-%m-%dT%H:%M:%S") for d in dates]
weather_df['Max Temp'] = max_temp_values #in deg F
weather_df['Station ID'] = stations

#Creating a column of station location names. Since original dictionary had the location names as the keys, a new dict was created with the names as values so they could be mapped.
swapped_dict = dict([(value, key) for key, value in station_Locations.items()])
weather_df['Station Location'] = weather_df['Station ID'].map(swapped_dict)

In [50]:
weather_df.head(10)

Unnamed: 0,Date,Max Temp,Station ID,Station Location
0,2018-01-01,67.0,GHCND:USW00023174,Los Angeles International Airport
1,2018-01-02,76.0,GHCND:USW00023174,Los Angeles International Airport
2,2018-01-03,76.0,GHCND:USW00023174,Los Angeles International Airport
3,2018-01-04,74.0,GHCND:USW00023174,Los Angeles International Airport
4,2018-01-05,69.0,GHCND:USW00023174,Los Angeles International Airport
5,2018-01-06,64.0,GHCND:USW00023174,Los Angeles International Airport
6,2018-01-07,69.0,GHCND:USW00023174,Los Angeles International Airport
7,2018-01-08,67.0,GHCND:USW00023174,Los Angeles International Airport
8,2018-01-09,63.0,GHCND:USW00023174,Los Angeles International Airport
9,2018-01-10,64.0,GHCND:USW00023174,Los Angeles International Airport


In [51]:
weather_df['Station Location'].unique()

array(['Los Angeles International Airport', 'San Diego Airport',
       'San Francisco Downtown', 'FRESNO YOSEMITE INTERNATIONAL',
       'SACRAMENTO METROPOLITAN AIRPORT', 'San Jose', 'Ontario Airport'],
      dtype=object)

In [52]:
max_temp_df = weather_df.copy()

Looks good! All 7 station locations are included in the data and the maximum temperature was pulled from each station for each day in 2018-2022.

In [53]:
#Perform data requests again but this time for minimum temperature values
dates = []
min_temp_values = []
stations = []

#Assign the station list to the values from the dictionary of stations of interest
station_list = station_Locations.values()


#Scraping data from station_locations in 2018-2022
for station_id in station_list:
    for year in range(2018, 2023):
        try:
            year = str(year)
            #print progress
            print('Working on year {} for station {}.'.format(year, station_id))

            #make the api call
            r = requests.get('https://www.ncdc.noaa.gov/cdo-web/api/v2/data?datasetid=GHCND&datatypeid=TMIN&units=standard&limit=1000&stationid='+station_id+'&startdate='+year+'-01-01&enddate='+year+'-12-31', headers={'token':Token})
            # Using the JSON library to deserialize the text attribute of the r object
            d = json.loads(r.text)
            #get all TMIN items in the response 
            min_temps = [item for item in d['results'] if item['datatype']=='TMIN']

            #get the date field from all temp readings
            dates += [item['date'] for item in min_temps]
            
            #get the actual values of the min and max temps 
            min_temp_values += [item['value'] for item in min_temps]
            
            #get the station field
            stations += [item['station'] for item in min_temps]
            
        except:
            pass


Working on year 2018 for station GHCND:USW00023174.
Working on year 2019 for station GHCND:USW00023174.
Working on year 2020 for station GHCND:USW00023174.
Working on year 2021 for station GHCND:USW00023174.
Working on year 2022 for station GHCND:USW00023174.
Working on year 2018 for station GHCND:USW00023188.
Working on year 2019 for station GHCND:USW00023188.
Working on year 2020 for station GHCND:USW00023188.
Working on year 2021 for station GHCND:USW00023188.
Working on year 2022 for station GHCND:USW00023188.
Working on year 2018 for station GHCND:USW00023272.
Working on year 2019 for station GHCND:USW00023272.
Working on year 2020 for station GHCND:USW00023272.
Working on year 2021 for station GHCND:USW00023272.
Working on year 2022 for station GHCND:USW00023272.
Working on year 2018 for station GHCND:USW00093193.
Working on year 2019 for station GHCND:USW00093193.
Working on year 2020 for station GHCND:USW00093193.
Working on year 2021 for station GHCND:USW00093193.
Working on y

In [54]:
min_weather_df = pd.DataFrame(columns=['Date', 'Min Temp', 'Station ID'])
min_weather_df['Date'] = [datetime.strptime(d, "%Y-%m-%dT%H:%M:%S") for d in dates]
min_weather_df['Min Temp'] = min_temp_values
min_weather_df['Station ID'] = stations
min_weather_df['Station Location'] = min_weather_df['Station ID'].map(swapped_dict)

In [55]:
min_weather_df

Unnamed: 0,Date,Min Temp,Station ID,Station Location
0,2018-01-01,48.0,GHCND:USW00023174,Los Angeles International Airport
1,2018-01-02,54.0,GHCND:USW00023174,Los Angeles International Airport
2,2018-01-03,54.0,GHCND:USW00023174,Los Angeles International Airport
3,2018-01-04,55.0,GHCND:USW00023174,Los Angeles International Airport
4,2018-01-05,56.0,GHCND:USW00023174,Los Angeles International Airport
...,...,...,...,...
11159,2022-05-12,48.0,GHCND:USW00003102,Ontario Airport
11160,2022-05-13,55.0,GHCND:USW00003102,Ontario Airport
11161,2022-05-14,59.0,GHCND:USW00003102,Ontario Airport
11162,2022-05-15,58.0,GHCND:USW00003102,Ontario Airport


In [56]:
min_temp_df= min_weather_df.copy()

Now time to compile the min and max temp dataframes into one. The date column is going to be transformed into a datetime object so that it can be used to merge on. 

In [57]:
all_temp_data = pd.merge(max_temp_df, min_temp_df, how = 'inner', on= ['Date', 'Station ID', 'Station Location'])
#inner merge to prevent null values

In [58]:
all_temp_data 

Unnamed: 0,Date,Max Temp,Station ID,Station Location,Min Temp
0,2018-01-01,67.0,GHCND:USW00023174,Los Angeles International Airport,48.0
1,2018-01-02,76.0,GHCND:USW00023174,Los Angeles International Airport,54.0
2,2018-01-03,76.0,GHCND:USW00023174,Los Angeles International Airport,54.0
3,2018-01-04,74.0,GHCND:USW00023174,Los Angeles International Airport,55.0
4,2018-01-05,69.0,GHCND:USW00023174,Los Angeles International Airport,56.0
...,...,...,...,...,...
11159,2022-05-12,84.0,GHCND:USW00003102,Ontario Airport,48.0
11160,2022-05-13,93.0,GHCND:USW00003102,Ontario Airport,55.0
11161,2022-05-14,97.0,GHCND:USW00003102,Ontario Airport,59.0
11162,2022-05-15,89.0,GHCND:USW00003102,Ontario Airport,58.0


Now to merge with the energy dataframe. The date formats are not the same between the dataframes so that needs to be fixed so they can be joined along the date column.

In [59]:
energy_df.head()

Unnamed: 0,Date,Daily Energy Demand
0,2018-04-10,7183786.0
1,2018-04-11,1961585.0
2,2018-04-12,6670701.0
3,2018-04-13,6643068.0
4,2018-04-14,6183992.0


In [60]:
all_temp_data['Date'] = pd.to_datetime((max_temp_df['Date']))
energy_df['Date'] = pd.to_datetime((energy_df['Date']))

In [61]:
all_temp_data['Date'].max()

Timestamp('2022-05-16 00:00:00')

In [62]:
all_temp_data['Date']

0       2018-01-01
1       2018-01-02
2       2018-01-03
3       2018-01-04
4       2018-01-05
           ...    
11159   2022-05-10
11160   2022-05-11
11161   2022-05-12
11162   2022-05-13
11163   2022-05-14
Name: Date, Length: 11164, dtype: datetime64[ns]

Notice that the temperature data doesn't end on the final date (05/16/2022), perhaps some weather stations don't have information for all the dates. Let's check what dates don't have values from all 7 stations.

In [63]:
test_df = pd.DataFrame({'date': all_temp_data['Date'].value_counts().index, 'value_counts': all_temp_data['Date'].value_counts()})

In [64]:
test_df[test_df['value_counts']<=6]

Unnamed: 0,date,value_counts
2019-11-18,2019-11-18,6
2018-02-28,2018-02-28,6
2022-05-15,2022-05-15,6
2021-04-23,2021-04-23,6
2018-08-03,2018-08-03,6
2018-08-04,2018-08-04,6
2018-08-05,2018-08-05,6
2018-08-06,2018-08-06,6
2018-08-07,2018-08-07,6
2018-08-08,2018-08-08,6


Looks like there are a handful of dates with missing data from one station and 1 date with data missing from 3 stations. Something to keep in mind during data cleaning.

Check that the date format was correctly transformed in energy_df

In [65]:
energy_df['Date']

0      2018-04-10
1      2018-04-11
2      2018-04-12
3      2018-04-13
4      2018-04-14
          ...    
1496   2022-05-15
1497   2022-05-16
1498   2022-05-17
1499   2022-05-18
1500   2022-05-19
Name: Date, Length: 1501, dtype: datetime64[ns]

In [66]:
energy_df['Date'].min()

Timestamp('2018-04-10 00:00:00')

In [67]:
df = pd.merge(energy_df, all_temp_data, how = 'inner', on= ['Date'])
#inner merge to only have dates where there is both energy and temperature data (04/10/2018 - 05/16/2022, 1498 days)

In [68]:
energy_df.head(14)

Unnamed: 0,Date,Daily Energy Demand
0,2018-04-10,7183786.0
1,2018-04-11,1961585.0
2,2018-04-12,6670701.0
3,2018-04-13,6643068.0
4,2018-04-14,6183992.0
5,2018-04-15,5926078.0
6,2018-04-16,6619344.0
7,2018-04-17,6567013.0
8,2018-04-18,6739403.0
9,2018-04-19,6649368.0


In [69]:
df.head(14)

Unnamed: 0,Date,Daily Energy Demand,Max Temp,Station ID,Station Location,Min Temp
0,2018-04-10,7183786.0,79.0,GHCND:USW00023174,Los Angeles International Airport,60.0
1,2018-04-10,7183786.0,85.0,GHCND:USW00023188,San Diego Airport,60.0
2,2018-04-10,7183786.0,63.0,GHCND:USW00023272,San Francisco Downtown,52.0
3,2018-04-10,7183786.0,86.0,GHCND:USW00093193,FRESNO YOSEMITE INTERNATIONAL,59.0
4,2018-04-10,7183786.0,68.0,GHCND:USW00093225,SACRAMENTO METROPOLITAN AIRPORT,51.0
5,2018-04-10,7183786.0,69.0,GHCND:USW00023293,San Jose,52.0
6,2018-04-10,7183786.0,72.0,GHCND:USW00003102,Ontario Airport,51.0
7,2018-04-11,1961585.0,74.0,GHCND:USW00023174,Los Angeles International Airport,57.0
8,2018-04-11,1961585.0,73.0,GHCND:USW00023188,San Diego Airport,60.0
9,2018-04-11,1961585.0,60.0,GHCND:USW00023272,San Francisco Downtown,49.0


In [70]:
df.to_csv('csv_files\df.csv')

Great! Now we have a master dataframe with all the temperature and energy data we want for our modeling. 

See Notebook 2 for exploratory data analysis and data cleaning.