




# Energy Consumption Data Notebook 3: Data Visualizations and Finding Trends in Data


---

### Goals For This Notebook:

1 - Create scatter plots to visualize how energy consumption changes with certain factors.<br>

2 - Create lines of best fit to see how strong the correlation between chosen variables are.<br>

---

### Table of Contents

1 - [Creating Scatter Plots and Lines of Best Fit](#section1)<br>

2 - [Which Appliance's Power Consumption Is Most Affected By Temperature?](#section2)<br>

---

In this notebook, we will continue to work with our merged weather and power dataset to create more data visualizations. We will create scatter plots, which allow us to examine the relationship between two variables. In particular, we want to see how factors such as outdoor air temperature affect the energy consumption of the building and various appliances.

We will also create lines of best fit to visualize how strong the correlation between chosen variables are.

Let's first get started by importing the libraries we need:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## 1. Creating Scatter Plots and Lines of Best Fit <a id='section1'></a>

Refer to Energy Consumption Data Notebook 2 for this exercise. Import the merged weather and power dataset saved in the file `weather_and_power.csv` (in the *data* folder) into a dataframe called `weather_and_power`. Then remove the column named `Unnamed: 0`.

In [None]:
#EXERCISE - Import the data

weather_and_power = ...
weather_and_power.head()

In [None]:
#EXERCISE - Remove the column "Unnamed: 0"

# Your code here

weather_and_power.head()

Let's review how to create a scatter plot using matplotlib. In the cell below, make a scatter plot that compares the total power consumption to the building total power consumption.

**Before you create the scatter plot, what sort of correlation do you expect? *(Positive, negative, or none?)* Why? Write your answer and reasoning below:**

*Your answer here*

In [None]:
# EXERCISE

x_data = weather_and_power["..."]
y_data = weather_and_power["..."]

plt.title('Comparing Total Power Consumption to the Building Total Power Consumption')
plt.xlabel('Building Total Power Consumption (Watts)')
plt.ylabel('Total Power Consumption (Watts)')
plt.scatter(..., ...);

Notice how when we do that, we seem to see almost a perfect positive correlation between total energy consumption and building total consumption (was this what you expected?). 

However, there are a few points with values of 0 that are making it harder to see the relationship.

Let's drop the rows where total power consumption OR building total power consumption are zero. We will use boolean indexing (refer to notebook 07: pandas dataframes) to first find the rows that match this criteria and save it into a filter.

In [None]:
# EXERCISE
# Use boolean indexing to find rows where the total power consumption OR the building total power consumption are zero
# We will save this into a filter "zero_power_filter"

zero_power_filter = weather_and_power[(weather_and_power["..."]==0) | 
                 (weather_and_power["..."]==0)]

Next step is to remove the rows. As we are working with rows, not columns as before, we will use the `drop()` function and feed it our filter. We must attach `.index` to our filter name so we tell the `drop()` function the index of the rows it needs to drop. We will also use the argument `inplace = True` so we drop the rows from the original dataframe instead of creating a new dataframe.

In [None]:
# Run this cell to remove those rows!
weather_and_power.drop(zero_power_filter.index, inplace = True)

weather_and_power.head()

Let's create the same scatter plot again, this time with our 0 values removed:

In [None]:
# EXERCISE

x_data = weather_and_power["..."]
y_data = weather_and_power["..."]

plt.title('Comparing Total Power Consumption to the Building Total Power Consumption')
plt.xlabel('Building Total Power Consumption (Watts)')
plt.ylabel('Total Power Consumption (Watts)')
plt.scatter(..., ...);

In order to better visualize the correlation, let's add a line of best fit. However, before we do this, we need to remove all the null values from the data so our functions fo creating the line of best fit works.

First, let's see if we have any null values (refer to notebook 07 pandas dataframes, section 1.6):

In [None]:
#EXERCISE - Check if there are any null values

weather_and_power....

There are plenty! We will use the function `.dropna()` which will remove all the rows that contain null values. Again, we will use the argument `inplace = True` so we drop the rows from the original dataframe instead of creating a new dataframe. Make sure to check if we successfully removed the null values:

In [None]:
weather_and_power.dropna(inplace=True)

#EXERCISE - Check again to see if there are any null values left:
weather_and_power....

Now let's create our model using the function `polyfit()` from the numpy library. `polyfit()` takes in your x and y data values. We also put in the argument '1' to indicate we are working with a linear model.

We will also use the function `poly1d()` from the numpy library. This allows us to visualize the best-fit line in the format we are familiar with, y = mx + b, where m is the slope and b is the y-intercept. `poly1d()` also allows us to easily plot it over our scatter plot from before.

In [None]:
#We are still working with building total power consumption and total power consumption:
x_data = weather_and_power["building total power consumption (Watts)"]
y_data = weather_and_power["total power consumption (Watts)"]

#Creating the model
model = np.polyfit(x_data, y_data, 1)
model_function = np.poly1d(model)
print(model_function)

Let's now look at our scatter plot and line of best fit together! In the cell below, copy and paste your code for creating the scatter plot. On the line below, create the line of best fit.

When plotting the line of best fit, notice that you use the model function applied to the x data for your y values.

In [None]:
# EXERCISE - Create your scatter plot with the line of best fit

#Copy your code to create the scatter plot here:



#Create the line of best fit
plt.plot(x_data, model_function(x_data), 'red')

#Create a legend
plt.legend(['scatter', 'line of best fit']);

With the line of best fit, it's even more clear that there is a clear, positive relationship between the total power consumption and building power consumption.

However, we can quantify how good the line of best fit is by finding the R<sup>2</sup> (coefficient of determination) value. The closer R<sup>2</sup> is to 1, the better the fit.

To do this, we'll use a new library, `scikit-learn` (which is actually the [machine learning library](https://scikit-learn.org/stable/) for Python!) There's a function that easily allows us to calculate R<sup>2</sup>, called `r2_score()`. The arguments it takes is our original y data values and the model's y data values:

In [None]:
from sklearn.metrics import r2_score

r2_score(y_data, model_function(x_data))

Now we are going to analyze a different relationship. We are going to look at total power consumption compared to the outdoor temperature. When do you think the most power is consumed? Keep in mind the only months we are looking at are January, February, March, and April.

**In this cell, write your hypothesis**

*Your answer here*

Let's test your hypothesis. First, create a scatter plot comparing outdoor air temperature to total power consumption. Put the temperature on the x-axis and total energy consumption on the y-axis.

*Hint: You can always call .columns to refresh your memory on the name of the columns. Create a new code cell below if you would like to do that.*

In [None]:
# EXERCISE - Create a scatter plot comparing outdoor air temperature and total power consumption

x_data = ...
y_data = ...

plt.title(...)
plt.xlabel(...)
plt.ylabel(...)
plt.scatter(..., ...);

Copy your code from before to create your linear model:

In [None]:
#EXERCISE - Create the best-fit line
model = np.polyfit(..., ..., 1)
model_function = np.poly1d(...)
print(model_function)

In the cell below, create a scatter plot with the line of best fit. Start by copying and pasting your earlier code.

In [None]:
# EXERCISE - Create your scatter plot with the line of best fit

#Copy your code to create the scatter plot here:



#Create the line of best fit
plt.plot(..., ..., 'red')

#Create a legend
plt.legend(['line of best fit', 'scatter']);

Let's quantify how good the line of best fit is by finding the R<sup>2</sup> value. Remember, the closer R<sup>2</sup> is to 1, the better the fit.

In [None]:
#EXERCISE - find R^2

r2_score(..., ...)

### Answer the following questions:

Is there a positive, negative, or no correlation between the total power consumption and outdoor air temperature? How can you tell? Does this match your hypothesis?

*Your answer here*

How strong of a relationship is there between the total power consumption and the outdoor air temperature? How can you tell?

*Your answer here*

What are some suggestions to investigate further the correlation between the total power consumption and outdoor air temperature?

*Your answer here*

## 2. Which Appliance's Power Consumption Is Most Affected By Temperature?<a id='section2'>

We have data for power consumption of five different appliances: freezer, refrigerator, refrigerator fan, west AC, and east AC.

Let's investigate which appliance's power consumption is most affected by temperature. We will do this by creating a scatter plot including the line of best fit for each of the five appliances. We will also look at the R<sup>2</sup> value.

Before we start, let's make a hypothesis. **Of the five appliances, which do you think is most strongly correlated with the temperature? Why?**

*Your answer here*

In [None]:
#Let's again remind ourselves of the names of the columns:
weather_and_power.columns

As will be plotting the power consumption of each of the appliances over the outdoor air temperature, the data on the x axis for all the plots will be outdoor air temperature. Let's go ahead and create the `x_data` variable that we will use in all our plots:

In [None]:
#EXERCISE - Have the variable x data be the outdoor air temperature column from the weather_and_power dataframe

x_data = ...

For each of the appliances below, do the following:
1. Grab the appropriate column and save it in the y_data_*appliance name* variable
2. Use `polyfit()` and `poly1d()` to create your linear model (best-fit line)
3. Create a scatter plot with the best fit line
4. Find R<sup>2</sup>

Remember, to make your life easier, you can copy and paste code! Just make sure to rename variables, graph titles and labels appropriately.

### Freezer

In [None]:
#Grab the appropriate column and save it in y_data_freezer

y_data_freezer = weather_and_power["..."]

In [None]:
#Create the linear model
model_freezer = np.polyfit(..., ..., 1)
model_function_freezer = np.poly1d(...)
print(model_function_freezer)

In [None]:
#Create the scatter plot with the best-fit line
plt.title("...")
plt.ylabel("...")
plt.xlabel("...")
plt.scatter(..., ...)

plt.plot(x_data,model_function_freezer(x_data), 'red')

plt.legend(['line of best fit', 'scatter']);

You might notice that your graph looks odd - there are some horizontal lines. These horizontal lines represent fan-only or compressor off mode, or regular "off states". If we want to find the correlation between the actual consumption and the outdoor air temperature, we will need to only use data when the freezer's power consumption is on.

To do that, let's create a filter where we only look at freezer power consumption data that is above 3400 Watts, then apply that filter to create a subset of x and y data that only fulfills the filter.

In [None]:
#Create our filters
freezer_filter = weather_and_power["..."] > ...

#Apply our filters
x_data_freezer_on = weather_and_power[...]["outdoor air temperature (F)"]
y_data_freezer_on = weather_and_power[freezer_filter]["..."]

Now recreate your model and scatter plot - you can copy code from before, but just make sure you are using the right variables!

In [None]:
#Create the linear model
model_freezer = np.polyfit(..., ..., 1)
model_function_freezer = np.poly1d(model_freezer)
print(model_function_freezer)

In [None]:
#Create the scatter plot with the best-fit line
plt.title("...")
plt.ylabel("...")
plt.xlabel("...")
plt.scatter(..., ...)

plt.plot(..., model_function_freezer(...), 'red')

plt.legend(['line of best fit', 'scatter']);

Now the scatter plot and linear model only uses the data when the freezer is on, and looks a bit more reasonable. Let's see what our R<sup>2</sup> returns:

In [None]:
#Find R^2

r2_score(y_data_freezer_on, model_function_freezer(x_data_freezer_on))

Feel free to experiment with the filter to see if you can get a better fit!

### Refrigerator

In [None]:
#Grab the appropriate column and save it in y_data_refrigerator

y_data_refrigerator = weather_and_power["..."]

In [None]:
#Create the linear model
model_refrigerator = np.polyfit(..., ..., 1)
model_function_refrigerator = np.poly1d(...)
print(model_function_refrigerator)

In [None]:
#Create the scatter plot with the best-fit line
plt.title("...")
plt.ylabel("...")
plt.xlabel("...")
plt.scatter(..., ...)

plt.plot(..., ..., 'red')

plt.legend(['line of best fit', 'scatter']);

Similar to the freezer, create a filter that ignores the "off values" and then recreate the linear model and scatter plot. You may need to experiment with your filter to find the right cutting point for data.

In [None]:
#Create our filters
refrigerator_filter = ...

#Apply our filters
x_data_refrigerator_on = weather_and_power[...]["..."]
y_data_refrigerator_on = weather_and_power[...]["..."]

In [None]:
#Re-create the linear model
model_refrigerator = np.polyfit(..., ..., 1)
model_function_refrigerator = np.poly1d(...)
print(model_function_refrigerator)

In [None]:
#Re-create the scatter plot with the best-fit line
plt.title("...")
plt.ylabel("...")
plt.xlabel("...")
plt.scatter(..., ...)

plt.plot(..., ..., 'red')

plt.legend(['line of best fit', 'scatter']);

In [None]:
#Find R^2

r2_score(..., model_function_refrigerator(...))

### Refrigerator Fan

In [None]:
#Grab the appropriate column and save it in y_data_refrigerator_fan

y_data_refrigerator_fan = ...

In [None]:
#Create the linear model
model_refrigerator_fan = np.polyfit(..., ..., 1)
model_function_refrigerator_fan = np.poly1d(...)
print(model_function_refrigerator_fan)

In [None]:
#Create the scatter plot with the best-fit line
plt.title("...")
plt.ylabel("...")
plt.xlabel("...")
plt.scatter(..., ...)

plt.plot(..., ..., 'red')

plt.legend(['line of best fit', 'scatter']);

This time, you might notice that there are three clusters of data close together that almost seems to form horizontal lines. You should create a filter, but it will be interesting to experiment which data cluster to focus on.

In [None]:
#Create our filter
refrigerator_fan_filter = ...

#Apply our filter
x_data_refrigerator_fan_on = weather_and_power[...]["..."]
y_data_refrigerator_fan_on = weather_and_power[...]["..."]

In [None]:
#Re-create the linear model
model_refrigerator_fan = np.polyfit(..., ..., 1)
model_function_refrigerator_fan = np.poly1d(...)
print(model_function_refrigerator_fan)

In [None]:
#Re-create the scatter plot with the best-fit line
plt.title("...")
plt.ylabel("...")
plt.xlabel("...")
plt.scatter(..., ...)

plt.plot(..., ..., 'red')

plt.legend(['line of best fit', 'scatter']);

In [None]:
#Find R^2

r2_score(..., ...)

### West Air Conditioning

In [None]:
#Grab the appropriate column and save it in y_data_west_ac

y_data_west_ac = ...

In [None]:
#Create the linear model
model_west_ac = np.polyfit(..., ..., 1)
model_function_west_ac = np.poly1d(...)
print(model_function_west_ac)

In [None]:
#Create the scatter plot with the best-fit line
plt.title("...")
plt.ylabel("...")
plt.xlabel("...")
plt.scatter(..., ...)

plt.plot(..., ..., 'red')

plt.legend(['line of best fit', 'scatter']);

The west air conditioning system has three different modes of operation: fan-only, 1-stage, 2-stage. Fan-only is when the compressor is off and there is just ventilation happening, 1-stage is the first stage of cooling and 2-stage is what turns on when 1-stage isn't sufficient for cooling.

Try to filter out your data to just focus on 2-stage. Note that there are some interesting data points around 6000W - according to resesarchers who conducted this study, these are anomalies and should be ignored for now.

In [None]:
#Create our filters
west_ac_filter = (weather_and_power["..."] < ...) & (weather_and_power["..."] > ...)

#Apply our filters
x_data_west_ac_on = weather_and_power[...]["..."]
y_data_west_ac_on = weather_and_power[...]["..."]

In [None]:
#Re-create the linear model
model_west_ac = np.polyfit(..., ..., 1)
model_function_west_ac = np.poly1d(...)
print(model_function_west_ac)

In [None]:
#Re-create the scatter plot with the best-fit line
plt.title("...")
plt.ylabel("...")
plt.xlabel("...")
plt.scatter(..., ...)

plt.plot(..., ..., 'red')

plt.legend(['line of best fit', 'scatter']);

In [None]:
#Find R^2

r2_score(..., ...)

### East Air Conditioning

In [None]:
#Grab the appropriate column and save it in y_data_east_ac

y_data_east_ac = ...

In [None]:
#Create the linear model
model_east_ac = ...
model_function_east_ac = ...
print(model_function_east_ac)

In [None]:
#Create the scatter plot with the best-fit line

#Your code here


plt.legend(['line of best fit', 'scatter']);

The east air conditioning system also has three different modes of operation (fan-only, 1-stage, 2-stage) like the west air conditioning system; however, it turns out that the east air conditioning system was constantly running and never in fan-only mode. The scatter plot shows data points for 1-stage and 2-stage.

Try to filter out your data to just focus on 2-stage.

In [None]:
#Create our filters
east_ac_filter = ...

#Apply our filters
x_data_east_ac_on = ...
y_data_east_ac_on = ...

In [None]:
#Re-create the linear model
model_east_ac = ...
model_function_east_ac = ...
print(model_function_east_ac)

In [None]:
#Re-create the scatter plot with the best-fit line

#Your code here

plt.legend(['line of best fit', 'scatter']);

In [None]:
#Find R^2

#Your code here

### Putting it all together

Revisit your hypothesis. Was your hypothesis correct/incorrect? How can you tell? Make sure to use the scatter plots with your linear models and the R<sup>2</sup> value as evidence in your answer.

*Type your answer here*

Notebook developed by: Rachel McCarty, Kseniya Usovich, Laurel Hales, Alisa Bettale