



# Energy Consumption Data Notebook 2: Data Analysis on the Cleaned Dataset


---

### Goals For This Notebook:

1 - Import our merged weather and power dataset.<br>

2 - Use the operation **GroupBy** to group data by time and month.<br>

3 - Create bar plots to examine which months uses the most energy and why.<br>

---

### Table of Contents

1 - [Import Data](#section1)<br>

2 - [GroupBy](#section2)<br>

3 - [Data Visualization - Which Months Uses the Most Energy?](#section3)<br>

---

In this notebook, we will work with our merged weather and power dataset from Energy Consumption Data Notebook 1. We will group data by using an operation called GroupBy to categorize data by time and month. Finally, we will create our first data visualizations, so we can see during which months most energy was consumed and try to find out why.

Let's first get started by importing the libraries we need:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## 1. Import Data <a id='section1'></a>

Let's import the data from last notebook - recall that we saved it in the file `weather_and_power.csv` in the *data* folder.

In [None]:
# EXERCISE

#Import it below!
weather_and_power = pd....("...")
weather_and_power.head()

When we imported the table, we get a weird column called "Unnamed: 0". It does not have any important information, so we do not need it.

Look back at your Energy Consumption Data Notebook 1 and remove the column.

In [None]:
# EXERCISE - Remove the column "Unnamed: 0"

# Your code here

weather_and_power.head()

Let's quickly get a feel for the data by using the `describe()` function on our dataframe:

In [None]:
# EXERCISE

...

## 2. GroupBy <a id='section2'></a>

When we used the function `describe()` above, we got useful statistics on each column, such as the mean. However, this is the mean over *all* time. What if we wanted to look at the mean power consumption for a certain time to see if appliances use more energy at a certain time of day? Or what if we wanted to look at the mean power consumption for just a particular month?

This is possible to do with the pandas library by using an operation called `groupby()`. `groupby()` allows you to group data in your choice of column(s), then apply a function (e.g. `sum()`, `mean()`, `count()`, etc.) to each group.

Let's do an example. We will group our weather and power dataset by `12-hr-time`. Pandas will take all data points taken at 12:00am, 12:15am, etc. and have them in separate groups. Then we will look at the column `outdoor air temperature (F)` and use the function `mean()` to find the mean outdoor air temperature for each time.

In [None]:
#EXAMPLE

weather_and_power.groupby("12-hr-time")["outdoor air temperature (F)"].mean()

What did we get? We have the average temperature of all the data points taken at each 15-minute interval data was taken over the whole time period (4 months).

But does the data make sense? We can use `sort_values()` to order the times from lowest average temperature to highest average temperature:

In [None]:
#EXAMPLE

weather_and_power.groupby("12-hr-time")["outdoor air temperature (F)"].mean().sort_values()

You should see that the lowest average temperature was measured at 6:30am, while the highest average temperature was measured at 1:45pm. It makes sense as we find the coolest temperatures generally after cooling overnight and before the sun rises, and have the highest temperatures around mid-day. 

For the below exercise, try to find the mean outdoor air temperature for each month. See if your results make sense.

In [None]:
#EXERCISE

weather_and_power.groupby("...")["..."].mean()

Another feature of `groupby()` is `.agg()`, which allows us to apply statistics to multiple columns. The example below finds the maximum air conditioning power for the west and east air conditioners, as well as the mean outdoor air temperature for each month.

In [None]:
#EXAMPLE

weather_and_power.groupby("month").agg({"west air conditioning power consumption (Watts)": np.max,
                                        "east air conditioning power consumption (Watts)": np.max,
                                       "outdoor air temperature (F)": np.mean})

In the cell below, groupby the "12-hr-time" column, and find the mean of just the "building total power consumption (Watts)" and the "outdoor air temperature (F)".

In [None]:
# EXERCISE

# Find the average building power consumption and average outdoor air temperature for every time in 12-hr-time column
grouped = weather_and_power.groupby("...").agg({"...": ...,
                            "...": ...})

grouped

Which time has the greatest average building total power consumption? Which time has the lowest average outdoor air temperature?

*Hint: You can apply `sort_values()` on the grouped dataframe you created above.*

In [None]:
#EXERCISE - Which time has the greatest average building total power consumption?

grouped.sort_values("...", ascending=False) #What does ascending=False do?

In [None]:
#EXERCISE - Which time has the lowest average outdoor air temperature?

...

## 3. Data Visualization - Which Months Uses the Most Energy?<a id='section3'></a>

Now we will apply `groupby()` to answer the question: which months uses the most energy?

**Discuss with your group - which column from the data is most appropriate to measure power consumption of the whole pilot site?** You might want to take a look at the dataframe again below:

In [None]:
#Use the function .head() on the dataframe to refresh your memory on the dataframe you are working with:

weather_and_power.head()

In the cell below, groupby "month" and find the average power consumption of the pilot site.

In [None]:
# EXERCISE

month_data_building = weather_and_power.groupby("...")["..."]....

month_data_building

Let's visualize this data! We will make a barchart of the table we just created.

First, you may notice that `month_data_building` was returned as a series rather than a dataframe. This will cause some issues if we were to try to use matplotlib to create plots. Luckily, we just need to use the function `reset_index()` so the months aren't treated as the index anymore. We will save this new dataframe:

In [None]:
month_data_building_df = month_data_building.reset_index()
month_data_building_df

Now use matplotlib to create a bar plot to show how much power is consumed over different months:

In [None]:
x_data = month_data_building_df["..."]
y_data = month_data_building_df["..."]

plt.title("Average Total Power Consumption (Watts) For Each Month")
plt.xlabel("Month")
plt.ylabel("Average Total Power Consumption (Watts)")
plt.bar(x_data, y_data, tick_label=["January", "February", "March", "April"]); 

#Note: If you are wondering why we are using the tick_label argument, try running the code without it!

Which month has the most energy used? Which has the least energy used? What reasons do you think explain that?

In the cell below, groupby "month" again and this time find the average outdoor temperature and make a bar chart that plots this. 

In [None]:
# EXERCISE - Use groupby

month_data_temp = weather_and_power...

month_data_temp

In [None]:
#EXERCISE - Create a bar chart

#Don't forget to use reset_index() first
month_data_temp_df = month_data_temp....

#Create your bar plot
x_data = ...
y_data = ...

plt.title("...")
plt.xlabel("...")
plt.ylabel("...")
plt.bar(..., ..., tick_label=["January", "February", "March", "April"]);

Seeing this chart, does it make sense that February and March use the most energy? Why or why not?

Notebook developed by: Rachel McCarty, Kseniya Usovich, Laurel Hales, Alisa Bettale