## Assignment 5: Pandas

October 6, 2019

#### Esin Alpturk

## Description 

This time, we're working with New York City gas consumption data. More information can be found at https://data.cityofnewyork.us/Environment/Natural-Gas-Consumption-by-ZIP-Code-2010/uedp-fegm/data .

As a data analyst working to improve energy efficiency in NYC, my first job is to ingest and explore this data. That's what we'll do here.

## Data Ingestion

We'll bring the data into my local environment using a pandas command called read_csv.

In [1]:
import pandas as pd

nyc_gas = pd.read_csv('https://raw.githubusercontent.com/CorkCork/Analytics-Programming/master/Module%205/Data/Natural_Gas_Consumption_by_ZIP_Code_-_2010.csv')

## Quick Data Preview

Let's look at the first ten rows to get an idea of what data looks like.

In [2]:
nyc_gas.head(10)

Unnamed: 0,Zip Code,Building type (service class,Consumption (therms),Consumption (GJ),Utility/Data Source
0,10300,Commercial,470.0,50.0,National Grid
1,10335,Commercial,647.0,68.0,National Grid
2,10360,Large residential,33762.0,3562.0,National Grid
3,11200,Commercial,32125.0,3389.0,National Grid
4,11200,Institutional,3605.0,380.0,National Grid
5,11200,Small residential,3960.0,418.0,National Grid
6,11254,Small residential,1896.0,200.0,National Grid
7,11274,Commercial,8364.0,882.0,National Grid
8,11279,Commercial,2579.0,272.0,National Grid
9,11279,Large residential,301.0,32.0,National Grid


So, it seems that we have five columns of data, which represent:

* ZIP code (postal code)
* Building type
* Consumption in therms 
* Consumption in GJ 
* Utility that reported this data

## Data Size and Labels

Next, it is time to get additional basic details about our data: number of columns and the label names for our data.

In [3]:
nyc_gas.shape

(1015, 5)

In [4]:
list(nyc_gas.columns)

['Zip Code',
 'Building type (service class',
 ' Consumption (therms) ',
 ' Consumption (GJ) ',
 'Utility/Data Source']

We have 1015 rows. As we already established, we have 5 columns. When looking at the column labels, we notice a few things that might affect how easy it is to use them in code:

* First problem is that column names are hard to work with.
* Another thing? It would be optimal to use either GJ or therms while calculating consumption. Not both.

I will find it useful to reindex my DataFrame and give different column names. I prefer to use labels that are all lowercase and have no punctuation other than underscore (_).

In [5]:
nyc_gas.rename(columns={'Zip Code' : "zip",
              'Building type (service class' : "building_type", 
              ' Consumption (therms) ' : "consumption_therms", 
              ' Consumption (GJ) ' : "consumption_gj", 
              'Utility/Data Source' : "utility_reporter"} , inplace=True)

Let's check our renamed DataFrame to see if it's a bit easier to work with, as far as computationally-friendly but human-readable column names:

In [6]:
nyc_gas.head()

Unnamed: 0,zip,building_type,consumption_therms,consumption_gj,utility_reporter
0,10300,Commercial,470.0,50.0,National Grid
1,10335,Commercial,647.0,68.0,National Grid
2,10360,Large residential,33762.0,3562.0,National Grid
3,11200,Commercial,32125.0,3389.0,National Grid
4,11200,Institutional,3605.0,380.0,National Grid


All look good. Now we can do some actual data analysis to explore the data. We'll start with building types.

## Building Types

We have a few questions we want to answer here, including:

* How many distinct building types are included?
* What is the median energy consumption for each type?
* How do building types compare?

We'll start by looking at the number of unique building types:

In [7]:
nyc_gas['building_type'].unique()

array(['Commercial', 'Large residential', 'Institutional',
       'Small residential', 'Industrial', 'Large Residential',
       'Residential'], dtype=object)

I see that there are seven kinds of buildings, but two of them seem to be the same thing, just written differently, with different capitalization. I want to combine "Large residential" and "Large Residential" into one group. Also, it's unclear what "Residential" is -- is it small? Large? I'll leave just plain "Residential" on its own until we get more information.

There are many ways to accomplish what I want to do here. One way is to filter the DataFrame so that I get several smaller DataFrames, one for each type of building. That's what I'll do here, doing a quick peek to make sure I have the right data. Then I can find median values on the columns, in my case the consumption_gj column!

In [8]:
commercial_df = nyc_gas[nyc_gas["building_type"] == "Commercial"]
commercial_df.head()

Unnamed: 0,zip,building_type,consumption_therms,consumption_gj,utility_reporter
0,10300,Commercial,470.0,50.0,National Grid
1,10335,Commercial,647.0,68.0,National Grid
3,11200,Commercial,32125.0,3389.0,National Grid
7,11274,Commercial,8364.0,882.0,National Grid
8,11279,Commercial,2579.0,272.0,National Grid


In [9]:
commercial_df["consumption_gj"].median()

189413.0

I'll do the same for the other building types, and make sure to include both "large" types when I create that data frame!

In [10]:
institutional_df = nyc_gas[nyc_gas["building_type"] == "Institutional"]
institutional_df.head()

Unnamed: 0,zip,building_type,consumption_therms,consumption_gj,utility_reporter
4,11200,Institutional,3605.0,380.0,National Grid
15,11315,Institutional,339.0,36.0,National Grid
27,11400,Institutional,93140.0,9827.0,National Grid
32,11438,Institutional,1770.0,187.0,National Grid
36,11468,Institutional,49184.0,5189.0,National Grid


In [11]:
institutional_df["consumption_gj"].median()

66027.0

In [12]:
small_residential_df = nyc_gas[nyc_gas["building_type"] == "Small residential"]
small_residential_df.head()

Unnamed: 0,zip,building_type,consumption_therms,consumption_gj,utility_reporter
5,11200,Small residential,3960.0,418.0,National Grid
6,11254,Small residential,1896.0,200.0,National Grid
11,11303,Small residential,3009.0,317.0,National Grid
12,11313,Small residential,3488.0,368.0,National Grid
13,11314,Small residential,6011.0,634.0,National Grid


In [13]:
small_residential_df["consumption_gj"].median()

599600.0

In [14]:
industrial_df = nyc_gas[nyc_gas["building_type"] == "Industrial"]
industrial_df.head()

Unnamed: 0,zip,building_type,consumption_therms,consumption_gj,utility_reporter
26,11400,Industrial,275.0,29.0,National Grid
53,"11226(40.646505002304, -73.957190099144)",Industrial,65835.0,6946.0,National Grid
66,"11210(40.627722263871, -73.946537919989)",Industrial,63151.0,6663.0,National Grid
82,"10314(40.596490302985, -74.165991118795)",Industrial,441639.0,46595.0,National Grid
83,"11420(40.673345242689, -73.817707171649)",Industrial,8901.0,939.0,National Grid


In [15]:
industrial_df["consumption_gj"].median()

16867.5

In [16]:
residential_df = nyc_gas[nyc_gas["building_type"] == "Residential"]
residential_df.head()

Unnamed: 0,zip,building_type,consumption_therms,consumption_gj,utility_reporter
59,"10003(40.731943947555, -73.98887214913)",Residential,3229606.0,340742.0,ConEd
62,"10468(40.869693003729, -73.898926992487)",Residential,907812.0,95779.0,ConEd
68,"10034(40.870617109292, -73.924223258796)",Residential,226754.0,23924.0,ConEd
70,"10044(40.761976486226, -73.949999674945)",Residential,93171.0,9830.0,ConEd
72,"11358(40.760350968822, -73.796326458199)",Residential,405678.0,42801.0,ConEd


In [17]:
residential_df["consumption_gj"].median()

53288.5

In [18]:
large_residential_df = nyc_gas[nyc_gas["building_type"].isin(["Large Residential", "Large residential"])]
large_residential_df.head(10)

Unnamed: 0,zip,building_type,consumption_therms,consumption_gj,utility_reporter
2,10360,Large residential,33762.0,3562.0,National Grid
9,11279,Large residential,301.0,32.0,National Grid
16,11315,Large residential,335091.0,35354.0,National Grid
20,11335,Large residential,,,National Grid
28,11400,Large residential,280.0,30.0,National Grid
43,11474,Large residential,2223970.0,234641.0,National Grid
46,11477,Large residential,493904.0,52110.0,National Grid
56,"10467(40.877047912132, -73.871532500824)",Large Residential,4811979.0,507691.0,ConEd
71,"10010(40.739140782121, -73.982898456399)",Large Residential,212584.0,22429.0,ConEd
76,"11220(40.641184928741, -74.016764726711)",Large residential,6756193.0,712816.0,National Grid


In [19]:
large_residential_df["consumption_gj"].median()

160960.0

It looks as though the "industrial" building type has the lowest median consumption. Next, we should understand more about building classifications before we proceed much further in our energy consumption analysis. The highest median consumption belongs to "small residential" building types.

## Utility Reporters

Again, we have a few questions to answer here:

* How many utility data reporters are included?
* What's the mean and standard deviation of their energy consumption (in GJ)?
* How do the different utility types compare?

I'm going to do a similar method to what I did with building types.

In [20]:
nyc_gas['utility_reporter'].unique()

array(['National Grid', 'ConEd'], dtype=object)

Great, only two possibilities here! I'll make two data frames, as always, peeking in a bit to make sure what I'm doing makes sense.

In [21]:
national_grid = nyc_gas[nyc_gas['utility_reporter'] == "National Grid"]
national_grid.head()

Unnamed: 0,zip,building_type,consumption_therms,consumption_gj,utility_reporter
0,10300,Commercial,470.0,50.0,National Grid
1,10335,Commercial,647.0,68.0,National Grid
2,10360,Large residential,33762.0,3562.0,National Grid
3,11200,Commercial,32125.0,3389.0,National Grid
4,11200,Institutional,3605.0,380.0,National Grid


In [22]:
national_grid['consumption_gj'].mean()

357475.56048387097

In [23]:
national_grid['consumption_gj'].std()

562355.2736235436

In [24]:
coned = nyc_gas[nyc_gas['utility_reporter'] == "ConEd"]
coned.head()

Unnamed: 0,zip,building_type,consumption_therms,consumption_gj,utility_reporter
51,"11109(40.744414792409, -73.957702346686)",Commercial,45899.0,4843.0,ConEd
52,"11429(40.709913120494, -73.738640316098)",Commercial,755.0,80.0,ConEd
56,"10467(40.877047912132, -73.871532500824)",Large Residential,4811979.0,507691.0,ConEd
59,"10003(40.731943947555, -73.98887214913)",Residential,3229606.0,340742.0,ConEd
61,"10451(40.820696407114, -73.923841367985)",Commercial,8071587.0,851598.0,ConEd


In [25]:
coned['consumption_gj'].mean()

224575.75049115912

In [26]:
coned['consumption_gj'].std()

298958.0488076621

I noticed that some of the zip codes are listed with longitude and latitude information. It would be better to drop that information. 

## Thank you for reading!