# Part 2 - Open your data & start exploring

If you're in the class, the data you need is already available. If not, you need to download:
* PEP_2014_PEPANNRES_with_ann.csv - US Census Population Estimates for populated places in Colorado - [from American Factfinder](http://factfinder.census.gov/bkmk/table/1.0/en/PEP/2014/PEPANNRES/0400000US08.16200)
* [njaccidents.csv](https://s3.amazonaws.com/nicar15/njaccidents.csv)



## Loading a data file


Before anything else, we import `pandas`. By convention, it's imported as `pd` to save typing. Reading in a CSV is just as easy as you'd guess. We'll load the Colorado population estimates. 

In [2]:
import pandas as pd
popest = pd.read_csv('PEP_2014_PEPANNRES_with_ann.csv') 

`popest` is a `DataFrame`. Mostly, you can think of a `DataFrame` as a single worksheet from a spreadsheet, or even more as a database table. 

Each column of your `DataFrame` has a name and specific datatype. Technically, the columns are `Series` objects. A lot of the time you don't have to worry about the terminology, but it can pay off to know the right words when googling or asking for help.

You'll see that one reason people like `pandas` and `jupyter` is that you can just kinda 'look' at your data 

In [3]:
popest.head() # head() Defaults to giving you the first five rows of the data frame

Unnamed: 0,GEO.id,GEO.id2,GEO.display-label,rescen42010,resbase42010,respop72010,respop72011,respop72012,respop72013,respop72014
0,Id,Id2,Geography,"April 1, 2010 - Census","April 1, 2010 - Estimates Base",Population Estimate (as of July 1) - 2010,Population Estimate (as of July 1) - 2011,Population Estimate (as of July 1) - 2012,Population Estimate (as of July 1) - 2013,Population Estimate (as of July 1) - 2014
1,1620000US0800760,0800760,"Aguilar town, Colorado",538,538,535,520,519,496,479
2,1620000US0800925,0800925,"Akron town, Colorado",1702,1702,1699,1700,1678,1693,1694
3,1620000US0801090,0801090,"Alamosa city, Colorado",8780,8828,9268,9393,9423,9550,9531
4,1620000US0801530,0801530,"Alma town, Colorado",270,270,271,269,269,271,275


### Loading, refined

Notice row `0` -- it's not a data row, it's a second header. This kind of thing is common in Excel, but it can tangle you up when you use python code for data analysis. Let's fix that.

In [4]:
popest = pd.read_csv('PEP_2014_PEPANNRES_with_ann.csv', 
                      skiprows=2, # we can't use row 0 and skip only row 1 so we skip both
                      header=None, # and tell pandas that it shouldn't treat the first non-skipped row as a header
                      names=[u'GEO.id', u'GEO.id2', # and since there's no header, we provide the names
                             u'GEO.display-label', u'rescen42010',
                             u'resbase42010', u'respop72010', u'respop72011', 
                             u'respop72012', u'respop72013', u'respop72014'])
popest.head()

Unnamed: 0,GEO.id,GEO.id2,GEO.display-label,rescen42010,resbase42010,respop72010,respop72011,respop72012,respop72013,respop72014
0,1620000US0800760,800760,"Aguilar town, Colorado",538,538,535,520,519,496,479
1,1620000US0800925,800925,"Akron town, Colorado",1702,1702,1699,1700,1678,1693,1694
2,1620000US0801090,801090,"Alamosa city, Colorado",8780,8828,9268,9393,9423,9550,9531
3,1620000US0801530,801530,"Alma town, Colorado",270,270,271,269,269,271,275
4,1620000US0802355,802355,"Antonito town, Colorado",781,781,782,784,779,776,775


Huh. This time when it loaded `GEO.id2`, which a lot of people would call a *FIPS code*, it lost the leading `0`. You've probably fought this battle before with US ZIP codes too. The software needs a clue that those leading zeroes are important to us.

The reason we even care about the FIPS code is that it provides a clean unique identifier to link data between datasets. Maybe we have another dataset from the state of Colorado where the place names don't include _, Colorado_ -- the FIPS save us the headache of matching names which are _almost but not quite the same._ In database terms, it's a "unique identifier" or a "primary key". You can tell `pandas` a certain column serves this purpose, by making it the *index*. `DataFrames` and `Series` can be easily joined if they have compatible indexes, even if the rows are in different order, or the datasets have different numbers of rows.

So, two things:
* we want to reload the data without losing that leading zero
* we want to make it clear to `pandas` that the FIPS code column is the index

So, once more, from the top.

In [5]:
popest = pd.read_csv('PEP_2014_PEPANNRES_with_ann.csv', 
                      skiprows=2, # we can't use row 0 and skip only row 1 so we skip both
                      header=0, # and tell pandas that it shouldn't treat the first non-skipped row as a header
                      names=[u'GEO.id', u'FIPS', # and since there's no header, we provide the names
                             u'GEO.display-label', u'rescen42010', # just copy/pasted from the output above
                             u'resbase42010', u'respop72010', u'respop72011', 
                             u'respop72012', u'respop72013', u'respop72014'],
                      index_col=1,
                      dtype={'FIPS': 'S7'} # should be able to use object instead of 'S7', I filed a bug!
                    )
popest.head()

Unnamed: 0_level_0,GEO.id,GEO.display-label,rescen42010,resbase42010,respop72010,respop72011,respop72012,respop72013,respop72014
FIPS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
800925,1620000US0800925,"Akron town, Colorado",1702,1702,1699,1700,1678,1693,1694
801090,1620000US0801090,"Alamosa city, Colorado",8780,8828,9268,9393,9423,9550,9531
801530,1620000US0801530,"Alma town, Colorado",270,270,271,269,269,271,275
802355,1620000US0802355,"Antonito town, Colorado",781,781,782,784,779,776,775
803235,1620000US0803235,"Arriba town, Colorado",193,193,193,193,193,191,194


OK, now we can trust that we'll be able to join this with another data set. (Note that `jupyter` displays the `index` in bold.) That `GEO.id` column seems redundant. Also, while maybe important in some cases, we don't really need the Decennial 2010 estimate or the base value used for estimates. Let's simplify things a little.

In [6]:
# don't forget axis=1 if you're dropping columns; default is to try to drop a row.
popest = popest.drop(['GEO.id', 'rescen42010', 'resbase42010'], axis=1) 

And let's fix those column names to something easier to type.

In [7]:
popest = popest.rename(columns={ # yeah, we coulda done this when we read in the CSV but sometimes you realize it later.
        'GEO.display-label': 'name',
        'respop72010': 'est2010', 
        'respop72011': 'est2011', 
        'respop72012': 'est2012', 
        'respop72013': 'est2013', 
        'respop72014': 'est2014', 
    })
popest.head()

Unnamed: 0_level_0,name,est2010,est2011,est2012,est2013,est2014
FIPS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
800925,"Akron town, Colorado",1699,1700,1678,1693,1694
801090,"Alamosa city, Colorado",9268,9393,9423,9550,9531
801530,"Alma town, Colorado",271,269,269,271,275
802355,"Antonito town, Colorado",782,784,779,776,775
803235,"Arriba town, Colorado",193,193,193,191,194


Ah, that‘s a lot tidier. One thing you should note: in both of the last two exampmles, we reassigned `popest` to the result of a function call. This is because these and many other operations in `pandas` don't change the original `DataFrame`. Instead, they return a new "view" of the same data. You will inevitably forget this and then wonder why that change you thought you made isn't there. This is why. The good news is that it will become a habit, mostly.

# Exploring your data

OK, now that we have the data loaded, what do we have? 

Like we said before, each column of a `DataFrame` is a `Series`. You get at the columns by using the `DataFrame` like a python `dict` class, with the column names as keys. 

In [8]:
popest['est2014'].head()

FIPS
0800925    1694
0801090    9531
0801530     275
0802355     775
0803235     194
Name: est2014, dtype: int64

You can get specific values from the `Series` by using the index value as a key.

In [9]:
print "The 2014 estimated population for FIPS 0800925 is {}".format(popest['est2014']['0800925'])

The 2014 estimated population for FIPS 0800925 is 1694


Like in Javascript, in some cases you an also get the values from the `DataFrame` by treating the column name as an attribute, although that doesn't work if the name has spaces, or is a python reserved word. For this lesson, we'll stick with the dictionary-access style, but just to show you, look at this:

In [10]:
popest.est2014.head()

FIPS
0800925    1694
0801090    9531
0801530     275
0802355     775
0803235     194
Name: est2014, dtype: int64

See? Same difference.

By now you must be itching to actually learn something about the data. It's easy to get basic summary statistics from a `Series`.

In [11]:
popest['est2014'].describe() # describe just a series

count       270.000000
mean      14652.214815
std       57327.321746
min           8.000000
25%         426.500000
50%        1181.500000
75%        5967.000000
max      663862.000000
Name: est2014, dtype: float64

In [12]:
popest.describe() # or describe the whole dataset. only numeric columns are included in the results

Unnamed: 0,est2010,est2011,est2012,est2013,est2014
count,270.0,270.0,270.0,270.0,270.0
mean,13705.62963,13925.751852,14149.511111,14396.911111,14652.214815
std,52922.920974,54053.576092,55112.97791,56210.091428,57327.321746
min,8.0,8.0,8.0,8.0,8.0
25%,425.5,424.25,425.25,426.75,426.5
50%,1189.5,1184.0,1167.0,1175.5,1181.5
75%,5669.75,5742.75,5827.25,5902.5,5967.0
max,603365.0,619390.0,633868.0,648401.0,663862.0


There are 270 places in our dataset. The average 2014 estimated population is about 14,652, but the median is only 1181.5. 

The smallest place has only 8 people. Which one is it?

In [13]:
popest[popest['est2014'] == 8]

Unnamed: 0_level_0,name,est2010,est2011,est2012,est2013,est2014
FIPS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
842495,"Lakeside town, Colorado",8,8,8,8,8


Ah, scenic [Lakeside](http://censusreporter.org/profiles/16000US0842495-lakeside-co/).

Maybe you just want a sorted list. That's pretty easy. Just remember that the default sort is low-to-high.

In [14]:
popest.sort_values(by='est2014', ascending=False).head(10) # don't forget that default sort is ascending not descending.

Unnamed: 0_level_0,name,est2010,est2011,est2012,est2013,est2014
FIPS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
820000,"Denver city, Colorado",603365,619390,633868,648401,663862
816000,"Colorado Springs city, Colorado",420455,427416,433630,440137,445830
804000,"Aurora city, Colorado",325977,332633,339302,346201,353108
827425,"Fort Collins city, Colorado",144505,145977,148956,152365,156480
843000,"Lakewood city, Colorado",143156,144184,145461,146992,149643
877290,"Thornton city, Colorado",119395,121885,124360,127728,130307
803455,"Arvada city, Colorado",106635,107578,109690,111600,113574
883835,"Westminster city, Colorado",106446,107748,109261,110978,112090
862000,"Pueblo city, Colorado",106887,107341,107813,108083,108423
812815,"Centennial city, Colorado",100871,102479,104074,106431,107201


Maybe you only want to see some of the data.

In [15]:
popest[['name', 'est2014']].head()

Unnamed: 0_level_0,name,est2014
FIPS,Unnamed: 1_level_1,Unnamed: 2_level_1
800925,"Akron town, Colorado",1694
801090,"Alamosa city, Colorado",9531
801530,"Alma town, Colorado",275
802355,"Antonito town, Colorado",775
803235,"Arriba town, Colorado",194


Maybe you're interested in change over time. You can add a new column to your `DataFrame` as a computation, like this.

In [16]:
popest['change'] = ((popest['est2014']-popest['est2010'])/popest['est2010'])*100
popest['change'].head()

FIPS
0800925   -0.294291
0801090    2.837721
0801530    1.476015
0802355   -0.895141
0803235    0.518135
Name: change, dtype: float64

Remember, you can treat a series (like our new `change` column) like a `dict` or like a `list`.

In [17]:
print "FIPS 0800925 had {:.2f}% population change".format(popest['change']['0800925'])
print "The third place in the list had {:.2f}% population change".format(popest['change'][2]) # zero-indexed

FIPS 0800925 had -0.29% population change
The third place in the list had 1.48% population change


Which places had the most change?

In [18]:
popest.sort_values(by='change',ascending=False).head()

Unnamed: 0_level_0,name,est2010,est2011,est2012,est2013,est2014,change
FIPS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
877510,"Timnath town, Colorado",628,796,1163,1538,1983,215.764331
839855,"Johnstown town, Colorado",9982,10429,11074,12108,13306,33.29994
828360,"Frederick town, Colorado",8720,8969,9417,10203,10927,25.309633
869040,"Seibert town, Colorado",180,183,182,181,218,21.111111
845955,"Lone Tree city, Colorado",11295,11564,11902,13286,13545,19.920319


Wow, Timnath doubled in size in five years! But we can see that it's an outlier. Maybe we want to just analyze bigger cities.

In [19]:
popest[popest['est2014'] > 100000].count()

name       11
est2010    11
est2011    11
est2012    11
est2013    11
est2014    11
change     11
dtype: int64

OK. There are 11 places in Colorado with more than 100,000 population. That output is kinda janky, though. Fortunately, we can also use `len`.

In [20]:
print "There are {} cities with population larger than 100,000.".format(len(popest[popest['est2014'] > 100000]))

There are 11 cities with population larger than 100,000.


We can slice off just the bigger cities into another dataframe.

In [21]:
bigtowns = popest[popest.est2010 > 100000]
bigtowns.sort_values(by='change', ascending=False).head() 

Unnamed: 0_level_0,name,est2010,est2011,est2012,est2013,est2014,change
FIPS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
820000,"Denver city, Colorado",603365,619390,633868,648401,663862,10.026601
877290,"Thornton city, Colorado",119395,121885,124360,127728,130307,9.139411
804000,"Aurora city, Colorado",325977,332633,339302,346201,353108,8.32298
827425,"Fort Collins city, Colorado",144505,145977,148956,152365,156480,8.28691
803455,"Arvada city, Colorado",106635,107578,109690,111600,113574,6.507244
