# Part 2 - Open your data & start exploring

If you're in the class, the data you need is already available. If not, you need to download:
* PEP_2014_PEPANNRES_with_ann.csv - US Census Population Estimates for populated places in Colorado - [from American Factfinder](http://factfinder.census.gov/bkmk/table/1.0/en/PEP/2014/PEPANNRES/0400000US08.16200)

To use this notebook, select each code cell in order and run it. The easiest way to run a cell is to type shift-enter, which runs the cell and moves the focus to the next cell. If the next cell is a markdown cell, use the arrow key to move to the next cell. If you feel like changing the code around a little and re-running it, go ahead! If you mess things up, you can always go to the **Kernel** menu above and choose _Restart & Clear Output_.

## Loading a data file


Before anything else, we import `pandas`. By convention, it's imported as `pd` to save typing. Reading in a CSV is just as easy as you'd guess. We'll load the Colorado population estimates. 

In [None]:
import pandas as pd
popest = pd.read_csv('PEP_2014_PEPANNRES_with_ann.csv') 

`popest` is a `DataFrame`. Mostly, you can think of a `DataFrame` as a single worksheet from a spreadsheet, or even more as a database table. 

Each column of your `DataFrame` has a name and specific datatype. Technically, the columns are `Series` objects. A lot of the time you don't have to worry about the terminology, but it can pay off to know the right words when googling or asking for help.

You'll see that one reason people like `pandas` and `jupyter` is that you can just kinda 'look' at your data 

In [None]:
popest.head() # head() Defaults to giving you the first five rows of the data frame

### Loading, refined

Notice row `0` -- it's not a data row, it's a second header. This kind of thing is common in Excel, but it can tangle you up when you use python code for data analysis. Let's fix that.

In [None]:
popest = pd.read_csv('PEP_2014_PEPANNRES_with_ann.csv', 
                      skiprows=2, # we can't use row 0 and skip only row 1 so we skip both
                      header=None, # and tell pandas that it shouldn't treat the first non-skipped row as a header
                      names=[u'GEO.id', u'GEO.id2', # and since there's no header, we provide the names
                             u'GEO.display-label', u'rescen42010',
                             u'resbase42010', u'respop72010', u'respop72011', 
                             u'respop72012', u'respop72013', u'respop72014'])
popest.head()

Huh. This time when it loaded `GEO.id2`, which a lot of people would call a *FIPS code*, it lost the leading `0`. You've probably fought this battle before with US ZIP codes too. The software needs a clue that those leading zeroes are important to us.

The reason we even care about the FIPS code is that it provides a clean unique identifier to link data between datasets. Maybe we have another dataset from the state of Colorado where the place names don't include _, Colorado_ -- the FIPS save us the headache of matching names which are _almost but not quite the same._ In database terms, it's a "unique identifier" or a "primary key". You can tell `pandas` a certain column serves this purpose, by making it the *index*. `DataFrames` and `Series` can be easily joined if they have compatible indexes, even if the rows are in different order, or the datasets have different numbers of rows.

So, two things:
* we want to reload the data without losing that leading zero
* we want to make it clear to `pandas` that the FIPS code column is the index

One more little thing: Colorado is one of a few states which has a few place names that are not simple ASCII (e.g. Cañon City). Pandas won't complain about this, but you should be on the lookout. We happen to know that the Census Bureau uses `Latin-1` to encode files which have names in them to handle this, so here we are also using the `encoding` argument to make sure strings are handled correctly.

So, once more, from the top.

In [None]:
popest = pd.read_csv('PEP_2014_PEPANNRES_with_ann.csv', 
                      skiprows=2, # we can't use row 0 and skip only row 1 so we skip both
                      header=0, # and tell pandas that it shouldn't treat the first non-skipped row as a header
                      names=[u'GEO.id', u'FIPS', # and since there's no header, we provide the names
                             u'GEO.display-label', u'rescen42010', # just copy/pasted from the output above
                             u'resbase42010', u'respop72010', u'respop72011', 
                             u'respop72012', u'respop72013', u'respop72014'],
                      index_col=1,
                      encoding='latin-1',
                      dtype={'FIPS': 'S7'} # should be able to use object instead of 'S7', I filed a bug!
                    )
popest.head()

OK, now we can trust that we'll be able to join this with another data set. (Note that `jupyter` displays the `index` in bold.) That `GEO.id` column seems redundant. Also, while maybe important in some cases, we don't really need the Decennial 2010 estimate or the base value used for estimates. Let's drop the irrelevant columns. And let's fix those column names to something easier to type.

In [None]:
# for 'drop', don't forget axis=1 if you're dropping columns; default is to try to drop a row.
popest = popest.drop(['GEO.id', 'rescen42010', 'resbase42010'], axis=1) 
popest = popest.rename(columns={ # yeah, we coulda done this when we read in the CSV but sometimes you realize it later.
        'GEO.display-label': 'name',
        'respop72010': 'est2010', 
        'respop72011': 'est2011', 
        'respop72012': 'est2012', 
        'respop72013': 'est2013', 
        'respop72014': 'est2014', 
    })
popest.head()

Ah, that‘s a lot tidier. One thing you should note: in both of the last two exampmles, we reassigned `popest` to the result of a function call. This is because these and many other operations in `pandas` don't change the original `DataFrame`. Instead, they return a new "view" of the same data. You will inevitably forget this and then wonder why that change you thought you made isn't there. This is why. The good news is that it will become a habit, mostly.

# Exploring your data

OK, now that we have the data loaded, what do we have? 

Like we said before, each column of a `DataFrame` is a `Series`. You get at the columns by using the `DataFrame` like a python `dict` class, with the column names as keys. 

In [None]:
popest['est2014'].head()

You can get specific values from the `Series` by using the index value as a key.

In [None]:
print "The 2014 estimated population for FIPS 0800925 is {}".format(popest['est2014']['0800925'])

Like in Javascript, in some cases you an also get the values from the `DataFrame` by treating the column name as an attribute, although that doesn't work if the name has spaces, or is a python reserved word. For this lesson, we'll stick with the dictionary-access style, but just to show you, look at this:

In [None]:
popest.est2014.head()

See? Same difference.

By now you must be itching to actually learn something about the data. It's easy to get basic summary statistics from a `Series`.

In [None]:
popest['est2014'].describe() # describe just a series

In [None]:
popest.describe() # or describe the whole dataset. only numeric columns are included in the results

There are 270 places in our dataset. The average 2014 estimated population is about 14,652, but the median is only 1181.5. 

The smallest place has only 8 people. Which one is it?

In [None]:
popest[popest['est2014'] == 8]

Ah, scenic [Lakeside](http://censusreporter.org/profiles/16000US0842495-lakeside-co/).

Maybe you just want a sorted list. That's pretty easy. Just remember that the default sort is low-to-high.

In [None]:
popest.sort_values(by='est2014', ascending=False).head(10) # don't forget that default sort is ascending not descending.

Maybe you only want to see some of the data.

In [None]:
popest[['name', 'est2014']].head()

Maybe you're interested in change over time. You can add a new column to your `DataFrame` as a computation, like this.

In [None]:
popest['change'] = ((popest['est2014']-popest['est2010'])/popest['est2010'])*100
popest['change'].head()

Remember, you can treat a series (like our new `change` column) like a `dict` or like a `list`.

In [None]:
print "FIPS 0800925 had {:.2f}% population change".format(popest['change']['0800925'])
print "The third place in the list had {:.2f}% population change".format(popest['change'][2]) # zero-indexed

Which places had the most change?

In [None]:
popest.sort_values(by='change',ascending=False).head()

Wow, Timnath doubled in size in five years! But we can see that it's an outlier. Maybe we want to just analyze bigger cities.

In [None]:
popest[popest['est2014'] > 100000].count()

OK. There are 11 places in Colorado with more than 100,000 population. That output is kinda janky, though. Fortunately, we can also use `len`.

In [None]:
print "There are {} cities with population larger than 100,000.".format(len(popest[popest['est2014'] > 100000]))

We can slice off just the bigger cities into another dataframe.

In [None]:
bigtowns = popest[popest.est2010 > 100000]
bigtowns.sort_values(by='change', ascending=False).head() 

Perfect! Let's then turn our bigtowns dataframe into a nice csv file. Or an Excel file (requires the `xlwt` library installed)

In [None]:
bigtowns.to_csv('bigtowns.csv')
bigtowns.to_excel('bigtowns.xls')
%ls -l bigtowns.*