# Part 2 - Open your data & start exploring

For this class, we're using two data sets, both of which should be in this repository. If you want the originals, see:

* `PEP_2014_PEPANNRES_with_ann.csv` - US Census Population Estimates for populated places in Colorado - [from American Factfinder](http://factfinder.census.gov/bkmk/table/1.0/en/PEP/2014/PEPANNRES/0400000US08.16200)
* `RCDCFundingSummary030516.csv` - Estimates of Funding for Various Research, Condition, and Disease Categories (RCDC) [from the NIH](https://report.nih.gov/categorical_spending.aspx)

As you read through this notebook, practice typing the given code into the next cell. Try to get comfortable using tab-completion (which even works for filenames and dictionary keys). 

Or you can copy/paste the code if you really want to. 

## Loading a data file


Before anything else, we import `pandas`. By convention, it's imported as `pd` to save typing. Reading in a CSV is just as easy as you'd guess. We'll load the Colorado population estimates. 

```python
import pandas as pd
popest = pd.read_csv('PEP_2014_PEPANNRES_with_ann.csv') 
popest.head()
```

Remember, after you are done entering code, press _shift-tab_ to execute the code and move to the next cell. Also, you won't be able to tab-complete `read_csv` for this one because the `import` hasn't run, but you _can_ complete that long CSV filename. 

Above you should see the first five rows of the data that you read in.

### Loading, refined

Notice row `0` -- it's not a data row, it's a second header. This kind of thing is common in Excel, but it can tangle you up when you use python code for data analysis. Let's fix that. This might be a good one for copy/paste!

```python
popest = pd.read_csv('PEP_2014_PEPANNRES_with_ann.csv', 
                      skiprows=2, # we can't use row 0 and skip only row 1 so we skip both
                      header=None, # don't treat the first non-skipped row as a header
                      names=popest.columns) # use the columns we just loaded for simplicity
popest.head()
```


### Check and correct

Huh. Look at the `GEO.id2` column. How is it different from when you used the `head` command before?

Hopefully you see that it lost the leading `0`. This happens a lot, so get used to looking for it. The most common way it plagues most of us in the US is with ZIP codes.

If you're never going to use a column that has a problem like this, just keep moving on. But in our case, we know that the FIPS code will help us link this data to other datasets. We'll see this in action below, but for now, trust us that it's worth taking care of. To take care of it, we specify the `dtype`. We're going to skip past the nuances here and just specify that the type is `S` (for "string").

One more little thing: Colorado is one of a few states which has a few place names that are not simple ASCII (e.g. Cañon City). `pandas` won't complain about this, but you should be on the lookout. We happen to know that the Census Bureau uses `Latin-1` to encode files which have names in them to handle this, so here we are also using the `encoding` argument to make sure strings are handled correctly.

So, once more, from the top.
```python
popest = pd.read_csv('PEP_2014_PEPANNRES_with_ann.csv', 
                      skiprows=2, # we can't use row 0 and skip only row 1 so we skip both
                      header=0, # and tell pandas that it shouldn't treat the first non-skipped row as a header
                      names=popest.columns,
                      encoding='latin-1',
                      dtype={'GEO.id2': 'S'} 
                    )
popest.head()
```

OK, good, we have our leading zeros back.

### Drop and rename columns

That `GEO.id` column seems redundant. Also, while maybe important in some cases, we don't really need the Decennial 2010 estimate or the base value used for estimates. And some of those column names are ugly.

```python
# for 'drop', don't forget axis=1 if you're dropping columns; default is to try to drop a row.
popest = popest.drop(['GEO.id', 'rescen42010', 'resbase42010'], axis=1) 
popest = popest.rename(columns={ # yes, we could have done this when we read in the CSV
        'GEO.display-label': 'name',
        'GEO.id2': 'fips',
        'respop72010': '2010', 
        'respop72011': '2011', 
        'respop72012': '2012', 
        'respop72013': '2013', 
        'respop72014': '2014', 
    })
popest.head()
```

**Note:** that `drop` command is the kind of thing that will give you errors if you run it a second time on a `DataFrame` that no longer has those columns. Don't panic.

Ah, that‘s a lot tidier. One thing you should note: in both of the last two examples, we reassigned `popest` to the result of a function call. This is because these and many other operations in `pandas` don't change the original `DataFrame`. Instead, they return a new "view" of the same data. 

You will inevitably forget this and then wonder why that change you thought you made isn't there. This is why. The good news is that it will become a habit, mostly.

## Exploring your data

### Sorting
OK, now that we have the data loaded, what do we have? Maybe you want to look at the extremes of the list: which places have the most people? Which ones have the fewest? `DataFrame` has a `sort_values` function which makes this easy:

```python
popest.sort_values(by='2014').head(10)
```


By default, sort is "ascending". If you want the ten biggest, you have to say so:

```python
popest.sort_values(by='2014',ascending=False).head(10)
```


### Summary statistics
We see there a pretty big range between the biggest and smallest places, but what about the distribution? The `describe` function shows basic summary stats for every numeric column in your `DataFrame`

```python
popest.describe()
```



We can see that the average population is trending upward decisively, although the quartile values don't change that much. That suggests that the action is really concentrated in the largest cities. There are a lot of other ways you can get a feel for the data in a dataframe, including visualization techniques, but we'll save those for later.

### Computing your own values

Let's add a column reflecting the net change in population.

```python
popest['change'] = ( (popest['2014'] - popest['2010']) / popest['2010'] )
popest.sort_values('change',ascending=False).head()
```

[Timnath](https://en.wikipedia.org/wiki/Timnath,_Colorado) is booming! But maybe we want to focus on the bigger population centers. Can we just see those?

### Filtering

`pandas` syntax for selecting a subset of a `DataFrame` is a little weird, although if you've used `R` at all, you'll be familiar. Let's start with an example, and we'll explain after.

Let's see just the places that were at least 100,000 people in 2014.

```python
popest[ popest['2014'] > 100000 ]
```


ok, but which ones changed the most? You can just chain methods together.

```python
popest[ popest['2014'] > 100000 ].sort_values(by='change',ascending=False).head()
```

#### That filtering syntax

Technically, when you filter, you pass a `Series` whose values are all `True` or `False`. The function returns a dataframe with all the same columns, but only including rows for which the matching value was `True`. 

Let's see how many places lost population?

```python
num_declines = len(popest[ popest['change'] < 0 ])
print "There were {} places which lost population.".format(num_declines)
```


## Merging datasets

Maybe you want to look on a map, to see if there are patterns to this growth. We're not going to get into mapping with `pandas` and `python`, but we'll grab another dataset which has the locations of these places, hook the two together and export the result so we can use it with another mapping tool.

The Census bureau publishes [gazetteer files](https://www.census.gov/geo/maps-data/data/gazetteer2014.html) every year which provide the area of each shape and the coordinates of its centroid. 

### Loading data from the web

Just because we can, here's an example of how to load data into a `DataFrame` via a URL.

```python
from urllib2 import urlopen
url = 'http://www2.census.gov/geo/docs/maps-data/data/gazetteer/2014_Gazetteer/2014_gaz_place_08.txt'
response = urlopen(url)
# We happen to know that this file uses tab, not comma, to separate values,
# and note we're controlling the encoding and the GEOID dtype
# these are things you have to check out for yourself as you load data
gaz = pd.read_csv(response,delimiter='\t',dtype={'GEOID': 'S'},encoding='latin-1')
gaz.head()
```

(If you're having trouble with the internet, a copy of this file is in the respository too.)

### Doing the actual merge

This is super easy. There are a lot of options you can pursue if you may have rows in one `DataFrame` which don't match the other, but we're not faced with that.

```python
merged = pd.merge(left=popest,right=gaz,left_on='fips',right_on='GEOID')
merged.head()
```    

If you have experience with SQL, you may be interested to read `pandas` [comparison with SQL](http://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html#compare-with-sql-join)

#### Another way to reduce the columns

For this exercise, we only want a few of the columns. Besides the `drop` command which we saw before, you can pass in a list of column names to the dataframe to select a subset. We're doing a little bit of other cleanup here too.

```python
merged = merged.rename(columns=lambda x: x.strip()) # turns out one column name has bogus whitespace
merged = merged.rename(columns={'INTPTLAT': 'lat', 'INTPTLONG': 'lng'})
merged = merged[['fips','name','change','lat','lng']]
merged.head()
```

#### Exporting a dataset

This is as easy as could be too. Just remember that we're dealing with place names like _Cañon city_ and keep track of the encoding. The default is ASCII so if you forgot the `encoding` argument with our current dataset, you'd get an error.

```python
merged.to_csv('ready_for_mapping.csv',encoding='utf-8')
%ls -l ready*
```

If you have the `xlwt` library installed, you can also write directly to an Excel file:

```python
merged.to_excel('ready_for_mapping.xls')
```

### Next steps

Now that you know some things about reading, writing, and exploring data with `pandas`, check out [part 3](Part%203.ipynb) to learn more ways to work with data.