### Census Data

Consider the following questions: 
- Who is my customer? 
- Where should I relocate my plant or headquarters, given the skill set I require of workers? 
- Where would be a good location for a new store, given my knowledge of my customer? 

**How would you answer these questions?** This lecture illustrate one set of tools to answer these questions: Get detailed demographic information about the people that live in a given area using the US census.

What is the [US census](https://en.wikipedia.org/wiki/United_States_Census)? Every 10 years the US government is required to essentially count all people within the United States and in doing so constructs detailed demographic information about the peopel living and working within fine geographic levels. 

A new innovation that is of interest to us is the [American Community Survey](https://en.wikipedia.org/wiki/American_Community_Survey). This is a Survey (not a census), but it asks the long-form questions whose answers can then be matched up with the 10 year census in a way to provide information at a **yearly** frequency. So you can find out median household income in zip code 90210 for 2015, 2014, etc. actually only going back to 2010 since this is a new development.

A second development is that the US census has a well developed API for which we can directly access the data. In the past, this process would look like this: bulk download `.csv` files, pull what you need, store it, etc. Now on the fly you can get what you want directly (and I think the main user of this are commercial vendors, e.g. like you look at Zillow and some characteristics of that zip code are reported, this is a direct feed from the census). 

**What are we going to do with it?** We will learn how to use the Census API and then use information to ask who voted for Trump or Clinton in the 2016 Presidential election. This is a nice application because, we know election results at very fine levels of geography, but we will never know individual votes. But we can `merge` the election results up with demographic information at those fine geographic locations and be able to make statements like "areas with a less educated population were more likely to vote for candidate X" Along the way, we will learn some more stuff:
- Census API
- More practice `merge`ing

Then this will fit with the next lecture on mapping.

#### Getting Started

So below are the packages that we need. The first two we know. The `Census` package is the new one:

In [1]:
import pandas as pd
import numpy  as np
import matplotlib.pyplot as plt

from census import Census # This is new...
from us import states

A couple of points. 

First, you may not have the `Census` package or the `states` package. To get these packages, open your terminal or command prompt and type: `coninstall census` and then the same thing for states (just replace `census`). This should do the trick.

Now before we can use this, YOU need to get access. It is very easy, just go here:

https://api.census.gov/data/key_signup.html

And then follow the instructions. This will give you a personalized key that you can use when interfacing with the Census. If you are having trouble, just use my key.

One you have a key, you create a session. The syntax looks like this:

In [2]:
my_api_key = '34e40301bda77077e24c859c6c6c0b721ad73fc7'
# This is my api_key

c = Census(my_api_key)
# This will create an object c which has methods associated with it.
# We will see  these below.

type(c) 
# Per the discussion below, try c.tab and see the options. 

census.core.Census

First: Here is then in depth [documentation](https://www.census.gov/content/dam/Census/data/developers/api-user-guide/api-guide.pdf). This provides info about what datasets are available, geographies, 

Now below is the basic syntax. Here are [some examples](https://pypi.python.org/pypi/census). The quick start is you do `c.acs5.get(stuff here)`. 

- The first bit `acs5` says use the 5 year America Community Survey. There are other options (`acs3` and `acs1` with the key difference being the geographical level possible.) The `acs5` is the slowest dataset to be updated, but it contains the finest geographic level of detail.
- The next git `get` says get the data
- Then the stuff in the brackets tells it what to grab. There are essentially three elements: the first one `code` tells it the code associated with the data series you want, (if you want multiple series, create a tuple); the second element describes the geography (we will work through several different levels of geography), the third element is the year.

Lets do an example:

In [3]:
code = ("NAME","B01001_001E") # This says grab the geographical name, and  B01001_001E 
                               # is the population. 
    
state_pop_2015 = c.acs5.get(code, {'for': 'state:'+ states.CA.fips}, year=2015)
                                  # Then this element says for 
                                  # Then the next element says, by state, then the specific state
                                  # you are looking for. Here is the trick, states are classified by FIPS numbers
                                  # So you then use the `state.CA.fips` which generates the correct
                                  # FIPS value for California.

print(states.CA.fips)
                    
state_pop_2015 = pd.DataFrame(state_pop_2015)


state_pop_2015.head()

06


Unnamed: 0,B01001_001E,NAME,state
0,38421464.0,California,6


In [4]:
state_pop_2015.head()

Unnamed: 0,B01001_001E,NAME,state
0,38421464.0,California,6


Lets do one more example: Here is population and total foreign born population in that state

In [5]:
code = ("NAME","B01001_001E","B05006_001E") # This says grab the geographical name, and  B01001_001E 
                               # is the population; B05006_001E is foreign born population (i.e. immigrants)
    
state_pop_2015 = c.acs5.get(code, {'for': 'state:'+ states.CA.fips }, year=2015)
                                  # Then this element says for 
                                  # Then the next element says, by state, then the specific state
                                  # you are looking for. Here is the trick, states are classified by FIPS numbers
                                  # So you then use the state.CA.fips which generates the correct
                                  # FIPS value for California.

state_pop_2015 = pd.DataFrame(state_pop_2015)


state_pop_2015.head()

Unnamed: 0,B01001_001E,B05006_001E,NAME,state
0,38421464.0,10389990.0,California,6


Almost a 1/4 of the population in California is foreign born! Is this correct? Quick check and google this and see the answer that you get. 

**How do I get information for all the states?** The simple answer is to use `*` which is the [wild card character](https://en.wikipedia.org/wiki/Wildcard_character) for their data: So you just do this:

In [6]:
code = ("NAME","B01001_001E","B05006_001E") # This says grab the geographical name, and  B01001_001E 
                               # is the population; B05006_001E is foreign born population (i.e. immigrants)
    
state_pop_2015 = c.acs5.get(code, {'for': 'state:* '}, year=2015)
                                  # Everythig is the same now... but the * says take all states

state_pop_2015 = pd.DataFrame(state_pop_2015)

print(state_pop_2015.shape)

state_pop_2015.head()

#county_2015[code].astype(float).sum()

(52, 4)


Unnamed: 0,B01001_001E,B05006_001E,NAME,state
0,4830620.0,167224.0,Alabama,1
1,733375.0,54047.0,Alaska,2
2,6641928.0,896004.0,Arizona,4
3,2958208.0,138822.0,Arkansas,5
4,38421464.0,10389990.0,California,6


**Exercises**:

- **What is the population of the United States? How can I check if correct?**

- **What is the foreign born population of the United states?**

Navigating the variables. This is still hard, but you have to put in the work to get what you want. So here is an approach: First, this provides that data sets available for general API calls (most of these are not in the census python wrapper). 

https://api.census.gov/data.html

Then find the "ACS 5-Year Detailed Tables". From here select on groups. This will take you here:

https://api.census.gov/data/2016/acs/acs5/groups.html

This then provides broad catagories to select from. Select on your favorite one. For example, lets click on "PLACE OF BIRTH FOR THE FOREIGN-BORN POPULATION IN THE UNITED STATES" taking us here:

https://api.census.gov/data/2016/acs/acs5/groups/B05006.html

Then you will find the individual catagories available for this subject matter.

In [7]:
code = ("NAME","B01001_001E","B05006_001E","B05006_002E") # This says grab the geographical name, and  B01001_001E 
                               # is the population; B05006_001E is foreign born population (i.e. immigrants)
    
state_pop_2015 = c.acs5.get(code, {'for': 'state:* '}, year=2015)
                                  # Everythig is the same now... but the * says take all states

state_pop_2015 = pd.DataFrame(state_pop_2015)

print(state_pop_2015.shape)

state_pop_2015.head(10)

#county_2015[code].astype(float).sum()

(52, 5)


Unnamed: 0,B01001_001E,B05006_001E,B05006_002E,NAME,state
0,4830620.0,167224.0,18252.0,Alabama,1
1,733375.0,54047.0,8907.0,Alaska,2
2,6641928.0,896004.0,83855.0,Arizona,4
3,2958208.0,138822.0,10162.0,Arkansas,5
4,38421464.0,10389990.0,670660.0,California,6
5,5278906.0,515772.0,72698.0,Colorado,8
6,3593222.0,500147.0,131245.0,Connecticut,9
7,926454.0,80163.0,10406.0,Delaware,10
8,647484.0,91588.0,16972.0,District of Columbia,11
9,19645772.0,3875498.0,381239.0,Florida,12


**Exercises**:

- **Go back to the group level. Find a catagory you like (do control + F and search). Then find one or two variables you like. Grab them.**

----

#### Finer Levels of Geography

The state stuff is interesting, but what is really cool is that very detailed levels of geography can be found. Two that may be of interest are:
- Counties: We should have a sense of what these are. 
- [Zip Code Tabulation Areas](https://en.wikipedia.org/wiki/ZIP_Code_Tabulation_Area): This is close to a zip code, but not always. 

Lets check it out....

In [8]:
code = ("NAME","B01001_001E","B05006_001E") # Same Codes:

county_2015 = pd.DataFrame(c.acs5.get(code, 
                                         {'for': 'county:*'}, year=2015))
                                         # Same deal, but we specify county then the wild card
                                         # On the example page, there are ways do do this, only by state
county_2015.head()

# HEre is another way to look at only one state...
        
county_2015[county_2015["NAME"].str.contains("Alaska")]

Unnamed: 0,B01001_001E,B05006_001E,NAME,county,state
64,3304.0,1491.0,"Aleutians East Borough, Alaska",13,2
65,5684.0,2012.0,"Aleutians West Census Area, Alaska",16,2
66,299107.0,29803.0,"Anchorage Municipality, Alaska",20,2
67,17776.0,328.0,"Bethel Census Area, Alaska",50,2
68,970.0,19.0,"Bristol Bay Borough, Alaska",60,2
69,2060.0,139.0,"Denali Borough, Alaska",68,2
70,4979.0,83.0,"Dillingham Census Area, Alaska",70,2
71,99705.0,5067.0,"Fairbanks North Star Borough, Alaska",90,2
72,2560.0,181.0,"Haines Borough, Alaska",100,2
73,2128.0,114.0,"Hoonah-Angoon Census Area, Alaska",105,2


In [9]:
code = ("NAME","B19013_001E", "B01001_001E") 
# The new code I added was median houshold income:
    
zip_2015 = pd.DataFrame(c.acs5.get(code, 
                                         {'for': 'zip code tabulation area: 90210, 90059'}, year=2015))

zip_2015.head()

Unnamed: 0,B01001_001E,B19013_001E,NAME,zip code tabulation area
0,44648.0,32506.0,ZCTA5 90059,90059
1,22052.0,145227.0,ZCTA5 90210,90210


This is interesting. We all know the zip code 90210, the zip code 90059 is also in the Los Angeles area, it happens to be part of the "Compton" neighborhood. Google it if you don't know what that it. Median income in Beverly Hills is about 138 thousand dollars, Compton is about 34 thousand. If you were selling designer handbags, where do you want to locate? If you owned a "dollar store" where would be a good location? 

---

### Who voted for Trump? For Clinton?

The idea here is to `merge` some demographic characteristics with election results. Here is my mapping out of the approach:
- Lets look at the election data (determine the appropriate geography for the Census data)
- Pull the Census data
- `Merge` it
- Learn about `pivot` tables to report some simple "cuts" of the data 
- Next lecture: A more formal statistical analysis next lecture.

** Election Data** Below is a link to some election data that I pulled last year, very soon after the election. Note that it is a bit old as the aggregate vote counts are off. One thing to do is to update this data set. 

In [None]:
url = "https://raw.githubusercontent.com/mwaugh0328/"
url = url + "Did-China-Cause-Trump/master/us-election-2016-results-by-county.csv"

election_2016 = pd.read_csv(url)

election_2016.head(10)

One thing to notice is that Alaska is not broken down by county. This was a problem with the dataset, so below we will just drop Alaska when we look at it. 

Now here we can use the `unique` method on the dataframe to find the unique entries. Thus this can answer a question: Who ran for election?

In [None]:
print("\n 2016 Number of Canidates", election_2016.Candidate.unique())

So Trump, Clinton, and several third party candidates we have a hard time remembering now. Now who won the popular vote?

In [None]:
trump_vote = election_2016[election_2016.Candidate == "Trump"].VoteCount.sum()
clinton_vote = election_2016[election_2016.Candidate == "Clinton"].VoteCount.sum()

print("Clinton Vote", clinton_vote, "Trump Vote", trump_vote)

Ok, so Clinton won the election???

Back to the data. What we **want** to do is to merge this up with the Census at the county level. How do we do this? The key thing to notice about the election data is that there is this thing called the `CountryFips` code. [This is a five digit number that uniquely identifies a county](https://en.wikipedia.org/wiki/FIPS_county_code). The first two numbers are the same for the state. The last three then pin down the county within the state. **Note** in the `head` above, you don't quite see this, since it is not showing the first zero. Example, the Alabama entries are all ``01***`` but it only shows ``1***``

Now lets look at the Census data. Side note, since we do not have the ACS for 2016, we will just use the ACS for 2015. This should be ok as my guess is that there is an very high correlation acros years within narrowly defined geographies. 

In [None]:
code = ("NAME","B01001_001E","B19013_001E") # Same Codes:

county_2015 = pd.DataFrame(c.acs5.get(code, 
                                         {'for': 'county:*'}, year=2015))
                                         # Same deal, but we specify county then the wild card
                                         # On the example page, there are ways do do this, only by state
        
county_2015 = county_2015.rename(columns = {"B01001_001E":"population", "B19013_001E":"income"})

print(county_2015.head())

county_2015.dtypes

Note that this does not include the combined FIPS number, but the state and then the county. So we just need to append one to the other to create our own FIPS number. Notice that the county and the state are stored as strings. So the operation to append is simply just to add the strings (look at the head to note that this was NOT numerical addition). 

In [None]:
county_2015["FIPS"] = county_2015["state"] + county_2015["county"]

county_2015.head()

Lets merge them....but first, ask yourself the following questions:
- What kind of merge is this? One-to-one, many-to-one?
-  What should you expect after the merge takes place?
Below is our syntax. HEre is a slightly different modification or our earlier examples, here we specify the key on the left and the key on the right (which in this case have slightly different names).

In [None]:
cens_election = pd.merge(county_2015, election_2016, left_on = "FIPS", right_on = "CountyFips", indicator = True)
cens_election.head(10)

**WHY IS THIS NOT WORKING!!!**

The datatypes are not the same. In the census data we need to convert the FIPS number to a numerical value

In [None]:
county_2015["FIPS"] = county_2015["FIPS"].astype(float)

In [None]:
cens_election = pd.merge(county_2015, election_2016, how='inner',
                         left_on = "FIPS", right_on = "CountyFips", indicator = True)

# Note by taking the inner, there will be some stuff that is going to be droped. 
# There are no election results for parts of Alaska...

cens_election.head(10)

#how='outer', cens_election.shape

In [None]:
# Then lets look at the stuff that was thrown out?
#cens_election[cens_election["_merge"]!= "both"].head()

cens_election.dtypes

cens_election["VoteShare"] = cens_election.VoteCount / cens_election.CountyTotalVote

### Some Simple Data Analysis

Generally, a good approach to analyzing data is to (i) first provide some simple "cuts" of the data or plots that illustrate the point you are after then (ii) use formal statistical modeling to establish the result. Here is why this is important: If the data does not pass (i) or the "plot test" as one of my former colleague called it, then this is suggest that you should view any results from (ii) with skeptisisim (does not mean it may not be true, just that more needs to be established).

So I want to explore the role of income and of urban/rural divide. One way to get at this is the following: create bins by income level...like poor, middle, rich, and look at the share of votes going to Trump by each bin. If we see, Trump's vote share declining as income rises, this suggests that income level in a factor in determining who voted for Trump. We can do the same by population (with less populated counties taken to be rural)...

So how do we do this, we can use this nice feature of pandas `.qcut` which create quanties by whatever we specify, then we can use `groupby` those quantiles and create the table we want.

Awsome plan! Now to execute, we need to convert the data types so they can be numerically evaluated. 

In [None]:
cens_election["population"] = cens_election["population"].astype(float)

cens_election["income"] = cens_election["income"].astype(float)

Now, lets take only the Trump votes. This is OK, because notice that we have for each trump entry, both the trump vote and the total number of votes within that country, thus we can construct all votes and all votes for Trump. We don't need to carry around all the other stuff if we are interested in Trump or not Trump.

In [None]:
only_trump = cens_election[cens_election.Candidate == "Trump"].copy()
# So look at only trump stuff...

only_trump["trump_share"] = only_trump.VoteCount / only_trump.CountyTotalVote

only_trump.head()
# Look at it again...

Below is the basic syntax to cut the data by different quintiles, then aggregate all votes for trump by income quantile divided by total votes....

In [None]:
nquantiles = 4 # This is the number of quintiles, it just allows me to change this at will.

labels = ["quantile " + str(var) for var in range(1,nquantiles + 1)]

# Here I'm goint to use list comprehension to create some lables, like quantile 1, etc.

inc_q = pd.qcut(only_trump["income"], # this says take quantiles by income
                nquantiles,           # The number of quantiels
                labels = labels)      # The labels to go withit.

grouped = only_trump.groupby(inc_q)   # Then this is the magic, I can group by it...

vote_income_quant = 100*(grouped.VoteCount.sum() / grouped.CountyTotalVote.sum())

                                       # Then this says, given the group, some over all votes (for trump)
                                       # Then divide by all votes, in total, for that group

print(vote_income_quant)

So what you see is that the Trump share systematically declines as household income rises. Here is another modification on the `qcut` command which is just to do `.cut` and specify how you want to cut by....

In [None]:
labels = ["poor", "rich"]

rich_poor = pd.cut(only_trump["income"], # this says take quantiles by income
                2,           # This does not do by qunitle, but just buts half in one bin, half in another....
                labels = labels)      # The labels to go withit.

grouped = only_trump.groupby(rich_poor)   # Then this is the magic, I can group by it...

vote_rich_poor = grouped.VoteCount.sum() / grouped.CountyTotalVote.sum()

                                       # Then this says, given the group, some over all votes (for trump)
                                       # Then divide by all votes, in total, for that group

print(vote_rich_poor)

I don't think this is the best way to illustrate this, but a similar message is emerging. 

Now, lets do the same thing by population, so here is this...

In [None]:
labels = ["quantile " + str(var) for var in range(1,nquantiles + 1)]

pop_q = pd.qcut(only_trump["population"], nquantiles, labels = labels)

grouped = only_trump.groupby(pop_q)

pop_income_quant =  100*(grouped.VoteCount.sum() / grouped.CountyTotalVote.sum())

Now combine the two (DataFrames) tables to make one nice illustration. Here we use the `.concat` method that is a way to "smush" two dateframes together when we know they have the same exact row length and just want to add a column.

In [None]:
combo = pd.concat([vote_income_quant,pop_income_quant], axis = 1)
           # This is the concat option, axis = 1, says add the column.

combo.columns = ["Vote Share by Income Quantile", "Vote Share by Population Quantile"]
# Make some nice lables...

combo.head(10)

----

### More Advanced "Cuts of the Data" (Pivot Tables)

One issue with the cuts of the data above is that we don't really see within a income quantile, how the population share varies or vise versa. This motivates the use of a `pivot` table, which is essentially a `groupby` operation but will achieve our want more quickly. 

I also want to add one more layer on top of the analysis. The cuts of the data above always used the continuous indicator of the Trump share. Another approach is to create a discrete variable, call it Red if the majority of the county voted for Trump, and Blue otherwise. then in the stats module estimate what is sometimes called a linear probability model.

**Creating the Dummy Variable**

So this uses the `np.where` command that issues a condition. The red, blue thing is nice, but I want to create a numerical value that takes the value one or zero. 

In [None]:
only_trump["red_blue"] = np.where(only_trump["trump_share"] > 0.50, 1.0, 0.0)
                         # The first part is the condition,
                         # The second part, "red" is if the condition is met
                         # The third part, "blue" is if the condition is not met.
only_trump.head()

** Pivot Table** 

This is basically an `groupby` like operation, but can be done in a multi-dimensional manner. For example, suppose that we want to see **within** a size category, a relationship between income and the propensity to vote for Trump? To answer this question, we want to create a slice within a population quantile and then see the different income quantiles. We could do this using a `groupby` and using a boolean operation for each different quantile. Or we could use a pivot table and do this in one swoop.

In [None]:
only_trump.pivot_table("red_blue", index = pop_q, columns = inc_q)
                       # The first element tells us the data we want to use, in this case the 1 or 0 of going for trump
                       # The second element is the row dimension we want to work by, in this case population quintile
                       # The thrid element is the column dimension to work by...
                       #
                       # This is a simple example, note like groupby there is some aggregator function, its the mean. 
                       # You can specify it by using aggfunc = {Dicionrary for each varible the mean.}

This is interesting... lets think what this says. Within smaller, counties, the propensity to vote for Trump is very high. The relationship with income is not as clear. At the bottom end of the income distribution, the propensity to go with Trump is low, rises, then dips down. Think of this like, within a size quintile, there is an inverted U with income. Then the peak of the inverted U decreases as size rises. 

The `pivot` table essentially allowed us to quickly see the joint distribution of the effects of voting for Trump by income and population. And it illustrated something that would be hard to see in a figure, some kind of inverted U shape between income and the Trump Vote. 

**Some Exercises:**
- Use the `pivot` table as above but change it to `VoteCount` and verify it is doing what it should by size.
- What about the `trump_share`, try that option. What do you see.
- Does it matter that the index was population and the columns was income. If they were reversed, what happens?

Ok, lets plot the data and do a quick look:

In [None]:
fig, ax = plt.subplots()

ax.scatter(100*only_trump["trump_share"], 
           np.log(only_trump["income"]), 
           s= 0.000085*only_trump["population"], 
           alpha = 0.35)

ax.set_title("Income and Trump's Share of Vote \n")
ax.set_ylabel("Log Scale: Median Household Income") 
ax.set_xlabel("Percent of County Population Voting For Trump")

ax.spines["right"].set_visible(False)
ax.spines["top"].set_visible(False)

ylist = [float(var)*0.01 for var in range(975,1225,25)]

ylabel_list = np.exp(ylist)            # Now creat the list of lables by converting 5,6,etc. to levels
                                       # by taking exp.
ylabel_list = np.round(ylabel_list,-2) # Then round it so it looks nice.

ax.set_yticklabels(ylabel_list) # Then set the xtick labels.

plt.show()

This is interesting...it also supports the different cuts of the data that we have seen. So we see how there is a general negative relationship between income and the Trump share (though even in log space this is not clearly monotonic). This is consistent with the results from the `pivot` table.  A second thing that you can observe is that bigger balls (larger counties) typically have lower vote shares.