# Data Bootcamp:  Code Practice A

Optional Code Practice A:  Jupyter basics and Python's **[graphics tools](https://davebackus.gitbooks.io/test/content/graphs1.html)** (the Matplotlib package). The goals are to become familiar with Jupyter and Matplotlib and to explore some **new datasets**.  The data management part of this goes beyond what we've done in class.  We recommend you just run the code provided and focus on the graphs for now.  

This notebook written by Dave Backus for the NYU Stern course [Data Bootcamp](https://nyu.data-bootcamp.com/).  

**Check Jupyter before we start.** Run the code below and make sure it works.  

In [None]:
# to make sure things are working, run this 
import pandas as pd
print('Pandas version: ', pd.__version__) 

If you get something like "Pandas version:  0.17.1" you're fine.  If you get an error, bring your computer by and ask for help.  If you're unusually brave, go to [StackOverflow](http://stackoverflow.com/a/19961403/804513) and read the instructions.  Then come ask for help.  (This has to do with how your computer processes unicode.  When you hear that word -- unicode -- you should run away at high speed.)     

## Question 1. Setup 

Import packages, arrange for graphs to display in the notebook. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt 
%matplotlib inline

**Remind yourself:**

* What does the `pandas` package do?
* What does the `matplotlib` package do?  
* What does `%matplotlib inline` do?  

## Question 2.  Jupyter basics 

* We refer to the cell that's highlighted as the **current cell**.  
* Clicking once on any cell makes it the current cell.  Clicking again allows you to edit it.   
* The + in the toolbar at the top creates a new cell below the current cell.  
* Change a cell from Code to Markdown (in other words, text) with the dropdown menu in the toolbar.  
* To run a cell, hit shift-enter or click on the run-cell icon in the tooolbar (sideways triangle and vertical line). 
* For more information, click on Help at the top.  User Interface Tour is a good place to start.

Practice with the following:  

* Make this cell the current cell. 
* Add an empty cell below it.  
* Add text to the new cell:  your name and the date, for example.  
* *Optional:*  Add a link to your LinkedIn or Facebook page.  *Hint:* Look at the text in the top cell to find an example of a link.  
* Run the cell.  

## Question 3. Winner take all and the long tail in the US beer industry

The internet has produced some interesting market behavior, music being a great example.  Among them:  

* Winner take all.  The large producers (Beyonce, for example) take larger shares of the market than they had in the past.  
* The long tail.  At the same time, small producers in aggregate increase their share.  

Curiously enough, we see the same thing in the US beer industry:   

* Scale economies and a reduction in transportation costs (the interstate highway system was built in the 1950s and 60s) led to consolidation, with the large firms getting larger, and the small ones either sellingout or going bankrupt.  (How many beer brands can you think of that no longer exist?)  
* Starting in the 1980s, we saw a significant increase in the market share of small firms ("craft brewers") overall, even though each of them remains small.  

We illustrate this with data from Victor and Carol Tremblay that describe the output of the top 100 US beer producers from 1947 to 2004.  This is background data from their book, [The US Brewing Industry](http://www.amazon.com/The-US-Brewing-Industry-Economic/dp/0262512637), MIT Press, 2004.  See [here](http://people.oregonstate.edu/~tremblac/pdf/Appendix%20A%20Weinberg%20Data.pdf) for the names of the brewers.  Output is measured in thousands of 31-gallon barrels.  

**Data manipulation.** The data manipulation goes beyond what we've done in class.  You're free to ignore it, but here's the idea.  

* The spreadsheet contains output by firms ranked 1 to 100 in size.  Each row refers to a specific year and includes the outputs of firms in order of size.  We don't have their names.  
* We transpose this so that the columns are years and include output for the top-100 firms.  The row labels are the size rank of the firm.  
* We then plot the size against the rank for four years to see how it has changed.  

In [None]:
url = 'http://pages.stern.nyu.edu/~dbackus/Data/beer_production_1947-2004.xlsx'
beer = pd.read_excel(url, skiprows=12, index_col=0)

print('Dimensions:', beer.shape)
beer[list(range(1,11))].head(3)

In [None]:
vars = list(range(1,101))   # extract top 100 firms 
pdf = beer[vars].T          # transpose (flip rows and columns)
pdf[[1947, 1967, 1987, 2004]].head()

**Question.** Can you see consolidation here?

In [None]:
# a basic plot 
fig, ax = plt.subplots()

pdf[1947].plot(ax=ax, logy=True)
pdf[1967].plot(ax=ax, logy=True)
pdf[1987].plot(ax=ax, logy=True)
pdf[2004].plot(ax=ax, logy=True)
ax.legend()

**Answer these questions below.** Code is sufficient, but it's often helpful to add comments to remind yourself what you did, and why.  

* Get help for the `set.title` method by typing `ax.set_title?` in a new cell and running it.  Note that you can open the documentation this produces in a separate tab with the icon in the upper right (hover text = "Open the pager in an external window").  
* Add a title with `ax.set_title('Your title')`.  
* Change the fontsize of the title to 14.  
* What happens if we add the argument/parameter `lw=2` to the `ax.plot()` statements?  
* Add a label to the x axis with `ax.set_xlabel()`.  
* Add a label to the y axis.  
* Why did we use a log scale (`logy=True`)?  What happens if we don't?
* Use the `color` argument/parameter to choose a more effective set of colors.  
* In what sense do you see "winner takes all"?  A "long tail"?    

## Question 4.  Japan's aging population 

Populations are getting older throughout the world, but Japan is a striking example.  One of our favorite quotes:

> Last year, for the first time, sales of adult diapers in Japan exceeded those for babies. 

Let's see what the numbers look like using projections fron the [United Nations' Population Division](http://esa.un.org/unpd/wpp/Download/Standard/Population/).  They have several projections; we use what they call the "medium variant." 

We have a similar issue with the data:  population by age for a given country and date goes across rows, not down columns.  So we choose the ones we want and transpose them.  Again, more than we've done so far.  

In [None]:
# data input (takes about 20 seconds on a wireless network)
url1 = 'http://esa.un.org/unpd/wpp/DVD/Files/'
url2 = '1_Indicators%20(Standard)/EXCEL_FILES/1_Population/'
url3 = 'WPP2017_POP_F07_1_POPULATION_BY_AGE_BOTH_SEXES.XLSX'
url = url1 + url2 + url3 

cols = [2, 4, 5] + list(range(6,28))
prj = pd.read_excel(url, sheetname=1, skiprows=16, parse_cols=cols, na_values=['…'])
print('Dimensions: ', prj.shape)
print('Column labels: ', prj.columns)

In [None]:
# rename some variables 
pop = prj
pop = pop.rename(columns={'Reference date (as of 1 July)': 'Year', 
                          'Region, subregion, country or area *': 'Country', 
                          'Country code': 'Code'})
# select Japan and years 
countries = ['Japan']
years     = [2015, 2035, 2055, 2075, 2095]
pop = pop[pop['Country'].isin(countries) & pop['Year'].isin(years)]
pop = pop.drop(['Country', 'Code'], axis=1)
pop = pop.set_index('Year').T
pop = pop/1000    # convert population from thousands to millions 
pop.head()

In [None]:
pop.tail()

**Comment.** Now we have the number of people in any five-year age group running down columns.  The column labels are the years.  

With the dataframe `df`:  

* Plot the current age distribution with `pop[[2015]].plot()`.  Note that `2015` here does not have quotes around it:  it's an unusual case of integer column labels. 
* Plot the current age distribution as a bar chart.  Which do you think looks better?  
* Create figure and axis objects 
* Use the axis object to plot the age distribution for all the years in the dataframe.  
* Add titles and axis labels. 
* Plot the age distribution for each date in a separate subplot.  What argument parameter does this?  *Bonus points:* Change the size of the figure to accomodate the subplots.  

## Question 5.  Dynamics of the yield curve 

One of our favorite topics is the yield curve:  a plot of the yield to maturity on a bond against the bond's maturity.  The foundation here is yields on zero coupon bonds, which are simpler objects than yields on coupon bonds.  

We often refer to bond yields rising or falling, but in fact the yield curve often does different things at different maturities.  We will see that here.  For several years, short yields have been stuck at zero, yet yields for bond with maturities of two years and above have varied quite a bit.  

We use the Fed's well-known [Gurkaynak, Sack, and Wright data](http://www.federalreserve.gov/pubs/feds/2006/200628/200628abs.html), which provides daily data on US Treasury yields from 1961 to the present.  The Fed posts the data, but it's in an unfriendly format.  So we saved it as a csv file, which we read in below.  The variables are yields:  `SVENYnn` is the yield for maturity `nn` years.  

In [None]:
# data input (takes about 20 seconds on a wireless network)
url = 'http://pages.stern.nyu.edu/~dbackus/Data/feds200628.csv'
gsw = pd.read_csv(url, skiprows=9, index_col=0, usecols=list(range(11)), parse_dates=True) 
print('Dimensions: ', gsw.shape)
print('Column labels: ', gsw.columns)
print('Row labels: ', gsw.index)

In [None]:
# grab recent data 
df = gsw[gsw.index >= dt.datetime(2010,1,1)]
# convert to annual, last day of year
df = df.resample('A', how='last').sort_index()
df.head()

In [None]:
df.columns = list(range(1,11))
ylds = df.T
ylds.head(3)

With the dataframe `ylds`:  

* Create figure and axis objects 
* Use the axis object to plot the yield curve for all the years in the dataframe.  
* Add titles and axis labels. 
* Explain what you see:  What happened to the yield curve over the past six years? 
* **Challenging.**  Compute the mean yield for each maturity.  Plot them on the same graph in black.  