# UMichigan Applied Plotting, Week 4 Assignment

In this report we analyze [...] in Ann Arbor, MI, as compared to statewide statistics. In order to study similar date ranges (1970–2010), we had to resort to data from Washtenaw county, of which Ann Arbor is the seat. This approximation seems to be acceptable: the population of the Ann Arbor metropolitan area was 306,022 in 2010; the same year, the county reported a population of 344,791—a difference of 11.2%.

## Datasets source
Datasets were downloaded from the Federal Reserve Bank, St. Louis website. Unfortunately, the website hides the actual URL for the file, so we can only list links to pages from which data can be downloaded (manually or with rather sophisticated web scraping). Nice integration of projects, WTF! For the win!
- City/county-level data
    - [Wasthenaw county population](https://fred.stlouisfed.org/series/MIWASH1POP)
    - [Ann Arbor, MI mean per capita annual income](https://fred.stlouisfed.org/series/ANNA426PCPI)
    - [Ann Arbor, MI house pricing index](https://fred.stlouisfed.org/series/ATNHPIUS11460Q)
- State-level data
    - [Michigan population](https://fred.stlouisfed.org/series/MIWASH1POP)
    - [Michigan mean per capita annual income](https://fred.stlouisfed.org/series/ANNA426PCPI)
    - [Michigan house pricing index](https://fred.stlouisfed.org/series/ATNHPIUS11460Q)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import ticker as tck
from matplotlib import dates as mdates
from matplotlib.lines import Line2D as Line
import seaborn as sns

%matplotlib widget

In [2]:
# This cell ignored because of unavailability of direct file URLs.

#urls = []
#
## Check if data files exist
#datafiles = ["aa_population.csv", "aa_income.csv", "aa_prices.csv"]
#
#datapaths = [path.Path(onefile) for onefile in datafiles]
#
#available = [item.exists() for item in datapaths]
#
#files = {x: y for x in datapaths for y in available}
#
#print(files)
#
#for item in files:
#    if not item:
#        print("Dangit")

## Data Processing 

We define standard prefixes and variable (AKA _column_) names for the dataframes we'll be creating. Then, we loop through prefixes (`aa_` for Ann Arbor, `mi_` for Michigan state) and then through column names, reading data from the CSV files. This results in 3 dataframes for each entity. These DFs are indexed by the date of measurements, and their unique column is named after one of the predefined variable `names`.

In [3]:
# Create prefixes, names and suffixes for the data files.
prefixes = ['aa_', 'mi_']
names = ['income', 'population', 'prices']
suffixes = '.csv'
dfs = []

# Loop the loops and load
for pre in prefixes:
    for nomen in names:
        dataset = f"{pre}{nomen}{suffixes}"
        vars() [f"{pre}{nomen}"] = pd.read_csv(dataset, \
                                        index_col='DATE', \
                                        names=['DATE', nomen], \
                                        header=0)



 Here, we join the DFs using an 'inner' approach, so we only keep rows (_i.e._, years) for which we have all the data.

In [4]:
# Don't know how to add dynamically generated variables to an iterable, so:

aa_data = aa_income.join(aa_population, how='inner')
aa_data = aa_data.join(aa_prices, how='inner')

mi_data = mi_income.join(mi_population, how='inner')
mi_data = mi_data.join(mi_prices, how='inner')

data = aa_data.join(mi_data, how='inner', lsuffix='_aa', rsuffix='_mi')
data.index = pd.to_datetime(data.index)


We compute population growth as a fraction of the initial value for each of county (_i.e._, Ann Arbor) and state. Also, we present `growthdiff` as the difference in population growth percentage between `AA` (Ann Arbor) and `MI` (state of Michigan).

In [5]:
data['popgrowth_aa'] = data.population_aa/data.population_aa[0]
data['popgrowth_mi'] = data.population_mi/data.population_mi[0]
data['growthdiff'] = data.popgrowth_aa - data.popgrowth_mi


This is the main plotting routine. We begin by creating a figure and a scatterplot with added functionalities (from package [`seaborn`](https://seaborn.pydata.org)) which allows one to use marker size and color as further scales for variables.

In [26]:


size_mi = [s**3 for s in data.popgrowth_mi]
size_aa = [s**3 for s in data.popgrowth_aa]

fig2, ax2 = plt.subplots(figsize=[9, 7])
            
plot2 = sns.scatterplot(data=data,
            x='income_aa',
            y=data.prices_aa,
            size='popgrowth_aa',
            hue='growthdiff',
            palette=plt.get_cmap('GnBu'),
            sizes=(80, 800),
            legend='brief',
            hue_norm=(0.0,0.3)
            )


ax2.set_xlabel("Mean Yearly Income (dollars)", fontsize=12, labelpad=10)
ax2.set_ylabel("House Price Index (All Transactions) for Ann Arbor, MI", fontsize=12, labelpad=10)


handles, labels = plot2.get_legend_handles_labels()
new_labels = ['Pop. Growth Rate\nDifference vs. State\n(approx. %)',
              'baseline', '10%', '15%', '25%', '30%',
              '\nPopulation Growth\nover Time\n(approx. %)',
              '5%', '10%', '20%', '30%', '35%', '40%']

plt.subplots_adjust(right=0.75)

# Create 'fake' twin plot sharing same y axis
axtwin=ax2.twiny()
# Plot indifferent data, making sure the x axis has the values we want
# Index (i.e. dates) need to be converted in Py format
py_index = data.index.to_pydatetime()

axtwin.plot_date(data.index, 'prices_aa', data=data, fmt='none')

fmt_date = mdates.DateFormatter('%Y')
loc_date = mdates.YearLocator(5)

#axtwin.set_xlim('1975-01-01', '2019-01-01')
axtwin.xaxis.set_major_formatter(fmt_date)
axtwin.xaxis.set_major_locator(loc_date)

ax2.set_title('House Prices by Mean Income Over Time in Ann Arbor, MI', pad=12.0, fontdict={'fontsize': 18})

#leg = plot2.legend_
plt.legend(loc=2,
           handles=handles,
           bbox_to_anchor=(1, 1),
           labels=new_labels,
           markerscale=1,
           frameon=False
          )

# Ok check out this one. Since this code comes after the second plot call
# this legend belongs to the second (convenience) plot. However, each plot
# creates a legend, if we call the function, so in order to keep the last one,
# which has all these adjustments, we also have to delete the first one. This
# is stupid.
ax2.get_legend().remove()

fig2.autofmt_xdate()
plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [7]:
plt.close('all')

The data show that, over the 1975–2020 period, house prices (expressed as an index) rose as much as the average yearly income in Ann Arbor—about 7-fold. A notable exception is the dip which preceded the 2008 subprime mortgage crisis, which shows how real estate prices fell, precipitating the situation of individuals and institutions who were invested in the market.

The plot appears to be an honest representation of the situation, although there is one single data point which might result misleading: in 2007 through 2009, average income actually decreased for the first and only time in this period. The relevant markers in the plot appear to be moved to the left, which is correct if you only look at the lower `x` axis, but when looking at the upper one, they would seem to be going back in time.

The color scale represents the lead over the state average in population growth. It would appear that this ratio of income _and_ house prices increase (~1) has been deemed relatively favorable, so that the population of Ann Arbor has grown significantly more than that of Michigan over these years (about 30 percentage points more).

Finally, the growth of the markers parallels that of the city/county population. Overall, the population has grown ~40% in these 45 years.

In [None]:
plt.gcf().clear('all')

In [None]:

fig, fx = plt.subplots()
fig.autofmt_xdate()
plt.plot(data.index, data.prices_aa)

In [None]:
plt.gcf().clear('all')

In [16]:
data.size

405

In [18]:
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 45 entries, 1975-01-01 to 2019-01-01
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   income_aa      45 non-null     int64  
 1   population_aa  45 non-null     float64
 2   prices_aa      45 non-null     float64
 3   income_mi      45 non-null     float64
 4   population_mi  45 non-null     float64
 5   prices_mi      45 non-null     float64
 6   popgrowth_aa   45 non-null     float64
 7   popgrowth_mi   45 non-null     float64
 8   growthdiff     45 non-null     float64
dtypes: float64(8), int64(1)
memory usage: 3.5 KB
