# Baby Boomers Thru Time
## Demographics Report by Amber Vasquez, kav835


## Homework 4

### Creating Interactive Charts to Visualize Population Shifts over Time with Altair

Baby boomers (often shortened to boomers) are the demographic cohort following the Silent Generation and preceding Generation X. The generation is generally defined as people born from 1946 to 1964, during the post–World War II baby boom. The term is also used outside the United States but the dates, the demographic context and the cultural identifiers may vary. The baby boom has been described variously as a "shockwave"and as "the pig in the python." Baby boomers are often parents of late Gen Xers and Millennials. [from wikipedia](https://en.wikipedia.org/wiki/Baby_boomers).

Let's explore this "shockwave" by examining the US Census data available via the vega datasets package.  We'll start by doing some data engineering to add a column in our population data to denote generational membership, then we'll juxtapose the sex distribution of the population using a brush and linking technique we studied in the lab.  Finally we'll add a slider to animate the transition through time. 

In [1]:
# Import the necessary libraries and data
import altair as alt
import pandas as pd
from vega_datasets import data

df_population = data.population()

In [2]:
df_population.head()

Unnamed: 0,year,age,sex,people
0,1850,0,1,1483789
1,1850,0,2,1450376
2,1850,5,1,1411067
3,1850,5,2,1359668
4,1850,10,1,1260099


## ✅Q1 - Add in the "Boomer" label

As we can see from inspecting the dataframe, our data only gives us information for: 
  - year
  - age
  - sex
  - people
  
But, we want to be able to highlight just the people born between 1946 - 1964 as a separate group.  To accomplish this, we want to create a new categorical attribute - `Generation`

Using pandas data manipulation techniques, add a new column to `df_population` named `Generation` that either has the value `Baby Boomer` or `Other`. 

In [3]:
# ADD GENERATION COLUMN 
df_population['generation']= ""
df_population

Unnamed: 0,year,age,sex,people,generation
0,1850,0,1,1483789,
1,1850,0,2,1450376,
2,1850,5,1,1411067,
3,1850,5,2,1359668,
4,1850,10,1,1260099,
...,...,...,...,...,...
565,2000,80,2,3221898,
566,2000,85,1,970357,
567,2000,85,2,1981156,
568,2000,90,1,336303,


In [4]:
# ADD A BIRTH YEAR COLUMN AND CALCULATE THE YEAR EACH PERSON IS BORN 

df_population['birth year']= df_population["year"]-df_population['age']
df_population

Unnamed: 0,year,age,sex,people,generation,birth year
0,1850,0,1,1483789,,1850
1,1850,0,2,1450376,,1850
2,1850,5,1,1411067,,1845
3,1850,5,2,1359668,,1845
4,1850,10,1,1260099,,1840
...,...,...,...,...,...,...
565,2000,80,2,3221898,,1920
566,2000,85,1,970357,,1915
567,2000,85,2,1981156,,1915
568,2000,90,1,336303,,1910


In [5]:
generation = []

Post1946= (df_population['birth year'] >= 1946) #in this case we want the T n F
Pre1964 = (df_population['birth year'] <= 1964)

for x,y in zip(Post1946, Pre1964):
  if x is True and y is True:
    #df_population.loc[generation']='boomer'
    generation.append('Baby Boomer')
  else:
    #df_population.loc[idx,'generation']='other'
    generation.append('Other')

df_population['generation']=generation

df_population.sample(20)


Unnamed: 0,year,age,sex,people,generation,birth year
209,1910,45,2,2114930,Other,1865
151,1880,90,2,15451,Other,1790
41,1860,5,2,1778772,Other,1855
405,1960,60,2,3723839,Other,1900
410,1960,75,1,1308913,Other,1885
258,1920,75,1,417292,Other,1845
55,1860,40,2,616826,Other,1820
298,1930,80,1,246708,Other,1850
482,1980,65,1,3893510,Other,1915
562,2000,75,1,2912655,Other,1925


## ✅Q2 - Change the encoding for `sex`

As in our lab in class, the sex is "Male" is encoded as the number `1` and the sex for Female is encoded as `2`.  Modify the dataframe  `df_population` to replace the encoding with the string so when we create our plots this will automatically have the legend come out correctly (note, you can map numbers to labels in Altair as well).  

In [6]:
df_population.sample(10)

Unnamed: 0,year,age,sex,people,generation,birth year
146,1880,80,1,67526,Other,1800
251,1920,55,2,1674672,Other,1865
83,1870,15,2,2077248,Other,1855
523,1990,70,2,4610222,Other,1920
125,1880,25,2,1974241,Other,1855
200,1910,25,1,4257755,Other,1885
420,1970,5,1,10411131,Other,1965
127,1880,30,2,1596772,Other,1850
226,1910,90,1,23929,Other,1820
465,1980,20,2,10614947,Baby Boomer,1960


In [7]:
df_population['sex'] = df_population['sex'].astype(str)
df_population.dtypes

year           int64
age            int64
sex           object
people         int64
generation    object
birth year     int64
dtype: object

In [8]:
df_population['sex'].replace('[1]','Male',regex=True, inplace = True)
df_population.sample(10)

Unnamed: 0,year,age,sex,people,generation,birth year
541,2000,20,2,9324244,Other,1980
112,1870,90,Male,8649,Other,1780
123,1880,20,2,2512803,Other,1860
100,1870,60,Male,399264,Other,1810
289,1930,55,2,2298614,Other,1875
132,1880,45,Male,1065973,Other,1835
153,1900,0,2,4589196,Other,1900
105,1870,70,2,165501,Other,1800
230,1920,5,Male,5789008,Other,1915
193,1910,5,2,4866139,Other,1905


In [9]:
df_population['sex'].replace('[2]','Female',regex=True, inplace = True)
df_population.sample(10)

Unnamed: 0,year,age,sex,people,generation,birth year
149,1880,85,Female,31227,Other,1795
297,1930,75,Female,522716,Other,1855
319,1940,35,Female,4791809,Other,1905
142,1880,70,Male,255422,Other,1810
295,1930,70,Female,918509,Other,1860
438,1970,50,Male,5298312,Other,1920
225,1910,85,Female,62725,Other,1825
206,1910,40,Male,2860229,Other,1870
65,1860,65,Female,141405,Other,1795
453,1970,85,Female,658511,Other,1885


## ✅Q3 Juxtapose Bar Charts Horizontally 

Create a bar chart of the population distribution in the year 1960, and horizontally juxtapose the bar chart of the population distribution for the year 1990. Plot the total number of people (ignoring the `sex` attribute).

Note - You can slice the data to a give year before you pass it to Altair using pandas. (you can also do this in Altair with filters, but we haven't covered that yet).

Encode membership of the Baby Boomer generation with color using "#7D3C98" (purple) for the boomers "#F4D03F" (gold) for other.

Fix the y axis so it is equal in both plots.

In [10]:
# Create a bar chart of the population distribution in the year 1960, 
# and horizontally juxtapose the bar chart of the population distribution for the year 1990. 
# Plot the total number of people (ignoring the sex attribute).

domain = ['Baby Boomer', 'Other']
range = ['#7D3C98','#F4D03F']

base1 = alt.Chart(df_population[(df_population['year'] == 1960)]).mark_bar().encode(
    x=alt.X('age:O',title='Age'),
    y=alt.Y("sum(people):Q", scale=alt.Scale(domain=[0,24000000]), title='Number of People'),
    color=alt.Color('generation', scale=alt.Scale(domain=domain, range=range))
).properties(title='Distribution of Ages in 1960')

base2 = alt.Chart(df_population[(df_population['year'] == 1990)]).mark_bar().encode(
    x=alt.X('age:O',title='Age'),
    y=alt.Y("sum(people):Q", scale=alt.Scale(domain=[0,24000000]), title='Number of People'),
    color=alt.Color('generation', scale=alt.Scale(domain=domain, range=range))
).properties(title='Distribution of Ages in 1990')

base1 | base2

## ✅Q5 - Show the Population Change Over Time with a Slider

Now, we have a snapshot of 2 different years next to each other, but what about creating a crude animation by controlling the the year displayed with a slider?

Create a slider using [this example](https://altair-viz.github.io/gallery/us_population_over_time.html) to help guide you.  Our plot will look similar, except we have not split our bar chart up by `sex` yet. Name the slider 'Select Year:' (this in controlled in `binding_range`, and not in the `selection_single` parameters).  

Encode membership of the Baby Boomer generation with color using "#7D3C98" (purple) for the boomers "#F4D03F" (gold) for other.

Start the slider at 1900.

In [11]:
domain = ['Baby Boomer', 'Other']
range = ['#7D3C98','#F4D03F']

slider = alt.binding_range(min=1900, max=2000, step=10, name='Select Year:')
select_year = alt.selection_single(name="Select", fields=['year'],
                                   bind=slider, init={'year': 1900})

base = alt.Chart(df_population).mark_bar().encode(
    x=alt.X('age:O',title='Age'),
    y=alt.Y("sum(people):Q", scale=alt.Scale(domain=[0,24000000]), title='Number of People'),
    color=alt.Color('generation', scale=alt.Scale(domain=domain, range=range)),
).properties(title='Population Distribution by Age in the USA'
).add_selection(
    select_year
).transform_filter(
    select_year
).configure_facet(
    spacing=8
)

base

## ✅Q6 - Linking

Let's take a closer look at just the year 2000 data, and find what the distribution of sex is for each individual age grouping.  Plot the distribution of ages as a bar chart for just the year 2000, and link a histogram that will plot the distribution of sex for the current selection.  It should default to no age group selected.  The histogram for the sex distribution should appear below the year 2000 data (vertically concatenated). When a bar on the top chart is selected, indicate its selection by turning the other bars light gray.  The histogram of the sex distribution below it should be a horizontal bar chart. 

Encode membership of the Baby Boomer generation with color using "#7D3C98" (purple) for the boomers "#F4D03F" (gold) for other.

In [12]:
domain = ['Baby Boomer','Other']
range = ['#7D3C98','#F4D03F']


click = alt.selection_multi(encodings=['color'], fields=['age'])

base = alt.Chart(df_population[(df_population['year'] == 2000)]).mark_bar().encode(
    x=alt.X('age:O',title='Age'),
    y=alt.Y("sum(people):Q", scale=alt.Scale(domain=[0,24000000]), title='Number of People'),
    #color=alt.Color('generation',scale=alt.Scale(domain=domain, range=range))
    #color=alt.condition(click, 'generation', alt.value('lightgray'))
    color=alt.condition(click,'generation',
                        alt.value('lightgray'),
                        scale=alt.Scale(domain=domain, range=range))

).properties(title='Distribution of Ages in 2000').add_selection(
    click)

hist = alt.Chart(df_population[(df_population['year']==2000) & (df_population['sex'])]).mark_bar().encode(
    x=alt.X("sum(people):Q", title='Number of People'),
    y=alt.Y('sex',title='Sex'),
    color=alt.Color('generation', scale=alt.Scale(domain=domain, range=range))
).transform_filter(
    click
).properties(title='Distribution of Sex for Above Age Selection')

base & hist

## ✅Q7 - Combine Q5 and Q6 to One Chart

In question 6, we linked the distribution of sex to the age selection for just the year 2000.  Let's visualize all the data by incorporating the year selection slider from question 5 so that you can select which year of data you are viewing. Retain the ability to just select one age group for the sex distribution, and default to no age group selected.

Add a tooltip so you can see exactly how many people are in the age range for the top "Distribution of Ages for the Selected Year" histogram. 

Encode membership of the Baby Boomer generation with color using "#7D3C98" (purple) for the boomers "#F4D03F" (gold) for other.

In [13]:
domain = ['Baby Boomer', 'Other']
range = ['#7D3C98','#F4D03F']

click = alt.selection_multi(encodings=['color'], fields=['age'])

slider = alt.binding_range(min=1900, max=2000, step=10, name='Select Year:')
select_year = alt.selection_single(name="Select", fields=['year'],
                                   bind=slider, init={'year': 1900})

base = alt.Chart(df_population).mark_bar().encode(
    x=alt.X('age:O',title='Age'),
    y=alt.Y("sum(people):Q", scale=alt.Scale(domain=[0,24000000]), title='Number of People'),
    color=alt.condition(click,'generation',
                        alt.value('lightgray'),
                        scale=alt.Scale(domain=domain, range=range)),
    tooltip=['age', 'people']
).properties(title='Population Distribution by Ages for Selected Year'
).add_selection(
    select_year,
    click
).transform_filter(
    select_year
)

hist = alt.Chart(df_population[(df_population['year']==2000) & (df_population['sex'])]).mark_bar().encode(
    x=alt.X("sum(people):Q", title='Number of People'),
    y=alt.Y('sex',title='Sex'),
    color=alt.Color('generation', scale=alt.Scale(domain=domain, range=range))
).properties(title='Distribution of Sex for Above Age Selection').transform_filter(
    click)


alt.layer(base, hist).configure_view(
    stroke='transparent'
).configure_facet(
    spacing=8
) 

base & hist 

