# Instructions

Your submission will be tested with the code tester. It is important to follow these instructions to ensure your work tests properly.

- Do not change the content of the cells under __SETUP__ and __TESTS__
- Work only in the __YOUR WORK__ area
- Rename the notebook with your group at the end (subsitute XX with your group number).
- Assign the results of each numbered question to the appropriate test variable. For example, the answer of `1.` should be assigned to `test_1`
- Rounding: use the supplied function `hround` to round decimal numbers when instructed. It's important to use this function because there are [multiple ways to round numbers in Python](https://www.knowledgehut.com/blog/programming/python-rounding-numbers) and they may not result in the same value that the tester is testing against.
- Ensure your run the cells under __SETUP__ before you run your work
- Before you submit your work, ensure you clean up your notebook. Your notebook has to run without an error in order to be tested. The easiest way to ensure is to `Kernel->Restart & Run All`
- Answers are provided in along with this notebook in eLC (look a picture named `solution_key`) for your convenience
- You will need to write a program to calculate the answers. Setting the answers to be their correct values without solving them is considered *hardcoding* and will result in zero grade for the assignment as well as a potential academic honesty violation.
- You can also test your submission using [the online code tester](https://notebook-tester.safadi-puzzler.com/)


# SETUP

In [None]:
import pandas as pd
import numpy as np
import altair as alt

In [None]:
# DO NOT EDIT OR CHANGE THE CONTENT OF THIS CELL
scenario = 0
renderer = 'default'

In [None]:
# increase the number of Altair maximum allowed rows in dataset
alt.data_transformers.enable('default', max_rows=None)


In [None]:
alt.renderers.enable(renderer)

In [None]:
def hround(number):
    return round(number, 3 - scenario)

In [None]:
test_1=test_2=test_3=test_4=test_5=test_6=test_7=test_8=test_9=test_10=0.0
test_11=test_12=test_13=test_14=test_15=test_16=test_17=test_18=test_19=test_20=0.0

In this homework, we are going to use data from [Seshat: Global History Databank](https://seshatdatabank.info/). This data bank "collects what is currently known about the social and political organization of human societies and how civilizations have evolved over time."

This data has been used in several research projects to test hypotheses about human development and history. We are going to use a subset of this data that was curated in [Turchin et al. (2017)](https://doi.org/10.1073/pnas.1708800115). You do not need to but are strongly encourage to read the paper.

The CSV file `pnas.1708800115.sd01.csv` contains data about the following variables:

- `NGA`: Natural Geographic Areas. Each NGA is defined spatially by a
boundary drawn on the world map that encloses an area delimited by naturally occurring geographical features (for example, river basins, coastal plains, valleys, and islands). The extent of the NGAs does not change over time
- `PolID`: unique id of the polity
- `Time`: year using the Gregorian/Julian calendar notation (e.g., CE and BCE negative values) in 100 year increments
- `PolPop`: polity population
- `PolTerr`: polity territory
- `CapPop`: capital population
- `levels`: hierarchy levels
- `government`: features of governance (aggregate of data about Officers, Bureaucrats, Court, Merit Promotion, Soldires, Lawyers, Judges, Government buildings, Priests, Exam system, and Legal code)
- `infrastr`: features of infrastructure (aggregate of data about Bridges, Canals, Ports, Mines, Roads, Irrigation, Market, Food storatge, and Water supply)
- `writing`: features of information systems (aggregate of data about Mnemonic, Script, Lists, Alphabet, Records, Non-phonetic)
- `texts`: features of written records and texts (Calendar, Science literature, Sacred texts, History, Religious literature, Finction, Practical literature, Philosophy)
- `money`: features of the monetary system (aggregate of Articles, Tokens, Metals, Foreign coins, Indigenous coins, Paper currency)



The CSV file `NGAClassification.csv` contains a classification of each `NGA` across two categories, `Region` the larger geographic area (e.g., Africa, Europe), and `Period` which is the relative period when this `NGA` reached considerable social complexity (Early, Intermediate, and late)


In [None]:
data = pd.read_csv('pnas.1708800115.sd01.csv').iloc[scenario:]
data.head(2)

In [None]:
regions_dict = {'Upper Egypt': 'Africa',
 'Niger Inland Delta': 'Africa',
 'Ghanaian Coast': 'Africa',
 'Latium': 'Europe',
 'Paris Basin': 'Europe',
 'Iceland': 'Europe',
 'Sogdiana': 'Central Eurasia',
 'Orkhon Valley': 'Central Eurasia',
 'Lena River Valley': 'Central Eurasia',
 'Yemeni Coastal Plain': 'Southwest Asia',
 'Susiana': 'Southwest Asia',
 'Konya Plain': 'Southwest Asia',
 'Deccan': 'South Asia',
 'Garo Hills': 'South Asia',
 'Kachi Plain': 'South Asia',
 'Kapuasi Basin': 'Southeast Asia',
 'Central Java': 'Southeast Asia',
 'Cambodian Basin': 'Southeast Asia',
 'Southern China Hills': 'East Asia',
 'Middle Yellow River Valley': 'East Asia',
 'Kansai': 'East Asia',
 'Finger Lakes': 'North America',
 'Cahokia': 'North America',
 'Valley of Oaxaca': 'North America',
 'Cuzco': 'South America',
 'Lowland Andes': 'South America',
 'North Colombia': 'South America',
 'Big Island Hawaii': 'Ocenaia-Australia',
 'Chuuk Islands': 'Ocenaia-Australia',
 'Oro PNG': 'Ocenaia-Australia'}

In [None]:
# create a column region that contains the region for each row
data['Region'] = data['NGA'].map(regions_dict)

# Questions

using the Altair library, create the following visualizations:

1. Create a bar chart showing the number of observations (count) per Region

2. Create a bar chart showing the average value of money per Region

3. Create a line plot showing the average value of PolPop over Time 

4. Build on the previous question but use color to draw a line per Region 

5. Create a line plot showing the average value of PolTerr over Time 

6. Put (3) and (5) side by side  

7. Create a histogram of PolPop, using 20 bins

8. Create a similar histogram for each variable in PolPop, PolTerr, CapPop, levels, government, infrastr, writing, texts and money. Organize all the histograms in a 3x3 grid

9. Create a plot representing the mean of money per region (y-axis) and time (x-axis). Use circle marks and color to represent the region and size to represent the mean of money

10. Create a similar plot using a stacked area chart. Use a `center` value the stack parameter of the y encoding

11. Create a scatter plot of money vs. PolPop and use circle as a graphical mark

12. Repeat the previous chart per `Region` using the column channel

13. Create a bar chart with a bar from the min of levels to the max of levels per NGA

14. Create a tick chart where ticks are the mean levels per NGA. The ticks should be red

15. Combine the two previous charts putting the ticks on top of the bars

16. Create an area chart with the area extending from the q1 of levels to the q3 of levels on the y-axis. Put Time on the x-axis

17. Create a red line with the median value of levels on the y-axis and Time on the x-axis

18. Put the line on top of the area chart

19. Create a heatmap (use rect mark) showing the average value of government variable in each region over Time

20. Repeat the previous chart for each irep using the row channel 


# Your Work Here

In [None]:
# Question 1
test_1 = alt.Chart(data).mark_bar().encode(
    x='count(Region)',
    y='Region'
)

In [None]:
# Question 2
test_2 = alt.Chart(data).mark_bar().encode(
    x='mean(money)',
    y='Region'
)

In [None]:
# Question 3
test_3 = alt.Chart(data).mark_line().encode(
    x='Time',
    y='mean(PolPop)'
)

In [None]:
# Question 4
test_4 = alt.Chart(data).mark_line().encode(
    x='Time',
    y='mean(PolPop)',
    color='Region'
)

In [None]:
# Question 5
test_5 = alt.Chart(data).mark_line().encode(
    x='Time',
    y='mean(PolTerr)'
)

In [None]:
# Question 6
polpop = alt.Chart(data).mark_line().encode(
    x='Time',
    y='mean(PolPop)'
)

polterr = alt.Chart(data).mark_line().encode(
    x='Time',
    y='mean(PolTerr)'
)

test_6 = (polpop) | (polterr)

In [None]:
# Question 7
test_7 = alt.Chart(data).mark_bar().encode(
    alt.X('PolPop', bin=alt.Bin(maxbins=20)),
    y='count()'
)

In [None]:
# Question 8
test_8 = alt.Chart(data).mark_bar().encode(
    x=alt.X(alt.repeat('repeat'), type='quantitative', bin=alt.Bin(maxbins=20)),
    y='count()'
).repeat(
    repeat=['PolPop', 'PolTerr', 'CapPop', 'levels', 'government', 'infrastr', 'writing', 'texts', 'money'], 
    columns=3
)

In [None]:
# Question 9
new_df = data.groupby(['Region', 'Time'])['money'].mean().reset_index()
test_9 = alt.Chart(new_df).mark_circle().encode(
    x='Time',
    y='Region',
    size='money',
    color='Region'
)

In [None]:
# Question 10
test_10 = alt.Chart(new_df).mark_area().encode(
    x='Time',
    y=alt.Y('money', stack='center'),
    color='Region'
)

In [None]:
# Question 11
test_11 = alt.Chart(data).mark_point().encode(
    x='PolPop',
    y='money'
)

In [None]:
# Question 12
test_12 = alt.Chart(data).mark_point().encode(
    x='PolPop',
    y='money'
).facet('Region')

In [None]:
# Question 13
test_13 = alt.Chart(data).mark_bar().encode(
    x='NGA',
    y='min(levels)',
    y2='max(levels)'
)

In [None]:
# Question 14
test_14 = alt.Chart(data).mark_tick(color='red').encode(
    x='NGA',
    y='mean(levels)'
)

In [None]:
# Question 15
test_15 = test_13 + test_14

In [None]:
# Question 16
test_16 = alt.Chart(data).mark_area().encode(
    x='Time',
    y='q1(levels)',
    y2='q3(levels)'
)

In [None]:
# Question 17
test_17 = alt.Chart(data).mark_line(color='red').encode(
    x='Time',
    y='median(levels)'
)

In [None]:
# Question 18
test_18 = test_16 + test_17

In [None]:
# Question 19
test_19 = alt.Chart(data).mark_rect().encode(
    x='Time',
    y='Region',
    color='mean(government)'
)

In [None]:
# Question 20
test_20 = alt.Chart(data).mark_rect().encode(
    x='Time',
    y='Region',
    color='mean(government)'
).facet(row='irep')

# TESTS

In [None]:
### TEST 1
test_1

In [None]:
## TEST 2
test_2

In [None]:
## TEST 3
test_3

In [None]:
## TEST 4
test_4

In [None]:
## TEST 5
test_5

In [None]:
## TEST 6
test_6

In [None]:
## TEST 7
test_7

In [None]:
## TEST 8
test_8

In [None]:
## TEST 9
test_9

In [None]:
## TEST 10
test_10

In [None]:
## TEST 11
test_11

In [None]:
## TEST 12
test_12

In [None]:
## TEST 13
test_13

In [None]:
## TEST 14
test_14

In [None]:
## TEST 15
test_15

In [None]:
## TEST 16
test_16

In [None]:
## TEST 17
test_17

In [None]:
## TEST 18
test_18

In [None]:
## TEST 19
test_19

In [None]:
## TEST 20
test_20