<a href="https://colab.research.google.com/github/PUBPOL-2130/notebooks/blob/main/Week5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -q geopandas networkx

In [None]:
%config InlineBackend.figure_formats = ["svg"]
import base64
import io
import json
import requests

import pandas as pd; pd.set_option('display.max_rows', 500)
import geopandas as gpd
import matplotlib.pyplot as plt
import networkx as nx

from shapely import Point

# Week 5: Graphs, flows, and arms

### Fetching SIPRI flow data

This is a case study in real-world data munging.  There is extensive pre-processing to do to make the data useable.  We've chosen to show it to you in this notebook so that you get a realistic sense of the data challenges of working with a new dataset by reading through the notebook.  (Of course, you can also just click through it and everything should work.)

We're going to be working with a dataset from SIPRI (the Stockholm International Peace Research Institute) that tracks how arms and weaponry move around the world.

They have created a unit of weaponry called TIV ("trend indicator value"), which is an attempt to measure comparable amounts of **military capability**. It's some mix of cost and lethality that is designed to make it possible to compare arms flows across type and time.

For more on TIV, see the SIPRI fact sheet ["Measuring International Arms Transfers"](https://www.sipri.org/sites/default/files/files/FS/SIPRIFS1212.pdf).

SIPRI doesn't intend for users to grab their full dataset all at once, but we've figured out how to pull the whole CSV without any filters.  This might take a minute or more to run.  If it times out, try it again!

**Note: We have received reports of errors in the next cell. Some of these errors are due to the fact that the SIPRI data may have gone offline. If you are encountering errors, you can use a backup version of the data.**

Please set the following variable `download_raw_data` to `False` if you want to use a backup version of the data.

In [None]:
# set the following variable to false if using backup data
download_raw_data=False

In [None]:
if download_raw_data:
    raw_data = requests.post(
        "https://github.com/PUBPOL-2130/notebooks/blob/main/data/sipri_arms_transfers.json",
        json={"filters": []},
    ).json()

Next, it turns out to be encoded in base 64, and we have to convert it to a more standard format.  Base 64 is a pretty nonstandard format, often used because it can encompass special characters, so it's important to notice the need to translate it or subsequent manipulations won't work as expected.

When you run the next cell, we'll examine the first 15 lines of the decoded file, which turns out to be a text header that needs to be chopped off the CSV for the rest of our work.  So we split at the carriage return (`split("\n")`) before continuing.

In [None]:
if download_raw_data:
    csv_lines = base64.b64decode(raw_data["bytes"]).decode("iso-8859-1").split("\n")
    csv_lines[:15]

In [None]:
if download_raw_data:
    first_line_index = next(idx for idx, line in enumerate(csv_lines) if line.startswith("Recipient,"))
    first_line_index

Let's make the dataframe!  It's got a huge amount of information about the weapons, when ordered, whether new or second-hand, and comments about the conflict context.

In [None]:
if download_raw_data:
    arms_df = pd.read_csv(io.StringIO("\n".join(csv_lines[first_line_index:])))
    arms_df.to_csv("fulldata.csv")
    arms_df
else:
    arms_df = pd.read_json('https://github.com/PUBPOL-2130/notebooks/blob/main/data/sipri_arms_transfers.json?raw=true')
    arms_df.to_csv("fulldata.csv")
    arms_df

In [None]:
arms_df.groupby(["Recipient", "Supplier", "Year of order"]).count()

We can narrow down what we're looking at.  The next cell lets you toggle between looking for (a) [Mi-17s](https://en.wikipedia.org/wiki/Mil_Mi-17) that do not come from Russia/USSR, or (b) second-hand weapons.  Try it both ways.

In [None]:
arms_df[(arms_df["Weapon designation"] == "Mi-17") & (arms_df["Supplier"] != "Russia") & (arms_df["Supplier"] != "Soviet Union")]
# arms_df[arms_df["status"] == "Second hand"]

What other statuses are possible?

In [None]:
arms_df["status"].unique()

In [None]:
arms_df["Weapon designation"].value_counts()

# Mapping transfers


Visualizing the SIFRI dataset presents several problems:

* Recipients and suppliers are _usually_ countries, but this does not always hold. Notice that the dataset includes non-state actors (such as the [Houthi rebels](https://en.wikipedia.org/wiki/Houthis) or the [Mujahideen](https://en.wikipedia.org/wiki/Afghan_mujahideen)).  We will need one lat/long location per actor, so we'll try to find an appropriate capital city.
* This dataset goes back to the 1940s, so it references countries that no longer exist (such as [Biafra](https://en.wikipedia.org/wiki/Biafra) and [Yugoslavia](https://en.wikipedia.org/wiki/Yugoslavia)).
* There is some variability in names. For instance, "Ivory Coast" and "Cote d'Ivoire" are two different names for the same country.

For visualization purposes, we'll map these special cases to the nearest equivalent contemporary states. For instance, we'll map South Vietnam to Vietnam; Saigon (the former capital of South Vietnam) is now Ho Chi Minh City (the current capital of Vietnam). We'll exclude flows to or from entities that do not clearly correspond to a single region (for instance, the United Nations).

The following dictionary associating SIPRI states to modern states was made by hand and represents a lot of labor!  You should at least glance through it to get a sense of some of the nitty-gritty involved here.  Note the comments at the end about non-obvious decisions (a good practice when cleaning data, so you'll know what you were thinking later on!) and the explicit list of exceptions.

In [None]:
capitals_map = {
    "ANC (South Africa)*": "South Africa",
    "Anti-Castro rebels (Cuba)*": "Cuba",
    "Amal (Lebanon)*": "Lebanon",
    "Armas (Guatemala)*": "Guatemala",
    "Contras (Nicaragua)*": "Nicaragua",
    "Darfur rebels (Sudan)*": "Sudan",
    "ELF (Ethiopia)*": "Ethiopia",
    "EPLF (Ethiopia)*": "Ethiopia",
    "FRELIMO (Portugal)*": "Portugal",
    "Haiti rebels*": "Haiti",
    "Hezbollah (Lebanon)*": "Lebanon",
    "Houthi rebels (Yemen)*": "Yemen",
    "Indonesia rebels*": "Indonesia",
    "Khmer Rouge (Cambodia)*": "Cambodia",
    "Kurdistan Regional Government (Iraq)*": "Iraq",
    "LF (Lebanon)*": "Lebanon",
    "LRA (Uganda)*": "Uganda",
    "LTTE (Sri Lanka)*": "Sri Lanka",
    "Libya GNC": "Libya",
    "Libya HoR*": "Libya",
    "Congo": "Congo (Brazzaville)",
    "DR Congo": "Congo (Kinshasa)",
    "MNLF (Philippines)*": "Philippines",
    "MPLA (Portugal)*": "Portugal",
    "MTA (Myanmar)*": "Myanmar",
    "Micronesia": "Federated States of Micronesia",
    "Mujahedin (Afghanistan)*": "Afghanistan",
    "NLA (Macedonia)*": "North Macedonia",
    "NTC (Libya)*": "Libya",
    "Northern Alliance (Afghanistan)*": "Afghanistan",
    "Northern Cyprus": "Cyprus",
    "PAIGC (Portugal)*": "Portugal",
    "PIJ (Israel/Palestine)*": "Israel",
    "PKK (Turkiye)*": "Turkey",
    "PLO (Israel)*": "Israel",
    "PRC (Israel/Palestine)*": "Israel",
    "Pathet Lao (Laos)*": "Laos",
    "Provisional IRA (UK)*": "United Kingdom",
    "RPF (Rwanda)*": "Rwanda",
    "RUF (Sierra Leone)*": "United Kingdom",
    "SLA (Lebanon)*": "Lebanon",
    "SNA (Somalia)*": "Somalia",
    "SPLA (Sudan)*": "Sudan",
    "Southern rebels (Yemen)*": "Yemen",
    "Syria rebels*": "Syria",
    "Turkiye": "Turkey",
    "UAE": "United Arab Emirates",
    "UIC (Somalia)*": "Somalia",
    "UNITA (Angola)*": "Angola",
    "Ukraine Rebels*": "Ukraine",
    "United States": "United States of America",
    "United Wa State (Myanmar)*": "Myanmar",
    "Viet Minh (France)*": "France",
    "Viet Nam": "Vietnam",
    "ZAPU (Zimbabwe)*": "Zimbabwe",
    "GUNT (Chad)*": "Chad",
    "FAN (Chad)*": "Chad",
    "FMLN (El Salvador)*": "El Salvador",
    "Gambia": "The Gambia",
    "Lebanon Palestinian rebels*": "Lebanon",
    "Cote d'Ivoire": "Ivory Coast",
    "Bahamas": "The Bahamas",
    "FNLA (Angola)*": "Angola",
    "Cabo Verde": "Cape Verde",
    "Timor-Leste": "East Timor",
    "Saint Vincent": "Saint Vincent and the Grenadines",
    "Guinea-Bissau": "Guinea Bissau",
    "South Vietnam": "Vietnam",  # Saigon is now Ho Chi Minh City
    "Viet Cong (South Vietnam)*": "Vietnam",
    "Hamas (Palestine)*": "Palestine",
    "Soviet Union": "Russia",
    "NATO**": "Belgium",  # NATO headquarters in Brussels
    'European Union**': "Belgium",  # EU headquarters in Brussels
    "OSCE**": "Austria",  # secretariat in Vienna
    "Yemen Arab Republic (North Yemen)": "Yemen",  # same capital as Yemen (Sanaa)
    "North Yemen": "Yemen",  # same capital as Yemen (Sanaa)
    "Czechoslovakia": "Czechia",  # same capital as the modern Czech Republic (Prague)
    "Yugoslavia": "Serbia",  # same capital as Serbia (Belgrade)
    "East Germany (GDR)": "Germany",  # for large-scale flow maps, approximate East Berlin with Berlin
    "Western Sahara": "Morocco",  # largely under Moroccan occupation,
}

exclude_flows = {
    "nan",
    "unknown rebel group*",
    "unknown recipient(s)",
    'unknown supplier(s)',
    "United Nations**",
    "Regional Security System**",
    "African Union**",
    '0.25',
    '3',
}


Even more exceptions!  These lat/longs were added by hand.

In [None]:
# (long, lat) coordinates for capitals of entities not included in the places shapefile.
# Several of these entities are countries that no longer exist.
extra_capitals = {
    "Biafra": ("Enugu", 7.5139, 6.4483),  # 1967 capital (now part of Nigeria)
    "Bosnia-Herzegovina": ("Sarajevo", 18.4131, 43.8563),
    "South Yemen": ("Aden", 45.0176, 12.7906),
    "Katanga": ("Lubumbashi", 27.5026, -11.6876),
    "South Sudan": ("Juba",  31.5825, 4.8539),
    "Palestine": ("East Jerusalem", 35.217018, 31.771959),
    "Aruba": ("Oranjestad", -70.0353, 12.5227),
}

...and put into a format that's suitable for adding to the capitals list we're making.  Note the CRS (coordinate reference system) specified.

In [None]:
extra_capitals_gdf = gpd.GeoDataFrame(
    [
        {
            "adm0name": entity,
            "name": capital,
            "longitude": long,
            "latitude": lat,
            "geometry": Point(long, lat),
        }
        for entity, (capital, long, lat) in extra_capitals.items()
    ],
    crs="epsg:4326",
).set_index("adm0name")
extra_capitals_gdf

Next we're going to read in a simple shapefile from a cartographic society that gives locations for major world cities and identifies which ones are capital cities.  Note even more special-case handling:  some countries have more than one capital!  (South Africa famously has three.)

In [None]:
places_gdf = gpd.read_file("https://naciscdn.org/naturalearth/110m/cultural/ne_110m_populated_places_simple.zip")
capitals_gdf = places_gdf[places_gdf["adm0cap"] == 1].set_index("adm0name")
# force each nation to have exactly one capital
capitals_gdf = capitals_gdf[~capitals_gdf["name"].isin(["Sucre", "Yamoussoukro", "Bloemfontein", "Pretoria"])][["name", "latitude", "longitude", "geometry"]]
capitals_gdf = gpd.GeoDataFrame(pd.concat([capitals_gdf, extra_capitals_gdf]), crs="epsg:4326")

In [None]:
capitals_gdf.tail(100)

Hopefully this dataset gives you many ideas of things you could analyze.

Here are just a few...
* Study arms flows connected to a protracted conflict (e.g., Israel–Palestine)
* Pick a small window of time, like 1-2 years, and do a deeper look at that snapshot.
* The USSR collapsed circa 1991.  Do you see an immediate impact on arms flows?
* Consider second-hand arms sales.  How does weaponry "trickle down" from rich exporters to first and then second recipients.
* How does the popularity and regionality of a particular weapon (e.g., the Mi-17) change over time?

# Introduction to graphs

To make a visual that shows the movement, a natural data object is a **graph**, in the sense of discrete mathematics.  A graph is made up of vertices (also called nodes), with some of them connected by edges.  The edges can be weighted, and can have a direction.  So a flow can be represented as a weighted directed edge from vertex $v$ to vertex $w$.

We'll set this up with a dataframe giving only a limited number of attributes per state, then a secondary dataframe grouping by year.

In [None]:
flowmap_arms_df = arms_df[~arms_df["Supplier"].isin(exclude_flows) & ~arms_df["Recipient"].isin(exclude_flows)].rename(
    columns={
        "Year of order": "order_year",
        "Recipient": "recipient",
        "Supplier": "supplier",
        "SIPRI TIV for total order": "order_sipri_tiv"
    }
)
flowmap_arms_df["order_year"] = flowmap_arms_df["order_year"].astype(int)
flowmap_arms_df = flowmap_arms_df[flowmap_arms_df["order_year"] >= 1950]

In [None]:
orders_by_year_df = flowmap_arms_df.groupby(["order_year", "recipient", "supplier"]).sum()["order_sipri_tiv"]
orders_by_year_df

## Example
Let's visualize arms flows to and from the U.S. in a particular year by constructing a graph.

In [None]:
year = 1980
country = "United States"

In [None]:
filtered_orders_df = orders_by_year_df[year].reset_index()
filtered_orders_df = filtered_orders_df[
    (filtered_orders_df["recipient"] == "United States")
    | (filtered_orders_df["supplier"] == "United States")
]
filtered_orders_df.head(10)

We're going to use a basic graph library in Python called NetworkX.  We imported it above as `nx`.  Our graph will have directed edges FROM the supplier TO the recipient of an arms shipment.

In [None]:
orders_graph = nx.DiGraph()

for _, row in filtered_orders_df.iterrows():
    orders_graph.add_edge(
        row.supplier,
        row.recipient,
        weight=row.order_sipri_tiv,
    )

Where did the U.S. send arms that year?  (or receive from!)

In [None]:
orders_graph.edges

Next we'll make a static map.  You can play around with thickness and color of the directed edges.

In [None]:
def cast_point(p):
    return (p.xy[0][0], p.xy[1][0])

In [None]:
fig, ax = plt.subplots(figsize=(20, 40))
ax.axis("off")
places_gdf.plot(ax=ax, edgecolor="0.1", linewidth=1, color="#e1f1fd", alpha=0.5)

pos = dict(capitals_gdf.geometry)

for edge in orders_graph.edges(data='weight'):
    if edge[0] in exclude_flows or edge[1] in exclude_flows:
        continue

    edge_pos = {
        edge[0]: cast_point(pos[capitals_map.get(edge[0], edge[0])]),
        edge[1]: cast_point(pos[capitals_map.get(edge[1], edge[1])]),
    }
    nx.draw_networkx_edges(orders_graph, edge_pos, edgelist=[edge], width=0.001 * edge[2], ax=ax, edge_color="blue")

## Advanced flow visualizations with FlowmapBlue

Now we'll use a very cool package called FlowmapBlue to create beautiful and interactive flowmaps.

In [None]:
!pip install "git+https://github.com/PUBPOL-2130/notebooks#egg=pubpol2130&subdirectory=lib"

This line will pop up a dialog asking for permission to generate Google Sheets credentials using your Google login.  If you're doing this in Colab, it should be particularly seamless.

In [None]:
from pubpol2130 import google_sheets_credentials, generate_flow_sheet

In [None]:
flowmap_locations_df = pd.DataFrame(
    [
        {
            "id": loc,
            "name": loc,
            "lat": capitals_gdf.loc[capitals_map.get(loc, loc), "latitude"],
            "lon": capitals_gdf.loc[capitals_map.get(loc, loc), "longitude"],
        }
        for loc in set(flowmap_arms_df["supplier"]) | set(flowmap_arms_df["recipient"])
    ]
)
flowmap_locations_df.head(5)

If you don't want to go through the whole Google permissions thing, read this [Medium article](https://medium.com/@a.marenkov/how-to-get-credentials-for-google-sheets-456b7e88c430) for info about getting credentials.

In [None]:
sheet_creds = google_sheets_credentials()

In [None]:
flow_sheet = generate_flow_sheet(
    sheet_creds=sheet_creds,
    locations_df=flowmap_locations_df,
    created_by_name="",  # YOUR NAME HERE
    created_by_email="", # YOUR EMAIL HERE
    data_source_name="SIPRI Arms Transfers Database",
    data_source_url="https://www.sipri.org/databases/armstransfers",
    incoming_tooltip="Inbound arms transfers (TIV)",
    outgoing_tooltip="Outbound arms transfers (TIV)",
    flow_tooltip="Arms transfer (TIV)",
    total_unit="TIVs",
    sheet_title="PUBPOL 2130: SIPRI arms transfers (orders over time)",
    flow_title="SIPRI Arms Transfers Database: orders over time",
    flows={
        f"Year: {year}": year_df.reset_index().rename(columns={
            "supplier": "origin",
            "recipient": "dest",
            "order_sipri_tiv": "count",
        })
        for year, year_df in orders_by_year_df.groupby(level=0)
    }
)

In [None]:
print(flow_sheet.url)

In [None]:
print(f"https://www.flowmap.blue/{flow_sheet.url.split('/')[-1]}")

#Homework 4, due Tuesday March 4, 1:25pm

Your homework this week starts with creating a flowmap.  Then you should (1) choose a question about arms flows, (2) read a SIPRI background paper connected to that topic, and (3) make a data product (typically a plot, like in previous weeks) to illustrate a key fact of your choice.

To access SIPRI's background papers, go to their [publications page](https://www.sipri.org/publications) and put "SIPRI background papers" in the publication type, and use the keyword to get closer to your topic.  (Note that most of these are regional rather than related to particular weapons.)