<a href="https://colab.research.google.com/github/PUBPOL-2130/notebooks/blob/main/helper-notebooks/Week5-supplemental.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -q geopandas networkx

In [None]:
%config InlineBackend.figure_formats = ["svg"]
import base64
import io
import json
import requests

import pandas as pd; pd.set_option('display.max_rows', 500)
import geopandas as gpd
import matplotlib.pyplot as plt
import networkx as nx

from shapely import Point

# Week 5: Supplemental Materials

In this notebook, we will first load the SIPRI data and then cut it down considerably to avoid timeout errors when creating our own Google sheet. This may help you with creating your own Google sheet when creating a flow map with Flowmapblue.

## Data cleaning

We'll start with preparing our data exactly as in the Week 5 lab.

**Note: We have received reports of errors in the next cell. Some of these errors are due to the fact that the SIPRI data may have gone offline. If you are encountering errors, you can use a backup version of the data.**

Please set the following variable `download_raw_data` to `False` if you want to use a backup version of the data.

In [None]:
# set the following variable to false if using backup data
download_raw_data=False

In [None]:
# loading raw data
if download_raw_data:
    raw_data = requests.post(
        "https://github.com/PUBPOL-2130/notebooks/blob/main/data/sipri_arms_transfers.json",
        json={"filters": []},
    ).json()

In [None]:
# converting from base6
if download_raw_data:
    csv_lines = base64.b64decode(raw_data["bytes"]).decode("iso-8859-1").split("\n")
    csv_lines[:15]

In [None]:
if download_raw_data:
    first_line_index = next(idx for idx, line in enumerate(csv_lines) if line.startswith("Recipient,"))
    first_line_index

In [None]:
# setting up the data frame, saving locally
if download_raw_data:
    arms_df = pd.read_csv(io.StringIO("\n".join(csv_lines[first_line_index:])))
    arms_df.to_csv("fulldata.csv")
    arms_df
else:
    arms_df = pd.read_json('https://github.com/PUBPOL-2130/notebooks/blob/main/data/sipri_arms_transfers.json?raw=true')
    arms_df.to_csv("fulldata.csv")
    arms_df

In [None]:
# data cleaning -- mapping transfers to general locations we can use
capitals_map = {
    "ANC (South Africa)*": "South Africa",
    "Anti-Castro rebels (Cuba)*": "Cuba",
    "Amal (Lebanon)*": "Lebanon",
    "Armas (Guatemala)*": "Guatemala",
    "Contras (Nicaragua)*": "Nicaragua",
    "Darfur rebels (Sudan)*": "Sudan",
    "ELF (Ethiopia)*": "Ethiopia",
    "EPLF (Ethiopia)*": "Ethiopia",
    "FRELIMO (Portugal)*": "Portugal",
    "Haiti rebels*": "Haiti",
    "Hezbollah (Lebanon)*": "Lebanon",
    "Houthi rebels (Yemen)*": "Yemen",
    "Indonesia rebels*": "Indonesia",
    "Khmer Rouge (Cambodia)*": "Cambodia",
    "Kurdistan Regional Government (Iraq)*": "Iraq",
    "LF (Lebanon)*": "Lebanon",
    "LRA (Uganda)*": "Uganda",
    "LTTE (Sri Lanka)*": "Sri Lanka",
    "Libya GNC": "Libya",
    "Libya HoR*": "Libya",
    "Congo": "Congo (Brazzaville)",
    "DR Congo": "Congo (Kinshasa)",
    "MNLF (Philippines)*": "Philippines",
    "MPLA (Portugal)*": "Portugal",
    "MTA (Myanmar)*": "Myanmar",
    "Micronesia": "Federated States of Micronesia",
    "Mujahedin (Afghanistan)*": "Afghanistan",
    "NLA (Macedonia)*": "North Macedonia",
    "NTC (Libya)*": "Libya",
    "Northern Alliance (Afghanistan)*": "Afghanistan",
    "Northern Cyprus": "Cyprus",
    "PAIGC (Portugal)*": "Portugal",
    "PIJ (Israel/Palestine)*": "Israel",
    "PKK (Turkiye)*": "Turkey",
    "PLO (Israel)*": "Israel",
    "PRC (Israel/Palestine)*": "Israel",
    "Pathet Lao (Laos)*": "Laos",
    "Provisional IRA (UK)*": "United Kingdom",
    "RPF (Rwanda)*": "Rwanda",
    "RUF (Sierra Leone)*": "United Kingdom",
    "SLA (Lebanon)*": "Lebanon",
    "SNA (Somalia)*": "Somalia",
    "SPLA (Sudan)*": "Sudan",
    "Southern rebels (Yemen)*": "Yemen",
    "Syria rebels*": "Syria",
    "Turkiye": "Turkey",
    "UAE": "United Arab Emirates",
    "UIC (Somalia)*": "Somalia",
    "UNITA (Angola)*": "Angola",
    "Ukraine Rebels*": "Ukraine",
    "United States": "United States of America",
    "United Wa State (Myanmar)*": "Myanmar",
    "Viet Minh (France)*": "France",
    "Viet Nam": "Vietnam",
    "ZAPU (Zimbabwe)*": "Zimbabwe",
    "GUNT (Chad)*": "Chad",
    "FAN (Chad)*": "Chad",
    "FMLN (El Salvador)*": "El Salvador",
    "Gambia": "The Gambia",
    "Lebanon Palestinian rebels*": "Lebanon",
    "Cote d'Ivoire": "Ivory Coast",
    "Bahamas": "The Bahamas",
    "FNLA (Angola)*": "Angola",
    "Cabo Verde": "Cape Verde",
    "Timor-Leste": "East Timor",
    "Saint Vincent": "Saint Vincent and the Grenadines",
    "Guinea-Bissau": "Guinea Bissau",
    "South Vietnam": "Vietnam",  # Saigon is now Ho Chi Minh City
    "Viet Cong (South Vietnam)*": "Vietnam",
    "Hamas (Palestine)*": "Palestine",
    "Soviet Union": "Russia",
    "NATO**": "Belgium",  # NATO headquarters in Brussels
    'European Union**': "Belgium",  # EU headquarters in Brussels
    "OSCE**": "Austria",  # secretariat in Vienna
    "Yemen Arab Republic (North Yemen)": "Yemen",  # same capital as Yemen (Sanaa)
    "North Yemen": "Yemen",  # same capital as Yemen (Sanaa)
    "Czechoslovakia": "Czechia",  # same capital as the modern Czech Republic (Prague)
    "Yugoslavia": "Serbia",  # same capital as Serbia (Belgrade)
    "East Germany (GDR)": "Germany",  # for large-scale flow maps, approximate East Berlin with Berlin
    "Western Sahara": "Morocco",  # largely under Moroccan occupation,
}

exclude_flows = {
    "nan",
    "unknown rebel group*",
    "unknown recipient(s)",
    'unknown supplier(s)',
    "United Nations**",
    "Regional Security System**",
    "African Union**",
    '0.25',
    '3',
}

# (long, lat) coordinates for capitals of entities not included in the places shapefile.
# Several of these entities are countries that no longer exist.
extra_capitals = {
    "Biafra": ("Enugu", 7.5139, 6.4483),  # 1967 capital (now part of Nigeria)
    "Bosnia-Herzegovina": ("Sarajevo", 18.4131, 43.8563),
    "South Yemen": ("Aden", 45.0176, 12.7906),
    "Katanga": ("Lubumbashi", 27.5026, -11.6876),
    "South Sudan": ("Juba",  31.5825, 4.8539),
    "Palestine": ("East Jerusalem", 35.217018, 31.771959),
    "Aruba": ("Oranjestad", -70.0353, 12.5227),
}

In [None]:
# putting into geodataframe format
extra_capitals_gdf = gpd.GeoDataFrame(
    [
        {
            "adm0name": entity,
            "name": capital,
            "longitude": long,
            "latitude": lat,
            "geometry": Point(long, lat),
        }
        for entity, (capital, long, lat) in extra_capitals.items()
    ],
    crs="epsg:4326",
).set_index("adm0name")

In [None]:
# reading in simple shapefiles for visualizations
places_gdf = gpd.read_file("https://naciscdn.org/naturalearth/110m/cultural/ne_110m_populated_places_simple.zip")
capitals_gdf = places_gdf[places_gdf["adm0cap"] == 1].set_index("adm0name")
# force each nation to have exactly one capital
capitals_gdf = capitals_gdf[~capitals_gdf["name"].isin(["Sucre", "Yamoussoukro", "Bloemfontein", "Pretoria"])][["name", "latitude", "longitude", "geometry"]]
capitals_gdf = gpd.GeoDataFrame(pd.concat([capitals_gdf, extra_capitals_gdf]), crs="epsg:4326")

In [None]:
flowmap_arms_df = arms_df[~arms_df["Supplier"].isin(exclude_flows) & ~arms_df["Recipient"].isin(exclude_flows)].rename(
    columns={
        "Year of order": "order_year",
        "Recipient": "recipient",
        "Supplier": "supplier",
        "SIPRI TIV for total order": "order_sipri_tiv"
    }
)
flowmap_arms_df["order_year"] = flowmap_arms_df["order_year"].astype(int)
flowmap_arms_df = flowmap_arms_df[flowmap_arms_df["order_year"] >= 1950]

In [None]:
orders_by_year_df = flowmap_arms_df.groupby(["order_year", "recipient", "supplier"]).sum()["order_sipri_tiv"]
orders_by_year_df

## Integrating with FlowmapBlue

Here, we load FlowmapBlue to create beautiful and interactive flowmaps. The steps are broadly similar to what you saw in Week 5. However, now we'll explore different ways that you can filter the data.

In [None]:
!pip install "git+https://github.com/PUBPOL-2130/notebooks#egg=pubpol2130&subdirectory=lib"

As you saw in the Week 5 lab, this line will pop up a dialog asking for permission to generate Google Sheets credentials using your Google login. **You should do this step in Google Colab.**

In [None]:
from pubpol2130 import google_sheets_credentials, generate_flow_sheet

In [None]:
flowmap_locations_df = pd.DataFrame(
    [
        {
            "id": loc,
            "name": loc,
            "lat": capitals_gdf.loc[capitals_map.get(loc, loc), "latitude"],
            "lon": capitals_gdf.loc[capitals_map.get(loc, loc), "longitude"],
        }
        for loc in set(flowmap_arms_df["supplier"]) | set(flowmap_arms_df["recipient"])
    ]
)
# we can visualize the first five rows of our location data
flowmap_locations_df.head(5)

In [None]:
sheet_creds = google_sheets_credentials()

**Now, we can explore filtering the data down so that it's easier to upload to Google sheets.** If you were encountering timeout errors before, this step may be particularly helpful.

Part of what makes filtering this dataframe challenging is that it has what's known as a [`MultiIndex`](https://pandas.pydata.org/docs/user_guide/advanced.html). If we display the index values of the dataframe, you can see that there are three values that identify a row: a year, a recipient, and a supplier.

To avoid indexing and slicing errors, we will use [`loc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html). However, some of these filtering steps can be down simply with `[]`.


In [None]:
orders_by_year_df.index

You can view the different values for each "level" of the index with the function `get_level_values()`. Remember, Python indexing starts with 0!

In [None]:
orders_by_year_df.index.get_level_values(0)

In [None]:
orders_by_year_df.index.get_level_values(1)

In [None]:
orders_by_year_df.index.get_level_values(2)

You can also filter different values of the index to get subsets of the data. For example, the following code subsets the data to orders that occurred between 1950 and 1970.

Note: using `range()` in Python is an easy way to create a sequence of values. To view them in a list format, we can call `list()` as well.

In [None]:
orders_by_year_df.loc[range(1950, 1970)]

In [None]:
# range will produce a sequence of values from 1950 to 1970
range(1950, 1970)

In [None]:
# calling list will put them into a list format
list(range(1950, 1970))

Here, we subset to all orders where the United States was the recipient.

In [None]:
orders_by_year_df.loc[:,'United States']

And here we can filter to all cases where the United States is the supplier.

In [None]:
orders_by_year_df.loc[:,:,'United States']

Now you can try slicing and filtering your data in order to avoid potential timeout issues when creating a Google sheet!

As an example, let's try filtering for:
* recent arms orders that occurred in the last 5 years (the dataset doesn't include information from 2024 or 2025)
* involving the United States as a supplier

In [None]:
orders_by_year_df_filtered = orders_by_year_df.loc[range(2019,2024),:,['United States']]
orders_by_year_df_filtered

Now, we can upload our **filtered** data to Google sheets. This process should be much faster!

In [None]:
# upload our filtered data to Google sheets
flow_sheet = generate_flow_sheet(
    sheet_creds=sheet_creds,
    locations_df=flowmap_locations_df,
    created_by_name="",  # YOUR NAME HERE
    created_by_email="", # YOUR EMAIL HERE
    data_source_name="SIPRI Arms Transfers Database",
    data_source_url="https://www.sipri.org/databases/armstransfers",
    incoming_tooltip="Inbound arms transfers (TIV)",
    outgoing_tooltip="Outbound arms transfers (TIV)",
    flow_tooltip="Arms transfer (TIV)",
    total_unit="TIVs",
    sheet_title="PUBPOL 2130: SIPRI arms transfers (orders over time)",
    flow_title="SIPRI Arms Transfers Database: orders over time",
    flows={
        f"Year: {year}": year_df.reset_index().rename(columns={
            "supplier": "origin",
            "recipient": "dest",
            "order_sipri_tiv": "count",
        })
        # note that we replaced the dataframe here
        for year, year_df in orders_by_year_df_filtered.groupby(level=0)
    }
)

In [None]:
print(flow_sheet.url)

In [None]:
print(f"https://www.flowmap.blue/{flow_sheet.url.split('/')[-1]}")