## Add Global Population data to Mass protest data
Add Population Density

Data provided by the United Nations<br>
https://population.un.org/wpp/Download/Standard/CSV/

* Total population by sex, annually from 1950 to 2100.
* PopMale: Total male population (thousands)
* PopFemale: Total female population (thousands)
* PopTotal: Total population, both sexes (thousands)
* PopDensity: Population per square kilometre (thousands)

In [214]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 

In [215]:
mass = pd.read_csv('../source/mmALL_073120_csv.csv')

In [216]:
mass.head(3)

Unnamed: 0,id,country,ccode,year,region,protest,protestnumber,startday,startmonth,startyear,...,protesterdemand4,stateresponse1,stateresponse2,stateresponse3,stateresponse4,stateresponse5,stateresponse6,stateresponse7,sources,notes
0,201990001,Canada,20,1990,North America,1,1,15.0,1.0,1990.0,...,,ignore,,,,,,,1. great canadian train journeys into history;...,canada s railway passenger system was finally ...
1,201990002,Canada,20,1990,North America,1,2,25.0,6.0,1990.0,...,,ignore,,,,,,,1. autonomy s cry revived in quebec the new yo...,protestors were only identified as young peopl...
2,201990003,Canada,20,1990,North America,1,3,1.0,7.0,1990.0,...,,ignore,,,,,,,1. quebec protest after queen calls for unity ...,"the queen, after calling on canadians to remai..."


In [217]:
mass = mass[mass['protest'] == 1]
mass.shape

(15239, 31)

In [218]:
all_population_df = pd.read_csv('../data/WPP2019_TotalPopulationBySex.csv')
all_population_df.head(3)


Unnamed: 0,LocID,Location,VarID,Variant,Time,MidPeriod,PopMale,PopFemale,PopTotal,PopDensity
0,4,Afghanistan,2,Medium,1950,1950.5,4099.243,3652.874,7752.117,11.874
1,4,Afghanistan,2,Medium,1951,1951.5,4134.756,3705.395,7840.151,12.009
2,4,Afghanistan,2,Medium,1952,1952.5,4174.45,3761.546,7935.996,12.156


The United Nations global population estimate data contains several projection models based on Fertility, Mortality, etc.
For our purposes, we will use the Medium variant.

From the [U.N. web site:](https://population.un.org/wpp/DefinitionOfProjectionVariants/)

>Medium-variant projection: in projecting future levels of fertility and mortality, probabilistic methods were used to reflect the uncertainty of the projections based on the historical variability of changes in each variable. The method takes into account the past experience of each country, while also reflecting uncertainty about future changes based on the past experience of other countries under similar conditions.  The medium-variant projection corresponds to the median of several thousand distinct trajectories of each demographic component derived using the probabilistic model of the variability in changes over time. 

In [219]:
pop_df = all_population_df[all_population_df['Variant']=='Medium'].copy()

#Drop Variant column since all rows are now 'Medium' variant. VarID is a numeric code for the variant. Also unnecessary
# Drop LocID (numeric): numeric code for the location - not used in any of our purposes.
# Drop MidPeriod - not used in any of our analysis
pop_df.drop(columns=['LocID','Variant', 'VarID','MidPeriod'], inplace=True)
pop_df.rename(columns = {'Location':'country', 'Time':'year'}, inplace = True) 
pop_df.head()

Unnamed: 0,country,year,PopMale,PopFemale,PopTotal,PopDensity
0,Afghanistan,1950,4099.243,3652.874,7752.117,11.874
1,Afghanistan,1951,4134.756,3705.395,7840.151,12.009
2,Afghanistan,1952,4174.45,3761.546,7935.996,12.156
3,Afghanistan,1953,4218.336,3821.348,8039.684,12.315
4,Afghanistan,1954,4266.484,3884.832,8151.316,12.486


Calculate a list of all country names that might be in the Mass protest dataframe, but do not have an exact match in the global population data.

In [220]:
global_pop_countries = sorted(set(list(pop_df['country'])))
mass_countries = sorted(set(list(mass['country'].value_counts().index)))

mass_not_global_pop = []

for country in mass_countries:
    if country not in global_pop_countries:
        mass_not_global_pop.append(country)
        
len(mass_not_global_pop)

30

In [221]:
mass_not_global_pop

['Bolivia',
 'Bosnia',
 'Cape Verde',
 'Congo Brazzaville',
 'Congo Kinshasa',
 'Czech Republic',
 'Czechoslovakia',
 'Germany East',
 'Germany West',
 'Iran',
 'Ivory Coast',
 'Kosovo',
 'Laos',
 'Macedonia',
 'Moldova',
 'North Korea',
 'Russia',
 'Serbia and Montenegro',
 'Slovak Republic',
 'South Korea',
 'Swaziland',
 'Syria',
 'Taiwan',
 'Tanzania',
 'Timor Leste',
 'USSR',
 'United Arab Emirate',
 'Venezuela',
 'Vietnam',
 'Yugoslavia']

In [222]:
#global_pop_countries

30 countries are labeled differently in the Mass Protest dataframe.

Update every row in the global population dataframe to have a matching label.

Then create a function to loop through every row in the Mass dataframe and insert it's population information for the corresponding year.

---
Rename 'Bolivia (Plurinational State of)' in Global Population to **'Bolivia'**

In [223]:
pop_df.loc[pop_df['country'] == 'Bolivia (Plurinational State of)' , 'country'] = 'Bolivia'

#Population data also exists for Bolivarian Alliance for the Americas (ALBA), but just ignore. 
#This is an alliance of 10 countries, not the country of Bolivia

Rename Bosnia and Herzegovina to just **Bosnia** to match mass protest dataframe.

In [224]:
pop_df.loc[pop_df['country'] == 'Bosnia and Herzegovina' , 'country'] = 'Bosnia'

**Cape Verde** is also known as Cabo Verde. The mass dataframe contains Cape Verde. Rename population dataframe values.

In [225]:
pop_df.loc[pop_df['country'] == 'Cabo Verde' , 'country'] = 'Cape Verde'

#### Congo values:
Mass protest dataframe contains both:
* Congo Brazzaville
* Congo Kinshasa

The Global Population data contains values for:
* Congo
* Democratic Republic of the Congo

Since Kinshasa is the capital of the Democratic Republic of the Congo, rename the Global Population label as Congo Kinshasa.

Brazzaville is the capital of the Republic of the Congo, so rename the Global Population label to Congo Brazzaville

In [226]:
pop_df.loc[pop_df['country'] == 'Congo' , 'country'] = 'Congo Brazzaville'
                
pop_df['country'] = pop_df['country'].str.replace('Democratic Republic of the Congo', 'Congo Kinshasa')

**Germany**

Create population data for East and West Germany in 1990.

The mass protest dataset contains several rows for East and West Germany in 1990, before unifying later that same year.  Unfortunately the U.N. population data only pertains to a unified Germany.

East Germany's [wikipedia page](https://en.wikipedia.org/wiki/East_Germany) provides some necessary data, but we have to estimate the male/female breakdown.  Use 1990 ratios for Germany as a guideline.

Data for West Germany's population in 1990 found on [Wikipedia](https://en.wikipedia.org/wiki/West_Germany)

In [227]:
pop_df[(pop_df['country']=='Germany') & (pop_df['year']==1990)]

Unnamed: 0,country,year,PopMale,PopFemale,PopTotal,PopDensity
99098,Germany,1990,38145.562,40908.422,79053.984,226.802


U.N. Population values are in thousands, convert East Germany's population in 1990 of 16,111,000 to 16111 for our dataset.

West Germany's 1990 population of 63,254,000 becomes 63254

In [228]:
germany_1990_total_pop = pop_df[(pop_df['country']=='Germany') & (pop_df['year']==1990)]['PopTotal'].item()
germany_1990_male_pop = pop_df[(pop_df['country']=='Germany') & (pop_df['year']==1990)]['PopMale'].item()

# Assume the male/female ratio is the same for East Germany and West Germany in 1990, which might not be the case.  
# Unable this specific data.
germany_1990_male_ratio = germany_1990_male_pop / germany_1990_total_pop

east_germany_1990_total_pop = 16111  # Per Wikipedia
east_germany_1990_male_pop = east_germany_1990_total_pop * germany_1990_male_ratio
east_germany_1990_female_pop = east_germany_1990_total_pop * (1 - germany_1990_male_ratio)
east_germany_1990_pop_density = 149  # Per Wikipedia, in /km^2

west_germany_1990_total_pop = 63254  # Per Wikipedia
west_germany_1990_male_pop = west_germany_1990_total_pop * germany_1990_male_ratio
west_germany_1990_female_pop = west_germany_1990_total_pop * (1 - germany_1990_male_ratio)
west_germany_1990_pop_density = 254  # Per Wikipedia, in /km^2

# Create row for East Germany
east_germany_row = {'country':'Germany East', 'year':1990, 'PopMale':east_germany_1990_male_pop,
           'PopFemale':east_germany_1990_female_pop, 'PopTotal':east_germany_1990_total_pop,
          'PopDensity':east_germany_1990_pop_density}

# Create row for West Germany
west_germany_row = {'country':'Germany West', 'year':1990, 'PopMale':west_germany_1990_male_pop,
           'PopFemale':west_germany_1990_female_pop, 'PopTotal':west_germany_1990_total_pop,
          'PopDensity':west_germany_1990_pop_density}


#append row to the dataframe
pop_df = pop_df.append(east_germany_row, ignore_index=True)
pop_df = pop_df.append(west_germany_row, ignore_index=True)
pop_df.tail(3)

Unnamed: 0,country,year,PopMale,PopFemale,PopTotal,PopDensity
72026,Zimbabwe,2100,15001.252,15964.169,30965.421,80.045
72027,Germany East,1990,7773.968095,8337.031905,16111.0,149.0
72028,Germany West,1990,30521.667052,32732.332948,63254.0,254.0


Mass protest data uses the common name **Iran**, while U.N. population uses 'Iran (Islamic Republic of)'

In [229]:
pop_df.loc[pop_df['country'] == 'Iran (Islamic Republic of)', 'country'] = 'Iran'

**Ivory Coast** is identified as Côte d'Ivoire in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [230]:
pop_df.loc[pop_df['country'] == "Côte d'Ivoire", 'country'] = 'Ivory Coast'

**Kosovo** is in Mass protest data but has no U.N population data.

Independence of Kosovo is disputed, but it declared independence from Serbia's in 2008.

https://en.wikipedia.org/wiki/International_recognition_of_Kosovo

Assign Serbia's Population data.

In [231]:
serbia_rows_indices = pop_df.loc[pop_df['country'] == "Serbia"].index

# Loop through each row for Serbia, and create a new identical row, but for country of Kosovo
for row_index in serbia_rows_indices:
    # Convert each row in the U.N. population dataframe as a dictionary. 
    row_as_dict = pop_df.iloc[row_index].to_dict()
    # Replace country name from Serbia to Kosovo
    row_as_dict['country'] = 'Kosovo'
    
    #print (row_as_dict)
    # append new row.
    pop_df = pop_df.append(row_as_dict, ignore_index=True)

In [232]:
print (f'There are {pop_df.loc[pop_df["country"] == "Serbia"].shape[0]} rows for Serbia.')
print (f'There are {pop_df.loc[pop_df["country"] == "Kosovo"].shape[0]} rows for Kosovo.')

There are 151 rows for Serbia.
There are 151 rows for Kosovo.


**Laos** is identified as Lao People's Democratic Republic in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [233]:
pop_df.loc[pop_df['country'] == "Lao People's Democratic Republic", 'country'] = 'Laos'

**North Macedonia** was known as Macedonia until February 2019.  Rename North Macedonia in U.N. population to Macedonia to match with mass protest dataframe.
https://en.wikipedia.org/wiki/North_Macedonia

In [234]:
pop_df.loc[pop_df['country'] == 'North Macedonia', 'country'] = 'Macedonia'

**Moldova** is identified as Republic of Moldova in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [235]:
pop_df.loc[pop_df['country'] == 'Republic of Moldova', 'country'] = 'Moldova'

**North Korea** is identified as Dem. People's Republic of Korea in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [236]:
pop_df.loc[pop_df['country'] == "Dem. People's Republic of Korea", 'country'] = 'North Korea'

**Russia** is identified as Russian Federation in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [237]:
pop_df.loc[pop_df['country'] == 'Russian Federation', 'country'] = 'Russia'

Serbia and Montenegro only existed 2003-2006.  It has two protests, both of which were in 2003.

Unfortunately, the U.N. population data contains separate rows for Serbia and Montenegro.

We can add the populations together, and create an average of the population density.

In [238]:
pop_df[(pop_df['country']=='Serbia') & (pop_df['year']==2003)]

Unnamed: 0,country,year,PopMale,PopFemale,PopTotal,PopDensity
54413,Serbia,2003,4551.672,4740.56,9292.232,106.246


In [239]:
pop_df[(pop_df['country']=='Montenegro') & (pop_df['year']==2003)]

Unnamed: 0,country,year,PopMale,PopFemale,PopTotal,PopDensity
42937,Montenegro,2003,302.621,311.46,614.081,45.657


In [240]:
# Add the populations together
serbia_2003_total_pop = pop_df[(pop_df['country']=='Serbia') & (pop_df['year']==2003)]['PopTotal'].item()
serbia_2003_male_pop = pop_df[(pop_df['country']=='Serbia') & (pop_df['year']==2003)]['PopMale'].item()
serbia_2003_female_pop = pop_df[(pop_df['country']=='Serbia') & (pop_df['year']==2003)]['PopFemale'].item()
serbia_2003_pop_density = pop_df[(pop_df['country']=='Serbia') & (pop_df['year']==2003)]['PopDensity'].item()

mont_2003_total_pop = pop_df[(pop_df['country']=='Montenegro') & (pop_df['year']==2003)]['PopTotal'].item()
mont_2003_male_pop = pop_df[(pop_df['country']=='Montenegro') & (pop_df['year']==2003)]['PopMale'].item()
mont_2003_female_pop = pop_df[(pop_df['country']=='Montenegro') & (pop_df['year']==2003)]['PopFemale'].item()
mont_2003_pop_density = pop_df[(pop_df['country']=='Montenegro') & (pop_df['year']==2003)]['PopDensity'].item()

# Create an average of the population density of the two countries.
serb_mont_avg_pop_density = (serbia_2003_pop_density + mont_2003_pop_density) / 2

# Create row for Serbia and Montenegro
serbia_mont_row = {'country':'Serbia and Montenegro', 'year':2003,
                    'PopMale': serbia_2003_male_pop + mont_2003_male_pop,
                    'PopFemale': serbia_2003_female_pop + mont_2003_female_pop,
                    'PopTotal': serbia_2003_total_pop + mont_2003_total_pop,
                    'PopDensity': serb_mont_avg_pop_density}


#append row to the dataframe
pop_df = pop_df.append(serbia_mont_row, ignore_index=True)
pop_df.tail(3)

Unnamed: 0,country,year,PopMale,PopFemale,PopTotal,PopDensity
72178,Kosovo,2099,2152.607,2111.586,4264.193,48.756
72179,Kosovo,2100,2129.297,2088.128,4217.425,48.221
72180,Serbia and Montenegro,2003,4854.293,5052.02,9906.313,75.9515


**Slovak Republic** is identified as Slovakia in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [241]:
pop_df.loc[pop_df['country'] == 'Slovakia', 'country'] = 'Slovak Republic'

**South Korea** is identified as Republic of Korea in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [242]:
pop_df.loc[pop_df['country'] == 'Republic of Korea', 'country'] = 'South Korea'

**Swaziland** is identified as Eswatini in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [243]:
pop_df.loc[pop_df['country'] == 'Eswatini', 'country'] = 'Swaziland'  

**Syria** is identified as Syrian Arab Republic in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [244]:
pop_df.loc[pop_df['country'] == 'Syrian Arab Republic', 'country'] = 'Syria'

**Taiwan** is identified as China, Taiwan Province of China in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [245]:
pop_df.loc[pop_df['country'] == 'China, Taiwan Province of China', 'country'] = 'Taiwan'

**Tanzania** is identified as United Republic of Tanzania in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [246]:
pop_df.loc[pop_df['country'] == 'United Republic of Tanzania', 'country'] = 'Tanzania'

**Timor Leste** is identified as Timor-Leste in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [247]:
pop_df.loc[pop_df['country'] == 'Timor-Leste', 'country'] = 'Timor Leste'

**USSR** ended in 1991.  All U.N. population is labeled 'Russian Federation'

60 USSR protests took place in 1990, 1991.

**United Arab Emirate** is identified as United Arab Emirates in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [248]:
pop_df.loc[pop_df['country'] == 'United Arab Emirates', 'country'] = 'United Arab Emirate'

**Venezuela** is identified as Venezuela (Bolivarian Republic of) in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [249]:
pop_df.loc[pop_df['country'] == 'Venezuela (Bolivarian Republic of)', 'country'] = 'Venezuela'

**Vietnam** is identified as Viet Nam in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [250]:
pop_df.loc[pop_df['country'] == 'Viet Nam', 'country'] = 'Vietnam'

**Yugoslavia** has no U.N. Population data.  Attempt to recreate by combining its former countries?

Former [Yugoslavia](https://en.wikipedia.org/wiki/Yugoslavia)

* Bosnia
* Croatia
* Macedonia
* Montenegro
* Serbia
* Slovenia

There are 137 protests identified as Yugoslavia, ranging from 1990 to 2002.

In [251]:
global_pop_countries = sorted(set(list(pop_df['country'])))
mass_countries = sorted(set(list(mass['country'].value_counts().index)))

mass_not_global_pop = []

for country in mass_countries:
    if country not in global_pop_countries:
        mass_not_global_pop.append(country)
        
len(mass_not_global_pop)

4

In [252]:
mass_not_global_pop

['Czech Republic', 'Czechoslovakia', 'USSR', 'Yugoslavia']

In [253]:
mass.columns

Index(['id', 'country', 'ccode', 'year', 'region', 'protest', 'protestnumber',
       'startday', 'startmonth', 'startyear', 'endday', 'endmonth', 'endyear',
       'protesterviolence', 'location', 'participants_category',
       'participants', 'protesteridentity', 'protesterdemand1',
       'protesterdemand2', 'protesterdemand3', 'protesterdemand4',
       'stateresponse1', 'stateresponse2', 'stateresponse3', 'stateresponse4',
       'stateresponse5', 'stateresponse6', 'stateresponse7', 'sources',
       'notes'],
      dtype='object')

In [254]:
#mass[mass['country']=='Czechoslovakia'][['participants','startmonth','startyear','endmonth','endyear','protesterdemand1','stateresponse1']]

mass[mass['country']=='Serbia and Montenegro'][['participants','startmonth','startyear','endmonth','endyear','protesterdemand1','stateresponse1']]

Unnamed: 0,participants,startmonth,startyear,endmonth,endyear,protesterdemand1,stateresponse1
6114,1000+,6.0,2003.0,6.0,2003.0,police brutality,beatings
6115,2000,10.0,2003.0,10.0,2003.0,"political behavior, process",crowd dispersal


In [255]:
#mass[mass['country']=='Czech Republic'][['startmonth','startyear','endmonth','endyear','protesterdemand1','stateresponse1']].head()
mass[mass['country']=='Serbia'][['startmonth','startyear','endmonth','endyear','protesterdemand1','stateresponse1']].head()

Unnamed: 0,startmonth,startyear,endmonth,endyear,protesterdemand1,stateresponse1
5750,12.0,2007.0,12.0,2007.0,"political behavior, process",ignore
5751,7.0,2008.0,7.0,2008.0,"political behavior, process",crowd dispersal
5754,5.0,2011.0,5.0,2011.0,"political behavior, process",beatings
5757,9.0,2014.0,9.0,2014.0,social restrictions,ignore
5760,4.0,2017.0,4.0,2017.0,"political behavior, process",ignore


In [256]:
#mass[mass['country']=='Slovak Republic'][['startmonth','startyear','endmonth','endyear','protesterdemand1','stateresponse1']].head()
mass[mass['country']=='Kosovo'][['startmonth','startyear','endmonth','endyear','protesterdemand1','stateresponse1']].head()

Unnamed: 0,startmonth,startyear,endmonth,endyear,protesterdemand1,stateresponse1
5726,12.0,2008.0,12.0,2008.0,"political behavior, process",ignore
5727,8.0,2009.0,8.0,2009.0,"political behavior, process",crowd dispersal
5728,3.0,2010.0,3.0,2010.0,"political behavior, process",crowd dispersal
5729,6.0,2010.0,6.0,2010.0,"political behavior, process",ignore
5730,6.0,2011.0,6.0,2011.0,"political behavior, process",accomodation


In [257]:
mass[mass['country']=='Yugoslavia'][['startmonth','startyear','endmonth','endyear','protesterdemand1','stateresponse1']]

Unnamed: 0,startmonth,startyear,endmonth,endyear,protesterdemand1,stateresponse1
5920,1.0,1990.0,1.0,1990.0,"political behavior, process",ignore
5921,1.0,1990.0,1.0,1990.0,"political behavior, process",crowd dispersal
5922,1.0,1990.0,1.0,1990.0,"political behavior, process",crowd dispersal
5923,1.0,1990.0,1.0,1990.0,removal of politician,crowd dispersal
5924,1.0,1990.0,1.0,1990.0,"price increases, tax policy",ignore
...,...,...,...,...,...,...
6109,2.0,2001.0,2.0,2001.0,"political behavior, process",ignore
6110,6.0,2001.0,6.0,2001.0,"political behavior, process",ignore
6111,11.0,2001.0,11.0,2001.0,"political behavior, process",ignore
6112,2.0,2002.0,2.0,2002.0,"political behavior, process",ignore


In [258]:
mass[mass['country']=='Germany'][['startmonth','startyear','endmonth','endyear','protesterdemand1','stateresponse1']].head()

Unnamed: 0,startmonth,startyear,endmonth,endyear,protesterdemand1,stateresponse1
4742,10.0,1990.0,10.0,1990.0,"political behavior, process",crowd dispersal
4743,11.0,1990.0,11.0,1990.0,"political behavior, process",crowd dispersal
4744,11.0,1990.0,11.0,1990.0,police brutality,ignore
4745,11.0,1990.0,11.0,1990.0,"political behavior, process",ignore
4746,12.0,1990.0,12.0,1990.0,"political behavior, process",ignore
