## Add Global Population data to Mass protest data
Add Population Density

Data provided by the United Nations<br>
https://population.un.org/wpp/Download/Standard/CSV/

* Total population by sex, annually from 1950 to 2100.
* PopMale: Total male population (thousands)
* PopFemale: Total female population (thousands)
* PopTotal: Total population, both sexes (thousands)
* PopDensity: Population per square kilometre (thousands)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 

In [2]:
mass = pd.read_csv('../source/mmALL_073120_csv.csv')

In [3]:
mass.head(3)

Unnamed: 0,id,country,ccode,year,region,protest,protestnumber,startday,startmonth,startyear,...,protesterdemand4,stateresponse1,stateresponse2,stateresponse3,stateresponse4,stateresponse5,stateresponse6,stateresponse7,sources,notes
0,201990001,Canada,20,1990,North America,1,1,15.0,1.0,1990.0,...,,ignore,,,,,,,1. great canadian train journeys into history;...,canada s railway passenger system was finally ...
1,201990002,Canada,20,1990,North America,1,2,25.0,6.0,1990.0,...,,ignore,,,,,,,1. autonomy s cry revived in quebec the new yo...,protestors were only identified as young peopl...
2,201990003,Canada,20,1990,North America,1,3,1.0,7.0,1990.0,...,,ignore,,,,,,,1. quebec protest after queen calls for unity ...,"the queen, after calling on canadians to remai..."


In [4]:
mass = mass[mass['protest'] == 1]
mass.shape

(15239, 31)

In [5]:
pop_df = pd.read_csv('../data/WPP2019_TotalPopulationBySex.csv')
pop_df.head(3)

Unnamed: 0,LocID,Location,VarID,Variant,Time,MidPeriod,PopMale,PopFemale,PopTotal,PopDensity
0,4,Afghanistan,2,Medium,1950,1950.5,4099.243,3652.874,7752.117,11.874
1,4,Afghanistan,2,Medium,1951,1951.5,4134.756,3705.395,7840.151,12.009
2,4,Afghanistan,2,Medium,1952,1952.5,4174.45,3761.546,7935.996,12.156


The United Nations global population estimate data contains several projection models based on Fertility, Mortality, etc.
For our purposes, we will use the Medium variant.

From the [U.N. web site:](https://population.un.org/wpp/DefinitionOfProjectionVariants/)

>Medium-variant projection: in projecting future levels of fertility and mortality, probabilistic methods were used to reflect the uncertainty of the projections based on the historical variability of changes in each variable. The method takes into account the past experience of each country, while also reflecting uncertainty about future changes based on the past experience of other countries under similar conditions.  The medium-variant projection corresponds to the median of several thousand distinct trajectories of each demographic component derived using the probabilistic model of the variability in changes over time. 

In [6]:
pop_df = pop_df[all_population_df['Variant']=='Medium']

#Drop Variant column since all rows are now 'Medium' variant. VarID is a numeric code for the variant. Also unnecessary
# Drop LocID (numeric): numeric code for the location - not used in any of our purposes.
# Drop MidPeriod - not used in any of our analysis
pop_df.drop(columns=['LocID','Variant', 'VarID','MidPeriod'], inplace=True)
pop_df.rename(columns = {'Location':'country', 'Time':'year'}, inplace = True) 
pop_df.head()

NameError: name 'all_population_df' is not defined

Calculate a list of all country names that might be in the Mass protest dataframe, but do not have an exact match in the global population data.

In [None]:
global_pop_countries = sorted(set(list(pop_df['country'])))
mass_countries = sorted(set(list(mass['country'].value_counts().index)))

mass_not_global_pop = []

for country in mass_countries:
    if country not in global_pop_countries:
        mass_not_global_pop.append(country)
        
len(mass_not_global_pop)

In [None]:
mass_not_global_pop

In [None]:
#global_pop_countries

30 countries are labeled differently in the Mass Protest dataframe.

Update every row in the global population dataframe to have a matching label.

Once all rows of U.N. population are cleaned, merge with Mass Protest data.

---
Create function to create population data that is missing from the U.N. population data set.

This happens when countries were broken up, and the previous country name is no longer is tracked in the U.N. population data, but does still exist as a recorded protest.

For example, Czechoslovakia was split into the Czech Republic (also referred to as Czechia) and Slovakia in 1993. 

Unfortunately, the U.N. population data no longer contains population information under the 'Czechoslovakia' label.

In [None]:
# This code is used in several places - creating records for Czechoslovakia, Yugoslavia, Serbia and Montenegro and the Soviet Union. 

def create_country_population_data (parent, child_countries, year_start, year_end):
    # must use global to change the main dataframe
    global pop_df
    
    # child_countries should be a list of countries to combine into one.

    # Create a list of Populations for male, female and total to add together for a combined total.
    # Create an average of all of the countries PopDensity.

    for year in range (year_start, year_end):

        all_males = 0
        all_females = 0
        total_pop = 0

        total_pop_density = 0

        # for all countries in this group
        for country in child_countries:

            # Keep running totals of populations in this block of countries
            all_males = all_males + pop_df[(pop_df['country']==country) & (pop_df['year']==year)]['PopMale'].item()
            all_females = all_females + pop_df[(pop_df['country']==country) & (pop_df['year']==year)]['PopFemale'].item()
            total_pop = total_pop + pop_df[(pop_df['country']==country) & (pop_df['year']==year)]['PopTotal'].item()

            # Keep running totals of all population densities. Will divide later for an average
            total_pop_density = total_pop_density + pop_df[(pop_df['country']==country) & (pop_df['year']==year)]['PopDensity'].item()

        # Create row for the parent country for this year
        new_row = {'country': parent, 'year':year,
                    'PopMale': all_males,
                    'PopFemale': all_females,
                    'PopTotal': total_pop,
                    'PopDensity': total_pop_density / len(child_countries)}
        
        #append row to the dataframe
        pop_df = pop_df.append(new_row, ignore_index=True)
   


---
Rename 'Bolivia (Plurinational State of)' in Global Population to **'Bolivia'**

In [None]:
pop_df.loc[pop_df['country'] == 'Bolivia (Plurinational State of)' , 'country'] = 'Bolivia'

#Population data also exists for Bolivarian Alliance for the Americas (ALBA), but just ignore. 
#This is an alliance of 10 countries, not the country of Bolivia

Rename Bosnia and Herzegovina to just **Bosnia** to match mass protest dataframe.

In [None]:
pop_df.loc[pop_df['country'] == 'Bosnia and Herzegovina' , 'country'] = 'Bosnia'

**Cape Verde** is identified as Cabo Verde in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [None]:
pop_df.loc[pop_df['country'] == 'Cabo Verde' , 'country'] = 'Cape Verde'

**Czech Republic** is identified as Czechia in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [None]:
pop_df.loc[pop_df['country'] == 'Czechia' , 'country'] = 'Czech Republic'

#### Congo values:
Mass protest dataframe contains both:
* Congo Brazzaville
* Congo Kinshasa

The Global Population data contains values for:
* Congo
* Democratic Republic of the Congo

Since Kinshasa is the capital of the Democratic Republic of the Congo, rename the Global Population label as Congo Kinshasa.

Brazzaville is the capital of the Republic of the Congo, so rename the Global Population label to Congo Brazzaville

In [None]:
pop_df.loc[pop_df['country'] == 'Congo' , 'country'] = 'Congo Brazzaville'
                
pop_df['country'] = pop_df['country'].str.replace('Democratic Republic of the Congo', 'Congo Kinshasa')

**Germany**

Create population data for East and West Germany in 1990.

The mass protest dataset contains several rows for East and West Germany in 1990, before unifying later that same year.  Unfortunately the U.N. population data only pertains to a unified Germany.

East Germany's [wikipedia page](https://en.wikipedia.org/wiki/East_Germany) provides some necessary data, but we have to estimate the male/female breakdown.  Use 1990 ratios for Germany as a guideline.

Data for West Germany's population in 1990 found on [Wikipedia](https://en.wikipedia.org/wiki/West_Germany)

In [None]:
pop_df[(pop_df['country']=='Germany') & (pop_df['year']==1990)]

U.N. Population values are in thousands, convert East Germany's population in 1990 of 16,111,000 to 16111 for our dataset.

West Germany's 1990 population of 63,254,000 becomes 63254

In [None]:
germany_1990_total_pop = pop_df[(pop_df['country']=='Germany') & (pop_df['year']==1990)]['PopTotal'].item()
germany_1990_male_pop = pop_df[(pop_df['country']=='Germany') & (pop_df['year']==1990)]['PopMale'].item()

# Assume the male/female ratio is the same for East Germany and West Germany in 1990, which might not be the case.  
# Unable this specific data.
germany_1990_male_ratio = germany_1990_male_pop / germany_1990_total_pop

east_germany_1990_total_pop = 16111  # Per Wikipedia
east_germany_1990_male_pop = east_germany_1990_total_pop * germany_1990_male_ratio
east_germany_1990_female_pop = east_germany_1990_total_pop * (1 - germany_1990_male_ratio)
east_germany_1990_pop_density = 149  # Per Wikipedia, in /km^2

west_germany_1990_total_pop = 63254  # Per Wikipedia
west_germany_1990_male_pop = west_germany_1990_total_pop * germany_1990_male_ratio
west_germany_1990_female_pop = west_germany_1990_total_pop * (1 - germany_1990_male_ratio)
west_germany_1990_pop_density = 254  # Per Wikipedia, in /km^2

# Create row for East Germany
east_germany_row = {'country':'Germany East', 'year':1990, 'PopMale':east_germany_1990_male_pop,
           'PopFemale':east_germany_1990_female_pop, 'PopTotal':east_germany_1990_total_pop,
          'PopDensity':east_germany_1990_pop_density}

# Create row for West Germany
west_germany_row = {'country':'Germany West', 'year':1990, 'PopMale':west_germany_1990_male_pop,
           'PopFemale':west_germany_1990_female_pop, 'PopTotal':west_germany_1990_total_pop,
          'PopDensity':west_germany_1990_pop_density}


#append row to the dataframe
pop_df = pop_df.append(east_germany_row, ignore_index=True)
pop_df = pop_df.append(west_germany_row, ignore_index=True)
pop_df.tail(3)

Mass protest data uses the common name **Iran**, while U.N. population uses 'Iran (Islamic Republic of)'

In [None]:
pop_df.loc[pop_df['country'] == 'Iran (Islamic Republic of)', 'country'] = 'Iran'

**Ivory Coast** is identified as Côte d'Ivoire in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [None]:
pop_df.loc[pop_df['country'] == "Côte d'Ivoire", 'country'] = 'Ivory Coast'

**Kosovo** is in Mass protest data but has no U.N population data.

Independence of Kosovo is disputed, but it declared independence from Serbia's in 2008.

https://en.wikipedia.org/wiki/International_recognition_of_Kosovo

Copy all of Serbia's Population data, and create new rows with that same data labeled as 'Kosovo'.

In [None]:
create_country_population_data('Kosovo', ['Serbia'], 1990, 2021)
pop_df.tail(3)

In [None]:
print (f'There are {pop_df.loc[pop_df["country"] == "Serbia"].shape[0]} rows for Serbia.')
print (f'There are {pop_df.loc[pop_df["country"] == "Kosovo"].shape[0]} rows for Kosovo.')

**Laos** is identified as Lao People's Democratic Republic in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [None]:
pop_df.loc[pop_df['country'] == "Lao People's Democratic Republic", 'country'] = 'Laos'

**North Macedonia** was known as Macedonia until February 2019.  Rename North Macedonia in U.N. population to Macedonia to match with mass protest dataframe.
https://en.wikipedia.org/wiki/North_Macedonia

In [None]:
pop_df.loc[pop_df['country'] == 'North Macedonia', 'country'] = 'Macedonia'

**Moldova** is identified as Republic of Moldova in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [None]:
pop_df.loc[pop_df['country'] == 'Republic of Moldova', 'country'] = 'Moldova'

**North Korea** is identified as Dem. People's Republic of Korea in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [None]:
pop_df.loc[pop_df['country'] == "Dem. People's Republic of Korea", 'country'] = 'North Korea'

**Russia** is identified as Russian Federation in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [None]:
pop_df.loc[pop_df['country'] == 'Russian Federation', 'country'] = 'Russia'

**Serbia and Montenegro** only existed 2003-2006.  It has two protests, both of which were in 2003.

Unfortunately, the U.N. population data contains separate rows for Serbia and Montenegro.

We can add the populations together, and create an average of the population density using our *create_country_population_data* function

In [None]:
# Combine the populations of Serbia and Montenegro

create_country_population_data ('Serbia and Montenegro', ['Serbia', 'Montenegro'], 2003, 2004)

pop_df.tail(3)

**Slovak Republic** is identified as Slovakia in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [None]:
pop_df.loc[pop_df['country'] == 'Slovakia', 'country'] = 'Slovak Republic'

#### Create Czechoslovakia records in U.N. Population
Now that Czechia and Slovak Republic have been renamed, we can use both to recreate values for Czechoslovakia in 1990.

In [None]:
# This only needs to be done for 1990, because that is the only year there were recorded protests in Czechoslovakia.
create_country_population_data ('Czechoslovakia', ['Czech Republic', 'Slovak Republic'], 1990, 1991)
pop_df.tail(3)

---
**South Korea** is identified as Republic of Korea in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [None]:
pop_df.loc[pop_df['country'] == 'Republic of Korea', 'country'] = 'South Korea'

**Swaziland** is identified as Eswatini in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [None]:
pop_df.loc[pop_df['country'] == 'Eswatini', 'country'] = 'Swaziland'  

**Syria** is identified as Syrian Arab Republic in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [None]:
pop_df.loc[pop_df['country'] == 'Syrian Arab Republic', 'country'] = 'Syria'

**Taiwan** is identified as China, Taiwan Province of China in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [None]:
pop_df.loc[pop_df['country'] == 'China, Taiwan Province of China', 'country'] = 'Taiwan'

**Tanzania** is identified as United Republic of Tanzania in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [None]:
pop_df.loc[pop_df['country'] == 'United Republic of Tanzania', 'country'] = 'Tanzania'

**Timor Leste** is identified as Timor-Leste in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [None]:
pop_df.loc[pop_df['country'] == 'Timor-Leste', 'country'] = 'Timor Leste'

**USSR** ended in 1991.  All U.N. population is labeled 'Russian Federation'. After changing above, all population data now under 'Russia'

60 USSR protests took place in 1990, 1991.

In [None]:
soviet_countries = ['Armenia', 'Azerbaijan','Belarus','Estonia','Georgia', 'Kazakhstan', 'Kyrgyzstan', 'Latvia', 'Lithuania', 'Moldova', 'Russia', 'Tajikistan',
                    'Turkmenistan','Ukraine','Uzbekistan' ]
           
create_country_population_data ('USSR', soviet_countries, 1990, 1992)
pop_df.tail(3)

**United Arab Emirate** is identified as United Arab Emirates in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [None]:
pop_df.loc[pop_df['country'] == 'United Arab Emirates', 'country'] = 'United Arab Emirate'

**Venezuela** is identified as Venezuela (Bolivarian Republic of) in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [None]:
pop_df.loc[pop_df['country'] == 'Venezuela (Bolivarian Republic of)', 'country'] = 'Venezuela'

**Vietnam** is identified as Viet Nam in the U.N. population data set. Rename to merge correctly with the mass protest dataframe.

In [None]:
pop_df.loc[pop_df['country'] == 'Viet Nam', 'country'] = 'Vietnam'

**Yugoslavia** has no U.N. Population data.  Attempt to recreate by combining its former countries.

Former [Yugoslavia](https://en.wikipedia.org/wiki/Yugoslavia)

* Bosnia
* Croatia
* Macedonia
* Montenegro
* Serbia
* Slovenia

There are 137 protests identified as Yugoslavia, ranging from 1990 to 2002.

In [None]:
yugo_countries = ['Bosnia', 'Croatia', 'Macedonia', 'Montenegro', 'Serbia', 'Slovenia']

create_country_population_data ('Yugoslavia', yugo_countries, 1990, 2003)
pop_df.tail(3)
    

---
Double check to see if there any countries in Mass protest that do not have corresponding U.N. population data

In [None]:
global_pop_countries = sorted(set(list(pop_df['country'])))
mass_countries = sorted(set(list(mass['country'].value_counts().index)))

mass_not_global_pop = []

for country in mass_countries:
    if country not in global_pop_countries:
        mass_not_global_pop.append(country)
        
len(mass_not_global_pop)

In [None]:
print (f'Mass protest data starts in {mass["year"].min()} and goes through {mass["year"].max()} ')

In [None]:
# Used to spot check out a country's mass protest data
#mass[mass['country']=='Czech Republic'][['startmonth','startyear','endmonth','endyear','protesterdemand1','stateresponse1']].head()

Drop all rows in U.N. population data with extraneous data.

In [None]:
# Drop all years that are not in Mass Protests
pop_df = pop_df[ (pop_df['year']>= 1990)  & (pop_df['year']<= 2020) ]

# Drop all rows with countries not in Mass Protests
pop_df = pop_df[(pop_df['country'].isin (mass_countries))]

print (f'U.N. Population data now has {pop_df.shape[0]} rows and {pop_df.shape[1]} columns.')

In [None]:
pop_df.head()

In [None]:
print (f'Mass protest data has {mass.shape[0]} rows and {mass.shape[1]} columns.')

In [None]:
mass.columns

In [None]:
mass[mass['country']=='Canada'][['id','country','ccode','year']].head()

### Save U.N. population data as csv for merging into other notebook

In [None]:
pop_df.to_csv('../data/population_clean.csv')

#### Add U.N. Population data to Mass Protest data using merge

Amazingly, this will copy down each row of data for each country for each year.

This merge code will be copied and run our main EDA notebook.

In [None]:
mass = mass.merge(pop_df)
mass.head()

In [None]:

#mass[mass['country']=='Papua New Guinea'][['id','country','ccode','year']].tail()

In [None]:
pop_df[pop_df['country']=='Canada'][['country','year','PopMale','PopFemale','PopTotal','PopDensity']].head()