In [1]:
library(ggplot2)
library(dplyr)
library(tidyr)
library(dygraphs)
library(lubridate)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union


Attaching package: 'lubridate'

The following object is masked from 'package:base':

    date



In [2]:
contrib <- read.csv('contribsandspend.csv', check.names=FALSE)
head(contrib)

Unnamed: 0,Date,Contributor,Contributor_ISO-3,Contributor_Capital,Contributor_Capital_Latitude,Contributor_Capital_Longitude,Contributor_Continent,Contributor_Region,Contributor UN_Bloc,⋯,Mission_SADC,Experts_on_Mission,Formed_Police_Units,Inidividual_Police,Civilian_Police,Troops,Observers,Total,Year,DefSpendGDP%
0,1990-11-30,Argentina,ARG,Buenos Aires,-36.5001,-60.0,South America,South America,GRULAC,,0,,,,,29.0,,29,1990,1.450909
1,1990-11-30,Argentina,ARG,Buenos Aires,-36.5001,-60.0,South America,South America,GRULAC,,1,,,,,6.0,,6,1990,1.450909
2,1990-11-30,Argentina,ARG,Buenos Aires,-36.5001,-60.0,South America,South America,GRULAC,,0,,,,,4.0,,4,1990,1.450909
3,1990-11-30,Australia,AUS,Canberra,-35.25,149.133,Oceania,Australia and New Zealand,WEOG,,0,,,,26.0,,,26,1990,2.075906
4,1990-11-30,Australia,AUS,Canberra,-35.25,149.133,Oceania,Australia and New Zealand,WEOG,,0,,,,,2.0,,2,1990,2.075906
5,1990-11-30,Australia,AUS,Canberra,-35.25,149.133,Oceania,Australia and New Zealand,WEOG,,0,,,,,13.0,,13,1990,2.075906


In [3]:
str(contrib)

'data.frame':	151845 obs. of  76 variables:
 $                              : int  0 1 2 3 4 5 6 7 8 9 ...
 $ Date                         : Factor w/ 334 levels "1990-11-30","1990-12-31",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Contributor                  : Factor w/ 156 levels "Albania","Algeria",..: 4 4 4 6 6 6 7 7 7 7 ...
 $ Contributor_ISO-3            : Factor w/ 155 levels "ALB","ARE","ARG",..: 3 3 3 6 6 6 7 7 7 7 ...
 $ Contributor_Capital          : Factor w/ 157 levels "Abu Dhabi","Abuja",..: 35 35 35 38 38 38 146 146 146 146 ...
 $ Contributor_Capital_Latitude : num  -36.5 -36.5 -36.5 -35.2 -35.2 ...
 $ Contributor_Capital_Longitude: num  -60 -60 -60 149 149 ...
 $ Contributor_Continent        : Factor w/ 6 levels "Africa","Asia",..: 6 6 6 5 5 5 3 3 3 3 ...
 $ Contributor_Region           : Factor w/ 22 levels "Australia and New Zealand",..: 15 15 15 1 1 1 22 22 22 22 ...
 $ Contributor UN_Bloc          : Factor w/ 6 levels "AG","APG","EEG",..: 4 4 4 5 5 5 5 5 5 5 ...
 $ Contributor_

In [4]:
# Extract variables date, DefSpendGDP%, and dependent variables, make all numeric for regression
vars <- c("Date", "Experts_on_Mission", "Formed_Police_Units", "Inidividual_Police", "Civilian_Police", "Troops", "Observers", "Total", "DefSpendGDP%")
contrib <- contrib[,(colnames(contrib) %in% vars)]
contrib$Date <- as_date(contrib$Date)
contrib$Date <- as.numeric(contrib$Date)
contrib[is.na(contrib)] = 0
str(contrib)

'data.frame':	151845 obs. of  9 variables:
 $ Date               : num  7638 7638 7638 7638 7638 ...
 $ Experts_on_Mission : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Formed_Police_Units: num  0 0 0 0 0 0 0 0 0 0 ...
 $ Inidividual_Police : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Civilian_Police    : num  0 0 0 26 0 0 0 0 0 0 ...
 $ Troops             : num  29 6 4 0 2 13 532 410 11 14 ...
 $ Observers          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Total              : num  29 6 4 26 2 13 532 410 11 14 ...
 $ DefSpendGDP%       : num  1.45 1.45 1.45 2.08 2.08 ...


In [5]:
#Rename some columns that bug me (and getting rid of the % that R doesn't like)
names(contrib)[names(contrib) == "Inidividual_Police"] <- "Individual_Police"
names(contrib)[names(contrib) == "DefSpendGDP%"] <- "DefSpend"
head(contrib)


Date,Experts_on_Mission,Formed_Police_Units,Individual_Police,Civilian_Police,Troops,Observers,Total,DefSpend
7638,0,0,0,0,29,0,29,1.450909
7638,0,0,0,0,6,0,6,1.450909
7638,0,0,0,0,4,0,4,1.450909
7638,0,0,0,26,0,0,26,2.075906
7638,0,0,0,0,2,0,2,2.075906
7638,0,0,0,0,13,0,13,2.075906


In [6]:
library(caTools)
set.seed(1000) # set.seed() will help us to reproduce the results.

split = sample.split(contrib, SplitRatio = 0.7)

# Train data will have 70% of data
train_data = subset(contrib, split==TRUE)

# Test data will have the rest 30% of data
test_data  = subset(contrib, split==FALSE)

In [7]:
cor(train_data)

Unnamed: 0,Date,Experts_on_Mission,Formed_Police_Units,Individual_Police,Civilian_Police,Troops,Observers,Total,DefSpend
Date,1.0,0.28646502,0.14327954,0.1929675851,0.08582623,0.02683658,-0.09091557,0.03472192,-0.1705215839
Experts_on_Mission,0.28646502,1.0,0.21997439,0.2053625588,0.17084971,0.33529559,0.46221204,0.35110962,-0.0100090933
Formed_Police_Units,0.14327954,0.21997439,1.0,0.2609786941,0.75018136,0.20667393,0.07556786,0.29208203,0.0455272277
Individual_Police,0.19296759,0.20536256,0.26097869,1.0,0.45146189,0.09894434,0.04195991,0.15129567,0.0003110808
Civilian_Police,0.08582623,0.17084971,0.75018136,0.4514618852,1.0,0.2210611,0.07049268,0.33654846,0.096560609
Troops,0.02683658,0.33529559,0.20667393,0.098944345,0.2210611,1.0,0.30046829,0.99262007,0.0450727009
Observers,-0.09091557,0.46221204,0.07556786,0.0419599069,0.07049268,0.30046829,1.0,0.31482117,0.0696045275
Total,0.03472192,0.35110962,0.29208203,0.1512956735,0.33654846,0.99262007,0.31482117,1.0,0.0563869971
DefSpend,-0.17052158,-0.01000909,0.04552723,0.0003110808,0.09656061,0.0450727,0.06960453,0.056387,1.0


The strongest correlations seem to be that one kind of contribution begets others. This makes sense as nations will send "packages" of forces to missions that include multiple kinds of personnel. The specific theory I want to test, though, is the idea that nations with low defense spending send extra peacekeepers for the training and because of the reimbursement money. 

In [8]:
spend_reg <- lm(train_data$Troops ~ train_data$DefSpend)
summary(spend_reg)


Call:
lm(formula = train_data$Troops ~ train_data$DefSpend)

Residuals:
   Min     1Q Median     3Q    Max 
-256.6 -124.7 -114.4  -99.2 9633.0 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)           92.205      2.481   37.16   <2e-16 ***
train_data$DefSpend   17.084      1.190   14.36   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 420.9 on 101228 degrees of freedom
Multiple R-squared:  0.002032,	Adjusted R-squared:  0.002022 
F-statistic: 206.1 on 1 and 101228 DF,  p-value: < 2.2e-16


In [9]:
# So the immediate result is that there is very little predictive value in defense spending...maybe it changes over time?
spend_reg2 <- lm(train_data$Troops ~ train_data$DefSpend + train_data$Date)
summary(spend_reg2)


Call:
lm(formula = train_data$Troops ~ train_data$DefSpend + train_data$Date)

Residuals:
   Min     1Q Median     3Q    Max 
-248.1 -128.5 -113.2  -91.8 9654.9 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         1.162e+01  7.631e+00   1.523    0.128    
train_data$DefSpend 1.938e+01  1.207e+00  16.057   <2e-16 ***
train_data$Date     5.629e-03  5.041e-04  11.165   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 420.6 on 101227 degrees of freedom
Multiple R-squared:  0.003259,	Adjusted R-squared:  0.003239 
F-statistic: 165.5 on 2 and 101227 DF,  p-value: < 2.2e-16


In [10]:
# Maybe there's more correlation with total contributions vs just Troops
spend_reg3 <- lm(train_data$Total ~ train_data$DefSpend + train_data$Date)
summary(spend_reg3)


Call:
lm(formula = train_data$Total ~ train_data$DefSpend + train_data$Date)

Residuals:
   Min     1Q Median     3Q    Max 
-306.8 -143.4 -120.9  -82.6 9647.6 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -4.1483589  7.9405719  -0.522    0.601    
train_data$DefSpend 25.3351792  1.2560079  20.171   <2e-16 ***
train_data$Date      0.0075292  0.0005246  14.353   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 437.7 on 101227 degrees of freedom
Multiple R-squared:  0.005204,	Adjusted R-squared:  0.005184 
F-statistic: 264.8 on 2 and 101227 DF,  p-value: < 2.2e-16


In [12]:
# Maybe try a log regression with bins for different quantities of Troops: none, some, more, lots...
