We then opened the file and set up our pre-processed dataset removing NA values. We were interested onyl in the social variables we considered most important: "gdp", "fertilityrate", "netmigration", "unemployment", "primaryeducation", "mortalityratefemale", "mortalityratemale", "labourforce", "grossfixedcapital" and "secondaryeducation". 

In [None]:
#we open our file
raw_data = read.csv("Researchdata.csv")
View(raw_data)
#setting up the data for regression
df = na.omit(raw_data)
all_variables = c("gdp", "fertilityrate", "netmigration", "unemployment", "primaryeducation", "mortalityratefemale", "mortalityratemale", "labourforce", "grossfixedcapital", "secondaryeducation")

Since most common production functions are the Cobb-Douglas ones, in which there is an exponential relation between GDP and capital with labor, we decided to calculate the logarithm of all variables to obtain a linear relationship. During this process we obtaines NAs values due to the presence of negative data, so we removed them.

We then proceed with our first regression between GDP and capital with labor. We test the normality of the residuals and we do some plots.

In [None]:
df = log(df[all_variables])
#NAs were generated when taking the logarithm of negative data (netmigration has some negative values), so we take them out again
df = na.omit(df)
attach(df)
#first regression
regression1 = lm(gdp ~ grossfixedcapital+labourforce)
summary(regression1)
#testing normality of residuals
library(olsrr)
ols_plot_resid_fit(regression1)
ols_plot_resid_qq(regression1)
ols_test_normality(regression1)
#generate added variable plots
library(car)
avPlots(regression1)

We conclude our work with the second regression. We test for potential outliers, we perform a model selection and in the final step we remove the less significant variables to obtain a more precise model

In [None]:
#second regression
regression2 = lm(gdp ~ fertilityrate+netmigration+unemployment+primaryeducation+mortalityratefemale+mortalityratemale+secondaryeducation)
summary(regression2)
#testing for potential outliers and testing normality of residuals
boxplot(df["gdp"], xlab="gdp")
plot(regression2)
hist(residuals(regression2), main="Residuals histogram", xlab="residuals")
ols_test_normality(regression2)
#generate added variable plots
avPlots(regression2)
#model selection, step methods
ols_step_forward_p(regression2)
ols_step_backward(regression2)
ols_step_both_p(regression2)
#model selection, cross selection
ols_step_best_subset(regression2)
#third regression
regression3 = lm(gdp ~ netmigration + secondaryeducation + unemployment + mortalityratemale)
summary(regression3)