### Which ISIC indicator is most important in affecting the GDP of Singapore? 

To explore this question, we will attempt to build multiple linear regression model to check the significance of GDP of Singapore against each ISIC indicator

In [None]:
#Remove null values
clean_data <- na.omit(data_subset)

#Get Singapore-related data only
data_singapore <- subset(data_subset, Country == "Singapore")

For doing linear regression, we cannot allow for null values to contain within the dataset, so we removed all rows that contain  null values. </br>
Also, we will extract out Singapore's data specifically.

In [None]:
#Linear Regression Formula 1
formula <- `Gross Domestic Product (GDP)` ~ 
  ` Agriculture, hunting, forestry, fishing (ISIC A-B) ` +
  `Construction (ISIC F)` +
  `Manufacturing (ISIC D)` +
  ` Mining, Manufacturing, Utilities (ISIC C-E) ` +
  ` Transport, storage and communication (ISIC I) ` +
  ` Wholesale, retail trade, restaurants and hotels (ISIC G-H) `

model <- lm(formula, data = data_singapore)
summary(model)

From the above results, we can see that the R-squared and Adjusted R-squared values are both very high, which are both considered a positive indicator for the model.

We can conclude that there are 3 variables that are significant measures that could be used to model GDP which are Manufacturing (ISIC D), Transport, Storage and Communications (ISIC I) and lastly Wholesale, Retail Trade, Restaurants and Hotels (ISIC G-H). However, if we were to choose the best indicator, based on the lowest p-value, ISIC G-H would be the most important indicator in affecting the GDP of Singapore.


### How does the above ISIC indicator get affected by other ISIC factors?

To explore this question, we will continue to do multiple linear regression model of  Wholesale, Retail Trade, Restaurants and Hotels (ISIC G-H) against the rest of the ISIC indicators. We do this for all the different countries available in the dataset.

In [None]:
#Getting a list of all the different countries 
countries <- unique(clean_data$Country)

#Creating a dataframe to store the most significant indicator for each country
most_significant_variables <- data.frame(Country = character(), 
                                         MostSignificantVariable = character(), 
                                         stringsAsFactors = FALSE)

Doing the multiple linear regression model for Singapore only specifically as an example.

In [None]:
#Linear Regression Formula 2
formula2 <- ` Wholesale, retail trade, restaurants and hotels (ISIC G-H) ` ~ 
  ` Agriculture, hunting, forestry, fishing (ISIC A-B) ` +
  `Construction (ISIC F)` +
  `Manufacturing (ISIC D)` +
  ` Mining, Manufacturing, Utilities (ISIC C-E) ` +
  ` Transport, storage and communication (ISIC I) `

#Example of doing Linear regression with formula 2 for Singapore
model2 <- lm(formula2, data = data_singapore)
summary(model2)
names(which.min(summary(model2)$coefficients[, "Pr(>|t|)"])) 

From the above results, we will extract out the factor that have the lowest p-value and store this result in a dataframe. As seen above, for Singapore specifically, the most signifcant indicator for ISIC G-H is in fact none of the other ISIC indicator but the intercept instead.

So now we will continue and do this for all the different countries.

In [None]:
#Doing linear regression with formula 2 for all countries
for(country in countries){
  model2 <- lm(formula2, data = subset(clean_data, Country == country))
  model_summary <- summary(model2)
  most_significant_variable <- names(which.min(model_summary$coefficients[, "Pr(>|t|)"])) 
  most_significant_variables <- rbind(most_significant_variables, 
                                      cbind(Country = country, MostSignificantVariable = most_significant_variable))
  
}
most_significant_variables

With the above results, we will take note of the count that each indicator appears and do a barplot to show the result.

In [None]:
#Plotting barplot
ggplot(most_significant_variables, aes(x = MostSignificantVariable, fill = MostSignificantVariable)) +
  geom_bar() +
  scale_x_discrete(labels = c('Intercept', 'ISIC A-B', 'ISIC C-E', 'ISIC I', 'ISIC F', 'ISIC D')) +
  labs(title = "Number of Occurrences of Each Indicator",
       x = "Indicator",
       y = "Count") +
  theme_minimal()

In [None]:
#Finding the variable that is pops up the most
most_common_variable <- names(sort(table(most_significant_variables$MostSignificantVariable), decreasing = TRUE)[1])
most_common_variable

As seen from the barplot and the above result, the the indicator that appeared the most often is Transport, Storage and Communication (ISIC I). Thus, ISIC I is the most significant indicator in affecting ISIC G-H.