# Predicting population growth based on development indicators


Can you train a model to predict the population growth based on other development indicators?  

First I'm going to have a look at how population growth is in the world based on the development indicators. 

In [None]:
pip install folium 

In [None]:
import pandas as pd
import numpy as np
import random
# import matplotlib.pyplot as plt
# from sklearn.preprocessing import StandardScaler
import folium

First I need to download the dataset to a variable

In [None]:
data = pd.read_csv('./Indikatorer/Indicators.csv')

In [None]:
# quick look at the columns
data.head()

First I prefer to scan through the indicators. I find that Excel is a decent way of looking at the indicators. So I'm making a dataframe that only contains the unique indicator names and corresponding codes. Further on it's probably easier to use the codes, than the names 

In [None]:
data2 =data[['IndicatorName','IndicatorCode']].drop_duplicates()
data2.head()
type(data2)

In [None]:
data2.to_excel('./Indikatorer.xlsx')

I am trying to train a model for the population growth across different countries. So first i'd like to look at how the growth differs across different countries and years. 

### Coordinates:

Source is the same as the one used in the course: https://github.com/python-visualization/folium/blob/master/examples/data/world-countries.json
Raw form: https://raw.githubusercontent.com/python-visualization/folium/588670cf1e9518f159b0eee02f75185301327342/examples/data/world-countries.json

In [None]:
country_geo = 'geo/world-countries.json'

In [None]:
country_geo

In [None]:
# select population growth for all countries in 2000
growthIndicator = 'SP.POP.GROW'

mask1 = data['IndicatorCode'].str.contains(growthIndicator) 
year2011 = data['Year'].isin([2011])
year1960 = data['Year'].isin([1960])

# apply our mask
stage2011 = data[mask1 & year2011]
stage1960 = data[mask1 & year1960]
stage1960.head()

In [None]:
# Now I am creating a dataframe containing only the data I need, The countrycodes and the values for population growth
plot_data2011 = stage2011[['CountryCode','Value']]
plot_data2011.describe()

In [None]:
# This is the label for the legend
growthIndicator = stage2011.iloc[0]['IndicatorName']

In [None]:
# Plotting the population growth on a geographical overlay
map = folium.Map(location=[0, 0], zoom_start=1.5)

In [None]:
bins = [-6, -4 -2, 0, 2, 4, 6, 8, 10]

In [None]:
map.choropleth(geo_data=country_geo, data=plot_data2011,
             columns=['CountryCode', 'Value'],
             key_on='feature.id',
             fill_color='YlGnBu', fill_opacity=0.7, line_opacity=0.2, bins=bins)

In [None]:
# Create Folium plot
map.save('plot_data2011.html')

In [None]:
# Import the Folium interactive html file
from IPython.display import HTML
HTML('<iframe src=plot_data2011.html width=700 height=600></iframe>')

In [None]:
# And for a different Year(1960): 
plot_data1960 = stage1960[['CountryCode','Value']]
plot_data1960.describe()

In [None]:
# And for a different Year(1960): 
growthIndicator = stage1960.iloc[0]['IndicatorName']

In [None]:
map.choropleth(geo_data=country_geo, data=plot_data1960,
             columns=['CountryCode', 'Value'],
             key_on='feature.id',
             fill_color='YlGnBu', fill_opacity=0.7, line_opacity=0.2,  bins=bins  ) 

In [None]:
# Create Folium plot
map.save('plot_data1960.html')

In [None]:
# Import the Folium interactive html file
from IPython.display import HTML
HTML('<iframe src=plot_data1960.html width=700 height=600></iframe>')

## Now to check if the indicators can predict population growth. 

I think it's best to use linear regression for this task. I will therefore import the relevant packages

In [None]:
# import sqlite3
# import pandas as pd 
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt

I'll make an array of the indicators and make a smaller dataset containing only the ones that I want to investigate. 

In [None]:
dev  = ['NY.GDP.PCAP.CD', 
        'SI.DST.10TH.10',
        'SP.DYN.LE00.IN',
        'SL.UEM.LTRM.ZS',
        'MS.MIL.XPND.GD.ZS',
        'SH.MED.PHYS.ZS',
        'IS.RRS.TOTL.KM',
        'SP.RUR.TOTL.ZG',
        'SP.URB.TOTL.IN.ZS',
        'EN.ATM.CO2E.KD.GD',
        'SP.DYN.TFRT.IN',
        'SP.POP.GROW']
small = data.loc[data['IndicatorCode'].isin(dev)]

small.head()

In order to use the techniques used in the course, I will need to transform the dataset from a "tall" to a "flat" dataset. I will do this in a series of steps:

In [None]:
# First drop unneccesary columns
small2 = small.drop(['CountryCode', 'IndicatorCode'], axis=1)
small2.head(15)

Then I'll pivot the data and make a new flat dataset. 

In [None]:
df_pivot=small2.pivot(index=['CountryName','Year'], columns='IndicatorName',values=['Value'])
df_pivot.head()

In [None]:
# Now removing levels and setting the data on the same level
df_pivot.columns = df_pivot.columns.droplevel()
df_pivot

In [None]:
modified_df=df_pivot.rename_axis(None,axis=1)
modified_df

In [None]:
# Finally we set the indexes at the same level:
modified_df=modified_df.reset_index()
modified_df.head()

In [None]:
modified_df.describe()

Looking at the data in the above table, I immediately see that many of the indicators are missing for a lot of the years. I will therefore drop those columns

In [None]:
modified_df.drop(['CO2 emissions (kg per 2005 US$ of GDP)','Income share held by highest 10%','Long-term unemployment (% of total unemployment)','Military expenditure (% of GDP)','Physicians (per 1,000 people)','Rail lines (total route-km)'], axis=1, inplace=True)
modified_df.head()

Cleaning the data: 

In [None]:
cleaned=modified_df.dropna()
cleaned.describe()

In [None]:
cleaned.shape

In [None]:
cleaned.iloc[100]

First let's take a look at the correlation 

In [None]:
cleaned.corr()

From the table we can see that the correlation for population growth is not particularly strong with any of the indicators

I now have a dataset with 9560 rows and where I would like to see if the five indicators:
- Fertility rate, total (births per woman)
- GDP per capita (current US$)
- Life expectancy at birth, total (years)
- Rural population growth (annual %)
- Urban population (% of total)

Can be used to predict the population growth: 
- Population growth (annual %)	

First I'll set up an array of the development indicators:

In [None]:
development  = ['Fertility rate, total (births per woman)',
        'GDP per capita (current US$)',
        'Life expectancy at birth, total (years)',
        'Rural population growth (annual %)',
        'Urban population (% of total)']


... and declare what the target indicator is

In [None]:
target =['Population growth (annual %)']

Extracting dev and target values into separate dataframes so that i can fit the model:

In [None]:
X = cleaned[development]
X

In [None]:
y = cleaned[target]
y

Splitting the dataset into training and testing 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=3)

(1) Linear Regression: Fit a model to the training set

In [None]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

Perform Prediction using Linear Regression Model

In [None]:
y_prediction = regressor.predict(X_test)
y_prediction[:10]

In [None]:
print(y_test[:10])

In [None]:
y_test.shape

What is the mean of the expected target value in test set ?

In [None]:
prediction =pd.DataFrame(y_prediction[:10])
print(prediction)

In [None]:
y_test.describe()


Evaluate Linear Regression Accuracy using Root Mean Square Error

In [None]:
RMSE = sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction))
print(RMSE)


(2) Decision Tree Regressor: Fit a new regression model to the training set

In [None]:
regressor = DecisionTreeRegressor(max_depth=10)
regressor.fit(X_train, y_train)


Perform Prediction using Decision Tree Regressor

In [None]:
y_prediction = regressor.predict(X_test)
y_prediction

In [None]:
y_test.describe()

Evaluate Decision Tree Regression Accuracy using Root Mean Square Error

In [None]:
RMSE = sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction))
print(RMSE)

We have reduced the RMSE, but I still think it's a bit too high. 