<a href="https://colab.research.google.com/github/Antony-gitau/probabilistic_AI_playgraound/blob/main/bayesian_linear_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we are going to tackle an example of bayesian linear regression using pyro, a probabilistic programming tool.

We want to confirm that the terrain ruggedness is related to poorer economic perfomance outside Africa, but reverse effects on income for African countries.  (as that was the conclusion of this [paper](https://diegopuga.org/papers/rugged.pdf))

*We are following an example on pyro webpage.*

In [11]:
%%capture
!pip install -q --upgrade pyro-ppl torch

import pandas as pd
import numpy as np

import pyro
import pyro.distributions as dist



Data:
We will use rugged data in which we only care about:
1. rugged index - quantifies the ruggedness of geographical location in and out of Africa.
2. list of nation in and out of Africa
3. The GDP per capita of the nations for the year 2000 

In [14]:
data_url = "https://d2hg8soec8ck9v.cloudfront.net/datasets/rugged_data.csv"

# We encode the data in the url with the latin-1, character encoding standard
rugged_data = pd.read_csv(data_url, encoding="ISO-8859-1")

In [15]:
rugged_data.head()

Unnamed: 0,isocode,isonum,country,rugged,rugged_popw,rugged_slope,rugged_lsd,rugged_pc,land_area,lat,...,africa_region_w,africa_region_e,africa_region_c,slave_exports,dist_slavemkt_atlantic,dist_slavemkt_indian,dist_slavemkt_saharan,dist_slavemkt_redsea,pop_1400,european_descent
0,ABW,533,Aruba,0.462,0.38,1.226,0.144,0.0,18.0,12.508,...,0,0,0,0.0,,,,,614.0,
1,AFG,4,Afghanistan,2.518,1.469,7.414,0.72,39.004,65209.0,33.833,...,0,0,0,0.0,,,,,1870829.0,0.0
2,AGO,24,Angola,0.858,0.714,2.274,0.228,4.906,124670.0,-12.299,...,0,0,1,3610000.0,5.669,6.981,4.926,3.872,1223208.0,2.0
3,AIA,660,Anguilla,0.013,0.01,0.026,0.006,0.0,9.0,18.231,...,0,0,0,0.0,,,,,,
4,ALB,8,Albania,3.427,1.597,10.451,1.006,62.133,2740.0,41.143,...,0,0,0,0.0,,,,,200000.0,100.0


In [17]:
rugged_data.columns

Index(['isocode', 'isonum', 'country', 'rugged', 'rugged_popw', 'rugged_slope',
       'rugged_lsd', 'rugged_pc', 'land_area', 'lat', 'lon', 'soil', 'desert',
       'tropical', 'dist_coast', 'near_coast', 'gemstones', 'rgdppc_2000',
       'rgdppc_1950_m', 'rgdppc_1975_m', 'rgdppc_2000_m', 'rgdppc_1950_2000_m',
       'q_rule_law', 'cont_africa', 'cont_asia', 'cont_europe', 'cont_oceania',
       'cont_north_america', 'cont_south_america', 'legor_gbr', 'legor_fra',
       'legor_soc', 'legor_deu', 'legor_sca', 'colony_esp', 'colony_gbr',
       'colony_fra', 'colony_prt', 'colony_oeu', 'africa_region_n',
       'africa_region_s', 'africa_region_w', 'africa_region_e',
       'africa_region_c', 'slave_exports', 'dist_slavemkt_atlantic',
       'dist_slavemkt_indian', 'dist_slavemkt_saharan', 'dist_slavemkt_redsea',
       'pop_1400', 'european_descent'],
      dtype='object')

In [18]:
# lets pick three columns we care about (as we stated earlier) and create a dataframe 
ruggedness_dataframe = rugged_data[["cont_africa","rugged","rgdppc_2000"]]
ruggedness_dataframe.head()

Unnamed: 0,cont_africa,rugged,rgdppc_2000
0,0,0.462,
1,0,2.518,
2,1,0.858,1794.729
3,0,0.013,
4,0,3.427,3703.113


In [19]:
ruggedness_dataframe.isnull().sum()

cont_africa     0
rugged          0
rgdppc_2000    64
dtype: int64

In [32]:
# lets remove the nan values in the rgdppc_2000 column
nan_values = np.isfinite(ruggedness_dataframe.rgdppc_2000)
sum_true = np.sum(nan_values)
sum_false = np.sum(~nan_values)
print("Trues ", sum_true)
print("Falses ", sum_false)

# this is the dataframe before we remove the nan values on the rgdppc_2000 column
print(ruggedness_dataframe.info())

ruggedness_dataframe = ruggedness_dataframe[nan_values]
#this is after we drop the columns
print(ruggedness_dataframe.info())
ruggedness_dataframe.head()

Trues  170
Falses  0
<class 'pandas.core.frame.DataFrame'>
Int64Index: 170 entries, 2 to 233
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   cont_africa  170 non-null    int64  
 1   rugged       170 non-null    float64
 2   rgdppc_2000  170 non-null    float64
dtypes: float64(2), int64(1)
memory usage: 5.3 KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 170 entries, 2 to 233
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   cont_africa  170 non-null    int64  
 1   rugged       170 non-null    float64
 2   rgdppc_2000  170 non-null    float64
dtypes: float64(2), int64(1)
memory usage: 5.3 KB
None


Unnamed: 0,cont_africa,rugged,rgdppc_2000
2,1,0.858,1794.729
4,0,3.427,3703.113
7,0,0.769,20604.46
8,0,0.775,12173.68
9,0,2.688,2421.985


In [34]:
#lets normalize the rgdppc_2000 column 
ruggedness_dataframe['rgdppc_2000'] = np.log(ruggedness_dataframe['rgdppc_2000'])

Lets get started with the bayesian linear regression