# Project 1: WorldWide Deaths by tuberculosis

Developed by Marcelo Jos√© Rovai, 8 May 2018 

Based on Michel Wermelinger, 14 July 2015, edited 5 April 2016, updated 18 October and 20 December 2017 

This is the project notebook for the first part of The Open University's _Learn to code for Data Analysis_ course.

In 2000, the United Nations set eight Millenium Development Goals (MDGs) to reduce poverty and diseases, improve gender equality and environmental sustainability, etc. Each goal is quantified and time-bound, to be achieved by the end of 2015. Goal 6 is to have halted and started reversing the spread of HIV, malaria and tuberculosis (TB).
TB doesn't make headlines like Ebola, SARS (severe acute respiratory syndrome) and other epidemics, but is far deadlier. For more information, see the World Health Organisation (WHO) page <http://www.who.int/gho/tb/en/>.

<p><img src="TB_map.jpg?raw=true"></p>
Given the population and number of deaths due to TB during one year, the following questions will be answered: 

- What is the total, maximum, minimum and average number of deaths in that year?
- Which countries have the most and the least deaths?
- What is the death rate (deaths per 100,000 inhabitants) for each country?
- Which countries have the lowest and highest death rate?

The death rate allows for a better comparison of countries with widely different population sizes.

In [1]:
import warnings
warnings.simplefilter('ignore', FutureWarning)

from pandas import *

## The data

The data consists of total population and total number of deaths due to TB (excluding HIV) in 2013 in all world. 

The data was taken in July 2015 from <http://apps.who.int/gho/data/node.main.POP107?lang=en> (population) and <http://apps.who.int/gho/data/node.main.1317?lang=en> (deaths). The uncertainty bounds of the number of deaths were ignored.

The data was collected into an Excel file which should be in the same folder as this notebook.

In [32]:
! ls # List files on working directory

[31mExercises_1.ipynb[m[m                  [31mWHO POP TB all.xls[m[m
GMAPS example.ipynb                [31mWHO POP TB some.xls[m[m
[31mTB deaths all world_MJR.ipynb[m[m      WHO_POP_TB_all_country_code.xlsx
TB worldwide deaths 2013.html      [34mgeo[m[m
TB_worldwide_deaths_2013.html      ~$WHO_POP_TB_all_country_code.xlsx
Untitled.ipynb


In [33]:
# Open data for all world
data = read_excel('WHO_POP_TB_all_country_code.xlsx')
data.shape # 194 countries in 3 columns

(194, 5)

In [45]:
data.head()

Unnamed: 0,Region,Country_code,Country,Population (1000s),TB deaths
0,Eastern Mediterranean,AFG,Afghanistan,32526.6,13000
1,Africa,AGO,Angola,25022.0,15000
2,Europe,ALB,Albania,2896.7,8
3,Europe,AND,Andorra,70.5,0
4,Eastern Mediterranean,ARE,United Arab Emirates,9157.0,38


## The range of the problem

The column of interest is the last one.

In [46]:
tbDeaths = data['TB deaths']

The total number of deaths in 2013 is:

In [47]:
totalDeathsWW = int(tbDeaths.sum())
totalDeathsWW

1359455

The largest and smallest number of deaths in a single country are:

In [48]:
maxDeathsWW = int(tbDeaths.max())
maxDeathsWW

455000

In [49]:
minDeathsWW = int(tbDeaths.min())
minDeathsWW

0

From 0 to almost a quarter of a million deaths is a huge range. The average number of deaths, over all countries in the data, can give a better idea of the seriousness of the problem in each country.
The average can be computed as the mean or the median. Given the wide range of deaths, the median is probably a more sensible average measure.

In [50]:
meanDeathsWW = int(tbDeaths.mean())
meanDeathsWW

7007

In [51]:
medianDeathsWW = int(tbDeaths.median())
medianDeathsWW

385

The median is far lower than the mean. This indicates that some of the countries had a very high number of TB deaths in 2013, pushing the value of the mean up.

## The most affected

To see the most affected countries, the table is sorted in descending order by the last column, which puts those countries in the first rows.

In [52]:
tbDeathsSorted = data.sort_values('TB deaths', ascending=False )
tbDeathsSorted.head(10)

Unnamed: 0,Region,Country_code,Country,Population (1000s),TB deaths
78,South-East Asia,IND,India,1300000.0,455000
77,South-East Asia,IDN,Indonesia,257564.0,119000
126,Africa,NGA,Nigeria,182202.0,97000
15,South-East Asia,BGD,Bangladesh,160996.0,76000
32,Western Pacific,CHN,China,1400000.0,54000
35,Africa,COD,Democratic Republic of the Congo,77266.8,49000
135,Eastern Mediterranean,PAK,Pakistan,188925.0,46000
56,Africa,ETH,Ethiopia,99390.8,34000
89,Africa,KEN,Kenya,46050.3,31000
179,Africa,TZA,United Republic of Tanzania,53470.4,30000


And as a comparation, the less affected:

In [53]:
tbDeathsSorted.tail(10)

Unnamed: 0,Region,Country_code,Country,Population (1000s),TB deaths
117,Europe,MNE,Montenegro,625.8,1
104,Europe,LUX,Luxembourg,567.1,0
128,Western Pacific,NIU,Niue,1.6,0
3,Europe,AND,Andorra,70.5,0
107,Europe,MCO,Monaco,37.7,0
37,Western Pacific,COK,Cook Islands,20.8,0
82,Europe,ISL,Iceland,329.4,0
7,Americas,ATG,Antigua and Barbuda,91.8,0
70,Americas,GRD,Grenada,106.8,0
156,Europe,SMR,San Marino,31.8,0


The table raises the possibility that a large number of deaths may be partly due to a large population. To compare the countries on an equal footing, the death rate per 100,000 inhabitants is computed.

In [54]:
population = data['Population (1000s)']
data['TB deaths (per 100,000)'] = tbDeaths * 100 / population
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194 entries, 0 to 193
Data columns (total 6 columns):
Region                     194 non-null object
Country_code               194 non-null object
Country                    194 non-null object
Population (1000s)         194 non-null float64
TB deaths                  194 non-null int64
TB deaths (per 100,000)    194 non-null float64
dtypes: float64(2), int64(1), object(3)
memory usage: 9.2+ KB


In [55]:
tbRate = data.sort_values('TB deaths (per 100,000)', ascending=False)
tbRate.head(10)

Unnamed: 0,Region,Country_code,Country,Population (1000s),TB deaths,"TB deaths (per 100,000)"
157,Eastern Mediterranean,SOM,Somalia,10787.1,11000,101.973654
173,South-East Asia,TLS,Timor-Leste,1184.8,1000,84.402431
119,Africa,MOZ,Mozambique,27977.9,19000,67.910744
89,Africa,KEN,Kenya,46050.3,31000,67.317694
67,Africa,GNB,Guinea-Bissau,1844.3,1200,65.065336
61,Africa,GAB,Gabon,1725.3,1100,63.757028
142,South-East Asia,PRK,Democratic People's Republic of Korea,25155.3,16000,63.604886
35,Africa,COD,Democratic Republic of the Congo,77266.8,49000,63.416629
154,Africa,SLE,Sierra Leone,6453.2,3900,60.435133
1,Africa,AGO,Angola,25022.0,15000,59.947246


In [56]:
tbRate.tail(10)

Unnamed: 0,Region,Country_code,Country,Population (1000s),TB deaths,"TB deaths (per 100,000)"
43,Europe,CYP,Cyprus,1165.3,1,0.085815
3,Europe,AND,Andorra,70.5,0,0.0
107,Europe,MCO,Monaco,37.7,0,0.0
128,Western Pacific,NIU,Niue,1.6,0,0.0
70,Americas,GRD,Grenada,106.8,0,0.0
7,Americas,ATG,Antigua and Barbuda,91.8,0,0.0
82,Europe,ISL,Iceland,329.4,0,0.0
104,Europe,LUX,Luxembourg,567.1,0,0.0
156,Europe,SMR,San Marino,31.8,0,0.0
37,Western Pacific,COK,Cook Islands,20.8,0,0.0


## MAP WorldWide death TB Rate 

### Using Folium Library for Geographic Overlays

In [57]:
import folium

### Country coordinates for plotting

source: https://github.com/python-visualization/folium/blob/master/examples/data/world-countries.json
- Download the file "world-countries.json" and save it on a new subdir and named it "geo"

In [58]:
country_geo = 'geo/world-countries.json'

### Setup our data for plotting.  
Create a data frame with just the country codes and the values we want plotted.

In [59]:
plot_data = data[['Country_code','TB deaths (per 100,000)']]
plot_data.head()

Unnamed: 0,Country_code,"TB deaths (per 100,000)"
0,AFG,39.967288
1,AGO,59.947246
2,ALB,0.276176
3,AND,0.0
4,ARE,0.414983


In [68]:
# Setup a folium map at a high-level zoom 
map = folium.Map(location=[20, 10], zoom_start=1.5)

# Choropleth maps bind Pandas Data Frames and json geometries.
map.choropleth(
    country_geo, 
    data=plot_data,
    columns=['Country_code', 'TB deaths (per 100,000)'],
    key_on='feature.id',
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='TB deaths (per 100,000)')
map.save('TB_map.html')
map

## Summary

In [61]:
print("\nWorldWide general numbers\n")

print(" - Total number of countries with data on 2013:",len(data.index))
print(" - Total number of deaths by TB excluding HIV :",totalDeathsWW)
print(" - Max number of deaths on a single country.  :",maxDeathsWW)
print(" - Min number of deaths on a single country.  :",minDeathsWW)
print(" - Mean of deaths                             :",meanDeathsWW)
print(" - Median of deaths                           :",medianDeathsWW)


WorldWide general numbers

 - Total number of countries with data on 2013: 194
 - Total number of deaths by TB excluding HIV : 1359455
 - Max number of deaths on a single country.  : 455000
 - Min number of deaths on a single country.  : 0
 - Mean of deaths                             : 7007
 - Median of deaths                           : 385


## Conclusions 

Taking in consideration, data from 194 countries in 2013,the total world suffered around 1.4 million deaths due to TB (those numbers excluyen deaths due HIV). In absolute numbers, the total number of deaths varies from "zero death" (San Marino) to 240,000 (India). The average of deaths on researched countries is around 7,000 deaths, but looking deeper we can see that half of those countries has less than 400 deaths, what means that some countries concentrated high number of deaths.

### Top 10 countries regarding death:

In [62]:
tbDeathsSorted[['Country', 'TB deaths']].head(10)

Unnamed: 0,Country,TB deaths
78,India,455000
77,Indonesia,119000
126,Nigeria,97000
15,Bangladesh,76000
32,China,54000
35,Democratic Republic of the Congo,49000
135,Pakistan,46000
56,Ethiopia,34000
89,Kenya,31000
179,United Republic of Tanzania,30000


### Lower countries in terms of number of TB deaths:

In [63]:
tbDeathsSorted[['Country', 'TB deaths']].tail(10)

Unnamed: 0,Country,TB deaths
117,Montenegro,1
104,Luxembourg,0
128,Niue,0
3,Andorra,0
107,Monaco,0
37,Cook Islands,0
82,Iceland,0
7,Antigua and Barbuda,0
70,Grenada,0
156,San Marino,0


However, taking the population size into account, the least affected is still San Marino but the  the most affected were Som√°lia an East African country ) (at "the Horn of Africa" near Ethiopia and with more than 100 deaths per 100,000 inhabitants.

### Top 10 countries regarding TB deaths per 100,000 inhabitants:

In [65]:
tbRate[['Country', 'TB deaths (per 100,000)']].head(10)

Unnamed: 0,Country,"TB deaths (per 100,000)"
157,Somalia,101.973654
173,Timor-Leste,84.402431
119,Mozambique,67.910744
89,Kenya,67.317694
67,Guinea-Bissau,65.065336
61,Gabon,63.757028
142,Democratic People's Republic of Korea,63.604886
35,Democratic Republic of the Congo,63.416629
154,Sierra Leone,60.435133
1,Angola,59.947246


### Lower countries regarding TB deaths per 100,000 inhabitants:

In [66]:
tbRate[['Country', 'TB deaths (per 100,000)']].tail(10)

Unnamed: 0,Country,"TB deaths (per 100,000)"
43,Cyprus,0.085815
3,Andorra,0.0
107,Monaco,0.0
128,Niue,0.0
70,Grenada,0.0
7,Antigua and Barbuda,0.0
82,Iceland,0.0
104,Luxembourg,0.0
156,San Marino,0.0
37,Cook Islands,0.0


<p><img src="TB_map.jpg?raw=true"></p>

Looking at tme Map, it is clear that Central Africa and South-Esat Asia have the highest concentration of TB deaths per 100,000 inhabitants 