# TM351 Data Management & Analysis


#### TMA02 Preparation Tutorial
## 24th January 2026

In [1]:
# This cell imports the standard pandas library needed for the tutorial

import pandas as pd
import folium

##### This tutorial is made up of several notebooks:

* `TMA02 Preparation 25b.ipynb`: (this one), which includes a review of TMA01.
* `TMA02 Overview 25b.ipynb`: a review of TMA02
* `MongoDB.ipynb`: (optional) includes extra examples of using MongoDB

### TMA01 Review

By now will have completed TMA01. Let us look at some issues that were highlighted, that are helpful to review when preparing for TMA02 and the EMA.

Please note, this notebook will not fully clean the datasets used in TMA01, but will point out the key things to address.

TMA01 used several datasets:

Q1:
* Bats Conservation Trust dataset: `records-2025-05-12/records-2025-05-12.csv`
* Towns and cities in the world which have a population of more than 5000 people: `cities5000.txt`

Q2:

* Tuberculosis data: `WHO_TB_data.csv`
* GINI coefficient and wealth ownership data: `API_11_DS2_en_csv_v2_88104` (folder containing several files)

All were found in the data directory.


In the TMAs and EMA, you **must** show that you have investigated the files for any problems. Even if it turns out there is nothing.

These can include:
- checking for issues in the files
- checking for issues in the data
- checking the common columns if you are joining datasets

Any thoughts on what you can do to carry out these checks?

Remember the pipeline you should go through:

!["Data pipeline"](images/tm351_pt1_f04.eps.jpg)

## Checking the data

## Cities dataset

Even if you have had a quick look in Excel or a text editor at a `CSV` or `txt` file, you should still review it in the notebook, so that you can show your reader that the file does have no problems.

An easy way to do this is to review the file before importing it using the operating system's `head` and `tail` commands.

For instance, the cities5000.txt file in particular had a few issues:

In [2]:
!head data/cities5000.txt

3039163	Sant Julià de Lòria	Sant Julia de Loria	San Julia,San Julià,Sant Julia de Loria,Sant Julià de Lòria,Sant-Zhulija-de-Lorija,sheng hu li ya-de luo li ya,Сант-Жулия-де-Лория,サン・ジュリア・デ・ロリア教区,圣胡利娅-德洛里亚,圣胡利娅－德洛里亚	42.46372	1.49129	P	PPLA	AD		06				8022		921	Europe/Andorra	2013-11-23
3039678	Ordino	Ordino	Ordino,ao er di nuo,orudino jiao qu,Ордино,オルディノ教区,奥尔迪诺	42.55623	1.53319	P	PPLA	AD		05				3066		1296	Europe/Andorra	2018-10-26
3040051	les Escaldes	les Escaldes	Ehskal'des-Ehndzhordani,Escaldes,Escaldes-Engordany,Les Escaldes,esukarudesu=engorudani jiao qu,lai sai si ka er de-en ge er da,Эскальдес-Энджордани,エスカルデス＝エンゴルダニ教区,萊塞斯卡爾德-恩戈爾達,萊塞斯卡爾德－恩戈爾達	42.50729	1.53414	P	PPLA	AD		08				15853		1033	Europe/Andorra	2024-06-20
3040132	la Massana	la Massana	La Macana,La Massana,La Maçana,La-Massana,la Massana,ma sa na,Ла-Массана,ラ・マサナ教区,马萨纳	42.54499	1.51483	P	PPLA	AD		04				7211		1245	Europe/Andorra	2008-10-15
3040686	Encamp	Encamp	Ehnkam,Encamp,en kan pu,enkanpu jiao qu,Энкам,エンカンプ教区,恩坎普	42.53

In [3]:
# note the cities file has been reduced to 100 records to avoid uploading large datasets
!tail data/cities5000.txt

2633476	Wroughton	Wroughton		51.52411	-1.79559	P	PPL	GB		ENG	N9	00HX015		6474		118	Europe/London	2017-06-12
2633485	Wrexham	Wrexham	Reksamas,Reksem,Reksum,Rexam,Wrecsam,Wreksam,Wrexham,legseom,lei ke si han mu,rekusamu,wrksam,Ρέξαμ,Рексем,Рексъм,Ռեքսհեմ,רקסהאם,ورکسام,レクサム,雷克斯漢姆,렉섬	53.04664	-2.99132	P	PPLA2	GB		WLS	Z4	00NL007		65692		87	Europe/London	2017-06-12
2633511	Wotton-under-Edge	Wotton-under-Edge	Wotton-under-Edge,awtwn-andr-aj,اوتون-آندر-اج	51.63242	-2.34512	P	PPL	GB		ENG	E6	23UF	23UF052	5627		82	Europe/London	2019-10-02
2633521	Worthing	Worthing	Vorting,Vortingo,Worthing,wajingu,wwrtyng,Вортинг,وورتینگ,ワージング	50.81795	-0.37538	P	PPL	GB		ENG	P6	45UH		99110		7	Europe/London	2021-08-08
2633551	Worksop	Worksop	Uehrksop,Uurksop,Worksop,wwrksap,Уърксоп,Уэрксоп,وورکساپ	53.30182	-1.12404	P	PPL	GB		ENG	J9	37UC		43252		46	Europe/London	2017-06-12
2633553	Workington	Workington	Uurkingtun,wo jin dun,wwrkyngtwn,Уъркингтън,وورکینگتون,沃金頓	54.6425	-3.54413	P	PPL	GB		ENG	C9	E06000063	16UB061	27

What are the issues here?

In [4]:
!head -3 data/cities5000.txt

3039163	Sant Julià de Lòria	Sant Julia de Loria	San Julia,San Julià,Sant Julia de Loria,Sant Julià de Lòria,Sant-Zhulija-de-Lorija,sheng hu li ya-de luo li ya,Сант-Жулия-де-Лория,サン・ジュリア・デ・ロリア教区,圣胡利娅-德洛里亚,圣胡利娅－德洛里亚	42.46372	1.49129	P	PPLA	AD		06				8022		921	Europe/Andorra	2013-11-23
3039678	Ordino	Ordino	Ordino,ao er di nuo,orudino jiao qu,Ордино,オルディノ教区,奥尔迪诺	42.55623	1.53319	P	PPLA	AD		05				3066		1296	Europe/Andorra	2018-10-26
3040051	les Escaldes	les Escaldes	Ehskal'des-Ehndzhordani,Escaldes,Escaldes-Engordany,Les Escaldes,esukarudesu=engorudani jiao qu,lai sai si ka er de-en ge er da,Эскальдес-Энджордани,エスカルデス＝エンゴルダニ教区,萊塞斯卡爾德-恩戈爾達,萊塞斯卡爾德－恩戈爾達	42.50729	1.53414	P	PPLA	AD		08				15853		1033	Europe/Andorra	2024-06-20


Most people probably spotted there was an issue when they tried to import the data.

For example, if you just tried to load it without checking, you would have got:

In [5]:
cities_df=pd.read_csv('data/cities5000.txt')

ParserError: Error tokenizing data. C error: Expected 10 fields in line 7, saw 33


There are two issues in this file:

* no header
* data is not comma separated and appears to be tab separated instead

The latter is easily resolved using the separator (sep) parameter with `read_csv` function.

What to do about the missing header?


In [6]:
# tell it there is no header
cities_df=pd.read_csv('data/cities5000.txt', sep='\t', encoding='utf-8', header=None)
cities_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,3039163,Sant Julià de Lòria,Sant Julia de Loria,"San Julia,San Julià,Sant Julia de Loria,Sant J...",42.46372,1.49129,P,PPLA,AD,,6,,,,8022,,921,Europe/Andorra,2013-11-23
1,3039678,Ordino,Ordino,"Ordino,ao er di nuo,orudino jiao qu,Ордино,オルデ...",42.55623,1.53319,P,PPLA,AD,,5,,,,3066,,1296,Europe/Andorra,2018-10-26
2,3040051,les Escaldes,les Escaldes,"Ehskal'des-Ehndzhordani,Escaldes,Escaldes-Engo...",42.50729,1.53414,P,PPLA,AD,,8,,,,15853,,1033,Europe/Andorra,2024-06-20
3,3040132,la Massana,la Massana,"La Macana,La Massana,La Maçana,La-Massana,la M...",42.54499,1.51483,P,PPLA,AD,,4,,,,7211,,1245,Europe/Andorra,2008-10-15
4,3040686,Encamp,Encamp,"Ehnkam,Encamp,en kan pu,enkanpu jiao qu,Энкам,...",42.53474,1.58014,P,PPLA,AD,,3,,,,11223,,1257,Europe/Andorra,2018-10-26


In [7]:
# don't forget to check the end of the file too
cities_df.tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
111,2633553,Workington,Workington,"Uurkingtun,wo jin dun,wwrkyngtwn,Уъркингтън,وو...",54.6425,-3.54413,P,PPL,GB,,ENG,C9,E06000063,16UB061,27120,,20,Europe/London,2023-08-03
112,2633561,Worcester Park,Worcester Park,"Old Malden,Worcester Park",51.37992,-0.24445,P,PPL,GB,,ENG,GLA,N8,,16031,,24,Europe/London,2019-08-22
113,2633563,Worcester,Worcester,"Caerwrangon,City of Worcester,UWC,Ustur,Vigorn...",52.18935,-2.22001,P,PPLA2,GB,,ENG,Q4,47UE,,101659,,29,Europe/London,2023-02-25
114,2633571,Royal Wootton Bassett,Royal Wootton Bassett,"Royal Wootton Bassett,Wooton Bassett,Wootton B...",51.5419,-1.9045,P,PPL,GB,,ENG,P8,00HY254,,11265,,129,Europe/London,2017-06-12
115,2633586,Woolton,Woolton,"Woolton,hu er dun,uruton,wu er dun,wwltwn,וולט...",53.37401,-2.87,P,PPL,GB,,ENG,H8,,,12921,,62,Europe/London,2024-04-14


Now the column names are just numbers, which is not very meaningful.

This is where we need to look at what information there is about the data. The website will have information, but the contents of the file are helpfully provided in `geonames_readme.txt` file:

    The main 'geoname' table has the following fields :
    ---------------------------------------------------
    geonameid         : integer id of record in geonames database
    name              : name of geographical point (utf8) varchar(200)
    asciiname         : name of geographical point in plain ascii characters, varchar(200)
    alternatenames    : alternatenames, comma separated, ascii names automatically transliterated, convenience                             attribute from alternatename table, varchar(10000)
    latitude          : latitude in decimal degrees (wgs84)
    longitude         : longitude in decimal degrees (wgs84)
    feature class     : see http://www.geonames.org/export/codes.html, char(1)
    feature code      : see http://www.geonames.org/export/codes.html, varchar(10)
    country code      : ISO-3166 2-letter country code, 2 characters
    cc2               : alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters
    admin1 code       : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)
    admin2 code       : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80) 
    admin3 code       : code for third level administrative division, varchar(20)
    admin4 code       : code for fourth level administrative division, varchar(20)
    population        : bigint (8 byte int) 
    elevation         : in meters, integer
    dem               : digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.
    timezone          : the iana timezone id (see file timeZone.txt) varchar(40)
    modification date : date of last modification in yyyy-MM-dd format
    
The site also states that the file is in tab separated format (which is borne out by the OS `head` command), and that the encoding is utf-8, which is the default for `read_csv`.

Do always check the documentation if not sure about any of the options: 
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

The question only asks for certain fields to be used, different approaches can be used to pull out the required columns:

* use the `usecols` parameter with the `read_csv()`
* import the whole file, then reshape the data frame, filtering out the required columns and renaming them


With small datasets either approach is fine, when working with huge amounts of data, which might be preferable?

Reminder of what we need to import: 

*Import the file cities5000.txt, and create a DataFrame named uk_cities_df which contains:*

- *a column `city_name` which contains the name of the town or city, and*
- *a column `population` which contains the city's population,*
- *columns `latitude` and `longitude` which contain the coordinates of the city.*

*The DataFrame should only contain those towns and cities in the United Kingdom. In the file `cities5000.txt`, towns and cities in the United Kingdom are represented by the two letter ISO code `GB`. (The ISO code GB covers the whole of the United Kingdom of Great Britain and Northern Ireland.)*

So we need the fields: 1: name, 4: latitude, 5: longitude, 14: population and 8: country code so we can retrieve just the UK towns and cities.

In [8]:
# For example, this is the second approach - after importing the file as above 
# We need Country at this stage, so we can remove the GB rows afterwards
uk_cities_df=(cities_df
              .rename({1: 'city_name', 4: 'latitude', 5: 'longitude', 8: 'country', 14: 'population'}, axis='columns')
              .filter(['city_name', 'latitude', 'longitude', 'country', 'population'], axis='columns')
)

uk_cities_df.head()

Unnamed: 0,city_name,latitude,longitude,country,population
0,Sant Julià de Lòria,42.46372,1.49129,AD,8022
1,Ordino,42.55623,1.53319,AD,3066
2,les Escaldes,42.50729,1.53414,AD,15853
3,la Massana,42.54499,1.51483,AD,7211
4,Encamp,42.53474,1.58014,AD,11223


In [9]:
# quick check of the countries left
uk_cities_df["country"].unique()

array(['AD', 'AE', 'ZM', 'ZW', 'GB'], dtype=object)

Even if the data has no issues at all, do comment on this and show some data to prove this. Do not just say something along the lines of *"I looked at the data in Excel and it seemed fine".*

## Bats dataset

Looking at the bats data (note the version included in the download has been reduced to 100 rows due to file limits):

In [10]:
! head -3 data/records-2025-05-12/records-2025-05-12.csv

﻿NBN Atlas record ID,Occurrence ID,Licence,Rightsholder,Scientific name,Taxon author,Name qualifier,Common name,Species ID (TVK),Taxon Rank,Occurrence status,Start date,Start date day,Start date month,Start date year,End date,End date day,End date month,End date year,Locality,OSGR,Latitude (WGS84),Longitude (WGS84),Coordinate uncertainty (m),Verbatim depth,Recorder,Determiner,Individual count,Abundance,Abundance scale,Organism scope,Organism remarks,Sex,Life stage,Occurrence remarks,Identification verification status,Basis of record,Survey key,Dataset name,Dataset ID,Data provider,Data provider ID,Institution code,Kingdom,Phylum,Class,Order,Family,Genus,OSGR 100km,OSGR 10km,OSGR 2km,OSGR 1km,Country,State/Province,Vitality,public _ resolution _ in _ meters
f8cc13a7-45b5-409e-aa7e-401fcfd94f6e,9715,OGL,Bat Conservation Trust,Rhinolophus hipposideros,"(Bechstein, 1800)",,Lesser Horseshoe Bat,NHMSYS0000080177,species,present,05/06/1995,05,06,1995,,,,,,SN10,51.713143,-4.679273,7071.1,,Undi

In [11]:
! tail -3 data/records-2025-05-12/records-2025-05-12.csv

f9ee4eb6-6a97-4f3b-9111-dcf49b746e5a,36324,OGL,Bat Conservation Trust,Myotis nattereri,"(Kuhl, 1817)",,Natterer's Bat,NHMSYS0000080184,species,present,22/10/2019,22,10,2019,,,,,,TQ35,51.278158,-0.065855,7071.1,,Undisclosed,,,,,,,,,For Metadata go to https://registry.nbnatlas.org/public/showDataResource/dr945,Accepted,HumanObservation,,Hibernation Survey,dr945,Bat Conservation Trust,dp57,BCT,Animalia,Chordata,Mammalia,Chiroptera,Vespertilionidae,Myotis,TQ,TQ35,,,United Kingdom,England,,
f874aab3-81f9-466a-a098-2e55323ac295,30591,OGL,Bat Conservation Trust,Rhinolophus hipposideros,"(Bechstein, 1800)",,Lesser Horseshoe Bat,NHMSYS0000080177,species,present,01/10/1992,01,10,1992,,,,,,SJ11,52.725941,-3.260043,7071.1,,Undisclosed,,,,,,,,,For Metadata go to https://registry.nbnatlas.org/public/showDataResource/dr945,Accepted,HumanObservation,,Hibernation Survey,dr945,Bat Conservation Trust,dp57,BCT,Animalia,Chordata,Mammalia,Chiroptera,Rhinolophidae,Rhinolophus,SJ,SJ11,,,United Kingdom,Wales,,

The file is comma separated, has a single header row, and doesn't appear to have any additional information at the bottom of the file. The data can be imported using the pandas.read_csv function directly, since it defaults to a comma separator and a header row.

This shows that you have looked at the data and are aware of how it can be imported correctly.

### Excel files

The OS `head` and `tail` commands are not suitable for looking at Excel data, you could provide some screen shots of looking at it in MS Excel first to see if there are any issues.

Things to watch out for:
* more than one worksheet
* extra data at the start and end of a worksheet

Excel data can be imported using the `pandas.read_excel` function. If you discover issues, do review the documentation to see what can be used to resolve this:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

For example (version 2.3):
<pre>
pandas.read_excel(io, sheet_name=0, *, header=0, names=None, index_col=None, usecols=None, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, parse_dates=False, date_parser=<no_default>, date_format=None, thousands=None, decimal='.', comment=None, skipfooter=0, storage_options=None, dtype_backend=<no_default>, engine_kwargs=None)
</pre>


**Issues in the data**

The next stage after importing and reshaping appropriately is to look at the data for any issues.

Things to look out for:
* missing data
* outliers
* incorrect data

Again there may not be any issues, but do check and comment on what you have found.

The `describe`, `info`, or `dtypes` functions can be used to do a quick check. You can then explore further if you spot anything out of the ordinary.

In [22]:
# extract just the UK cities
uk_cities_df=uk_cities_df[uk_cities_df["country"]=="GB"]
uk_cities_df=uk_cities_df.drop("country", axis=1)
uk_cities_df.head()

Unnamed: 0,city_name,latitude,longitude,population
100,Yeadon,53.86437,-1.68743,37379
101,Yaxley,52.51768,-0.25852,9174
102,Yatton,51.38839,-2.82353,10251
103,Yate,51.54074,-2.41839,34406
104,Yarm,54.50364,-1.35793,19184


In [23]:
uk_cities_df.describe()

Unnamed: 0,latitude,longitude,population
count,16.0,16.0,16.0
mean,52.380396,-1.785834,31562.75
std,1.257565,1.046164,31551.179239
min,50.81795,-3.54413,5459.0
25%,51.49018,-2.519675,9981.75
50%,51.910885,-1.850045,17607.5
75%,53.319868,-0.99628,38847.25
max,54.6425,-0.24445,101659.0


In [24]:
# this and the next command can be used to check if numeric fields are indeed numeric
uk_cities_df.dtypes

city_name      object
latitude      float64
longitude     float64
population      int64
dtype: object

In [25]:
uk_cities_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16 entries, 100 to 115
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   city_name   16 non-null     object 
 1   latitude    16 non-null     float64
 2   longitude   16 non-null     float64
 3   population  16 non-null     int64  
dtypes: float64(2), int64(1), object(1)
memory usage: 640.0+ bytes


What might these highlight?

## Question 2

The data for Q2 in particular had a number of issues.

### TB Data

In [26]:
# TB Data
!head -n 5 data/WHO_TB_data.csv

"Countries, territories and areas","Year","Number of incident tuberculosis cases","Incidence of tuberculosis (per 100 000 population per year)","Number of incident tuberculosis cases in children aged 0 - 14","Number of incident tuberculosis cases,  (HIV-positive cases)","Incidence of tuberculosis (per 100 000 population) (HIV-positive cases)"
"Afghanistan"," 2022","76000 [48000-105000]","185 [117-255]","15000 [8300-22000]","23 [11-41]","0.06 [0.03-0.10]"
"Afghanistan"," 2021","74000 [47000-103000]","185 [118-257]","","5 [1-13]","0.01 [0.00-0.03]"
"Afghanistan"," 2020","71000 [46000-101000]","183 [118-260]","","13 [4-26]","0.03 [0.01-0.07]"
"Afghanistan"," 2019","71000 [46000-102000]","189 [122-270]","","20 [0-76]","0.05 [0.00-0.20]"


In [27]:
!tail -n 5 data/WHO_TB_data.csv

"Zimbabwe"," 2004","74000 [25000-148000]","607 [208-1210]","","52000 [17000-106000]","426 [136-875]"
"Zimbabwe"," 2003","75000 [26000-148000]","617 [213-1230]","","53000 [17000-108000]","436 [141-893]"
"Zimbabwe"," 2002","74000 [30000-138000]","617 [247-1150]","","53000 [20000-102000]","442 [166-850]"
"Zimbabwe"," 2001","73000 [25000-147000]","617 [211-1230]","","54000 [17000-110000]","450 [146-920]"
"Zimbabwe"," 2000","72000 [20000-156000]","605 [166-1320]","","53000 [14000-118000]","449 [117-999]"


Data appears to be comma separated and has a header, so will import it:

In [53]:
who_tb_df = pd.read_csv("data/WHO_TB_data.csv")
who_tb_df.head()

Unnamed: 0,"Countries, territories and areas",Year,Number of incident tuberculosis cases,Incidence of tuberculosis (per 100 000 population per year),Number of incident tuberculosis cases in children aged 0 - 14,"Number of incident tuberculosis cases, (HIV-positive cases)",Incidence of tuberculosis (per 100 000 population) (HIV-positive cases)
0,Afghanistan,2022,76000 [48000-105000],185 [117-255],15000 [8300-22000],23 [11-41],0.06 [0.03-0.10]
1,Afghanistan,2021,74000 [47000-103000],185 [118-257],,5 [1-13],0.01 [0.00-0.03]
2,Afghanistan,2020,71000 [46000-101000],183 [118-260],,13 [4-26],0.03 [0.01-0.07]
3,Afghanistan,2019,71000 [46000-102000],189 [122-270],,20 [0-76],0.05 [0.00-0.20]
4,Afghanistan,2018,69000 [45000-99000],189 [122-270],,19 [0-71],0.05 [0.00-0.19]


In [54]:
# drop the columns not required
who_tb_df=who_tb_df.drop(["Number of incident tuberculosis cases", "Number of incident tuberculosis cases in children aged 0 - 14", "Incidence of tuberculosis (per 100 000 population) (HIV-positive cases)"], axis=1)

# and pull out 2022
who_tb_df = who_tb_df.query("`Year`==2022")


In [55]:
who_tb_df 

Unnamed: 0,"Countries, territories and areas",Year,Incidence of tuberculosis (per 100 000 population per year),"Number of incident tuberculosis cases, (HIV-positive cases)"
0,Afghanistan,2022,185 [117-255],23 [11-41]
23,Albania,2022,15 [13-18],2 [0-7]
46,Algeria,2022,51 [37-67],190 [98-310]
69,Andorra,2022,5.80 [4.90-6.70],
92,Angola,2022,333 [202-492],9100 [5600-14000]
...,...,...,...,...
4324,Venezuela (Bolivarian Republic of),2022,46 [34-58],1000 [380-2000]
4347,Viet Nam,2022,176 [121-251],4300 [2900-6100]
4370,Yemen,2022,48 [41-55],120 [65-180]
4393,Zambia,2022,295 [184-431],19000 [12000-28000]


We only want one numeric value, not the ranges, so need to extract the first value. One way to do this is to use the `split()` function, for example:

In [56]:
who_tb_df["Incidence of tuberculosis (per 100 000 population per year)"] = who_tb_df["Incidence of tuberculosis (per 100 000 population per year)"].apply(
    lambda x: x.split()[0]
)

In [57]:
who_tb_df.head()

Unnamed: 0,"Countries, territories and areas",Year,Incidence of tuberculosis (per 100 000 population per year),"Number of incident tuberculosis cases, (HIV-positive cases)"
0,Afghanistan,2022,185.0,23 [11-41]
23,Albania,2022,15.0,2 [0-7]
46,Algeria,2022,51.0,190 [98-310]
69,Andorra,2022,5.8,
92,Angola,2022,333.0,9100 [5600-14000]


The same can be done for the `Number of incident tuberculosis cases, (HIV-positive cases)` column too.

Do check afterwards that the datatypes are as required. We want these two fields to be numeric when producing the plots, otherwise you will get unexpected results.

In [58]:
who_tb_df.dtypes

Countries, territories and areas                                object
Year                                                             int64
Incidence of tuberculosis (per 100 000 population per year)     object
Number of incident tuberculosis cases,  (HIV-positive cases)    object
dtype: object

As we can see the columns are objects, so as part of the cleaning exercise you also need to change these two columns to type float - `astype()` can be used for this.

### World bank data

Check the file again:

In [18]:
!head -n 7 data/API_11_DS2_en_csv_v2_88104/API_11_DS2_en_csv_v2_88104.csv

﻿"Data Source","World Development Indicators",

"Last Updated Date","2025-04-15",

"Country Name","Country Code","Indicator Name","Indicator Code","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020","2021","2022","2023","2024",
"Aruba","ABW","Annualized average growth rate in per capita real survey mean consumption or income, total population (%)","SI.SPR.PCAP.ZG","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","",
"Aruba","ABW","Survey mean consumption or income per capita, t

In [32]:
!tail -n 3 data/API_11_DS2_en_csv_v2_88104/API_11_DS2_en_csv_v2_88104.csv

"Zimbabwe","ZWE","Income share held by third 20%","SI.DST.03RD.20","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","14","","","","","","13.2","","","","","","","",
"Zimbabwe","ZWE","Income share held by second 20%","SI.DST.02ND.20","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","9.5","","","","","","9.1","","","","","","","",
"Zimbabwe","ZWE","Population living in slums (% of urban population)","EN.POP.SLUM.UR.ZS","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","27.52175","","26.92639","","26.33104","","25.73569","","25.14033","","24.54498","","23.94962","","23.35427","","22.75892","","22.16356","","21.56821","","","","",


Hmmm, not so good. The data is comma separated, but we can see there appears to be some meta data at the start of the file and a lot of nulls!

In [19]:
# skip the metadata at the start
worldbank_df = pd.read_csv(
    "data/API_11_DS2_en_csv_v2_88104/API_11_DS2_en_csv_v2_88104.csv", skiprows=4
)
worldbank_df.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2016,2017,2018,2019,2020,2021,2022,2023,2024,Unnamed: 69
0,Aruba,ABW,Annualized average growth rate in per capita r...,SI.SPR.PCAP.ZG,,,,,,,...,,,,,,,,,,
1,Aruba,ABW,"Survey mean consumption or income per capita, ...",SI.SPR.PCAP,,,,,,,...,,,,,,,,,,
2,Aruba,ABW,Annualized average growth rate in per capita r...,SI.SPR.PC40.ZG,,,,,,,...,,,,,,,,,,
3,Aruba,ABW,"Survey mean consumption or income per capita, ...",SI.SPR.PC40,,,,,,,...,,,,,,,,,,
4,Aruba,ABW,Poverty gap at $6.85 a day (2017 PPP) (%),SI.POV.UMIC.GP,,,,,,,...,,,,,,,,,,


We can see there are a lot of nulls, some investigation could be made to see how bad this is, e.g., a quite count:

In [44]:
worldbank_df.count()

Country Name      5586
Country Code      5586
Indicator Name    5586
Indicator Code    5586
1960                 0
                  ... 
2021              1383
2022               568
2023                79
2024                 0
Unnamed: 69          0
Length: 70, dtype: int64

Assuming the country name and codes are likely to be fully populated, 2022 has only 568 rows out of 5586.

We only want certain indicators, this is where the metadata comes to the rescue again to see which codes are needed - `Metadata_Indicator_API_11_DS2_en_csv_v2_88104.csv`. For instance, the indicator_code for the gini data is: `SI.POV.GINI`


In [22]:
gini_df = worldbank_df.query("`Indicator Code`=='SI.POV.GINI'").drop(
    ["Indicator Name", "Indicator Code"], axis="columns"
)

gini_df.head()

Unnamed: 0,Country Name,Country Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2016,2017,2018,2019,2020,2021,2022,2023,2024,Unnamed: 69
9,Aruba,ABW,,,,,,,,,...,,,,,,,,,,
30,Africa Eastern and Southern,AFE,,,,,,,,,...,,,,,,,,,,
51,Afghanistan,AFG,,,,,,,,,...,,,,,,,,,,
72,Africa Western and Central,AFW,,,,,,,,,...,,,,,,,,,,
93,Angola,AGO,,,,,,,,,...,,,51.3,,,,,,,


Now 2022 produces 28 rows out of 266:

In [45]:
gini_df.count()

Country Name    266
Country Code    266
1960              0
1961              0
1962              0
               ... 
2021             71
2022             28
2023              4
2024              0
Unnamed: 69       0
Length: 68, dtype: int64

Different approaches could be made to try and salvage some data. If you investigate the data a bit more, the values do not vary that significantly year to year for each country.

One approach is to use the forward fill function - `ffil()`. Details can be found here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ffill.html

In [24]:
gini_df.ffill(axis="columns")

Unnamed: 0,Country Name,Country Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2016,2017,2018,2019,2020,2021,2022,2023,2024,Unnamed: 69
9,Aruba,ABW,ABW,ABW,ABW,ABW,ABW,ABW,ABW,ABW,...,ABW,ABW,ABW,ABW,ABW,ABW,ABW,ABW,ABW,ABW
30,Africa Eastern and Southern,AFE,AFE,AFE,AFE,AFE,AFE,AFE,AFE,AFE,...,AFE,AFE,AFE,AFE,AFE,AFE,AFE,AFE,AFE,AFE
51,Afghanistan,AFG,AFG,AFG,AFG,AFG,AFG,AFG,AFG,AFG,...,AFG,AFG,AFG,AFG,AFG,AFG,AFG,AFG,AFG,AFG
72,Africa Western and Central,AFW,AFW,AFW,AFW,AFW,AFW,AFW,AFW,AFW,...,AFW,AFW,AFW,AFW,AFW,AFW,AFW,AFW,AFW,AFW
93,Angola,AGO,AGO,AGO,AGO,AGO,AGO,AGO,AGO,AGO,...,42.7,42.7,51.3,51.3,51.3,51.3,51.3,51.3,51.3,51.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5490,Kosovo,XKX,XKX,XKX,XKX,XKX,XKX,XKX,XKX,XKX,...,26.7,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0
5511,"Yemen, Rep.",YEM,YEM,YEM,YEM,YEM,YEM,YEM,YEM,YEM,...,36.7,36.7,36.7,36.7,36.7,36.7,36.7,36.7,36.7,36.7
5532,South Africa,ZAF,ZAF,ZAF,ZAF,ZAF,ZAF,ZAF,ZAF,ZAF,...,63.0,63.0,63.0,63.0,63.0,63.0,63.0,63.0,63.0,63.0
5553,Zambia,ZMB,ZMB,ZMB,ZMB,ZMB,ZMB,ZMB,ZMB,ZMB,...,55.8,55.8,55.8,55.8,55.8,55.8,51.5,51.5,51.5,51.5


Or could have unintended results since we are only interested in the numeric values, so drop the country details whilst using the function:

In [27]:
final_gini_df = (
    gini_df[["Country Name", "Country Code"]].merge(
        (
            gini_df.drop(["Country Name", "Country Code"], axis="columns").ffill(
                axis="columns"
            )
        )["2022"],
        left_index=True,
        right_index=True,
    )
).rename({"2022": "Gini coefficient"}, axis="columns")

final_gini_df.head()

Unnamed: 0,Country Name,Country Code,Gini coefficient
9,Aruba,ABW,
30,Africa Eastern and Southern,AFE,
51,Afghanistan,AFG,
72,Africa Western and Central,AFW,
93,Angola,AGO,51.3


In [29]:
final_gini_df.count()

Country Name        266
Country Code        266
Gini coefficient    169
dtype: int64

We now have 169 values rather than 28. The rows with nulls can now be removed. A similar exercise can be carried out with the poorest 20% data, which is identified with the series code `SI.DST.FRST.20`.

## Joining datasets

You will normally have to merge datasets as part of your investigation. 

Normally this involves joining over one field, such as the country name seen in Q2.

Do you just join them and get on with the rest of the investigation?

Then wonder why the plots have little data, or may be more than expected....


Things to think about and check:
* how far do the values in the common field match
* can anything be done about the unmatched values
* is one field enough for the merge

Sometimes different datasets use different names for countries, such as just plain `China` or `People's Republic of China` (with or without the apostrophe); some might have the data in upper case, others mixed case, etc. 

Some work could be done to try and match obvious differences such as these.

Sometimes there may just not be data available for all countries in both datasets, in which case nothing can be done, but the key thing is you are aware of this and understand why your final results are not perfect because they are based on incomplete data.


Like joining tables in relational databases, sometimes there is more than one field that is in common with the datasets to be merged, such as the longitude and latitude seen in Q1, or if you have yearly data for countries you may need to join on both the country name and year.


Q2 did require you to join the World Bank and TB data. A quick reminder shows

In [59]:
who_tb_df.head()

Unnamed: 0,"Countries, territories and areas",Year,Incidence of tuberculosis (per 100 000 population per year),"Number of incident tuberculosis cases, (HIV-positive cases)"
0,Afghanistan,2022,185.0,23 [11-41]
23,Albania,2022,15.0,2 [0-7]
46,Algeria,2022,51.0,190 [98-310]
69,Andorra,2022,5.8,
92,Angola,2022,333.0,9100 [5600-14000]


In [60]:
final_gini_df.head()

Unnamed: 0,Country Name,Country Code,Gini coefficient
9,Aruba,ABW,
30,Africa Eastern and Southern,AFE,
51,Afghanistan,AFG,
72,Africa Western and Central,AFW,
93,Angola,AGO,51.3


Both datasets have the country name in common, but do both datasets contain information for the same countries, or are the names in the same format? 

The key thing is to check for differences, do not just merge them without at least checking the size of the resulting data frame.

One way is to look for any differences:

In [61]:
set(final_gini_df["Country Name"]) - set(who_tb_df["Countries, territories and areas"]) 

{'Africa Eastern and Southern',
 'Africa Western and Central',
 'American Samoa',
 'Arab World',
 'Aruba',
 'Bahamas, The',
 'Bermuda',
 'Bolivia',
 'British Virgin Islands',
 'Caribbean small states',
 'Cayman Islands',
 'Central Europe and the Baltics',
 'Channel Islands',
 'Congo, Dem. Rep.',
 'Congo, Rep.',
 'Curacao',
 'Early-demographic dividend',
 'East Asia & Pacific',
 'East Asia & Pacific (IDA & IBRD countries)',
 'East Asia & Pacific (excluding high income)',
 'Egypt, Arab Rep.',
 'Euro area',
 'Europe & Central Asia',
 'Europe & Central Asia (IDA & IBRD countries)',
 'Europe & Central Asia (excluding high income)',
 'European Union',
 'Faroe Islands',
 'Fragile and conflict affected situations',
 'French Polynesia',
 'Gambia, The',
 'Gibraltar',
 'Greenland',
 'Guam',
 'Heavily indebted poor countries (HIPC)',
 'High income',
 'Hong Kong SAR, China',
 'IBRD only',
 'IDA & IBRD total',
 'IDA blend',
 'IDA only',
 'IDA total',
 'Iran, Islamic Rep.',
 'Isle of Man',
 "Korea, D

In [62]:
# check both ways too
set(who_tb_df["Countries, territories and areas"]) - set(final_gini_df["Country Name"])  

{'Bahamas',
 'Bolivia (Plurinational State of)',
 'Congo',
 'Cook Islands',
 "Democratic People's Republic of Korea",
 'Democratic Republic of the Congo',
 'Egypt',
 'Gambia',
 'Iran (Islamic Republic of)',
 'Kyrgyzstan',
 "Lao People's Democratic Republic",
 'Micronesia (Federated States of)',
 'Netherlands (Kingdom of the)',
 'Niue',
 'Republic of Korea',
 'Republic of Moldova',
 'Saint Kitts and Nevis',
 'Saint Lucia',
 'Saint Vincent and the Grenadines',
 'Slovakia',
 'United Kingdom of Great Britain and Northern Ireland',
 'United Republic of Tanzania',
 'United States of America',
 'Venezuela (Bolivarian Republic of)',
 'Yemen'}

The economic data appears to also contain regional areas, such as, "Africa Eastern and Southern" and "South Asia", which nothing can be done about and are likely summary data. Some countries exists in one dataset and not the other, such as "Guam", or "Niue" and again nothing can be done.

There is good news, some are obviously the same countries, but are named differently, such as: "United Kingdom" verses "United Kingdom of Great Britain and Northern Ireland", or "St. Lucia" verses "Saint Lucia" and a mapping exercise could be carried out, similar to Step 5 in Q1 for the bats. 

Once this is done, you can then merge the two datasets. Do check how many countries survived, so you are aware of the impact of the merge. This information can be then used when evaluating the results at the end.  

### Summary

Things to think about in TMA02 and the EMA:

- do not skip data checks: this had the biggest impact, which often meant data was unknowingly lost, such as losing 40% of the countries when merging the data.
- do explain your decisions: do not just produce lots of code and tada, here are the plots and maps required. It is important to explain design decisions. There is no one right answer to these assessments, but we want to know what the logic was behind the code.
- avoid putting large amounts of code in one cell too. Aim for one cell, one purpose, one outcome. This helps with the story telling.
- make sure the code and commentary are in sync
- make sure the plots answer the questions asked
- some of you used AI-generated code, but you are still responsible for understanding and justifying every transformation. Unexplained logic is still invisible reasoning
- make sure your notebook is runnable. Before submitting, always do a "Kernel>Restart Kernel and Run All Cells...". It is not a problem if you have left in errors to show issues with the data, such as the initial read of the cities5000.txt file above, so long as the notebook can be run again after the error.
- if you have done some cleaning in Open Refine and produced cleaned files, do make sure you include these with your submission and the images that show this cleaning.
- use relative filenames in the notebook. Remember your tutor will not have the same folder structure as your computer! So no: `pd.read_csv('c:/MyComputer/OU/TM351/TMA02/data/filename.csv')`

Remember this is a final year module, so coding skills, whilst necessary, are not the only requirement. You need to be able to explain and justify your choices. Ensure you understand the impact of any transformations carried out and be aware of how much data was lost during this process. 

**Next Steps**

We will now move onto a review of TMA02: `TMA02 Overview 25b.ipynb`.


### 