# TM351 Data Management & Analysis


## TMA02 Preparation Tutorial
## 25th January 2025

In [1]:
# This cell imports the standard pandas library needed for the tutorial

import pandas as pd
import folium

This tutorial is made up of several notebooks:

* `TMA02 Preparation 24b.ipynb`: (this one), which includes a review of TMA01.
* `TMA02 Overview 24b.ipynb`: a review of TMA02
* `MongoDB.ipynb`: (optional) includes extra examples of using MongoDB

### TMA01 Review

By now will have completed TMA01. Let us look at some issues that were highlighted, that are helpful to review when preparing for TMA02 and the EMA.

TMA01 used several datasets:

Q1:
* Museums dataset: `MappingMuseumsData2021_09_30.csv`
* Towns and cities in the world which have a population of more than 500 people: `cities500.txt`

Q2:
* World happiness data: `happiness_2024.xls`

* Economic data - spending on cultural heritage: `cultural_heritage_data` (folder containing several files)

All were found in the data directory.


In the TMAs and EMA, you want to show that you have investigated the files for any problems.

These can include:
- checking for issues in the files
- checking for issues in the data

Any thoughts on what checks you could carry out?

Remember the pipeline you should go through:

!["Data pipeline"](images/tm351_pt1_f04.eps.jpg)


**Issues in the files**

Even if you have had a quick look in Excel or a text editor at a `CSV` or `txt` file, you should still review it in the notebook, so that you can show your reader that the file does have no problems.

An easy way to do this is to review the file before importing it using the operating system's `head` and `tail` commands.

For instance, the cities500.txt file in particular had a few issues:

In [2]:
!head data/cities500.txt

3038999	Soldeu	Soldeu		42.57688	1.66769	P	PPL	AD		02				602		1832	Europe/Andorra	2017-11-06
3039154	El Tarter	El Tarter	Ehl Tarter,Эл Тартер	42.57952	1.65362	P	PPL	AD		02				1052		1721	Europe/Andorra	2012-11-03
3039163	Sant Julià de Lòria	Sant Julia de Loria	San Julia,San Julià,Sant Julia de Loria,Sant Julià de Lòria,Sant-Zhulija-de-Lorija,sheng hu li ya-de luo li ya,Сант-Жулия-де-Лория,サン・ジュリア・デ・ロリア教区,圣胡利娅-德洛里亚,圣胡利娅－德洛里亚	42.46372	1.49129	P	PPLA	AD		06				8022		921	Europe/Andorra	2013-11-23
3039604	Pas de la Casa	Pas de la Casa	Pas de la Kasa,Пас де ла Каса	42.54277	1.73361	P	PPL	AD		03				2363	2050	2106	Europe/Andorra	2008-06-09
3039678	Ordino	Ordino	Ordino,ao er di nuo,orudino jiao qu,Ордино,オルディノ教区,奥尔迪诺	42.55623	1.53319	P	PPLA	AD		05				3066		1296	Europe/Andorra	2018-10-26
3040051	les Escaldes	les Escaldes	Ehskal'des-Ehndzhordani,Escaldes,Escaldes-Engordany,Les Escaldes,esukarudesu=engorudani jiao qu,lai sai si ka er de-en ge er da,Эскальдес-Энджордани,エスカルデス＝エンゴルダニ教区,萊塞斯卡爾德-恩戈爾達,萊塞

In [3]:
# note the cities file has been reduced to 70 records to avoid uploading large datasets
!tail data/cities500.txt

1123004	Taloqan	Taloqan	Khanabad,TQN,Taikhan,Taleqan,Talikan,Talikhan,Taliqan,Talkan,Talokan,Taloqan,Talugan,Talukan,Talukanas,Taluqan,Tologan,Tâloqân,Tāleqān,Tāloqān,Tāluqān,Tālīqān,ta lu kan,talokam,talokuvan,taloqana,talqan,Таликан,Талукан,تالقان,तालोक़ान,तालोकां,தலோகுவான்,თალიკანი,塔卢坎	36.73605	69.53451	P	PPLA	AF		26	1201			64256		801	Asia/Kabul	2018-02-17
1123343	Tagāw-Bāy	Tagaw-Bay	Bai,Bay,Bāy,Tagaw-Bay,Tagaw-bay,Tagaw-bāy,Tagow Bay,Tagow Bāy,Tagāw-Bāy,tgaw bay,تگاو بای	35.69941	66.06164	P	PPL	AF		33	3103			9096		1666	Asia/Kabul	2020-06-09
1123424	Tagāb	Tagab	Pagab,Tagab,Tagao,Tagāb,tgab,تگاب	34.85501	69.64917	P	PPL	AF		14	205			6400		1335	Asia/Kabul	2020-06-09
1123666	Markaz-e Ḩukūmat-e Sulţān-e Bakwāh	Markaz-e Hukumat-e Sultan-e Bakwah	Bakva,Bakvā,Bakwah,Bakwāh,Markaz-e Hokumat-e Soltan-e Bakva,Markaz-e Hukumat-e Sultan-e Bakwa,Markaz-e Hukumat-e Sultan-e Bakwah,Markaz-e Ḩokūmat-e Solţān-e Bakvā,Markaz-e Ḩukūmat-e Sulţān-e Bakwā,Markaz-e Ḩukūmat-e Sulţān-e Bakwāh,Markaze Hokumat

What are the issues here?

In [4]:
!head -3 data/cities500.txt

3038999	Soldeu	Soldeu		42.57688	1.66769	P	PPL	AD		02				602		1832	Europe/Andorra	2017-11-06
3039154	El Tarter	El Tarter	Ehl Tarter,Эл Тартер	42.57952	1.65362	P	PPL	AD		02				1052		1721	Europe/Andorra	2012-11-03
3039163	Sant Julià de Lòria	Sant Julia de Loria	San Julia,San Julià,Sant Julia de Loria,Sant Julià de Lòria,Sant-Zhulija-de-Lorija,sheng hu li ya-de luo li ya,Сант-Жулия-де-Лория,サン・ジュリア・デ・ロリア教区,圣胡利娅-德洛里亚,圣胡利娅－德洛里亚	42.46372	1.49129	P	PPLA	AD		06				8022		921	Europe/Andorra	2013-11-23


Most people probably spotted there was an issue when they tried to import the data.

For example, if you just tried to load it without checking, you would have got:

In [5]:
cities_df=pd.read_csv('data/cities500.txt')

ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 10


There are two issues in this file:

* no header
* data is not comma separated and appears to be tab separated instead

The latter is easily resolved using the separator (sep) parameter with `read_csv` function.

What to do about the missing header?


In [6]:
# tell it there is no header
# and set low_memory to avoid DtypeWarning: Columns (9,10,11,12,13) have mixed types
cities_df=pd.read_csv('data/cities500.txt', sep='\t', encoding='utf-8', header=None, low_memory=False)
cities_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,3038999,Soldeu,Soldeu,,42.57688,1.66769,P,PPL,AD,,2,,,,602,,1832,Europe/Andorra,2017-11-06
1,3039154,El Tarter,El Tarter,"Ehl Tarter,Эл Тартер",42.57952,1.65362,P,PPL,AD,,2,,,,1052,,1721,Europe/Andorra,2012-11-03
2,3039163,Sant Julià de Lòria,Sant Julia de Loria,"San Julia,San Julià,Sant Julia de Loria,Sant J...",42.46372,1.49129,P,PPLA,AD,,6,,,,8022,,921,Europe/Andorra,2013-11-23
3,3039604,Pas de la Casa,Pas de la Casa,"Pas de la Kasa,Пас де ла Каса",42.54277,1.73361,P,PPL,AD,,3,,,,2363,2050.0,2106,Europe/Andorra,2008-06-09
4,3039678,Ordino,Ordino,"Ordino,ao er di nuo,orudino jiao qu,Ордино,オルデ...",42.55623,1.53319,P,PPLA,AD,,5,,,,3066,,1296,Europe/Andorra,2018-10-26


But now the column names are just numbers, which is not very meaningful.

This is where we need to look at what information there is about the data. The website will have information, but the contents of the file are helpfully provided in `geonames_readme.txt` file:

    The main 'geoname' table has the following fields :
    ---------------------------------------------------
    geonameid         : integer id of record in geonames database
    name              : name of geographical point (utf8) varchar(200)
    asciiname         : name of geographical point in plain ascii characters, varchar(200)
    alternatenames    : alternatenames, comma separated, ascii names automatically transliterated, convenience                             attribute from alternatename table, varchar(10000)
    latitude          : latitude in decimal degrees (wgs84)
    longitude         : longitude in decimal degrees (wgs84)
    feature class     : see http://www.geonames.org/export/codes.html, char(1)
    feature code      : see http://www.geonames.org/export/codes.html, varchar(10)
    country code      : ISO-3166 2-letter country code, 2 characters
    cc2               : alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters
    admin1 code       : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)
    admin2 code       : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80) 
    admin3 code       : code for third level administrative division, varchar(20)
    admin4 code       : code for fourth level administrative division, varchar(20)
    population        : bigint (8 byte int) 
    elevation         : in meters, integer
    dem               : digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.
    timezone          : the iana timezone id (see file timeZone.txt) varchar(40)
    modification date : date of last modification in yyyy-MM-dd format
    
The site also states that the file is in tab separated format (which is borne out by the OS `head` command), and that the encoding is utf-8, which is the default for `read_csv`.

Do always check the documentation if not sure about any of the options: 
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

The question only asks for certain fields to be used, different approaches can be used to pull out the required columns:

* use the `usecols` parameter with the `read_csv()`
* import the whole file, then reshape the data frame, filtering out the required columns and renaming them


With small datasets either approach is fine, when working with huge amounts of data, which might be preferable?

In [7]:
# For example, this is the second approach - after importing the file as above 
# We need Country at this stage, so we can remove the GB rows afterwards
uk_cities_df=(cities_df
              .rename({1: 'City', 8:'Country', 14:'Population'}, axis='columns')
              .filter(['City', 'Country', 'Population'], axis='columns')
)

uk_cities_df.head()

Unnamed: 0,City,Country,Population
0,Soldeu,AD,602
1,El Tarter,AD,1052
2,Sant Julià de Lòria,AD,8022
3,Pas de la Casa,AD,2363
4,Ordino,AD,3066


Do also comment that there are no issues. 

Looking at the museum data:

In [8]:
! head -3 data/MappingMuseumsData2021_09_30.csv

head: cannot open 'data/MappingMuseumsData2021_09_30.csv' for reading: No such file or directory


In [9]:
! tail -3 data/MappingMuseumsData2021_09_30.csv

tail: cannot open 'data/MappingMuseumsData2021_09_30.csv' for reading: No such file or directory


The file is comma separated, has a single header row, and doesn't appear to have any additional information at the bottom of the file. The data can be imported using the pandas.read_csv function directly, since it defaults to a comma separator and a header row.

This shows that you have looked at the data and are aware of how it can be imported correctly.

The OS `head` and `tail` commands are not suitable for looking at Excel data, you could provide some screen shots of looking at it in MS Excel first to see if there are any issues.

Things to watch out for:
* more than one worksheet
* extra data at the start and end of a worksheet

Excel data can be imported using the `pandas.read_excel` function. If you discover issues, do review the documentation to see what can be used to resolve this:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

For example (version 2.2.3):
<pre>
pandas.read_excel(io, sheet_name=0, *, header=0, names=None, index_col=None, usecols=None, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, parse_dates=False, date_parser=<no_default>, date_format=None, thousands=None, decimal='.', comment=None, skipfooter=0, storage_options=None, dtype_backend=<no_default>, engine_kwargs=None)
</pre>


**Issues in the data**

Once you have the data imported and reshaped appropriately. Do next look at the data for any issues.

Things to look out for:
* missing data
* outliers
* incorrect data

Again there may not be any issues, but do check and comment on what you have found.

The `describe`, `info`, or `dtypes` functions can be used to do a quick check. You can then explore further if you spot anything out of the ordinary.

In [10]:
# extract just the UK cities
uk_cities_df=uk_cities_df[uk_cities_df["Country"]=="GB"]
uk_cities_df=uk_cities_df.drop('Country', axis=1)
uk_cities_df.head()

Unnamed: 0,City,Population


In [11]:
uk_cities_df.describe()

Unnamed: 0,Population
count,0.0
mean,
std,
min,
25%,
50%,
75%,
max,


In [12]:
# this and the next command can be used to check if numeric fields are indeed numeric
uk_cities_df.dtypes

City          object
Population     int64
dtype: object

In [13]:
uk_cities_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   City        0 non-null      object
 1   Population  0 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 0.0+ bytes


What might these highlight?

**Joining datasets**

You will normally have to merge datasets as part of your investigation. 

Normally this involves joining over one field, such as the country name seen in Q2.

Do you just join them and get on with the rest of the investigation?

Then wonder why the plots have little data, or may be more than expected....


Things to think about and check:
* how far do the values in the common field match
* can anything be done about the unmatched values
* is one field enough for the merge

Sometimes different datasets use different names for countries, such as just plain `China` or `People's Republic of China` (with or without the apostrophe); some might have the data in upper case, others mixed case, etc. 

Some work could be done to try and match obvious differences such as these.

Sometimes there may just not be data available for all countries in both datasets, in which case nothing can be done, but the key thing is you are aware of this and understand why your final results are not perfect because they are based on incomplete data.


Like joining tables in relational databases, sometimes there is more than one field that is in common with the datasets to be merged.

As seen in Q2, the economic data was provided not just by country, but also by year. If you join just on the country, you will get a semi-Cartesian product of the year values.

The following just looks at Australia, so you can see the effect of not joining the data frames correctly with just a small amount of data.

In [14]:
# load the national data
sdg11_data_national_df = pd.read_csv(
    "data/cultural_heritage_data/SDG11_DATA_NATIONAL.csv",
    encoding="UTF-8",
    engine="python",
)
sdg11_data_national_df.head()

Unnamed: 0,INDICATOR_ID,COUNTRY_ID,YEAR,VALUE,MAGNITUDE,QUALIFIER
0,HEXPCSTPPPCAP.CULHER,AUS,2020,42.5165,,UIS_EST
1,HEXPCSTPPPCAP.CULHER,BFA,2019,0.03098,,
2,HEXPCSTPPPCAP.CULHER,BIH,2019,50.8496,,
3,HEXPCSTPPPCAP.CULHER,BIH,2020,47.78392,,
4,HEXPCSTPPPCAP.CULHER,BIH,2021,49.58025,,


In [15]:
# load the country names so we can identify the country Ids
sdg11_country_df = pd.read_csv(
    "data/cultural_heritage_data/SDG11_COUNTRY.csv",
    encoding="UTF-8",
    engine="python",
)

In [16]:
# pull out just the Australian data and merge with the Country names
data_national_df = sdg11_data_national_df[sdg11_data_national_df["INDICATOR_ID"]=="HEXPCSTPPPCAP.CULHER"]
aus_national_df = data_national_df[data_national_df["COUNTRY_ID"]=="AUS"]
aus_national_df = aus_national_df.merge(sdg11_country_df)
aus_national_df

Unnamed: 0,INDICATOR_ID,COUNTRY_ID,YEAR,VALUE,MAGNITUDE,QUALIFIER,COUNTRY_NAME_EN
0,HEXPCSTPPPCAP.CULHER,AUS,2020,42.5165,,UIS_EST,Australia


In [17]:
# load the happiness data, but again just look at Australia
happiness_df = pd.read_excel("data/happiness_2024.xls").rename(
    {"Country name": "country_name"}, axis="columns"
)
aus_happiness_df = happiness_df[happiness_df["country_name"]=="Australia"]
aus_happiness_df

Unnamed: 0,country_name,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect
81,Australia,2005,7.340688,10.662058,0.967892,69.800003,0.934973,,0.390416,0.76977,0.238012
82,Australia,2007,7.285391,10.694434,0.965276,69.959999,0.890682,0.341637,0.512578,0.762304,0.215351
83,Australia,2008,7.253757,10.709456,0.946635,70.040001,0.915733,0.299933,0.430811,0.728992,0.218427
84,Australia,2010,7.450047,10.713649,0.95452,70.199997,0.932059,0.311334,0.366127,0.761716,0.220073
85,Australia,2011,7.405616,10.723386,0.967029,70.279999,0.944586,0.363977,0.381772,0.724132,0.195324
86,Australia,2012,7.195586,10.744205,0.944599,70.360001,0.935146,0.268277,0.368252,0.728092,0.214397
87,Australia,2013,7.364169,10.752455,0.928205,70.440002,0.933379,0.263428,0.431539,0.770147,0.177142
88,Australia,2014,7.28855,10.763002,0.923799,70.519997,0.922932,0.313164,0.442021,0.739815,0.245304
89,Australia,2015,7.309061,10.769909,0.951862,70.599998,0.921871,0.326533,0.356554,0.749504,0.209637
90,Australia,2016,7.25008,10.781229,0.942334,70.675003,0.922316,0.233216,0.398545,0.735896,0.236086


These two datasets have been reduced to just the Australian rows. If we merge these two datasets, how many rows would you expect in the merged dataset?

Look at the `aus_national_df` results above - there is one record for Australia from 2020. `aus_happiness_df` has more rows since it has data for most years between 2005 and 2023 (though not all).

How many rows in `aus_happiness_df`? Let's do a quick describe:

In [18]:
aus_happiness_df.describe()

Unnamed: 0,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect
count,17.0,17.0,17.0,17.0,17.0,17.0,16.0,17.0,17.0,17.0
mean,2014.764706,7.242307,10.764977,0.942319,70.570588,0.914413,0.254107,0.429199,0.737576,0.218661
std,5.425972,0.119461,0.051867,0.018212,0.4219,0.023014,0.076539,0.053819,0.019191,0.02096
min,2005.0,7.024582,10.662058,0.89646,69.800003,0.853777,0.115171,0.356554,0.705983,0.177142
25%,2011.0,7.176993,10.723386,0.936517,70.279999,0.91055,0.197963,0.390416,0.726976,0.205078
50%,2015.0,7.253757,10.769909,0.942774,70.599998,0.917537,0.265852,0.430209,0.731053,0.218427
75%,2019.0,7.309061,10.800653,0.951862,70.900002,0.932059,0.312309,0.453676,0.749504,0.236086
max,2023.0,7.450047,10.846434,0.967892,71.199997,0.944586,0.363977,0.545217,0.770147,0.248163


The count is 17, so there are 17 rows here, should we get 1 row or 17 from the merge?

In [19]:
aus_combined_df = aus_national_df.merge(aus_happiness_df, left_on="COUNTRY_NAME_EN", right_on="country_name")
aus_combined_df

Unnamed: 0,INDICATOR_ID,COUNTRY_ID,YEAR,VALUE,MAGNITUDE,QUALIFIER,COUNTRY_NAME_EN,country_name,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect
0,HEXPCSTPPPCAP.CULHER,AUS,2020,42.5165,,UIS_EST,Australia,Australia,2005,7.340688,10.662058,0.967892,69.800003,0.934973,,0.390416,0.76977,0.238012
1,HEXPCSTPPPCAP.CULHER,AUS,2020,42.5165,,UIS_EST,Australia,Australia,2007,7.285391,10.694434,0.965276,69.959999,0.890682,0.341637,0.512578,0.762304,0.215351
2,HEXPCSTPPPCAP.CULHER,AUS,2020,42.5165,,UIS_EST,Australia,Australia,2008,7.253757,10.709456,0.946635,70.040001,0.915733,0.299933,0.430811,0.728992,0.218427
3,HEXPCSTPPPCAP.CULHER,AUS,2020,42.5165,,UIS_EST,Australia,Australia,2010,7.450047,10.713649,0.95452,70.199997,0.932059,0.311334,0.366127,0.761716,0.220073
4,HEXPCSTPPPCAP.CULHER,AUS,2020,42.5165,,UIS_EST,Australia,Australia,2011,7.405616,10.723386,0.967029,70.279999,0.944586,0.363977,0.381772,0.724132,0.195324
5,HEXPCSTPPPCAP.CULHER,AUS,2020,42.5165,,UIS_EST,Australia,Australia,2012,7.195586,10.744205,0.944599,70.360001,0.935146,0.268277,0.368252,0.728092,0.214397
6,HEXPCSTPPPCAP.CULHER,AUS,2020,42.5165,,UIS_EST,Australia,Australia,2013,7.364169,10.752455,0.928205,70.440002,0.933379,0.263428,0.431539,0.770147,0.177142
7,HEXPCSTPPPCAP.CULHER,AUS,2020,42.5165,,UIS_EST,Australia,Australia,2014,7.28855,10.763002,0.923799,70.519997,0.922932,0.313164,0.442021,0.739815,0.245304
8,HEXPCSTPPPCAP.CULHER,AUS,2020,42.5165,,UIS_EST,Australia,Australia,2015,7.309061,10.769909,0.951862,70.599998,0.921871,0.326533,0.356554,0.749504,0.209637
9,HEXPCSTPPPCAP.CULHER,AUS,2020,42.5165,,UIS_EST,Australia,Australia,2016,7.25008,10.781229,0.942334,70.675003,0.922316,0.233216,0.398545,0.735896,0.236086


17 rows it is, but is this the right answer?

Look carefully at the results, you will have to scroll to see the year column from the `happiness_df'. Is this what you should expect?

Really the answer should only be one row - for the 2020 record, the other years available in the happiness data should be left out, since we did not have any other years in the economic data for Australia.

In [20]:
# join over the two common columns
aus_combined_df = aus_national_df.merge(aus_happiness_df, left_on=["COUNTRY_NAME_EN", "YEAR"], right_on=["country_name","year"])
aus_combined_df

Unnamed: 0,INDICATOR_ID,COUNTRY_ID,YEAR,VALUE,MAGNITUDE,QUALIFIER,COUNTRY_NAME_EN,country_name,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect
0,HEXPCSTPPPCAP.CULHER,AUS,2020,42.5165,,UIS_EST,Australia,Australia,2020,7.137368,10.794416,0.936517,70.974998,0.905283,0.201515,0.491095,0.725689,0.205078


This time we do only have the one record.

The moral of the story is, do always check the results produced. Just because your dataframe has some data in it, does not mean they are correct!

**Next Steps**

We will now move onto a review of TMA02: `TMA02 Overview 24b.ipynb`.
