# City data from Eurostats
## Urban Audit
We get the data from http://ec.europa.eu/eurostat/statistics-explained/index.php/Statistics_on_European_cities#Cities_.28Urban_Audit.29

Looking at the webpage and navigating around, we see an intersting statistics file for "Population on 1 January by age groups and sex - cities and greater cities (urb_cpop1)" at http://ec.europa.eu/eurostat/web/cities/data/database	 

A local version of this statistic is available at ./data/urb_cpop1.tsv (the fileformat seems to be tsv)

In [7]:
import os # for getting file size
import csv # for handling csv/tsv files

In [52]:
urbanAuditFile="./data/urb_cpop1.tsv"
fSize = os.path.getsize(urbanAuditFile)

print('File size of '+urbanAuditFile+' is: '+str(fSize/1024/1024) + ' MBytes')

File size of ./data/urb_cpop1.tsv is: 4.1985626220703125 MBytes


Looks like a "bigger" file with 4.19MB
Lets have a look at the first couple of lines

In [53]:
with open(urbanAuditFile) as f:
    content=f.read()
    c=0
    for line in content.split("\n"): 
        c+=1
        if c<10:
            print("line"+str(c)+">> "+line)
print("total lines "+str(c))

line1>> indic_ur,cities\time	2015 	2014 	2013 	2012 	2011 	2010 	2009 	2008 	2007 	2006 	2005 	2004 	2003 	2002 	2001 	2000 	1999 	1998 	1997 	1996 	1995 	1994 	1993 	1992 	1991 	1990 
line2>> DE1001V,AT	: 	: 	8451860 	: 	: 	: 	8355260 	8318592 	8282984 	8254298 	8201359 	8140122 	: 	8083797 	8032875 	8011566 	7992323 	8075425 	8067812 	8054802 	8039865 	8015027 	7962003 	7867796 	7795786 	: 
line3>> DE1001V,AT001C1	: 	: 	1741246 	: 	: 	: 	1687271 	1674909 	1661246 	1652449 	1632569 	1598626 	: 	: 	1550123 	1615438 e	1608144 e	1606843 e	1609631 e	1595402 e	: 	: 	: 	: 	1539848 	: 
line4>> DE1001V,AT002C1	: 	: 	265778 	: 	: 	: 	253994 	250738 	247624 	244997 	241298 	235477 	: 	: 	226244 	: 	: 	: 	: 	: 	: 	: 	: 	: 	237810 	: 
line5>> DE1001V,AT003C1	: 	: 	191501 	: 	: 	: 	189122 	188593 	188393 	187936 	186781 	185530 	: 	: 	183504 	: 	: 	: 	: 	: 	: 	: 	: 	: 	203044 	: 
line6>> DE1001V,AT004C1	: 	: 	145871 	: 	: 	: 	147732 	147169 	146845 	146484 	145124 	145680 	: 	: 	142662 	: 	: 	: 	:

Interesting, this might be a tab-separated ("\t") file. There are no "," and the formatting indicates some tabs. 
The lines seems to have 27 elements (first column as indicator, followed by 26 columns for 2015 - 1990)

Lets parse this file as CSV files and we get nice "rows"

In [54]:
with open(urbanAuditFile) as f:
    csvfile = csv.reader(f, delimiter="\t")
    c=0 # counter for the parsed line numbers
    N=2 # MAX number of lines we want to output
    for row in csvfile:
        c+=1
        print(row)
        print("Lenght of row "+str(len(row)))
        if c>N:
            break

['indic_ur,cities\\time', '2015 ', '2014 ', '2013 ', '2012 ', '2011 ', '2010 ', '2009 ', '2008 ', '2007 ', '2006 ', '2005 ', '2004 ', '2003 ', '2002 ', '2001 ', '2000 ', '1999 ', '1998 ', '1997 ', '1996 ', '1995 ', '1994 ', '1993 ', '1992 ', '1991 ', '1990 ']
Lenght of row 27
['DE1001V,AT', ': ', ': ', '8451860 ', ': ', ': ', ': ', '8355260 ', '8318592 ', '8282984 ', '8254298 ', '8201359 ', '8140122 ', ': ', '8083797 ', '8032875 ', '8011566 ', '7992323 ', '8075425 ', '8067812 ', '8054802 ', '8039865 ', '8015027 ', '7962003 ', '7867796 ', '7795786 ', ': ']
Lenght of row 27
['DE1001V,AT001C1', ': ', ': ', '1741246 ', ': ', ': ', ': ', '1687271 ', '1674909 ', '1661246 ', '1652449 ', '1632569 ', '1598626 ', ': ', ': ', '1550123 ', '1615438 e', '1608144 e', '1606843 e', '1609631 e', '1595402 e', ': ', ': ', ': ', ': ', '1539848 ', ': ']
Lenght of row 27


Ok, thats what we thougth, tab delimiter and 27 columns.

In [55]:
ind=[]
with open(urbanAuditFile) as f:
    csvfile = csv.reader(f, delimiter="\t")
    c=0
    for row in csvfile:
        c+=1
        if c>1: #we skip the first header row
            ind.append(row[0])
print("lenght of unique identifiers "+str(len(set(ind))))

lenght of unique identifiers 35812


Ok 35812 unique indicators and 35814 rows -> one indicator per row

In [23]:
for indicator in ind[0:20]:
    print(indicator)#print first 20 indicators

DE1001V,AT
DE1001V,AT001C1
DE1001V,AT002C1
DE1001V,AT003C1
DE1001V,AT004C1
DE1001V,AT005C1
DE1001V,AT006C1
DE1001V,BE
DE1001V,BE001C1
DE1001V,BE002C1
DE1001V,BE003C1
DE1001V,BE004C1
DE1001V,BE005C1
DE1001V,BE006C1
DE1001V,BE007C1
DE1001V,BE008C1
DE1001V,BE009C1
DE1001V,BE010C1
DE1001V,BE011C1
DE1001V,BG


### Observations
Ah ok, this looks like there is a pattern. Lets try to understand the indictor codes:
We see that there are two types of codes, which uses "," to separate the indicator and the country/city
* DE1001V,AT
* DE1001V,AT001C1

The first part ( before the ,) is the indicator variable, the second part refers to the country/city:


## Indicators and their meaning
Browsing through the Eurostats page shows that the indicator codes are listed at "http://dd.eionet.europa.eu/vocabulary/eurostat/indic_ur/" in different formats. 

We downloaded the file in CSV to the local file "./data/indic_ur.csv"

In [40]:
indicatorCodesFile="./data/indic_ur.csv"
with open(indicatorCodesFile) as f:
    csvfile = csv.reader(f)
    for i, row in enumerate(csvfile):
        if i<3:
            print(row)
print("Total lines "+str(csvfile.line_num))

['\ufeff"URI"', 'Label', 'Definition', 'Notation', 'Status', 'AcceptedDate']
['http://dd.eionet.europa.eu/vocabulary/eurostat/indic_ur/DE1001V', 'Population on the 1st of January, total', '', 'DE1001V', 'valid', '2013-09-04']
['http://dd.eionet.europa.eu/vocabulary/eurostat/indic_ur/DE1002V', 'Population on the 1st of January, male', '', 'DE1002V', 'valid', '2013-09-04']
Total lines 537


Ok 537 different codes

In [34]:
# check if one of the codes in the first lines appear in this file
with open(indicatorCodesFile) as f:
    csvfile = csv.reader(f)
    for i, row in enumerate(csvfile):
        if 'DE1001V' in row[3]:
            print(row)

['http://dd.eionet.europa.eu/vocabulary/eurostat/indic_ur/DE1001V', 'Population on the 1st of January, total', '', 'DE1001V', 'valid', '2013-09-04']


Ok, lets keep that in mind, we come back to this later

## City & Country codes
The city codes are available in at "http://dd.eionet.europa.eu/vocabulary/eurostat/cities/", either as "RDF":"http://dd.eionet.europa.eu/vocabulary/eurostat/cities/rdf;jsessionid=5AF9B4AF77C5DADF55069F856234E145" or as "CSV":"http://dd.eionet.europa.eu/vocabulary/eurostat/cities/csv;jsessionid=5AF9B4AF77C5DADF55069F856234E145"
A local copy of the CSV file is available at "./data/cities.csv" and lists the cities incl their codes and names.

In [41]:
cityCodeFile="./data/cities.csv"
with open(cityCodeFile) as f:
    csvfile = csv.reader(f)
    for i,row in enumerate(csvfile):
        if i<1:
            print(row)
print("Total lines "+str(csvfile.line_num))

['\ufeff"URI"', 'Label', 'Definition', 'Notation', 'Status', 'AcceptedDate', 'coreCityOf', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'hasDistrict', 'inCountry', 'owl:sameAs', 'skos:broader', 'skos:closeMatch', 'skos:exactMatch', 'sk

Okay, that are a lot of information for each city code and we have 11595 lines. 

Maybe we want to filter the columns for Label (index 1) and Notation (index 3). 

In [47]:
with open(cityCodeFile) as f:
    csvfile = csv.reader(f)
    for i,row in enumerate(csvfile):
        if i<4:
            print(row[1],row[3])

Label Notation
Austria AT
Wien AT001C
Wien AT001C1


Lets check if they have "Wien":

In [48]:
with open(cityCodeFile) as f:
    csvfile = csv.reader(f)
    for i,row in enumerate(csvfile):
        if 'Wien' in row[1]:
            print(row[1],row[3])

Wien AT001C
Wien AT001C1
Wien AT001L2


Ah, three different codes for "Wien", go figure;) (?)

### Country Codes

Countries use ISO two-letter codes, e.g. available on "datahub.io":"https://datahub.io/de/dataset/iso-3166-1-alpha-2-country-codes/resource/9c3b30dd-f5f3-4bbe-a3cb-d7b2c21d66ce"
 * "CSV":"./data/iso_3166_2_countries.csv" list of countries and country codes.

In [49]:
countryCodeFile="./data/iso_3166_2_countries.csv"
with open(countryCodeFile) as f:
    csvfile = csv.reader(f)
    for i,row in enumerate(csvfile):
        if i<1:
            print(row)
print("Total lines "+str(csvfile.line_num))

['Sort Order', 'Common Name', 'Formal Name', 'Type', 'Sub Type', 'Sovereignty', 'Capital', 'ISO 4217 Currency Code', 'ISO 4217 Currency Name', 'ITU-T Telephone Code', 'ISO 3166-1 2 Letter Code', 'ISO 3166-1 3 Letter Code', 'ISO 3166-1 Number', 'IANA Country Code TLD']
Total lines 269


Again, lots of columns, but we are mainly interested in the mapping between common name (index 1) and the ISO 2 Letter code (index 10)

In [51]:
with open(countryCodeFile) as f:
    csvfile = csv.reader(f)
    for i,row in enumerate(csvfile):
        if row[1] == 'Austria':
            print(row[1], row[10])


Austria AT


## Merging
At this stage, we are able to create a dictionary of 
* Indicator -to-> name
* CityCode -to-> city name
* CountryCode -to-> country name
These mappings could be used to provide some more "human friendly" labels in the output

# Investiagting some "interesting" questions

## Which one is the biggest city?
So what exactly do we mean? 
* Biggest city in terms of biggest population over all 26 years? 
* Or biggest city in terms of overall population in 2015? 

#### Lets get the biggest city in 2015

This can be done by a simple scan and storing always the local maximum.

We filter all rows which have the indicator **DE1001V** (=Population on the 1st of January, total) and a city code, that is the indicator string is longer than "DE1001V,AT" -> which would be only the country indicator

In [57]:
cityCode=''
cityPopulation2015=0

with open(urbanAuditFile) as f:
    csvfile = csv.reader(f, delimiter="\t")
    for row in csvfile:
        if 'DE1001V' in row[0] and len(row[0])> len("DE1001V,AT"):
            if int(row[1])>cityPopulation2015: # row[1] is the population of 2015
                cityPopulation2015=row[1]
                cityCode=row[0].split(",")[1] # split the indicator by "," and take the second part
print("Biggest city: "+cityCode+" with "+str(cityPopulation2015)+" citizens")

ValueError: invalid literal for int() with base 10: ': '

Interesting, row[1] is by default a str value and we firstly need to typecast it int. 

Next, it seems that not all row[1] values are numbers, missing values are indicated by ":"

So lets adapt our filtering approach

We first remove all leading and ending whitespaces by using the string function **strip()**.

In [68]:
cityCode=''
cityPopulation2015=0

with open(urbanAuditFile) as f:
    csvfile = csv.reader(f, delimiter="\t")
    for row in csvfile:
        if 'DE1001V' in row[0] and len(row[0])> len("DE1001V,AT"):
            popCount=row[1].strip()
            if popCount != ":" and int(popCount)>cityPopulation2015: # row[1] is the population of 2015
                cityPopulation2015=int(popCount)
                cityCode=row[0].split(",")[1] # split the indicator by "," and take the second part
print("Biggest city: "+cityCode+" with "+str(cityPopulation2015)+" citizens")

ValueError: invalid literal for int() with base 10: '205768 de'

Hmm, clean data looks definitely different to this data!!

Seems, like some population values have some additional suffix in form of "_X".

We can remove them as well, by splitting the string at the " " (whitespace) and taking only the former part

In [71]:
cityCode=''
cityPopulation=0
year=1 #2015

with open(urbanAuditFile) as f:
    csvfile = csv.reader(f, delimiter="\t")
    for row in csvfile:
        if 'DE1001V' in row[0] and len(row[0])> len("DE1001V,AT"):
            popCount=row[year].strip().split(" ")[0]
            if popCount != ":" and int(popCount)>cityPopulation: # row[1] is the population of 2015
                cityPopulation=int(popCount)
                cityCode=row[0].split(",")[1] # split the indicator by "," and take the second part
print("Biggest city: "+cityCode+" with "+str(cityPopulation)+" citizens")

Biggest city: PT001K1 with 1835785 citizens


Ah cool, finally it works and we have as **PT001K1** the biggest city in 2015 (for which we have population values) 

lets try to convert this city code to a more human readable string (or do you know which city it is?)

In [72]:
#Building the cityCode to Label map
cityCodeMap={}
with open(cityCodeFile) as f:
    csvfile = csv.reader(f)
    for i,row in enumerate(csvfile):
        cityCodeMap[row[3]]= row[1]

In [73]:
#lets convert the city code to the lable
print("Biggest city: "+cityCodeMap[cityCode]+" with "+str(cityPopulation)+" citizens")

Biggest city: Lisboa (greater city) with 1835785 citizens


Better, but somehow strange that Lisboa is the biggest city, maybe this is due to the missing values, lets check for 2014

In [77]:
cityCode=''
cityPopulation=0
year=2 #2014

with open(urbanAuditFile) as f:
    csvfile = csv.reader(f, delimiter="\t")
    for row in csvfile:
        if 'DE1001V' in row[0] and len(row[0])> len("DE1001V,AT"):
            popCount=row[year].strip().split(" ")[0]
            if popCount != ":" and int(popCount)>cityPopulation: # row[1] is the population of 2015
                cityPopulation=int(popCount)
                cityCode=row[0].split(",")[1] # split the indicator by "," and take the second part
print("Biggest city: "+cityCodeMap[cityCode]+" with "+str(cityPopulation)+" citizens")

Biggest city: London (greater city) with 8477600 citizens


Ah, ok, looks slightly better, considering that this is eurostats, Lodon might be the biggest city in Europe.

Looks like "Wikipedia" lists Istanbul (14,657,434) and Moscow (12,330,126) as bigger cities in Europe (source: https://en.wikipedia.org/wiki/List_of_European_cities_by_population_within_city_limits) 

**Interesting question would be, are these two cities in the dataset or not? How could you figure this out?**


## What are the (most recent) populations per country?

Hmm, well this is just an adaption of our algorithm to find the biggest city in 2015 (which would be the most recent year). However, if the value for 2015 is missing, we could just take the next most recent value. 

Lets try this.

We start by using a dictionary for which the key is the city and the value is another dictionary with the population number for the most recent year. 

In addition, we might need to loop over the years to find the most recent year with a existing population value.
Considering that we have a list and operate with indices rather than labels, we want to 
* store the first csv row in a field headerRow
* use the headerRow to convert the year index into a Year label

But lets do it step by step

In [87]:
recentCityPopulation={} # key= city , value = {'year': year, 'pop':population}

with open(urbanAuditFile) as f:
    csvfile = csv.reader(f, delimiter="\t")
    for i, row in enumerate(csvfile):
        if i==0:
            headerRow=row
        if 'DE1001V' in row[0] and len(row[0])> len("DE1001V,AT"): # same filter conditition as before
            for yearIndex in range (1,27): # this generates all numbers between 1 and 27
                popCount=row[yearIndex].strip().split(" ")[0]
                if popCount != ":" and int(popCount)>0: # check if value exists and we can convert it
                    #ok we have a population value >0, that is hte most recent year, lets store this
                    cityPopulation=int(popCount)
                    cityCode=row[0].split(",")[1]
                    recentCityPopulation[cityCode]={'year':yearIndex,'pop': cityPopulation}
                    
print(recentCityPopulation)


{'FR219C1': {'year': 7, 'pop': '130194'}, 'NL509C1': {'year': 6, 'pop': '121532'}, 'UK524C1': {'year': 15, 'pop': '210160'}, 'PT508C1': {'year': 16, 'pop': '122596'}, 'UK517C1': {'year': 15, 'pop': '223302'}, 'RO005C1': {'year': 25, 'pop': '249662'}, 'DK001C1': {'year': 25, 'pop': '464773'}, 'SE005C1': {'year': 26, 'pop': '90004'}, 'ES515C1': {'year': 15, 'pop': '166187'}, 'PL516C1': {'year': 16, 'pop': '107416'}, 'IT012C1': {'year': 25, 'pop': '255824'}, 'FR051C2': {'year': 7, 'pop': '126477'}, 'PL506C1': {'year': 16, 'pop': '178611'}, 'PT019C1': {'year': 6, 'pop': '63674'}, 'IT004C1': {'year': 25, 'pop': '962507'}, 'UK002K1': {'year': 6, 'pop': '2390000'}, 'DE513C1': {'year': 15, 'pop': '194748'}, 'FR009C1': {'year': 26, 'pop': '1067345'}, 'IT058C1': {'year': 6, 'pop': '51404'}, 'PT501C1': {'year': 16, 'pop': '362976'}, 'RO016C1': {'year': 15, 'pop': '95840'}, 'FR024C2': {'year': 7, 'pop': '199096'}, 'AT003C1': {'year': 25, 'pop': '203044'}, 'RO012C1': {'year': 25, 'pop': '75954'}, '

Ok, looks "goodish" but hard to read for a human and i do not want to present the values like that. Lets fix it. 
We can conver the cityCode to City labels and replace the yearIndices with their column headers

In [114]:
recentCityPopulation={} # key= city , value = {'year': year, 'pop':population}



## a quick trick to be able to convert the yearIndex to its label
headerRow=[]
with open(urbanAuditFile) as f:
    csvfile = csv.reader(f, delimiter="\t")
    for i, row in enumerate(csvfile):
        if i==0:
            headerRow=row
        if 'DE1001V' in row[0] and len(row[0])> len("DE1001V,AT"): # same filter conditition as before
            for yearIndex in range (1,27): # this generates all numbers between 1 and 27
                popCount=row[yearIndex].strip().split(" ")[0]
                if popCount != ":" and int(popCount)>0: # check if value exists and we can convert it
                    #ok we have a population value >0, that is hte most recent year, lets store this
                    cityPopulation=int(popCount)
                    cityCode=row[0].split(",")[1]
                    cityLabel=cityCodeMap[cityCode] # lets just merge the mapping from city code to label in this code
                    recentCityPopulation[cityLabel]={'year':headerRow[yearIndex],'pop': cityPopulation} # we also convert hte year index to the year label
                    break
## lets fix also the printing 
print("Number of cities: "+str(len(recentCityPopulation.keys())))
for city, pop in recentCityPopulation.items():
    print("  "+city+":"+str(pop['pop'])+" in "+str(pop['year']))

Number of cities: 975
  Liverpool:471900 in 2014 
  Guadalajara:83720 in 2014 
  Stoke-on-trent:250600 in 2014 
  Reims:210177 in 2013 
  Roosendaal:77155 in 2013 
  Versailles:182105 in 2013 
  Tromsø:70358 in 2013 
  Frankenthal (Pfalz):47332 in 2014 
  Trabzon:230618 in 2004 
  Tilburg:208527 in 2013 
  Alcobendas:112188 in 2014 
  Ede:109823 in 2013 
  Vitoria/Gasteiz:242082 in 2014 
  Douai:150935 in 2013 
  Tarbes:73340 in 2013 
  Trier:107233 in 2014 
  CC des Coteaux de la Seine:37060 in 2013 
  Nitra:78033 in 2014 
  Verona:259966 in 2014 
  Enfield:322500 in 2014 
  Banská Bystrica:79027 in 2014 
  Wolfsburg:122457 in 2014 
  Almada:170139 in 2015 
  Charleroi:203640 in 2014 
  Cartagena:216451 in 2014 
  Bristol:440000 in 2014 
  Lübeck:212958 in 2014 
  Jönköping:128305 in 2011 
  Bielefeld:328864 in 2014 
  Koszalin:109170 in 2013 
  Barreiro:76775 in 2015 
  Palencia:80178 in 2014 
  Jelgava:57332 in 2014 
  Lubin:74053 in 2013 
  Milton Keynes:257500 in 2014 
  Pisa:8862

### Further exercise

We leave it up to you to play around with this interesting data. Some ideas you could explore are:
* How many cities have a value in 2015
* Sort them first by year and then population

### Which ones are the 10 biggest cities?

We show here an algorithms which performs a scan over the data and *remembers* the top-k biggest elements

Other versions could be to store all values in a list or dictionary and then sort the datasctructre after a full parse.

We use the code example from http://stevehanov.ca/blog/index.php?id=122 and adapt it for "tuples" as per https://docs.python.org/2/library/heapq.html

In [109]:
#USING A HEAP ( complexity O(n log(k))) 
import heapq
topKCities = []
k=10
yearIndex=2#2015
with open(urbanAuditFile) as f:
    csvfile = csv.reader(f, delimiter="\t")
    for i, row in enumerate(csvfile):
        if 'DE1001V' in row[0] and len(row[0])> len("DE1001V,AT"): # same filter conditition as before
            popCount=row[yearIndex].strip().split(" ")[0]
            if popCount != ":" and int(popCount)>0: # check if value exists and we can convert it
                #ok we have a population value >0, that is hte most recent year, lets store this
                cityCode=row[0].split(",")[1]
                cityLabel=cityCodeMap[cityCode] # lets just merge the mapping from city code to label in this code
                cityPopulation=int(popCount)
                #CODE FROM http://stevehanov.ca/blog/index.php?id=122
                #ADAPTION TO TUPLES FROM https://docs.python.org/2/library/heapq.html
                item=(cityPopulation,cityLabel)
                # If we have not yet found k items, or the current item is larger than
                # the smallest item on the heap,
                if len(topKCities) < k or item > topKCities[0]:
                    # If the heap is full, remove the smallest element on the heap.
                    if len(topKCities) == k: heapq.heappop( topKCities )
                    # add the current element as the new smallest.
                    heapq.heappush( topKCities, item )
for city in heapq.nlargest(k,topKCities): 
    print(city) #see https://docs.python.org/2/library/heapq.html

(8477600, 'London (greater city)')
(3421829, 'Berlin')
(3207006, 'Milano (greater city)')
(3176357, 'Barcelona (greater city)')
(3176107, 'Napoli (greater city)')
(3165235, 'Madrid')
(2863322, 'Roma')
(2723900, 'Greater Manchester')
(2462300, 'West Midlands urban area')
(2110878, 'Bucuresti')


In [112]:
#USING A Dictionary ( complexity O(n log(n))) 

cities = {}
k=10
yearIndex=2#2015
with open(urbanAuditFile) as f:
    csvfile = csv.reader(f, delimiter="\t")
    for i, row in enumerate(csvfile):
        if 'DE1001V' in row[0] and len(row[0])> len("DE1001V,AT"): # same filter conditition as before
            popCount=row[yearIndex].strip().split(" ")[0]
            if popCount != ":" and int(popCount)>0: # check if value exists and we can convert it
                #ok we have a population value >0, that is hte most recent year, lets store this
                cityCode=row[0].split(",")[1]
                cityLabel=cityCodeMap[cityCode] # lets just merge the mapping from city code to label in this code
                cityPopulation=int(popCount)
                cities[cityLabel]=cityPopulation

#Option1
from operator import itemgetter
topkCitites= sorted(cities.items(), key=itemgetter(1), reverse=True)[0:k] # sort descending
for city in topkCitites:
    print(city)

('London (greater city)', 8477600)
('Berlin', 3421829)
('Milano (greater city)', 3207006)
('Barcelona (greater city)', 3176357)
('Napoli (greater city)', 3176107)
('Madrid', 3165235)
('Roma', 2863322)
('Greater Manchester', 2723900)
('West Midlands urban area', 2462300)
('Bucuresti', 2110878)


In [110]:
#Another Option for the output
from collections import Counter

sortedCitites = Counter(cities)
for k, v in sortedCitites.most_common(k):
    print(k,v) #see https://docs.python.org/2/library/heapq.html

London (greater city) 8477600
Berlin 3421829
Milano (greater city) 3207006
Barcelona (greater city) 3176357
Napoli (greater city) 3176107
Madrid 3165235
Roma 2863322
Greater Manchester 2723900
West Midlands urban area 2462300
Bucuresti 2110878


### How many cities has each country?

This question will be answered in the lecture