# Some thoughts on missing values

**Prerequisite:** We have lists of cities and many indicators from EUROSTAT. 





** What about non-European cities? ** For instance, let's compare whether we could/should use Imputation by "Mean substitution" for US citiesfor the EUROSTAT indicator "Number of registered cars per 1000 population".
Let's look at the [data](http://ec.europa.eu/eurostat/web/cities/data/database) (cf. dataset urb_ctran) that EUROSTAT has here. Looking at this [map](http://ec.europa.eu/eurostat/documents/3217494/5784093/KS-HA-13-001-12-EN.PDF/89d2a7bb-0860-46cc-a494-cd5eeadbafc8?version=1.0) (Map12.3)

![MapEU](./figures/Europe_cars_per_1000_population.png)

We see that rarely any European city hat >600 cars per 1000 population.

Let's havea a closer look at the CSV data and compute the actual mean of cars per 1000 population in European cities. First, let's try to find the relevant indicator ID:

In [4]:
import csv # for handling csv/tsv files
from statistics import mean

# first, find out which indicator is the number of registered cars per 1000 population:
with open('./data/indic_ur.csv') as f:
    csvfile = csv.reader(f)
    for row in csvfile:
        if "registered cars per 1000" in row[1]:
            print(row[1],':',row[3])
            indic = row[3]

Number of registered cars per 1000 population : TT1057I


In [5]:
# Now, compute the mean over all the values for this indicator:
cars_per_1000 = []
with open('./data/urb_ctran.tsv') as f:
    csvfile = csv.reader(f,delimiter="\t")
    for row in csvfile:
        if indic in row[0]:
            n = next((float(s.split(' ')[0]) for s in row[1:] if s[0] != ':'),[])
            if n != [] : cars_per_1000.append(n)
print(mean(cars_per_1000))

441.3053376906318


Ok, looks like the number of cars per 1000 in Europe is around 441 on average.

Is this also a plausible/good value to assume for, e.g. US cities? Let's see...


Unfortunately this data is all missing for our UN Data, let's check whether it would be a good idea here to work with Mean susbstitution, i.e. add 441 for each US city?

Probably not a good idea... Why?

Let's look at this other [map](http://www.governing.com/gov-data/car-ownership-numbers-of-vehicles-by-city-map.html):

![MapUS](./figures/US_cars_per_household.png)

It seems to indicate that the mean for cars per household in the US cities is something around 1.5, roughly (US wide it is 1.8 even), if we want to set this in relation, we need the [average household size](https://www.statista.com/statistics/183648/average-size-of-households-in-the-us/), which is - again roughly - 2.54.

This makes the following estimated mean ot cars per 1000 population in  US cities:

In [7]:
1.5 * 1000 / 2.54

590.5511811023622

Some thoughts on this: of course we were playing here with completely over-rough estimates and assumptions, but: what it seems to indicate is the following: it is probably not a good idea to use value imputation based on EUROSTAT data  only for missing data outside of the US, which likely follows different regularities.

**Discussion:** What else/more we could do? 