## Exercise 02: Loading a sample dataset and calculating the mean.

In this exercise, we will be loading the `world_population.csv` dataset and calculate the mean of some rows and columns.   
Our dataset holds the yearly population density for each country. We can therefore use Pandas to get some really quick and easy insights.

#### Loading our dataset

In [45]:
# importing the necessary dependencies
import pandas as pd

In [49]:
# loading the Dataset
dataset = pd.read_csv('./data/world_population.csv', index_col=0)

**Note:**   
`index_col` enables you to use any column as index instead of the incrementing int column that gets added by default. In our case we want column 0, which is the country names, as indices.

In [38]:
# looking at the dataset
dataset.head()

Unnamed: 0_level_0,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
Country Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Aruba,ABW,Population density (people per sq. km of land ...,EN.POP.DNST,,307.972222,312.366667,314.983333,316.827778,318.666667,320.622222,...,562.322222,563.011111,563.422222,564.427778,566.311111,568.85,571.783333,574.672222,577.161111,
Andorra,AND,Population density (people per sq. km of land ...,EN.POP.DNST,,30.587234,32.714894,34.914894,37.170213,39.470213,41.8,...,180.591489,182.161702,181.859574,179.614894,175.161702,168.757447,161.493617,154.86383,149.942553,
Afghanistan,AFG,Population density (people per sq. km of land ...,EN.POP.DNST,,14.038148,14.312061,14.599692,14.901579,15.218206,15.545203,...,39.637202,40.634655,41.674005,42.830327,44.127634,45.533197,46.997059,48.444546,49.821649,
Angola,AGO,Population density (people per sq. km of land ...,EN.POP.DNST,,4.305195,4.384299,4.464433,4.544558,4.624228,4.703271,...,15.387749,15.915819,16.459536,17.020898,17.600302,18.196544,18.808215,19.433323,20.070565,
Albania,ALB,Population density (people per sq. km of land ...,EN.POP.DNST,,60.576642,62.456898,64.329234,66.209307,68.058066,69.874927,...,108.394781,107.566204,106.843759,106.314635,106.013869,105.848431,105.717226,105.60781,105.444051,


---

#### After loading our dataset

To get a quick overview on our dataset we want to print out the "shape" of it.   
This will give us an output of the form (rows, columns)

In [39]:
# printing the shape of our dataset
dataset.shape

(264, 60)

In [40]:
# calculating the mean for 1961 column
dataset["1961"].mean()

176.91514132840538

In [41]:
# calculating the mean for 2015 column
dataset["2015"].mean()

368.7066010400187

**Note:**   
Only by comaparing the overall mean of the two years, 1961 and 2015, we can already see that the mean population density **more than doubled** in this time range.

In [47]:
# mean for each country (row)
dataset.mean(axis=1).head(10)

Country Name
Aruba                   413.944949
Andorra                 106.838839
Afghanistan              25.373379
Angola                    9.649583
Albania                  99.159197
Arab World               16.118586
United Arab Emirates     31.321721
Argentina                11.634028
Armenia                 103.415539
American Samoa          211.855636
dtype: float64

In [48]:
# mean for each feature (col)
dataset.mean(axis=0).tail(10)

2007    331.995474
2008    338.688417
2009    343.649206
2010    347.967029
2011    351.942027
2012    357.787305
2013    360.985726
2014    364.849194
2015    368.706601
2016           NaN
dtype: float64

**Note:**   
The axis parameter is again needed to control the aggregation flow.

In [52]:
# calculating the mean for the whole matrix
dataset.mean()

1960           NaN
1961    176.915141
1962    180.703231
1963    184.572413
1964    188.461797
1965    192.412363
1966    196.145042
1967    200.118063
1968    203.879464
1969    207.336102
1970    210.607871
1971    213.489694
1972    215.998475
1973    218.438708
1974    220.621210
1975    223.046375
1976    224.960258
1977    227.006734
1978    229.187306
1979    232.510772
1980    236.185357
1981    240.789508
1982    246.175178
1983    251.342389
1984    256.647822
1985    261.680751
1986    266.647038
1987    271.768300
1988    276.813259
1989    281.850054
1990    286.062387
1991    288.292566
1992    293.305416
1993    297.759160
1994    302.275463
1995    304.537276
1996    309.714948
1997    313.896935
1998    320.405981
1999    324.004669
2000    327.270760
2001    312.259570
2002    313.269043
2003    315.847613
2004    317.746559
2005    322.669534
2006    326.907971
2007    331.995474
2008    338.688417
2009    343.649206
2010    347.967029
2011    351.942027
2012    357.

**Note:**   
If you compare the result of this last cell with the one about `# mean for each col` you can see that the default axis is 0 which leads to the same result as the cell above.

---

Using a real dataset with Pandas can already give us some quick and easy insights into our data.  
In this case we can already see that the mean population density increased constantly for each year.