# Elements Of Data Processing (2020S1) - Week 2


# DataFrames

DataFrames represents tabular data structure and can contain multiple rows and columns.  They can be thought of as a dictionary of Series objects, and are one of the most important data structures you will use to store and manipulate information in data science.

A DataFrame has both row and column indices.

The Pandas DataFrame structure contains many useful methods to aid your analysis.  Recall from week 1 the [API reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html) is available which details all of the functionality provided by pandas.  You will particularly need con consult the [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) reference page.


<img src="images/DF.jpg">


In [1]:
# as before, begin by importing the pandas library
import pandas as pd

In [2]:
# create a new series of the population
Aus_Population = {'1990':17065100, '2000':19153000, '2007':20827600,
                 '2008':21249200,'2009':21691700,'2010':22031750,
                 '2011':22340024, '2012':22728254, '2013':23117353}
population = pd.Series(Aus_Population)

In [3]:
# we will reuse the emissions data from last week
Aus_Emission = {'1990':15.45288167, '2000':17.20060983, '2007':17.86526004,
                '2008':18.16087566,'2009':18.20018196,'2010':16.92095367,
                '2011':16.86260095, '2012':16.51938578, '2013':16.34730205}

co2_Emission = pd.Series(Aus_Emission)

In [4]:
# verify the values in the series
population

1990    17065100
2000    19153000
2007    20827600
2008    21249200
2009    21691700
2010    22031750
2011    22340024
2012    22728254
2013    23117353
dtype: int64

In [5]:
# create a DataFrame object from the series objects
australia = pd.DataFrame({'co2_emission':co2_Emission, 'Population':population})
australia

Unnamed: 0,co2_emission,Population
1990,15.452882,17065100
2000,17.20061,19153000
2007,17.86526,20827600
2008,18.160876,21249200
2009,18.200182,21691700
2010,16.920954,22031750
2011,16.862601,22340024
2012,16.519386,22728254
2013,16.347302,23117353


In [7]:
# create a DataFrame from a csv file
countries = pd.read_csv('data/countries.csv',encoding = 'ISO-8859-1')

In [8]:
# check the top 10 countries in the DataFrame
countries.head(10) # the default value is set to 5

Unnamed: 0,Country,Region,IncomeGroup
0,Afghanistan,South Asia,Low income
1,Albania,Europe & Central Asia,Upper middle income
2,Algeria,Middle East & North Africa,Upper middle income
3,American Samoa,East Asia & Pacific,Upper middle income
4,Andorra,Europe & Central Asia,High income
5,Angola,Sub-Saharan Africa,Upper middle income
6,Antigua and Barbuda,Latin America & Caribbean,High income
7,Argentina,Latin America & Caribbean,Upper middle income
8,Armenia,Europe & Central Asia,Lower middle income
9,Aruba,Latin America & Caribbean,High income


In [9]:
# count the number of countries in each region
countries.Region.value_counts()

Europe & Central Asia         58
Sub-Saharan Africa            48
Latin America & Caribbean     42
East Asia & Pacific           37
Middle East & North Africa    21
South Asia                     8
North America                  3
Name: Region, dtype: int64

In [10]:
# set the name of countries as the index
countries.set_index('Country')


Unnamed: 0_level_0,Region,IncomeGroup
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,South Asia,Low income
Albania,Europe & Central Asia,Upper middle income
Algeria,Middle East & North Africa,Upper middle income
American Samoa,East Asia & Pacific,Upper middle income
Andorra,Europe & Central Asia,High income
Angola,Sub-Saharan Africa,Upper middle income
Antigua and Barbuda,Latin America & Caribbean,High income
Argentina,Latin America & Caribbean,Upper middle income
Armenia,Europe & Central Asia,Lower middle income
Aruba,Latin America & Caribbean,High income


In [11]:
# create a new DataFrame for the CO2 emission from a csv file
emission = pd.read_csv('data/emission.csv',encoding = 'ISO-8859-1')
emission.head()

Unnamed: 0,Country,1990,2000,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,Afghanistan,0.216661,0.039272,0.087858,0.158962,0.249074,0.302936,0.425262,0.688084,0.693183,,,
1,Albania,1.615624,0.978175,1.322335,1.484311,1.4956,1.578574,1.803972,1.624722,1.662185,,,
2,Algeria,3.007911,2.819778,3.195865,3.168524,3.430129,3.307164,3.300558,3.47195,3.51478,,,
3,American Samoa,,,,,,,,,,,,
4,Andorra,,8.018181,6.350868,6.296125,6.049173,6.12477,5.968685,6.195194,6.473848,,,


In [None]:
# Create a subset of emission dataset for Year 2010
yr2010 = emission['2010']
names  = emission['Country']
yr2010.index = names
type(yr2010)

In [None]:
# Sort column values using sort_values 
yr2010.sort_values()


In [None]:
#Sort column values to find the top countries
yr2010.sort_values(ascending = False)

### <span style="color:blue"> Exercise 1 </span>

- Retrieve the mean, median of CO2 emission generated in 2012 by all countries.
- Retrieve the top 5 countries with the most CO2 emission in 2012. How about the 5 countries with the least emission? (remember that sort_values has an **ascending** parameter that is set to True by default).
- Retrieve the sum of CO2 emission for all years and find the 2 years with the maximum CO2 emission.





In [None]:
##answer here



# More Sort Operations
Pandas allows you to sort your DataFrame by rows/columns as follows:

In [None]:
# Sort column values of a DataFrame
sorted2012 = emission.sort_values( by = '2012',ascending = False )
sorted2012

In [None]:
# Sort column values using two columns
sorted2012 = emission.sort_values( by = ['2012','2013'],ascending = [False, True] )
sorted2012

#### Slicing using the .loc and .iloc method
Slicing allows you to take part of your DataFrame.  You can use the .iloc method to select data using row/column numbers, or use .loc to select data using row/column headings.  See [this article](https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/) for more examples

In [None]:
# Slicing using a range of rows and range of columns 
emission.iloc[2:5,2:6]

In [None]:
# Slicing using specific rows and specific columns
emission.loc[[3,5],['Country','1990']]

In [None]:
# Specific rows and all columns

emission.loc[[3,5],:]

In [None]:
# All rows and specific columns
emission.loc[:,['Country','1990']]

### <span style="color:blue"> Exercise 2 </span>

Create a DataFrame object that has the name, region and IncomeGroup of the top 10 emitting countries in 2012.






In [None]:
##answer here



## Groupby
The Groupby method lets you separate the data into different groups based off shared characteristics.  For example, we could group countries by region or income range and then analyse those groups individually.  The official documentation on groupby can be found [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html).  [This tutorial](https://www.marsja.se/python-pandas-groupby-tutorial-examples/) is also well worth reading.

#### Groupby
<img src="files/images/groupby1.jpg">

### <span style="color:blue"> Exercise 3 </span>

Using Countries data frame, group the rows using the Region column.
* Show the size of each group
* Find the number of high income and low income countries by region

In [None]:
##answer here



# Visualization

In these exercises you will:

- learn how to visualize a set of data using a Python library called `matplotlib`.
- find out different forms of visualization, such as bar charts, histograms, scatter plot, and line plot.
- customize the visualization output; for example, by modifying axis properties or adding labels

You will be able to transform a set of data into an appropriate visualization form.

## Why Visualization?

The power of 'preattentive perception' is the foundation of visualization. People see some things preattentively, without the need of focused attention. These visual properties can be distinguised in less than 200 millisecconds (eye movements take 200 msecs) [Healey, 2005]. What 'preattentive perception' is shall be clarified in the next example.

The following example uses the maximum temperature data. The CSV-formatted data contains the average maximum temperature recorded for all major Australian cities during the period March 2007 to February 2008 (obtained from the Australian Government's [Bureau of Meteorology](http://www.bom.gov.au/climate/data/)). The data is presented below in two forms: text in a table and a multi-lines plot.

city/month | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec
- | - | - | - | - | - | - | - | - | - | - | - | -
Melbourne | 41.2 | 35.5 | 37.4 | 29.3 | 23.9 | 16.8 | 18.2 | 25.7 | 22.3 | 33.5 | 36.9 | 41.1
Brisbane | 31.3 | 40.2 | 37.9 | 29 | 30 | 26.7 | 26.7 | 28.8 | 31.2 | 34.1 | 31.1 | 31.2
Darwin | 34 | 34 | 33.2 | 34.5 | 34.8 | 33.9 | 32 | 34.3 | 36.1 | 35.4 | 37 | 35.5
Perth | 41.9 | 41.5 | 42.4 | 36 | 26.9 | 24.5 | 23.8 | 24.3 | 27.6 | 30.7 | 39.8 | 44.2
Adelaide | 42.1 | 38.1 | 39.7 | 33.5 | 26.3 | 16.5 | 21.4 | 30.4 | 30.2 | 34.9 | 37.1 | 42.2
Canberra | 35.8 | 29.6 | 35.1 | 26.5 | 22.4 | 15.3 | 15.7 | 21.9 | 22.1 | 30.8 | 33.4 | 35
Hobart | 35.5 | 34.1 | 30.7 | 26 | 20.9 | 15.1 | 17.5 | 21.7 | 20.9 | 24.2 | 30.1 | 33.4
Sydney | 30.6 | 29 | 35.1 | 27.1 | 28.6 | 20.7 | 23.4 | 27.7 | 28.6 | 34.8 | 26.4 | 30.2

If you have to find out from the the table, which Australian city has the highest temperature, then you have to really look through the data. Your eyes need to scan the table, scurrying all the table cells, comparing values, before you can finally answer the question.

<img src="images/maxtemp.png"></a>

On the other hand, using the visualization of the same data (see the figure above), you can easily notice that the light blue line contains the highest temperature of the year. Thus, you can conclude that Perth, the city represented by that line, is the hottest city of the year, without really bother about the rest of the data. You can also almost instantly notice that Darwin's temperature is historically the most stable one compared to the other cities. This quick observation is hardly possible by just looking at the raw textual data. 

### <span style="color:blue"> Exercise 4 </span> 




Find out the city with the lowest maximum temperature. First, try to do that with the table. Then, try to do the same using the multi-lines plot in the figure above. Find city with the most stable temperature. Do you think visualization is helpful in drawing your conclusion?


In [None]:
##answer here





## Elements of Visualization
All forms of visualization are built with some basic visual elements such as:

- Location (x,y coordinate in the screen)
- Brightness
- Color (Hue)
- Pattern/Texture
- Shape
- Line
- Text

Visualization, in principle, transforms the numerical and symbolic data into these basic visual elements. In the previous example, the cities are translated into the colors of the lines and the temperature data is used to plot the location of the lines.

From those simple elements, some popular types of visualization can be built such as:

- graphs, representation of a set of relationships between various entities, like family trees, network diagrams, grammar trees
- maps, representation of a particular space (and its properties), like geographical map, brain activity map
- charts, representation of numerical data either from a given set of real-world data or generated by mathematical functions

In this worksheet, you will mostly learn to generate various types of charts: line plot (line chart), bar chart, pie chart and histogram.

## Visualization with Python

`matplotlib` is a Python 2D plotting library that enables you to produce figures and charts, both in a screen or in an image file. You can use the `matplotlib`'s interactive environment to display figures in your screen if you have installations of Python and `matplotlib` in your own computer.

The following example demonstrates a simple plot of the fibonacci sequence. 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.plot([1,1,2,3,5,8,13]) # a plot of fibonacci sequence

`matplotlib` allows you to produce plots, histograms, bar charts, pie charts, errorcharts, scatterplots. All these types of graphics shall be clarified as you go through the examples. matplotlib also provides flexibility in customizing those graphics. It permits you to modify the line styles, font properties, axes properties, and many other properties.

## The Structure of matplotlib

The `matplotlib` is conceptually divided into three parts. The first part, matplotlib API, is the library that does the hard-work, managing and creating figures, text, lines, plots and so on. In the code above, we access this library by issuing the following command:

    >>> import matplotlib

The device dependent backend is the second part. It is the drawing engine that is responsible to render the visual representation to a file or a display device. Example of backends: 'PS' backend is used to create postscript file (suitable for hardcopy printout), 'SVG' creates scalar vector graphics (SVG file), 'Tkinter' on Windows provides interactive interface to the visualization. E.g. One can use the Agg backend to produce a PNG file, as displayed the example above:

     >>> matplotlib.use('Agg')

The `pyplot` interface is the last part of the matplotlib package. Module `pyplot` provides a set of functions that utilize the underlying matplotlib library. High level visualization functions like plot, boxplot, and bar, are available through pylab interface. To import these functions, issue the following command:

    >>> import matplotlib.pyplot as plt

## The Powerful Plot
Plot is probably the most important function in matplotlib. Plot draws lines and/or markers using coordinates from the multiple points or x,y pairs supplied in the argument of the function. Both x and y are generally list or array of values. For example, the following command plots a simple quadratic function `y = x * x`.

    >>> x = [1,2,3,4]
    >>> y = [1,4,9,16]
    >>> plot(x,y)

A single list argument to the plot command, like `plot(y)`, would be considered as a list of y-values. matplotlib automatically generates the x-values for you. This is displayed in the next example, which plots the monthly averages of maximum temperature in Melbourne (See table in the beginning of this worksheet).

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import calendar

# Melbourne maximum temperature Jan 2007 - Dec 2008
t = [41.2,35.5,37.4,29.3,23.9,16.8,18.2,25.7,22.3,33.5,36.9,41.1]
s=pd.Series(t)
# print 'jan' - 'dec'
# calendar.month_abbr returns an empty string for the first element
plt.xticks(range(12),list(calendar.month_abbr)[1:])
plt.plot(s)
#plt.show()
#plt.clf()
#plt.plot(s*2)

`xticks()` is used to annotate the ticks in the x-axis. You need to supply `xticks()` with a list of x-values and a list of text that go with those values. 

**Note:** In shell script mode, everytime you issue `plot()` command, the output is added to the results of the earlier `plot()`. The `clf()` function can be called to empty the canvas. 

The plot function is also commonly used to plot mathematical and scientific formula, shown below.

In [None]:
%matplotlib inline
from pylab import *


def f(t):
    return cos(2*pi*t)*log(1+t)

precision = 0.1 # 
t = arange(0.0, 5.0, precision)
plot(t,f(t),'m')  #'m'is magenta colour

### <span style="color:blue"> Exercise 5 </span> 


Modify the example on Melbourne's maximum temperature to display Sydney's maximum temperature from April 2007 to November 2007. Have you code load in the temperature data from [this file](data/max_temp.csv)


In [None]:
##answer here





### <span style="color:blue"> Exercise 6 </span> 

In the mathematical formula example, change the definition of f(t) to sin(2*pi*t)*exp(-t), play around with the variable precision, too. Observe the impact of the changes to the result.

In [None]:
##answer here





## Plot line properties

You can supply an optional argument to customize the color and the linestyle of plot output. For example, to plot with red circles, you would issue `plot([1,2,3,4],'ro')`. `'r'` represents red color and `'o'` refers to circle-shaped marker. `plot([1,2,3,4],'bs:')` draws blue dotted line with square marker. The line color, the linestyle, and the marker type are respectively given by `'b'`,`':'`, and `'s'`. Select your preferred color from a set of matplotlib colors. The choices for linestyle and marker can be seen in the table below.

**Line properties**

Property | Values
- | -
alpha | The alpha transparency on 0-1 scale
antialiased | True or False - use antialiased rendering
color | a matplotlib color arg
label | a string optionally used for legend
linestyle | One of -- : -. -
linewidth | a float, the line width in points
marker | One of + , o . s v x > < ^
markeredgewidth | The line width around the marker symbol
markeredgecolor | The edge color if a marker is used
markerfacecolor | The face color if a marker is used
markersize | The size of the marker in points

There are other methods to set plot line properties. First, you can use keyword arguments listed in the table above. For example:

    >>> plot(x, y, linewidth=3.0)

Second method uses `setp()` command. Shown below, `setp()` allows you to modify multiple properties of a collection of lines.

    >>> lines = plot(x1, y1, x2, y2)
    >>> setp(lines, color='b', linewidth=4.0)

`plot()` may return a list of lines; eg line1,line2 = plot(x1,y1,x2,x2). Third method utilizes various set functions on the lines returned by a plot command. The list of the set functions are available [here](http://matplotlib.org/api/lines_api.html).

    >>> line1,line2 = plot(x1, y1, x2, y2)
    >>> line1.set_antialiased(False) # turn off antialising on the first line

The use of those three methods are demonstrated in the following example:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

t = np.arange(0.0, 2.05, 0.05)

# method 1: keyword arguments
plt.plot(t, sin(t*10), 'k-', linewidth=3.0)
# plot a solid black line with thickness=3

# method 2: setp command
lines = plt.plot(t, [4 for i in t], t, 4*t)
plt.setp(lines, 'color', 'r' )
plt.setp(lines, 'linestyle', ':' )
# plot two red dotted lines

# method 3: setp command
line1,line2 = plt.plot(t, t**2, t, exp(t))
line1.set_marker('s')
line1.set_color('g')
line2.set_marker('^')
line2.set_color('b')
# plot two lines
# first line is drawn with green square marker
# second line is drawn with blue triangle marker

Note: As you can see in the result of the example above, the output of multiple plot commands are accumulated. You can use clf() if you like to start with a blank slate.

### <span style="color:blue"> Exercise 7 </span> 


Modify the example on Melbourne maximum temperature in the previous section to produce a plot with magenta colored triangle marker. Increase the thickness of the plot line, too.


In [None]:
##answer here






## Adding Text to the Charts

You can add labels to the x and y axis of the plot using `xlabel()` and `ylabel()`. As you have seen and used earlier, `xticks()` is used to put text on the x-axis ticks. `xticks()` needs two arguments: a list of x-values and a list of text that go to those values. The same rules apply to `yticks()`.

### <span style="color:blue"> Exercise 8 </span> 


Add the following lines of code to the example on Melbourne maximum temperature. Replace the `xticks()` command with the supplied code. Run to see the effect.


    >>> xticks( arange(12), list(calendar.month_abbr)[1:], rotation=40 )
    >>> ylabel("temperature in celcius", fontsize=14)
    >>> xlabel("months", fontsize=14)
    >>> title("Melbourne maximum temperature (Dec 07 - Feb 08)", fontsize=18)

In [None]:
##answer here





As apparent in the code above, you can supply additional arguments to change the properties of the text. The options for these properties are available in the table below:

Property | Values
- | -
alpha | The alpha transparency on 0-1 scale
color | a matplotlib color argument
fontangle | italic &#124; normal &#124; oblique
fontname | Sans &#124; Helvetica &#124; Courier &#124; Times &#124; Others
fontsize | an scalar, eg, 10
fontweight | normal &#124; bold &#124; light &#124; 4
horizontalalignment | left &#124; center &#124; right
rotation | horizontal &#124; vertical
verticalalignment | bottom &#124; center &#124; top

You can also attach a legend for the output using the `legend()` command to describe each line produced by `plot()`. You need to supply a list of text to describe the respective lines. See the example in the scatter plot section below.

    >>> legend(all_species,loc='lower right')

## Customizing your graphic
You can customize your plot result further by using the following commands:

- `xlim()` to set the range of the x axis, for example `xlim(0,10)`
- `ylim()` to set the range of the y axis, for example `ylim(-1,1)`
- `axis()` command to do both at the same time. The two commands above is equivalent to axis([0,10,-1,1])
- `grid(True)` to turn on the grid, or `grid(False)` to do otherwise

## Scatter plot

With real world data, `plot()` function can be used to generate time series and scatter plots.

Scatter plot is often used to display the relationship between two variables (plot as x-y pairs). In this scatter plot example, we use [famous Iris data set](http://en.wikipedia.org/wiki/Iris_flower_data_set). The data is available [here](data/iris.csv). This data set provides measurements on various parts of three types of Iris flower (Iris setosa, Iris versicolour, and Iris virginica). For each type, there are 50 measurements, or samples. Each data row in the CSV file contains (1) sepal length, (2) sepal width, (3) petal length, (4) petal width, and (5) the type of Iris flower.

The following code generates the scatter plot between petal length and petal width of the three Iris types.

In [None]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

iris=pd.read_csv('data/iris.csv',encoding = 'ISO-8859-1',header=None)
setosa=iris.loc[iris[4]=='Iris-setosa']
versicolor=iris.loc[iris[4]=='Iris-versicolor']
virginica=iris.loc[iris[4]=='Iris-virginica']


plt.scatter(setosa.iloc[:,2],setosa.iloc[:,3],color='green')
plt.scatter(versicolor.iloc[:,2],versicolor.iloc[:,3],color='red')
plt.scatter(virginica.iloc[:,2],virginica.iloc[:,3],color='blue')
plt.xlim(0.5,7.5)
plt.ylim(0,3)
plt.ylabel("petal width")
plt.xlabel("petal length")
plt.grid(True)



From the scatter plot, we may be able to suggest a particular type of relationship or a formation of clusters. In the example above you may notice that, for Iris versicolor, the samples with longer petal tend to have wider petal. You can also see clearly that there exist clusters of these three Irises. As such, the measurements of petal and sepal can help identifying the type of Iris flower. This example demonstrates how botanists may indentify a certain species from phenotype characteristics.

### <span style="color:blue"> Exercise 9 </span> 


Modify the example above to generate the scatter plot of petal length and sepal length.


In [None]:
##answer here





## Bar chart

Bar chart is probably the most common type of chart. It displays a property or properties of a set of different entities. Bar chart is typically used to provide comparison, or to show contrast between different entities. For example, the bar chart below displays the GNP per capita of the three poorest and the three richest countries in the world (based on 2004 GNP per capita):

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import calendar
from numpy import arange

countries = ['Burundi','Ethiopia','Rep of Congo','Switzerland','Norway','Luxembourg']
gnp = [90,110,110,49600,51810,56380] # GNP per capita (2004)
plt.bar(arange(len(gnp)),gnp)
plt.xticks( arange(len(countries)),countries, rotation=30)
plt.show()

### <span style="color:blue"> Exercise 10 </span> 


Modify the bar chart example to plot the average maximum temperature in all major Australian cities. The data is available [here](data/max_temp.csv). 


In [None]:
##answer here





In a clustered bar chart, you can display a few measurements from the entities of interest. For example, the clustered bar chart below simultaneously shows the number of births and deaths in four countries of interest. The number of births is displayed as the blue-colored bar and the number of deaths as the red-colored bar:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import calendar
from numpy import arange

countries = ['Afghanistan', 'Albania', 'Algeria', 'Angola']
births = [1143717, 53367, 598519, 498887]
deaths = [529623, 16474, 144694, 285380]
plt.bar(arange(len(births))-0.3, births, width=0.3)
plt.bar(arange(len(deaths)),deaths, width=0.3,color='r')
plt.xticks(arange(len(countries)),countries, rotation=30)

## Histogram

Histogram displays a distribution of population samples (typically a large set of data like digital images or age of population). The following example creates a histogram of age within a small number of samples (assumes these are the age of your classmates).

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt


ages = [17,18,18,19,21,19,19,21,20,23,19,22,20,21,19,19,14,23,16,17]
plt.hist(ages, bins=10)
plt.grid(which='major', axis='y')
plt.show()