## Lesson 1: Data Analysis and Visualisation

Data analysis is is the process of taking raw data and using different methods to make sense of it, in order to present it in a clear way so decisions can be made based on this data. Python is commonly used for this purpose, since it does not have such a steep learning curve compared to other languages, it has powerful libraries that allow you to do whatever you want and have a lot of available resources you can use for documentation. This lesson will make you more familiar with the tools and libraries that are commonly used for data analysis tasks. This lesson includes two sections. The first section will cover the Pandas library, which is used for analysing the data. The second section will cover two libraries, matplotlib and seaborn to visualise the data we just cleaned and analysed. 

#### Learning Objectives


After this lesson, you should

* Be able to work with Pandas `DataFrame` and `Series` objects and Matplotlib and Seaborn functions;
    in particular,
    
    * to create,
    * to inspect and extract parts from,
    * to modify,
    * to compute summary statistics for, and
    * to visualize and create custom plots of
    
  `DataFrame` objects.

## Pandas

This section will introduce you to the **Pandas** library. This library deals with data in different forms and can be used to create datasets, read them, write to them and change them. It also has a few options for visualisation.

To help you along if you have any issues while coding, there are a few resources you can access: 
* The _Python Data Science Handbook_ by Jake VanderPlas,
  the chapter [_Data Manipulation with Pandas_](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.00-Introduction-to-Pandas.ipynb)
* The official [Pandas documentation](http://pandas.pydata.org/). The documentation includes information on the different variables you can use and how to use them

To get started, we will need to import both the numpy and pandas libraries. The most common way is to use abbreviations. We will use these to call on functions of the library.  

In [2]:
import numpy as np
import pandas as pd

### Dataframes

The most important data type of the Pandas library is **`pd.DataFrame`**.
It is a _composite_ data type, whose values are called **data frames**.

A data frame is a two-dimensional arrangement of data values. It can be helpful to think of it as a table with rows and columns. 
The data itself is typically of data type `int`, `float`, `bool`, `str`,
or a different type of variable offered by the NumPy library.

Here is an example in which we create a data frame from scratch; we name it `df`.
It has

* four rows indexed from 0 through 3, and
* three columns labeled `'A'`, `'B'`, and `'C'`.

In [3]:
df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': ['one', 'two', 'three', 'four'], 'C': [False, True, False, True]})
df

Unnamed: 0,A,B,C
0,1,one,False
1,2,two,True
2,3,three,False
3,4,four,True


From the table we can see the properties of a DataFrame. They have: 
* A **row** is a horizontal selection of data values.
* A **column** is a vertical selection of data values.
* The **index** provides an identification of the rows (0, 1, 2, 3).
* The **column labels** provide an identification of the columns (`'A'`, `'B'`, `'C'`)

Say we want to obtain this information through Python code, there are a few operations we can use:

* **`df.shape`** : the number of rows and number of columns
* **`df.index`** : the row index
* **`df.columns`** : the column labels
* **`df.dtypes`** : the types of the values in each column

In [4]:
df.shape

(4, 3)

In [5]:
df.index

RangeIndex(start=0, stop=4, step=1)

In [6]:
df.columns

Index(['A', 'B', 'C'], dtype='object')

In [7]:
df.dtypes

A     int64
B    object
C      bool
dtype: object

In case you want all the information in one place, you could use the function **`df.info()`**. This will return a table with the information displayed above. 

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   A       4 non-null      int64 
 1   B       4 non-null      object
 2   C       4 non-null      bool  
dtypes: bool(1), int64(1), object(1)
memory usage: 196.0+ bytes


#### Example

Say we want to create a little dataframe based on information we gathered about people's birthdays. 

In [9]:
names = ['Peter', 'Anna', 'Tom', 'John', 'Simone']
years = [ 1998, 2002, 1946, 1973, 1962 ]

This can be done in the following way: 

In [10]:
df_years = pd.DataFrame({'Names': names, 'Years of Birth': years})
df_years

Unnamed: 0,Names,Years of Birth
0,Peter,1998
1,Anna,2002
2,Tom,1946
3,John,1973
4,Simone,1962


We can then use **`info()`** again to see an overview on the dataframe

In [11]:
df_years.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Names           5 non-null      object
 1   Years of Birth  5 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 208.0+ bytes


### Getting data from a Dataframe

We can find a column of a DataFrame by looking at its index. This is similar to how you find an element in an array. The result is a `Series` object. This is useful if you want to know something about a particular feature of the dataset

In [12]:
df['B']

0      one
1      two
2    three
3     four
Name: B, dtype: object

To get a row from a data frame, you can use the function `loc`. This works the same as when you try to index a list, with square brackets. The result is also a `Series` object. 

In [13]:
df.loc[2]

A        3
B    three
C    False
Name: 2, dtype: object

Say that we want a particular **value** at a given location in the data frame. We can find this value in several ways. 

* **`df[column_label][row_index]`** : first get the column,
    then get the value from the resulting `Series` object

In [14]:
df['B'][2]

'three'

* **`df.loc[row_index, column_label]`** : get the value directly,
    using `loc` with the row index and column label

In [None]:
df.loc[2, 'B']

You can also get information from larger parts of the data frame by using the _Slicing_ technique. This is similar to how you slice lists. 

> Keep in mind: When slicing `DataFrame` and `Series` objects you use the syntax `.loc[start:stop].
> Here, the **`stop` value is included, unlike with lists where the stop value is not included. 

In [15]:
df.loc[1:2]

Unnamed: 0,A,B,C
1,2,two,True
2,3,three,False


You can also slice columns: 

In [17]:
df.loc[:, 'A':'B']

Unnamed: 0,A,B
0,1,one
1,2,two
2,3,three
3,4,four


What this does is it first selects all the rows with the `:` argument. The `.loc` function then extracts a particular slice of rows and columns in one go. This would look like this: 
 

In [18]:
df.loc[1:2, 'A':'B']

Unnamed: 0,A,B
1,2,two
2,3,three


You can also get **non-adjacent** rows and columns. For rows you use `.loc` with a list of the indices of the rows you want to see. The result is a new data frame: 

In [None]:
df.loc[[1, 3]]

For columns, you do something similar, but instead of using `.loc` you can directly provide a list of indices of the columns you want to see. Again, the result is a new data frame: 

In [19]:
df[['A', 'C']]

Unnamed: 0,A,C
0,1,False
1,2,True
2,3,False
3,4,True


You can also use `.loc`, similarly to what we did before to get information from non-adjacent columns: 

In [23]:
df.loc[:, ['A', 'C']]

Unnamed: 0,A,C
0,1,False
1,2,True
2,3,False
3,4,True


The function **`df.head(n)`** returns the first `n` rows of the data frame, regardless of how the rows are indexed. The default value of of `n` is 5, so if you just call `df.head()`, this is the same as saying `df.head(5)`. 

In [16]:
df.head(2)

Unnamed: 0,A,B,C
0,1,one,False
1,2,two,True


#### Example 

We have the following information that needs to be organised. 

In [24]:
prices = [2, 1, 3, 2.5, 1.5 ]
items  = ['bread', 'milk', 'chips', 'bananas', 'carrots']

We will put in in a data frame called `df_shopping_list`

In [25]:
df_shopping_list = pd.DataFrame( { 'Item': items, 'Price (Euro)': prices } )
df_shopping_list

Unnamed: 0,Item,Price (Euro)
0,bread,2.0
1,milk,1.0
2,chips,3.0
3,bananas,2.5
4,carrots,1.5


We want to now get information from the first four rows of this data frame. The resulting data frame will be called `df_shopping_short`

In [27]:
df_shopping_short = df_shopping_list.head(4)
df_shopping_short

Unnamed: 0,Item,Price (Euro)
0,bread,2.0
1,milk,1.0
2,chips,3.0
3,bananas,2.5


We can now select the `'Item'` column from this new data frame. The result of this expression is a `Series` object. 

In [28]:
df_shopping_short['Item']

0      bread
1       milk
2      chips
3    bananas
Name: Item, dtype: object

We want to specifically know how much milk and chips cost. So we use an operation to only select the rows with indices 1 and 2. Because we want the result to be a data frame we do the following: 

In [29]:
df_shopping_short.loc[1:2]

Unnamed: 0,Item,Price (Euro)
1,milk,1.0
2,chips,3.0


### Modifying a DataFrame

`DataFrame` objects can be changed with pretty basic operations. 

#### Changing a value in a DataFrame 

To change a value at a particular location of a data frame, you can use the following expression: 

`df.loc[row, column] = new value`

In [30]:
df.loc[0,'B'] = 'ACE'
df

Unnamed: 0,A,B,C
0,1,ACE,False
1,2,two,True
2,3,three,False
3,4,four,True


#### Changing multiple values in a slice of a DataFrame

Modifying values in slices of a data frame takes a few more operations. For starters, it is important to note that it is recommended to explicitely create a copy of the slice you want to modify values of. 

If we take our original data frame `df`

In [31]:
df

Unnamed: 0,A,B,C
0,1,ACE,False
1,2,two,True
2,3,three,False
3,4,four,True


and then take a slice of it

In [32]:
df_slice = df.loc[1:2].copy()
df_slice

Unnamed: 0,A,B,C
1,2,two,True
2,3,three,False


You can then modify the first row of the slice: 

In [33]:
df_slice.loc[1, 'A'] = 8
df_slice.loc[1, 'B'] = "EIGHT" 
df_slice

Unnamed: 0,A,B,C
1,8,EIGHT,True
2,3,three,False


This does not affect the original data frame. 

In [34]:
df

Unnamed: 0,A,B,C
0,1,ACE,False
1,2,two,True
2,3,three,False
3,4,four,True


#### Example

If we continue with the shopping list example: 

In [35]:
prices = [2, 1, 3, 2.5, 1.5 ]
items  = ['bread', 'milk', 'chips', 'bananas', 'carrots']
df_shopping_list = pd.DataFrame( { 'Item': items, 'Price (Euro)': prices } )
df_shopping_list

Unnamed: 0,Item,Price (Euro)
0,bread,2.0
1,milk,1.0
2,chips,3.0
3,bananas,2.5
4,carrots,1.5


We can now change the price of bananas to 2.0 Euros instead of 2.5

In [36]:
df_shopping_list.loc[3, 'Price (Euro)'] = 2.0
df_shopping_list

Unnamed: 0,Item,Price (Euro)
0,bread,2.0
1,milk,1.0
2,chips,3.0
3,bananas,2.0
4,carrots,1.5


Say we don't need all of these items, and only need bread, milk and chips, we can now create a shorter shopping list. But we also notice that the price of chips has gone down to 2.7 euros instead of 3.0, so we can change that as well. 

In [37]:
df_shorter_list = df_shopping_list.loc[0:2].copy()
df_shorter_list.loc[2, 'Price (Euro)'] = 2.7

df_shorter_list

Unnamed: 0,Item,Price (Euro)
0,bread,2.0
1,milk,1.0
2,chips,2.7


### Working with large datasets from a CSV file

The examples you just saw used very small datasets that were just made up. In reality you will most likely be working with much larger datasets that you will either need to read from your computer, or download from a URL. For the sake of this example, you will get a dataset provided by us.

The example file is named `countries.csv`