## Coding Basics for Researchers - Day 1

*Notebook by [Pedro V Hernandez Serrano](https://github.com/pedrohserrano)*


---
# 1. Python Building Blocks
* [1.1. Python Basic Commands](#1.1)
* [1.2. Strings](#1.2)
* [1.3. Lists](#1.3)

---

Guido van rossum | Monty python
- | - 
![](https://gvanrossum.github.io/images/guido-portrait-dan-stroud.jpg) | ![](https://upload.wikimedia.org/wikipedia/en/c/cd/Monty_Python%27s_Flying_Circus_Title_Card.png)


## A bit of history

#### Python starts with ABC.

- ABC is a general-purpose programming language and programming 
environment, which had been developed in the Netherlands, Amsterdam, at 
the CWI (Centrum Wiskunde & Informatica).

- The greatest achievement of ABC was to influence the design of Python.  
He emphasizes on the DRY (Don’t Repeat Yourself) principle and readability.

- Python was conceptualized in the late 1980s. Guido van Rossum worked that 
time in a project at the CWI, called Amoeba, a distributed operating system.

- Python was designed as a simple scripting language that possessed some of 
ABC's better properties, but without its problems.

-  So, what about the name "Python": Most people think about snakes, but the 
name has something to do with excellent British humour. A show called Monty Python's Flying Circus was the culprit.

## Tutorials for Learning Python
    
- [Codecademy](https://www.codecademy.com/tracks/python) is great for beginner levels.
- There is also the [Official Beginners Guide](https://wiki.python.org/moin/BeginnersGuide).
- [Learn Python the Hard Way](https://learnpythonthehardway.org/book/) is a great tutorial for a more in-depth overview.
    - It isn't actually particularly hard, although note that the currently available version is in Python2. \n",
- [Whirlwind Tour of Python](https://github.com/jakevdp/WhirlwindTourOfPython) is a free collection of Jupyter notebooks that takes you through Python. 
 - [Leet Code](https://leetcode.com/) is a place for more intense technical coding questions and challenges (geared towards industry interviews).

## Getting Un-Stuck
At some point, you will get stuck. It happens. The internet is your friend.
    
If you get an error, or aren't sure how to proceed, use {your favourite search engine} with specific search terms relating to what you are trying to do. Sometimes this just means searching the error that you got.
   
Your are likely to find responses on [StackOverflow](https://stackoverflow.com) - which is basically a forum for programming questions, and a good place to find answers.

## Managing Cells in the Notebooks

__Add__ a new cell to the notebook by:
 - click the + button on the toolbar
 - `Insert -> Insert Cell Above` or `ESC-A`
 - `Insert -> Insert Cell Below` or `ESC-B`
 
__Delete__ a cell by selecting it and:
 - click the scissors button on the toolbar
 - `Edit -> Delete cells` or `ESC-DD`

__Undelete__ the last deleted cell:
- `Edit -> Undo Delete cells` or `ESC-Z`

Each cell has a __cell history__ associated with it. Use `CMD-Z` to step back through previous cell contents.
 
__Reorder__ cells by:
- moving them up and down the notebook using the up and down arrows on the toolbar
- `Edit -> Move Cell Up` or `Edit -> Move Cell Down` 
- cutting and pasting them:
 - `Edit - >Cut` or `Edit->Paste Cells Above` or `Edit->Paste Cells Below`
 - on the toolbar, `Cut selected cells` then `Paste selected cells`

Copy and cut selected cells from the toolbar:
- `Edit -> Copy Cells` or `ESC-C`.
- `Edit -> Cut Cells` or `ESC-X`.

## Packages

Packages are basically just collections of code. The anaconda distribution comes with all the core packages you will need for this class. 
  
For getting other packages, anaconda comes with
    <a href="https://conda.io/docs/using/pkgs.html" class="alert-link">conda</a>
    a package manager, with support for downloading and installing other packages.

---
## 1.1. Python basic commands
<a id="1.1">

Many of the things I used to use a calculator for, I now use Python for:

In [1]:
2+2

4

In [2]:
(50-5*6)/4

5.0

There are some gotchas compared to using a normal calculator.

In [3]:
7/3

2.3333333333333335

Alternatively, you can convert one of the integers to a floating point number, in which case the division function returns another floating point number.

In [4]:
7/3.0

2.3333333333333335

In [5]:
7/float(3)

2.3333333333333335

Checking the datatype

In [6]:
type(7/3)

float

In the last few lines, we have sped by a lot of things that we should stop for a moment and explore a little more fully. We've seen, however briefly, two different data types: 
- **integers**, also known as *whole numbers* to the non-programming world, and 
- **floating point numbers**, also known as *decimal numbers* to the rest of the world.


But also important is not only to do calculations but assign values 
- **Variables** are names for values.
- In Python the `=` symbol assigns the value on the right to the name on the left. (Similar to `<-` in R)
- The variables is created when a value is assigned to it.

In [11]:
width = 20
length = 30
area = length*width

In [12]:
print(area)

600


But if you try to access a variable that you haven't yet defined, you get an error:


```Python
> volume
```

```Python
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-6211527fe2c2> in <module>
----> 1 volume

NameError: name 'volume' is not defined
```



Variables must be created before they are used

In [13]:
depth = 10
volume = area*depth

In [14]:
print(volume)

6000


You can name a variable *almost* anything you want. It needs to start with an alphabetical character or "\_", can contain alphanumeric charcters plus underscores ("\_"). Certain words, however, are reserved for the language:

    and, as, assert, break, class, continue, def, del, elif, else, except, 
    exec, finally, for, from, global, if, import, in, is, lambda, not, or,
    pass, print, raise, return, try, while, with, yield

Trying to define a variable using one of these will result in a syntax error:

```Python
return = 0

File "<ipython-input-12-c7a05f6eb55e>", line 1
    return = 0
           ^
SyntaxError: invalid syntax
```

The [Python Tutorial](http://docs.python.org/2/tutorial/introduction.html#using-python-as-a-calculator) has more on using Python as an interactive shell. The [IPython tutorial](http://ipython.org/ipython-doc/dev/interactive/tutorial.html) makes a nice complement to this, since IPython has a much more sophisticated iteractive shell.

---
## 1.2. Strings
<a id="1.2">

Strings are lists of printable characters, and can be defined using either single quotes

In [15]:
'Hello, Maastricht!'

'Hello, Maastricht!'

or double quotes

In [16]:
"Hello, Maastricht!"

'Hello, Maastricht!'

But not both at the same time, unless you want one of the symbols to be part of the string.

In [17]:
"She's a Researcher"

"She's a Researcher"

In [18]:
'She asked, "How are you today?"'

'She asked, "How are you today?"'

Just like the other two data objects we're familiar with (ints and floats), you can assign a string to a variable

In [19]:
greeting = "Hello, Maastricht! "

In [20]:
subject = "She's a Researcher"

The **print** statement is often used for printing character strings:

In [21]:
example_text = greeting + subject

print(example_text)

Hello, Maastricht! She's a Researcher


In [22]:
type(example_text)

str

Use an index to get a single character from a string.
* The characters (individual letters, numbers) in a string are order. We can then treat the string as a list of characters.
* Each position in the string is given a number called **index**.
* Indices are numbered from 0.
* Use the position's index in sqaure brackets to get the character at that position.

![](https://swcarpentry.github.io/python-novice-gapminder/fig/2_indexing.svg)

In [23]:
# assign variable 
atom_name = "helium"

#print index 0 position
print(atom_name[0])

h


Use a slice to get a substring

* A part of a string is called a substring. A substring can be as short as a single character.
* An item in a list is called an element. Whenever we treat a string as if it were a list, the string’s elements are its individual characters.
* A slice is a part of a string.
* We take a slice by using `[start:stop]`, where `start` is replaced with the index of the first element we want and `stop` is replaced with the index of the element just after the last element we want.

In [24]:
# print name and substring first 3 characters
print(example_text[0:5])

Hello


But it can also print data types, separating by commas:

In [25]:
print ("The area is ",area, volume, 10, 5*4, example_text)

The area is  600 6000 10 20 Hello, Maastricht! She's a Researcher


Also possible with the format method

In [26]:
print ("The area is {} and volume is {}".format(area, volume))

The area is 600 and volume is 6000


In the above snipped, the number 600 (stored in the variable "area") is converted into a string before being printed out.

If you have a lot of words to concatenate together, there are other, more efficient ways to do this. But this is fine for linking a few strings together.

In [27]:
# Number of characters in the text
len(example_text) 

37

Use `split` method to get the individual words

In [28]:
split_text = example_text.split(' ') # Return a list of the words in text2, separating by ' '.

In [29]:
print(split_text)

['Hello,', 'Maastricht!', "She's", 'a', 'Researcher']


In [30]:
len(split_text)

5

More advanced functionalities allow us to find different type of words

In [31]:
[w for w in split_text if len(w) < 6] # Words that are greater than 3 characteres long in text2

["She's", 'a']

In [32]:
[w for w in split_text if w.istitle()] # Capitalized words in text2

['Hello,', 'Maastricht!', 'Researcher']

In [33]:
[w for w in split_text if w.endswith('!')] # Words in text2 that end in 's'

['Maastricht!']

All the string tricks could be used for:
- Data cleaning of clinical records
- Text analysis on policy documents
- Data analysis of gene sequence
- ...etc.

---
## 1.3. Lists
<a id="1.3">

Very often in a programming language, one wants to keep a group of similar items together. 
The object we used in the above example is a Python data type called **lists**.

In [34]:
days_of_the_week = ["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]

In [35]:
type(days_of_the_week)

list

You can access members of the list using the **index** of that item:

In [36]:
# Index the 3rd element of the list, then
# index the 2nd element of the word
days_of_the_week[2][1]

'u'

Python lists, like C, but unlike Fortran, use 0 as the index of the first element of a list. Thus, in this example, the 0 element is "Sunday", 1 is "Monday", and so on. If you need to access the *n*th element from the end of the list, you can use a negative index. For example, the -1 element of a list is the last element:

In [37]:
print(days_of_the_week[-2] == days_of_the_week[5])

True


You can add additional items to the list using the .append() command:

In [38]:
# set a list of elements
languages = ["Java","R","C++"]

# append a new element
languages.append("Python")

# print the object 
print(languages)

['Java', 'R', 'C++', 'Python']


We could remove an element

In [39]:
languages.remove('Java')

In [40]:
languages

['R', 'C++', 'Python']

In [41]:
del languages[-2]

In [42]:
languages

['R', 'Python']

The **range()** command is a convenient way to make sequential lists of numbers:

In [43]:
range(10)

range(0, 10)

Note that range(n) starts at 0 and gives the sequential list of integers less than n. If you want to start at a different number, use range(start,stop)

In [44]:
list(range(2,8))

[2, 3, 4, 5, 6, 7]

Lists do not have to hold the same data type. For example,

In [45]:
["Today",7,99.3,"", languages, days_of_the_week]

['Today',
 7,
 99.3,
 '',
 ['R', 'Python'],
 ['Sunday',
  'Monday',
  'Tuesday',
  'Wednesday',
  'Thursday',
  'Friday',
  'Saturday']]

However, it's good (but not essential) to use lists for similar objects that are somehow logically connected. If you want to group different data types together into a composite data object, it's best to use **tuples**, which we will learn about below.

You can find out how long a list is using the **len()** command:

In [46]:
help(len)

Help on built-in function len in module builtins:

len(obj, /)
    Return the number of items in a container.



- Iteration in Python  
One of the most useful things you can do with lists is to *iterate* through them, i.e. to go through each element one at a time. To do this in Python, we use the **for** statement:

In [47]:
for day in days_of_the_week:
    print (day)

Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday


This code snippet goes through each element of the list called **days_of_the_week** and assigns it to the variable **day**. It then executes everything in the indented block (in this case only one line of code, the print statement) using those variable assignments. When the program has gone through every element of the list, it exists the block.

(Almost) every programming language defines blocks of code in some way. In Fortran, one uses END statements (ENDDO, ENDIF, etc.) to define code blocks. In C, C++, and Perl, one uses curly braces {} to define these blocks.

Python uses a colon (":"), followed by indentation level to define code blocks. Everything at a higher level of indentation is taken to be in the same block. In the above example the block was only a single line, but we could have had longer blocks as well:

---

#### <i style="color:red">**EXERCISES - **</i>

+ _1. If you assign `a = 123`, what happens if you try to get the second digit of `a` via index `a[1]`?_
___

+ _2. Which is a better variable name, m, min, or minutes? Why?   
Hint: think about which code you would rather inherit from someone who is leaving the lab:_

```python
1. ts = m * 60 + s
2. tot_sec = min * 60 + sec
3. total_seconds = minutes * 60 + seconds
```
___

+ _3. What do you think is the error in the code and how would you fix it

```python
atom_name = 'carbon'
print('atom_name[1:3] is', atom_name[1:3])
```
___

+ _3. What type of value (integer, floating point number, or character string) would you use to represent each of the following? 
Try to come up with more than one good answer for each problem. For example, in # 1, when would counting days with a floating point variable make more sense than using an integer?_

1. Number of days since the start of the year.
2. Serial number of a piece of lab equipment.
3. A lab specimen’s age
4. Current population of a city.
5. Average population of a city over time.

---

+ _4. Which of the following will return the floating point number 2.0? Note: there may be more than one right answer._

```python
1. first + float(second)
2. float(second) + float(third)
3. first + int(third)
4. first + int(float(third))
5. int(first) + int(float(third))
6. 2.0 * second
```
---

+ _5. You want to select a random character from a string_
```python
bases = 'ACTTGCTTGAC'
```

    + _2.1. Which [standard library](https://docs.python.org/3/library/) module could hep you?_

    + _2.2. Which function would you select from that module? 

    + _2.3. Try to write a program that uses the function._
___

+ _6. When a colleague of yours types help(math), Python reports an error:_

```python
NameError: name `math` is not defined
```
    How would you help her?
   
---

+ _7. Given the following:_

```python
print('string to list:', list('silver'))
print('list to string:', ''.join(['g', 'o', 'l', 'd']))
```


+ What does `list('some string') do?_

+ What does `'-'.join(['x', 'y', 'z'])` generate?_
___

+ _8. How many words are in the following text? Use Python to find out:_


```Python 
The future is in Maastricht:
UM multidisciplinary collaborations contribute to solving major societal issues within our primary research themes. We develop new methods to make plastics from organic materials, but we also conduct research into migration, and look into methods to get more people interested taking the necessary financial preparations for their retirement. Whenever possible, UM research is translated into economic, financial, or social value. UM participates in centres of excellence, both technological and social, to allow scientific discoveries to be swiftly converted into practical applications. What is more, research is integrated into education at every level. Our educational method, Problem-Based Learning, lays the groundwork for students to embrace research and the scientific method from the very first day of their studies.
```

---
# 2. Pandas Fundamentals
* [2.1. Dealing with different data sources](#2.1)
* [2.2. Data structures](#2.2)

---

![](http://warscapes.com/sites/default/files/assam_1.png)

## A research use case

#### Insurgency-Civilian Relations & EU Policy

- International Relations, Peace and Policy Studies
- PhD proposal, S. Roerigh 2020
- Understand patterns in international relations and to make effective policies and strategic decisions.
- Normally: 
    - is studied the social order, and
    - lived experiences of civilians living under rebel rule
- Proposed:
    - how and when such factors are perceived, discussed and considered in third party policy decisions
    - what impact third party policy decisions have on rebel-civilian relations. 
- It contributes to a critical perspective and discussion on the motivations, interests, effectiveness, of third party policymakers

---
## 2.1. Dealing with different data sources
<a id="2.1">

* Pandas is widely-used Python library for handling data, particularly on tabular data.
* Borrows many features from R's dataframes:
  - Two dimensional table whose columns have names and potentially have different type of data types/
* Load it with `import pandas as pd`. The alias pd is commonly used for Pandas.
* Pandas is able to handle virtually any kind of formats

![](https://pandas.pydata.org/docs/_images/02_io_readwrite.svg)

In [1]:
import pandas as pd

In [2]:
path = '../data/'

* The columns in a dataframe are the observed variables, and the rows are the observations.
* Pandas uses backslash `\` to show wrapped lines when output is too wide to fit the screen.

**File Not Found:**

> Our lessons store their data files in a `data` sub-directory, which is why the path to the file is `data/...csv`. If you formet to include `data/` or if you include it but your cope of the file is somewhere else, you will bet a runtime error that ends with a line like this:

`ERROR`: _OSError: File b'gapminder_gdp_oceania.csv' does not exist_

#### Reputation of Terror Groups (RTG) Dataset
Description: The dataset contains all domestic terrorist groups, which defined in Enders et al. (2011) and based on Global Terrorism Database, with more than 5 terrorist attacks from 1980 to 2011. The data is in group name - year format. The data codes terrorist groups' actions which can build reputation among constituency and out-group. Researchers can found originally coded variables in regard to building positive and negative reputation among the audience as well as existing group level variables.

[Link to data](http://www.efetokdemir.com/data.html)

In [3]:
# READING A STATA FILE
rtg_table = pd.read_stata(path+'replicationdatajpr-oldstata.dta')

In [4]:
rtg_table.head()

Unnamed: 0,year,gname,ffund,childrec,frec,rebel,parterr,terpwing,teraff,govcaus,...,nat,civcausreal,civcauseffreal,outnegrep,cleavage,reputation,last,counter,endedtype,endedtype2
0,1989,1 May Group,0,0,0,0,0,0,0,1,...,0.0,0.25,1.0,1.0,1.0,0.0,3.0,1.0,0.0,0.0
1,1991,1 May Group,0,0,0,0,0,0,0,0,...,0.0,2.333333,0.0,0.0,1.0,0.0,3.0,2.0,0.0,0.0
2,1992,1 May Group,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,1.0,0.0,3.0,3.0,1.0,1.0
3,1989,16 January Organization for the Liberation of ...,0,0,0,0,0,0,0,1,...,,22.625,0.0,0.0,,0.0,1.0,1.0,1.0,1.0
4,1983,2 April Group,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0


#### The Foundations of Rebel Group Emergence (FORGE) Dataset

It provides information on the origins of violent non-state actors engaged in armed conflict against their government resulting in 25+ yearly battle deaths, active between 1946 and 2011. The unit of observation in this dataset is the rebel group organization. We also include information on the dyad and conflict in which these groups are participants for ease of integration with various Uppsala Conflict Data Program (UCDP) datasets. We draw upon the population of groups included in the Non-State Actor database described in greater detail here:
    
[Link to data](http://ksgleditsch.com/eacd.html)

In [5]:
# READING AN ASCII FILE ON TABULAR FORMAT WITH CSV FUNCTION
forge_table = pd.read_csv(path+'nsa_v3.4_21November2013.asc', delimiter='\t')

In [6]:
forge_table.head()

Unnamed: 0,obsid,ucdpid,dyadid,side_a,acr,side_b,startdate,enddate,oldid,oldconfid,...,rsupname,gov.support,gtypesup,gsupname,govextpart,type.of.termination,victory.side,prevactive,prevact.ref,oldobsid
0,NSA.3.4-1,1,462,Bolivia,BOL,Popular Revolutionary Movement,1946-06-01,1946-07-21,1010,1010.1,...,,no,,,no,4.0,2.0,0,,NSA.3.3-1
1,NSA.3.4-4,1,463,Bolivia,BOL,MNR,1952-04-09,1952-04-12,1010,1010.2,...,,no,,,no,4.0,2.0,0,,NSA.3.3-4
2,NSA.3.4-7,1,464,Bolivia,BOL,ELN,1967-03-01,1967-10-16,1010,1010.3,...,Cuba,explicit,military,USA,no,4.0,1.0,0,,NSA.3.3-7
3,NSA.3.4-10,2,654,France,FRN,Khmer Issarak,1946-08-01,1953-11-09,1020,1020.0,...,Thailand,explicit,military,USA,no,7.0,,0,,NSA.3.3-10
4,NSA.3.4-13,3,466,China,CHN,Peoples Liberation Army,1946-01-01,1949-10-1,1030,1030.0,...,USSR,explicit,military,USA,no,4.0,2.0,0,,NSA.3.3-13


#### The CEPS EurLex dataset: EU laws from 1952-2019 with full text and 22 variables

The dataset contains 142.036 EU laws - almost the entire corpus of the EU's digitally available legal acts passed between 1952 - 2019. It encompasses the three types of legally binding acts passed by the EU institutions: 102.304 regulations, 4.070 directives, 35.798 decisions in English language. The dataset was scraped from the official EU legal database (Eur-lex.eu) and transformed in machine-readable CSV format with the programming languages R and Python. 
The dataset was collected by the Centre for European Policy Studies (CEPS) for the TRIGGER project (https://trigger-project.eu/). We hope that it will facilitate future quantitative and computational research on the EU. 

[Link to data](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/0EGYWY)

In [7]:
# READING AN EXCEL FILE
eurlex_table = pd.read_excel(path+'EurLex_all_no_text.xlsx')

In [8]:
eurlex_table.head()

Unnamed: 0,CELEX,Act_name,Act_type,Status,EUROVOC,Subject_matter,Treaty,Legal_basis_celex,Authors,Procedure_number,...,Temporal_status,Act_cites,Cites_links,Act_ammends,Ammends_links,Eurlex_link,ELI_link,Proposal_link,Oeil_link,Additional_info
0,32019D0276,Decision (EU) 2019/276 of the European Parliam...,Decision,In Force,aid to refugees; budget appropriation; EC gene...,cooperation policy; budget; EU finance; int...,TFEU,32013Q1220(01),European Parliament; European Council,,...,,32013R1311,http://data.europa.eu/eli/reg/2013/1311/oj,,,,,,,
1,32019D0277,Decision (EU) 2019/277 of the European Parliam...,Decision,In Force,aid to catastrophe victims; emergency aid; EC ...,cooperation policy; EU finance; budget; det...,TFEU,32002R2012; 32013Q1220(01),European Parliament; European Council,,...,,32013R1311,http://data.europa.eu/eli/reg/2013/1311/oj,,,,,,,
2,32019D0275,Decision (EU) 2019/275 of the European Parliam...,Decision,In Force,professional reintegration; Attica; EGF; EC ge...,employment; regions of EU Member States; EU ...,TFEU,32013Q1220(01); 32013R1309,European Parliament; European Council,,...,,32013R1311,http://data.europa.eu/eli/reg/2013/1311/oj,,,,,,,
3,32018D1859,Decision (EU) 2018/1859 of the European Parlia...,Decision,In Force,commitment appropriation; Latvia; payment appr...,budget; Europe; EU finance; cooperation pol...,TFEU,32002R2012; 32013Q1220(01),European Council; European Parliament,,...,,32018D508; 32013R1311,http://data.europa.eu/eli/dec/2018/508/oj; htt...,,,,,,,
4,32018D1720,Decision (EU) 2018/1720 of the European Parlia...,Decision,In Force,Northern Portugal; Portugal; employment aid; e...,regions of EU Member States; Europe; economi...,TFEU,32013Q1220(01); 32013R1309,European Council; European Parliament,,...,,32013R1311,http://data.europa.eu/eli/reg/2013/1311/oj,,,,,,,


---
## 2.2. Data structures
<a id="2.2">

A DataFrame is a collection of Series; The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.

Pandas is built on top of the Numpy library, which in practice means that most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.

What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.

In [9]:
# Basic representation of a dataframe
dictionary = {'age': [20, 30], 'height': [1.80, 1.60], 'course':['Python', None]}

# Define a dataframe
df = pd.DataFrame(data=dictionary)

# Print the dataframe
df

Unnamed: 0,age,height,course
0,20,1.8,Python
1,30,1.6,


Use the `DataFrame.info()` method to find out more about a dataframe

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     2 non-null      int64  
 1   height  2 non-null      float64
 2   course  1 non-null      object 
dtypes: float64(1), int64(1), object(1)
memory usage: 176.0+ bytes


* This is a DataFrame
* Columns named `age`, `height` and `course`
* Two actual 64-bit int values and one is floating point
* The columns are not null
* Uses 176 bytes of memory

The DataFrame.columns variable stores information about the dataframe's columns

* Note that this is data, _not_ a method. (it doesn't have parentheses)
  - Like `math.pi`
  - So do not use `()` to call it

In [11]:
df.columns

Index(['age', 'height', 'course'], dtype='object')

Use `DataFrame.T` to transpose a dataframe:

* Sometimes want to treat columns as rows and vice versa.
* Transpose (written `.T`) doesn't copy the data, just changes the program's view of it.
* Like `columns`, it is a member variable


In [12]:
df.T

Unnamed: 0,0,1
age,20,30.0
height,1.8,1.6
course,Python,


Use `DataFrame.describe()` to get summary statistics about data

`DataFrame.describe()` gets the summary statistics of only the columns that have numerical data. All other columns are ignored, unless you use the argument `include = 'all'` 

In [13]:
df.describe()

Unnamed: 0,age,height
count,2.0,2.0
mean,25.0,1.7
std,7.071068,0.141421
min,20.0,1.6
25%,22.5,1.65
50%,25.0,1.7
75%,27.5,1.75
max,30.0,1.8


Make use of a list to select desired columns and create a subset

In [14]:
# Desired columns
my_columns = ['age','course']

# Selection of subset
df[my_columns]

Unnamed: 0,age,course
0,20,Python
1,30,


Use `DataFrame.to_csv()` to generate a CSV file as result of the analysed dataframe

In [15]:
df[my_columns].to_csv('example.csv', index=False)

#### <i style="color:red">**EXERCISES - **</i>

+ _1. Reproduce the examples given in the introduction of this notebook
    - Download the datasets
    - Import Pandas and define a path to the files
    - Read the files and save them in Python variables
___

+ _2.  Use the `help()` function to find out what `DataFrame.head` and `DataFrame.tail` do
---


+ _3.  Use the `DataFrame.info()` and `DataFrame.describe()`functions to learn general information about the datasets
---


+ _4.  Usinng the `CEPS EurLex dataset`.
- Get the name of the columns with `DataFrame.columns`
- Select the columns `Act_name` and `Eurlex_link` to create a subset of the original dataset
- Select the frist 50 entries of the data using `head()` function
- Save this resulted subset in a `CSV` file

---


---
# 3. Python and Automation
* [3.1. Creating basic functions](#3.1)
* [3.2. Sharing is caring](#3.2)


---
## 3.1. Creating basic functions
<a id="3.1">


A function is a block of organized, reusable code that can make your scripts more effective, easier to read, and simple to manage. You can think functions as little self-contained programs that can perform a specific task which you can use repeatedly in your code.

We have already used some functions such as the `print()` command which is actually a built-in function in Python.  

Steps:

- Begin the definition of a new function with def.
- Followed by the name of the function.
    - Must obey the same rules as variable names.
- Then parameters in parentheses.
    - Empty parentheses if the function doesn’t take any inputs.
- Then a colon.
- Then an indented block of code.

In [1]:
def print_greeting():
    print('Hello!')

- Defining a function does not run it.
    - Like assigning a value to a variable.
- Must call the function to execute the code it contains.

In [2]:
print_greeting

<function __main__.print_greeting()>

- More useful when we can specify parameters when defining a function.
    - These become variables when the function is executed.
    - Are assigned the arguments in the call (i.e., the values passed to the function).
    - If you don’t name the arguments when using them in the call, the arguments will be matched to parameters in the order the parameters are defined in the function.

In [3]:
def print_date(year, month, day):
    joined = str(year) + '/' + str(month) + '/' + str(day)
    return print(joined)

In [4]:
print_date(1871, 3, 19)

1871/3/19


In [5]:
print_date(month=3, day=19, year=1871)

1871/3/19


- Let's create a temperature converter

In [6]:
def celsiusToFahr(tempCelsius):
    '''This function converts celsius to fahrenheit'''
    celsius_value = 9/5 * tempCelsius + 32
    return celsius_value

In [7]:
freezingPoint =  celsiusToFahr(0)

print('The freezing point of water in Fahrenheit is:', freezingPoint)
print('The boiling point of water in Fahrenheit is:', celsiusToFahr(100))

The freezing point of water in Fahrenheit is: 32.0
The boiling point of water in Fahrenheit is: 212.0


Having a **docstring** in the function it helps to know what the function is about through the python command line:

In [8]:
help(celsiusToFahr)

Help on function celsiusToFahr in module __main__:

celsiusToFahr(tempCelsius)
    This function converts celsius to fahrenheit



If you define a docstring for all of your functions, it makes it easier for other people to use them, since they can get help on the arguments and return values of the function.

Next, note that rather than putting a comment in about what input values lead to errors, we have some testing of these values, followed by a warning if the value is invalid, and some conditional code to handle special cases.

---
## 3.2. Sharing is caring
<a id="3.2">


- Posting your work in Github will automatically be rendered by **NBviewer**(https://nbviewer.jupyter.org/)

- Uploading your work in **Google Colab** can make it sharable immediately (https://colab.research.google.com/)
    

- Markdown cells can contain embedded links and images

Add a link using the following pattern: `[link text](URL_or_relative_path)`
gives the clickable link: [Maastricht University](https://www.maastrichtuniversity.nl).

Add an image using the following pattern: `![image alt text](URL_or_path)`
 embeds the following image: ![UM logo](https://logos-download.com/wp-content/uploads/2017/11/Maastricht_University_logo.png)

- Markdown cells can include Latex Expressions

Mathematical expessions can be rendered inline by wrapping a LaTeX expression (no spaces) with a $ either side.

For example, `$e^x=\sum_{i=0}^\infty \frac{1}{i!}x^i$` is rendered as the inline $e^x=\sum_{i=0}^\infty \frac{1}{i!}x^i$ expression.

Wrapping the expression with `$$` either side forces it to be rendered on a new line in the centre of the cell: $$e^x=\sum_{i=0}^\infty \frac{1}{i!}x^i$$

- Checking Reproducibility

One of the aims of using notebooks is to produce an executable document that can be rerun to reproduce the results.

To run cells from scratch (i.e. from a fresh kernel), `Kernel -> Restart and Clear Output` and then run the cells you want.

To run all the cells in the notebook from scratch: `Kernel -> Restart and Run All`

- Licensing

[Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/)  
More info: https://reproducible-science-curriculum.github.io/sharing-RR-Jupyter/LICENSE.html

---

#### <i style="color:red">**EXERCISES - **</i>

+ _1. Create a simple function that adds 100 to something.   
    Fill the blanks and test it.

```Python
def adding100(i):
    value = i + _____
    return _____   
```


+ _2. Following up on the example.

    We know now how to create a function to convert Celsius to Fahrenheit, let’s create another function called `kelvinsToCelsius`   
    Fill the blanks and test it.
    

```Python
___ kelvinsToCelsius(tempKelvins):
    return tempKelvins ______
```

And let’s use it in the same way as the earlier one

```Python

absoluteZero = kelvinsToCelsius(tempKelvins=0)

print('Absolute zero in Celsius is:', absoluteZero)

```

What about converting Kelvins to Fahrenheit? We could write out a new formula for it, but we don’t need to. Instead, we can do the conversion using the two functions we have already created and calling those from the function we are now creating

```Python
def kelvinsToFahrenheit(______):
    '''This function converts kelvin to fahrenheit'''
    ______
    ______
    return ______
```

Finally use the function

``` Python
absoluteZeroF = kelvinsToFahrenheit(tempKelvins=0)

print('Absolute zero in Fahrenheit is:', absoluteZeroF)
```

+ _3. Add a License to your notebook and upload it to Google Colab