# First Tutorial
### Learning Goals of the first Tutorial

1. What is Python?
2. How to use Python as a basic calculator
3. Data structures in Python: strings, lists, sets, tuples, dictionaries and matrices
4. Select elements in a string, list or dataframe
5. Working with data.frames - Loading and manipulating data
6. Plotting data and saving plots

## What is Python

Python is an open-source programming language designed to help you create web applications, perform data analysis and machine learning, create graphics and many more. 

Python is popular for having a **simple syntax**, being **versatile** (used for many different tasks), being **beginner friendly** and having a **large and active community**.

## Python as a basic calculator

Python can perform arithmetic operations. Here are some examples:

In [1]:
# Addition
1 + 2

3

In [2]:
# Subtraction
4 - 2

2

In [3]:
# Mulitiplication
3 * 4

12

In [4]:
# Division
8 / 3

2.6666666666666665

In [5]:
# Exponentiation
2 ** 3

8

In [6]:
# Modulus operator: it returns the remainder from the division of the first argument by the second
3 % 2

1

In [7]:
# Floor division operator: behaves like normal division except that it returns the largest possible integer
5 // 2

2

**Order of operations** matters in Python and looks as follows: exponents, multiplication, division, addition, subtraction. However, you can use parentheses to force an expression to evaluate in the order you want.

In [9]:
# Since expressions in parentheses are evaluted first, the results is 64
((2 + 2) * 2) ** 2

64

## Data Structures

There are quite a few data structures available in Python. The builtins data structures are: lists, tuples, strings, dictionaries and sets. Lists, strings and tuples are ordered sequences of objects. Unlike strings which contain only characters, list and tuples can contain objects of any type. Lists are unlike tuples and strings mutable and can be extended or reduced at any time. Sets are mutable unordered sequences of unique elements. Dictionaries are an unordered collection of items (key/value) pair.

Lists are enclosed in brackets:

In [11]:
l = [4, 5, "a"]

Tuples are enclosed in parentheses:

In [12]:
t = (4, 5, "a")

Dictionaries are enclosed in curly brackets:

In [13]:
d = {"a":4, "b":5, "c":6}

Strings are enclosed in double or single quotes:

In [14]:
s = "This is a character object"

### Matrices

Matrices in Python are a two-dimensional data structure where values are arranged into rows and columns. The values in matrices must of the same type. Python does not have a builtin type for matrices. However, one can build matrices with a list of lists. As follows:

In [15]:
M = [[1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]]

### Dataframes

Just like matrices, dataframes are also two-dimensional table objects. Unlike matrices, dataframes can contain heterogeneous values, and named (labelled) columns and rows. There are different ways to build a dataframe. One way is to create a dictionary with keys as column names and values as column values.

So far we only used Python’s base functions. In order to use some more sophisticated or special Python functions, we need to load libraries or packages first.

Since Python has no builtin type for dataframe, we need to make use of the DataFrame function form pandas package. In order to use this function, we first need to import it.

In [1]:
# Import pandas
import pandas as pd

# Create dictionary data
data = {'Name': ['Belal', 'Bob', 'Gino', 'Cris'], 'Age': [25, 18, 23, 30], 'Skill': ['Python', 'Java', 'Spark', 'SQL']}

# Create DataFrame
df = pd.DataFrame(data)
print(df)

    Name  Age   Skill
0  Belal   25  Python
1    Bob   18    Java
2   Gino   23   Spark
3   Cris   30     SQL


# Selecting elements in a list, matrix or dataframe

Sometimes we want to select one or multiple data entries from our data objects. We can do so by selecting elements via [ ].

We first do this for a list:

In [2]:
cities = ['Berlin', 'Tokyo', 'Madrid', 'Paris', 'London']

Let's say we only want to select *Berlin* from the above list. We can achieve this by accessing the value via its position in the list which is zero in this case (Python indexing starts at zero):

In [3]:
cities[0]

'Berlin'

The first three elements can be accessed as follows:

In [4]:
cities[0:3]

['Berlin', 'Tokyo', 'Madrid']

Python also supports negative or reverse indexing. To access the last element in the list, we can proceed as follows:

In [6]:
cities[-1]

'London'

### Selecting elements in two-dimensional objects

Similar like lists, we can access values of a matrix using index. However, we need to think of one additional dimension. We, generally, type object[row, column] to access specific rows and columns. Let's consider the following matrix:

In [19]:
import numpy as np

M = np.array([[1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]])
print(M)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


Now we want to access the value 8. It’s in the third row and the second column. Recall that Python indexing for rows and columns starts at zero:

In [20]:
M[2, 1]

8

### Selecting with conditions

This is a good start to access data in an object. However, it might be a bit exhausting (maybe even impossible) to always look up the exact position in the object.

Fortunately, Python allows us also select elements based on conditions. Instead of the position we put a conditionin the [ ] square brackets.

- For this, we can use the comparison operators in Python:

    - Is equal to:            `==`
    - Is not equal to:        `!=`
    - Is smaller than:        `<`
    - Is greater than:        `>`
    - Is smaller or equal to: `<=`
    - Is greater or equal to: `>=`
    
- Conditions can be combined with and and/or or statements
    - AND: logical `AND`, binary `&`
    - OR:  logical `OR` , binary `|`

So how to subset with conditions. Let's create a matrix to work with. For this purpose, we use the numpy `arange` function which returns evenly spaced values within a given interval. We then reshape it to have 4 rows and 5 columns for the matrix.

In [4]:
import numpy as np

example_mat = np.arange(0, 20).reshape(4, 5)
print(example_mat)

[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]]


Let's say we are interested in all values greater than 4. We can proceed as follows: `example_mat > 4`. This returns True or False for each value.

Now if we put this condition in square brackets we get the values for which the condition is true.

In [5]:
example_mat[example_mat > 4]

array([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

# Working with DataFrames

Working with dataframes is similar to working with matrices. 

## Loading and manipulating data

Most of the times we want to work with dataframes in Python. Before working with dataframes when need to load them. Luckily, Python can load most of the standard formats of datasets.

The dataset we want to work with is stored as `csv` or comma-separated values file. Reading the `csv`into a dataframe is straight forward with the pandas' `pandas.read_csv`.

In [18]:
import pandas as pd
df = pd.read_csv("raw_data/example_data.csv")

With `DataFrame.head()` we can look at the first five rows of the data set

In [19]:
df.head()

Unnamed: 0,age,gender,income
0,34,Male,3291
1,28,Female,3666
2,76,Female,3873
3,69,Female,3067
4,92,Male,4231


If we only want to look at the column names, we can use the `columns` method of the dataframe like this: `DataFrame.columns`

In [20]:
df.columns

Index(['age', 'gender', 'income'], dtype='object')

## Selecting data from a DataFrame

Now we can use our selecting abilities on a data frame. We can select elements in three different ways:

1. Dataframe[ ] : selecting data with indexing operator
2. Dataframe.loc[ ] : selecting data using row or column labels.
3. Dataframe.iloc[ ] : selecting data using index positions

### 1. Indexing a DataFrame using df[ ] :

##### Selecting one column

In order to select a single column, we simply put the name of the column in-between the brackets like this: `DataFrame[“column_name"]`

In [21]:
df["age"]

0      34
1      28
2      76
3      69
4      92
       ..
995    60
996    40
997    82
998    23
999    25
Name: age, Length: 1000, dtype: int64

##### Selecting multiple columns

In order to select multiple columns, we have to pass a list of columns inside the indexing operator like this: `DataFrame.[[“col1_name",“col2_name", “col3_name"]`

In [22]:
df[["age", "gender", "income"]]

Unnamed: 0,age,gender,income
0,34,Male,3291
1,28,Female,3666
2,76,Female,3873
3,69,Female,3067
4,92,Male,4231
...,...,...,...
995,60,Male,2260
996,40,Male,3355
997,82,Male,1848
998,23,Male,3000


### 2. Indexing a DataFrame using .loc[ ] :

This function selects data by the label of the rows and columns. Unlike the indexing operator which selects the entire column(s), the `df.loc` indexer can select subsets of rows and columns. Let's take a lood at the dataframe with labelled rows. `first_name` is used as row label for every sample in the dataset:

In [23]:
import pandas as pd
df = pd.read_csv("raw_data/example_data2.csv", index_col= "first_name")
df.head()

Unnamed: 0_level_0,last_name,email,gender,income
first_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Elnar,Sargison,esargison0@ebay.co.uk,Genderfluid,1638
Sunny,Coopper,scoopper1@loc.gov,Bigender,3408
Neils,Philott,nphilott2@home.pl,Genderfluid,4497
Chaunce,Espina,cespina3@guardian.co.uk,Genderqueer,2530
Deeann,Nazareth,dnazareth4@nifty.com,Genderqueer,3033


##### Selecting a single row using .loc[ ]

In order to select a single row using .loc[ ], we put a single row label in the .loc function. As a result, we get name of the column with its corresponding value for the requested row:

In [24]:
df.loc["Elnar"]

last_name                 Sargison
email        esargison0@ebay.co.uk
gender                 Genderfluid
income                        1638
Name: Elnar, dtype: object

##### Selecting multiple rows using .loc[ ]

In order to select multiple rows, we put all the row labels in a list and pass that to .loc indexer.

df.loc[["Elnar", "Sunny", "Deeann"]]

#### Selecting multiple rows and columns using .loc[ ]

In order to select different rows and columns, we pass row and column labels in two different lists separated by a comma like this: `df.loc[["row1", "row2", "row3"], ["column1", "column2"]]`

In [28]:
df.loc[["Elnar", "Sunny"], ["email", "gender", "income"]]

Unnamed: 0_level_0,email,gender,income
first_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Elnar,esargison0@ebay.co.uk,Genderfluid,1638
Sunny,scoopper1@loc.gov,Bigender,3408


### 3. Indexing a DataFrame using .iloc[ ] :

#### Selecting a single row

Pass a single integer to .iloc[] function to select only one row:

In [29]:
df.iloc[5]

last_name             Seamans
email        dseamans5@ft.com
gender             Polygender
income                   4513
Name: Dulciana, dtype: object

#### Selecting a multiple rows

Pass a list of integers to .iloc[] function to select more than one row:

In [31]:
df.iloc[[2, 4, 5]]

Unnamed: 0_level_0,last_name,email,gender,income
first_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Neils,Philott,nphilott2@home.pl,Genderfluid,4497
Deeann,Nazareth,dnazareth4@nifty.com,Genderqueer,3033
Dulciana,Seamans,dseamans5@ft.com,Polygender,4513


#### Selecting a multiple rows and columns

Pass two lists of integers (one for rows, one for columns) to .iloc[] function separated by a comma, in order to select mulitple rows and columns:

In [32]:
df.iloc [[2, 4], [1, 3]]

Unnamed: 0_level_0,email,income
first_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Neils,nphilott2@home.pl,4497
Deeann,dnazareth4@nifty.com,3033
