# Introduction to Python (on UCloud)

### *CompUp workshop, October 27th 2021*

## Purpose of the Python workshops

- Provide general insights into using Python for social data science
- First steps in doing various forms of analysis with Python
- The first three workshops: Providing the foundation for understanding the tools introduced in theme-specific workshops
    - First three workshops as a coherent series
    - Theme-specific workshops individual

## Content of today's workshop

- **Introduction to UCloud**
- **Launching your first app on UCloud (JupyterLab)**
- **What is "Python"?**
- **Working with Python (in JupyterLab)**
- **The Python basics** (variables, types, functions, methods, packages)
- **Basic data structures in Python** (lists, tuples, dictionaries)
- **Basic control structures in Python** (for loops, while loops, if statements)
- **Introduction to pandas DataFrames** (if time permits)

## Teaching format

- Live-coding - Write along as we go
- Exercises throughout - feel free to help each other
- Raise hand for questions
- Use stickers to indicate status
    - Green: I'm good
    - Red: I need assistance

# Introduction to UCloud

## What is "UCloud"?
- Interactive digital research environment
- Cloud service
- Computing and data management
- Created for researchers
- App-based

## Why use UCloud?
- Easy access to files and programs
- One location for analysis solutions (no need to deal with differing installations)
- GDPR-compliant storage
- Eases collaboration

## Introduction to UCloud

### App-based?
- UCloud works via virtual machines
- Start up virtual machine based on analysis software needed
- You "pay" for the amount of time the app is running

***You have 1000 DKK on your account corresponding to 2907 hours of running a machine non-stop with 23 GB RAM and 4 VCPUs***

## Introduction to UCloud

### Navigating UCloud

(live demo)

### Files on UCloud

(live demo)

### Launching your first app

(live demo)

# What is Python?

- General purpose programming language
- Popularity within "data science" (but is used for so much more than that)

### Working with Python

- No GUI (graphical user interface)
- Command-based (speak Python and you shall receive; speak anything else and you shall receive errors)

### Why use Python?

- Versatile (not locked to specific file- or data formats)
- Expandable (write your own functions)
- Free
- Portable (Windows, Linux, Mac)
- Large community and eco-system

## What is Python

### Characteristics of the Python programming language

### Python som programmeringssprog

**Python is "general purpose"**

**Python is "object-oriented"**

**Python is "cross-platform"**

## Working with Python (in JupyterLab)

- One works with Python via an IDE (Interactive Development Environment)
- [JupyterLab](https://jupyter.org/) is a Python IDE
- JupyterLab is a web-based interactive computational environment for creating notebooks for codes.

- It is originally developed for python but can be used for other code as well (for example R).

- Jupyter Notebook works in a cell-structure. Each notebook is comprised of cells. The cells can be either code cells or markdown cells (or raw cells)
    - Code cells: Contains python code
    - Markdown cells: Contains formatted text


## Python basics: The Python language

Working with Python means working almost entirely with codes written in the Python language. 

Python works by writing lines of code (commands) and having Python interpret that code (running commands or cells).

You tell Python to do something by writing a command and Python will do that (if Python can understand you).

Python, for example, understands mathematical expressions:

In [1]:
2 + 5

## Python basics: The Python language

When Python does not understand our commands, it will return an error:

In [3]:
give recipe for pancakes

SyntaxError: invalid syntax (Temp/ipykernel_23168/2584914264.py, line 1)

Python returns a `SyntaxError` from the code above because the command does not comply with Python conventions (that is not the only problem with the code but it is the first problem that is encountered when interpreting the code).

## Python basics: The Python language

Working with Python means working with functions!

A function takes one or several input (called "arguments") and does something with them.

The function `print()` takes a piece of text or number as input and prints it:

In [4]:
print("Hello there!")
print(921 - 20)

Hello there!
901


Use `#` to write comments in your code. Text following a `#` is ignored by Python.

## Python basics: Variables in Python

- Python is an object-oriented programming language

- You continuously create and work with different objects (called "variables" in Python)

- Python variables are basically containers for some information

- Alle these can be variables in Python:
    - A number
    - A word
    - A text
    - A dataset
    - A file path
    - A URL
    - An image

## Python basics: Variables in Python

### Defining variables

Variables are defined using `=`

In [5]:
a = 2 + 5

In [6]:
print(a)

7


In [7]:
b = 'jedi'

In [8]:
print(b)

jedi


Using `' '` or `" "` denotes that the code should be read as text.

## Python basics: Variables in Python

Note that Python reads code very literally!

Python differentiates between upper and lower case:

In [15]:
print(b)
print(B)

jedi


NameError: name 'B' is not defined

The code above returns a `NameError`. This error usually occurs when the variable or function has not been defined (or imported).

## Python basics: Variables in Python

### Changing a variable

As a rule variables are not altered by using them: 

In [16]:
a = 42
print(a + 7) # returns a + 7 = 49
print(a)     # a is not altered. still 42

49
42


The variable has to be re-assigned if the content should be changed:

In [17]:
a = a + 7    # a is overwritten with a +7
print(a)     # a is now altered

49


## Python basics: Variables in Python

### Naming variables
Variables can be named almost anything but a good rule of thumb is to use names that are indicative of what the variable contains.

#### Restrictions for naming variables
- Most special characters not allowed: `/`, `?`, `*`, `+`, `.` and so on (most characters mean something in python and will be read as an expression)
- Already existing names in python (will overwrite the function/variable in the environment)

#### Good naming conventions 
- Using '`_`': `my_variable`, `room_number`

or:

- Capitalize each word except the first: `myVariable`, `roomNumber`


## Python basics: Types of variables

Python distinguishes between different types of variables.

A variable is stored as a *type*. The type denotes what kind of variable it is and affects what operations are possible.

## Python basics: Types of variables

### Numeric and character types
As you work with python, you will encounter a lot of different types. For now we will be focusing on two of the more common ones:
- Numeric types (like integers or floats)
- Character types (like strings)

Numbers are automatically stored as a numeric type (like a float, integer etc.).

When using `''` or `""` around the information to be stored in the variable, python will interpret that as text. Variables containing text are refered to as *strings*.

*Numbers enclosed in `''` or `""` are therefore stored as strings, as python interprets it as text!*

Python has to be told that something is text as Python would otherwise interpret it as an existing variable.

In [18]:
my_name2 = vader

NameError: name 'vader' is not defined

Note that python can interpret multiplying and adding text:

In [19]:
my_name = "kenobi"

2 * my_name

'kenobikenobi'

In [20]:
my_num = "42"

2 * my_num

'4242'

In [21]:
'obi-wan ' + my_name

'obi-wan kenobi'

## Python basics: Types of variables

### Casting types

The type of a variable can be examined with `type(variable)`.

Variables can be coerced with specific functions:

- Coerce to character type:`str(variable)`
- Coerce to numeric type: `int(variable)` or `float(variable)`

Python will always try to "guess" the type. If python guesses wrong, you can tell python what type it should be (if possible).

In [22]:
a_number = '56'
print(type(a_number))

<class 'str'>


In [23]:
a_number = int(a_number)
print(type(a_number))

<class 'int'>


## EXERCISE: Variables and types

Create the variables: `my_number1` and `my_number2`.

* `my_number1` should contain `21`
* `my_number2` should contain `"12"` (with quotes!)

Try dividing the two variables with each other (`my_number1 / my_number2`). Examine the error.

Examine the types for `my_number1` and `my_number2` with `type()`. Both should be numeric in order to be able to perform division. Change them if necessary.

Try dividing the numbers again. If you have corrected the type correctly, it should return the number `1.75`.

## Python basics: Functions

Essential part of programming is using functions.

Functions takes one or several input ("arguements"), does something and returns (for the most part) an output.

Functions in Python has the following format:

- `function(argument1, argument2, ...)`

Arguments can be in the form of "keyword arguments". These can be seen as a kind of setting for the function. "keyword arguments" are included in a function with the name of the argument, a `=` followed by what the setting should be:

- `function(argument1, argument2, keywordargument1 = "something")`

`print()` and `type()` are both functions.

## Python basics: Methods

Besides functions Python also contains "methods". Like functions, methods accepts some kind of input and returns an output.

Methods are bound to certain variable types. The variable has to be a certain type (or "class") for the method to be used.

Methods in Python has the following format:

- `variable.method(option1 = something)`. 

Strings fx have a lot of methods associated:

In [24]:
word = "Hello"

print(word.upper())  # Convert to upper-case
print(word.lower())  # Convert to lower-case

HELLO
hello


Notice that methods as a rule do not change the variable.

In [25]:
print(word)

Hello


If one tries to use methods on variables of the wrong type, an error is returned:

In [26]:
number = 627

number.upper()

AttributeError: 'int' object has no attribute 'upper'

---
## KNOWLEDGE CHECK

Inspect the code:

```
words = "Hello there!"
words.replace("there", "world")
```

The code above replaces the word "there" with "world" in the variable `words`.

*What does `words` contain after `.replace()` has been used?*

---

## Python basics: Booleas

A large part of programming involved working with logical or boolean values. Disse er stored as the type `bool`.

Variables of type `bool` can *only* take the value true (`True`) or false (`False`). 

Certain operators always returns a boolean value:

In [27]:
a = 10
b = 12

a == b

False

- `==` is used to ask: "Does `a` equal `b`?" (`=` is reserved for creating variables)

## Python basics: Booleas

These operators always return a boolean value:

```
a == b  # Equals
a != b  # Does not equal
a > b   # Greater than
a >= b  # Greater than or equals
a < b   # Less than
a <= b  # Less than or equals
```

A lot of functinos and methods return boolean values. 

The method `.startswith()` returns a boolean value dependent on whether or not the string starts with a specific piece of text or not:

In [28]:
words = "Hello there!"

print(words.startswith("Hello"))

True


## Python basics: Packages

Base Python has very limited functionality. You will always have to import various packages/modules in order to perform your analysis.

A package is a collection of functions, variables and methods that can be loaded into your Python environment.

Once loaded, the contents of the module is usable in the Python environment.

It is possible to either import whole modules or parts of a module.

In [29]:
c = sqrt(a**2 + b**2)
print(c)

NameError: name 'sqrt' is not defined

In [30]:
import math #whole module/package

# Or...

from math import sqrt #specific function/method

In [31]:
c = sqrt(a**2 + b**2)
print(c)

15.620499351813308


## Basic data structures: Lists

So far we have looked at Python variables containing single values: a number, a word or a boolean.

Python has different ways of storing a series of elements (values, variables, etc.). One of the more common is the *list*.

A list is a grouping of elements. They are created by enclosing the values in `[]`:

In [32]:
my_list = [1, 9, 7, 3]
print(my_list)

[1, 9, 7, 3]


Note that lists can contain variables of different types.

In [33]:
my_list2 = ["kenobi", 3, 42.0, True]
print(my_list2)

['kenobi', 3, 42.0, True]


## Basic data structures: Lists

### Adding to lists

Elements can be added to the list with the method `append`.

*Note that using this method changes the contents of the list*

In [34]:
my_list.append(22)
print(my_list)

[1, 9, 7, 3, 22]


## Basic data structures: Lists

### Indexes

Each element in the list is assigned an index running from 0 to the number of elements - 1. We can use the index to refer to specific elements with `[]`:

In [35]:
my_list2[0]

'kenobi'

Single elements in a list can be changed by refering to their index:

In [36]:
my_list2[0] = 'vader'
print(my_list2)

['vader', 3, 42.0, True]


## Basic data structures: Tuples

*Tuples* are immutable lists meaning their values cannot be changed.

They are created by enclosing elements in `()`:

In [37]:
t = (1.0, 4.0)
t, type(t)

((1.0, 4.0), tuple)

In [38]:
t[1]

4.0

In [39]:
t[1] = 2

TypeError: 'tuple' object does not support item assignment

## Basic data structures: Dictionaries

Dictionaries consist of a range of key-value pairs.

Dictionaries are defined using `{}`:

In [40]:
my_dict = {"jedi": "Katarn", "sith": "Desann"}

Each key-value pair is defined with a key in the form of a string followed by `:`.

The value can be a number, text, list, another dicionary and so on.

The different key-value pairs are separated using `,`.

Dictionaries do not have an index as they are not seen as having an order. On refers to a value through the key (which has to be unique):

In [41]:
print(my_dict["jedi"])

Katarn


## Basic data structures: Dictionaries

Pairs can be added by just adding a key that has not been used:

In [42]:
my_dict["dealer"] = "Watto"

print(my_dict)

{'jedi': 'Katarn', 'sith': 'Desann', 'dealer': 'Watto'}


Key has to be unique. Reusing the same key vill simply overwrite it.

In [43]:
my_dict["jedi"] = "Kenobi"

print(my_dict)

{'jedi': 'Kenobi', 'sith': 'Desann', 'dealer': 'Watto'}


---
## VIDENSCHECK

*Can a list contain different types of variables?*

---

## Basic control structures: For loops

In Python (and many other programming languages), "loops" are used to repeat commands.

The most common loops are "for loops" and "while loops"
- for loops: repeat one or several commands for a range of values
- while loops: repeat one or several commands while a condition is true

## Basic control structures: For loops

Below three variables are created and stored in a list (`ages`)

In [44]:
age1 = 29
age2 = "41"
age3 = 87

ages = [age1, age2, age3]

A for loop is used below to change the type of each of them:

In [45]:
for age in ages:
    print(type(age))

<class 'int'>
<class 'str'>
<class 'int'>


## Basic control structures: For loops

### Construction of a for loop

A for loop consists of two parts:
- A range of values for the loop to iterate over
- On or more commands that has to be repeated for each value

Prior example uses the list `ages` as the range of values to iterate over.

`age` is redefined with each loop, using the values from `ages`. 

`age` is used in the commands to refer to the individual values.

Commandoes are written on the lines following `:` - Identation matters here!

---
## EXERCISE: For loops

You have previously been introduced to the method `.upper()` which converts a string to upper-case.

Write a for loop that prints each word in the list `words` in upper-case:

```
words = ["potato", "cat", "scrumptious", "monitor", "carpenter"]
```

## Basic control structures: While loops

While loops repeat a command while a condition is true.

Example:

In [46]:
x = 1

while x < 5:
    print("loopet kører videre")
    x = x + 1

loopet kører videre
loopet kører videre
loopet kører videre
loopet kører videre


## Basic control structures: While loops

#### Construction of a while loop

A while loop consists of two parts:
- A condition that has to be met for the loop to be repeated
- One or several commands to be repeated with each loop.

Previous example defines `x` as 1. 

Loop runs while `x < 5`. 

Loop contains line that adds 1 to `x`. That way `x` exceeds 5 after four iterations - DANGER: Infinite loop if this is not included!

## Basic control structures: If statements

If statements are codechunks, where certain commands are only run, if a condition is met:

In [47]:
x = 12

if x > 10:
    print("The number is larger than 10!")
else:
    print("The number is not larger than 10!")

The number is larger than 10!


## Basic control structures: If statements

The example consists of two blocks: an if-block and an else-block

The code first evaluates the if-condition: `x > 10`. 
- If true, the command below is run 
- if false, it runs the else-block

An else-block is run in all cases where the if-condition is not meant. Therefore it is not strictly necessary.

If left out, nothing happens if the if-condition is not met:

In [48]:
x = 8

if x > 10:
    print("Tallet er større end 10!")

## Basic control structures: If statements

### Multiple if-conditions**

It is possible to specify multiple conditions with `elif` ("else if").

The conditions are run in order: if if-block is false, run next elif.

Continues until it reaches a statement evaluating true or reaches an else-block:

In [49]:
x = 7

if x > 10:
    print("Tallet er større end 10!")
elif x > 5:
    print("Tallet er større end 5!")
else:
    print("Tallet er ikke større end 5!")

Tallet er større end 5!


---
## Knowledge check: If statements

Inspect the following code:

```python
master = "Obi-Wan Kenobi"

if master == "Luke Skywalker":
    apprentice = "Ben Solo"
elif master == "Qui-Gon Jinn":
    apprentice = "Obi-Wan Kenobi"
elif master == "Obi-Wan Kenobi":
    apprentice = "Anakin Skywalker"
elif master == "Yoda":
    apprentice = "Mace Windu"
else:
    apprentice = "Ingen"
```

*What does the variable `apprentice` contain when the code is run? (try and solve it without running it)*

---

## Introduction to pandas DataFrames

*Pandas* is a module that allows you to create dataframes in python: A spreadsheet-like data structure for data with rows and columns.

The pandas module contain a lot of functions and methods for data handling and processing.

With the pandas module, various files can be imported directly as dataframes (also from the web).

Below a subset of the danish section of the 2014 European Social Survey (http://www.europeansocialsurvey.org/) is imported:

In [50]:
import pandas as pd

ess = pd.read_csv('https://github.com/CALDISS-AAU/workshop_python-intro/raw/master/datasets/ESS2014DK_subset.csv')

## Introduction to pandas DataFrames

### Inspecting DataFrames

Use the method `.head()` to inspect the first 5 rows of the data.

In [51]:
ess.head()

Unnamed: 0,idno,ppltrst,happy,cgtsday,alcfreq,height,weight,gndr,yrbrn
0,921018,6,9,10.0,2-3 times a month,178.0,64.0,Male,1990.0
1,921026,8,8,,Several times a week,172.0,64.0,Female,1948.0
2,921034,8,8,,Every day,176.0,87.0,Male,1957.0
3,921076,8,8,,Several times a week,162.0,70.0,Female,1958.0
4,921084,5,8,,Every day,175.0,80.0,Male,1936.0


Inspect column names with method .columns

In [52]:
list(ess.columns)

['idno',
 'ppltrst',
 'happy',
 'cgtsday',
 'alcfreq',
 'height',
 'weight',
 'gndr',
 'yrbrn']

## Introduction to pandas DataFrames

See key summary statistics using `.descibe()`. (n, mean, std, min, max, quartiles).

In [53]:
ess.describe()

Unnamed: 0,idno,cgtsday,height,weight,yrbrn
count,1502.0,330.0,1497.0,1473.0,1502.0
mean,935551.059254,12.0,173.752171,75.855261,1965.891478
std,8588.682562,9.237495,9.625371,15.599516,18.94614
min,921018.0,0.0,142.0,38.0,1914.0
25%,928066.5,5.0,167.0,65.0,1951.0
50%,935772.0,10.0,173.0,74.0,1966.0
75%,942859.5,20.0,180.0,85.0,1981.0
max,950516.0,102.0,204.0,137.0,1999.0


## Introduction to pandas DataFrames

### Slicing rows and selecting columns

Selecting rows is refered to as *slicing*. Rows can be selected by their index using `[]`. It excludes the last index.

Columns can be selected the same way by refering to the column name.

In [54]:
ess[0:1] # First row

Unnamed: 0,idno,ppltrst,happy,cgtsday,alcfreq,height,weight,gndr,yrbrn
0,921018,6,9,10.0,2-3 times a month,178.0,64.0,Male,1990.0


In [55]:
ess['alcfreq'] # Selecting alcfreq column.

0          2-3 times a month
1       Several times a week
2                  Every day
3       Several times a week
4                  Every day
                ...         
1497    Several times a week
1498       2-3 times a month
1499             Once a week
1500             Once a week
1501       2-3 times a month
Name: alcfreq, Length: 1502, dtype: object

## Introduction to pandas DataFrames

The method `.loc[]` is used for subsetting the data. First rows, then columns. Columns have to be specified by their name.

Several columns can be selected by refering to a list of column names.

Unlike the "standard" indexing/slicing, using `.loc[]` includes the last index.

*NOTE*: `.loc[]` is also used for recoding specific values.

In [56]:
ess.loc[2:4, 'alcfreq'] # Returns as a series

2               Every day
3    Several times a week
4               Every day
Name: alcfreq, dtype: object

In [57]:
ess.loc[2:4, ['alcfreq', 'yrbrn']] # Returns as a dataframe

Unnamed: 0,alcfreq,yrbrn
2,Every day,1957.0
3,Several times a week,1958.0
4,Every day,1936.0


In [58]:
ess[2:4][['alcfreq', 'yrbrn']] # Alternative - excludes last index

Unnamed: 0,alcfreq,yrbrn
2,Every day,1957.0
3,Several times a week,1958.0


## Introduction to pandas DataFrames

### Series

A `DataFrame` is a type. An important subtype is a `series`: A one-dimensional datastructure with an index (like a type-specific list or a variable, as it is understood in statistics).

In [59]:
a = pd.Series([4, 2, 7, 8, 4, 4])
print(a)

0    4
1    2
2    7
3    8
4    4
5    4
dtype: int64


In [60]:
print(a*2 + 4)

0    12
1     8
2    18
3    20
4    12
5    12
dtype: int64


A wide range of operations can be performed on series (like variables in any other statistics software).

In [61]:
print(a.unique())

[4 2 7 8]


In [62]:
print(a.isin([2, 4]))

0     True
1     True
2    False
3    False
4     True
5     True
dtype: bool


## Introduction to pandas DataFrames

### Operations on dataframes
Operations can be performed on pandas series.

Operations on pandas series are not restricted to pandas functions!

In [63]:
(ess['height'] / 100).head() #converting to meters - first 5 rows

0    1.78
1    1.72
2    1.76
3    1.62
4    1.75
Name: height, dtype: float64

In [64]:
ess['height'].mean()

173.75217100868403

The type of a dataframe column (series) can be inspected using the attribute `dtypes`.

In [65]:
ess['height'].dtypes

dtype('float64')

## Introduction to pandas DataFrames

### Creating variables

Variables are created by refering to columns not yet in the dataframe.

In [66]:
ess['height_m'] = ess['height'] / 100

In [67]:
list(ess.columns)

['idno',
 'ppltrst',
 'happy',
 'cgtsday',
 'alcfreq',
 'height',
 'weight',
 'gndr',
 'yrbrn',
 'height_m']

## Introduction to pandas DataFrames

### NaN: "Not a Number"

`NaN` is the Python equivalent of missing.

Notice that Python does not treat NaN-values as larger or smaller than zero. NaN-values do not have a value. We therefore need to use specific methods to refer to them (like `isnull()`).

In [68]:
ess['cgtsday'].isnull()

0       False
1        True
2        True
3        True
4        True
        ...  
1497     True
1498     True
1499     True
1500     True
1501     True
Name: cgtsday, Length: 1502, dtype: bool

## EXERCISE: DataFrames

1. Create a new variable/column called `bmi` containing the bmi of the respondents.
    - BMI = kg/m<sup>2</sup> (power of 2 in python is written with `**2`)
2. What is the lowest bmi? Use either `.describe` or `.min`