## Introduction to Python for data science

This notebook contains the code examples from Monday of the MASSHINE Social Data Science summer school 2024.

## Python notebooks

This file is a Python notebook. It is one way of interacting with Python. A notebook consists of two types of cells: code cells and text (/markdown) cells.

Code cells can contain and run Python code.

Text cells contain normal text (written in [Markdown](https://www.markdownguide.org/)).

### Python code

Code cells can only contain Python code. Use text cells to write documentation and notes for the code. If you want to add comments to the code itself, use #. Everything followed by # will be ignored by the Python interpreter.

Spaces affects the interpretation of the code if the line begins with spaces (indented). Otherwise Python ignores spaces but they are often added for readability.

In [1]:
# Example of a code cell
2+2

4

## Variables and types

Use = to define variables.

In [2]:
a = 10 # numeric type
b = "hello" # string type
c = True # boolean type

### Types matter

The variable supports operations based on its type/class.

In [3]:
# works - numeric + numeric
a + 5

15

In [4]:
# gives error - numeric + string = ?
b + 5

TypeError: can only concatenate str (not "int") to str

### The boolean type

Booleans can only be `True` or `False`. A range of operators *always* return a boolean:

- `<` less than
- `>` greater than
- `==` equals
- `<=` less than or equal
- `>=` greater than or equal
- `not` negates the boolean

In [5]:
this_is_true = 5 > 2

print(this_is_true)

True


In [6]:
print(not this_is_true)

False


## Defining functions

In [7]:
def say_hello(name): # say_hello is the name of the function. name is a placeholder variable - the name used for the input (argumet)
  print("Hello " + "there " + name) # the function concatenates/pastes the strings "Hello ", "there " and the input variable (name). Notice the strings contain spaces.
  return

In [8]:
# running the function with "Mathieu" as input
say_hello("Mathieu")

Hello there Mathieu


## Cell order vs. running order

In Python notebooks, cells can be ordered in any way you want and run in any order you want. As soon as a code cell is run, it will be assigned a running number (the number in the square brackets [] to the right of the code cell). Removing a code cell does not remove whatever variables were defined in the code cell.

Good practice entails making sure that the cell order reflects the running order. Otherwise you risk ending up in a situation where code works while in the same session but it may not work in a new session because some cell accidentally was deleted at some point.

The example below is purely illustrative:

Imagine you have written a lot of code that you are using to produce your analysis (`my_entire_phd`). This code expects a variable defined outside the function (`really_important_variable`). Assume, then, that you keep working on your code but during this work, you accidentally deletes the cell defining `really_important_variable`. As long as you are in the same session, all the code will still work, as the variable is still defined (part of the vocabulary - the "namespace"). When you restart the session and run the cells again, the code will all the sudden not work anymore, as the cell with `really_important_variable` was lost.

In [9]:
# I hope I don't lose this
really_important_variable = 42

In [10]:
def my_entire_phd(mywork):
  some_work = (mywork + 4 + 4 + 2)*2

  really_important_result = really_important_variable + some_work

  return(really_important_result)



In [11]:
my_entire_phd(9000)

18062

## Hands-on: Defining a function

How can we create a function that calculates the area of a circle from a given radius?

$A = pi∗r^2$

In [12]:
def area(r): # function is called "area". "r" is the name of the placeholder variable

  pi=3.1415926 # python does not know the pi value so it is defined within the function

  calc = pi*r**2 # calculating the area based on input r and pi as defined above

  return(calc) # return the calculation

In [13]:
# using the function
area(8)

201.0619264

## Packages

Packages are imported using `import` followed by the package name.

In [14]:
import math

In [15]:
# pi can now be accessed from the math package

math.pi

3.141592653589793

In [16]:
# Version 2 of the area function using pi from the math package
def areav2(r):

  calc = math.pi*r**2

  return(calc)

In [17]:
# using the function
areav2(8)

201.06192982974676

In [18]:
# importing specific function/variables from a package

from math import pi

In [19]:
pi

3.141592653589793

In [20]:
# importing and assigning a nickname (silly example)

from math import pi as ploptidoo

In [21]:
ploptidoo

3.141592653589793

## Lists

Lists are defined using `[]` with elements separated by `,`.

Each element in a list is assigned an index based on the order of the elements. Indices start with 0.

In [22]:
numbers = [3,5,7,9,10]

In [23]:
numbers

[3, 5, 7, 9, 10]

In [24]:
numbers[0] # accessing the first element - index 0

3

In [25]:
numbers[4] # accessing the last element - index 4

10

In [26]:
numbers[0] * 2 # operation on specific element (does not change the element)

6

In [27]:
things = [27, "hej", True, [2,3,4]] # list with mixed types: numeric, string, boolean, list

In [28]:
things

[27, 'hej', True, [2, 3, 4]]

In [29]:
things[1]

'hej'

In [30]:
things[3][1] # last item (index 3) is a list which also can be indexed

3

## Dictionaries

Dictionaries store data in key-value pairs. They are defined using `{}`. Keys are input as strings; values added following the `:`. Each key-value pair is separated by a comma `,`.

Keys in a dictionary must be unique.

Values can be of any type (including other dictionaries).

Dictionaries correspond to the JSON data format.

In [31]:
student_ages = {
    "Alice": 29,
    "Ole": 41,
    "Yurgio": 33,
    "Unknown": ["Carl", "Maria", "Frido"], # a list
    "Aalborg": { # another dictionary
        "Rolf": 38,
        "Kristian": 35
    }
}

In [32]:
student_ages

{'Alice': 29,
 'Ole': 41,
 'Yurgio': 33,
 'Unknown': ['Carl', 'Maria', 'Frido'],
 'Aalborg': {'Rolf': 38, 'Kristian': 35}}

In [33]:
student_ages["Unknown"] # accessing the element with key "Unknown" - the list

['Carl', 'Maria', 'Frido']

In [34]:
student_ages["Tom"] = 10 # adding a new key to the list (Tom) with value 10

In [35]:
student_ages

{'Alice': 29,
 'Ole': 41,
 'Yurgio': 33,
 'Unknown': ['Carl', 'Maria', 'Frido'],
 'Aalborg': {'Rolf': 38, 'Kristian': 35},
 'Tom': 10}

In [36]:
student_ages["Unknown"].append("Paul") # adding element to the "Unknown"-list

In [37]:
student_ages

{'Alice': 29,
 'Ole': 41,
 'Yurgio': 33,
 'Unknown': ['Carl', 'Maria', 'Frido', 'Paul'],
 'Aalborg': {'Rolf': 38, 'Kristian': 35},
 'Tom': 10}

## Sets

`{}` are also used to define sets. Sets are collection of unique values.

In [38]:
something = {2,3,4,4}

In [39]:
something


{2, 3, 4}

## Classes

All variables are instances of a specific *class* with its own attributes and methods.

Classes differ in terms of how calling a method affects the variable.

In [40]:
word = "tree" # instance of a string class
things = [3,6,8] # instance of a list class

In [41]:
word.upper() # strings contain string manipulation methods like .upper() - converts to upper-case

'TREE'

In [42]:
word # notice that the variable has not changed by calling the method

'tree'

In [43]:
things.append(12) # lists contain the .append() method for adding elements to a list

In [44]:
things # notice that the variable has been changed - 12 is not added to the list

[3, 6, 8, 12]

## if-else statements

In [45]:
age = 20

if age < 18: # if the variable age is below 18 (evaluate to True), execute the code below (notice the indent)
  print("Minor")
elif age >= 18: # if the variable age if above or equals 18, execure the code below
  print("Adult")
else: # if all conditions evaluate to False, execute the code below
  print("whatever")

Adult


## Functions and keyword arguments

Functions can have many different arguments (input). Functions can have both non-keyword arguments and keyword arguments. Keyword arguments are named arguments that are usually assigned some default value. This means than when running the function without setting the keyword arguments, you are running the function with whatever the default setting is for those keyword arguments.

**Example with print**

The `print()` function prints out variables. Each variable to be printed is separated by comma `,`. These are the non-keyword arguments.

`print()` also has a range of keyword-arguments, including `sep`. This argument sets what character(s) should separate the variables to be printed. The default `sep = ' '` (separate by a space) is how the function is run if it is not changed when running the function, but it can be changed to something else.

In [46]:
print("hello", "there", "darth")

hello there darth


In [47]:
print("hello", "there", "darth", sep = "-")

hello-there-darth


## Hands-on: Writing a function using data and control structures

Write a function that returns all even numbers in a list of numbers.

*Tip*: Use % for remainder division (fx 9%2 will return 1 as there is only 1 left after 9 has been divided by 2).

In [48]:
def even_numbers(input_list): # even_numbers is the name of the function; input_list the placeholder variable (main argument)

  even_number_list = [] # create an empty list inside the function - used to store the even numbers

  for e in input_list: # iterate over numbers in the input list with "e" as a placeholder

    remainder = e%2 # using remainder division on number in list (e)

    if remainder == 0: # if the remainder == 0 (i.e. the number is even), execute the code below
      even_number_list.append(e) # add the number to the even_number_list

  return(even_number_list) # return the list - notice the indent! Should be the same indent as the for-statement. If further indented, it will return after running through the first number

In [49]:
# list of numbers
numbers = [2,52,3,15,976,231,251,24,57,10]

In [50]:
# run the function and store output in new list
keep_numbers = even_numbers(numbers)

In [51]:
keep_numbers

[2, 52, 976, 24, 10]