# Module P02-introduction-Python

Welcome to the course "CPP Data Science & AI"!



## Jupyter notebooks

During this course, you will learn Python via Jupyter notebooks. 

Jupyter notebooks are composed of cells. In each cell, you can write text, or code chunks. When running a cell, the code will be executed and its output displayed. 

Jupyter notebooks are interactive, which means that you can easily modify the contents of a cell, and see whether the output is to your liking or not.

Another advantage of Jupyter notebooks is that they can be converted to a slidedeck. All presentations in this course are made by converting Jupyter notebooks to a slidedeck.

## Basics

Let's try to add two numbers

In [1]:
1+1

2

Assign a value to a variable

In [2]:
# the "=" is used to assign a value to a variable in Python

x = 5
print(x) # the print statement is used to check the value of x

5


In [3]:
y = 3
x + y

8

### Variable types

With the type() function, you can check the variable type. When assigning a value to a variable, you do not have to declare the variable upfront. Python will set the variable type for you, based on the value assigned to the variable.

In [4]:
type(x) # the variable type is an integer, a whole number

int

In [5]:
z = 5.0
type(z) # the variable type is a float, a number with at least one decimal place

float

In [6]:
# x and z have the exact same values assigned
x == z

True

In [7]:
# ... but the variable types are not the same
x is z

False

## Python object types

4 important Python objects are:

- strings
- lists
- dictionaries
- tuples

In this module, we will explain strings, lists and dictionaries.

### Python object type 1: strings

A string is a sequence of characters. 

In [8]:
# to generate an empty string
str_empty = ""

# use the type-function to check the type of object
type(str_empty)

str

An example is the string "Welcome to this course!"

The function len() returns the length of a string. Please note that whitespaces and punctuation are also counted as characters.

In [9]:
# generate string and assign this to the object welcome
welcome = "Welcome to this course!"

print(welcome)

Welcome to this course!


In [10]:
# length of this string
len(welcome)

23

#### Indexing

An index refers to a position in an ordered list. A string can be seen as a list of characters.

The function index() used on a string returns the position of the first occurrence of the element in that string, i.e. the lowest index for this element.

Python uses 0-based indexing, which means that index = 0 refers to the first element, index = 1 to the second element, etc.

In [11]:
# let's see the string again
print(welcome)

Welcome to this course!


In [12]:
# the outcome is 1, since the second element of welcome is the first occurrence of the character "e" 
welcome.index("e")

1

In [13]:
# an element can also be a combination of characters
# the substring "co" starts at position nr 4 in the string, thus the index is 3 due to 0-based indexing
welcome.index("co")

3

#### Slicing

Now we do the opposite: let's use indices to get a particular substring, a set of sequential characters, from a string

In general, slicing has the following form: 

a[start:stop]

where:
-     a = an object (for now, we start with slicing a string)
- start = the starting position of the first character in the substring
- stop  = the position of the first character which is NOT included in the selected slice

The round brackets are used for functions, squared brackets are used for slicing

From: https://stackoverflow.com/questions/509211/understanding-slice-notation

Let's use indices to get a particular word from our welcome string, in this example the word "to"

Remember, Python used 0-based indexing, e.g. index 8 refers to the 9th position in the string

In [14]:
# let's see the string again
print(welcome)

Welcome to this course!


In [15]:
# "to" starts with the 9th position in the string (whitespaces also count as characters)
# the first number between brackets, before the colon, refers to the starting position
# thus the first number is 8
# the second number, after the colon, refers to the 11th position, which is the first position not included
# in the slice, thus the second number is 10

# tip for checking: you get the number of characters in this substring by subtracting the two numbers from each other: 
# 10 minus 8 equals 2 characters

# use slicing to get the substring "to"
welcome[8:10]


'to'

For slicing, you do not need to specify (both) numbers.

- a[start:] - the slice starts at the specified index, and includes the rest of the array
- a[:stop] - the slice starts at the beginning of the string, and stops at the (stop-1)th position

In [16]:
# to get the first word, without a subsequent whitespace
welcome[:7]

'Welcome'

In [17]:
# to get the last two words, including the exclamation mark
welcome[11:]

'this course!'

In [18]:
# when you do not specify any number, you get the whole object again
welcome[:]

'Welcome to this course!'

Negative numbers can also be used for slicing


In [19]:
# to get the last word without the exclamation mark
welcome[-7:-1]


'course'

#### Explicit step-argument in slicing

You can also specify the step-argument for slicing.

a[start:stop:step]

where:

- a = an object
- start = the starting position of the first element
- stop = the position of the first element which is NOT included in the selected slice
- step = the amount by which the index increases per step. When the step argument is not specified, the default is 1

From: https://stackoverflow.com/questions/509211/understanding-slice-notation

In [20]:
# let's create an object with only numbers
numbers = [2,-5,6,20,7,10,-5,-3,7,10,5,-4]

In [21]:
# now get only numbers on the even positions (position nr 2, nr 4, etc)
# the start argument equals 1, and corresponds with the number -5 on position nr. 2
# the step argument is 2: the index increases with 2 by each step
# the stop argument is not specified, thus the slicing continues up to and including the end
numbers[1::2]


[-5, 20, 10, -3, 10, -4]

In [22]:
# to get the same numbers in reverse order
numbers[:-12:-2]

[-4, 10, -3, 10, 20, -5]

#### Moving from strings to lists

Our infamous welcome string contains several words. The split-function returns a list of strings. By default, the whitespace is used as separator, and the resulting list contains strings, each containing one word.

In [23]:
# to get a list with separate words as strings
list_welcome = welcome.split()
list_welcome

['Welcome', 'to', 'this', 'course!']

In [24]:
# the same result can be obtained by explicitly stating the whitespace as separator
list_welcome_alt = welcome.split(" ")
list_welcome_alt

['Welcome', 'to', 'this', 'course!']

In [25]:
# you can also use a different separator, let's say a comma
welcome_long = "Welcome to this course, put in the hours, and you can use Python for analysis"
list_welcome_long = welcome_long.split(",")
list_welcome_long

['Welcome to this course',
 ' put in the hours',
 ' and you can use Python for analysis']

This results in a list containing three elements, substrings from the original string.

### Python object type 2: Lists

In Python, a list is an ordered sequence of items. The items of a list are put between square brackets, a comma is used to separate items from each other.

Lists are very flexible, they can contain items of various data types, and lists can also contain other lists (the lists within lists are nested lists)

In [26]:
# to create a new list, simply use square brackets
list_empty = []
type(list_empty)

list

In [27]:
# this list contains four items: 
list_new = ['first_item', 58, 7, 12.25]

# The length of the list shows the number of items in a list
len(list_new) 

4

In [28]:
# this list also contains four items:
list_new2 = ['first_item', 58, [5.00, 7, 'last_item_nested_list'], 12.25]

len(list_new2)

4

In [29]:
# to print the nested list (third item of list_new2)
print(list_new2[2])

[5.0, 7, 'last_item_nested_list']


#### Slicing list

Lists can be sliced in similar ways as strings. 

In [30]:
# let's create a new list
list_long = [2, 5, 9, 4.57, 'dogs', 7, 'cats', 80, 9.34, 'snakes']

# to select the first four items of this list
# the fifth item (with index 4) is not selected anymore
list_long[:4]

[2, 5, 9, 4.57]

In [31]:
# to get the even items from this list
list_long[1::2]

[5, 4.57, 7, 80, 'snakes']

In [32]:
# use for-loop to get only strings from a list
# to create an empty list
list_strings_only = []

for item in list_long:                     
    if type(item) == str: # check whether the item is a string, result is either True or False
        #print(item)
        list_strings_only.append(item)      # add item to list only if the if-condition is True

print(list_strings_only)

['dogs', 'cats', 'snakes']


#### Indentation in loops

In Python, indentation is important. Compare the output from this code block to the previous slide. At the end of every iteration, the list is printed. Since we use append(), we can see that an item is added to the list when the condition is true. This is because the print-statement is indented within the if-function. 

In [33]:
list_strings_only = []

for item in list_long:                     
    if type(item) == str:                  # check whether the item is a string, result is either True or False
        list_strings_only.append(item)     # add item to list only if the if-condition is True
        print(list_strings_only)

['dogs']
['dogs', 'cats']
['dogs', 'cats', 'snakes']


Can you explain the following result?

In [34]:
list_strings_only = []

for item in list_long:                     
    if type(item) == str:                  # check whether the item is a string, result is either True or False
        list_strings_only.append(item)     # add item to list only if the if-condition is True
    print(list_strings_only)

[]
[]
[]
[]
['dogs']
['dogs']
['dogs', 'cats']
['dogs', 'cats']
['dogs', 'cats']
['dogs', 'cats', 'snakes']


#### Differences between lists and strings

Lists are mutable, strings are not.

In [35]:
# recap the string welcome_long
welcome_long 

'Welcome to this course, put in the hours, and you can use Python for analysis'

In [36]:
#welcome_long[1] = "e"

# error notification due to string being immutable

In [37]:
# recap long list
list_long

[2, 5, 9, 4.57, 'dogs', 7, 'cats', 80, 9.34, 'snakes']

In [38]:
# change the first item
list_long[1] = "e"
list_long

[2, 'e', 9, 4.57, 'dogs', 7, 'cats', 80, 9.34, 'snakes']

#### Functions for mutable lists



In [39]:
# with the append-function, you can add an item to the end of the list
list_long.append('rabbits')
list_long

[2, 'e', 9, 4.57, 'dogs', 7, 'cats', 80, 9.34, 'snakes', 'rabbits']

In [40]:
# with the insert-function, you can add an item to the list at a position specified by an index
list_long.insert(3,'pigs')
list_long

[2, 'e', 9, 'pigs', 4.57, 'dogs', 7, 'cats', 80, 9.34, 'snakes', 'rabbits']

In [41]:
# with the remove-function, you can remove a specific item from the list
list_long.remove('pigs')
list_long

[2, 'e', 9, 4.57, 'dogs', 7, 'cats', 80, 9.34, 'snakes', 'rabbits']

### Python object type 3: Dictionaries

Dictionaries is a collection which is unordered and changeable.

Dictionaries contain key-value pairs, specific values can be looked up by using a key.

In [42]:
# dictionaries are depicted by parentheses. 
# to create a new dictionary
python_scores = {}

In [43]:
# example of dictionary with Python scores
# names are the keys, with Python scores as corresponding values
# key-value pairs ('key: value') are separated by commas
python_scores = {'bas': '7', 'robert': '8', 'susie': '7', 'timmy': '6', 'michael': '5', 'richard': '7'}

In [44]:
# use the key to get a specific value
python_scores['robert']

'8'

#### Slicing with dictionaries...

Does not work: since a dictionary is not a sequence, we cannot slice a dictionary.

However, we can use a selection of keys to retrieve corresponding values.

In [45]:
score_keys = python_scores.keys()

print(score_keys)

dict_keys(['bas', 'robert', 'susie', 'timmy', 'michael', 'richard'])


In [46]:
# create an empty list
keys_selected = []

for key in python_scores:
    if key[0] == 'r':
        keys_selected.append(key)
        
print(keys_selected)

['robert', 'richard']


In [47]:
# to create a list with values corresponding with keys
list_scores = []
for key in keys_selected:
    list_scores.append(python_scores[key])

print(list_scores)

['8', '7']


In [48]:
# alternatively, using list comprehension
list_scores2 = [python_scores[key] for key in keys_selected]
print(list_scores2)

['8', '7']


### Intermezzo: libraries and functions

Python has standard built-in functions. However, quite often you need a specific function from a specific library.

Libraries need to be installed on your machine before you can use them.

The standard way to install them is to use the Preferred Installer Program (pip)

To install a library, run the following code from the command line:

`python -m pip install SomeLibrary`

More information: https://docs.python.org/3/installing/index.html

After installment, you need to use the import statement to load a library

In [49]:
# let's import the library pandas, a Python library which contains many functions for data manipulation
import pandas as pd

ModuleNotFoundError: No module named 'pandas'

In [None]:
# we can use the modules-function in the sys module to check whether a specific library was imported
import sys
'pandas' in sys.modules

## Dataframes

A dataframe is a 2-dimensional labeled data structure with columns of potentially different types. Data is aligned in a tabular fashion, with rows and columns. Rows have indices assigned to them, and columns are depicted by labels. A dataframe is a specific list, with columns of equal length.

Dataframe can be created in various ways with pandas:

- way 1: creating a DataFrame from various dictionaries
- way 2: creating a DataFrame from a list of dictionaries
- way 3: creating a DataFrame from reading files



### Way 1: Creating a dataframe from various dictionaries

This way, the keys are the column labels, the dictionary values are the data values in the DataFrame

In [None]:
subjects_scores = {
    'name': ['laura', 'robert', 'susie', 'timmy', 'bas'],
    'subjects': ['maths', 'physics', 'programming', 'chemistry', 'maths'],
    'scores': ['8', '7', '7', '6', '5'],
    'completed': 100        # column specifying how much of the subject is completed,
                            # when using one value (here 100), this value will be assigned to every record
}

df_scores = pd.DataFrame(subjects_scores, columns=['name', 'subjects', 'scores', 'completed'])
print(df_scores)

In [None]:
# to get the number of dimensions of a dataframe
df_scores.ndim


In [None]:
# to get the number of data values across each dimension
df_scores.shape

In [None]:
# to get the number of rows in a DataFrame
df_scores.shape[0]

In [None]:
# to get the number of columns in a DataFrame
df_scores.shape[1]

In [None]:
# to get the number of elements in a DataFrame
df_scores.size

### Way 2: Creating a dataframe from a list of dictionaries
When having dictionaries with the same keys, you can create a dataframe by using the DataFrame function from the Pandas module.

More info for next time: https://thispointer.com/pandas-create-dataframe-from-list-of-dictionaries/

In [None]:
subjects_scores_list = [
    {'name': 'laura', 'subjects': 'maths', 'scores': 8, 'completed': 100},
    {'name': 'robert', 'subjects': 'physics', 'scores': 7, 'completed': 100},
    {'name': 'susie', 'subjects': 'programming', 'scores': 7, 'completed': 100},
    {'name': 'timmy', 'subjects': 'chemistry', 'scores': 6, 'completed': 100},
    {'name': 'bas', 'subjects': 'maths', 'scores': 5, 'completed': 100}
]

df_scores2 = pd.DataFrame(subjects_scores_list) # when not further specified, 
                                                # Python automatically assigns indices starting from 0
print(df_scores2)

In [None]:
# you can also specify the indices assigned to the various records

df_scores3 = pd.DataFrame(subjects_scores_list, index = ['a', 'b', 'c', 'd', 'e'])
print(df_scores3)

In [None]:
# you can also rearrange columns in your dataframe

df_scores4 = pd.DataFrame(subjects_scores_list, columns = ['name', 'subjects', 'completed', 'scores'])
print(df_scores4)

### Way 3: Creating a dataframe from reading files

A Comma Separated Values (CSV) file is an often used format. With the read_csv() function of the pandas module, you can directly read a dataset into dataframe format

Further reading: https://realpython.com/pandas-read-write-files/#read-a-csv-file



### Example mtcars

In the following slides, we will use the mtcars dataset to demonstrate functions.

Use the Pathlib module, in order for Python code to work on both Windows and Mac/Linux.

(https://medium.com/@ageitgey/python-3-quick-tip-the-easy-way-to-deal-with-file-paths-on-windows-mac-and-linux-11a072b58d5f)


In [None]:
import pathlib
from pathlib import Path

# you can specify your own data_folder 
data_folder = Path("../../programma/datasets/")

file_to_open = data_folder / "mtcars.csv"

mtcars = pd.read_csv(file_to_open)

mtcars.head()

In [None]:
list(mtcars.columns)

In [None]:
# Edit element of column header, replace "Unnamed: 0" with "brand"
mtcars = mtcars.rename(columns={"Unnamed: 0":"brand"})

mtcars.head()

In [None]:
# to check column names
print(mtcars.columns)

In [None]:
list(mtcars.columns)

In [None]:
mtcars.shape

In [None]:
# mtcars has 32 rows
mtcars.shape[0]

In [None]:
# ... and 12 columns. 
mtcars.shape[1]

In [None]:
# please note that the first column of mtcars is the "brand" column, not one containing indices
mtcars.iloc[:,0]

In [None]:
# display the structure of a dataframe
mtcars.info()

### Slicing data frames by row

In [None]:
# Slicing by index number
mtcars.iloc[23,]

In [None]:
# Slicing using brand name
mtcars[mtcars.brand == "Camaro Z28"]

### Slicing data frames by row

Index by **logical expression**, for instance all cars with automatic transmission.

In [None]:
mtcars[mtcars.am == 1]

Other logical expressions: <, >, <=, >=, !=, |, &. **Try them!**

### Slicing data frames by column

To select only one column, use square brackets

In [None]:
mtcars["hp"]

### Slicing data frames by column

The same can be achieved by using the index for column hp. Since this is the fifth column, the corresponding index is 4.

In [None]:
mtcars.iloc[:,4]

### Exercise: slicing data frames by row

Select only rows of cars with 100+ horsepower and 5 gears

### Solution: slicing data frames by row

Select only rows of cars with 100+ horsepower and 5 gears

In [None]:
# the rounded brackets are needed for proper slicing
mtcars[(mtcars.hp >= 100) & (mtcars.gear == 5)]

### Help function in Python

Python's help() function invokes the interactive built-in help system

In [None]:
help(min)

### Loops and functions

- For loops
- Repeat, break
- Basic functions
- Create your own functions

### For loop

Use a For-loop when the number of iterations is predefined.
Also be aware of the indentation.

Syntax in Python:

In [None]:
for i in range(10):
    print(i)    # this is an example of a statement

### While loop

Use a While-loop when the number of iterations is not predefined

In [None]:
number = 0
while number**2 < 39:
    print(number)
    number += 1    # this means you add 1 to the value of number

Also note that the print-statement must be within the loop if you like to print a number during every iteration. Compare the following syntax.

In [None]:
number = 0
while number**2 < 39:
    number += 1
    print(number)    

In [None]:
number = 0
while number**2 < 39:
    number += 1
print(number)  

### Functions

There are numerous basic built-in functions implemented in Python. You can even get more functions by installing more Python libraries. 

Some built-in functions we have seen: max(), print(), range().

A list of built-in functions can be found [here](https://docs.python.org/3/library/functions.html)

### Functions

It is also possible in Python to create your own function. 

First you need to define your function (beware of the indentation!):

In [None]:
# in order to "add", both parts need to be a string

def printMaxAndMin(vData):
    print("The maximum is: " + str(max(vData)))
    print("The minimum is: " + str(min(vData)))
   

In [None]:
# create data as input for defined function printMaxAndMin
import numpy as np
# we use the random.normal() function in numpy to generate 100 numbers from a normal distribution with mean = 50 and sd = 10
# we convert the outcome into a Pandas series object so we can use the describe() function within the Pandas library
var1 = pd.Series(np.random.normal(loc = 50, scale = 10, size = 100))
var1.describe()

In [None]:
# let's invoke our function now!
printMaxAndMin(var1)

### String format method

With this method, you can insert values in string placeholders

In [None]:
"For {}, the maximum group size is {} students".format("CPP DS&AI", 15)

### Exercise functions

Now create your own function! Make a function that prints the mpg and hp for a given car in the mtcars dataset.

Advanced: Then use a for-statement to print the information for all cars.

Tip: use mtcars.brand[0] to get the first carname 

In [None]:
mtcars.brand[0]

### Solution functions (I)

In [None]:
# define the function
def printCarInformation(car):
    str_combined = "The {} car drives {} miles per gallon and has {} hp".format(mtcars.brand[car], mtcars.mpg[car], mtcars.hp[car])
    print(str_combined)

In [None]:
# invoke function for the first car
printCarInformation(0)

### Solution functions (II)

In [None]:
# invoke function for all cars
for i in range(len(mtcars)):
    printCarInformation(i)