<a href="https://colab.research.google.com/github/WahlerP/csfundamentals-hsg/blob/master/Data_Science.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **DS - Basics**

In this notebook we will explore and elaborate on the contents of the Data Science part of Fundamentals of Data Science Course @ HSG. We will look at the following:



> Recap on important data types


> Correct use of functions


> Numerical Python


> Functional Programming









# Data Science Basics

Let's start our introduction to Data Science by taking a look at the most useful methods and data types for Data Science. After that we will introduce the NumPy library as well as the basics of functional programming.
<br><br>
Note: this is not an introduction to Python programming. If you are not yet familiar with the fundamentals, refer to the Coding Crashcourse on our [GitHub page](https://https://github.com/WahlerP/csfundamentals-hsg).

## **Data types in Python**

Every value in Python has a datatype. Since everything is an object in Python programming, data types are actually classes and variables are instance (object) of these classes.

There are various data types in Python. Some of the important types are listed below.



### Python Numbers

Integers and floating point numbers falls under *Python numbers* category. They are defined as `int`, `float` in Python.

We can use the **`type()`** function to know which class a variable or a value belongs to and the **`isinstance()`** function to check if an object belongs to a particular class.

In [0]:
a = 5
print(a, "is of type", type(a))

a = 2.0
print(a, "is of type", type(a))


5 is of type <class 'int'>
2.0 is of type <class 'float'>


Integers can be of any length, it is only limited by the memory available.

A floating point number is accurate up to 15 decimal places. Integer and floating points are separated by decimal points. `1` is integer, `1.0` is floating point number.

In [0]:
a = 1234567890123456789
print(a)

1234567890123456789


In [0]:
b = 0.1234567890123456789
print(b)

0.12345678901234568


Notice that the `float` variable `b` got truncated.

### Python Strings

String is sequence of text characters. We can use single quotes or double quotes to represent strings. Multi-line strings can be denoted using triple quotes, \' \' \' or \" \" \".

In [0]:
s = "This is a string"
s = '''a multiline
dumdiedum lalala '''

Like list and tuple, slicing operator `[` `]` can be used to access specific elements within a string. Strings are immutable.

In [0]:
s = 'Hello world!'

# s[4] = 'o'
print("s[4] = ", s[4])

# s[6:11] = 'world'
print("s[6:11] = ", s[6:11])

s[4] =  o
s[6:11] =  world


In [0]:
# Generates error
# Strings are immutable in Python
s[5] ='d'

TypeError: ignored

### None type



None is a very special data type. It stands for an empty object and is used to indicate that something (e.g. a variable, a parameter) does not exist.

In [0]:
Nothing = None
print(type(None))

It is important to realize that a `None` object is, counterintuitively,  not the same as actual "nothingness". Let's look at an example:



In [0]:
none_list = [None]
empty_list = []

print(none_list == empty_list) # check whether the two lists are equal

In our example we can see that both lists are  not identical. This is due to the fact that `None` serves as a sort of placeholder and as such, is still considered an element. Our `empty_list` is a list with no elements, whereas our `none_list` encompasses one element, which happens to be `None`.

In [0]:
# let's find out the number of elements within both lists
print(len(none_list))
print(len(empty_list))

### Iterables

An iterable is any Python object capable of returning its members one at a time, permitting it to be iterated over. Familiar examples of iterables include lists, tuples, sets and dictionaries.

In Data Science, where we often have to access elements individually, iterables are highly useful.

**Python List**

`List` is an ordered sequence of items. It is one of the most used datatype in Python and is very flexible. All the items in a list do not need to be of the same type.

Declaring a list is pretty straight forward. Items separated by commas are enclosed within brackets `[` `]`. We can also use `list()` to create a list. 

In [0]:
a = [1, 2.2, 'python'] # or a = list(1,2.2,'python')

We can use the slicing operator ``[` `]`` to extract an item or a range of items from a list. Index starts from 0 in Python.

In [0]:
lst = [5,10,15,20,25,30,35,40]

# lst[2] = 15
print("lst[2] = ", lst[2])

# lst[0:3] = [5, 10, 15]
print("lst[0:3] = ", lst[0:3])

# lst[5:] = [30, 35, 40]
print("lst[5:] = ", lst[5:])

Lists are mutable, meaning, value of elements of a list can be altered.

In [0]:
lst = [1,2,3]
lst[2]=4
print(lst)

The .append() operator is used to add a single element or a list to the end of an existing list.

In [0]:
lst = [1,2,3]
lst.append("test")
print(lst)


**Python Tuple**

Tuple is an ordered sequence of items same as list.The only difference is that tuples are immutable. Tuples once created cannot be modified.

Tuples are used to write-protect data and are usually faster than list as it cannot change dynamically.

It is defined within parentheses `()` where items are separated by commas.

In [0]:
t = (5,'program', 1+3j) 
print(t)

We can use the slicing operator `[` `]` to extract items but we cannot change its value.

In [0]:
t = (5,'program', 1+3j)

# t[1] = 'program'
print("t[1] = ", t[1])

# t[0:3] = (5, 'program', (1+3j))
print("t[0:3] = ", t[0:3])

In [0]:
# Generates error
# Tuples are immutable
t[0] = 10

**Python Set**

Set is defined by values separated by comma inside braces `{` `}`. Items in a set are **not ordered** and **unique**.

In [0]:
a = {5,2,3,1,4}

# printing set variable
print("a = ", a)

# data type of variable a
print(type(a))

We can perform set operations like union, intersection on two sets. Set have unique values. They eliminate duplicates.

In [0]:
a = {1,2,2,3,3,3}
print(a)

Since, set are unordered collection, indexing has no meaning. Hence the slicing operator `[` `]` does not work.

In [0]:
a = {1,2,3}
print(a[1])

**Python Dictionary**

`Dictionary` is an unordered collection of key-value pairs.

It is generally used when we have a huge amount of data. Dictionaries are optimized for retrieving data. We must know the key to retrieve the value.

In Python, dictionaries are defined within braces `{` `}` with each item being a pair in the form `key:value`. Key and value can be of any type.

In [0]:
d = {1:'value','key':2}
type(d)

We use key to retrieve the respective value. But not the other way around.

In [0]:
d = {1:'value','key':2}
print(type(d))

print("d[1] = ", d[1]);

print("d['key'] = ", d['key']);

In [0]:
# Generates error
print("d[2] = ", d[2]);

### Casting

We can convert between different data types by using different type conversion functions like `int()`, `float()`, `str()` etc.

In [0]:
float(5)

5.0

Conversion from float to int will truncate the value (make it closer to zero).

In [0]:
int(10.6)

10

In [0]:
int(-10.6)

-10

Conversion to and from string must contain compatible values.

In [0]:
float('2.5')

In [0]:
str(25)

In [0]:
int('1p')

We can even convert one sequence to another.

In [0]:
set([1,2,3])

In [0]:
tuple({5,6,7})

In [0]:
list('hello')

To convert to dictionary, each element must be a pair

In [0]:
dict([[1,2],[3,4]])

In [0]:
dict([(3,26),(4,44)])

### Slicing Data Types

When working with sequences or iterables we often want to access certain elements without having to iterate over the list. Using the slicing operator `[:]`  we are able to extract the indicated parts of sequences.

Slicing can be used on `lists`, `sets`, `tuples` or `strings`.

**Syntax**:

**`sequence[start:end:step]`**



*   The `start` parameter indicates the index of the first element that should be accessed. If the column `:` is not provided, only the start element will be accessed .
*   The `end` parameter indicates the index of the element **after** the last element that is accessed. If it is not given, it defaults to the last element of the sequence.
*   The `step` parameter can be used to specify that only every n-element should be accessed between the `start` and the `end`.

Keep in mind that python uses **zero-based indexing**, i.e. an index of 0 stands for the first element of a sequence.





In [0]:
seq = [0,1,2,3,4,5]

# access the first element of the sequence
seq[0]

0

In [0]:
# access the the three first elements of the sequence
seq[0:3]

[0, 1, 2]

In [0]:
# access all elements except the first
seq[1:]

[1, 2, 3, 4, 5]

In [0]:
# only access every second element (via step)
seq[::2]

[0, 2, 4]

**Negative Slicing**

In case we want to access the elements in our sequence from the end, we use negaitve indexing.

*   A negative `start` and `end` parameter is counted from the end of the sequence.
*   A negative `step` parameter indicates that the step goes backwards.



In [0]:
# access last element of our sequence
seq = [0,1,2,3,4,5]

seq[-1]

5

In [0]:
# access our sequence except for the least element
seq[0:-1]

[0, 1, 2, 3, 4]

In [0]:
# reverse our sequence
seq[-1::-1]

[5, 4, 3, 2, 1, 0]

## **Functions**

We have previously explored that functions can be created as follows:

In [0]:
def function(a,b):
  return a+b

Calling the function is also rather easy

In [0]:
function(1,2)

3

It is also possible to add an optional parameter to our functions. 

In [0]:
def add(a,b,c=None):
  if c is None:
    return (a + b)
  else:
    return (a + b+ c)

In [0]:
# not including a parameter c will default to None
add(1,2)

3

In [0]:
add(1,2,3)

6

We can also store functions in variables to re-use them later on.

In [0]:
def test_function(parameter):
  return "The parameter is " + str(parameter) 

In [0]:
#store function in variable and call it

para = test_function

para(123)

'The parameter is 123'

## **NumPy (Numerical Python)**

Numpy is an important libary which encompasses very fundamental mathematical features such as arrays or matrices. Many other libraries (pandas, scikit learn, keras, torchvision) build on top of NumPy.

Let's start by importing numpy. Conventionally we abbreviate Numpy with `np`.

In [0]:
import numpy as np

### Creating Arrays

A Numpy Array is an iterable data type which is in essence very similar to a list. However, elements within an array must be of the same type!

Du to the many possible applications of arrays, they are one of the most used objects in the numpy library.

We can create an array by defining a list and passing this list into numpy's `array` function.

In [0]:
mylist = [1, 2, 3]
x = np.array(mylist)
x

array([1, 2, 3])

We can also directly pass a list into the array function

In [0]:
y = np.array([4, 5, 6])
y

array([4, 5, 6])

Numpy can also create multidimensional arrays. They correspond to nested lists and are often used to represent rows and columns in a table.

In [0]:
multi_a = np.array([[7, 8, 9], [10, 11, 12]])
multi_a

array([[ 7,  8,  9],
       [10, 11, 12]])

We can find the number of rows and columns of a (multidimensional) array using the `shape` operator. The return will be (# rows, # columns)

In [0]:
multi_a.shape
# our array has two rows and three columns

(2, 3)

In case we do not want to type in every single value of our array, we can use the arrange method to create an array with specific attributes.

**`np.arrange(start:end:step)`**

*   The **start** parameter indicates the starting value of our array. This defaults to 0.
*   The **end** parameter tells us before which value the array should stop
*   The **step** parameter defines an intervall within the individual elements.





In [0]:
n = np.arange(0, 30, 2) # start at 0 count up by 2, stop before 30
n

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28])

One further option to arrange the content of our array is `linspace`
Simply said, `linspace` returns evenly spaced numbers over a specified interval.

While this sounds a lot like `np.arrange`, it's syntax is quite different:

**`np.linspace(start:end:num)`**


*   `start` indicates the starting value.
*   `End` indicates the end value.
*   The `num` parameter indicates the amount of evenly spaced values you want between start and end. It defaults to `50`. 




In [0]:
o = np.linspace(0, 4, 9) # return 9 evenly spaced values from 0 to 4
o

array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. ])

In [0]:
o = np.linspace(0, 4) # return 50 evenly spaced values from 0 to 4
o

If we want to work with several arrays, we must ensure that they have the same amount of rows and columns.

Let's change the shape of our array using the reshape method. Again you can specify the intended number of rows and columns of the desired multidimensional array. Be aware, however, that the shape of the multidimensional array must match the number of elements within the array. 

In [0]:
n = n.reshape(3, 5) # reshape array to be 3x5
n

array([[ 0,  2,  4,  6,  8],
       [10, 12, 14, 16, 18],
       [20, 22, 24, 26, 28]])

If we want to reshape our array "in-place", i.e. **without assigning it to a variable first**, we can use `resize` instead of `reshape`. Resize thus does not return a value, but exercises the changes directly.

In [0]:
n.resize(5,3 )
n

array([[ 0,  2,  4],
       [ 6,  8, 10],
       [12, 14, 16],
       [18, 20, 22],
       [24, 26, 28]])

Numpy also includes the option to create an array with a specific shape and fill this array with either ones (`np.ones`) or zeros (`np.zeros`).

In [0]:
np.ones((3, 2)) #returns a new arre with given shape, filled with ones

array([[1., 1.],
       [1., 1.],
       [1., 1.]])

In [0]:
np.zeros((2, 3)) # returns a new array of given shape and type, filled with zeros.

array([[0., 0., 0.],
       [0., 0., 0.]])

`eye` returns a 2-D array with ones on the diagonal and zeros elsewhere.

In [0]:
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

`diag` extracts a diagonal or constructs a diagonal array.

In [0]:
diag_ = np.array([4, 5, 6])
print(diag_)
np.diag(diag_)

[4 5 6]


array([[4, 0, 0],
       [0, 5, 0],
       [0, 0, 6]])

### Mathematical Operations with Arrays

As with vanilla python we have several possibilities to apply mathematical operations to our arrays.  `+`, `-`, `*`, `/` and `**` can be used to perform element wise addition, subtraction, multiplication, division and power.

To be able to do so, several conditions must hold:


*   Array shapes must be identical (generally speaking)
*   Arrays must encompass the same data type



In [0]:
x = np.array([1,2,3])
y = np.array([4,5,6])

print(x + y) # elementwise addition     [1 2 3] + [4 5 6] = [5  7  9]
print(x - y) # elementwise subtraction  [1 2 3] - [4 5 6] = [-3 -3 -3]

[5 7 9]
[-3 -3 -3]


In [0]:
print(x * y) # elementwise multiplication  [1 2 3] * [4 5 6] = [4  10  18]
print(x / y) # elementwise divison         [1 2 3] / [4 5 6] = [0.25  0.4  0.5]

[ 4 10 18]
[0.25 0.4  0.5 ]


In [0]:
print(x**2) # elementwise power  [1 2 3] ^2 =  [1 4 9]

[1 4 9]


In [0]:
z = np.array(["1", "2", "3"])

print(x + z) # this will fail since array have different data type

<class 'numpy.ndarray'>


UFuncTypeError: ignored

**Dot Product:**  

As you may have guessed, arrays are often used to represent vectors in linear algebra. We can therefor also do a dot products between a scalar and a vector.

$ \begin{bmatrix}x_1 \ x_2 \ x_3\end{bmatrix}
\cdot
\begin{bmatrix}y_1 \\ y_2 \\ y_3\end{bmatrix}
= x_1 y_1 + x_2 y_2 + x_3 y_3$

In [0]:
print(x)
print(y)
x.dot(y) # dot product  1*4 + 2*5 + 3*6

[1 2 3]
[4 5 6]


32

## **Functional Programming**

Functional programming decomposes a problem into a set of functions. Ideally, functions only take inputs and produce outputs, and don’t have any internal state that affects the output produced for a given input.

### Lambda

Similar to `def`, the `lambda` function creates a function, which can be called later. However, instead of designing it to a name (as `def` does) `lambda` directly returns the funtion. Lambdas are thus also called **anonymous functions**.

Let's see how that difference looks in practice.

In [0]:
def func(x):    # def directly assigns the function to the name "func"
  return x ** 3
print(func(5))

 
test = lambda x: x ** 3 # lambda returns the result directly, we do not have to write "return"
test(5)

125


125

Ok, got it!
Let's see how we can build a lambda function.

**Syntax:**

**`lambda [arg1, arg2, ..., argi] : [expression]`**


*   Define the function by specifying it per `lambda`
*   This is followed by one or several arguments, which are seperated by commas. The arguments must be passed in front of the column operator `:` 
*   After the column operator, we specify which values or operations are supposed to be returned by the function







In [0]:
add = lambda x,y:x+y # the function takes two arguments and adds them together
add(1,2)

3

So far, we have always assigned our lambda function to a variable. In reality, however, this is bad practice (as you coud just create a normal `def` function inestead).

Lambdas are useful when you need a throwaway function, i.e. you do not intend to re-use the respective operation. 

### Sorting with Lambdas

Sorting is a common example of when lambda functions are useful.
Generally, we can sort an iterable via the `sorted` method.

In [0]:
lst = [3,2,4,6,1]
sorted(lst)

[1, 2, 3, 4, 6]

Let's say now you wanted to sort a list of tuples `lst = [("a", 33),("b",44),("c", 12)]` according to the second element within every tuple.

How do we tell our sorted method, what to sort by? Let's have a look at which parameters `sorted` takes.

In [0]:
#run this to get further infos on the sorted method

sorted?

We can see that sorted takes a "custom key function to customize the sort order". Let's use a lambda function for this key function.

In [0]:
lst = [("a", 33),("b",44),("c", 12)]
print(lst)
lst = sorted(lst, key=lambda tuple_:tuple_[1]) # we tell the function that it should sort by every second element of each tuple
print(lst)

[('a', 33), ('b', 44), ('c', 12)]
[('c', 12), ('a', 33), ('b', 44)]


### List Comprehension

One further fundamental aspect of functional programming is the use of list comprehensions.

In the past we had the option to create and fill lists using `for loops`:

In [0]:
inp = "hello"
outp = []

for i in inp:
  outp.append(i)

print(outp)

['h', 'e', 'l', 'l', 'o']


List comprehensions accomplish the same but in a more concise manner. 

**`[(expression) for (variable) in (iterable) (optional if)]`**

They consist of brackets containing an expression followed by a for clause, then
zero or more for or if clauses. The expressions can be anything, meaning you can
put in all kinds of objects in lists.

The result will be a new list resulting from evaluating the expression in the
context of the for and if clauses which follow it. 

The list comprehension always returns a result list. 

In case you are still uncertain on how you would go about building a list comprehension, make sure to read this helpful [article](https://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/) on the topic.

In [0]:
# let's re-write the example above:
inp = "hello"

outp = [i for i in inp]
print(outp)

['h', 'e', 'l', 'l', 'o']

As we have seen before, we can also include if-statements as well as nested for-loops in our list comprehension.

In [0]:
# let's look at some examples

a = [x for y in range(5) for x in range(y)]
b = [x if x < 3 else "NO" for x in [1, 2, 3, 4, 5]]
c = [x for x in [y.lower() for y in "HELLO"]]
d = [number for number in range(0,100) if number % 2 == 0] # prints all even numbers from 0 to 100

print(a, "\n", b, "\n", c, "\n", d)

[0, 0, 1, 0, 1, 2, 0, 1, 2, 3] 
 [1, 2, 'NO', 'NO', 'NO'] 
 ['h', 'e', 'l', 'l', 'o'] 
 [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98]


We can also apply the intuition behind list comprehensions on tuples or dictionaries. To do so, we just swap the `[]` for `()` or `{}` respectively.

### Dictionary Comrehensions

In [0]:
# create a dictionary the standart way

dct = {}
lst = [("Hello", "World"), ("Goodbye", "Life")]

for i, z in lst:
  dct[i] = z

print(dct)

{'Hello': 'World', 'Goodbye': 'Life'}


In [0]:
# let's do the same using list comprehensions

dct = {i: z for i, z in [("Hello", "World"), ("Goodbye", "Life")]}
print(dct)

{'Hello': 'World', 'Goodbye': 'Life'}


### Python map()

The python map() function  applies a given function to each item of an iterable (list, tuple etc.) and returns a list of the results. Often, map is used in combination with a lambda function.

The syntax of map() is:

**`map(function, iterable, ...)`**

In [0]:
lst = [1,2,3,4,5]

# let's multiply each element of the list with itself and return a list with the outputs

lst = list(map(lambda x: x*x, lst)) # lambda takes every element of the itearable "lst" as an input x and multiplies it with itself
print(lst)

[1, 4, 9, 16, 25]


Task: look at the following example and try guessing the output

In [0]:
x = ["abc", "cde", "f", "gh"]
print(list(map(list, x)))

[['a', 'b', 'c'], ['c', 'd', 'e'], ['f'], ['g', 'h']]


### Python filter()

The `filter()` method constructs an iterator from elements of an iterable for which a function returns true.

In simple words, the filter() method filters the given iterable with the help of a function that tests each element in the iterable to be true or not.

The syntax of filter() method is:

**`filter(function, iterable)`**

The following example checks whether each number in range(10) can be divided by 2 (i.e. is even) and returns the even numbers in a list.

In [0]:
out = filter(lambda k: k%2 == 0, range(10))
print(list(out))

[0, 2, 4, 6, 8]


In [0]:
# again try guessing what the following code snippet accomplishes
x = ["abc", "cde", "f", "gh"]
print(list(filter(lambda k: len(k) > 2, x)))

['abc', 'cde']


### Python reduce()

The reduce() function is used to apply a particular function passed in its argument to all of the list elements mentioned in the sequence passed along.

The syntax looks like this:

**`reduce(function, iterable, initializer) `**

The function is applied to every element of the given iterable. If an initializer is provided (optional), it is used as the first value before using the values in the iterator.

Note: we need to import the module functools to be able to use reduce

In [0]:
from functools import reduce

In [0]:
lst = [0,1,2,3,4]
print(reduce(lambda lst,y: lst+y, lst, 10))

# start with 10 then add +0+1+2+3+4

20


In [0]:
#guess the output
x = ["abc", "cde", "f", "gh"]
print(reduce(lambda x,y: x+y, x, "HELLO"))

HELLOabccdefgh
