# Data
Without data there is not much computing to be done. All programs take data as input (from file, a database, keyboard, a GUI) and most of them also generate data as output.  
In essence, there are only a few basic types of data: text and numbers. Data can be aggregated in more complex structures such as collections and objects.

In this chapter we'll explore the different base data types of Python, both **scalar** and **collection**.  
 

## Scalar types

A scalar data type, or just **scalar**, is any non-collection value. Put differently, a scalar is a singular value. There are just a few different scalar values in Python to be found in the vast majority of Python code, and they are the same or at least similar across the vast majority of programming languages.

- **Numeric types** represent numbers and they come in two flavours; with and without decimal part. 
    - integers have no decimal part (`int`) e.g. `2`, `3`, `140000000198`
    - floating point numbers do have a decimal part (`float`) e.g. `3.211`, `2.0`, `5E-5`
- **Text**: aka string data (`str`) e.g. `"It's a wonderful world"` or `'What a challenge!'`
- **Logical**: aka boolean (`bool`) with only two possible values: `True` and `False`

(I skipped binary and complex types for conciseness)


Since Python is a **dynamically typed language** you do not need to specify its type when creating a variable. However, you can inspect the type by using the `type()` function on any literal or variable. 

In [1]:
print(type(True))

m = "It's a wonderful world" # You can use single quotes in a double-quoted string, but not double quotes (unless escaped)

print(type(m))

print(type(5E5)) #exponents are always floats

<class 'bool'>
<class 'str'>
<class 'float'>


Variables can change type when their value changes:

In [2]:
x = 42

print(f'x is of type {type(x)} and has value {x}')

x = True

print(f'x is of type {type(x)} and has value {x}')

x is of type <class 'int'> and has value 42
x is of type <class 'bool'> and has value True


You can also explicitly change between types, as long as it is a legal conversion.

In [3]:
print(float("42.0"))  # OK
#print(int("42.0"))   # fails
print(int("42"))      # OK
print(bool("42.0"))   # OK - any non-empty string is considered True
print(bool(""))       # OK - an empty string is False
# print(int(""))      # fails

42.0
42
True
False


When reading from the command-line args (terminal) or from file, your data will always be character data, even though they are numeric. You always need to do the conversion yourself (unless you use dedicated libraries for it). 

### Exercise 2.1
Try some conversions yourself.  
Figure out why some fail and some do not (sometimes where you expected it), what the result is and what the logic behind the conversion is. Especially conversions to `bool` are interesting and very relevant in `if <condition>:` blocks. 

### Mathematical operators with strings?

Yes, some math operators work with string as well, but not all:

In [4]:
print('Hello, ' + 'world!')
print('Hello, ' * 2)
print('Hello' + 2) # gives a TypeError

Hello, world!
Hello, Hello, 


TypeError: can only concatenate str (not "int") to str

## Collection types

Collection types, also named Container types, are exactly what their name implies; they are collections of other types (scalar or collection). The number of collection types in the base language is limited, and there are only a few that are used in the majority of cases; these are listed here. A very important aspect of collections is that they are **iterable**: they can be traversed to inspect or access all individual elements.



- **Sequence types** These all have elements in a specific order and these can be addressed using the position of the element - its **index**.
    - **list** In a list, order matters, and that is why you can fetch elements by their position, starting at zero. Lists can change: you can add and delete elements.
    - **tuple** The tuple is much like a list, but with a very important distinction: they are **immutable**. Once created they can't change.
    - **range** A range is a series of numbers that can be used for iteration of for creating lists or tuples.
    - (**str**) Strings behave A LOT like other sequence types!
- **set** A set is a collection of unique elements; no duplicates are allowed.
- **dict** In a dictionary (in other languages map)there are **entries** where a (**key**) is coupled to a corresponding **value**. So that can be retreived by its key.



Below, common aspects of collections are discussed first. These are **slicing** and accessing methods via the **dot operator** (`object.method()`). Some details of individual collection types are discussed in later sections.

### Slicing
A any sequence type can be considered as a street with houses. The house number identifies a house within the street. In Python these addresses start at zero, but unlike house numbers you can also start from the end, with -1:   

<code style="font-size: x-large; font-weight: bold; color: darkred">
 0  1  2  3  4  5  6  7 	
 A  B  C  D  E  F  G  H
-8 -7 -6 -5 -4 -3 -2 -1
</code>

You can access a single house, a range of houses or in a pattern. All this is done using **slicing**. Its general syntax is `[start:stop:step]`.  


The `step` is 1 by default and it is not mandatory, and if `start` or `stop` is omitted this means "from the beginning (0)" or "to the end".  
Note that `stop` is NOT inclusive!  
If you try to access an element (character) that does not exist you will get an `IndexError`.    
Here are some examples using strings. They work the same in lists and tuples as you will discover.

In [None]:
letters = 'ABCDEFGHIJK'

print(f'The character(s) selected by letters[0] are {letters[0]}')
print(f'The character(s) selected by letters[3] are {letters[3]}')
print(f'The character(s) selected by letters[-3] are {letters[-3]}')
print(f'The character(s) selected by letters[2:6] are {letters[2:6]}')


In [None]:
letters = 'ABCDEFGHIJK'

print(f'The character(s) selected by letters[::2] are {letters[::2]}')
print(f'The character(s) selected by letters[:] are {letters[:]}')
print(f'The character(s) selected by letters[::-2] are {letters[::-2]}')
print(f'The character(s) selected by letters[:-5:-2] are {letters[:-5:-2]}')


As you can see, when the step is negative (last 2 examples), start and stop will be "reversed". I don't understand this design decision.

#### Exercise

Given the string `txt = 'aA.bB.cC.dD.eE.fF.gG.hH.zZ'`, write code using string slicing to print to screen  
 
- `"abcdefgz"`
- `"........"` 
- `"BCDEFGH"` 
- `'bC.fG.'`

In [None]:
txt = 'aA.bB.cC.dD.eE.fF.gG.hH.zZ'
# YOUR CODE

### Meet the dot operator and objects

So far I have skipped the point that Python is an **Object-Oriented** programming language.  
Being object-oriented means the (almost) everything is being modeled as an entity with data and behaviour (e.g. methods) where the objects' **class** holds the _blueprint_ for creating objects (instances) according to that blueprint.  


Let's explore this concept with the string type. In Python, strings have only a single property - their character sequence. They do, however, have many methods.  
Properties and methods can borh be accessed on an object using the **dot operator**.  
The difference lies in the fact that -unlike properties- methods have parentheses after their name that (optionally) define method arguments.  



Here are some examples of methods on string objects.

In [None]:
s1 = "Hello"
print(s1.upper())      # every character to uppercase

print(s1.find("ll"))   # find('str') gives the index where a given substring occurs (or -1 if not present)

print("Hi I am a programmer".split(" "))    # split on space to get a list of words

print(.join("ABC")) # combine individual characters with a separator

print(s1.rjust(10))    # right jusitify at 10 characters

print("foo bar baz".title())  # make all first letters of words uppercase. Alternatively, call on the class itself: str.title("foo bar baz")

#### Exercise
Given this variable (use your own name!)

```python
name = 'Michiel Noback'
```

use any combination of find() and slicing to print only your first name.

In [None]:
name = 'Michiel Noback'
# your code

The dot operator works on **all objects** in Python. Of course, the set of available methods differs from type to type; a list for instance does not have `upper()`.  You may be wondering how you can recognize what is an object and what is not? Well, that one is simple: _everything in Python is an object_.  

Try it out!  

Use the function `type()` again to get the object type of any variable or type. There are (of course) a few exceptions: the reserved **keywords** of the language do not have a type: `and`, `if`, `not`, `for`, etc; type `help("keywords")` or see [Python keywords](https://realpython.com/python-keywords/). Here are some built-in and custom types.   

In [None]:
print(type(42))
print(type(print))
print(type(type))

#define a custom type
class Foo:
    pass

print(type(Foo()))


### Collection type `list`: []

A list is an **ordered mutable sequence of elements**.  

**Ordered** means that individual elements can be accessed using their index.  
**Mutable** means that lists can be extended, shortened, end elements can be changed (swapped).  

As with string objects, slicing is a very important technique for working with them. Accessing elements is done using square brackets `[]`.
Here are a few basic operations with lists.

In [None]:
fruits = ["apple", "orange", "kiwi", "pear"]
fruits += ["banana", "plum"]       # extend and overwrite
print(fruits)
fruits[2:5] = [] # delete elements 3 - 5
print(fruits)
fruits[1:1] = ["grapefruit", "strawberry"]  # insert new elements
print(fruits)
# fruits[1] = ["grapefruit", "strawberry"] gives an embedded list!


#### Exercise

Given the list below, investigate whether list slicing behaves the same as with strings.

In [None]:
fruits = ["apple", "orange", "kiwi", "pear", "banana", "plum"]
#your code

#### Multidimensional lists

Since you can put any Python data type in a list, it is easy to create a multidimensional list (matrix). 

```python
numbers = [[1, 2], [3, 4], [5, 6]]
numbers[1]    # gives [3, 4]
numbers[2][1] # gives [6]
```

Note that Python has dedicated libraries for working with datastructures of this type: Numpy (number crunching) and Pandas (spreadsheet type data).

#### Methods of type list

Class `list` als has an extensive collection of methods to apply to lists. Here are just a few. Also, there is the Python builtin `len()` that will give you the length of any iterable. Again, this is only a glimpse of what is available. *Any* operation you can think of with lists is probably already implemented, if not in the core data types then for sure in Numpy of Pandas.


In [None]:
fruits = ["apple", "orange", "kiwi"] # note the square brackets
fruits.append(["banana", "plum"])    # adds an embedded list
fruits.extend(["guava", "cherry"])    # addes each element separately
print(fruits)
print(fruits.pop()) # removes and returns last element
print(len(fruits))  # len() is one of the built-in functions. See chapter on functions for more info
fruits.reverse()
print(fruits)

### Collection type `tuple`: ()

A tuple is an **ordered immutable sequence of elements**. 

**Ordered** means that individual elements can be accessed using their index.  
**Immutable** means that tuples cannot be changes once they are created.  

As with string and list objects, slicing is a very important technique for working with them as long as they are not mutating operations. Accessing elements is done using square brackets `[]`. 
Tuples are often encountered as return value of a method.
Here are a few basic operations with tuples.

In [None]:
animals = ("bear", "horse", "ant")  # note the parentheses ()
print(animals[:2])                  # OK
animals += ("platypus", )           # surprisingly, OK 
#animals += ("platypus")            # Gives a TypeError because this is a string surrounded by parentheses
#animals[1:2] = ()                  # TypeError because tuples are immutable
print(animals)

You may be puzzled by the above snippet. This is allowed:

```python
animals += ("platypus", )
```

whereas this is not:

```python
animals[1:2] = () 
```

Type tuple is immutable, but you _can_ create a _modified copy_, which is exactly what happens in the first example. The second line attempts to delete an element which is not allowed.

It is very informative to visualize this kind of stuff in https://pythontutor.com/

### Collection type `set`: {}

A set is a collection type that can only hold unique values.  
If a value is added that is already present this will have no effect (the value will be ignored).  
If you attempt to add a mutable type (an unhashable element) you get a `TypeError`.  

They are quite limited in functionality when comppared to other collection types: they are not sliceable or indexable (using brackets `[index]` or `[from:to]`).

Below you see the most-used usage scenarios.

In [None]:
animals = set()
animals.add("deer") 

## could also have been created as a literal: 
# animals = {'deer'}
animals.add("deer")
animals.add("beetle")

print(animals)

In [None]:

#animals.add(["gorilla", "horse"]) # TypeError, not immutable/hashable
animals.add(("zebra", "armadillo")) # OK

print(animals)

animals.remove("deer") # remove an element
print(animals)

#### Why sets (and dicts)?
Although any collection in Python supports membership tests, like `element in collection`, sets have several properties that make them extremely useful.

- Sets are guaranteed to have unique elements
- Element lookup is extremely efficient: O(1) for sets (and dicts) instead of O(n) for lists
- Sets support all operations of set theory (see below)


### Set theory operations

Given the sets depicted here; set A and B:

![sets](./pics/sets.png)


Where set A contains the elements `{'apple', 'banana', 'kiwi'}` and  
set B contains the elements `{'kiwi', 'orange', 'pear'}`


- the **union** of two sets is the set of elements that are in A or in B (`A ∪ B`)
- the **intersections** of two sets is the set of elements that are in A and in B (`A ∩ B`)
- set A and set B are **disjoint** if they have no elemenst in common.
- set B is a **proper subset** of A if all elements of B are also in B but `B ≠ A`
- the **symmetric difference** is the complement of the intersection

In [None]:
A = {'apple', 'banana', 'kiwi'}
B = {'kiwi', 'orange', 'pear'}

print(A | B)  # union, same as A.union(B)
print(A & B)  # intersection, same as A.intersection(B)
print(A.isdisjoint(B))
print(A in B) # subset
print(A ^ B)  # symmetric difference, same as A.symmetric_difference(B)

### Collection type `dict`: {}

The last collection type discussed here is the dict, which is an abbreviation of _dictionary_.  
It is used to hold mappings between _keys_ and _values_, much like real dictionaries.  
Like sets, the keys of a dict must be immutable types. There are no restrictions for the values. Also like sets, the keys are always unique. The values, however, can contain duplicates.  
Dicts cannot be sliced, but finding values is done using `dictionary[key]`.


In [None]:
zoo = {'lion' : 'Carl',
      'leopard' : ['Sue', "Lilith"],
      'rhino' : 'Bobolin'}

print(zoo.get("leopard"))

if (zoo.get('lion')):       # workst because get() returns None when key is absent, and None evaulates to False
    print('we have lions!')

# better solution 
if ('lion' in zoo):
    print('we have lions!')

print(zoo['rhino'])

#print(zoo['giraffe'])     # gives a KeyError
print(zoo.get('giraffe'))  # returns None


#### DIY
Now take some time to figure out what the `dict` methods `setdefault()` and `getdefault()` are for and how they are used. Try them out!

### Shared operations of collection types

Many operations (operators and methods) are applicable to all or a subset of collection types. Slicing only works on sequence types (list, tuple, string), but not on dict or set. Here are some of the most-used collection operations.




Operation    | Meaning
-------------|---------------------------------
x in s       |True if x is an item in or member of s, False otherwise
x not in s   | idem, reversed
s + t        | the concatenation of s and t
s * n, n * s | n shallow copies of s concatenated
s\[i\]       | i‘th item of s, origin 0
s\[i:j\]     | slice of s from i to j
s\[i:j:k\]   | slice of s from i to j with step k
len(s)       | length of s
min(s)       | smallest item of s
max(s)       | largest item of s
s.index(i)   | index of the first occurence of i in s
s.count(i)   | total number of occurences of i in s


### What do I need to memorize?

Do you need to know all these string methods, and methods from other datatypes?  

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**NO!**

You will simply remember the ones you use most without any problem.  
For the most part you will need to learn how to quickly find answers to your problems. Here are the resources that are most relevant (in logical order of usage):

1. Use the dot operator within your editor. This usually suggests possible methods on an object (or class).
2. use `help()` in Jupyter or the Python console, e.g. help(str)
3. Use [the python docs](https://docs.python.org/3/), in particular [The Python Standard lLibrary](https://docs.python.org/3/library/index.html)
4. GIYF (Google Is Your Friend)
5. Ask a colleague/friend/random tech wizard on the street. Possibly preceeded by asking the [debug duck](https://rubberduckdebugging.com/)

Also, it may be worth your while to have some cheat sheets copied to your Desktop (fysical or virtual).

#### Exercise

Use the above (sequence of) resources to find out how to ...
- print three string variables as one, with a `+` between each string
- split a sentence in a list of separate words
- get a random number between 1 and 100
- round a number to 2 decimals


# Key concepts

- **collection**: variable that is composed of multiple values i.e. `list`, `set` etc.
- **class**: A blueprint for creating objects of that type. Classes define data and -mainly- behaviour in the form of methods. These methods may be blueprint-scoped (class methods) or instance-scoped (object methods).
- **(im)mutable**: Indicates whether the collection can change or not.
- **keywords**: The reserved words of the programming language. These include words like `for`, `in`, `not` etc.
- **scalar** variable consisting of a single value.
