# Some Python data structures

This section covers examples how to use some most important data structures often needed for data analysis. In data analysis we need often arrays and matrices to be able to store many values under one symbolic name.

## Python list

List is a variable which can store a list of different kind of values. List is created by using square brackets surrounding all necessary values, which are separated with commas:

In [1]:
from numpy import pi

list_of_integers=[1,2,3,4,5]
print(list_of_integers)

list_of_floats=[3.141, 1.412, pi]
print(list_of_floats)

list_of_characters=['a', 'b', 'c', 'd']
print(list_of_characters)

list_of_strings=['eka', 'toka', 'kolmas']
print(list_of_strings)

an_empty_list=[]
print(an_empty_list)

# Get the lenghts
print("The length of integer list is", len(list_of_integers))
print("The length of empy list is", len(an_empty_list))

[1, 2, 3, 4, 5]
[3.141, 1.412, 3.141592653589793]
['a', 'b', 'c', 'd']
['eka', 'toka', 'kolmas']
[]
The length of integer list is 5
The length of empy list is 0


## Accessing values

In [2]:
print("The first integer is ", list_of_integers[0])
print("The second float is ", list_of_floats[1])
print("The last string ", list_of_strings[-1])

# The lists can be also sliced
print("The second and third integers are ", list_of_integers[1:3])
print("All integers, except the first one are", list_of_integers[1:])
print("All integers, except the last one are", list_of_integers[:-1])

The first integer is  1
The second float is  1.412
The last string  kolmas
The second and third integers are  [2, 3]
All integers, except the first one are [2, 3, 4, 5]
All integers, except the last one are [1, 2, 3, 4]


### The values in the list can be changed

In [3]:
list_of_integers[2]=99
print(list_of_integers)

list_of_integers[1:4]=[21,22,23]
print(list_of_integers)

[1, 2, 99, 4, 5]
[1, 21, 22, 23, 5]


The output of many functions can be interpreted as a list

In [4]:
x=list(range(15))
print(x)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]


### The values of the list can be iterated

Each list is also an iterator. The following loop structure assigns each value in the list x, to variable i in turn. In the loop, the value is raised to the second power, and the result is printed.

In [5]:
for i in x:
    print(i**2)

0
1
4
9
16
25
36
49
64
81
100
121
144
169
196


In [6]:
# It also works for floats
for i in list_of_floats:
    print(i**2)

9.865881
1.9937439999999997
9.869604401089358


In [7]:
# And for strings
for i in list_of_strings:
    print(i)

eka
toka
kolmas


In [8]:
# But you cannot raise a string to the power of 2
for i in list_of_strings:
    print(i**2)

TypeError: unsupported operand type(s) for ** or pow(): 'str' and 'int'

In [None]:
# And a string is also a list
s='Hello!'
for i in s:
    print(i)

## Python Tuples

Tuple is similar than list, except that it is immutable, its values cannot be changed. A tuple is created by listing the values in braces.

Tuples are indexed exactly like lists.

In [None]:
tuple_of_integers=(1,2,3)
print(tuple_of_integers)

tuple_of_strings=('eka', 'toka', 'kolme')
print(tuple_of_strings)

print("The second integer", tuple_of_integers[1])

# Tuple cannot be changed
tuple_of_integers[1]=3

## Multi-dimensional lists

List can have more than one dimensions. The following multidimensional list, M, is actually a list which contains four sublists, which contain three items each.

![](https://github.com/pevalisuo/AML/blob/master/lectures/kuvat/matrix.png?raw=1)

In [None]:
M=[[1,2,3], [4,5,6], [7,8,9], [10,11,12]]
print(M)

print("The second sublist is ", M[1])
print("The first item of the second sublist is ", M[1][0])

# Get the dimensions
nrows = len(M)
ncols = len(M[0])

print("The dimensions of M are %d %s %d" % (nrows, 'x', ncols))


## Dictionaries

Sometimes it is convenient to index a list with something else than integers. In this case you can use dictionaries.

In [9]:
d={'eka' : 'first',
     'toka': 'second',
     'kolmas': 'third'}

s='toka'
print("%s in Finnish means %s in English" % (s, d[s]))
d

toka in Finnish means second in English


{'eka': 'first', 'toka': 'second', 'kolmas': 'third'}

## Numpy arrays (ndarray)

Numpy is a shortcut of Numeric Python. Numpy is a library of data structures and functions for efficient numerical computing. Numpy library is programmed in C-language, and compiled to native binary, and therefore the functions in Numpy library are really efficient.

Native python data structures are convenient, but Numpy arrays are much better in hanling large amount of data. The benefits of numpy arrays are:
 - They are used in all numpy libraries
 - They support easy vectorizatin of operations
 - They support very convenient indexing and slicing
 - They are much more efficient in terms of memory usage than python lists
 - They can be used directly in place of Python lists



In [10]:
# Import first the Numeric Pyhthon library, and give it a nice shortcut, np
import numpy as np

# Then we can create Numpy arrays

# 1) From python list
x=np.array([1,2,3])
y=np.array(list_of_floats)
s=np.array(list_of_strings)

print(x)
print(y)
print(s)

[1 2 3]
[3.141      1.412      3.14159265]
['eka' 'toka' 'kolmas']


They can be indexed just like lists or tuples

In [11]:
s[0]

np.str_('eka')

In [12]:
# Multidimensional list can be converted to Numpy array
A=np.array(M)
print("A=", A)

# We can create a multidimensional zero array using the
# Numpy function zeros, and by giving the dimensions of
# the array as tuple

B=np.zeros((3,4))
print("B=", B)

NameError: name 'M' is not defined

## Indexing Numpy arrays
Numpy arrays support flexible indexing.
![](https://github.com/pevalisuo/AML/blob/master/lectures/kuvat/matrix.png?raw=1)

In [None]:
# Get the dimensions of the array
nrows,ncols = A.shape
print("The dimensions of A are ", (nrows, ncols))

# The shape property of Numpy array is a tuple of integers
A.shape

In [None]:
# Get a specific value
row=3 # The fourth row
col=2 # The third column

# This works just like in MATLAB
A[row,col]


In [None]:
# Get the first row
A[0]

In [None]:
# Get the first column, again just like in MATLAB
A[:,0]

In [None]:
import numpy as np
234**234

## Vectorization of operators

In [None]:
# It is rather slow to iterate through the values of large vector by python
def sumofvalues(x):
    total=0
    for i in x:
        total += i
    return total

# Lets create a numpy array whic is a list of N first numbers an calculate the sum
v=np.arange(100000000)

# Lets use a %time directive to let the computer to calculate the time needed
# for the calcuation. In my computer it took about 300 ms
%time sum=sumofvalues(v)
print("Sum of values is ", sum)

Now we do the same thing using the function from the numpy library. In this way it took only 5 ms in my computer, so it was 60 times faster

In [None]:
%time np.sum(v)

The array is actually an object, which has many build in functions, including `sum()`. You can get the sum using that function as well, this is exactly as fast as calling `np.sum()`

In [None]:
%time v.sum()

By vectorizing the operators using build-in functions you can make the code siginificantly more efficient with fewer lines of code.

As another example, we can rise all the cells in an array to the power of two

In [None]:
A**2

Because the result is also an array, we can chain the array operators like this:

In [None]:
(A**2).sum()

In [None]:
np.sqrt((A**2).sum())

## Exercises

1. Study what build in functions numpy-arrays include by reading the [documentation](https://numpy.org/doc/stable/reference/arrays.ndarray.html).
1. Create a numpy array A, and then use interactive help function to study the documentation. Do this by writing A. ("A" and "."), in beginning of the line in the code cell, place the cursor immediately after the letters A., and press Shft-Tab. You can expand the documentation window by clicking the `^`-mark in the help window.
1. Study the list of available properties and functions in numpy array A, by placing the cursor after "A." and by pressing the Tab.

![](https://github.com/pevalisuo/AML/blob/master/lectures/kuvat/numpyArrayHelp.png?raw=1)



In [None]:
A.

# Pandas

Pandas is a Python data analysis library, which makes data reading, handling and plotting even more convenient.

Lets just import the pandas librrary and give it a short name pd.

In [None]:
import pandas as pd
import numpy as np


In pandas, the data is stored in Data Frames (like in R) which are N-dimensional matrices, just like numpy arrays.

We can now create a pandas data frame based on the existing numpy array.

In [None]:
A=[[1,2,3], [4,5,6], [7,8,9], [10,11,12]]
pd.DataFrame(data=A)

As can be seen, pandas can plot the data in a pretty format automatically in a notebook.

The dataframe can have symbolic column and row names:

In [None]:
D=pd.DataFrame(data=A, columns=('a', 'b', 'c'),
               index=('Mon','Tue', 'Wed', 'Fri'))
D

Dataframe can be indexed by either symbolic names (loc) or numerical indices (iloc).

In [None]:
D.iloc[1,2]

In [None]:
D.loc['Tue':, :'b']

In [None]:
D.loc['Tue', 'b']

Pandas Dataframe has also a large number build in functions which usually operate columnwise by default, but can be changed to operate row-wise using axis-parameter

In [None]:
D.sum()

In [None]:
D?

In [None]:
D.std(axis=1)

### Excercise on pandas

1. Study the on-line documentation of DataFrame by entering `D.` and pressing Shift-Tab after the dot.
1. Study the on-line documentation of the sum() function of the DataFrame by writing `D.sum` and pressing Shift-Tab after that.
1. Calculate the average of the dataframe columnwise and row-wise.
1. Calculate the sum of all values in a dataframe. Notice that the result of the columnwise sum is also a dataframe.

# Object oriented programming

All data structures listed above are objects. It means that they do not store only data, but they also contains methods (member functions) and attributes (member variables) which can be accessed using dot notation.

Object oriented programming is actually [more](https://www.tutorialspoint.com/What-is-object-oriented-programming-OOP) than that, but this is minimal what you need to know now.

![](https://github.com/pevalisuo/AML/blob/master/lectures/kuvat/ooprogramming.svg?raw=1)

In [None]:
# This way you can access the numpy array inside data frame
D.values

In [None]:
# You can ask the type of and object with type() -function
print(type(D))
print(type(D.values))

In [None]:
# Read the shape property of the numpy array part inside data frame
D.values.shape
#D.shape