# Essential Data Structures in Python
## Week 4 

This notebook has 4 sections, corresponding to the topics covered this week: 
1. List
    - 1.1 List comprehension 
    - 1.2 Working with lists 
2. Array
    - 2.1 Creating arrays 
    - 2.2 Reshaping arrays 
    - 2.3 Array Truthy and Falsy
3. Dictionary
    - 3.1 Dictionary comprehension 
4. Data frame
    - 4.2 the Index 
    - 4.2 Working with data frames
    - 4.3 Missing values in data frames 

In [1]:
# start by loading required packages and modules 
import pandas as pd 
import numpy as np

## 1. List 

We can create lists in Python using square brackets `[]`. Lists are heterogenous data structures, so they can include data elements of various data types, including lists themselves!

In [2]:
list_a = ["cat", "dog", 45.3, "horse", "fish", 68]

print(list_a)

['cat', 'dog', 45.3, 'horse', 'fish', 68]


In [3]:
print(list_a[0])

cat


In [4]:
print(list_a[-2])

fish


As a built-in data structure, there are a lot of useful functions when working with lists. To see these type the name of the list object followed by a dot and hit the [TAB] key.

In [None]:
## try it here - put your cursor after the . and hit [TAB]
list_a.

We can also request information about lists, such as the length using `len()`:

In [5]:
len(list_a)

6

### 1.1 **List comprehension** 
is an elegant way of creating lists, often with a single line of code. Compare the following two cells, both of which produce the same output - namely, a new list containing only fruits with the letter `"a"` in the name. 

List comprehension offers a streamlined syntax as follows: 
* `newlist = [expression for item in iterable if condition == True]`

In [6]:
fruits = ["apple", "banana", "cherry", "kiwi", "mango"]
# create an empty list 
newlist = []

# use a function to fill the empty list 
for x in fruits:
    if "a" in x:
        newlist.append(x)

print(newlist)

['apple', 'banana', 'mango']


In [7]:
## list comprehension solution 
newlist_comp = [x for x in fruits if "a" in x]

print(newlist_comp)

['apple', 'banana', 'mango']


### 1.2 Working with lists 

To add an item to the end of a list, for example, we can use `listobject.append()`. 

In [8]:
list_a.append("antelope")

print(list_a) 

## run this code cell a few times and see what happens

['cat', 'dog', 45.3, 'horse', 'fish', 68, 'antelope']


To remove a specific item from a list we can use `listobject.remove()`

In [9]:
list_a.remove(45.3)

print(list_a)

['cat', 'dog', 'horse', 'fish', 68, 'antelope']


We can join two lists together in Python by simply using the `+` sign 

In [10]:
list_a + fruits

['cat',
 'dog',
 'horse',
 'fish',
 68,
 'antelope',
 'apple',
 'banana',
 'cherry',
 'kiwi',
 'mango']

You can sequentially iterate over a list using a for loop. In Python, we use a "for in" loop construction which is similiar to "for each" loops you find in C++ and Java. For Loop Syntax is as follows: 

` for iterator in sequence: 
     statements(s)`

Note that white space indendtation, as mentioned in Week 1, is important here! If you go to a new line after the colon `:` (by pressing ENTER, Jupyter will automatically do this indentation for you. 

In [11]:
# using a for loop to iterate over a list and print each element 
## using range(len(list)) to obtain the index of each element 
for i in range(len(list_a)):
    print( list_a[i] )

cat
dog
horse
fish
68
antelope


As mentioned above lists can be comprised of more lists, this results in a list of lists being comprised of 2 dimensions. We can index such a list with `list[dim1][dim2]`. Let's see how this works.

In [12]:
# let's create a two-dimensional list:
list_of_lists = [[3,4,5,6,7], [30,40,50,60,70]] 

# print the big list
print(list_of_lists) 

# print first sub-list
print(list_of_lists[0]) 

# print first number in first sub-list
print(list_of_lists[0][0])

[[3, 4, 5, 6, 7], [30, 40, 50, 60, 70]]
[3, 4, 5, 6, 7]
3


In [13]:
# remember indexes count from 0. So first element has index 0, second index 1, third index 2, etc
# to go into the second list, then to its fourth element (which should have value 60), you would use
print(list_of_lists[1][3])

# negative numbers count from the end
# to get from the first list its last element (which should be 7), you would use
print(list_of_lists[0][-1])

60
7


In most instances, your data will have many dimensions. As a recap: 
    
- a variable has **zero dimensions** - you do not need any more index/address to know what's in it
- a list/array has **one dimension** - index is the way to address individual items in a list
- a list of lists has **two dimensions** - index of the top list, and then an index of the inner list  

## 2. Array 

Array is is basically like a List, but has a number of new very powerful methods and syntax that make data operations easier and faster. Theoretically you could do everything that we do with Arrays by just using good old Lists, but it would take more time and be less compatible with other libraries.

To create an Array, you can cast a list into `np.array(your_list)`. Notice the `np.array` at the begining - it means that you are using the `array class` from the ```np``` library. `np` is a short name for `numpy` (which we gave it in `import numpy as np` at the top of this notebook)

To work with array data structures, we need to `NumPy` package. In fact, `NumPy` was designed specifically to perform numerical operations with n-dimensional arrays. Arrays store values of the same data type. The NumPy vectorization of arrays significantly enhances performance and accelerates the speed of computing operations.

In [14]:
my_list = [3, 7, 5, 5]
print(my_list)

[3, 7, 5, 5]


In [15]:
# you can create an array by feeding in a List directly or as a data object:
my_array = np.array(my_list)
print(my_array)

my_array2 = np.array([3, 0, 5, 0])
print(my_array2)

# notice it is printed a bit differently than a list! 
# the commas are not present compared to the list print out in the cell above

[3 7 5 5]
[3 0 5 0]


Arrays are often used in situations where there are many dimensions. So a grid of 

`
1,  2,  3,  4
11, 12, 13, 14
21, 22, 23, 24
`

Can be represented as:

In [16]:
scores = np.array([ [1, 2, 3, 4],
                    [11, 12, 13, 14],
                    [21, 22, 23, 24]
                   ])
# do you see it? A list with 3 lists, each with 4 items inside!

print(scores)

# let's print what type of a thing it is:
print(type(scores))

[[ 1  2  3  4]
 [11 12 13 14]
 [21 22 23 24]]
<class 'numpy.ndarray'>


`Numpy` arrays bring with them a new addressing style. We can use `array[first_dimension, second_dimension, third_dimension, ....]`, meaning you can pass an index of each next dimension separated by commas. You can use the same style (pure Python way) as with lists we looked at above ```my_list[first_dimention][second_dimention]```, but using a single square brackets with commas for dimensions is more common. Let's see how this works with some multidimensional arrays

In [17]:
## above we created a multidimensional array called scores 
print(scores[0, 0])
print(scores[0, 1])
print(scores[1, 2])
print(scores[1, -1])
print(scores[-1, -1])

# look at the printed output and see if you understand why these numbers are printed

1
2
13
14
24


We can use the index to not only get a value, but also change it. 

In [18]:
# let's change some items

# reassign values 
scores[0,0] = 10
scores[0,1] = 20

# change values mathamatically 
scores[1,3] += 30
scores[-1,-1] += 40

print(scores)

[[10 20  3  4]
 [11 12 13 44]
 [21 22 23 64]]


You can also ommit one of the dimensions, and replace everything in a row or column. Use colon `:` to indicate 'everything'

In [19]:
scores = np.array([ [1,  2,   3,  4],
                    [11, 12, 13, 14],
                    [21, 22, 23, 24]
                   ])
 
scores[0,:] = 10 
print(scores)
# first dimention: value 0, second dimention: all values
# note you could also use simpler version scores[0] = 10

[[10 10 10 10]
 [11 12 13 14]
 [21 22 23 24]]


In [20]:
scores = np.array([ [1,  2,   3,  4],
                    [11, 12, 13, 14],
                    [21, 22, 23, 24]
                   ])
 
scores[:,2] = 30  
print(scores)
# first dimention: all values, second dimention: value 2
# note but here you cannot simplify it to scores[,2] = 30

[[ 1  2 30  4]
 [11 12 30 14]
 [21 22 30 24]]


To slice arrays there is a new syntax, which can be used in lists and data frames as well! ` my_array[start_index : stop_index : step/jump]`

In [21]:
digits = np.arange(10,20)
print(digits)
print(digits[2:7:2]) # from index 2, till index 7, jumping every 2

[10 11 12 13 14 15 16 17 18 19]
[12 14 16]


In [22]:
print(digits[:7:2]) # from beginning, till index 7, jumping every 2

[10 12 14 16]


In [23]:
print(digits[2::2]) # from index 2, till end, jumping every 2

[12 14 16 18]


In [24]:
print(digits[::2]) # all, jumping every 2

[10 12 14 16 18]


In [25]:
# with negative step/jump the array gets reversed
print(digits[::-1]) # all, but index counting down
print(digits[::-3]) # all, but index counting down every 3

[19 18 17 16 15 14 13 12 11 10]
[19 16 13 10]


There is a variety of information you can request about arrays with methods including: 
* dimension with `.ndim`
* shape with `.shape`
* size with `.size`
* data type with `.dtype`

In [26]:
# you can request some info about a multi-dimensional Arrays:
scores = np.array([ [1, 2, 3, 4],
                    [11, 12, 13, 14],
                    [21, 22, 23, 24]
                   ])

print("dimensions:", scores.ndim)
print("shape:", scores.shape)
print("size:", scores.size)
print("data type:", scores.dtype)

dimensions: 2
shape: (3, 4)
size: 12
data type: int64


### Creating arrays 

You can specify the default value and type of your new empty array:

- full of zeros with `np.zeros()`
- full of ones with `np.ones()`
- full of some other value with `np.full(some_value)`
- full of random values

In [None]:
np.zeros(4)

As you can see, by default the values created are floats. But you can specify the data type with the `dtype` argument. It is good practice when creating arrays with specified values to be explicitly cast with the desired data type. 

In [27]:
np.zeros(5, dtype = 'int')

array([0, 0, 0, 0, 0])

In [28]:
np.zeros(5, dtype = 'bool')

array([False, False, False, False, False])

You can also create multi-dimentional arrays with sizes of all dimentions in a tuple (which we will learn more about next week - essentially it is a immuntable list made with `()`):

- `(10)` - array of 10 elements
- `(5,10)` - 5 sets of 10 elements
- `(3,5,10)` - 3 sets of 5 sets of 10 elements

In [29]:
np.zeros((10), dtype = float)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [30]:
np.zeros((5,10), dtype = int)

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [31]:
np.zeros((3, 5,10), dtype = int)

array([[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],

       [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],

       [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]])

You can also specify values other than zero with `np.ones( dimensions )` or with `np.full(dimensions, value)`

In [32]:
np.ones((2,5), dtype = int)

array([[1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1]])

In [33]:
np.full((2,5), 3.14)

array([[3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14]])

In [34]:
# so as you can these two do the same thing:
print(np.ones((2,5), dtype = int))
print(np.full((2,5), 1))

# in programming there are often multiple ways of doing the same thing! 

[[1 1 1 1 1]
 [1 1 1 1 1]]
[[1 1 1 1 1]
 [1 1 1 1 1]]


You can also create values

- from a range with `np.arange(start, top, jump)`
- with even split between two values `np.linspace(start, end, slices)`
- indentity matrix with `eye(size)`
- to repeat a pattern with `np.tile()`
- full of random data with `np.random.randint(max_value, size = (size_tuple))`

In [35]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [36]:
np.arange(5,15)

array([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [37]:
np.arange(5, 50, 10) # range has start, end, jump

array([ 5, 15, 25, 35, 45])

In [38]:
np.linspace(0, 1, 5) # from_value, to_value, how_many_slices

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [39]:
np.linspace(0, 100, 3, dtype = int) # from_value, to_value, how_many_slices

array([  0,  50, 100])

In [40]:
np.eye(4)

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

In [41]:
pattern = np.array([0, 1, 2])

np.tile(pattern, 2) # the number (or tuple) given in tile describes size/times of output matrix # repeat 2 times

array([0, 1, 2, 0, 1, 2])

In [42]:
np.tile(pattern, (3,2))
# repeat 2 times in one dimension, and 3 times in another dimension 

array([[0, 1, 2, 0, 1, 2],
       [0, 1, 2, 0, 1, 2],
       [0, 1, 2, 0, 1, 2]])

In [47]:
# random numbers
print(np.random.randint(10, size = 6))

# every time you run this cell, your random numbers will be differnt. 
## Try it!

[8 1 5 9 8 9]


In [46]:
# but if you specify a seed, your random numbers will be the same every time you run this cell! 
# this can be very helpful when debugging or in some statitical analyses using random numbers or sampling 

np.random.seed(0) # plant the seed 0 - this could be any number
print(np.random.randint(10, size=6))
print(np.random.randint(10, size=6))
print(np.random.randint(10, size=6))

## run this cell a few times to see that the numbers do not change

[5 0 3 3 7 9]
[3 5 2 4 7 6]
[8 8 1 6 7 7]


Let's now create some arrays full of random values. 

In [48]:
# ONE DIMENTION - with size 6
array1 = np.random.randint(100, size = 6)
print(array1)
print(array1[0])

[20 80 69 79 47 64]
20


In [49]:
# TWO DIMENTIONS - with size 4 for first and 5 for the second
# these could be scores for four courses, each with 5 students

array2 = np.random.randint(100, size = (4, 5))
print(array2)
print() # empty print statement to organise output a bit easier 

print(array2[3])
print()
print(array2[3, 4])

[[82 99 88 49 29]
 [19 19 14 39 32]
 [65  9 57 32 31]
 [74 23 35 75 55]]

[74 23 35 75 55]

55


In [50]:
# THREE DIMENTIONS - with size 3 for first and 4 for the second and 5 for third
array3 = np.random.randint(100, size = (3, 4, 5))

print(array3)

# change the numbers in size to ensure you are understanding what they are doing 

[[[28 34  0  0 36]
  [53  5 38 17 79]
  [ 4 42 58 31  1]
  [65 41 57 35 11]]

 [[46 82 91  0 14]
  [99 53 12 42 84]
  [75 68  6 68 47]
  [ 3 76 52 78 15]]

 [[20 99 58 23 79]
  [13 85 48 49 69]
  [41 35 64 95 69]
  [94  0 50 36 34]]]


### 2.2 Reshaping arrays 

We can change the dimensions of an array with `.reshape()` We can concatinate or flatten arrays(and lists!) using `np.concatenate()` which will remove 1 dimension. We can also split one array into many arrays unsing predefined indexes with the `np.split()` function. 

In [51]:
# lets start out with a one dimensional array 
print(np.arange(12)) 

[ 0  1  2  3  4  5  6  7  8  9 10 11]


In [52]:
print(np.arange(12).reshape((2, 6)))

[[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]]


In [53]:
print(np.arange(12).reshape((4, 3)))

[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]


In [54]:
print(np.arange(12).reshape((3, 2, 2)))

[[[ 0  1]
  [ 2  3]]

 [[ 4  5]
  [ 6  7]]

 [[ 8  9]
  [10 11]]]


In [55]:
# This causes an erorr - you can't split 12 numbers into 4 sets of 5 numbers!
print(np.arange(12).reshape((4, 5)))

ValueError: cannot reshape array of size 12 into shape (4,5)

In [56]:
arraya = np.array([10,20,30])
arrayb = np.array([40,50,60])
parent_array = [arraya, arrayb]

print(parent_array)

[array([10, 20, 30]), array([40, 50, 60])]


In [57]:
print( np.concatenate(parent_array) )

[10 20 30 40 50 60]


In [58]:
## works the same for 2-D lists 
lista = [70, 80,90]
listb = [1, 2, 3]

print( np.concatenate([lista, listb]) )

[70 80 90  1  2  3]


In [59]:
# you can even concatenate lists and arrays together. They really are very simmilar
arraya = np.array([10,20,30])
arrayb = np.array([40,50,60])
lista = [70, 80,90]

print( np.concatenate([arraya, arrayb, lista]) )

[10 20 30 40 50 60 70 80 90]


In [60]:
# concatenation respects dimensions, it will flatten only the top dimension
two_dimension_array1 = np.array([ [1,2,3],    [4,5,6] ])
two_dimension_array2 = np.array([ [10,20,30], [40,50,60] ])

print(two_dimension_array1)
print()
print(two_dimension_array2)

[[1 2 3]
 [4 5 6]]

[[10 20 30]
 [40 50 60]]


In [65]:
array_con = np.concatenate([two_dimension_array1,two_dimension_array2])
print(array_con)
print(np.concatenate([array_con[0], array_con[1], array_con[2], array_con[3]]))

[[ 1  2  3]
 [ 4  5  6]
 [10 20 30]
 [40 50 60]]
[ 1  2  3  4  5  6 10 20 30 40 50 60]


Concatenate has an extra argument called `axis`: ```np.concatenate([arr1, arr2], axis = 0)``` which by default is 0

- `axis = 0` (the default) - flatten horizontally - remove one dimension from all items in list
- `axis = 1` - flatter vertically - combine all first items, then all second items, all third... etc

In [66]:
two_dimension_array1 = np.array([ [1,2,3], [4,5,6] ])
two_dimension_array2 = np.array([ [10,20,30], [40,50,60] ])

print( np.concatenate([two_dimension_array1,two_dimension_array2], axis = 0) )

[[ 1  2  3]
 [ 4  5  6]
 [10 20 30]
 [40 50 60]]


In [67]:
print( np.concatenate([two_dimension_array1, two_dimension_array2], axis = 1) )

[[ 1  2  3 10 20 30]
 [ 4  5  6 40 50 60]]


In [68]:
digits = np.arange(1, 10)
print(digits)

three_sub_arrays = np.split(digits, [3, 6])
print(three_sub_arrays)

[1 2 3 4 5 6 7 8 9]
[array([1, 2, 3]), array([4, 5, 6]), array([7, 8, 9])]


In [70]:
# We have not seen this syntax yet, it's typical used in more advanced uses of Python. 
# You can specify many variables in one line, but assigning a List to them - see how useful data structures can be! 

a, b, c = [10,20,30]
print(a)
print(a, b, c)

10
10 20 30


In [71]:
# so split can be used as follows: 

start, middle, end = np.split(digits, [3, 6])
print(start, middle, end)

[1 2 3] [4 5 6] [7 8 9]


In [72]:
# Note: you could achieve the same effect with many lines of code with range()
# but that requires much more thinking and opportunities for mistakes 

first = np.arange(1, 4)
second = np.arange(4, 7)
third = np.arange(7, 10)
print(first, second, third)

# but why do something the hard way if there is a proper syntax for it?

[1 2 3] [4 5 6] [7 8 9]


### 2.3 Array Truthy and Falsy
Like numbers which we learned about in week 2 and 3, arrays evaluate as truthy and falsy depending on how they compare to 0. 

In [74]:
a1 = np.array([0])

print(a1)

print(len(a1))

print(bool(a1))

0


TypeError: len() of unsized object

Even though `a1` has a length of 1, it is still falsy because its value is 0. When arrays have more than one element, some elements might be falsy and some might be truthy. In those cases, NumPy will raise a `ValueError`

In [75]:
a2 = np.array([0, 1])

bool(a2) #produces a value error

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [76]:
## are any values truthy
print(a2.any())

## are all values truthy
print(a2.all())

True
False


## 3. Dictionary 

As a reminder, we create dict data structures in Python using curly brackets `{}` and specifying the `key:value` pair.

In [77]:
dict = {"city" : "Edinburgh",
       "univeristy" : "UoE",
       "year" : 1583}

print(dict)

{'city': 'Edinburgh', 'univeristy': 'UoE', 'year': 1583}


As a built-in data structure, list lists, there are a lot of useful functions when working with dictionaries. To see these type the name of the dictionary object followed by a dot and hit the [TAB] key.

In [None]:
## try this here - put your cursor after the . and hit [TAB]
dict.

In [78]:
# to see the list of keys 
print(dict.keys())

#to see the list of values 
print(dict.values())

dict_keys(['city', 'univeristy', 'year'])
dict_values(['Edinburgh', 'UoE', 1583])


You can also access each key-value pair within a dictionary using the `.items()` method

In [79]:
print(dict.items())

dict_items([('city', 'Edinburgh'), ('univeristy', 'UoE'), ('year', 1583)])


Since we cannot have duplicate keys, you can change the `key:value` mapping by setting the key to a new value.

In [80]:
dict["city"] = "Glasgow"

print(dict)

{'city': 'Glasgow', 'univeristy': 'UoE', 'year': 1583}


Using the same syntax though with a unique key, you can add a new item to a dict data structure. 

In [81]:
dict["campus"] = "George square"

print(dict)

{'city': 'Glasgow', 'univeristy': 'UoE', 'year': 1583, 'campus': 'George square'}


### 3.1 Dictionary comprehension 

Dictionary comprehension has been available since Python 2.7 and like list comprehension, is an efficient way of creating new dictionaries. Dictionary comprehensions takes the form: `{key: value for (key, value) in iterable}`

In [82]:
dict_comp = {x: x**2 for x in [1,2,3,4,5]} 

print(dict_comp)

{1: 1, 2: 4, 3: 9, 4: 16, 5: 25}


Let's say that now we want to double each value in our dictionary that we created called `dict_comp`

In [83]:
dict_double = {k:v*2 for (k,v) in dict_comp.items()}

print(dict_double)

{1: 2, 2: 8, 3: 18, 4: 32, 5: 50}


You can also use dictionary comprehension to make changes to the key values. Let's make the same dictionary as `dict_double` but also change the names of the key 

In [84]:
dict_keys = {k*10:v for (k,v) in dict_double.items()}

print(dict_keys)

{10: 2, 20: 8, 30: 18, 40: 32, 50: 50}


A useful function of Python dictionaries is to convert them to data frames. When a dictionary is converted to dataframe, the key become the column name and the value becomes the elements of the series (in other words the rows of a column). The function we use for this is `pd.DataFrame.from_dict()`. Use `orient='index'` to create the DataFrame using dictionary keys as rows

In [87]:
pd.DataFrame.from_dict(dict, orient = "index")

Unnamed: 0,0
city,Glasgow
univeristy,UoE
year,1583
campus,George square


## 4. Data frames

A data frame is a 2 dimensional structure (rows and columns) which can contain hetereogenous data typed elements. Data frames are not a built-in Python data struture, thus we need to `pandas` package in order to access and work with this data structure.

We will be using the gapminder data set, which is available in R in the `gapminder` package. We saw this dataset briefly in the Week 1 tutorial. Lets read in the data.

In [89]:
gap_data = pd.read_csv("../data/gapminder_data_unfiltered.csv")

In [90]:
## check data descriptions 
print(gap_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3313 entries, 0 to 3312
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    3313 non-null   object 
 1   continent  3313 non-null   object 
 2   year       3313 non-null   int64  
 3   lifeExp    3313 non-null   float64
 4   pop        3313 non-null   int64  
 5   gdpPercap  3313 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 155.4+ KB
None


In [91]:
gap_data.describe()

Unnamed: 0,year,lifeExp,pop,gdpPercap
count,3313.0,3313.0,3313.0,3313.0
mean,1980.293088,65.240457,31773250.0,11313.820704
std,16.931879,11.772424,104501900.0,11369.008362
min,1950.0,23.599,59412.0,241.165876
25%,1967.0,58.333,2680018.0,2505.291423
50%,1982.0,69.61,7559776.0,7825.823398
75%,1996.0,73.657,19610540.0,17355.7471
max,2007.0,82.67,1318683000.0,113523.1329


In [92]:
# include all columns not just numeric data - by default the describe method includes only numeric data  
print(gap_data.describe(include = "all"))

       country continent         year      lifeExp           pop  \
count     3313      3313  3313.000000  3313.000000  3.313000e+03   
unique     187         6          NaN          NaN           NaN   
top      Spain    Europe          NaN          NaN           NaN   
freq        58      1302          NaN          NaN           NaN   
mean       NaN       NaN  1980.293088    65.240457  3.177325e+07   
std        NaN       NaN    16.931879    11.772424  1.045019e+08   
min        NaN       NaN  1950.000000    23.599000  5.941200e+04   
25%        NaN       NaN  1967.000000    58.333000  2.680018e+06   
50%        NaN       NaN  1982.000000    69.610000  7.559776e+06   
75%        NaN       NaN  1996.000000    73.657000  1.961054e+07   
max        NaN       NaN  2007.000000    82.670000  1.318683e+09   

            gdpPercap  
count     3313.000000  
unique            NaN  
top               NaN  
freq              NaN  
mean     11313.820704  
std      11369.008362  
min        241.

In [93]:
## check df dtype of data frame columns
gap_data.dtypes

country       object
continent     object
year           int64
lifeExp      float64
pop            int64
gdpPercap    float64
dtype: object

To slice or subset rows, we can use the same notation as with arrays `dataframe[start: stop: step]`

In [94]:
print(gap_data[1:3])

       country continent  year  lifeExp       pop  gdpPercap
1  Afghanistan      Asia  1957   30.332   9240934  820.85303
2  Afghanistan      Asia  1962   31.997  10267083  853.10071


To access columns, we can use `[]` with the column name or a dot.

In [95]:
# get a column with [] 
# Notice meta information about names and data types at the bottom!
print(gap_data['year'])

0       1952
1       1957
2       1962
3       1967
4       1972
        ... 
3308    1987
3309    1992
3310    1997
3311    2002
3312    2007
Name: year, Length: 3313, dtype: int64


In [96]:
# but quite frequently you would use a . dot notation, like this:
print(gap_data.year)

0       1952
1       1957
2       1962
3       1967
4       1972
        ... 
3308    1987
3309    1992
3310    1997
3311    2002
3312    2007
Name: year, Length: 3313, dtype: int64


In [97]:
# to get individual items
print(gap_data.year[0])

# same as - just a different style of syntax 
print(gap_data['year'][0])

1952
1952


In [98]:
# or a few individual items
print(gap_data.year[0:5])

0    1952
1    1957
2    1962
3    1967
4    1972
Name: year, dtype: int64


### 4.1 the Index 

The index is the most important part of your data. It should be unique, but does not have to be. 

If you do not specify the index in your data, Python will just use continuous numbers starting from 0 (like 0,1,2,3,4,...). It's sort of like a row name in Excel.

```.set_index(a_column_name)``` will set a column with name `a_column_name` to be the index

```drop=False``` will make the old column stay (it will sort-of get duplicated and you'd have two identical columns: the original one, and the new index column)

You could also have many columns act as  indexes, but we will not go into that. If you wanted to do that, just pass a List of column names to `set_index` rather than one column name.

In [99]:
# the index is the numbers on the left (0-3308)
# notice it does not have a column/variable name 
gap_data

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
3308,Zimbabwe,Africa,1987,62.351,9216418,706.157306
3309,Zimbabwe,Africa,1992,60.377,10704340,693.420786
3310,Zimbabwe,Africa,1997,46.809,11404948,792.449960
3311,Zimbabwe,Africa,2002,39.989,11926563,672.038623


In [100]:
# we can turn 1 of the columns into the index 
# notice the spacing difference between the column names and the index name 
gap_data.set_index("year")

Unnamed: 0_level_0,country,continent,lifeExp,pop,gdpPercap
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1952,Afghanistan,Asia,28.801,8425333,779.445314
1957,Afghanistan,Asia,30.332,9240934,820.853030
1962,Afghanistan,Asia,31.997,10267083,853.100710
1967,Afghanistan,Asia,34.020,11537966,836.197138
1972,Afghanistan,Asia,36.088,13079460,739.981106
...,...,...,...,...,...
1987,Zimbabwe,Africa,62.351,9216418,706.157306
1992,Zimbabwe,Africa,60.377,10704340,693.420786
1997,Zimbabwe,Africa,46.809,11404948,792.449960
2002,Zimbabwe,Africa,39.989,11926563,672.038623


In [101]:
# but notice above year is now gone. to keep that column as it was, add drop = False
gap_data_yindex = gap_data.set_index("year", drop = False)
gap_data_yindex

Unnamed: 0_level_0,country,continent,year,lifeExp,pop,gdpPercap
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1952,Afghanistan,Asia,1952,28.801,8425333,779.445314
1957,Afghanistan,Asia,1957,30.332,9240934,820.853030
1962,Afghanistan,Asia,1962,31.997,10267083,853.100710
1967,Afghanistan,Asia,1967,34.020,11537966,836.197138
1972,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1987,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1992,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1997,Zimbabwe,Africa,1997,46.809,11404948,792.449960
2002,Zimbabwe,Africa,2002,39.989,11926563,672.038623


In [105]:
# so now you can use your index to get whole rows from the dataframe
# this can be bit cleaner than indexes 1,2,3,4... depending on your data
gap_data_yindex.loc[1987]

## see section below for what the .loc method does 

Unnamed: 0_level_0,country,continent,year,lifeExp,pop,gdpPercap
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1987,Afghanistan,Asia,1987,40.822,13867957,852.395945
1987,Albania,Europe,1987,72.000,3075321,3738.932735
1987,Algeria,Africa,1987,65.799,23254956,5681.358539
1987,Angola,Africa,1987,39.906,7874230,2430.208311
1987,Argentina,Americas,1987,70.774,31620918,9139.671389
...,...,...,...,...,...,...
1987,Vietnam,Asia,1987,62.820,62826491,820.799445
1987,West Bank and Gaza,Asia,1987,67.046,1691210,5107.197384
1987,"Yemen, Rep.",Asia,1987,52.922,11219340,1971.741538
1987,Zambia,Africa,1987,50.821,7272406,1213.315116


## 4.2 Working with data frames 

When working with data frames, there are a wide variety of functionalities available to you through `pandas`. Here we will focus on some key functions and attributes: 

* `groupby()` to perform subgroup analysis on your data
* `reset_index()` to reset the index of your data frame. It is good practice to do this after `groupby()` in particular to avoid funky and unexpected results
* `assign()` to assign or calculate a new column. `assign()` *does not* change the original dataframe, which is a special change of pace! That's because if you specified `inplace=True` it would just add a column called 'inplace' and put values True in every row of that column. 
* `agg()` aggregate using one or more operations over a specified range of columns or rows 
* `isin()` along side slicing to filter specific rows of a data frame
* `query()` to query or filter columns of a data frame with a boolean expression 
* `loc` to select rows and columns by labels or a boolean array, stands for location
* `iloc` to select rows and columns by integer-location indexing for selecting by position, stands for integer-location
* `df.astype()` to change the data type of a `pandas` object

You can also assign new values to a selection of your data frame using `loc`/`iloc`.

When combining multiple conditional statements, each condition must be surrounded by parentheses `()`. Additioanlly, in `pandas` you *cannot* use `or`/`and` but need to use the 'or' operator `|` and the 'and' operator `&`.

`pandas` data frame provide a full set of all statistical methods as well. If you need something specific, always [look in the documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)

In [107]:
# mean of all populations in the dataset grouped by continent 
print(gap_data["pop"].groupby(gap_data["continent"]).mean())

gap_data_2007 = gap_data[gap_data['year'] == 2007]
print(gap_data_2007["pop"].groupby(gap_data["continent"]).mean())

continent
Africa      9.728850e+06
Americas    3.941673e+07
Asia        9.544418e+07
Europe      1.531594e+07
FSU         3.179300e+07
Oceania     5.424172e+06
Name: pop, dtype: float64
continent
Africa      1.754648e+07
Americas    2.731354e+07
Asia        8.932016e+07
Europe      1.744369e+07
FSU         2.840137e+07
Oceania     2.995027e+06
Name: pop, dtype: float64


In [108]:
# subset data to be only year and continent variables
gap_filter = gap_data.iloc[:, 0:2].copy()

## [:, 0:2] reads as everything or all rows, slicing columns to be only the first 2 following [row, columns]
## .copy() method to get a regular copy, to avoid the case where changing gap_filter also changes gap_data

gap_filter

Unnamed: 0,country,continent
0,Afghanistan,Asia
1,Afghanistan,Asia
2,Afghanistan,Asia
3,Afghanistan,Asia
4,Afghanistan,Asia
...,...,...
3308,Zimbabwe,Africa
3309,Zimbabwe,Africa
3310,Zimbabwe,Africa
3311,Zimbabwe,Africa


In [109]:
# filter data to be only 1980 and before
gap_data.loc[gap_data["year"] >= 1980]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
6,Afghanistan,Asia,1982,39.854,12881816,978.011439
7,Afghanistan,Asia,1987,40.822,13867957,852.395945
8,Afghanistan,Asia,1992,41.674,16317921,649.341395
9,Afghanistan,Asia,1997,41.763,22227415,635.341351
10,Afghanistan,Asia,2002,42.129,25268405,726.734055
...,...,...,...,...,...,...
3308,Zimbabwe,Africa,1987,62.351,9216418,706.157306
3309,Zimbabwe,Africa,1992,60.377,10704340,693.420786
3310,Zimbabwe,Africa,1997,46.809,11404948,792.449960
3311,Zimbabwe,Africa,2002,39.989,11926563,672.038623


In [110]:
# filter data to be specific countries
gap_data.loc[gap_data["country"].isin(["Thailand", "Peru"])]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
2313,Peru,Americas,1952,43.902,8025700,3758.523437
2314,Peru,Americas,1957,46.263,9146100,4245.256698
2315,Peru,Americas,1962,49.096,10516500,4957.037982
2316,Peru,Americas,1967,51.445,12132200,5788.09333
2317,Peru,Americas,1972,55.448,13954700,5937.827283
2318,Peru,Americas,1977,58.447,15990099,6281.290855
2319,Peru,Americas,1982,61.406,18125129,6434.501797
2320,Peru,Americas,1987,64.134,20195924,6360.943444
2321,Peru,Americas,1992,66.458,22430449,4446.380924
2322,Peru,Americas,1997,68.386,24748122,5838.347657


In [111]:
# get the mean of a column 
print(gap_data["gdpPercap"].mean())

## or you could use 
print(gap_data.gdpPercap.mean())

11313.820704426891
11313.820704426891


In [112]:
# create a new variable called mean_gdpPercap with the mean of the whole data set using assign 
gap_data.assign(mean_gdpPercap = gap_data["gdpPercap"].mean())

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,mean_gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,11313.820704
1,Afghanistan,Asia,1957,30.332,9240934,820.853030,11313.820704
2,Afghanistan,Asia,1962,31.997,10267083,853.100710,11313.820704
3,Afghanistan,Asia,1967,34.020,11537966,836.197138,11313.820704
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,11313.820704
...,...,...,...,...,...,...,...
3308,Zimbabwe,Africa,1987,62.351,9216418,706.157306,11313.820704
3309,Zimbabwe,Africa,1992,60.377,10704340,693.420786,11313.820704
3310,Zimbabwe,Africa,1997,46.809,11404948,792.449960,11313.820704
3311,Zimbabwe,Africa,2002,39.989,11926563,672.038623,11313.820704


In [113]:
# let's try again and see how inplace does something unexpected, as mentioned above 
gap_data.assign(mean_gdpPercap = gap_data["gdpPercap"].mean(), inplace = True)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,mean_gdpPercap,inplace
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,11313.820704,True
1,Afghanistan,Asia,1957,30.332,9240934,820.853030,11313.820704,True
2,Afghanistan,Asia,1962,31.997,10267083,853.100710,11313.820704,True
3,Afghanistan,Asia,1967,34.020,11537966,836.197138,11313.820704,True
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,11313.820704,True
...,...,...,...,...,...,...,...,...
3308,Zimbabwe,Africa,1987,62.351,9216418,706.157306,11313.820704,True
3309,Zimbabwe,Africa,1992,60.377,10704340,693.420786,11313.820704,True
3310,Zimbabwe,Africa,1997,46.809,11404948,792.449960,11313.820704,True
3311,Zimbabwe,Africa,2002,39.989,11926563,672.038623,11313.820704,True


In [114]:
# to get the mean population across all years per country we just first need to do a groupby to group the data by country 
gap_data["pop"].groupby(gap_data["country"]).mean()

country
Afghanistan           1.582372e+07
Albania               2.580249e+06
Algeria               1.987541e+07
Angola                7.309390e+06
Argentina             2.860224e+07
                          ...     
Vietnam               5.456857e+07
West Bank and Gaza    1.848606e+06
Yemen, Rep.           1.084319e+07
Zambia                6.353805e+06
Zimbabwe              7.641966e+06
Name: pop, Length: 187, dtype: float64

We can use our knowledge of the dictionary data stucture to get a summary table of our data with aggregate statistics per column of interest: 

In [117]:
# to get a summary table with aggregations per column
gap_data.agg({'year' : ['min', 'max'], 'pop' : ['sum', 'min', 'max']})

Unnamed: 0,year,pop
min,1950.0,59412
max,2007.0,1318683096
sum,,105264781912


We cast specific columns in a data frame to another dtype using `as.type()` and a dictionary. Lets make `year` a category:

In [132]:
gap_data_ycat = gap_data.astype({"year": "category"})
gap_data_ycat['year'] = gap_data_ycat['year'].cat.as_ordered()
gap_data_ycat.agg({'year' : ['min', 'max'], 'pop' : ['sum', 'min', 'max']})
# adding .dtypes at the end has Python print out the dtype of the data frame after changing the year dtype 
## in this way we can check our work 

Unnamed: 0,year,pop
min,1950.0,59412
max,2007.0,1318683096
sum,,105264781912


### 4.3 Missing values in data frames

`pandas` treat `None` and `NaN` as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful functions for detecting, removing, and replacing null values in a `pandas` DataFrame:

* `isnull()` returns a data frame of Boolean vlaues which are `True` for NaN values 
* `notnull()` returns a data frame of Boolean vlaues which are `False` for NaN values 
* `df.dropna()` allows you to analyze and drop Rows/Columns with missing values in different ways. By default uses row-wise deletion so all rows with missing data are deleted
* `df.fillna()` allows you to replace missing values with some other specified value 
* `df.replace()` replaces a string, regex, list, dictionary, series, number, etc. from a `pandas` data frame
* `df.interpolate()` which uses various interpolation technique to fill the missing values rather than hard-coding the value. This is particularly useful with numeric data although there are numerous methods available - [see the documentation if you are interested in learning more](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html)

We looked at dealing with missing data in Week 2 & 3, but lets create a data frame with some missing data to remind ourselves how this works. 

In [133]:
# to create a DataFrame we put a Dict it its constructor. But remember that values need to be lists
patients = pd.DataFrame(
    {"names": ["Angela", "Shondra", np.nan, "Ben"],
     "age": [27, np.nan, 57, 44],
     "result": [True, False, np.nan, False]})

patients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   names   3 non-null      object 
 1   age     3 non-null      float64
 2   result  3 non-null      object 
dtypes: float64(1), object(2)
memory usage: 228.0+ bytes


In [134]:
patients.isnull()

Unnamed: 0,names,age,result
0,False,False,False
1,False,True,False
2,True,False,True
3,False,False,False


In [135]:
# to check for missing values in a specific column
patients.names.isnull()

0    False
1    False
2     True
3    False
Name: names, dtype: bool

In [136]:
patients.dropna()

Unnamed: 0,names,age,result
0,Angela,27.0,True
3,Ben,44.0,False


In [137]:
patients.fillna("missing")

Unnamed: 0,names,age,result
0,Angela,27.0,True
1,Shondra,missing,False
2,missing,57.0,missing
3,Ben,44.0,False


In [143]:
# use the method argument in fillna the fill missing data with previous item 
patients.fillna(method = "ffill") ## ffill being forward fill 

  patients.fillna(method = "ffill") ## ffill being forward fill
  patients.fillna(method = "ffill") ## ffill being forward fill


Unnamed: 0,names,age,result
0,Angela,27.0,True
1,Shondra,27.0,False
2,Shondra,57.0,False
3,Ben,44.0,False


In [145]:
# or to fill the missing data with the next viable observation 
patients.ffill() # bfill meaning back fill 

  patients.ffill() # bfill meaning back fill


Unnamed: 0,names,age,result
0,Angela,27.0,True
1,Shondra,27.0,False
2,Shondra,57.0,False
3,Ben,44.0,False


In [146]:
# the default method is linear which applies only to numeric data
patients.interpolate(method = "linear")

  patients.interpolate(method = "linear")


Unnamed: 0,names,age,result
0,Angela,27.0,True
1,Shondra,42.0,False
2,,57.0,
3,Ben,44.0,False


In [147]:
patients.interpolate(method = "pad") # this method is the same as ffill 

  patients.interpolate(method = "pad") # this method is the same as ffill
  patients.interpolate(method = "pad") # this method is the same as ffill


Unnamed: 0,names,age,result
0,Angela,27.0,True
1,Shondra,27.0,False
2,Shondra,57.0,False
3,Ben,44.0,False


---

## You did it! 🎉 

Well done for making it to the end of this notebook. If you have not done so yet, move to the Week 4 data structures in R RMarkdown notebook next. 

⭐⭐⭐❓👣 Do not forget your 3 stars, a wish, and a step mini-diaries once you have completed the content for this week. 

---
*Dr. Brittany Blankinship (2024)*