# Session 1: Python


---
## 1. Preliminaries

### 1.1 Getting started

We will be working in Google Colab throughout the course. You can find more details about Colab [here](https://https://colab.research.google.com/notebooks/basic_features_overview.ipynb).

There are different styles to writing code, but try to stick consistently to a (sensible) style guide: it is essential that the code not only works but is also understandable by everyone. The "official" guide for `python` can be founde here:

* Style guide: https://www.python.org/dev/peps/pep-0008/

Rules to code properly are fairly common sense, just like those of natural language. You can make them your own (within reason), the important part is to be **consistent**.

### 1.2 Printing

It can be useful to *print* what we are running on the console, this can be done with the built-in `print` command. You might have noticed that Colab notebooks automatically display the value of the last expression in a cell when you execute it, so you don't need to print that.

In [1]:
# Colab automatically prints "y" to "out", 
# but we need to manually print "x" if we want to see it
x = 15 / 2
print(x)
y = x > 2
y

7.5


True

We may want to use placeholders to print strings. There are a number of ways to do that.

In [2]:
a = 15
b = 2
c = 2
f'{a} divided by {b} is {a/b}, and it is {(a/b) > c} that this is greater than {c}'

'15 divided by 2 is 7.5, and it is True that this is greater than 2'

We will go into what these objects are shortly.

---
## 2. Basic data types

Here we highlight the most important native operations you can apply to basic objects in `python`. We will go to a lower level of detail of the structure of these processes in the `R` session later in the afternoon.

### 2.1 Numerical values

In [3]:
# Assignment
a = 10  # 10

# Increment/Decrement
a += 1  # 11 (a = a + 1)
a -= 1  # 10 (a = a - 1)

# Operations
b = a + 1  # 11
c = a - 1  # 9

d = a * 2  # 20 
e = a / 2  # 5 
f = a % 3  # 1 (remainder)
g = a ** 2  # 100 (exponentiation)

# Operations with other variables
d = a + b  # 21

### 2.2 String values

You can concatenate strings together with the `+` operator:

In [4]:
"Adding" + " " + "strings" + " " + "is" + " " + "pasting"

'Adding strings is pasting'

The built-in function `len` can be used to find the length of a string:

In [5]:
len("four")

4

In [6]:
len("Doha")

4

### 2.3 Logical values

We can evaluate the relationships between different types of data in `python`. The output of such comparisons/operations are boolean variables. Some examples:

In [7]:
# Comparing numbers
x = (1 >= 2)  # greater or equal
y = (1 == 2)  # equal
w = (1 != 2)  # different

print(x)
print(y)
print(w)

False
False
True


In [8]:
# Parentheses are not required, but they help readability
x = (1 <= 2) and (1 > 0)  # both statements are true
y = (1 >  2) or  (1 < 3)  # at least one statement is true

print(x)
print(y)

True
True


In [9]:
# It is good practice to use "is" instead of == for checking for NoneType:
x = None
y = x is None
z = x is not None
print(x)
print(y)
print(z)

None
True
False


### 2.4 Lists

Lists are one of the simplest multi-value objects. They are created with square brackets:

In [10]:
a_list = ["This", "is", "a", "list", "of", "strings"]
a_list

['This', 'is', 'a', 'list', 'of', 'strings']

Here you can see we created a list of strings. We can also create a list of integers:

In [11]:
num_list = [1, 5, 10]
num_list

[1, 5, 10]

Lists in `python` need not be homogenous, you can mix object types:

In [12]:
mix_list = ['a', 'b', 1, 2, 3, True, None]
mix_list

['a', 'b', 1, 2, 3, True, None]

Sometimes you want to access individual elements from a list. You can do this using square brackets together with the index of the element:

In [13]:
mix_list[0]  # first element

'a'

In [14]:
mix_list[-1]

In [15]:
mix_list[-2]

True

Notice that the first element is indexed at `0`, the second element at `1`, and so on. You can also access a contiguous range of elements:

In [16]:
mix_list[1:3]  # second item (index 1) and third item (index 2) only!

['b', 1]

You can also use negative indices to access items from the end. For example, the last item:

In [17]:
num_list

[1, 5, 10]

In [18]:
num_list[-1]

10

You can concatenate multiple lists together with the operator `+`:

In [19]:
num_list + [40, 50, 60]

[1, 5, 10, 40, 50, 60]

And you can check for membership with the operator `in`:

In [20]:
"abc" in ["abc", "def", "ghi"]

True

### 2.5 Tuples

Tuples are created with parenthesis:

In [21]:
x = ("foo", 1)

But can also be created without any perenthesis, implied by the comma:

In [22]:
x = "foo", 1

Elements in the tuple are also accessed via the index (like lists).

Lists can be used most places that a tuple is used, so it can be confusing what the difference is between the two. Besides technicalities, the following rules can help you decide when to use a tuple and when to use a list:

* `list`: many elements (potentially), unknown number, relatively homogenous, mutable.
* `tuple`: few elements, fixed number, completely heterogeneous, immutable (fixed).

The name comes from here: double, triple, quadruple... This hints that they should be of fixed length.
Since their length is fixed, we often use them with destructuring:

In [23]:
name,num = x

In [24]:
name

'foo'

In [25]:
num

1

Now the variable `name` contains the value `"foo"` and the variable `num` contains the value `1`.

### 2.6 Dictionaries

Dictionaries are another basic type in `python`. They are *associative* data structures. Like a standard dictionary, python dictionaries associate a `KEY` with a `VALUE` and are created with the `{`, `}` operators:


In [26]:
player = {"name": "Jane", "score": 10000}

In [27]:
player

{'name': 'Jane', 'score': 10000}

You can access the value via the *key*, and set it in a similar way:


In [28]:
print(player["name"])
player["name"] = "Jane Smith"

Jane


In [29]:
print(player['name'])

Jane Smith


Each key can only have one value. In the above example, we have overwritten the original `"name"`-key with a new value.

In [30]:
# Create a list whose elements are of type dictionary
# Each element on the list is a player, and each player has three attributes.
players = [{"name": "John", "score": 100, "likes": ["R"]},
           {"name": "Jane", "score": 10000, "likes": ["python"]},
           {"name": "Stephen", "score": 55, "likes": ["julia"]}]
print(players[0])

# We can fetch elements of the dictionary
print(players[0]['name'])
print(players[0].get('name'))

{'name': 'John', 'score': 100, 'likes': ['R']}
John
John


In [31]:
players[0].get('name')

'John'

### 2.7 Properties of an object: instances, attributes and methods

Objects in `python` have *classes*. For example:

In [32]:
new_list = [1, 2, 3]  # Create a list
type(new_list)  # This function tells you the type of object

list

In [33]:
type(new_list[0])

int

`new_list` is an ***instance*** of the class `list`. Instances can have ***attributes***. These attributes are just properties that are attached to the instance. They are accessed with dot notation:

In [34]:
# I define my own class: custom types
class invented_class: 
    name = "John"
    score = 100
    def show(self): 
        print (self.name) 
        print (self.score) 
 
# Create an object of this new class
new_obj = invented_class()
print(type(new_obj))
print(new_obj.score)

<class '__main__.invented_class'>
100


Here `new_obj` is an instance of the class `invented_class`, and it has attributes `name` and `score`.

If the attribute happens to be a function, we call it a ***method***. Methods are functions that have a special purpose: they interact with the instance itself in some way.

---
## 3. Iterables and control flow

### 3.1 Iterate over elements of an object (`for`-loops)

You may want to perform an operation for each element in an *iterable* object, and (possibly) store the result of such operation. We do that by *looping* over this object operating on every of its elements sequentially.

In [35]:
vector = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
squared = []
for num in vector:
    squared_num = num ** 2
    squared = squared + [squared_num]

squared

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

In [36]:
vec = [1, 4, 9]
square = []
for a in vec:
  square_a = a ** 0.5
  square = square + [square_a]

square

[1.0, 2.0, 3.0]

The structure of the loop is critical and always the same: 

```
## NOT RUN
for ELEMENT in EXISTING_OBJECT:  # Use "for", "in" and ":"
    operation_on(ELEMENT)  # Indent 4 spaces (MANDATORY)

# Loop ENDS when indentation is over
## END NOT RUN
```

`for`-loops are applicable to every instance of an iterable object class: `list`, `tuple`, `dictionary`.

### 3.2 Operations on lists

There are three main operations we perform on a list:
1. Aggregate (*reduce* operation)
2. Apply a function to each element (*map* operation)
3. Select elements (*filter* operation)

We illustrate these with some examples.

In [37]:
# 1. AGGREGATION
# Summing the numbers in a list: 
nums = [30, 1, 4, 3, 10.5, 100]
total = 0 
for num in nums:
    total += num
    
print(total)

148.5


In [38]:
sum(nums)

148.5

In [39]:
numbers = [1, 2, 3, 4, 5]
summ = 0
for num in numbers:
  summ += num

print(summ)

15


In [40]:
sum(numbers)

15

In [41]:
# 2. APPLY A FUNCTION 
# Squaring each number in a list
nums = [30, 1, 4, 3, 10.5, 100]

# This is called a "list comprehension"
# and is the python way to apply a function to 
# every element in a list
squared_nums = [num ** 2 for num in nums]
squared_nums

[900, 1, 16, 9, 110.25, 10000]

In [42]:
# 3. FILTER
# Remove all values less than 18:
ages = [0, 3, 21, 45, 10, 97]
adults = [a for a in ages if a > 17]
adults

[21, 45, 97]

In [43]:
ages = [0, 3, 21, 45, 10, 97]
teen = [a for a in ages if a <= 17]
teen

[0, 3, 10]

In [44]:
numb = [1, 2, 3, 4]
sq_numb = [a ** 2 for a in numb]
sq_numb

[1, 4, 9, 16]

### 3.3 Control flow

Sometimes you may want to operate only if the object satisfies some relevant criteria, for example make a decision based on some condition holds. In these cases we use `if`-statement operations.

In [45]:
gender = "male"
age = 2 

# Start of if-statement
if gender == "female":
    if age > 18:
        print("woman")
    else: 
        print("girl")
elif gender == "male":
    if age > 18:
        print("man")
    else: 
        print("boy")
else:
    print("other")


boy


In [46]:
salary = 700
country = 'America'

if country == 'America':
  if salary > 500:
    print('Middle class')
  else:
    print('Poor')
elif country == 'China':
  if salary > 300:
    print('Middle class')
  else:
    print('Poor') 

Middle class


Again, the structure of an `if`-statement is always the same, so it is essential that we use it correctly

```
## NOT RUN
if BOOLEAN:  # Use: "if", ":" and supply a boolean condition
    ACTION1a  # Indent with 4 spaces
    ACTION1b
elif BOOLEAN:  # Second layer to "if" (else-if)
    ACTION2
else:  # Rest of scenarios (make sure all contingencies are covered)
    ACTION3

## END NOT RUN
```

Like in `for`-loops, statements are closed by indentation.

---
## 4. Input/output

`python` treats external files as any other regular data type, and so they have attributes, etc. There are two actions to perform on these files, read and write.

In short, the built-in function `open` creates a `python` file object, which serves as a link to a file residing on your machine. After calling `open`, you can transfer strings of data to and from the associated external file by calling the returned file object’s methods.

Then, `with` is a file context manager which allows us to wrap file-processing code in a logic layer that ensures that the file will be closed automatically on exit.

In [47]:
with open('simple_example.txt', 'w') as file:
    for i in range(10):
        file.write(f'I have written {i+1} lines \n')

In [48]:
# Let's read the file into a variable:
with open('simple_example.txt') as file:
    content = file.read()

print(content)

I have written 1 lines 
I have written 2 lines 
I have written 3 lines 
I have written 4 lines 
I have written 5 lines 
I have written 6 lines 
I have written 7 lines 
I have written 8 lines 
I have written 9 lines 
I have written 10 lines 



In [49]:
# The file object is actually an iterator, so we can do our usual tricks:
with open('simple_example.txt') as file:
    for line in file: 
        print(line)

I have written 1 lines 

I have written 2 lines 

I have written 3 lines 

I have written 4 lines 

I have written 5 lines 

I have written 6 lines 

I have written 7 lines 

I have written 8 lines 

I have written 9 lines 

I have written 10 lines 



In [50]:
# We can also read it into a list by converiting 
# the iterator into a list directly:
with open('simple_example.txt') as f:
    content = list(f)

content[0]

'I have written 1 lines \n'

Similarly, you can write things on this outside file by using `file.write()`.

In [51]:
# 'r' is to read, 'w' to write:
with open('simple_example.txt', 'r') as file_in, open('simple_example2.txt', 'w') as file_out:
    for line in file_in:
        file_out.write(line)


If your file is located on a website, instead of a local path, then we need to proceed differently. We use the `urllib` package. The first task is to access the data contained on a website with an URL command. The main difference with respect to local files is how to read the file, and the structure of the imported object, which is no longer an iterable. If we want to split the loaded content into lines we need to proceed manually.

In [52]:
# We load the data
import urllib.request

url_path = "https://raw.githubusercontent.com/barcelonagse-datascience/academic_files/master/data/textfile.txt"
file_conn = urllib.request.urlopen(url_path)  # This opens the connection to the URL

raw_txt = file_conn.read().decode() # decode is used to convert to string format
raw_txt

'Why Do People Use Python?\n\nBecause there are many programming languages available today, this is the usual first question of newcomers. Given that there are roughly 1 million Python users out there at the moment, there really is no way to answer this question with complete accuracy; the choice of development tools is sometimes based on unique constraints or personal preference.\n\nBut after teaching Python to roughly 225 groups and over 3,000 students during the last 12 years, some common themes have emerged. The primary factors cited by Python users seem to be these:\n\nSoftware quality\nFor many, Python’s focus on readability, coherence, and software quality in general sets it apart from other tools in the scripting world. Python code is designed to be readable, and hence reusable and maintainable—much more so than traditional scripting languages. The uniformity of Python code makes it easy to understand, even if you did not write it. In addition, Python has deep support for more 

In [53]:
# Let's recover lines using split and the target line break
#split_txt = raw_txt.split('\n\n')
split_txt = raw_txt.split('\n')
split_txt[0:10]

['Why Do People Use Python?',
 '',
 'Because there are many programming languages available today, this is the usual first question of newcomers. Given that there are roughly 1 million Python users out there at the moment, there really is no way to answer this question with complete accuracy; the choice of development tools is sometimes based on unique constraints or personal preference.',
 '',
 'But after teaching Python to roughly 225 groups and over 3,000 students during the last 12 years, some common themes have emerged. The primary factors cited by Python users seem to be these:',
 '',
 'Software quality',
 'For many, Python’s focus on readability, coherence, and software quality in general sets it apart from other tools in the scripting world. Python code is designed to be readable, and hence reusable and maintainable—much more so than traditional scripting languages. The uniformity of Python code makes it easy to understand, even if you did not write it. In addition, Python has 

---
## 5. Functions

### 5.1 Coding functions

Functions (may) use an input to produce an output through a set of operations.

```
## NOT RUN
def function_name(input):  # Use "def", parenthesis and ":"
    operations
    return output  # End with "return"

## END NOT RUN
```

For example, here is a function that takes a number, and returns its square:

In [54]:
def squared(x):
    return x**2

squared(7)

49

In [55]:
def square_root(x):
  return x**0.5

square_root(9)

3.0

Here is a more general function that takes a number and a power and returns the number to that power:

In [56]:
def power(x, n):
    return x ** n

power(4, 3)

64

Here is a function that returns the minimum and the sum of a list of numbers:

In [57]:
def min_sum_fun(x):
    minx = x[0] if len(x) > 0 else None
    sumx = 0.0  
    for val in x: 
        if val < minx:
            minx = val
        sumx += val 
    return minx, sumx  # Notice multiple outputs (technically a tuple)


Now we can call that function:

In [58]:
m, s = min_sum_fun([1, 5, 0.3, -1])  # Destructuring
w = (m, s) = min_sum_fun([1, 5, 0.3, -1])  # You can assign directly to a tuple

print(m)
print(s)
print(w)
print(type(w))

-1
5.3
(-1, 5.3)
<class 'tuple'>


In [59]:
def min_sum_f(x):
  mini = min(x)
  sums = sum(x)
  return mini, sums

In [60]:
mi, su = min_sum_f([1, 5, 0.3, -1])
w = mi, su
print(mi)
print(su)
print(w)
type(w)

-1
5.3
(-1, 5.3)


tuple

### 5.2 Map, reduce and filter

`map` is a function used to apply a certain (set of) operation(s) sequentiallt to every element in a list. This is similar to a `for`-loop but through a simpler syntax.

In [61]:
items = [1, 4, 9, 16, 25]
sqroot = map(lambda x: x**(1/2), items)  # map(function, iterable_object)
sqroot = list(sqroot)  # Put into a list
print(sqroot)

[1.0, 2.0, 3.0, 4.0, 5.0]


In [62]:
vect = [1, 2, 3, 4]
square = map(lambda x: x ** 2, vect)
square = list(square)
print(square)

[1, 4, 9, 16]


Note that here `map` is using a so-called `lambda` function: this is an anonymous function (it has not been defined) which takes argument `x`, where `x` is every individual element in the `list`, and applies the function after the semi-colon to each of these elements. Note, however, that the function applied by `map` need not be a `lambda` function.

Also, the iterable element need not even be a list of objects, it can be a list of functions:

In [63]:
def sqroot(x):
    return x**(1/2)

def squared(x):
    return x**2

functions = [sqroot, squared]  # List of functions
for i in [1, 4, 9, 16, 25]:
    results = list(map(lambda x: x(i), functions))
    print(results)


[1.0, 1]
[2.0, 16]
[3.0, 81]
[4.0, 256]
[5.0, 625]


`reduce` is a similar function, in this case employed to perform rolling computations to sequential pairs of values in a list. In other words, the computation of an operation on an element of the list is connected to the computation of the previous element, and this accumulates (*reducing* the list to a single result).

Say you want to compute the factorial of a number:

In [64]:
from functools import reduce

# This gives a list with the integers needed to compute the factorial
def factorial_elements(x):
    return [i + 1 for i in range(x)]

# Factorial of 5 (5!)
factorial5 = reduce(lambda x, y: x * y, factorial_elements(5))  # Needs two arguments
factorial5

120

In [65]:
from functools import reduce 
def cum_sum1(x):
  return[i + 1 for i in range(x)]

cum_sum = reduce(lambda x, y: x + y, cum_sum1(5))
cum_sum

15

One final useful function is `filter`. It returns the list of elements of an original list that satisfy a logical condition (i.e. applies a function that returns a logical value). A simple example:

In [66]:
num_list = [-1, 2, -3, 4, -5, 6]
pos_vals = filter(lambda x: x > 0, num_list)  # Same syntax as map/reducte
pos_vals = list(pos_vals)  # Needs to be coerced to a list too
print(pos_vals)

[2, 4, 6]


In [67]:
from statistics import mean


num = [1, 2, 3, 4, 5, 6, 7, 8]
high = filter(lambda x: x>mean(num), num)
high = list(high)
print(high)
print(mean(num))

[5, 6, 7, 8]
4.5


---
## 6. Modules and imports

*Modules* are `python` files, recognised in the computer as `<filename>.py`. Data and methods defined in the module can become part of the namespace by using `import`.

In [68]:
x = sin(5)  # The "sin" function does not exist

NameError: name 'sin' is not defined

In [None]:
from math import sin  # You need to "import" it
x = sin(5)
print(x)

-0.9589242746631385


There are some other useful tricks to import data and methods:

In [None]:
from math import sin  # Imports a single function
from math import sin as sinus  # Nickname, useful when you import something with a long name
print(sinus(3))

import math  # Imports the module in the namespace, methods can then be accessed e.g.
print(math.sin(3))

0.1411200080598672
0.1411200080598672


Let us discuss here a couple of fundamental imports in `python`:

### 6.1 `numpy` (Numerical Python)

*   User's manual: https://numpy.org/doc/stable/

This is Python's stack for scientific computing. The fundamental new data type is that of a **`numpy` `array`**, Python's matrix-type object, which is used in the majority of its data analysis, statistics and machine learning modules.

These arrays contain data all of the same type (dtype), numerical of many types or boolean:

In [None]:
# Import module 
import numpy as np  # usually imported with name "np"

# Creating an array
A = np.array([[1, 2, 3, 4, 5, 6], 
              [42, 53, 43 ,62, 7, 4], 
              [-3, -1, -4 ,-8, -52, -4], 
              [10, 0, 4 , 1, 0, 1]])

We can access the elements in an array using multi-index notation (counting starts from 0, slicing `a:b` is inclusive:exclusive, negative indices, etc.)

In [None]:
print(A)
# Print from row 3 onwards, columns 3 and 5
print(A[2:, [2, 4]])

[[  1   2   3   4   5   6]
 [ 42  53  43  62   7   4]
 [ -3  -1  -4  -8 -52  -4]
 [ 10   0   4   1   0   1]]
[[ -4 -52]
 [  4   0]]


In [None]:
A.shape

(4, 6)

In [None]:
A.diagonal()

array([ 1, 53, -4,  1])

In [None]:
A.transpose()

array([[  1,  42,  -3,  10],
       [  2,  53,  -1,   0],
       [  3,  43,  -4,   4],
       [  4,  62,  -8,   1],
       [  5,   7, -52,   0],
       [  6,   4,  -4,   1]])

In [None]:
A.reshape(6,4)

array([[  1,   2,   3,   4],
       [  5,   6,  42,  53],
       [ 43,  62,   7,   4],
       [ -3,  -1,  -4,  -8],
       [-52,  -4,  10,   0],
       [  4,   1,   0,   1]])

In [None]:
# Attributes
A.shape  # dimension
A.min()  # minimum (you can similarly use max, sum, etc.)
A.diagonal()  # diagonal
B = A.transpose()  # transposing
C = A.reshape(6, 4)  # rearrange values to change dimension (CAREFUL!)
print(A.shape)
print(B.shape)
print(C.shape)
A.dot(B)  # dot-product (with array B)


(4, 6)
(6, 4)
(6, 4)


array([[   91,   584,  -333,    32],
       [  584, 10331, -1227,   658],
       [ -333, -1227,  2810,   -58],
       [   32,   658,   -58,   118]])

### 6.1.1. `array` operations

Mathematical symbols take on mathematical meanings in `numpy`, and so the `+` operator between two `np.array` just tries to add them together elementwise (it works differently to `list`-type objects). To concatenate you need a specific function.


In [None]:
# Numpy array addition: 
a, b = np.array([1, 2, 3]), np.array([4, 5, 6])
print(a + b)  # Addition
print(np.concatenate([a,b]))  # Concatenation

[5 7 9]
[1 2 3 4 5 6]


### 6.2 `pandas` (Panel Data Structures)

This is the module in Python for doing rectangular-data management, analysis and plotting. It provides tools to read and write external data.


In [None]:
import pandas as pd  # Usually imported with name "pd"

file_path = "https://raw.githubusercontent.com/barcelonagse-datascience/academic_files/master/data/tips.csv"
tips = pd.read_csv(file_path)  # read_csv allows importing .CSV files
print(tips.head(5))  # head gives first rows (arg = number of rows)

# Data formats in pandas
print(type(tips))
print(type(tips['tip']))

   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


There are the two basic data formats in `pandas`: `Series` and `DataFrame`. A `Series` is the equivalent to a vector in linear algebra. A `DataFrame` is the equivalent to a rectangular data structure. These types are equipped with several attributes useful for data management and analysis. In the tips example, the variable object `tips` is a `DataFrame`, while any individual column would be a `Series`.

---
## 7. Structured data operations (`pandas`)

### 7.1 `Series`

They are accessed via their index, which can be either a number or a string. They are typically obtained by reading a dataset from an external file or when operating on a `DataFrame`, but they can be defined manually:

In [None]:
# Here no indices are specified (there are defaults)
a_series = pd.Series([1, 15, -5, None, 4, 123, 0, 78, 0, 1, -4])
a_series

0       1.0
1      15.0
2      -5.0
3       NaN
4       4.0
5     123.0
6       0.0
7      78.0
8       0.0
9       1.0
10     -4.0
dtype: float64

In [None]:
# Accessing a certain value via the index
a_series[0]

1.0

In [None]:
# .values returns a numpy.ndarray of the values
a_series.values
a_series.index, a_series.index.values

(RangeIndex(start=0, stop=11, step=1),
 array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10], dtype=int64))

In [None]:
# You can overwrite the index directly: 
a_series.index = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"]
a_series

a      1.0
b     15.0
c     -5.0
d      NaN
e      4.0
f    123.0
g      0.0
h     78.0
i      0.0
j      1.0
k     -4.0
dtype: float64

Accessing values via the index can be very useful, but sometimes you want to access the values as though they were a Python list. In other words "I want the first value!", without having to know the name of the label. This can be achieved with `.iloc`:


In [None]:
a_series.iloc[0], a_series["a"], a_series.iloc[-1]


(1.0, 1.0, -4.0)

In [None]:
# This just resets the index to be as we found it originally
a_series = a_series.reset_index(drop = True)
x = a_series.sort_values()

x[0], x.iloc[0] 
# the indices remain, so x indexed by 0 is 1
# the ordering changes, so the first element of x is -5.0, the minimum

(1.0, -5.0)

In [None]:
x

2      -5.0
10     -4.0
6       0.0
8       0.0
0       1.0
9       1.0
4       4.0
1      15.0
7      78.0
5     123.0
3       NaN
dtype: float64

As usual one should explore the attributes of any python object one ends up working with. We have already accessed the `.index` and `.value`. Some other that are worth highlighting:

*   `.map`
*   `.corr`
*   `.describe`
*   `.hist`
*   `.plot`
*   `.size`
*   `.value_counts`
*   `.sort_values`

For example:

In [None]:
tips.describe()

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


### 7.1.1 Operations with `Series`

`Series` are `numpy` arrays behind the scenes, so we can compute element-wise functions on one series or several series at the same time. The result is another series with data type depending on the type of operations performed.


In [None]:
series1 = pd.Series([1, 3, 5, 7])
series2 = pd.Series([0, 10, -1, 6])

series3 = 2 * series1 + abs(series2)
series4 = series1 > series2 

print(series3)
print(series4)
# Take a look at the different Series objects!

0     2
1    16
2    11
3    20
dtype: int64
0     True
1    False
2     True
3     True
dtype: bool


What goes on in the previous examples is more subtle than it looks. How does Python know which elements from each series to join in the required operation together? What happens is that the indices happened to be the same. So when we ask something like

`series3 = series1 + series2`,

Python looks for entries in each series with the same index and then does an elementwise summation that it stores in a like-wise index in Series 3.

Consider instead the following example:


In [None]:
series1 = pd.Series([1, 10], index=["A", "B"])
series2 = pd.Series([4, -1], index=["C", "D"])
series3 = series1 + series2
print(series3)

A   NaN
B   NaN
C   NaN
D   NaN
dtype: float64


In [None]:
series1

A     1
B    10
dtype: int64

This aspect makes it very easy to work with series that we have sorted or manipulated otherwise; there is always the address to access a value. This helps prevent accidentally combining values we didn't mean to combine!

In [None]:
a_series  

0       1.0
1      15.0
2      -5.0
3       NaN
4       4.0
5     123.0
6       0.0
7      78.0
8       0.0
9       1.0
10     -4.0
dtype: float64

In [None]:
# accesing by list of index labels
a_series.index = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K"]
x = a_series[["A", "K"]]

In [None]:
a_series

A      1.0
B     15.0
C     -5.0
D      NaN
E      4.0
F    123.0
G      0.0
H     78.0
I      0.0
J      1.0
K     -4.0
dtype: float64

In [None]:
x

A    1.0
K   -4.0
dtype: float64

In [None]:
# Getting a boolean-valued series by checking a condition
choose = (a_series == 0.0)
choose

A    False
B    False
C    False
D    False
E    False
F    False
G     True
H    False
I     True
J    False
K    False
dtype: bool

In [None]:
# Notice the index of x is a SUBSET of the index of "a_series"
# This can be useful when needing to relate values back to the original "a_series"!
x = a_series[choose]
print(x)
# or the complement
a_series[~choose]

G    0.0
I    0.0
dtype: float64


A      1.0
B     15.0
C     -5.0
D      NaN
E      4.0
F    123.0
H     78.0
J      1.0
K     -4.0
dtype: float64

We often use boolean masks to filter data in Pandas. We also get special boolean algebra operators to use in `numpy`/`pandas`, distinct from the and/or/not you will use in regular Python: `&` (AND), `|` (OR), `~` (NOT).


### 7.1.2 Missing values

A series object in `pandas` can help us deal with missing data. We already see very naturally how data management can quickly lead to missing data.

In [None]:
series3 = series1 + series2
print(series3)

A   NaN
B   NaN
C   NaN
D   NaN
dtype: float64


What happened there is that in the operation labels could not be matched, so `pandas` tried to sum a numeric value with missing value, the result of which is a missing value.
The way to manually specify in `pandas` that a value is missing is to use `None`, as below:

In [None]:
temp = pd.Series([1, None, 2])
print(temp)

0    1.0
1    NaN
2    2.0
dtype: float64


`pandas` coerces `None` values to `NaN` ("Not a Number") values. We can create boolean masks on the basis of such values. The way to identify NaN or None values in a Series is to use either of the equivalent two attributes: `.isna()` and `.isnull()` (existing the opposite `.notna()` and `.notnull()`).

In [None]:
temp.isna()
temp.isnull()

0    False
1     True
2    False
dtype: bool

In [None]:
temp.notna()
temp.notnull()

0     True
1    False
2     True
dtype: bool

### 7.2 `DataFrame`

This is `pandas` model for rectangular data. Operationally similar to a dictionary of `Series`; each column of the `DataFrame` is a `Series` object, and comes with all the attributes/methods of a `Series`. An implication is that within each column the data type is common; across columns this can change.

In [None]:
# Recall our "tips" dataset
tips.head(5)  # The first method of our DataFrame object

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [None]:
# the other important attribute: name of rows and columns
print(tips.shape)
print(tips.index)
print(tips.columns)

(244, 7)
RangeIndex(start=0, stop=244, step=1)
Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size'], dtype='object')


To access the columns of a `DataFrame` there are two standard ways.

In [None]:
tips.tip  # As if it were a method (NOT recommended)
tips["size"]  # As if it were a named index (RECOMMENDED)

0      2
1      3
2      3
3      2
4      4
      ..
239    3
240    2
241    2
242    2
243    2
Name: size, Length: 244, dtype: int64

The latter is recommended because, for example, the column `size` cannot be accessed via the first option (it would be confused with the method `.size`). The result of this operation is an object of type `Series`.

We can access various columns at a time, supplying a list of columns, obtaining a dataframe with the same index as the original and columns the chosen subset.

In [None]:
tips[["tip", "size", "sex"]].head(5)

Unnamed: 0,tip,size,sex
0,1.01,2,Female
1,1.66,3,Male
2,3.5,3,Male
3,3.31,2,Male
4,3.61,4,Female


Similarly, you can access rows instead of columns with the same philosophy.

*   Using a list of index labels: `tips.loc[ [index1, index2, ...] ]`
*   Using a list of integer index location (i-loc): `tips.iloc[ [integer1, integer2, ...] ]`



In [None]:
# Accessing rows AND columns!
# Example of 2-dimension loc
tips.loc[[1, 3], ['sex', 'smoker']]

Unnamed: 0,sex,smoker
1,Male,No
3,Male,No


In [None]:
# Accessing rows AND columns!
# Example of 2-dimensional iloc
tips.iloc[[1, 3], 2:5]

Unnamed: 0,sex,smoker,day
1,Male,No,Sun
3,Male,No,Sun


Note that certain operations are exchangeable: the 3rd element of column "sex" can be obtained with either of the following ways:

In [None]:
tips.sex[2]  # Access col as series, then the 3rd element of that
tips.loc[2, "sex"]  # Access the entry in DF by giving the index labels of row and col
tips.loc[2]["sex"]  # Accessing the whole row as a series, then using the column name as index label

'Male'

As with `Series`, we can use a boolean-valued series to index a `DataFrame` provided the share the same index labels. The simplest instance of this is to use series produced as boolean masks of columns of the dataframe. The output of this *filtering* operation is a dataframe with subset of rows corresponding to the `True` values in the boolean mask.

In [None]:
tips[tips['sex'] == "Male"].head(5)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
5,25.29,4.71,Male,No,Sun,Dinner,4
6,8.77,2.0,Male,No,Sun,Dinner,2


In [None]:
tips[(tips['sex'] == "Male") & (tips['day'] == "Sun")].head(5)  # Multiple booleans ("&", "|", "~")

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
5,25.29,4.71,Male,No,Sun,Dinner,4
6,8.77,2.0,Male,No,Sun,Dinner,2


A `DataFrame` comes with several attributes for computing column-wise statistics and summaries. We highlight some of them:

*   `.boxplot` (check out the `by = ` option)
*   `.corrwith` and `.corr` (within and across `DataFrame`'s)
*   `.dot`
*   `.mean/median/max/quantile/sum`, etc.
*   `.sample`
*   `.sort_values`
*   `.unique`


### 7.2.1 `GroupBy`

This `DataFrame` method groups the `DataFrame` according to the values of a column, treating them as categorical values. It returns a groupby object.

In [None]:
# Group tips DataFrame by size of table
by_size = tips.groupby("size")
print(by_size)

# If we coerce it to a list, we see something interesting: 
# it's basically a list of tuples
# The first element is the "category" variable, the second
# is a datafame. 
list(by_size)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000213FDE3FFD0>


[(1,
       total_bill   tip     sex smoker   day    time  size
  67         3.07  1.00  Female    Yes   Sat  Dinner     1
  82        10.07  1.83  Female     No  Thur   Lunch     1
  111        7.25  1.00  Female     No   Sat  Dinner     1
  222        8.58  1.92    Male    Yes   Fri   Lunch     1),
 (2,
       total_bill   tip     sex smoker   day    time  size
  0         16.99  1.01  Female     No   Sun  Dinner     2
  3         23.68  3.31    Male     No   Sun  Dinner     2
  6          8.77  2.00    Male     No   Sun  Dinner     2
  8         15.04  1.96    Male     No   Sun  Dinner     2
  9         14.78  3.23    Male     No   Sun  Dinner     2
  ..          ...   ...     ...    ...   ...     ...   ...
  237       32.83  1.17    Male    Yes   Sat  Dinner     2
  240       27.18  2.00  Female    Yes   Sat  Dinner     2
  241       22.67  2.00    Male    Yes   Sat  Dinner     2
  242       17.82  1.75    Male     No   Sat  Dinner     2
  243       18.78  3.00  Female     No  Thur

In [None]:
list(tips.groupby("sex"))

[('Female',
       total_bill   tip     sex smoker   day    time  size
  0         16.99  1.01  Female     No   Sun  Dinner     2
  4         24.59  3.61  Female     No   Sun  Dinner     4
  11        35.26  5.00  Female     No   Sun  Dinner     4
  14        14.83  3.02  Female     No   Sun  Dinner     2
  16        10.33  1.67  Female     No   Sun  Dinner     3
  ..          ...   ...     ...    ...   ...     ...   ...
  226       10.09  2.00  Female    Yes   Fri   Lunch     2
  229       22.12  2.88  Female    Yes   Sat  Dinner     2
  238       35.83  4.67  Female     No   Sat  Dinner     3
  240       27.18  2.00  Female    Yes   Sat  Dinner     2
  243       18.78  3.00  Female     No  Thur  Dinner     2
  
  [87 rows x 7 columns]),
 ('Male',
       total_bill   tip   sex smoker  day    time  size
  1         10.34  1.66  Male     No  Sun  Dinner     3
  2         21.01  3.50  Male     No  Sun  Dinner     3
  3         23.68  3.31  Male     No  Sun  Dinner     2
  5         25.29

In [None]:
# We can iterate through the groupby just like we would a list of tuples!
for sex, data in tips.groupby("sex"):
    print(sex)
    print(data.mean())


Female
total_bill    18.056897
tip            2.833448
size           2.459770
dtype: float64
Male
total_bill    20.744076
tip            3.089618
size           2.630573
dtype: float64


  print(data.mean())


We `groupby` to perform *some* operation on each group, to *map* over the groups, applying a function to each element Very often this function is itself an aggregation (*reduction*). We want to somehow aggregate each group into a value or set of values that *describe* it.

To apply functions to each element of a `groupby`, we use `.apply`:


In [None]:
# Get the maximum bill by gender: 
def max_bill(df):
    return df['total_bill'].max()

tips.groupby("sex").apply(max_bill)

sex
Female    44.30
Male      50.81
dtype: float64

Many aggregation functions that exist on Series and `DataFrames` (`mean`, `max`, `min`, etc.) can be called directly via the groupby object:

In [None]:
print(tips.groupby("sex").max())
print(tips.groupby("sex").mean())

        total_bill   tip smoker   day   time  size
sex                                               
Female       44.30   6.5    Yes  Thur  Lunch     6
Male         50.81  10.0    Yes  Thur  Lunch     6
        total_bill       tip      size
sex                                   
Female   18.056897  2.833448  2.459770
Male     20.744076  3.089618  2.630573


We can actually `groupby` more than one column:

In [None]:
tips.groupby(["sex", "day"])['tip'].mean()

sex     day 
Female  Fri     2.781111
        Sat     2.801786
        Sun     3.367222
        Thur    2.575625
Male    Fri     2.693000
        Sat     3.083898
        Sun     3.220345
        Thur    2.980333
Name: tip, dtype: float64

### 7.2.2 Combining `DataFrame`'s

There are many ways to combine various `DataFrame`s into a new one, extending in many ways what we already saw for operations on `Series`. The main ways of doing this are:

*   **Concatenate**: paste row-column-wise and taking action on `NaN`s (this works more on the rectangular structure of the data)
*   **Merge**: combine `DataFrame`s using a common piece of information, e.g. an identifier column (this works more as a database operation)



In [None]:
# Concatenate
df1 = pd.DataFrame({"A": pd.Series([1, 2, 3]), "B": pd.Series([4, 5, 6])})
df2 = pd.DataFrame({"A": pd.Series([4]), "C": pd.Series([7])})
pd.concat([df1, df2], axis = 0)
# axis: 0 for pasting below, 1 for pasting on the side

Unnamed: 0,A,B,C
0,1,4.0,
1,2,5.0,
2,3,6.0,
0,4,,7.0


In [None]:
df1 = pd.DataFrame({"A": pd.Series([1, 2, 3]), "B": pd.Series([4, 5, 6])})
df2 = pd.DataFrame({"A": pd.Series([4]), "C": pd.Series([7])})
pd.concat([df1, df2], axis = 1, join = "inner")  # "inner" join

Unnamed: 0,A,B,A.1,C
0,1,4,4,7


In [None]:
# Concatenation is mostly used when the rows or columns are shared. 
# For example, you might have data with the same columns and want to concatenate them on axis 0:
# But note: what happened to the index? 
# We might want to reset it. 
df1 = pd.DataFrame({"A": pd.Series([1, 2, 3]), "B": pd.Series([4, 5, 6])})
df2 = pd.DataFrame({"A": pd.Series([4]), "B": pd.Series([7])})
df3 = pd.concat([df1, df2], axis = 0)
df3

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6
0,4,7


In [None]:
# Similarly, you might have data with the same rows and different columns:
df1 = pd.DataFrame({"A": pd.Series([1, 2, 3]), "B": pd.Series([4, 5, 6])})
df2 = pd.DataFrame({"C": pd.Series([4]), "D": pd.Series([7])})
pd.concat([df1, df2], axis = 1)

Unnamed: 0,A,B,C,D
0,1,4,4.0,7.0
1,2,5,,
2,3,6,,


In [None]:
# But note what happens if the rows do not align, and you concatenate on axis 1 (by rows):
df1 = pd.DataFrame({"A": pd.Series([1, 2, 3]), "B": pd.Series([4, 5, 6])})
df2 = pd.DataFrame({"C": pd.Series([7]), "D": pd.Series([10])})
pd.concat([df1, df2], axis = 1)

Unnamed: 0,A,B,C,D
0,1,4,7.0,10.0
1,2,5,,
2,3,6,,


Let's go with merging now: `merge` is commonly used when your two `DataFrame`s must be connected and they do not share an index or columns. With merge we will connect two `DataFrame`s on some common piece of information, e.g. a common column. There are four types of `join` operations:

*   `inner`-join: **intersection** of *keys*
*   `outer`-join: **union** of *keys*
*   `left`-join: use *keys* from **left only**
*   `right`-join: use *keys* from **right only**

In [None]:
# Merging
df1 = pd.DataFrame({"A": pd.Series([1, 2, 3]), "B": pd.Series([4, 5, 6])})
df2 = pd.DataFrame({"A": pd.Series([3, 4]), "C": pd.Series([7, 8])})

# "on" defines on what piece of information the DataFrame's will merge
pd.merge(df1, df2, on = 'A', how = 'right')  # if column names differ, use "left_on" and "right_on"

Unnamed: 0,A,B,C
0,3,6.0,7
1,4,,8


In [None]:
pd.merge(df1, df2, on = 'A', how = 'left')

Unnamed: 0,A,B,C
0,1,4,
1,2,5,
2,3,6,7.0


In [None]:
pd.merge(df1, df2, on = 'A', how = 'outer')

Unnamed: 0,A,B,C
0,1,4.0,
1,2,5.0,
2,3,6.0,7.0
3,4,,8.0


In [None]:
pd.merge(df1, df2, on='A', how = 'inner')

Unnamed: 0,A,B,C
0,3,6,7


---
## 8. String manipulation

### 8.1 Simple operations

You may find yourself working with text data, a.k.a. strings. 

Strings are actually iterables, just like lists. They can be subset analogously:

In [None]:
x = "one python string"
x[4:10]

'python'

In [None]:
# You can also turn a string into a list of strings via the "split" method:
x = "one python string"
y = x.split(" ")
y == ["one", "python", "string"]

True

In [None]:
# The reverse is also possible via the "join" method:
space = " "
z = space.join(y)
z == x

True

You can also make everything lower (or upper) case, replace certain substrings with other substrings, and check for the existence of a substring with `in`:

In [None]:
z = "My Python String"
z.lower()

'my python string'

In [None]:
z.upper()

'MY PYTHON STRING'

In [None]:
w = z.replace("Python", "R")
print(w)
"Python" in w

My R String


False

There are many more easy-to-use, built-in tools for working with text data in Python. You can read more here:

*   https://docs.python.org/3/library/stdtypes.html#string-method

### 8.2 Regular expressions

A *regular expression* is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. Regular expressions are used in different languages, we'll also review them in the `R` session. Python uses the module `re` to deal with them.

Let's pretend we have an `html` text. Find **all** words tagged as bold (`<b>word</b>`) in `html`, turn them to italics (`<i>word</i>`) and add the word "freaking" before them.

In [None]:
import re

a_text = 'Why am I taking this <b>course</b>? I want to go back to the <b>holidays</b>.'

# Prefix 'r' is an indicator of regular expression
the_regex = r'<b>([a-z\s]+)</b>'
replacement = r'<i>freaking \1</i>'

# Function to replace a regular expression with another
re.sub(the_regex, replacement, a_text)

'Why am I taking this <i>freaking course</i>? I want to go back to the <i>freaking holidays</i>.'

Some helpful functions to manouvering with regular expressions with `re`:

*  `re.search(pattern, string)`: scan through `string` looking for locations where the regular expression `pattern` produces a match, and return a corresponding match object.
*  `re.match(pattern, string)`: if zero or more characters at the beginning of `string` match the regular expression `pattern`, return a corresponding match object.
*  `re.split(pattern, string)`: split `string` by the occurrences of `pattern`
*  `re.sub(pattern, repl, string)`: return the `string `obtained by replacing the leftmost non-overlapping occurrences of `pattern` in string by the replacement `repl`
*  `re.findall()`: search for *all* occurrences that match a given pattern

Regular expressions are relatively complex and there is a long list of combinations we can make, which are well outside the scope of this introduction. We limit to mentioning some special characters to form regular expressions:

*  `^`: Start of string
*  `$`: End of string
*  `.`: One character, no matter which (except line break)
*  `*`: match 0 or more repetitions of the preceding RE
*  `+`: match 1 or more repetitions of the preceding RE
*  `?`: match 0 or 1 repetition of the preceding RE
*  `{m,n}`: Causes the resulting RE to match from m to n repetitions of the preceding RE. The comma and the n are optional depending on the case
*  `|`: OR. E.g. A|B means match RE A or B
*  `(...)`: Matches whatever regular expression is inside the parentheses
*  `[]`: Defines a subset of characters to match
*  `\`: Either escapes special characters (permitting you to match characters like '*', '?', and so forth), or signals a special sequence (e.g. \s means a white space).
*  `\w`: Matches a word (letters only)
*  `\W`: Matches a word (letters and characters, equivalent to `[^a-zA-Z0-9_]`)

---
## 9. Some advanced concepts


### 9.1 Copy vs. assignment

Look at the following example:

In [None]:
a = [1, 2, 5]
b = a
b[2] = 10

In [None]:
print(a)
print(b)

[1, 2, 10]
[1, 2, 10]


What happens is that really `a` and `b` point to the same place in the memory and share the same data. The way to create an object that will *copy* the data in a but not *share* the data with a is to use the method `copy()`.

In [None]:
a = [1, 2, 5]
b = a.copy()
b[2] = 10

In [None]:
print(a)
print(b)

[1, 2, 5]
[1, 2, 10]


This goes a bit deeper into the concept how `python` uses memory in your computer, but it is important to have a grasp on it in order to avoid potential problems when memory becomes an issue.

In [None]:
a = [1, [2, 3], 5, 'abc'] 
b = a.copy()
b[1].append(100)

In [None]:
print(a)
print(b)

[1, [2, 3, 100], 5, 'abc']
[1, [2, 3, 100], 5, 'abc']


Things become a little trickier when you deal with lists of lists, for this reason there is also the `deepcopy`. 

In [None]:
a = [1, [2, 3], 5, 'abc'] 
b = a.copy()
b[1].append(100)

In [None]:
print(a)
print(b)

[1, [2, 3, 100], 5, 'abc']
[1, [2, 3, 100], 5, 'abc']


(What happened here is that both `a` and `b` still point to some deeper list `[2, 3]`)

In [None]:
# Try now 
from copy import deepcopy
b = deepcopy(a)
b[1].pop()  # Removes (last) item on the list
print(a)
print(b)

[1, [2, 3, 100], 5, 'abc']
[1, [2, 3], 5, 'abc']


### 9.2 Error handling

Sometimes there problems come up in the code that make it not function properlu. When that happens, if there is no error raised, the problem could be anywhere and that makes it very hard to fix.

Setting up informative *errors* if something is not working well, on the other hand, tells us where things go wrong and point us to what we need to fix.

In [None]:
# Errors in Python are called Exceptions. 
e = Exception()
type(e)

Exception

In [None]:
# Exceptions are raised with the "raise" keyword: 
raise Exception('Oops!')

Exception: Oops!

Every *part* of your program, for example each function, must be in charge of things going as expected inside its body. If something goes wrong, it should tell us what happened.

One way to do this is to check for possible problems before they occur:

In [None]:
def age_a_person(person):
    if not hasattr(person, 'age'):
       raise Exception(f'The person must have an age attribute! Given: {person}')
    return person.age + 1

age_a_person('notaperson')

Exception: The person must have an age attribute! Given: notaperson

Python encourages that one should first try, and *catch* any expected errors that occur, handling them then. By *catching an error* we mean the following:


In [None]:
def age_a_person(person):
    try:
        return person.age + 1
    except AttributeError as e:
        raise Exception(f'The person must have an age attribute! Given: {person}') from e

age_a_person('notaperson')

Exception: The person must have an age attribute! Given: notaperson

### 9.3 Default values in functions

In `python` we get to assign default values to inputs of functions. For example:

In [None]:
# New function "f"
def f(a = 1, b = 2):
    return a + b


In [None]:
# Function "f" can be validly be called in the following ways
print(f())
print(f(10))
print(f(b = 4))
print(f(10, 4))
print(f(a = 10, b = 4))
print(f(b = 4, a = 10))

3
12
5
14
14
14


In [None]:
# But NOT like this
f(a = 10, 4)

SyntaxError: positional argument follows keyword argument (2957630882.py, line 2)

---
## 10. Particular topics

### 10.1 Date and time objects

We introduce how to deal with time and date type of objects in Python. To do this, we need a module denominated `datetime`, which it also involves the class `datetime`.

In [None]:
from datetime import datetime  # This is a module and a class

# Current date
current_time = datetime.now()
print(current_time)
print(type(current_time))

2022-09-28 11:02:52.847914
<class 'datetime.datetime'>


`current_time` is of class `datetime`, but this is just one of five distinct time-object classes:

*   `datetime`: allows to work with times and dates together (month, day, year, hour, second, microsecond).
*   `date`: works with dates only (month, day, year), independent of time. 
*   `time`: works with time only (hour, minute, second, microsecond), independent of date. 
*   `timedelta`: a duration of time used for measuring distance between to time points.
*   `tzinfo`: to deal with time zones.

In [None]:
from datetime import date

date.today()


datetime.date(2022, 9, 28)

The most common scenario with time data is translating from and to regular character objects, which is the most frequent format that we will encounter when importing times and dates. We use the following two functions:

In [None]:
today_date = '2022-01-04'

# Create date object in given time format yyyy-mm-dd
today_date = datetime.strptime(today_date, '%Y-%m-%d')

print(today_date)
print(type(today_date))

2022-01-04 00:00:00
<class 'datetime.datetime'>


Note we used the *pattern* `%Y-%m-%d` to indicate the year-month-day format we want to give the date object. A full list of imaginable patterns can be found in the library's [documentation](https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior).

In [None]:
print('* Month:', today_date.month)  # Get month from date
print('* Year:', today_date.year)  # Get month from year
print('* Day of month:', today_date.day)  # ...
print('* Day of Week (number):', today_date.weekday())  # Recall indexing

* Month: 1
* Year: 2022
* Day of month: 4
* Day of Week (number): 1


In [None]:
print('* Hour: ', current_time.hour)
print('* Minute: ', current_time.minute)
print(current_time.isocalendar())  # Returns (year, # week, # day)

* Hour:  11
* Minute:  2
datetime.IsoCalendarDate(year=2022, week=39, weekday=3)


As mentioned, you can convert from time to string with the function `strftime()`.

In [None]:
today_str = datetime.strftime(today_date, format = '%Y-%m-%d')
type(today_str)

str

Recall we mentioned that to measure time spans, or to operate dates or times (add/subtract) we could use the `timedelta` type of object. Mind you, these objects need not be anchored on a specific date and they can be a generic time frame.

In [None]:
from datetime import timedelta

# timedelta objects
three_weeks = timedelta(weeks = 3)
one_year = timedelta(days = 365)

print(three_weeks)
print(type(three_weeks))
print(three_weeks.days)
print(one_year.days)

21 days, 0:00:00
<class 'datetime.timedelta'>
21
365


Let us now operate on these objects.

In [None]:
from datetime import datetime, timedelta

# Current time
now = datetime.now()
print("Today's date: ", str(now))

# Add three weeks to current date
two_weeks = timedelta(days = 14)
now_in_2weeks = now + two_weeks
print('Date after three weeks: ', now_in_2weeks)

# Subtract one year from current date
two_years = timedelta(days = 730)
two_year_ago = now - two_years
print('Date two years ago: ', two_year_ago)
print(type(two_year_ago))

Today's date:  2022-09-28 11:17:11.452195
Date after three weeks:  2022-10-12 11:17:11.452195
Date two years ago:  2020-09-28 11:17:11.452195
<class 'datetime.datetime'>


In [None]:
from datetime import date

# Create two dates
date1 = date(2011, 5, 28)
date2 = date(2015, 6, 6)
# create two dates with year, month, day, hour, minute, and second
date1b = datetime(2011, 5, 28, 23, 1, 0)
date2b = datetime(2015, 6, 6, 22, 52, 10)

# Difference between two dates
date_diff = date2 - date1
date_diffb = date2b - date1b
print("Time difference (days): ", date_diff.days)
print("Time difference: ", date_diffb)
print(type(date_diff))

Time difference (days):  1470
Time difference:  1469 days, 23:51:10
<class 'datetime.timedelta'>


In [None]:
# To work with time zones:
from pytz import timezone

# Create timezone US/Eastern
est = timezone('US/Eastern')

# Re-set date to local time
loc_time = est.localize(datetime(2015, 6, 6, 22, 52, 10))
print(loc_time)

2015-06-06 22:52:10-04:00


In [71]:
from datetime import datetime
datetime.today()

datetime.datetime(2022, 9, 28, 11, 51, 46, 176673)

You can also work with time objects using `pandas`. You can convert text strings into `pandas` `Datetime` objects using:

*  `to_datetime()`: to convert string dates/times to `datetime` objects.
*  `to_timedelta()`: find differences in times.


In [74]:
import pandas as pd
import numpy as np

# String to datetime
good_date = pd.to_datetime("6th of June, 2015")
print(good_date)

# Create date series to_timedelta() (add numpy)
date_series = good_date + pd.to_timedelta(np.arange(12), 'D')
print(date_series)

# Create date series using date_range() function
date_series = pd.date_range('06/06/2015', periods = 12, freq = 'D')
print(date_series)

2015-06-06 00:00:00
DatetimeIndex(['2015-06-06', '2015-06-07', '2015-06-08', '2015-06-09',
               '2015-06-10', '2015-06-11', '2015-06-12', '2015-06-13',
               '2015-06-14', '2015-06-15', '2015-06-16', '2015-06-17'],
              dtype='datetime64[ns]', freq=None)
DatetimeIndex(['2015-06-06', '2015-06-07', '2015-06-08', '2015-06-09',
               '2015-06-10', '2015-06-11', '2015-06-12', '2015-06-13',
               '2015-06-14', '2015-06-15', '2015-06-16', '2015-06-17'],
              dtype='datetime64[ns]', freq='D')


In [75]:
# Create a DataFrame with date as a column
data = pd.DataFrame()
data['date'] = date_series
data.head()

Unnamed: 0,date
0,2015-06-06
1,2015-06-07
2,2015-06-08
3,2015-06-09
4,2015-06-10


In [76]:
# Extract year, month, day, hour, and minute; and assign to new columns
data['year'] = data['date'].dt.year
data['month'] = data['date'].dt.month
data['day'] = data['date'].dt.day
data['hour'] = data['date'].dt.hour
data['minute'] = data['date'].dt.minute
data.head()

Unnamed: 0,date,year,month,day,hour,minute
0,2015-06-06,2015,6,6,0,0
1,2015-06-07,2015,6,7,0,0
2,2015-06-08,2015,6,8,0,0
3,2015-06-09,2015,6,9,0,0
4,2015-06-10,2015,6,10,0,0


### 10.2 Web scraping and `html` parsing

Web *scraping* deals with the retrieval of information featured in some web page. It basically consists of reading the content in a URL, and posteriorly *parsing* it to extract the relevant information that you are looking for. This is an extensive topic in itself so we will just introduce the plain basics to get you started.

Before anything, note that most information in a HTML source is of little use to us and is used to render and format the webpage itself. Therefore it would be wise to familiarise yourself with HTML tags as and when needed. We must consider some design aspects of creating a *spider* that will *crawl* the target webpage and get us the desired content.

1.   Identify tags that contain useful information
2.   Add randomised waiting periods between every access to website
3.   Make use of logs to monitor progress
4.   Regularly write collected data to an external file
5.   **Always** respect `robots.txt` file defined by websites

We use as an example Wikipedia. The target will be to  retrieve the summary of the featured article in Wikipedia along with all the URLs in it.

In [77]:
from bs4 import BeautifulSoup
#import urllib  # If you're using Python 2.x
import urllib.request  # If you're using Python 3

url = 'https://en.wikipedia.org/wiki/Main_Page'
#html = urllib.urlopen(url).read()  # Python 2.x
with urllib.request.urlopen(url) as url_content:
    html = url_content.read()

# BeautifulSoup() is a formatting module to interpret html/xml
soup = BeautifulSoup(html)
print(soup)

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Wikipedia, the free encyclopedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"94bc46c3-d81b-44fe-8249-d71333f40661","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":1108085777,"wgRevisionId":1108085777,"wgArticleId":15580374,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata"],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevan

In [78]:
print(type(soup))
print(soup.prettify()[0:1000])

<class 'bs4.BeautifulSoup'>
<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Wikipedia, the free encyclopedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"94bc46c3-d81b-44fe-8249-d71333f40661","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":1108085777,"wgRevisionId":1108085777,"wgArticleId":15580374,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata"],"wgPageContentLanguage":"e

The text is structured with HTML tags organised in a systematic way. This structure is referred to as a *HTML DOM* (Document Object Model).

We are interested grabbing the summary of article of the day, from the Wikipedia main page along with all the links in it. So we'll use `BeautifulSoup`'s functions to parse HTML tags that contain this information or lead us to it in the form of embedded URLs.

On inspection of the HTML source we identify that the article summary is enclosed in the tag table with the attribute `id = mp-upper`. The idea is as follows,

1.   Using the function in `.find_all()` in BeautifulSoup we search for all tags of the type `table`
2.   Search the returned list for all the tag that has an attribute `id = mp-upper`.
3.   Now that you have the table notice that the information needed is in the tag called `p`. Search for all tags `p`
4.  Identify the correct paragraph and extract URLs.
5.  Extract text.



In [79]:
# 1. Look for all tags called table
tables_list = soup.find_all(name = "table")
tables_list

[<table role="presentation" style="margin:0 3px 3px; width:100%; box-sizing:border-box; text-align:center; background-color:transparent; border-collapse:collapse; padding:0.9em">
 <tbody><tr>
 <td style="padding:0 0.9em; text-align:left;">
 </td></tr>
 <tr>
 <td style="padding:0 0.9em; text-align:left;">
 <ul class="gallery mw-gallery-packed center">
 <li class="gallerybox" style="width: 140px"><div style="width: 140px">
 <div class="thumb" style="width: 138px;"><div style="margin:0px auto;"><a class="image" href="/wiki/File:US-NBN-IL-Lebanon-2057-Orig-1-400-C.jpg" title="$1 (original): First National Bank, Lebanon, Indiana"><img alt="Obverse and reverse of a one-dollar National Bank Note" data-file-height="3048" data-file-width="3500" decoding="async" height="120" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b5/US-NBN-IL-Lebanon-2057-Orig-1-400-C.jpg/207px-US-NBN-IL-Lebanon-2057-Orig-1-400-C.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b5/US-NBN-IL-Lebanon-20

In [80]:
# 2. Search for the table of interest by matching on the key 'id'
for table in tables_list:
    try:
        if table['id'] == 'mp-upper':
            article_table = table
    except:
        None

print('object type of "article_table" is ' + str(type(article_table)) + '\n')

NameError: name 'article_table' is not defined

In [81]:
# 3. Search for the tag "p"
paragraph = article_table.findAll(name = 'p')[0]
paragraph

NameError: name 'article_table' is not defined

In [None]:
# 4. Extract URLs
urls = [tag['href'] for tag in paragraph.findAll('a', href = True)]
for url in urls:
    print(url)

print('\n')

NameError: name 'paragraph' is not defined

In [None]:
# 5. Build the text of the article summary by looping through all the children and concatinating the text 
text = ''
for ch in paragraph.children:
    text = text + ch.string

print(text + '\n')

NameError: name 'paragraph' is not defined