In [None]:
import time
import numpy as np
import pandas as pd

# Exercise 2: Parsing text files

In this exercise, we learn how to use Python to parse a text file. In particular, we will see how to transform "regular for loop code" into "pythonic code" using the so-called "list comprehensions", resulting in equivalent but compact and easy to read code. The goal of the exercise is to implement an algorithm that extracts the hidden word from a given acrostic text. An acrostic is a text where the first letter of each line form a word when concatenated together. This algorithm has the following pseudocode:

In [None]:
import time
import numpy as np
import pandas as pd
# Load 'acrostic.txt' (read mode only) inside a 'File' object 'text':
text = open("acrostic.txt", 'r')
print(type<text>)
# Close 'acrostic.txt':
text.close()
print(acrostic_word)

Never forget to use the 'close()' method once you are done processing your text file ! A convenient way to bypass this rule is to use a context manager:

In [None]:
with open("acrostic.txt", 'r') as text:
    # 'acrostic.txt' is now open in 'text'!
    '''
    Extract the first letter of each line in 
    the variable 'text' to reconstruct the 
    hidden word in a variable 'acrostic_word'
    '''
# 'acrostic.txt' is now closed !
print(acrostic_word)

With the context manager syntax, the file gets automatically closed once you stop indenting your code ! Other than that, the two syntax are completely equivalent.

1) Load the dataset 'acrostic.txt' inside a 'File' object 'text', then extract the first letter of each line in the variable 'text' to reconstruct the hidden word in a variable 'acrostic_word'. Hint: loop over the elements returned by the 'File.readlines()' method. Also remember that to access the j-th character in a string s, you would write '$s[j]$'.

It is often possible to transform your for loops into list comprehensions, which makes your code more compact, elegant and sometimes even more efficient. A simple example is as follows: suppose we are given the list of integers $[0,N[$ and we want to compute the sum of squares of the integers in this list, that is: compute $S=0^{2}+...+(N-1)^{2}$.

In [None]:
N = 10000000

t0 = time.monotonic()

# Regular for loop -> 3 lines:
S = 0
for n in range(N):
    S += n * n
    
t1 = time.monotonic()
print(t1 - t0)
print(S)

In [None]:
N = 10000000

t0 = time.monotonic()

# List comprehension -> 1 line:
S = sum([n * n for n in range(N)])

t1 = time.monotonic()
print(t1 - t0)
print(S)

2) Transform your code for 1) with a list comprehension. Hint: consider using the 'String.join()' method.

3) If not already done, refactor your code into a function 'extract_acrostic_word()' that takes as input the path to a text file and returns the acrostic word hidden in the text. 

# Exercise 3: The danger with nested for loops

In this small exercise, we show that choosing the right order for your nested loops is crucial in Python to avoid bad performance.

1) Implement a nested loop that does nothing (use the 'pass' instruction), with the outer loop indexed on $[0,2[$ and the inner loop indexed on $[0,N[$ where N is very large. 

2) Same as in 1), but this time with the outer loop indexed on $[0,N[$ and the inner loop indexed on $[0,2[$.

3) Compare the performance of the two nested loops (measure the time needed to run each nested loop). Do so for increasing values of N (you can try $10^{5}$, $10^{6}$ etc...). What can you conclude ?

In [None]:
t0 = time.monotonic()
### Implement the first nested loop here
t1 = time.monotonic()
print(t1 - t0) # Time needed to run the first nested loop

In [None]:
t0 = time.monotonic()
### Implement the second nested loop here
t1 = time.monotonic()
print(t1 - t0) # Time needed to run the second nested loop

# Exercise 4: Writing efficient Python code

In this exercise, we want to emphasize how different implementations of the same code can lead to dramatic change in performance in Python. This is particularly true when your code can be vectorized, that is when you can directly operate on matrices and vectors instead of relying on Python's for loop. We'll illustrate this with the following simple algorithm: given a integer N, compute the sum of all odd numbers minus the sum of all even numbers in $[0,N[$, that is (assuming N is even): compute $S=-0+1-2+...-(N-2)+(N-1)$.

1) Implement the algorithm "naively": initialize S=0, then in a for loop indexed on $[0,N[$, check if the current number is odd or even and add it or substract it to S accordingly. Hint: to assess parity, use the modulo operator ('%' in Python).

2) Implement the algorithm "smartly": create two vectors, one containing all odd numbers and the other containing all even numbers in $[0,N[$, then use Python's 'sum()' function on those vectors. Hint: consider Python's 'range()' function to construct easily the two vectors.

3) Implement the algorithm the "numpy" way: you can for instance replace Python's 'range()' and 'sum()' functions by the functions 'numpy.arange()' and 'numpy.sum()'. 

4) Compare the performance of your three implementations for increasing values of N. What can you notice relative to the performance of those three implementations of the same code ?

# Exercise 5: Loading and preprocessing data with pandas and numpy

In this exercise, we will see how to easily and efficiently manipulate datasets thanks to numpy and pandas. This is an important exercise in the sense that throughout this course we will deal with potentially large datasets, thus it is crucial to do it the "right way".

1) Load the dataset 'colors_dataset.csv' inside a DataFrame structure using the 'pandas.read_csv()' function and print it. Inspect the dataset. What kind of dataset is this ?

2) Convert the DataFrame into a numpy array using the 'pandas.DataFrame.to_numpy()' method and print it. Between pandas DataFrames and numpy arrays, which is the best for visualization purpose ?

3) Preprocess the dataset:
- Store the first column of your numpy array (the labels) into a variable y, then convert y to a numpy array of string using the 'numpy.ndarray.astype()' method.
- Store the three next columns (the features) into a variable X (we choose to discard the last column since it is uninformative), then convert X to a numpy array of float using the 'numpy.ndarray.astype()' method. Finally, rescale all values in X by a factor of 1/255. 
Hint: to select columns of a numpy array, we use "indexing" (see https://numpy.org/doc/stable/reference/arrays.indexing.html for a good introduction on this topic) 

4) How many different labels are there in the dataset ? What is the number of samples for each label ? Hint: you can use the 'numpy.unique()' function, with the 'return_counts' flag set to True.

5) Let us investigate further the dataset: for each label in the dataset, extract from X the features of all samples with this label, then compute the mean value of those features with respect to the samples (we have three features so you should end up with a mean vector of size 3). Hint: to extract the rows of a matrix X for which a certain boolean condition is True, you would write '$X[cond,:]$' where cond is a boolean vector whose size is the number of rows in X. In order to compute the mean vector of a matrix X along a given dimension d, you would write 'numpy.mean(X, axis=d)' (d=0 to compute the mean wrt rows, d=1 to compute the mean wrt columns). Do the mean vectors look consistent for each label ?

6) It is often crucial in machine learning that the data is properly shuffled in order to learn good models. Therefore, it is usually recommended to shuffle the data as a preprocessing step, even if the data appears to be shuffled already. Shuffle your numpy arrays X and y (they must be shuffled by the same permutation !). Hint: You can build a permutation using the functions 'numpy.random.shuffle()' and 'numpy.arange()'.

# Exercise 6: Basics of object-oriented programming

Python is a language that fully supports object-oriented programming (OOP). This paradigm is a different way of thinking, organizing and writing your code. Although OOP is not "essential" in machine learning, in some cases it may let us implement non-trivial machine learning algorithms (e.g. decision trees) in a rather intuitive way.

The basic idea behind OOP is to think your code as a hierachy of concepts. For instance, if one wishes to make a naval video game, a very broad concept could be "Naval ship". This concept could be divided into subconcepts, like "Submarine" and "Boat" which are two different types of naval ship: though different, they still share some similarities (they are made to be used in the sea, they carry a crew...). 

These concepts form what we call "classes" in OOP. Objects are instances of a class: you can understand this as the class being the abstraction of an object. As such, in OOP most of the implementation goes into writing a hierarchy of classes. A class consists of two main ingredients: attributes and methods. 

As an example, we could implement a class "Submarine" with the attributes 'submerged', 'current_depth', 'max_depth', 'condition', 'size_crew', 'current_position' and with the methods 'go_underwater()', 'reach_surface()' and 'navigate()'. In this example, the role of attributes and methods becomes obvious: attributes are inherent characteristics of the objects abstracted by the class while methods are functions representing the actions objects of the class will be able to perform. 

Since the attributes 'condition', 'size_crew', 'current_position' and the method 'navigate()' are non-specific to submarines but concern all kinds of naval ship, they can be implemented as part of a larger class "NavalShip". Then, by having our "Submarine" class "inherit" from the larger "NavalShip" class, our submarines will also have access to attributes and methods of naval ships.

In Python, our "Submarine" and "NavalShip" classes could have the following implementation:

In [None]:
class NavalShip:
    '''
    Parent class NavalShip
    '''
    # Constructor of the NavalShip class:
    def __init__(self, size_crew, current_position=(0.0,0.0)):
        # Set attributes that are specific to naval ships:
        self.size_crew = size_crew
        self.current_position = current_position
        self.condition = "Good"
        
    def navigate(self, position):
        if self.condition == "Good":
            self.current_position = position
        else:
            print("Repair me first!")
            
    def report_status(self):
        print()
        print("Size crew:", self.size_crew)
        print("Condition:", self.condition)
        print("Current position:", self.current_position)

class Submarine(NavalShip):
    '''
    Child class Submarine that inherits from the parent class NavalShip
    '''
    # Constructor of the Submarine class:
    def __init__(self, max_depth, size_crew, current_position=(0.0,0.0)):
        # Call the constructor of the parent class NavalShip to set attributes that are specific to naval ships:
        NavalShip.__init__(self, size_crew, current_position)
        # Set attributes that are specific to submarines:
        self.max_depth = max_depth
        self.current_depth = 0.0
        self.submerged = False
        
    def go_underwater(self, depth):
        if self.condition == "Good":
            self.current_depth = min(depth, self.max_depth)
            self.submerged = self.current_depth > 0.0
        else:
            print("Repair me first!")
            
    def reach_surface(self):
        self.go_underwater(0.0)
        
    def report_status(self):
        # Call the 'report_status()' method of the parent class NavalShip to report the state of the attributes 
        # that are specific to naval ships:
        NavalShip.report_status(self)
        # Also report the state of the attributes that are specific to submarines:
        print("Current depth:", self.current_depth)
        print("Submerged:", self.submerged)

### In the main:
# Create a submarine object (this will automatically call the 'Submarine.__init__()' constructor):
submarine = Submarine(max_depth=3000.0, size_crew=50)
# Print a status report of the submarine:
submarine.report_status()
# Dive 500m underwater:
submarine.go_underwater(500.0)
# Print a status report of the submarine:
submarine.report_status()
# Navigate to position (726.0, 341.0):
submarine.navigate((726.0, 341.0))
# Print a status report of the submarine:
submarine.report_status()
# Go back to the surface:
submarine.reach_surface()
# Print a status report of the submarine:
submarine.report_status()

The task of this exercise is the following: reorganize your code from Exercise 5 in order to write a class "ColorsDatasetPreprocessor" whose purpose is to load and preprocess a dataset with the same structure as the 'colors_dataset.csv' dataset. Hint: you could write a single class with methods 'load_dataset()' and 'preprocess_dataset()', where 'load_dataset()' takes as input the path to a '.csv' file and returns two numpy arrays X and y (that may be set as attributes of the class). Note that there is a multitude of ways to implement a code in OOP, hence the crucial aspect is to think and organize your implementation (be it in your head or on paper) before jumping to the writing itself.

# Exercise 7: Learning to debbug your code

In this exercise, you are given a series of codes that you have to debbug. Errors can be obvious, typically when Python interrupts the program and returns an error, but they can also be tricky to spot, for instance when your program does not behave as expected but still does something "valid", or even when your program is faultless, except for a very specific set of input arguments. In particular, Python has this annoying tendency not to stop you from writing dangerous things, so you should always double check regurlarly that everything behaves exactly the way you want !

In [None]:
def perform_operation(a, b, operation_name)
    if operation_name == "addition":
        result = a + b
    else operation_name == "substraction":
        result = a - b
    elif operation_name == "multiplication"
        result = a * b
    elif operatin_name == "division":
        result = a / b

output_1, output_2 = perform_operation(2, 0)
print(output)

In [None]:
def compute_factorial(n):
    '''
    Recursively compute n!
    '''
    if n <= 1:
        return 1
    else:
        n * compute_factorial(n-1)
        
output = compute_factorial(5)
print(output)

In [None]:
def Fibonacci(n):
    '''
    Recursively compute F(n) = F(n-1) + F(n-2)
    '''
    Fibonacci = 0
    if n == 1:
        Fibonacci = 1
    elif n > 1:
        Fibonacci = Fibonacci(n-1) + Fibonacci(n-2)
    return Fibonacci
    
output = Fibonacci(3)
print(output)

In [None]:
def multiply_vectors_with_random_permutation(v1, v2):
    '''
    Randomly permute input vectors v1 and v2 
    then compute their pointwise product
    '''
    v1 = np.random.shuffle(v1)
    v2 = np.random.shuffle(v2)
    return v1 * v2
        
output = multiply_vectors_with_random_permutation(np.array([1.0, 2.0]), np.array([3.0, 4.0]))
print(output)

In [None]:
def toss_coins(max_tails=100):
    '''
    Keep tossing a perfectly balanced coin 
    until max_tails tails are obtained
    '''
    coin_tosses = []
    count_tails = 0
    stop = False
    while stop = False:
        new_toss = np.random.randint(low=0, high=2, size=1)
        if new_toss == 0:
            coin_tosses.append("H")
        else:
            coin_tosses.append("T")
        count_tails += 1
        if count_tails == max_tails:
            stop == True
    return np.array(coin_tosses)
        
coin_tosses = toss_coins()
print(coin_tosses)
print("Tails count:", coin_tosses[coin_tosses == "T"].shape[0])