<h2><center>Week 3 - Assignment</center></h2>
<h3><center>Programming for Data Science 2024</center></h3>

Exercises for the topics covered in the third lecture.

The exercise will be marked as passed if you get **at least 10/15** points.

Exercises must be handed in via **ILIAS** (Homework assignments). Deliver your submission as a compressed file (zip) containing one .py or .ipynb file with all exercises. The name of both the .zip and the .py/.ipynb file must be *SurnameName* of the two members of the group. Example: Riccardo Cusinato + Athina Tzovara = *CusinatoRiccardo_TzovaraAthina.zip* .

It's important to use comments to explain your code and show that you're able to take ownership of the exercises and discuss them.

You are not expected to collaborate outside of the group on exercises and submitting other groups’ code as your own will result in 0 points.

For questions contact: *riccardo.cusinato@unibe.ch* with the subject: *Programming for Data Science 2024*.

**Deadline: 14:00, March 14, 2024.**

<h3 style="text-align:left;">Exercise 1 - Error investigation<span style="float: right">2 points</span></h3>

The code below squares and sums the numbers in the array *arr*, and holds the result in the variable *squared_sum*, which should be 1135. However, that is not the case. Correct the code and explain in a comment , clearly and amply, what was wrong.

In [1]:
import numpy as np

arr = np.array([13, 14, 15, 16, 17], dtype=np.int8)
squared_sum = np.sum(arr ** 2)
squared_sum

-145

In [2]:
###

# For the array arr, it specifies a data type of int8.
# This means that a specific number of memory bytes is allocated to store each element,
# and int8 can only represent bytes in the range -128 to 127. This can be verified using:
# np.iinfo(np.int8)

# Since the sum of squared values exceeds the maximum value representable by int8,
# the previous result was -145. This happened because in the signed integer 8-bit representation,
# the maximum value is 127, and 1135 exceeds this range. Thus, we see a negative number as the result.
# To fix the issue, we should use a larger datatype.

# For a 16-bit signed integer, the range is: -32,768 to 32,767, which can be verified using:
# np.iinfo(np.int16)

# Now with int16, we have more space to store larger values as expected.
# Therefore, we perform the operation using int16 instead of int8:

arr = np.array([13, 14, 15, 16, 17], dtype=np.int16)
squared_sum = np.sum(arr ** 2)
squared_sum

###

1135

<h3 style="text-align:left;">Exercise 2 - Vacation selector<span style="float: right">3 points</span></h3>

The code below defines five vacation destinations (*locations*) and four attributes for each (*attributes*). Each row describes one destination, and the columns represent scores on the factors scenery, activities, food, and nightlife.

Write a function *vacation_advisor* that asks the user whether they find each of the attributes important or not, and suggests the best vacation spot based on these preferences.

Use techniques from the third lecture to solve the exercise.

Example interaction:
```python
Is scenery important to you [y/n]?    > y
Is activities important to you [y/n]? > y
Is food important to you [y/n]?       > n
Is nightlife important to you [y/n]?  > n
Based on your preferences, the best destination is Australia
```

In [3]:
# List of destinations
locations = np.array([ "Hawaii", "Thailand", "Italy", "Australia", "Japan" ])

# List of attributes for each destination. Each column is an attribute. Each row a destination.
attributes = np.array([
    [8, 8, 7, 6],
    [7, 9, 8, 7],
    [8, 6, 9, 7],
    [9, 8, 8, 6],
    [7, 9, 7, 8]
])

# Declare attribute names and initialize boolean array with preferences
attribute_names = ['scenery', 'activities', 'food', 'nightlife']


In [31]:
###

def vacation_advisor(locations, attributes, attribute_names):
    """ Take list of destinations, a list of attribute values for each destination, and a list of attributes
     to suggest the best location based on user preferences """

    # First, ask the user to indicate the importance for each attribute,
    # Store responses as boolean values in the array ask_user_preferences 
    # True if answer is 'y', False otherwise.
    ask_user_preferences = np.array([input(f"Is {attribute} important to you [y/n]? ") == 'y' for attribute in attribute_names])

    # Then, we extract important attributes based on user preferences using boolean indexing
    # Each column represents an important attribute, with the value corresponding to 'y' responses
    extract_important_attributes = attributes[:, ask_user_preferences]

    # Calculate total points for each destination
    total_attribute_values = np.sum(extract_important_attributes, axis=1)

    # Find the index with the maximum total points attributed to each destination, 
    # that selects the best location index as the suggested destination
    index_suggested_location = np.argmax(total_attribute_values)

    # Display the best destination based on user preferences
    print(f"Based on your preferences, the best destination is {locations[index_suggested_location]}")

# Call function 
vacation_advisor(locations, attributes, attribute_names)

###


Based on your preferences, the best destination is Australia


<h3 style="text-align:left;">Exercise 3 - Indexing<span style="float: right">3 points</span></h3>

You have two arrays of the same length: temperature *temp*, and humidity, *rh*. Write a program that:
1) Substitutes the values of *temp* for which the corresponding values of *rh* is less than 0.3 with *np.nan*.
2) On this new temperature array, calculate the mean value (do **not** calculate it on the original array).

As an example:

```python
temp = [70, 80, 90]
rh = [0.5, 0.2, 0.6]

temp_nan --> [70, np.nan, 90]
temp_avg --> 80
```

In [6]:
# Generate some surrogate data

np.random.seed(29041996)  # Make sure we all have the same data
temp = 20 * np.cos(np.linspace(0, 2 * np.pi, 100)) + 80 + 2 * np.random.randn(100)
rh = np.abs(0.1 * np.cos(np.linspace(0, 4 * np.pi, 100)) 
            + 0.3 + 0.05 * np.random.randn(100))


In [7]:
###

def calculate_mean_temp(temp, rh):
    """ Replace temperature values with nan where humidity is less than 0.3,
    then calculate the average temperature """

    # Create arrays
    temp = np.array(temp, dtype=float) # to account for: "ValueError: cannot convert float NaN to integer"
    rh = np.array(rh)
    
    # Create a mask where values of humidity is less than 0.3
    mask = rh < 0.3

    # Make a deep copy of temperature, to avoid changing original data
    temp_nan = temp.copy()

    # Replace temperature values with nan values,
    # for each corresponding temperature where the mask is True using boolean indexing
    temp_nan[mask] = np.nan
    
    # Calculate the mean value of the temperature without nan values
    temp_avg = np.nanmean(temp_nan)
    
    # Return temparature array with nan values and the average temperature
    return temp_nan, temp_avg

# Call function
temp_nan, temp_avg = calculate_mean_temp(temp, rh)

# Display values
print("temp_nan: ", temp_nan)
print("temp_avg: ", temp_avg)

###

temp_nan:  [ 97.61100965  98.29758553 100.01692772  98.73758771  98.38710799
 100.80608254 100.42365343          nan  97.34895106  99.28378744
          nan  94.70357922          nan          nan          nan
  89.2217891   89.41943969          nan          nan          nan
  81.47344699          nan          nan          nan          nan
          nan          nan          nan          nan          nan
          nan          nan          nan          nan  68.73828046
  67.47413002  68.54133082  65.53398191  64.56197892  64.19909813
          nan          nan  59.92352112  63.86983538  61.88582567
  59.71619218  59.84554475  60.32812302  59.09837842  60.02298563
  58.56227652  58.99225298  57.80804413  61.99996728  61.91705067
  60.34955294  62.5543744   62.38104106          nan  63.96025183
  64.60977283  66.26614781  65.304059    68.41834429  65.57144047
          nan          nan          nan          nan          nan
          nan          nan          nan          nan          nan

<h3 style="text-align:left;">Exercise 4 - Base converter<span style="float: right">2 points</span></h3>

Write a function *int_to_bin* that takes a positive integer as input and returns the binary equivalent of that integer.

You can **not** use built-in methods such as *bin()* in your solution.

In [22]:
###

def int_to_bin(n):
    """ Takes a positive integer and returns the binary equivalente of the integer """

    if n == 0:
        return 0  # account for the case of 0
    
    binary_equivalent = []  # empty list to store binary bits
    while n > 0:
        # Adding the remainder of dividing the number by 2 to the list
        # It represents the rightmost digit, which is the value of 2^0
        binary_equivalent.append(n % 2)
        # Right shift the number by 1 bit to remove the least significant bit (the rightmost bit)
        # until we reach 0.
        n //= 2
    
    # Return the reverse list 
    return binary_equivalent[::-1]  # Reverse the list to get the correct binary representation

# Call function testing integer values n
n = 100
binary = int_to_bin(n)
# Display in a binary representation 
print(f"Binary equivalent of {n}: {''.join(map(str, binary))}")

n = 0
print(f"Binary equivalent of {n}: {int_to_bin(n)}")

###


Binary equivalent of 100: 1100100
Binary equivalent of 0: 0


<h3 style="text-align:left;">Exercise 5 - Broadcasting<span style="float: right">2 points</span></h3>

Reshape *a* so it is possible to multiply *a* and *b*, and explain why you had to reshape *a* to be able to multiply the two arrays.

In [38]:
import numpy as np

a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([2, 3])

# Display the array a and b 
print("a: ", a)
print("b: ", b)

# Without reshaping a * b throws an Valuerror : "operands could not be broadcast together with shapes (2,3) (2,) "

# Reshape to have the same number of elements as b
a.shape = (3, 2)
print("reshaped a: ", a)
print("multiplication: ", a * b)

a:  [[1 2 3]
 [4 5 6]]
b:  [2 3]
reshaped a:  [[1 2]
 [3 4]
 [5 6]]
multiplication:  [[ 2  6]
 [ 6 12]
 [10 18]]


In [15]:
# Before reshaping, we could not multiply arrays because their dimensions were not aligned
# Array a(2, 3) had two rows and three columns.
# Array b(2,) had a single row with two elements.
# There were no clear way to align the rows and columns
# After reshaping a to (3, 2) it became possible to align the dimensions with b (3 rows and 2 columns to two columns and one row)

<h3 style="text-align:left;">Exercise 6 - Moving average<span style="float: right">3 points</span></h3>

Given the array of values, *a*, we can calculate the moving average by averaging nearby values and repeating the procedure sliding along the array. Here's an example of a 3-point moving average (ignoring the edges), with a for loop:

In [14]:
a = np.round(30 + np.random.randn(20) * 2, 1)
print(a)

# Moving average
a_avg = np.zeros_like(a)
# We're just ignoring the edge effects here
for i in range(1, len(a) - 1):
    sub = a[i - 1:i + 2]
    a_avg[i] = sub.mean()
# For the first and last point, we use the original values.
a_avg[[0, -1]] = a[[0, -1]]
print(a_avg)

[35.3 30.8 32.2 29.8 28.7 28.2 33.6 31.3 28.6 31.3 28.5 28.6 30.8 29.4
 31.7 31.9 31.2 29.3 30.7 33.3]
[35.3        32.76666667 30.93333333 30.23333333 28.9        30.16666667
 31.03333333 31.16666667 30.4        29.46666667 29.46666667 29.3
 29.6        30.63333333 31.         31.6        30.8        30.4
 31.1        33.3       ]


Write a function *mov_avg* that takes an array in input and returns its 3-point moving average. You **have to use broadcasting** to compute the moving average. As in the example, use the original array values at the borders.

In [29]:
###

def mov_avg(array):
    """ Compute the moving average of an array as input using broadcasting """
    
    # First, create a new array with zeros with the same fimension that array
    a_avg = np.zeros_like(array)
    
    # Then, we get 3 sub arrays:
    # array[:-2]: selects all elements from array except for the last two.
    # array[1:-1]: selects all elements from  array except for the first and last.
    # array[2:]: selects all elements of the array except for the first two.
    # And by taking the element at the same index from each array, 
    # We can have the three point sliding window average.
    # It's an element-wise operation performed on arrays—but it's not strictly broadcasting cause the shapes are already compatible
    # although we divide array by a scalar: “stretch” or duplicate the value
    a_avg[1:-1] = (array[:-2] + array[1:-1] + array[2:]) / 3
    # print(array[:-2], "subarray1") # Print subarrays for debugging 
    # print(array[1:-1], "subarray2")
    # print(array[2:], "subarray3")
    
    # For the first and last point, we use the original values.
    a_avg[[0, -1]] = array[[0, -1]]
    
    return a_avg

# Call function using random data 
a = np.round(30 + np.random.randn(20) * 2, 1)
print("Original array:", a)
a_moving_avg = mov_avg(a)
print("Moving average:", a_moving_avg)

###

Original array: [30.5 31.4 31.2 27.9 27.9 32.  30.1 28.4 27.7 35.7 31.1 30.3 31.9 30.2
 30.3 28.3 33.8 31.9 32.3 29.7]
[30.5 31.4 31.2 27.9 27.9 32.  30.1 28.4 27.7 35.7 31.1 30.3 31.9 30.2
 30.3 28.3 33.8 31.9] array1
[31.4 31.2 27.9 27.9 32.  30.1 28.4 27.7 35.7 31.1 30.3 31.9 30.2 30.3
 28.3 33.8 31.9 32.3] array2
[31.2 27.9 27.9 32.  30.1 28.4 27.7 35.7 31.1 30.3 31.9 30.2 30.3 28.3
 33.8 31.9 32.3 29.7] array3
Moving average: [30.5        31.03333333 30.16666667 29.         29.26666667 30.
 30.16666667 28.73333333 30.6        31.5        32.36666667 31.1
 30.8        30.8        29.6        30.8        31.33333333 32.66666667
 31.3        29.7       ]
