## NumPy Library Exercises

In [2]:
import numpy as np

**1.** Given a random array, change the sign of the elements whose values are between 3 and 8.

In [11]:
# your code
arr = np.random.randint(1,11, size=10)
print(f'Orginal array set: {arr}')

mask = (arr >= 3) & (arr <= 8)

arr[mask] *= -1

print(f'Modified: {arr}')

Orginal array set: [ 4  9  1 10 10  7  6  7  4  4]
Modified: [-4  9  1 10 10 -7 -6 -7 -4 -4]


**2.** Replace the maximum element of a random array with 0.

In [14]:
# your code
arr = np.random.random(10)
print(f'Orginal array set: {arr}')

arr[arr.argmax()] = 0
print(f'Modified: {arr}')

Orginal array set: [0.75070898 0.01438804 0.38366651 0.88835257 0.63754886 0.02467646
 0.75022207 0.03388368 0.50324648 0.31754171]
Modified: [0.75070898 0.01438804 0.38366651 0.         0.63754886 0.02467646
 0.75022207 0.03388368 0.50324648 0.31754171]


**3.** Construct the Cartesian (direct) product of arrays (all combinations with each element).
The input is a 2D array.

In [151]:
# your code

**4.** Given two arrays A (8×3) and B (2×2).
Find the rows in A that contain the elements from each row of B, regardless of the order of elements in B.

In [152]:
# your code

**5.** Given a 10×3 matrix, find the rows that contain non-equal values
(e.g. row [2, 2, 3] is kept, row [3, 3, 3] is removed).

In [153]:
# your code

**6.** Given a 2D array, remove the rows that are duplicates.

In [154]:
# your code


______
______

For each of the following tasks (1-5), provide two implementations: one without using NumPy (assume that where NumPy arrays are expected as input or output, standard Python lists will be used instead), and a second version that is fully vectorized using NumPy (without using Python loops, map, or list comprehensions).

Note 1. You may assume that all specified objects are non-empty (for example, in Task 1, there are non-zero elements on the matrix diagonal).

Note 2. For most tasks, the solution requires no more than 1–2 lines of code.

___

Task 1: Calculate the product of non-zero elements on the diagonal of a rectangular matrix. For example, for X = np.array([[1, 0, 1], [2, 0, 2], [3, 0, 3], [4, 4, 4]]), the answer is 3.

In [20]:
# your code
X = np.array([[1, 0, 1], [2, 0, 2], [3, 0, 3], [4, 4, 4]])
s = np.diagonal(X)
a = s != 0
a

array([ True, False,  True])

Task 2: Given two vectors, x and y. Check whether they represent the same multiset (i.e., whether they contain the same elements with the same frequencies, regardless of order). For example, for x = np.array([1, 2, 2, 4]) and y = np.array([4, 2, 1, 2]), the answer is True.

In [6]:
# your code
x = np.array([1,2,2,4])
y = np.array([4,2,1,2])

is_equal = np.array_equal(np.sort(x), np.sort(y))

is_equal

True

Task 3: Find the maximum element in vector x among those elements that are preceded by a zero. For example, for x = np.array([6, 2, 0, 3, 0, 0, 5, 7, 0]), the answer is 5.

In [None]:
# your code
def func3(x):
    zero_mask = (x[:-1] == 0)
    after_zeros = x[1:][zero_mask]
    return after_zeros.max()

x = np.array([6, 2, 0, 3, 0, 0, 5, 7, 0])
print(f"Result: {func3(x)}") 

Result: 5


Task 4: Implement Run-length encoding (RLE). For a given vector x, return a tuple of two vectors of the same length. The first vector contains the values, and the second contains the counts of how many times each value is repeated consecutively. 
For example, for x = np.array([2, 2, 2, 3, 3, 3, 5]), the answer is (np.array([2, 3, 5]), np.array([3, 3, 1])).

In [12]:
# your code
def func4(x):
    if x.size == 0: return (np.array([]), np.array([]))
    
    locs = np.where(x[1:] != x[:-1])[0]
    
    boundaries = np.concatenate(([ -1], locs, [len(x) - 1]))
    
    values = x[boundaries[1:]]
    
    counts = np.diff(boundaries)
    
    return (values, counts)

x = np.array([2, 2, 2, 3, 3, 3, 5])
print(f"Result: {func4(x)}")

Result: (array([2, 3, 5]), array([3, 3, 1]))


Task 5: Given two sets of objects (samples) — $X$ and $Y$. Calculate the matrix of Euclidean distances between all pairs of objects. Compare your implementation's performance (speed) with the scipy.spatial.distance.cdist function.

In [14]:
# your code
import time
from scipy.spatial.distance import cdist

def task_5_numpy(X, Y):
    diff = X[:, np.newaxis, :] - Y[np.newaxis, :, :]
    return np.sqrt(np.sum(diff**2, axis=-1))

X = np.random.rand(500, 3)
Y = np.random.rand(500, 3)

start = time.time()
res_numpy = task_5_numpy(X, Y)
print(f"NumPy excution time: {time.time() - start:.5f} sek")

start = time.time()
res_scipy = cdist(X, Y, 'euclidean')
print(f"Scipy cdist excution time: {time.time() - start:.5f} sek")

NumPy excution time: 0.01521 sek
Scipy cdist excution time: 0.00199 sek


_______
________

Task 6: CrunchieMunchies
You work in the marketing department of a food company, MyCrunch, which is developing a new type of tasty and healthy cereal called CrunchieMunchies.

You want to demonstrate to consumers how healthy your cereal is compared to other leading brands, so you have collected nutritional data on several different competitors.

Your task is to use NumPy calculations to analyze this data and prove that CrunchieMunchies is the healthiest choice for consumers.

In [77]:
import numpy as np

Task 1: Review the cereal.csv file. This file contains the calorie counts for various cereal brands. Load the data from the file and save it as calorie_stats.

In [16]:
calorie_stats = np.loadtxt("./data/cereal.csv", delimiter=",")
calorie_stats

array([ 70., 120.,  70.,  50., 110., 110., 110., 130.,  90.,  90., 120.,
       110., 120., 110., 110., 110., 100., 110., 110., 110., 100., 110.,
       100., 100., 110., 110., 100., 120., 120., 110., 100., 110., 100.,
       110., 120., 120., 110., 110., 110., 140., 110., 100., 110., 100.,
       150., 150., 160., 100., 120., 140.,  90., 130., 120., 100.,  50.,
        50., 100., 100., 120., 100.,  90., 110., 110.,  80.,  90.,  90.,
       110., 110.,  90., 110., 140., 100., 110., 110., 100., 100., 110.])

Task 2: One serving of CrunchieMunchies contains 60 calories. How much higher is the average calorie count of your competitors? Save the answer in a variable named average_calories and print it to the terminal.

In [17]:
# your code
average_calories = np.mean(calorie_stats) - 60

average_calories

np.float64(46.883116883116884)

Task 3: Does the average calorie count accurately reflect the distribution of the dataset? Let’s sort the data and find out. Sort the data and save the result in a variable named calorie_stats_sorted. Print the sorted information.

In [18]:
# your code
calorie_stats_sorted = np.sort(calorie_stats)

calorie_stats_sorted

array([ 50.,  50.,  50.,  70.,  70.,  80.,  90.,  90.,  90.,  90.,  90.,
        90.,  90., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 110., 110., 110.,
       110., 110., 110., 110., 110., 110., 110., 110., 110., 110., 110.,
       110., 110., 110., 110., 110., 110., 110., 110., 110., 110., 110.,
       110., 110., 110., 110., 120., 120., 120., 120., 120., 120., 120.,
       120., 120., 120., 130., 130., 140., 140., 140., 150., 150., 160.])

Task 4: It looks like the majority of values are above the mean. Let’s see if the median is a more accurate indicator for this dataset. Calculate the median of the dataset and save your answer in median_calories. Print the median so you can see how it compares to the mean.

In [19]:
# your code
median_calories = np.median(calorie_stats_sorted)

print(f"Median: {median_calories}")

Median: 110.0


Task 5: While the median shows that at least half of the values are over 100 calories, it would be more impressive to show that a significant majority of competitors have a higher calorie count than CrunchieMunchies. Calculate various percentiles and print them until you find the lowest percentile that is greater than 60 calories. Save this value in a variable named nth_percentile.

In [None]:
# your code
nth_percentile = 0

for i in range(1, 101):
    percentile_value = np.percentile(calorie_stats, i)
    print(f"{i}-percentile: {percentile_value} kcal")
    
    if percentile_value > 60:
        nth_percentile = i
        break

print(f"\nThe smallest percentile higher than CrunchieMunchies (60 kcal): {nth_percentile}")

1-percentile: 50.0 kcal
2-percentile: 50.0 kcal
3-percentile: 55.599999999999994 kcal
4-percentile: 70.0 kcal

The smallest percentile higher than CrunchieMunchies (60 kcal): 4


Task 6: While percentiles show us that most competitors have a much higher calorie count, it's an awkward concept to use in marketing materials. Instead, let's calculate the percentage of cereals that contain more than 60 calories per serving. Save your answer in the variable more_calories and print it.

In [23]:
# your code

stat = np.mean(calorie_stats > 60)
stat

np.float64(0.961038961038961)

Task 7: That's a really high percentage! This will be very useful when we promote CrunchieMunchies. But one question is, how much variation is there in the dataset? Can we generalize that most cereals contain around 100 calories, or is the spread even wider? Calculate the amount of variation by finding the standard deviation. Save your answer in calorie_std and print it to the terminal. How can we incorporate this value into our analysis?

In [24]:
# your code
std = np.std(calorie_stats)

std

np.float64(19.35718533390827)

Task 8: Write a short paragraph summarizing your findings and how you think this data can be used to MyCrunch's advantage when marketing CrunchieMunchies.

Based on our NumPy analysis, CrunchieMunchies is significantly healthier than the competition because its 60-calorie serving is nearly 47 calories lower than the industry average of 107 calories. With a market median of 110 calories and a standard deviation of 19.36, the data proves that 96% of all competitor cereals are higher in calories, allowing MyCrunch to market this product as a mathematically superior choice that stands out from the high-calorie industry norm.