# Numpy Exercise

## Instructions
The same instructions from previous exercises apply.

Write the Python program in the cell below the text of the exercise (or create a new one). Always print the final result to the screen to verify the correctness of the exercise. Sometimes, we provide some important concepts in a code cell below the exercise text, try running it and modify it if necessary to ensure you understand the required concepts.

## Submission
The same rules from previous exercises apply.

It is mandatory to **submit the solution for all exercises** (except for those marked as optional) **before the beginning of the next lesson** in the appropriate assignment on iCorsi. To submit:
- Run the entire notebook from scratch (`Kernel -> Restart & Run All`) and ensure that the solutions are as expected;
- Export the notebook in HTML format (`File -> Download as...`) and submit the resulting file.

If you were unable to complete one or more exercises, describe the problem encountered and **still submit the file with the rest of the solutions**.


In [1]:
# Import packaged for later usage
import pandas as pd
import numpy as np
import os

## Exercise 0
Various warmup exercises

- Create a 1D array with values ranging from 10 to 49 (use `np.arange`, find its docs!)
- Create a 5x5 array with random values and find the minimum and maximum values (use `np.random.rand`, `np.min`, `np.max`)
- Create a 2D array with 1 on the border and 0 inside
- Write a function that takes a 1D array as input, and returns a new array that is the same as the previous, but includes a 0 at the first and last element.  E.g. `[1,2,3]` becomes `[0,1,2,3,0]`
- Same as above, but take a parameter `n` which indicates how many zeros to place.
- Create an 8x8 boolean array and fill it with a checkerboard pattern
- Given a 1D array of numbers, return a new array that contains the same values, except that all values between 5 and 8 (inclusive) will be negated.  E.g. `[-1,3,6,9,5]` becomes `[-1,3,-6,9,-5]`
- Given a 1D array of numbers `a` and a number `x`, return the index of the element in `a` that is closest to `x`. E.g. `a=[-1,3,6,9,5], x=2` returns `1`, because `a[1]=3` is closest to `2`.

In [2]:
# Solution 0

#1
a = np.arange(10, 50)
print("Array from 10 to 49:", a)

#2
b = np.random.rand(5, 5)
print("\nMatrice di random:\n", b)
print("Min:", np.min(b))
print("Max:", np.max(b))

#3
c = np.ones((5, 5))
c[1:-1, 1:-1] = 0
print("\nMatrice con 1 sulla cornice:\n", c)

#4
def add_borders(array):
    return np.concatenate(([0], array, [0]))
print("\nArray con bordi a zero:", add_borders(np.array([1, 2, 3])))

#5
def add_n_zero_borders(array, n):
    return np.concatenate((np.zeros(n), array, np.zeros(n)))
print("\nArray con 'n' zeri sui bordi:", add_n_zero_borders(np.array([1, 2, 3]), 2))

#6
d = np.zeros((8, 8), dtype=bool)
d[1::2, ::2] = True
d[::2, 1::2] = True
print("\n8x8 boolean array:\n", d)

#7
def negate_values_5to8(array):
    array = array.copy()
    array[(array >= 5) & (array <= 8)] *= -1
    return array
print("\nNegate values 5 to 8:", negate_values_5to8(np.array([-1, 3, 6, 9, 5])))

#8
def closest_index(array, x):
    return np.abs(array - x).argmin()
print("\nIndex of closest value:", closest_index(np.array([-1, 3, 6, 9, 5]), 2))


Array from 10 to 49: [10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49]

Matrice di random:
 [[0.80838609 0.09069151 0.53434491 0.37568694 0.01053505]
 [0.06639107 0.16983402 0.74683074 0.10286074 0.35961161]
 [0.72237469 0.82930485 0.40275266 0.99849766 0.5364941 ]
 [0.74505001 0.19285015 0.04923652 0.5117509  0.24905403]
 [0.35209646 0.42337822 0.40730447 0.35101939 0.22109484]]
Min: 0.010535052464391348
Max: 0.9984976580394596

Matrice con 1 sulla cornice:
 [[1. 1. 1. 1. 1.]
 [1. 0. 0. 0. 1.]
 [1. 0. 0. 0. 1.]
 [1. 0. 0. 0. 1.]
 [1. 1. 1. 1. 1.]]

Array con bordi a zero: [0 1 2 3 0]

Array con 'n' zeri sui bordi: [0. 0. 1. 2. 3. 0. 0.]

8x8 boolean array:
 [[False  True False  True False  True False  True]
 [ True False  True False  True False  True False]
 [False  True False  True False  True False  True]
 [ True False  True False  True False  True False]
 [False  True False  True False  True False  True]
 [ Tr

## Exercise 1

Let's consider the online dating profiles dataset, which we mentioned in class.

Download the zip files from [this link](https://github.com/rudeboybert/JSE_OkCupid), unzip them, and place the files `profiles_revised.csv` and `essays_revised_and_shuffled.csv` in a `data` subdirectory in the current directory. Then, run the following cell, which uses the pandas library (which we will cover in detail later) to parse the CSV.

The current directory, where we expect to find the file, is this:

In [3]:
print(os.getcwd()+"/data")

C:\Users\ivohe\Desktop\Datascience/data


In [4]:
df = pd.read_csv("data/profiles_revised.csv")
age = df["age"].values
sex = df["sex"].values

dfe = pd.read_csv("data/essays_revised_and_shuffled.csv")
essay = dfe["essay0"].values

### 1.1

Read the [codebook](https://github.com/rudeboybert/JSE_OkCupid/blob/master/okcupid_codebook_revised.txt) for the dataset, focusing on the three columns "age", "sex", and "essay0".

Explain the type of the `age`, `sex`, and `essay` arrays; visualize some elements from them.

In [5]:
# Solution 1.1

print("\nEsempi di età:", age[:3])
print("Esempi di sesso:", sex[:3])
print("Esempi di essay:\n", essay[:3])


Esempi di età: [22 36 37]
Esempi di sesso: ['m' 'm' 'm']
Esempi di essay:
 ['well hi there. my mantra this year is "<a class="ilink" href=\n"/interests?i=explore">explore</a>." that may mean walking down an\nuntrodden path, trying russian food, visiting another continent, or\njust staring at a snail to see who blinks first. it may be a\nstrange year, but it won\'t be boring.'
 nan
 'musician / writer / programmer from the woods of montana. dance\nmusic enthusiast, secondhand film critic. owner of too many\nguitars.']


### 1.2

Analyze the arrays with the numpy methods you know and answer the following questions:
- What is the average, minimum, and maximum age of the users?
- How many are male, and how many are female?
- Does the average age of males differ significantly from the average age of females?
- How long are their introductions on average?
- Show the longest introduction.
- Can we determine if males tend to write more compared to females?

In [6]:
# Solution 1.2

average_age = np.mean(age)
min_age = np.min(age)
max_age = np.max(age)
print(f"Età media: {average_age}")
print(f"Età minima: {min_age}")
print(f"Età massima: {max_age}")


num_males = np.sum(sex == 'm')
num_females = np.sum(sex == 'f')
print("\nNumero maschi:", num_males)
print("Numero femmine:", num_females)


average_age_m = np.mean(age[sex == 'm'])
average_age_f = np.mean(age[sex == 'f'])
print("\nEtà media maschi:", average_age_m)
print("Età media femmine:", average_age_f)


essay_lengths = np.array([len(e.split()) if isinstance(e, str) else 0 for e in essay])
avg_length = np.mean(essay_lengths)
print("\nLunghezza media delle introduzioni:", avg_length)


idx_longest = np.argmax(essay_lengths)
print("\nLa più lunga introduzione (indice {}):".format(idx_longest))
print(essay[idx_longest])

Età media: 32.33540186167551
Età minima: 17
Età massima: 111

Numero maschi: 35829
Numero femmine: 24117

Età media maschi: 32.012950403304586
Età media femmine: 32.81444624124062

Lunghezza media delle introduzioni: 111.30620892136255

La più lunga introduzione (indice 8650):
before i say anything, can i offer, please check out<br />
<br />
<a class="ilink" href=
"/interests?i=the+book%2c+anastasia%2c+by+megre%2c+the+ringing+cedar+series%2c%3cbr+%2f%3e%0aand+the+films%2c+%22zeitgeist%22%2c+%22kymatica%22%2c+the+esoteric+agenda+and%0a%22thrive%22+able+to+be+seen+on+youtube">
the book, anastasia, by megre, the ringing cedar series,<br />
and the films, "zeitgeist", "kymatica", the esoteric agenda and
"thrive" able to be seen on youtube</a><br />
<br />
am reikiteacher redwood on facebook (where you may find more
information on my wall).<br />
<br />
ringing a few bells<br />
<a class="ilink" href=
"/interests?i=awakening...krysthl-a%3cbr+%2f%3e%0a...melchizedek...heinlein...nacaals...%2

## Exercise 2

### 2.1
Write a function `primeNumbers(N)` that takes as an (optional) input argument an integer `N` (default value `N=1000`) and returns an array with all prime numbers less than or equal to `N` (neither 0 nor 1 should be considered).

In [7]:
# Solution 2.1

def primeNumbers(N=1000):
    if N < 2:
        return np.array([], dtype=int)

    sieve = np.ones(N+1, dtype=bool)
    sieve[:2] = False
    
    for i in range(2, int(np.sqrt(N)) + 1):
        if sieve[i]:
            sieve[i*i:N+1:i] = False
    return np.nonzero(sieve)[0]
    
print(primeNumbers(50))

[ 2  3  5  7 11 13 17 19 23 29 31 37 41 43 47]


### 2.2
Write a function `createMyDictionary(N)` that takes as an (optional) input argument an integer `N` (default value `N=1000`) and returns a dictionary with two elements. The first element has the key "middle", associated with an array of all prime numbers between $\frac{1}{4}N$ e $\frac{3}{4}N$ (inclusive). The second element has the key "extremes" and contains the remaining prime numbers less than $N$. Hint: Use the previously created function `primeNumbers(N)`.

In [8]:
# Hint

# Remember that the logical AND (or OR) between 2 Boolean variables (or Boolean arrays)
# is performed in Python using the operator `&` (or `|` respectively)

array1 = np.array(np.arange(10))
print(f"array1: {array1}")

mask1 = (array1 > 2) & (array1 < 8)
print("mask1: ", mask1)

mask2 = (array1[0:3] == 2) | (array1[0:3] == 1)
print("mask2: ", mask2)

array1: [0 1 2 3 4 5 6 7 8 9]
mask1:  [False False False  True  True  True  True  True False False]
mask2:  [False  True  True]
