---
---
Problem Set 2: Numerical Python (NumPy)

Applied Data Science using Python

New York University, Abu Dhabi

Out: 13th Sept 2023 || **Due: 20th Sept 2023 at 23:59**

---
---
#Start Here
## Learning Goals
### General Goals
- Learn the basics of programming in Python
- Learn to process data
- Learn to represent arrays for efficient computation

### Specific Goals
- Learn the numpy library and some of its basic functions
- Learn to use numpy for different problems

## Collaboration Policy
- You are allowed to talk with / work with other students on homework assignments.
- You can share ideas but not code, analyses or results; you must submit your own code and results. All submitted code will be compared against all code submitted this and previous semesters and online using MOSS. We will also critically analyze the similarities in the submitted reports, methodologies, and results, **but we will not police you**. We expect you all to be mature and responsible enough to finish your work with full integrity.
- You are expected to comply with the [University Policy on Academic Integrity and Plagiarism](https://www.nyu.edu/about/policies-guidelines-compliance/policies-and-guidelines/academic-integrity-for-students-at-nyu.html). Violations may result in penalties, such as failure in a particular assignment.

## Late Submission Policy
You can submit the homework for upto 3 late days. However, we will deduct **20 points** from your homework grade **for each late day you take**. We will not accept the homework after 3 late days.

## Distribution of Class Materials
These problem sets and recitations are intellectual property of NYUAD, and we request the students to **not** distribute them or their solutions to other students who have not signed up for this class, and/or intend to sign up in the future. We also request you don't post these problem sets, and recitations online or on any public platforms.

## Disclaimer
The number of points do not necessarily signify/correlate to the difficulty level of the tasks.

## Submission
You will submit all your code as a Python Notebook through [Brightspace](https://brightspace.nyu.edu/).

---




# General Instructions
This homework is worth 100 points. It has 4 parts. Below each part, we provide a set of concepts required to complete that part. All the parts need to be completed in this Jupyter (Colab) Notebook.



# Part I: Augmenting the Iconic Iris Dataset (25 points)

**Iris dataset** is arguably one of the oldest and over-prescribed datasets in the machine learning and statistics community. It is a small dataset released in 1936, and is often still used for testing out *machine learning classification* algorithms and visualizations. If you don't know what the terms **machine learning** or **classification** mean, don't worry, we'll cover these concepts thoroughly in the second half of this course. But for now, we would like to manipulate this dataset a bit just for fun (and of course for grades :--))

Iris dataset is a very small dataset consisting of 150 data instances. Each row of the table in the data represents an iris flower, including its species and dimensions of its botanical parts, sepal and petal, in centimeters.

Here's how the dataset looks:

In [43]:
# Importing numpy library and using np as a short form
import numpy as np

# The URL where the IRIS dataset can be downloaded from
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

# Using numpy's genfrontext function to read the data as a numpy array
iris_2d = np.genfromtxt(url, delimiter=',', dtype='str')

# Printing the data
iris_2d

array([['5.1', '3.5', '1.4', '0.2', 'Iris-setosa'],
       ['4.9', '3.0', '1.4', '0.2', 'Iris-setosa'],
       ['4.7', '3.2', '1.3', '0.2', 'Iris-setosa'],
       ['4.6', '3.1', '1.5', '0.2', 'Iris-setosa'],
       ['5.0', '3.6', '1.4', '0.2', 'Iris-setosa'],
       ['5.4', '3.9', '1.7', '0.4', 'Iris-setosa'],
       ['4.6', '3.4', '1.4', '0.3', 'Iris-setosa'],
       ['5.0', '3.4', '1.5', '0.2', 'Iris-setosa'],
       ['4.4', '2.9', '1.4', '0.2', 'Iris-setosa'],
       ['4.9', '3.1', '1.5', '0.1', 'Iris-setosa'],
       ['5.4', '3.7', '1.5', '0.2', 'Iris-setosa'],
       ['4.8', '3.4', '1.6', '0.2', 'Iris-setosa'],
       ['4.8', '3.0', '1.4', '0.1', 'Iris-setosa'],
       ['4.3', '3.0', '1.1', '0.1', 'Iris-setosa'],
       ['5.8', '4.0', '1.2', '0.2', 'Iris-setosa'],
       ['5.7', '4.4', '1.5', '0.4', 'Iris-setosa'],
       ['5.4', '3.9', '1.3', '0.4', 'Iris-setosa'],
       ['5.1', '3.5', '1.4', '0.3', 'Iris-setosa'],
       ['5.7', '3.8', '1.7', '0.3', 'Iris-setosa'],
       ['5.1

Each row in the Iris dataset has 5 elements. These correspond to *sepal_length*, *sepal_width*, *petal_length*, *petal_width*, and *type_of_iris_flower (sentosa, versicolor, or virginica)* respectively.

One other thing to understand is that we have used a new function `np.genfromtxt()` above to read the text file as a numpy array. `np.genfromtxt()` takes in the file path (`url`), optional string to separate values (`delimiter`), and the optional data type (`dtype`). This function reads the file from the given path, using the given delimiter to separate elements into a list, and casts the objects given the data type, and returns an n-dimensional numpy array. In this case it is a 2-dimensional numpy array of shape (150,5).

*There are several other optional arguments one can provide to this function, a comprehensive definition of which can be found [here](https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html). In the near future, we will see a more robust function to read data from files using `pandas`*.


## Prompt

Given the `iris_2d` array with 150 rows and 5 columns, create a new column for volume in the `iris_2d`, where volume is calculated as

$volume = \frac{(\pi * petal\_length * (sepal\_ length)^2)}{3}$.

For example, if the row you are dealing with is

`['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']`

with `petal_legth` of `1.4` and `sepal_length` of `5.1`, then the volume will be:

$volume = \frac{\pi * 1.4 * (5.1)^2}{3} = 38.13265162927291$

Notes:

1. You are not allowed to use iterations (loops, map, list comprehensions). Instead think of a numpy-based solution.
2. Your solution should not be more than a couple of lines
3. For the value of pi, use `np.pi`

In [44]:
iris_2d_with_volume = None #Assign your new array with an added column to this variable

# Write your implementation below this line
############### SOLUTION ###############
# Calculating the volumes by getting required values from the iris_2d array by slicing and reshaping the array to a column vector
volume = ((np.pi * iris_2d[:,2].astype('float') * iris_2d[:,0].astype('float')**2)/3).reshape(-1,1)
iris_2d_with_volume = np.hstack([iris_2d,volume]) # Appending the volume column to the iris_2d array
iris_2d_with_volume # Printing the array with the volume column appended

############### SOLUTION END ###############


array([['5.1', '3.5', '1.4', '0.2', 'Iris-setosa', '38.13265162927291'],
       ['4.9', '3.0', '1.4', '0.2', 'Iris-setosa', '35.200498485922445'],
       ['4.7', '3.2', '1.3', '0.2', 'Iris-setosa', '30.0723720777127'],
       ['4.6', '3.1', '1.5', '0.2', 'Iris-setosa', '33.238050274980004'],
       ['5.0', '3.6', '1.4', '0.2', 'Iris-setosa', '36.65191429188092'],
       ['5.4', '3.9', '1.7', '0.4', 'Iris-setosa', '51.911677007917746'],
       ['4.6', '3.4', '1.4', '0.3', 'Iris-setosa', '31.022180256648003'],
       ['5.0', '3.4', '1.5', '0.2', 'Iris-setosa', '39.269908169872416'],
       ['4.4', '2.9', '1.4', '0.2', 'Iris-setosa', '28.38324242763259'],
       ['4.9', '3.1', '1.5', '0.1', 'Iris-setosa', '37.714819806345474'],
       ['5.4', '3.7', '1.5', '0.2', 'Iris-setosa', '45.80442088933919'],
       ['4.8', '3.4', '1.6', '0.2', 'Iris-setosa', '38.60389052731138'],
       ['4.8', '3.0', '1.4', '0.1', 'Iris-setosa', '33.77840421139745'],
       ['4.3', '3.0', '1.1', '0.1', 'Iris-seto

## *Concepts required to complete this task*

*   Array indexing and slicing (Also optionally `np.newaxis`)
*   Array operations
*   Combining Arrays




## Rubric

- +12 points for correctness (using `numpy` wherever necessary to achieve the desired output)
- +8 points for conciseness
- +5 points for proper comments and variable names

# Part II: One-hot encodings (25 points)

**One-hot encoding** is a type of representation of data that is very popular in machine learning and natural language processing applications. One-hot encoding is an array (or a vector) representation where all the elements of the array are 0 except one, which has 1 as its value.

For example, `[0,0,0,1,0,0]` is a one-hot array. That is simply because all the elements are `0` (or cold) except one which is `1` (or hot).

Why do we use this? Where do we use this? Well.. in layman terms, let's say we had to create an *intelligent* program that was able to take the `petal_length` and the `sepal_length` as inputs, and predict the type of iris flower as output i.e. whether the flower is *sentosa*, *versicolor*, or *virginica*. Now our computers do not understand anything except 0s and 1s. So we need a way to represent *sentosa*, *versicolor*, or *virginica*. How do we do that? There are many ways. One of the ways we do that in practice is to create a vector representation where *sentosa* gets mapped to `[1,0,0]`, *versicolor* gets mapped to `[0,1,0]`, and *virginica* gets mapped to `[0,0,1]`. This makes it easier to process this data in machine learning algorithms. There are many intricate details related to one-hot encodings that are not important for this task, but you will learn those later in the course when we talk about machine learning and classification algorithms.

## Prompt

Write a function called `create_one_hot()` that takes in an arbitrary list of elements, and returns the one-hot encodings for the list in the form of a 2D numpy array.

Here are some examples for what your program should return on different inputs.






### Example 1

If the input was

`['sentosa', 'versicolor', 'virginica', 'virginica', 'sentosa']`

Your function should return

`[[1,0,0],[0,1,0],[0,0,1],[0,0,1],[1,0,0]]`

**A very important note:** In this example, we assume `sentosa` to be mapped to `[1,0,0]`, `versicolor` to be mapped to `[0,1,0]`, and `virginica` to be mapped to `[0,0,1]`. However, in practice, you can assign any place to any element in the input list as long as it is **consistent**. For example if you assume `virginica` to be `[1,0,0]`, `sentosa` to be `[0,1,0]` and `versicolor` to be `[0,0,1]`, then your program should return

`[[0,1,0],[0,0,1],[1,0,0],[1,0,0],[0,1,0]]`

Both of these are valid outputs from your function. It will depend on how you map each element in the list.

### Example 2

If the input was

`[2,3,2,2,1,4]`

Your function should return

`[
    [0,1,0,0],
    [0,0,1,0],
    [0,1,0,0],
    [0,1,0,0],
    [1,0,0,0],
    [0,0,0,1]
]`

Notes:

1. The input to your function is a list, and not a numpy array.
2. Your function should use numpy functions.
3. Our reference solution is no more than 10 lines of code.
4. For an efficient implementation, you should look at `np.unique()`, `np.zeros()` and use `enumerate`.
5. For an even compact implementation (1 line), you could optionally look at `np.view()` in conjunction with `np.unique()` and array slicing.

In [45]:
def create_one_hot(lst):

    # Write your implementation below this line

    ############### SOLUTION ###############
    '''
    The idea is to map each unique element in the list to a row (or column) in an identity matrix
    So, for instance, if we have 2 unique elements in the list, we will create a 2x2 identity matrix [[1, 0], [0, 1]]
    These will be our one-hot encoded vectors for each element in the list
    '''
    indices = np.unique(lst, return_inverse=True)[1]  # Getting the indices of the unique elements in the list
    return np.eye(len(np.unique(lst)))[indices].astype(int).tolist()  # Creating the identity matrix and returning it as a list of integers

    ############### SOLUTION END ###############


# How we will call your function
print(create_one_hot(['sentosa', 'versicolor', 'virginica', 'virginica', 'sentosa']))
print(create_one_hot(['1', '1', '2', '2', '3']))
print(create_one_hot([1, 1, 2, 3, 3]))


[[1, 0, 0], [0, 1, 0], [0, 0, 1], [0, 0, 1], [1, 0, 0]]
[[1, 0, 0], [1, 0, 0], [0, 1, 0], [0, 1, 0], [0, 0, 1]]
[[1, 0, 0], [1, 0, 0], [0, 1, 0], [0, 0, 1], [0, 0, 1]]


## *Concepts required to complete this task*

*   Creating arrays
*   Iterating over arrays
*   Array Indexing and Slicing

## Rubric

- +12 points for correctness (using `numpy` wherever necessary to achieve the desired output)
- +8 points for conciseness
- +5 points for proper comments and variable names

# Part III: *That's not even a number!* (25 points)

A huge chunk of your time in this course and in your projects will be spent in the task of **data wrangling**. *Data wrangling* is a popular term in data science that is used to define the initial process of refining and cleaning raw data into content or formats better-suited for consumption by downstream tasks. Legend has it that data scientists spend about 80% of their time cleaning and manipulating data.

Now, in computing, there is such a thing as `NaN`, standing for **Not a Number**. It is a member of a numeric data type that can be interpreted as a value that is undefined or unrepresentable. NaNs typically represent *missing values* in a data.

NaNs are not good for many downstream tasks such as visualizations or training machine learning algorithms. Therefore, we must deal with them somehow by either filtering them out, or *impute* them with appropriate values such as 0s or mean of the data or row or column. In this task we will impute them with the *mean* of the column.


## Prompt

We have a 2D numpy array with NaNs at many places in the array, and we would like to replace NaNs with the mean of the column they are part of. Write a function called `impute_NaNs()` that takes as input the `array_2d` and returns the appropriately imputed `imputed_array_2d`.



### Example

If your input array was

```
[ [1,  NaN,  3],
  [NaN, 2,   4],
  [3,   3,   5]]
```

Then your function should output

```
[ [1,  2.5,  3],
  [2,   2,   4],
  [3,   3,   5]]
```

Notes:

1. Use numpy functions, and avoid loops wherever possible.
2. NaN is represented by `np.nan`
3. Look up `np.mean` and `np.nanmean` functions either of which may be useful based on how you implement your function.
4. Also look up `np.isnan` as it may come handy.
5. You should not need to use any other numpy functions besides the one taught in class, and recitation, or explicitly mentioned in this handout. If you do, comment them well, and be prepared to explain them.
6. Our reference solution is no more than 3 lines of code.

In [46]:
def impute_NaNs(array_2d):
  # Write your implementation below this line
  ############### SOLUTION ###############
  col_mean = np.nanmean(array_2d, axis=0) # Compute nanmean of each column
  inds = np.where(np.isnan(array_2d)) # Find indices where nan values are present
  array_2d[inds] = np.take(col_mean, inds[1]) # Replace the nan values with the mean values

  ############### SOLUTION END ###########
  # Returning the imputed array
  return array_2d


In [47]:
# How we may test your solution
array_2d = np.array([[1, np.nan, np.nan],
                      [np.nan, 2, 4],
                      [3, 3, 5]])
print(impute_NaNs(array_2d))

[[1.  2.5 4.5]
 [2.  2.  4. ]
 [3.  3.  5. ]]


## *Concepts required to complete this task*

*   Array indexing and slicing
*   Iterating over arrays
*   Numpy library (`np.where`, `np.isnan`, `np.mean`, etc.)


## Rubric

- +12 points for correctness (using `numpy` wherever necessary to achieve the desired output)
- +8 points for conciseness
- +5 points for proper comments and variable names

## Interesting things about NaNs

Here's one interesting fact about NaNs:

In [48]:
np.nan == np.nan # :D

False

Another interesting fact about NaNs:

In [49]:
np.nan in [np.nan] # Mind == Blown! (Not really if you give it some thought :))
# Is it because something that's not a number does in fact exist in a list of things that are not numbers?

True

# Part IV: Softmax (25 points)

If you ever pursue a career in machine learning or deep learning, **softmax** is a mathemetical function that you will most certainly come across. In simple terms softmax turns arbitrary real values into probabilities such that they sum up to 1, transforming the values into a probability distribution.

In fancy terms, given $n$ numbers, $x_1 \ldots x_n$, softmax performs the following transform on $n$ numbers:

$$s(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n}e^{x_j}}$$

Don't get intimidated by the equation. What is basically happening here is the following:

1. We raise $e$ (the mathematical constant) to the power of each of the numbers, and sum up all the exponentials (powers of e). This result is the *denominator*.
2. Then we use each number's exponential as its *numerator*.
3. Finally, $softmax = \frac{numerator}{denominator}$


## Prompt

In this part of the assignment, we want you to simply implement softmax using numpy.

More concretely, given a python list, write a function that takes in the list, and returns a numpy array of softmax values of the elements in the list. Your function should not be more than 2 lines of code, and you may use `np.exp()` function to compute the exponent. Once you are done, validate that the sum of the elements in your output array equals 1 using numpy. `np.exp()` takes in a numpy array and returns the exponent of each of the elements in the array.

In [50]:
def softmax(lst):
  # Write your implementation of the function below this line

  ######### SOLUTION #########
  return np.exp(lst)/np.sum(np.exp(lst),axis=0)

  ######### SOLUTION END #########

example_lst = [-1, 0, 3, 5, 6, -1, 0, -3]
output_array = softmax(example_lst)

# Write your implementation validating that the sum equals 1 here
######### SOLUTION #########
np.sum(output_array)

######### SOLUTION END #########

1.0

## *Concepts required to complete this task*

*   Array Operations
*   `np.exp()`

## Rubric

- +12 points for correctness (using `numpy` wherever necessary to achieve the desired output)
- +8 points for conciseness
- +5 points for proper comments and variable names