# Hw 1: Data Science Workflow and More NumPy 📈

Name: Mahoto Sasaki

Student ID: 467695

Collaborators: Chenthuran Abeyakaran (TA)


## Instructions

For this homework, work through **Lab 0 (Introduction to Python)** first. Most of the things we ask you to do in this homework are explained in the lab. In general, you should feel free to import any package that we have previously used in class. Ensure that all plots have the necessary components that a plot should have (e.g. axes labels, a title, a legend).

Work your way through these problems. They will provide some more practice for working with NumPy arrays. 
> Use NumPy arrays and functions in **all** problems unless explicitly stated otherwise! 

Frequently **save** your notebook!

### Collaborators and Sources
Furthermore, in addition to recording your **collaborators** on this homework, please also remember to **cite/indicate all external sources** used when finishing this assignment. 
> This includes peers, TAs, and links to online sources. 

Note that these citations will be taken into account during the grading and regrading process.

### Submission instructions
* Submit this python notebook including your answers in the code cells as homework submission.
* **Do not change the number of cells!** Your submission notebook should have exactly one code cell per problem. 
* Do **not** remove the `# your code here` line and add your solution **after** that line. 

## 1. Data Science Use Case I

Consider the following data science use case.

> *Assume you just landed a great analytical job with MegaTelCo, one of the largest telecommunication firms in the United States. They are having a major problem with customer retention in their wireless business. In the mid-Atlantic region, 20% of cell phone customers leave when their contracts expire, and it is getting increasingly difficult to acquire new customers. Since the cell phone market is now saturated, the huge growth in the wireless market has tapered off. Communications companies are now engaged in battles to attract each other’s customers while retaining their own. Customers switching from one company to another is called churn, and it is expensive all around: one company must spend on incentives to attract a customer while another company loses revenue when the customer departs.
You have been called in to help understand the problem and to devise a solution. Attracting new customers is much more expensive than retaining existing ones, so a good deal of marketing budget is allocated to prevent churn. Marketing has already designed a special retention offer. Your task is to devise a precise, step-by-step plan for how the data science team should use MegaTelCo’s vast data resources to decide which customers should be offered the special retention deal prior to the expiration of their contracts.*

### Problem 1.1

**Write-up!** Is this a scientific, social, or business problem? Briefly justify your answer. 

### Problem 1.2

**Write-up!** Using the **data science workflow** discussed in class, describe how you would takle this task. Think carefully about what data you might use and how they would be used. Specifically, how should MegaTelCo choose a set of customers to receive their offer in order to best reduce churn for a particular incentive budget?

## 2. Data Science Use Case II

Consider the following data science use case.

> *Congratulations! You’ve just been hired to lead the data science efforts at DataSciencester, a not-for-profit  social network for data scientists. It’s your first day on the job at DataSciencester, and the VP of Networking is full of questions about the DataSciencester users.
In particular, he wants you to identify who the “key connectors” are among data scientists to better understand the data science community, its structure and the roles of its members.*

### Problem 2.1

**Write-up!** Is this a scientific, social, or business problem? Briefly justify your answer. 

### Problem 2.2

**Write-up!** Using the **data science workflow** discussed in class, describe how you would takle this task. Think carefully about what data you might use and how they would be used to derive a solution and how you would present the results to the VP of Networking. 

## 3. Timing Comparison

There were some questions about why we emphasize NumPy functions and indexing so intensely. To help explain and motivate this, let's do some performance comparisons.

In [1]:
# run me!
import numpy as np

For this experiment, we will be testing various operations with random arrays. Here is an example of how you can generate a random array using `np.random.rand`.

In [4]:
np.random.rand(10)

array([0.28051483, 0.50868897, 0.95390907, 0.59180934, 0.45654595,
       0.585108  , 0.63876377, 0.4115953 , 0.33839472, 0.69312673])

### Problem 3.1

**Implement this!** Complete the function below so that it returns a random array of size `n`. Assign your new array to the `result` variable.

In [5]:
def generate_random_array(n):
    '''Returns a random array of a given shape.'''
    
    # your code here
    result = np.random.rand(n)
    
    return result

Let's try it out! You should see an array similar to the one from earlier.

In [6]:
generate_random_array(10)

array([0.99097949, 0.98931036, 0.78643025, 0.62918432, 0.58361681,
       0.58444073, 0.63621384, 0.82900832, 0.58345115, 0.14792089])

### 3.1 Summing Arrays

Let's try a sum operation. There are two ways of computing the sum of an array using built-in features. The first is to use a `for` loop and iterate through the values in an array. The second is to use the `sum` function.

### Problem 3.2

**Implement this!** Complete the function below so that it returns the sum of an array `a`, making sure to use a `for` loop. Again, assign your result to the `result` variable.

In [7]:
def loop_sum(a):
    '''Computes the sum of array A using a for loop'''
    
    # your code here
    result = 0
    for i in a:
        result += i
    
    return result

Let's check that your function works.

In [8]:
an_array = generate_random_array(100)

'Nice!' if loop_sum(an_array) == sum(an_array) else 'Something went wrong...'

'Nice!'

### Problem 3.3

Now that we have a working `for` loop sum implementation, `loop_sum`, let's compare its performance to both the Python built-in `sum` function and the NumPy `np.sum` function. We will use an [IPython magic command 🧙‍♀️](https://ipython.readthedocs.io/en/stable/interactive/tutorial.html#magic-functions) (I'm not kidding) called `%timeit`. This function will run a given command repeatedly and report back the mean runtime and standard deviation.

In [9]:
another_array = generate_random_array(1000)

%timeit loop_sum(another_array)
%timeit sum(another_array)
%timeit np.sum(another_array)

212 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
162 µs ± 4.04 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
4.59 µs ± 353 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


**Write up!** What do you notice about these outputs? What would happen if you added more dimensions to your array? What do these results tell us?

### 3.2 Finding a Value

In Lab 0 and HW 1, we needed to find the age and class of the student from a roster who would graduate first. Let's use this set up to do another comparison. Here is the data that we worked with.

In [13]:
names = ['Billy','Meghan','Jeff','Alex','Cate']
roster = np.array([
    [50, 2021],
    [18, 2020],
    [21, 2019],
    [21, 2021],
    [21, 2020]
])

In Problem 7 from HW 1, we used the Python built-in `min` function with a lambda to accomplish this for a `List` version of the roster. What would happen if we had used the same method to do the same for a NumPy array?

In [14]:
min(roster, key=lambda student: student[1])

array([  21, 2019])

An appropriate NumPy equivalent of the code is this:

In [15]:
roster[np.argmin(roster[:, 1])]

array([  21, 2019])

### Problem 3.4

Before we check out the performance of each of these implementations, let's expand our roster a bit. The following cell, generates a new roster with entries for 1000 students.

In [16]:
roster = np.array([np.random.randint(16, 100, size=1000),
                   np.random.randint(2018, 2022, size=1000)]).T

Here's a preview of the new roster containing the first ten rows in the array:

In [17]:
roster[:10, :]

array([[  41, 2021],
       [  88, 2019],
       [  29, 2020],
       [  18, 2018],
       [  22, 2020],
       [  20, 2020],
       [  85, 2019],
       [  50, 2018],
       [  55, 2021],
       [  98, 2020]])

Now we are ready to evaluate these implementations using the `%timeit` magic command from earlier. 

In [41]:
%timeit min(roster, key=lambda student: student[1])
%timeit roster[np.argmin(roster[:, 1])]

366 µs ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.4 µs ± 24.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


**Write up!** What do you notice about these outputs? What would happen if you added more dimensions to your array? What do these results tell us?

## 4. Indexing Review

Indexing, especially with NumPy, can be a tricky feature to truly wrap one's head around, but it the benefits of (working towards) mastering it make it a worthy endeavor. The more practice you get, the easier it will become — eventually you won't be able to even imagine how else you would do things.

In this section, we will start with some review and then move on to more complex features.

### 4.1 Basic Indexing

At the risk of belaboring this topic, let's quickly do some practicing. The following cell produces `yet_another_array` using the `arange` [(array range) NumPy function](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.arange.html), which we will use in the next few problems.

In [19]:
yet_another_array = np.arange(10)
yet_another_array

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

### Problem 4.1

**Implement this!** Retrieve the 4th element of `yet_another_array`.

In [20]:
# your code here
yet_another_array[3]

3

### Problem 4.2

Here's something new: specifically for NumPy arrays, you can also pass in a Python list or NumPy array of indicies to retrieve.

For example, `some_array[[0, 2, 4, 6]]`.


**Implement this!** Retrieve the 2nd, 5th, and 9th elements of `yet_another_array`.

In [47]:
# your code here
yet_another_array[[1, 4, 8]]

array([1, 4, 8])

### Problem 4.3

Let's do this with a 2D array, too. Here we will generate `a_2d_array` using the `np.random.rand` function.

In [21]:
a_2d_array = np.random.rand(5, 5)
a_2d_array

array([[0.54687084, 0.3321218 , 0.7415408 , 0.96675191, 0.38726851],
       [0.68939549, 0.51138471, 0.67003501, 0.61476236, 0.08245001],
       [0.75741783, 0.91696026, 0.32407569, 0.5308505 , 0.78418507],
       [0.20682688, 0.99466032, 0.34140406, 0.53957635, 0.37166742],
       [0.76973912, 0.5132058 , 0.71670305, 0.8432723 , 0.41019939]])

Remember that you can index into a 2D array like this: `another_2D_array[row, column]`, where `row` and `column` are indicies, slices, `:`s, or some mix of these three.

**Implement this!** Retrieve the value at position (3, 4) from `a_2d_array`.

In [22]:
# your code here
a_2d_array[3, 4]

0.37166741792531277

### 4.2 Slice Indexing

You've already gotten familiar with slices, but here is some more practice.

### Problem 4.4

**Implement this!** Retrieve the 6th, 7th, and 8th elements of `yet_another_array`.

In [23]:
# your code here
yet_another_array[[5, 6, 7]]

array([5, 6, 7])

### Problem 4.5

Again, we can do this with a 2D array, too. Here is a reminder of what `a_2d_array` looks like.

In [24]:
a_2d_array

array([[0.54687084, 0.3321218 , 0.7415408 , 0.96675191, 0.38726851],
       [0.68939549, 0.51138471, 0.67003501, 0.61476236, 0.08245001],
       [0.75741783, 0.91696026, 0.32407569, 0.5308505 , 0.78418507],
       [0.20682688, 0.99466032, 0.34140406, 0.53957635, 0.37166742],
       [0.76973912, 0.5132058 , 0.71670305, 0.8432723 , 0.41019939]])

**Implement this!** Retrieve the values in the 3rd column of `a_2d_array`.

In [26]:
# your code here
a_2d_array[:, 2]

array([0.7415408 , 0.67003501, 0.32407569, 0.34140406, 0.71670305])

### 4.3 Logical Indexing

Now for something complete new! You might be interested in finding all of the values in an array that fulfills some condition like _get all values greater than 5_. NumPy enables this be supporting [**logical indexing**](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.indexing.html#boolean-or-mask-index-arrays).

This idea is also referred to as **masking**. You can use an array of boolean values _of the same shape_ as the target array as the "index" into the target array. This will return all of the values that are in the same position as the `True` values from the boolean arrays. These logical arrays are called "masks" because they are analogous to masks, which let some of the underlying surface show through but hide the rest.

Let's try making our first logical array. In the cell below, we use a comparison operation on `yet_another_array`, returning an array of the same size with `True` in the positions where the values meet the condition.

In [54]:
yet_another_array > 5

array([False, False, False, False, False, False,  True,  True,  True,
        True])

Now let's see how this array "lines up" with `yet_another_array`.

In [55]:
yet_another_array

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Here is another example using an array that is not sorted already.

In [56]:
shuffled_array = np.random.permutation(yet_another_array)
shuffled_array

array([1, 3, 7, 4, 5, 0, 8, 2, 6, 9])

In [57]:
shuffled_array > 5

array([False, False,  True, False, False, False,  True, False,  True,
        True])

Is this what you would expect? Let's actually get those values.

In [58]:
shuffled_array[shuffled_array > 5]

array([7, 8, 6, 9])

### Problem 4.6

**Implement this!** Retrieve the elements of `yet_another_array` that are even (hint: `%`).

In [61]:
# your code here
yet_another_array[yet_another_array % 2 == 0]

array([0, 2, 4, 6, 8])

And that's it!