# Homework 7

**Due: 11/22/2020 on gradescope**

## References

+ Lectures 12-13 (inclusive).


## Instructions

+ Type your name and email in the "Student details" section below.
+ Develop the code and generate the figures you need to solve the problems using this notebook.
+ For the answers that require a mathematical proof or derivation you can either:
    
    - Type the answer using the built-in latex capabilities. In this case, simply export the notebook as a pdf and upload it on gradescope; or
    - You can print the notebook (after you are done with all the code), write your answers by hand, scan, turn your response to a single pdf, and upload on gradescope.

+ The total homework points are 100. Please note that the problems are not weighed equally.

**Note**: Please match all the pages corresponding to each of the questions when you submit on gradescope. 

## Student details

+ **First Name:**
+ **Last Name:**
+ **Email:**

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_context('paper')
sns.set_style('white')
import scipy.stats as st
# A helper function for downloading files
import requests
import os
def download(url, local_filename=None):
    """
    Downloads the file in the ``url`` and saves it in the current working directory.
    """
    data = requests.get(url)
    if local_filename is None:
        local_filename = os.path.basename(url)
    with open(local_filename, 'wb') as fd:
        fd.write(data.content)

# Problem 1 - Clustering Uber Pickup Data

In this problem you will analyze Uber pickup data collected during April 2014 around New York City.
The complete data are freely on [Kaggle](https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city/).
The data consist of a timestamp (which we are going to ignore), the latitude and longitude of the Uber pickup, and a base code (which we are also ignoring).
The data file we are going to use is [uber-raw-data-apr14.csv](https://raw.githubusercontent.com/PurdueMechanicalEngineering/me-297-intro-to-data-science/master/homework/uber-raw-data-apr14.csv).
As usual, you have to make it visible to this Jupyter notebook.
On Google Colab, just run this:

In [None]:
url = 'https://raw.githubusercontent.com/PurdueMechanicalEngineering/me-297-intro-to-data-science/master/homework/uber-raw-data-apr14.csv'
download(url)

And you can load it using pandas:

In [None]:
p1_data = pd.read_csv('uber-raw-data-apr14.csv')

Here is a text view:

In [None]:
p1_data

As you see, there where about half a million Uber pickups during April 2014...
Let's extract the lattitude and longitude data only (this is needed for passing them to scikit-learn algorithms).
Here is how you can do this in pandas:

In [None]:
# Just use the column names as indices.
# The two brackets are required because you are actually
# passing a list of columns
loc_data = p1_data[['Lat', 'Lon']]
loc_data

Let's also visualize these points:

In [None]:
fig, ax = plt.subplots(dpi=150)
ax.scatter(loc_data.Lon, loc_data.Lat, s=0.01)
# ``s=0.01`` specifies the size. I am using a small size because
# these are too many points to visualize
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude');

This is nice, but it would be even nicer if we had a map of New York City on the background.
We can make such a map on [www.openstreetmap.org](https://www.openstreetmap.org/export#map=11/40.7855/-73.8964).
We just need to have a box of longitude's and latitudes that overlaps with our data.
Here is how to get such a *bounding box*:

In [None]:
box = ((loc_data.Lon.min(), loc_data.Lon.max(),
        loc_data.Lat.min(), loc_data.Lat.max()))
box

I have already extracted this picture for you and you can find it [here](https://github.com/PredictiveScienceLab/data-analytics-se/blob/master/homework/ny_map.png).
As always, it needs to be visible from the Jupyter notebook.
On Google Colab run:

In [None]:
url = 'https://github.com/PurdueMechanicalEngineering/me-297-intro-to-data-science/raw/master/homework/ny_map.png'
download(url)

If you have it at the right place, you should be able to see the image here:

![New York City Map](ny_map.png)

Now let's load the image as a matrix:

In [None]:
ny_map = plt.imread('ny_map.png')

And we can visualize it with ``plt.imshow`` and draw the Uber pickups on top of it.
Here is how:

In [None]:
fig, ax = plt.subplots(dpi=600)
ax.scatter(loc_data.Lon, loc_data.Lat, zorder=1, alpha= 0.5, c='b', s=0.001)
ax.set_xlim(box[0],box[1])
ax.set_ylim(box[2],box[3])
ax.imshow(ny_map, zorder=0, extent=box, aspect= 'equal')
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude');

Because we have over half a million data points, machine learning algorithms may be a bit slow.
So, as you develop your code use use only 50K observations.
Once you have a stable version of your code, modify the following code segment to make use of the entire dataset.

In [None]:
# While your are developing your code use this:
p1_train_data = loc_data[:100000]
# When you have a stable code, use this:
# p1_train_data = loc_data

## Part A - Splitting New York City into Subregions

Suppose that you are assigned the task of splitting New York City into operating subregions with pretty much equal demand.
When a pickup is requested in each subregion only the drivers in that region are called.
Note that this can quickly become a very difficult problem very quickly.
We are not looking for the best possible answer here.
This would require posing and solving a suitable optimization problem.
We are looking for a data-informed solution that is compatible with common sense.

Do (at least) the following:
+ Use Kmeans clustering on the pickup data with different number of clusters;
+ Visualize the labels of the clusters on the map using different colors (see the hands-on activities);
+ Visualize the centers of the discovered Kmeans clusters (in red color);
+ Use your common sense, e.g., make sure that you have enough clusters so that no region crosses the water (even if it is possible the drivers may have to pay tolls to cross). If it is impossible to get perfect results simply by Kmeans, feel free to ignore a small number of outliers as they could be handled manually;
+ Use [MiniBatchKMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html#sklearn.cluster.MiniBatchKMeans) which is an much faster version of Kmeans suitable for large datasets (>10K observations);

Answer with as many text blocks and code blocks as you like right below.

In [None]:
# Your code here

## Part B - Create a Stochastic Model of Pickups

One of the key ingredients for a more sophisticated approach to optimizing the operations of Uber would involve the construction of a stochastic model of the demand for pickups.
The ideal model for this problem is the [Poisson Point Process](https://en.wikipedia.org/wiki/Poisson_point_process).
However, we are going to do something simpler, using the Gaussian mixture model and a Poisson random variable.
The model will not have a time component, but it will allow us to sample the number and locations of pickups during a typical month.
We will guide you through the process of constructing this model.

### Subpart B.I - Random variable capturing number of monthly pickups

Find the rate of monthly pickups (ignore the fact that months may differ by a few days) and use it to define a Poisson random variable corresponding to the monthly number of pickups.
Use ``scipy.stats.poisson`` to initialize this random variable. Sample from it 10,000 times and plot the histogram of the samples to get a feeling about the corresponding probability mass function.

In [None]:
# Your code here

### Subpart B.II - Estimate the spatial density of pickups

Fit a Gaussian Mixture model to the pickup data.
**Do not use the Bayesian Information Criterion** to decide how many components to keep.
This would take quite a bit of time for this problem. Simply use 40 mixture components.
Plot the contour of the logarithm of the probability density on the New York City map.

In [None]:
# Your code here

### Subpart B.III - Sample some random months of pickups

Now that you have a model that gives you the number of pickups and a model that allows you to sample a pickup location, sample five different datasets (number of pickups and location of each pick) from the combined model and visualize them on the New York map.

**Hint:** Don't get obsessed with making the model perfect. It's okay if a few of the pickups are on water...

In [None]:
# Your code here

# Problem 2 - Counting Celestial Objects

Consider [this picture](https://github.com/PurdueMechanicalEngineering/me-297-intro-to-data-science/raw/master/datasets/galaxies.png) of a patch of sky taken by the [Hubble Space Telescope](https://www.nasa.gov/mission_pages/hubble/story/index.html):

In [None]:
url = 'https://github.com/PurdueMechanicalEngineering/me-297-intro-to-data-science/raw/master/homework/galaxies.png'
download(url)

![galaxies](galaxies.png)

This picture includes many galaxies, but also some starts.
We are going to create a machine learning model capable of counting the number of objects in such images.
Our model is not going to be able to differentiate between the different types of objects, and it will not be very accurate, but it does form the basis of more sophisticated approaches.
The idea is as follows:
+ Convert the picture to points sampled according to the intensity of light.
+ Apply Gaussian mixture on the resulting points.
+ Use the Bayesian Information Criterion to identify the number of components in the picture.
+ Associate the number of components with the actual number of celestial objects.

I will set you up with the first step. You will have to do the last three.

We are going to load the image with the [Python Imaging Library (PIL)](https://en.wikipedia.org/wiki/Python_Imaging_Library) which allows us to apply a few basic transformations to the image:

In [None]:
from PIL import Image
hubble_image = Image.open('galaxies.png')
# here is how to see the image
hubble_image

Now we are going to convert it to gray scale and crop it to make the problem a little bit easier:

In [None]:
img = hubble_image.convert('L').crop((100, 100, 300, 300))
img

Remember that black-and white images are matrices:

In [None]:
img_ar = np.array(img)
img_ar

The minimum number if $0$ corresponding to black and the maximum number is 255 corresponding to white.
Anything in between is some shade of gray.

Now, imagine that each pixel is associated with some cordinates.
Without loss of generality, let's assume that each pixel is some coordinate in $[0,1]^2$.
We will loop over each of the pixels and sample its coordinates in a way that increases with increasing light intensity.
To achieve this, we will pass the intensity values of each pixels through a sigmoid with parameters that can be tuned.
Here is this sigmoid:

In [None]:
intensities = np.linspace(0, 255, 255)
fig, ax = plt.subplots(dpi=150)
alpha = 0.1
beta = 255 / 3
ax.plot(intensities, 1.0 / (1.0 + np.exp(-alpha * (intensities - beta))));
ax.set_xlabel('Light intensities')
ax.set_ylabel('Probability of sampling the pixel coordinates');

And here is the code that samples the pixel coordinates.
I am organizing it into a function because we may want to use it with different pictures:

In [None]:
def sample_pixel_coords(img, alpha, beta):
    """
    Samples pixel coordinates based on a probability defined as the sigmoid of the intensity.
    
    Arguments:
    
        img    -     The gray scale pixture from which we sample as an array
        alpha     -     The scale of the sigmoid
        beta      -     The offset of the sigmoid
    """
    img_ar = np.array(img)
    x = np.linspace(0, 1, img_ar.shape[0])
    y = np.linspace(0, 1, img_ar.shape[1])
    X, Y = np.meshgrid(x, y)
    img_to_locs = []
    # Loop over pixels
    for i in range(img_ar.shape[1]):
        for j in range(img_ar.shape[0]):
            # Calculate the probability of the pixel by looking at each
            # light intensity
            prob = 1.0 / (1.0 + np.exp(-alpha * (img_ar[j, i] - beta)))
            # Pick a uniform random number
            u = np.random.rand()
            # If u is smaller than the desired probability,
            # the consider the coordinates of the pixel sampled
            if u <= prob:
                img_to_locs.append((Y[i, j], X[-i-1, -j-1]))
    # Turn img_to_locs into a numpy array
    img_to_locs = np.array(img_to_locs)
    return img_to_locs

Let's test the code:

In [None]:
locs = sample_pixel_coords(img, alpha=0.1, beta=200)
fig, ax = plt.subplots(dpi=150)
ax.imshow(img, extent=((0, 1, 0, 1)), zorder=0)
ax.scatter(locs[:, 0], locs[:, 1], zorder=1, alpha=0.5, c='b', s=1);

Note that by playing with $\alpha$ and $\beta$ you can make the whole thing more or less sensitive to the light intensity.

In [None]:
from sklearn.mixture import GaussianMixture

def count_objs(img, alpha, beta, nc_min=1, nc_max=50):
    """
    Counts objects in image.
    
    Arguments:
        img       -     The image
        alpha     -     The scale of the sigmoid
        beta      -     The offset of the sigmoid
        nc_min    -     The minimum number of components to consider
        nc_max    -     The maximum number of components to consider
    """
    locs = sample_pixel_coords(img, alpha, beta)
    # Use BIC to search for the best GaussianMixture model
    # with components between nc_min and nc_max 
    # Set the following variables
    best_nc = NotImplemented('Set this equal to the number of components of the best model.')
    best_model = NotImplemented('Set this equatl to the best model.')
    return best_nc, best_model, locs

Once you have completed the code, try it out the following images.
Feel free to play with $\alpha$ and $\beta$ to improve the performance.
**Do not try to make a perfect model. To do so, we would have to go beyond the Gaussian mixture model. This is just a homework problem.**

In [None]:
objs, model, locs = count_objs(img, alpha=1.0, beta=200)
fig, ax = plt.subplots(dpi=150)
ax.imshow(img, extent=((0, 1, 0, 1)))
for i in range(model.means_.shape[0]):
    ax.plot(model.means_[i, 0], model.means_[i, 1], 'rx', 
            markersize=10.0 * model.weights_.shape[0] * model.weights_[i])
ax.scatter(locs[:, 0], locs[:, 1], zorder=1, alpha=0.5, c='b', s=1)
ax.set_title('Counted {0:d} objects!'.format(objs));

Try this one:

In [None]:
img = hubble_image.convert('L').crop((200, 200, 400, 400))
objs, model, locs = count_objs(img, alpha=.1, beta=250)
fig, ax = plt.subplots(dpi=150)
ax.imshow(img, extent=((0, 1, 0, 1)))
for i in range(model.means_.shape[0]):
    ax.plot(model.means_[i, 0], model.means_[i, 1], 'rx', 
            markersize=10.0 * model.weights_.shape[0] * model.weights_[i])
ax.scatter(locs[:, 0], locs[:, 1], zorder=1, alpha=0.5, c='b', s=1)
ax.set_title('Counted {0:d} objects!'.format(objs));

And try this one:

In [None]:
img = hubble_image.convert('L').crop((300, 300, 500, 500))
objs, model, locs = count_objs(img, alpha=.1, beta=250)
fig, ax = plt.subplots(dpi=150)
ax.imshow(img, extent=((0, 1, 0, 1)))
for i in range(model.means_.shape[0]):
    ax.plot(model.means_[i, 0], model.means_[i, 1], 'rx', 
            markersize=10.0 * model.weights_.shape[0] * model.weights_[i])
ax.scatter(locs[:, 0], locs[:, 1], zorder=1, alpha=0.5, c='b', s=1)
ax.set_title('Counted {0:d} objects!'.format(objs));

# Problem 3 - Quantifying Uncertainties in Steel Magnetic Properties

The magnetic properties of steel are captured in the so called [$B-H$ curve](https://en.wikipedia.org/wiki/Saturation_(magnetic)) which connects the magnetic field $H$ to the magnetic flux density $B$.
The shape of this curve depends on the manufacturing process of the steel. As a result the $B-H$ differs across different suppliers but alos across time for the same supplier.

Let's use some real manufacturer data to visualize these differences.
The data are [here](https://github.com/PurdueMechanicalEngineering/me-297-intro-to-data-science/raw/master/homework/B_data.csv).
Start by downloading the data:

In [None]:
url = 'https://github.com/PurdueMechanicalEngineering/me-297-intro-to-data-science/raw/master/homework/B_data.csv'
download(url)

Let's load and visualize the data:

In [None]:
bh_data = np.loadtxt('B_data.csv')
fig, ax = plt.subplots(dpi=150)
ax.plot(bh_data.T, 'r', lw=0.1)
ax.set_xlabel('Index $j$')
ax.set_ylabel('$B$ (T)');

The dimensions of tha data are:

In [None]:
bh_data.shape

This, we have 200 observations of B-H curves each one consists of 1,500 dimensions.
Our goal is to build a stochastic model of B-H curves.
We did the same thing in Problem 5 of Homework 2.
Now we are going to do it better since we are going to use Principal Component Analysis to reduce the dimensionality of the data.
Do the following:
+ Use PCA to reduce the dimensionality of the data. Make sure that you follow the same routine we followed in the hands-on activity
    - separate into training and validation datasets
    - perform PCA on the training dataset
    - visualize the explained variance as a function of the number of components and figure out how many components you need to keep so that you capture 98% of the variance of the data
    - visualize the mean and the first three eigenvectors
    - visualize the reconstruction error on some points of the validation set and make sure that you have enough PCA components so that it is negligible
    - do the scatter plot of the two first principal components of the training data
    - use a Gaussian mixture model to find the density of the principal components you get (use the BIC to determine the number of Gaussian mixture components)
    
Once you have done all the above, demonstrate the resulting stochastic model by sampling from it 10 times, i.e., sample from the mixture the principal components, reconstruct the 1,500 dimensional vector of B's, and plot it as a curve.

In [None]:
# Your code and text here