<a href="https://colab.research.google.com/github/CMU-MS-DAS-Modern-Programming-Mini/fall2022-homework-2/blob/main/Modern_Programming_for_Data_Analytics_Homework_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Modern Programming for Data Analytics
Name:

Andrew ID:

# Exercise - GeoIP
a. Install the package [ip2geotools](https://pypi.org/project/ip2geotools/) and [Faker](https://pypi.org/project/Faker/) using `pip`.

In [None]:
# INSERT CODE IN THIS CELL

b. Use `Faker` to generate a list of 50 random [IPv4](https://faker.readthedocs.io/en/master/providers/faker.providers.internet.html#) addresses. Instructions to generate these random values can be found in the documentation.

**Hint**
* Set seed to `123`, i.e. 
```
seed = 123
Faker.seed( seed )
```

In [None]:
# INSERT CODE IN THIS CELL

c. Use `ip2geotools` to get information about the IPs. Use the method `get` from `DbIpCity`. Read the [documentation](https://pypi.org/project/ip2geotools/) on how to properly use the method. 

* The responses should be save to a list named `responses`.

**Hint**
* Set the `api_key` to `free`.
* Export the response from the get method to a JSON block using the method `to_json`.

In [None]:
# INSERT CODE IN THIS CELL

d. Data cleanup. Some the responses have empty fields for `latitude` and `longitude`. Remove the entries from the list that are missing either of these values.

**Hint**
* If the latitude and longitude are missing, then the value of either of these is set to `None`.

In [None]:
# INSERT CODE IN THIS CELL

d. Save variables to disk. Use [pickle](https://docs.python.org/3/library/pickle.html) to serialize the variable `responses`. Save the pickle file to the course folder in your Google Drive in a file named `ips.pkl`. 

In [None]:
# INSERT CODE IN THIS CELL

This section will make a plot using the data above but will fail for missing data. So make sure the data is clean. Nothing to do here.

**Hint**
* If the plot is printed, then you are good to do. Keep in mind the shape of `responses`: it is a list of strings that needs to converted to JSON.

In [None]:
# DO NOT MODIFY THIS BLOCK
!pip install basemap
!pip install basemap-data-hires

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.cm
import math
from mpl_toolkits.basemap import Basemap
from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
from matplotlib.colors import Normalize

fig, ax = plt.subplots(figsize=(25,25))
m = Basemap(resolution='i', # c, l, i, h, f or None
	projection='merc',
	lat_1=45.,lat_2=55,lat_0=50,lon_0=-107,
		llcrnrlon=-180, llcrnrlat=-70, urcrnrlon=180, urcrnrlat=80)
m.drawmapboundary(fill_color='#45bcec')
m.fillcontinents(color='#f2f2f2',lake_color='#46bcec')

scale = 0.1
for response in responses:
    response = json.loads(response)

    if not response['longitude'] == None:
      lon = response['longitude']
      lat = response['latitude']
      markerSize = scale*response['longitude']
      x, y = m(lon,lat)
      plt.plot(x, y, markersize = markerSize, color = 'red', marker = 'o')

plt.show()

## Exercise - Random sampling
Numpy has a very robust library for sampling from random distributions. For a detailed list of discrete and continuous distributions that can be sampled from, see the [documentation](https://numpy.org/doc/stable/reference/random/index.html).

For example, sampling from an exponential distribution can be achieved with

In [None]:
# DO NOT EDIT THIS CELL
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(123)

#this line is sampling from an exponential distribution
elambda = 1
values = np.random.exponential(scale=elambda, size=10000)

num_bins = 100
fig, ax = plt.subplots()
n, bins, patches = ax.hist(values, num_bins, density=True)
ax.set_xlabel('Values')
ax.set_ylabel('Probability density')
ax.set_title(r'Random sampling')
fig.tight_layout()

plt.show()

a. Inverse Transform Sampling. The inverse CDF method is a widely documented method for generating random samples.

In this exercise you will use this method to sample from an exponential distribution with parameter `lambda=1`.

Since this is a widely documented method, part of this exercise includes finding the method and implementing it yourself.

* Set `lambda=1`.
* Generate `10000` samples.
* Save the samples to a variable named `samples`.

**Hint**
* This is not complicated, you should be able to write this in a couple of lines.
* Feel free to use online resources like StackOverFlow.

In [None]:
# INSERT CODE IN THIS CELL
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(123)

# <--- INSERT CODE HERE -->
# <--- INSERT CODE HERE -->

num_bins=100
fig, ax = plt.subplots()
n, bins, patches = ax.hist(samples, num_bins, density=True)
ax.set_xlabel('Values')
ax.set_ylabel('Probability density')
ax.set_title(r'Random sampling')
fig.tight_layout()
plt.show()

# Exercise - SIR model
The SIR model is a simple mathematical model of epidemics. The entities in this model stand for

* (S)usceptible: inviduals that are not infected with the disease yet. However, they are not immune to it either, and so they can become infected with the disease in the future.
* (I)nfected or infectious: individuals that are infected with the disease and can transmit the disease to susceptible people.
* (R)ecovered: individuals who have recovered from the disease and are immune, so they can no longer be infected.

In the most basic form, this model model can be represented as

<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/c2a8fd2e93bfcf1092a44cfec7ef32c1a80a26f4" />

where 

* β is the average number of contacts per person per time
* γ is the probability a contagious person becoming non-contagious
* N is the population size (constant)

Consider the numerical solution below using the parameters

* `beta = 0.2`
* `gamma = 0.07`
* `N = 100`
* `S0 = 99`
* `I0 = 1`
* `R0 = 0`





In [None]:
# DO NOT MODIFY THIS BLOCK
!pip install numpy
!pip install scipy

import numpy as np
from scipy.integrate import odeint

# Model
def model(y, t, N, beta, gamma):
    S, I, R = y
    dSdt = -beta * S * I / N
    dIdt = beta * S * I / N - gamma * I
    dRdt = gamma * I
    return dSdt, dIdt, dRdt

# Parameters
N = 100
beta = 0.2
gamma = 0.07
S_0 = 99
I_0 = 1
R_0 = 0
t = np.linspace(0, 360, 360)

# Initial conditions vector
y0 = (S_0,I_0,R_0)

# Solve using ODE solver
results = odeint(model, y0, t, args=(N,beta,gamma))

## Pretty plot
Use matplotlib and seaborn to make a figure. 

* The figure size should `25x25`
* The figure should have one plot with three lineplots, `t vs S`, `t vs I` and `t vs R`.
* Set title to `SIR model`.
* Set x-label to `Time (t)`.
* Set y-label to `Population size`.
* Make sure each line plot uses different colors.

In [None]:
# INSERT CODE IN THIS CELL

## Save plot to disk
Save the plot above to the course folder in your Google Drive to a file named `sir.png`. 

In [None]:
# INSERT CODE IN THIS CELL

# Exercise - Matrix multiplication using `numpy` (updated edition).

Design an implement a method called `can_be_multiplied`. 

* Use only the standard libary and `numpy`.
* This method takes two Numpy arrays and returns True if these two matrices can be multiplied. False, otherwise.
* Work on the assumption that if the input arguments are `a` and `b` (in that order), then the operation `a*b` will be checked by this method and not `b*a`.
* If the input argument is not a Numpy array, then the method should return `None`.
* **NEW**. If any of the arrays is a Numpy array, then check that the `dtype` of these arrays is numeric.
* **NEW**. If any of the arrays is empty, then issue a warning message letting the user know which of the arrays is empty.
* Write docstrings for this method.
* Write at least 6 assertions to test your method.

**Hint**
* Test for empty matrices. If any of the matrices are empty, then this method should return `False`.

In [None]:
# INSERT CODE IN THIS CELL