# Homework 5

## Problem 1: Changing colors

Take a picture of something and put the file in the same folder as this notebook. This will be a lot more fun if the picture has multiple colors!

In [None]:
import numpy as np
from matplotlib import pyplot as plt
import matplotlib.image as mpimg

Write code below that reads your image file, saves it to a variable `img`, and plots it. You can copy paste code from the lecture.

my example:

![image1.png](image1.png)



In [None]:
#your code here

How many pixels does your image have?
What are the minimum and maximum values of your image? What is the data type?

You should use https://numpy.org/doc/stable/reference/generated/numpy.ndarray.dtype.html

In [None]:
# write and run code here that displays your answer


Run the code below and concisely explain what it does in the comments.

In [None]:
fig, ax = plt.subplots(1)

# write here
ax.set(xticks=[], yticks=[])

# write something here that involves RGB
ax.imshow(img[:,:,::-1])

Let's apply a sepia filter to your picture. You've probably seen this effect in movies and photo apps: https://helpx.adobe.com/photoshop/key-concepts/sepia-tone.html

The algorithm is actually very simple:
1. Let R, G, B be the RGB channels of the image.
2. Calculate:
```
newRed = 0.393*R + 0.769*G + 0.189*B
newGreen = 0.349*R + 0.686*G + 0.168*B
newBlue = 0.272*R + 0.534*G + 0.131*B
```
3. Clip newRed, newGreen, newBlue such that their values stay <= 255.
4. Turn values of newRed, newGreen, newBlue to integers.
5. Set newRed, newGreen, newBlue to be the new RGB channels of the image.

Links that might be helpful:
- https://numpy.org/doc/stable/reference/generated/numpy.minimum.html
- https://numpy.org/doc/stable/reference/generated/numpy.ndarray.astype.html
- https://numpy.org/doc/stable/reference/generated/numpy.dstack.html

To get full credit:
- No for loops over the pixels.
- No modifying the original image.
- Add a short docstring.

![image2.png](image2.png)

In [None]:
def make_sepia(img):
    pass

In [None]:
#call the function here to show the image

This time, turn your color image into black & white (grayscale). 
You need to:

1. Take the average of the R, G, B channels.
2. Turn them into integers.
3. Set that as the new RGB channels of the image.
4. Write a short docstring.

![image3.png](image3.png)

In [None]:
def make_grayscale(img):
    pass

In [None]:
#call the function here to show the image

(extra credit)

Now that you can make grayscale images, let's try making it look old. Many photos have salt and pepper noises, which are just randomly occurring black and white pixels: https://en.wikipedia.org/wiki/Salt-and-pepper_noise.

You need to:
1. Turn your image grayscale by calling your function from the previous part.
2. Based on the percent parameter `pct`, figure out how many pixels in your image will be modified. Set that number `n`.
3. Randomly select `0.5n` pixels and set those pixel values to be 0.
4. Randomly select `0.5n` pixels and set those pixel values to be 255.
5. Write a short docstring.

This might be useful: https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html

![image4.png](image4.png)

In [None]:
def salt_pepper_noise(img, pct):
    pass

In [None]:
#call the function here to show the image

Congratulations! You're basically Photoshop now!

## Problem 2: Scatterplot Matrices

Run the following code to download, import, and display a data set from the 2021 World Happiness Report. 

https://worldhappiness.report/ed/2021/

In [None]:
# if you experience ConnectionRefused errors, you may instead 
# copy the url into your browser and save the file manually

import pandas as pd

url = "https://happiness-report.s3.amazonaws.com/2021/DataForFigure2.1WHR2021C2.xls"
happiness = pd.read_excel(url)

# uncomment to see what the raw data looks like
# happiness

In [None]:
# only take some of the columns
happiness = happiness[happiness.columns[[0, 2, 13, 14, 15, 16, 17, 18]]]

# add overall rank
happiness.loc[:, "Overall rank"] = range(1, len(happiness)+1) 

# renaming columns
cols = [c.replace("Explained by: ", "") for c in happiness.columns]
cols[0] = "Country or region"
cols[1] = "Score"
cols[2] = "GDP per capita"
happiness = happiness.rename(columns= dict(zip(happiness.columns, cols)))

# switching column order
cols = ["Overall rank"] + cols[:-1]
happiness = happiness[cols]

# save changed table to csv file
happiness.to_csv('happiness_report_2021.csv', header=True)

In [None]:
happiness

This is a `pandas` data frame. Observe the following: 

1. Each row corresponds to a country or region. 
2. The `Score` column is the overall happiness score of the country, evaluated via surveys. 
3. The other columns give indicators of different features of life in the country, including GDP, level of social support, life expectancy, freedom, generosity of compatriots, and perceptions of corruption in governmental institutions. 

 You can extract each of these columns using dictionary-like syntax: 
 
 ```python
 happiness["Score"]
 ```
 
 ```
0      7.769
1      7.600
2      7.554
3      7.494
4      7.488
       ...  
151    3.334
152    3.231
153    3.203
154    3.083
155    2.853
Name: Score, Length: 156, dtype: float64
 ```
 
Technically, this output is a `pandas` `Series`; however, in this context (and most others) it's fine to simply think of it as a 1-dimensional `np.array()`. 

### Part A

As a warmup, create a scatterplot of the overall `Score` column against a numerical column of your choice. Give the horizontal and vertical axes appropriate labels. Discuss your result. Is there a correlation? Does that correlation make sense to you? 

In [None]:
# plotting code here

---

*Discuss here*

---

## Part B

That plot you made may have helped you understand whether or not there's a relationship between the overall happiness score and the variable that you chose to plot. However, there are several variables in this data set, and we don't want to manually re-run the plot for each pair of variables. Let's see if we can get a more systematic view of the correlations in the data. 

Write a function called `scatterplot_matrix()`, with arguments `cols` and `figsize`. The `cols` argument should be a list of strings, each of which are the name of one of the columns above, for example `cols = ["Score", "GDP per capita", "Social support"]`. Your function should create a *scatterplot matrix*, like this:  

```python
cols = ["Score", 
        "GDP per capita", 
        "Social support"]
           
scatterplot_matrix(cols,figsize = (7,7))
```

<figure class="image" style="width:70%">
  <img src="scatterplot_matrix.png" alt="A 3x3 grid of scatterplots, showing relationships between Score and GDP, Score and Social support, and GDP and Social support. The diagonal is blank. All three variables are strongly correlated with each other." width="400px">
</figure>

There is a separate scatterplot for each possible pair of variables. In fact, there are two: one where the first variable is on the horizontal axis, and one where it's on the vertical axis. Some analysts prefer to remove half the plots to avoid redundancy, but you don't have to bother with that. The diagonal is empty, since there's no point in investigating the relationship between a variable and itself. 

Don't forget comments and docstrings! 

In [None]:
# define your function
def scatterplot_matrix(cols, figsize):
    """
    Create a grid of scatterplots between each pair of the specified columns in the happiness
    data set. 
    cols: a list of column names in the  happiness data set. Only these columns will be plotted. 
    figsize: the desired dimensions of the figure. 
    """
    pass

In [None]:
# test your code, several times if needed, and discuss the correlations you observe. 
# Add code cells if needed to show multiple outputs. 

## Part C

The *correlation coefficient* is a measure of linear correlation between two variables. The correlation coefficient between $X$ and $Y$ is high if $X$ tends to be high when $Y$ is, and vice versa. Correlation coefficients lie in the interval $[-1, 1]$. 

`numpy` provides a function to conveniently compute the correlation coefficient between two or more variables. Find it, and then use it to add "captions" (as horizontal axis labels) to each panel of your plot giving the correlation coefficient between the plotted variables. For example, 

```python
cols = ["Score", 
        "GDP per capita", 
        "Social support"]
           
scatterplot_matrix(cols,figsize = (7,7))
```

<figure class="image" style="width:70%">
  <img src="scatterplot_matrix_with_cors.png" alt="A 3x3 grid of scatterplots, showing relationships between Score and GDP, Score and Social support, and GDP and Social support. The diagonal is blank. All three variables are strongly correlated with each other. Correlation coefficients between 0.70 and 0.84 lie beneath each plot." width="400px">
</figure>

It's not required that you add the Greek letter $\rho$ (the classical symbol for correlation coefficients), but if you do want to, here's how. You can also tweak the rounding as desired.  

```python
ax.set(xlabel = r"$\rho$ = " + str(np.round(my_number, 2)))
```


Run your code on several different subsets of the columns. Discuss your findings. What positive correlations do you observe? Do they make sense? Are there any negative correlations? Do the quantitative results match what you see "by eye"? 

If you were going to create a model to attempt to predict overall happiness from other indicators, which columns would you use? Why? 

---

*Discuss here*

---

## Problem 3: Plotting Time Series

Run the following code to download two time series data sets: 
- Historical data on the Dow Jones Industrial Average (a composite performance measure of the US stock market), retrieved from Yahoo Finance. 
- Cumulative COVID19 cases over time, from the [New York Times](https://github.com/nytimes/covid-19-data). 

In [None]:
# run this block
# if you experience ConnectionRefused errors, you may instead 
# copy the urls into your browser, save the files as DJI.csv 
# and COVID.csv respectively in the same directory as the notebook.
# Then, in the lines using the function pd.read_csv(), replace 
# the url with "DJI.csv" and "COVID.csv" 

import datetime

url = "https://query1.finance.yahoo.com/v7/finance/download/%5EDJI?period1=1580750232&period2=1712372632&interval=1d&events=history&includeAdjustedClose=true"
DJI = pd.read_csv(url)
DJI['date'] = pd.to_datetime(DJI['Date'])
DJI.drop(["Date"], axis = 1)

url = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us.csv"
COVID = pd.read_csv(url)
COVID['date'] = pd.to_datetime(COVID['date'])

## Part A

The series `COVID['cases']` is essentially a `numpy` array containing the cumulative case counts over time. The COVID19 case data  is cumulative, but we would like to see the number of new cases per day (i.e. as in [this kind of plot](https://www.google.com/search?q=covid+stats)). Check the documentation for the `np.diff` function and figure out what it does. Use it appropriately to construct a new array, called `per_day`, giving the number of new cases per day. Then, make a new array called `per_day_date` that gives the appropriate date for each case count. In particular, you will need to ensure that `per_day` and `per_day_date` have the same shape. 

In [None]:
# your solution here

## Part B

Create a figure with two very wide axes, one on top of the other (i.e. two rows, one column). Use the `sharex` argument of `plt.subplots()` to ensure that these two plots will share the same horizontal axis. 

Then: 

1. On the upper axis, plot the Dow Jones Industrial Average over time. For the horizontal axis use `DJI['date']`; the for the vertical use `DJI['Close']`. 
2. On the lower axis, plot the variables `per_day_date` and `per_day` to visualize the progress of the COVID19 pandemic over time. Use a different color for the time series. 

Give your plot horizontal and vertical axis labels. 

In [None]:
#your code here

## Part C

The command 

```
ax[0].axvspan(datetime.datetime(2020,6,1),
              datetime.datetime(2020,6,30), 
              alpha = .3, 
              color = "gray")
```

will add a simple rectangular shade which can be used to highlight specific portions of a time-series. In the given code, this shade runs through the month of June 2020. Add at least two such rectangular shades to your figure corresponding to important time intervals. You can put two shades on one axis, or one on each. If you're not sure what time periods are important, just choose intervals at random. Feel free to modify the color and transparency as desired. You can modify your figure code from Part B -- no need for copy/paste. 

## Part D

The command 

```
ax[0].text(datetime.datetime(2020,9,15), 
           22000, 
           "penguins?\npenguins!")
```

will add a fun text annotation to your plot, with the first letter in horizontal position corresponding to September 15th, and at vertical position 22,000. Annotate each of your shaded regions with a few words describing their significance. Again, just modify your Part B code. 

## Part E

Add an overall title, spruce up your axis labels, and add anything else you think will make the plot look good. Again, you can just modify your Part B code, without copy/paste. 

Then, submit a job application at www.FiveThirtyEight.com and show Nate Silver your cool data visualization.  