## Discussion Question
Personal data is increasingly stored in encrypted formats managed by large tech companies. In cases where a crime has occured, this data may be important to investigations by law enforcement. Apple has famously [refused to crack their encryption or build backdoors](https://www.theverge.com/2020/1/7/21054836/fbi-iphone-unlock-apple-encryption-debate-pensacola-ios-security) into their software to help law enforcement gain access to potential evidence, while [Google has charged law enforcement](https://www.nytimes.com/2020/01/24/technology/google-search-warrants-legal-fees.html) a fee to execute search warrants for user data. **What roles and responsibilities do these private tech companies have to both users and law enforcement when it comes to either sharing or safeguarding user data when a crime has occurred?**

# HW 5

**To make it easier for TAs to grade your work, make sure that all cells are executed so that we can see your results without having to run anything**

Resources: Jake VanderPlas, [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html)

Goals:

* Practice calculating correlation, $r^2$; visually interpret the meaning of these numbers.

* Run linear regressions with `sklearn`, understand the intutition behind linear models.

* Practice pattern matching with regular expressions.


In [1]:
import numpy
import pandas as pd
from matplotlib import pyplot
from sklearn.linear_model import LinearRegression

## Covariance, correlation, and $r^2$

In this problem you will calculate statistics of the following array. The first column will be $X$, the second column will be $Y$.

In [None]:
q1_array = numpy.array([[ 4.35994902, 10.84003387],
                       [ 0.25926232,  6.64625283],
                       [ 5.49662478, 12.70709295],
                       [ 4.35322393, 10.33309438],
                       [ 4.20367802, 10.21407197],
                       [ 3.30334821,  8.47051026],
                       [ 2.04648634,  7.93047576],
                       [ 6.19270966, 11.491082  ],
                       [ 2.99654674,  7.60715331],
                       [ 2.66827275,  9.15886831],
                       [ 6.21133833, 11.99190195],
                       [ 5.29142094, 10.98304651],
                       [ 1.34579945,  6.89008468],
                       [ 5.13578121,  9.01125354],
                       [ 1.84439866,  8.3941969 ]])

### A. Visual Display

**Problem 1 (1 point):** Display a scatter plot of your data with one point per $X,Y$ pair. Both the axes should start at zero and show all points. Set the figure size to `(5,5)` (Google this to find the correct syntax if you need to). Don't worry about labeling axes.

In [None]:
## insert solution here

**Problem 2 (1 point):** Repeat the same figure, but swap the axes (show $X$ on the vertical axis, $Y$ on the horizontal axis). Both the axes should start at zero and show all points. Set the figure size to `(5,5)`.

In [None]:
## insert solution here

**Problem 3 (1 point):** Based on the first plot, are $X$ and $Y$ positively correlated, negatively correlated, or independent?  Support your answer *qualitatively* (describe the intuition in words, not numbers).

In [None]:
## insert solution here

### B. Quantitative statistics

In the following exercises, you will write code to calculate:
* the variance of $X$ and the variance of $Y$
* the covariance of $X$ and $Y$
* the correlation between $X$ and $Y$ 
* the slope, intercept, and standard deviation of residuals for $Y$ given $X$
* the slope, intercept, and standard deviation of residuals for $X$ given $Y$

You may use `numpy` functions such as `var`, `cov`, and `corrcoef` only to *confidence check* your answers, **you must write your own code to implement these functions.** Remember that there are two ways to calculate variance, dividing by $N$ and dividing by $N-1$. You may use either, but be consistent. Use print statements with descriptive text so we can clearly see your answers. 

To make your code clearer, feel free to create variables for values that you use repeatedly. You can use them across cells. **Watch out for copy-paste errors!** Show that your code outputs a similar result as either numpy or pandas `.var()` *(Hint: numpy and pandas have slightly different implementations of variance; one uses n-1, the other weights by n)*

**Problem 4 (2 points):** Calculate the variance of $X$ and the variance of $Y$.

In [None]:
## insert solution here

**Problem 5 (2 points)**: Calculate the covariance of $X$ and $Y$.

In [None]:
## insert solution here

**Problem 6 (2 points):** Calculate the correlation between $X$ and $Y$.

In [None]:
## insert solution here

**Problem 7 (3 points):** Calculate the slope, intercept, and standard deviation of residuals for $Y$ given $X$

In [None]:
## insert solution here

**Problem 8 (3 points):** Calculate the slope, intercept, and standard deviation of residuals for $X$ given $Y$

In [None]:
## insert solution here

### C. Quantitative statistics for almost the same 

**Problem 9 (5 points):** Calculate and print the same values, but use $2Y$ (multiply $Y$ by 2)

* the variance of $X$ and the variance of $2Y$
* the covariance of $X$ and $2Y$
* the correlation between $X$ and $2Y$ 
* the slope, intercept, and standard deviation of residuals for $2Y$ given $X$
* the slope, intercept, and standard deviation of residuals for $X$ given $2Y$

Remember to double check your answers using the standard package functions!

In [None]:
## insert solution here

**Problem 10 (6 points):** For each of the quantities, state whether it is different from or the same as when you used $X$ and $Y$. For each, explain why or why not.

In [None]:
## insert solution here

### D. Again, but in `pandas`

**Problem 11 (6 points):** Create a `pandas` data frame named `q1_dataframe` with two columns, `x` and `y`, with values equal to the columns of `q1_array`.

Use `sklearn` to calculate a linear regression for `y` given `x`. Save the `LinearRegression` object as `q1_model`. Print the values of the regression (slope and intercept).

Create an array `predictions` using the `predict` function of `q1_model`. Calculate the sum of the squared differences between $Y$ and `predictions` as `ssr` (sum of squared residuals.  Calculate the sum of the squared differences between $Y$ and the mean of $Y$ as `sst` (sum of squares, total). Print `ssr` and `sst`.

Calculate $1 - \frac{SSR}{SST}$ and print it.

Use the `score` function of `q1_model` to calculate and print the $r^2$ score. (Confidence check: this should equal the previous value.)

In [None]:
## insert solution here

### E. Commentary

**Problem 12 (2 points):** Based on your *quantitative* analysis of this data set, argue whether or not there is evidence of a linear relationship between $X$ and $Y$.

In [None]:
## insert solution here

## More data and correlation

In these problems, you will create paired data which exhibits certain statistical properties.

Write each array in the `numpy.array(...)` format used for `q1_array` at the very beginning.

The coefficient of determination $r^2$, or r-squared value, is a measure of model fit which encodes the extent to which a model can explain variability in a given output variable (phenomenon). It is the square of the correlation coefficient you calculated in the previous problem. It can be often be interpreted as the percent of variability in outcome behavior explained by the model. It is useful but has limitations.

Each array should have two columns and at least eight (8) rows.

### A. Perfect correlation

**Problem 13 (1 points):** Create an array with two columns with coefficient of determination 1.0.

In [None]:
## insert solution here

### B. Perfect correlation is not unique

**Problem 14 (2 points):** Create an array with the same $X$ values from your answer to exercise 12 but different $Y$ values, such that the coefficient of determination is also 1.0.

In [None]:
## insert solution here

### C. Imperfect correlation, $r^2 = 0.25$

**Problem 15 (3 points):** Create an array with coefficient of determination 0.25 (+/- 0.01) by adding noise. First create $X$. Then create a new array `noise` of the same length using `numpy.random.normal()` with mean zero and standard deviation 1.0.

In [None]:
## insert solution here

Set a variable `scale`, initially equal to 1.0. Let $Y$ equal $X$ plus `scale` times `noise` and calculate $r^2$. If $r^2$ is greater than or less than 0.25, change the value of `scale` and repeat. Print the array with your $X$ and $Y$ values.

In [None]:
## insert solution here

### D. $r^2 = 0$, random

**Problem 16 (2 points):** Create an array with two columns and eight rows, with coefficient of determination less than or equal to 0.01, with no visible pattern.

In [None]:
## insert solution here

**Problem 17 (2 points):** Use sklearn's `LinearRegression().fit()` to calculate a "line of best fit" for your created data points in **problem 16.** Plot the data points and the line (*Hint:* to plot the linear model, you can plot the predicted y_values against the original x_values. To find the predicted y_values, you can use the sklearn linear regression model's: `.predict()` and passing the original x_values as a parameter.)

In [None]:
## insert solution here

### E. $r^2 = 0$, not random

**Problem 18 (3 points):** Create an array $X$ and then generate an array $Y$ that is a deterministic function of $X$ *and* with coefficient of determination less than or equal to 0.01. In other words, given $X$, you know *exactly* what $Y$ will be, but the $r^2$ value is still close to 0.

In [None]:
## insert solution here

**Problem 19 (2 points):** Again, use sklearn's `LinearRegression().fit()` to calculate a "line of best fit" for this created data in **problem 18.** Plot the data points and the line.

In [None]:
## insert solution here

## Pattern Matching with Regular Expressions

**Problem 20 (4 points):** The following string `dynamite` are the lyrics from BTS's Dynamite. Use regex pattern matching and `re.sub` to replace each hyphenated word with the word "banana" then print out the last twelve lines. You can use [regex101](https://regex101.com/) to help test your regex pattern (*hint: make sure you select the Python flavor*). Also, [here](https://www.debuggex.com/cheatsheet/regex/python) is a helpful list of regular expressions.

The last 12 lines (or approximately 545 characters) should look like this:

 ```So watch me bring the fire and set the night alight 
 Shining through the city with a little funk and soul 
 So Ima light it up like dynamite 
 (This is ah) 
 Cause, banana, Im in the stars tonight 
 So watch me bring the fire and set the night alight 
 Shining through the city with a little funk and soul 
 So Ima light it up like dynamite, whoa 
 banana, banana, banana, banana, life is dynamite 
 banana, banana, banana, banana, life is dynamite 
 Shining through the city with a little funk and soul 
 Ima light it up like dynamite, whoa```


In [1]:
import re
dynamite ="Cause, ah-ah, Im in the stars tonight \n So watch me bring the fire and set the night alight \n Shoes on, get up in the morn \n Cup of milk, lets rock and roll \n King Kong, kick the drum, rolling on like a rolling stone \n Sing song when Im walking home \n Jump up to the top, LeBron \n Ding dong, call me on my phone \n Ice tea and a game of ping pong \n This is getting heavy \n Can you hear the bass boom? Im ready \n Life is sweet as honey \n Yeah, this beat cha ching like money \n Disco overload, Im into that, Im good to go \n Im diamond, you know I glow up \n Hey, so lets go \n Cause, ah-ah, Im in the stars tonight \n So watch me bring the fire and set the night alight \n Shining through the city with a little funk and soul \n So Ima light it up like dynamite, whoa \n Bring a friend, join the crowd \n Whoever wanna come along \n Word up, talk the talk, just move like we off the wall \n Day or night the skys alight \n So we dance to the break of dawn \n Ladies and gentlemen, I got the medicine, so you should keep ya eyes on the ball, huh \n This is getting heavy \n Can you hear the bass boom? Im ready \n Life is sweet as honey \n Yeah, this beat cha ching like money \n Disco overload, Im into that, Im good to go \n Im diamond, you know I glow up \n Lets go \n Cause, ah-ah, Im in the stars tonight \n So watch me bring the fire and set the night alight \n Shining through the city with a little funk and soul \n So Ima light it up like dynamite, whoa \n Dy-na-na-na, na-na, na-na-na, na-na, life is dynamite \n Dy-na-na-na, na-na, na-na-na, na-na, life is dynamite \n Shining through the city with a little funk and soul \n So Ima light it up like dynamite, whoa \n Dy-na-na-na, na-na, na-na, eh \n Dy-na-na-na, na-na, na-na, eh \n Dy-na-na-na, na-na, na-na, eh \n Light it up, dynamite \n Dy-na-na-na, na-na, na-na, eh \n Dy-na-na-na, na-na, na-na, eh \n Dy-na-na-na, na-na, na-na, eh \n Light it up, dynamite \n Cause, ah-ah, Im in the stars tonight \n So watch me bring the fire and set the night alight \n Shining through the city with a little funk and soul \n So Ima light it up like dynamite \n (This is ah) \n Cause, ah-ah, Im in the stars tonight \n So watch me bring the fire and set the night alight \n Shining through the city with a little funk and soul \n So Ima light it up like dynamite, whoa \n Dy-na-na-na, na-na, na-na-na, na-na, life is dynamite \n Dy-na-na-na, na-na, na-na-na, na-na, life is dynamite \n Shining through the city with a little funk and soul \n Ima light it up like dynamite, whoa"
print(dynamite[-551:])


So watch me bring the fire and set the night alight 
 Shining through the city with a little funk and soul 
 So Ima light it up like dynamite 
 (This is ah) 
 Cause, ah-ah, Im in the stars tonight 
 So watch me bring the fire and set the night alight 
 Shining through the city with a little funk and soul 
 So Ima light it up like dynamite, whoa 
 Dy-na-na-na, na-na, na-na-na, na-na, life is dynamite 
 Dy-na-na-na, na-na, na-na-na, na-na, life is dynamite 
 Shining through the city with a little funk and soul 
 Ima light it up like dynamite, whoa


In [None]:
## insert solution here

**Problem 21 (3 points):** In the string `dynamite` use regex pattern matching and `re.findall()` to find every four letter word that starts with a capital letter (examples: "King," "Kong," and "Ding"). Print the list of words. (*Hint: [word boundaries](https://docs.python.org/3/library/re.html#regular-expression-syntax) could be useful here*)

In [None]:
## insert solution here

**Problem 22 (2 points):** In the string `dynamite` use regex pattern matching and `re.findall()` to find each word immediately following the word "Im" then print this list of words.

In [None]:
## insert solution here