# INFO 2950 Homework 3

So far, we've been looking within one variable at a time: population, avocado price, etc. In this homework, we're going to start looking at ways of quantifying the relationship *between* two variables.

**No problem in this homework will involve `for` loops. Use only methods that operate on pandas DataFrames or Series.** (Series are one-dimensional arrays and often how DataFrame columns are represented when extracted.) These custom methods are written to operate efficiently on pandas objects, and are generally more computationally efficient than `for` loops.

## Part 1: Discussion exercises

### Python Functions

A Python function is a set of pre-defined code which is programmed for a certain purpose. To decrease code repetition (especially copy and pasting which can introduce unwanted bugs), we write functions to perform repeated tasks. You can input data and other arguments (usually called "parameters") into a function. A function will perform its internal code and can (optionally) output objects using a `return` statement. A function will only execute when it's explicitly called. Indents in the function definiton are essential.

Here is an example of a simple function that prints the variable passed in the first (and only) parameter (called `print_content` locally, within the function definition.

In [None]:
def print_sth(print_content):
    '''
    param print_content: a string parameter. This is what you want to print
    '''
    print(print_content)
    
print_sth("Homework 3")    

While this function prints a statement, it doesn't actually return anything:

In [None]:
print("type returned: " + str(type(print_sth("test"))))

The comment (between the `'''` symbols) in the function defintion tells the user what the parameter `print_content` should be. This syntax can automatically generate documentation for your own functions. It is very important to document any code you bother spending time to convert to a function, because you may want to use it again or share your code with others. It's helpful to write notes about how the function can be used and what it does.

Here is another function: this one takes *two* arguments and actually returns an object, all of which is documented in the comment at the top of the function definition. 

In [None]:
def course(dept,classcode):
    '''
    param dept: a string. It's the department code. 
    param classcode: an integer. It's the course number
    return: a string value which combines both department code and course number
    '''
    return dept+str(classcode)

output = course("INFO",2950)
print(output)
print("type returned: " + str(type(output)))

### String formatting

We often want to construct and print strings that include the values of calculated variables; it's good practice to add context to any values you print in your notebooks, so others reading them know what the number you're printing is. For instance, it's clearer to print `mean price: $27.80` instead of just `27.8`.
    
You can concatenate strings with the `+` operator, but you can't merge a string and a number without doing some type conversion (like we did above with `str(type(output))`). Sometimes there are multiple ways to display a variable, such as a float with either 2 or 3 decimal places. All strings have a function `.format()` that allows you to construct strings with placeholders where variables get inserted and to specify how the variables should appear.

We start by creating a string *template* (as a string). We insert placeholders `{}` into the string template where we want variable values to appear. Then for each of these placeholds, we include corresponding variable as an argument to `.format()` (in the desired order of appearance in the string template). For each of these values, Python will convert the value to a string and insert it in the corresponding placeholder. You can also specify how you want a value to appear. To format a value as a 4-digit decimal integer with leading zeros, use `{:04d}`. To round a float to two decimal places, use `{:.2f}`. See [the documentation](https://docs.python.org/3.8/library/string.html#formatspec) for other options.

Python also supports an older string format style using the `%` operator, which we prefer you do *not* use. There is also a newer method called f-strings that you may use. You may be familiar with `.format()` in the context of `print()` statements, but it's really a function of strings, not printing strings.

In [None]:
"this is my {} string".format("favorite")

In [None]:
"The letter {} has Unicode codepoint {:d} (as an integer), which is {:x} in hexadecimal and {:08b} in binary".format("M", ord("M"), ord("M"), ord("M"))

### Correlation and Causation

We often want to measure the relationship between two variables because we want to know whether the value of one factor *causes* another factor to have a certain value. Usually this is interesting because there may be one factor we care about but cannot directly control, and another factor that we can control, but we don't necessarily care about in and of itself. For example, I don't care about the number on my thermostat for itself, I care about it because it has a causal effect on the temperature in the house.

**1. Describe a situation where one factor, which we can observe and control, influences the value of another factor. Without using specific quantitative measurements, describe how strong you consider this relationship to be.**

**2. Describe a situation where one factor does not *influence* another factor, but nevertheless allows you to *predict* the value of that second factor. What would you need to do to distinguish between this situation and the previous situation?**

Correlation does not necessarily imply causation. Most of the statistical methods we will study can only show correlation, though there exist careful experimental designs that can enable [causal inference](https://en.wikipedia.org/wiki/Causal_inference). 

### Beyond Correlation/Causation

You may be familiar with the previous discussion. In fact, mean, variance, and "correlation is not causation" is about the only thing that we can reliably assume that everyone learns in a statistics class. But there's another problem that we face, which can be subtler and more dangerous.

Consider a system that predicts creditworthiness. The rows in the data table will correspond to people, and it is extremely easy to think of what you are doing as classifying *people* as creditworthy or not. But, as Princeton sociologist Ruha Benjamin [has pointed out](https://www.goodreads.com/en/book/show/42527493-race-after-technology), almost all of the actual variables are describing a person's *situation*. Sometimes situations can change quickly, as we all saw this year, and sometimes they can be nearly impossible to escape. 

**3. The dataset we will look at in this homework is about the educational achievements of kindergarterners. It includes demographic information, as you can see from the data description file (**`Data description ECLS_R7.pdf`**). What do you think that you should be able to say about these children based on these measurements, and what can you not say?**


---

## Problem 1 (9 pts)

Write down your thoughts about the three discussion questions. Ensure your answer is moderately detailed to get full marks.

---

(insert your answer for the first question here)

(insert your answer for the second question here)

(insert your answer for the third question here)

## Part 2: Correlation and Covariance
We will investigate data collected as part of the Early Childhood Longitudinal Studies program, which seeks to understand childhood development from birth through elementary school in the United States. The main dataset we will work with quantifies [the development of kindergarteners in 2010-2011](https://nces.ed.gov/ecls/kindergarten2011.asp) and is provided as `ECLS_R7.csv`. You may find it helpful to refer to the data description file (also sometimes referred to as a "codebook"), `Data description ECLS_R7.pdf`, which defines the variables in the data. While there is a lot of of information in this file, we will only explore a few variables in this homework.

In [None]:
## load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## load data
education_data = pd.read_csv("ECLS_R7.csv")

---

## Problem 2 (4 pts)
What does the variable `X7MTHETK4` measure? Calculate and print the mean and median of this variable. (make sure to use .format() to round all printed floats to two decimal places)

**Hint:** We will use the columns `X7MTHETK4`, `X7STHETK4`, and `X7RTHETK4` throughout this homework. You may find it helpful to rename these columns now (e.g. using the [`.rename()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html) dataframe method).

---

Your answer: 

In [None]:
#Your code:

---

## Problem 3 (4 pts)

Calculate and print the 1st quartile and the 95th precentile of the science score (original column name: `X7STHETK4`) using the [.quantile()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.quantile.html) Series method. How are quartiles and percentiles related?


**Hint:** What are ["quantiles", "quartiles", and "percentiles"](https://www.statsdirect.com/help/nonparametric_methods/quantiles.htm#:~:text=Quantiles%20are%20points%20in%20a,of%20values%20in%20that%20distribution.&text=Centiles%2Fpercentiles%20are%20descriptions%20of,sorted%20values%20of%20a%20sample)?

---

In [None]:
# Your code:

Your answer: 

---
## Problem 4 (2 pts)

Calculate and print the mean and standard deviation of the students' reading scores (original column name: `X7RTHETK4`)---it's actually "reading", not "reaching" as in the data description file. Save these two values to Python variables (you will find these variables useful in the next problem).

In [None]:
#Your code:

---

## Problem 5 (10 pts)

Suppose we want to count the number of observations when a variable is in a given range. Write a [Python function](https://www.w3schools.com/python/python_functions.asp) `count_within_range(sub_data, lower, upper)` that takes three arguments, the subset data for calculation (as a Pandas Series), and the lower and upper bounds (as numbers).

In the body of your function:
1. use the [`.between()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.between.html) subsetting method to select the observations in the given range
2. calculate and print the **minimum** and **maximum** values of the range, **the count of observations** within the range, and **the percentage (%) of all observations** that are found in the range.

The function should not return anything. And be sure to label each printed number and use [`.format()`](https://mkaz.blog/code/python-string-format-cookbook/) to round all printed floats to two decimal places (integers should have zero decimal places). 

Call this function for the reading score values with the following ranges:

* *one standard deviation below the mean* to *the mean*
* *the mean* to *one standard deviation above the mean*

Based on these results, do you think the distribution of this variable is *symmetric*? Explain.

---

In [None]:
#define your function


In [None]:
#call your function


Your answer:

---

## Problem 6 (4 pts)

Make a scatter plot [with matplotlib](https://matplotlib.org/3.3.1/api/_as_gen/matplotlib.pyplot.scatter.html) to show the distribution of data points in two dimensions (variables): math score on the x-axis and and science score on the y-axis. Make sure to set the parameter `alpha` as 0.3 in `.scatter()` to control the dot opacity, and set the axis labels as *Math Score* and *Science Score*.

Describe the trend in the relationship between these two variables, as revealed by the scatterplot.

---

In [None]:
#Your code

Your answer:

---

## Problem 7 (6 pts)

Measure the strength of the association observed in the scatterplot from the previous problem by calculating and printing the **covariance** and (Pearson) **correlation** between the math and science scores. Use the [`.cov()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.cov.html) and [`.corr()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.corr.html) pandas Series methods.

Create two new Series:
* `math_times_ten` = the math score column times 10,
* `science_times_ten` = the science score column times 10.

Calculate and print the covariance and correlation between these two new series and describe the effect of scaling the original data by 10 on the covariance and correlation.

---

In [None]:
#Your code:

Your answer:

---

## Problem 8 (6 pts)

Calculate and print the mean and standard deviation of the math and science scores. Save these values to Python variables.

Select the students whose **science score** is between its mean and one standard deviation above the mean. Calculate and print the mean and the standard deviation of the **math scores** for this subset of students.

For the math scores, how do the mean and standard deviation change when we go from looking at all students in the dataset to just those with a science score within one standard deviation above its mean? How does your observation relate to the correlation score between these two variables?

---

In [None]:
#Your code:

Your answer:

---

## Problem 9 (9 pts)

Group the data by school (indicated in the ID column `'S7_ID'`) and calculate the mean math and science scores for each school. Subset the data, keeping only school whose average math *and* science scores are both strictly greater than 3. Save this dataframe as a variable called `school_mean_subset`. Print the number of rows in `school_mean_subset`.

**Hint:** When subsetting the mean math and science scores by school to satisfy two  conditions, there are [several ways to accomplish this task](https://kanoki.org/2020/01/21/pandas-dataframe-filter-with-multiple-conditions/). Pick your favorite. And [`.reset_index()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html) is a method you may need if you want to reset the index column of a groupby result.

**Confidence check:** You should have 55 schools whose math and science scores are both  strictly greater than 3.

Calculate and print the standard (Pearson) correlation and the rank (Spearman) correlation between the math and science scores in the subset data. The [`.corr()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.corr.html) computes the Pearson correlation by default, but you can get the Sperman correlation by explicitly specifying the `method` parameter (check the docs).

Make a scatterplot with the average math scores on the x-axis, and the average science scores on the y-axis. Discuss any patterns you see in the scatterplot, and connect your observations to the two correlation scores.

---

In [None]:
#Your code:

Your answer:

---

## Problem 10 (8 pts)

Use the [`.rank()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.rank.html) Series method to add columns called `'math_rank'` and `'sci_rank'` to `school_mean_subset`; these columns should contain the rank of the average math and science scores (by school), respectively. Print the first few rows of these new columns.

**Confidence check:** For the school with `S7_ID==1022`, the mean math score rank is 28 and the mean science score rank is 15.

Calculate the Pearson and Spearman correlation between the two rank variables and make a scatterplot with `math_rank` on the x-axis and `sci_rank` on the y-axis.

Compare the correlations computed in this problem with those computed in the previous problem.

---

In [None]:
#Your code:

Your answer: