## Codio Activity 2.4: Measures of Center and Spread with `pandas`

**Expected Time: 60 Minutes**

**Total Points: 20**

Measures of central tendency for datasets are typically mean, median, mode, variance, and standard deviation. We will explore mode in later modules. This assignment focuses on using pandas to compute numerical summaries of data. You will load in a dataset and compute the mean, median, variance, and standard deviation of numeric features in the data. You will also examine the spread of the data using quartiles and the interquartile range, also identifying outliers using the  1.5×𝐼𝑄𝑅  rule.


## Index:

- [Problem 1](#Problem-1:-Load-the-Data)
- [Problem 2](#Problem-2:-Base-Salary-Distribution-Mean)
- [Problem 3](#Problem-3:-Base-Salary-Distribution-Median)
- [Problem 4](#Problem-4:-Determining-Quartiles)
- [Problem 5](#Problem-5:-Interquartile-Range)
- [Problem 6](#Problem-6:-Rule-for-Outliers)
- [Problem 7](#Problem-7:-Dropping-Outliers)
- [Problem 8](#Finding-the-Mean-with-no-Outliers)
- [Problem 9]()
- [Problem 10](#Standard-Deviation-with-`numpy`)
- [Problem 11](#Standard-Deviation-with-`pandas`)
- [Problem 12]()
- [Problem 13](#Measure-of-Spread)

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

### The Data


![](images/payroll.png)
The dataset for this assignment comes from New York City's Open Data, and contains information about the citywide payroll data.  Below, you are to load the data from a `.csv` file.  

[Back to top](#Index:) 

### Problem 1:  Load the Data

**1 Point**

Below, load the data file located in the `data` folder named `nyc_salaries.csv`.  Assign your DataFrame to `nyc_salary_data`.

In [None]:
###GRADED

nyc_salary_data = ''

# YOUR CODE HERE
nyc_salary_data = 

### Examine the Data

Now that the data is loaded, information about the datatypes and missing values as well as a quick look at the first five rows of the data can be had with the `.info()` and `.head()` methods.

In [None]:
nyc_salary_data.info()

In [None]:
nyc_salary_data.head()

[Back to top](#Index:) 

### Problem 2: Base Salary Distribution Mean

**1 Point**

Use pandas to determine the mean of the `base_salary` column.  Assign your answer to the variable `ans_2` below.

In [None]:
nyc_salary_data['base_salary'].hist()

In [None]:
###GRADED

ans_2 = ''

# YOUR CODE HERE
raise NotImplementedError()

[Back to top](#Index:) 

### Problem 3: Base Salary Distribution Median


**1 Point**


Use the pandas `.median()` method to find the median of the `base_salary` column. Assign your answer to the variable `ans_3` below.

In [None]:
###GRADED

ans_3 = ''

# YOUR CODE HERE
raise NotImplementedError()

[Back to top](#Index:) 

### Problem 4: Determining Quartiles

**2 Points**

The first quartile in our data is found by arranging the data from least to greatest and determining the median of the lower half of the data.  The third quartile is the median of the upper half of the data.  Pandas has a `.quantile()` method that accepts an argument as a single value or list of values of quantiles to compute.  Use the `.quantile()` method to determine the first and third quartile of the data.  Assign your answers to the variables `first_quartile` and `third_quartile` below.

In [None]:
###GRADED

first_quartile = ''
third_quartile = ''

# YOUR CODE HERE
raise NotImplementedError()

[Back to top](#Index:) 

### Problem 5: Interquartile Range

**1 Point**

The Interquartile Range (IQR) is the distance between the third and first quartile.  

$$\text{IQR} = Q_3 - Q_1$$

Use your work from problem 4 to determine the IQR for the `base_salary` column and assign it to the `iqr` variable below.

In [None]:
###GRADED

iqr = ''

# YOUR CODE HERE
raise NotImplementedError()

[Back to top](#Index:) 

### Problem 6: Rule for Outliers

**2 Points**


We can use the IQR to determine boundaries for outliers.  One such rule of thumb is that any values $1.5 \times IQR$ above the third quartile or below the first are considered outliers.  Below you are to determine the upper and lower boundaries for outliers using this rule, assign your solutions to the variables `lower` and `upper` below.

In [None]:
###GRADED

lower = ''
upper = ''

# YOUR CODE HERE
raise NotImplementedError()

[Back to top](#Index:) 

### Problem 7: Dropping Outliers


**1 Point**


With the upper and lower boundaries in place, we want to drop outliers from the dataset.  Use pandas to limit the data based on outliers using the $1.5 \times IQR$ rule on the `base_salary` column.  Assign the resulting DataFrame to `salaries_no_outlier` below.

In [None]:
###GRADED

salaries_no_outlier = ''

# YOUR CODE HERE
raise NotImplementedError()

[Back to top](#Index:) 

### Problem 8: Finding the Mean with no Outliers

**1 Point**

Now that the outliers have been removed, we determine the mean of the `base_salary` column.  Assign your answer to `mean_no_outliers` below.

In [None]:
###GRADED

mean_no_outliers = ''

# YOUR CODE HERE
raise NotImplementedError()

[Back to top](#Index:) 

### Problem 9: Finding the Median with no Outliers

**2 Points**


Now, determine the median of the `base_salary` column with the outlier removed.  Assign your solution to the variable `median_no_outliers` below.

In [None]:
###GRADED

median_no_outliers = ''

# YOUR CODE HERE
raise NotImplementedError()

[Back to top](#Index:) 

### Problem 10: Standard Deviation with `numpy`

**2 Points**


An alternative measure for the spread of a dataset to the IQR is the **standard deviation**.  Both pandas and numpy have functions for computing the standard deviation, however they have different assumptions about the data.  To begin, we will use the `np.std()` method to determine the standard deviation of the `base_salary` column from both our datasets.  Assign your results to the variable `std_numpy_outliers` and `std_numpy_no_outliers` below.

In [None]:
###GRADED

std_numpy_outliers = ''
std_numpy_no_outliers = ''

# YOUR CODE HERE
raise NotImplementedError()

[Back to top](#Index:) 

### Problem 11: Standard Deviation with `pandas`

**2 Points**



Now, we will use pandas to determine the standard deviation of the `base_salary` column.  Assign the standard deviation from pandas to `std_pandas_no_outliers` and `std_pandas_outliers` respectively below.

In [None]:
###GRADED

std_pandas_outliers = ''
std_pandas_no_outliers = ''

# YOUR CODE HERE
raise NotImplementedError()

### Why are these different?

Pandas and NumPy make different assumptions about the data, which is why we receive slightly different results.  This involves the default degree of freedom argument in the functions.  From the numpy docs we have:

```
ddof: int, optional
Means Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero.
```

and from pandas we find:

```
ddof: int, default 1
Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
```

The main reason for the different assumptions is motivated by the fact that all the deviations from the mean will add to zero.  This means we would always know the the last deviation if we know the previous $n-1$.  Only $n-1$ are able to vary freely so the average is determined by dividing the total by $n-1$.  In practice the values won't differ greatly, it is important to recognize that there are some different assumptions however.  

### A Second Small Example

The data below relates to salmon smolts (young salmon) and the reflectance of their skin as a measure for readiness for migration.

In [None]:
smolt = pd.read_csv('data/smolts.csv', index_col=  0)

In [None]:
smolt.head()

[Back to top](#Index:) 

### Problem 12: Measure of Center

**2 Points**


Determine the mean and median of the `Reflectance` column, save these to the variable `smolt_mean` and `smolt_median` below.  Which of the measures do you believe more appropriate?

In [None]:
###GRADED

smolt_mean = ''
smolt_median = ''

# YOUR CODE HERE
raise NotImplementedError()

[Back to top](#Index:) 

### Problem 13: Measure of Spread

**2 Points**

Compute the standard deviation and the quartiles of the `Reflectance` column using pandas. Save your results to `smolt_std`, `smolt_first_quartile` and `smolt_third_quartile` below.  Which of these do you prefer for describing the spread of the data?  

In [None]:
###GRADED

smolt_std = ''
smolt_first_quartile = ''
smolt_third_quartile = ''

# YOUR CODE HERE
raise NotImplementedError()