# Single Variable Exploratory Data Analysis

Throughout this notebook, we'll be using the `pandas` and `numpy` libraries to perform some of your calculations and the `pyplot` module from `matplotlib`. 

A **library** bundles together functions and objects that have a common functionality (like data or statistical analysis).

These must be imported in order to use them. 

We'll *alias* `pandas` as `pd` so that when we refer to it later, we'll only need to type `pd`. Similarly, we'll alias `numpy` and `matplotlib.pyplot`.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

The tools that we use depend on whether we're looking at numeric or categorical variables.

## Qualitative/Categorial Variables

Recall that qualitative variables are those that fall into two or more levels/groups.

The most interesting information in regard to qualitative variables is the number of observations per level/group.

You can also look for the **mode** of a categorical variable, or the most frequent observation.

This can be displayed in a **frequency table**, which shows a count of observations per category.

For this example, we'll look at data from the 2018 Central Park Squirrel Census, which can be obtained from https://data.cityofnewyork.us/Environment/2018-Central-Park-Squirrel-Census-Squirrel-Data/vfnx-vebw.

The first step is to read the data into a pandas DataFrame so the we can manipulate it.

In [None]:
squirrels = pd.read_csv('../data/squirrels.csv')

Let's break apart what happened in the previous cell.

```
squirrels = pd.read_csv('../data/squirrels.csv')
```

* `pd`: This refers to the pandas library (remember the alias from above). We are using a function from the pandas libary, so need to indicate this.

* `.read_csv`: This is the name of the function we are using. Functions are used to perform differnt actions. In this case, we are reading the data from a csv file. Note that when you call a function, you'll have a set of parentheses on the end.

* `'../data/squirrels.csv'`: This function needs to know where to look for the data, so we must supply it with a filepath to locate the csv file.

* `squirrels =`: This assigns the result of the function call to a variable named squirrels so that we can refer to it and reuse it later.

We can look at the squirrels variable by calling it, like we do in the next cell.

In [None]:
squirrels

What we are seeing is a **DataFrame**, an object from the pandas library which is useful for working with tabular data.

DataFrames are made up of **rows** and **columns**. Each row consists of an observation and we have a column for each variable.

You can look at the first few rows by using the `.head()` **method**. This is a function which is built in to DataFrames.

To use a dataframe method, you normally type the name of the dataframe followed by a `.` and the name of the method you wish to use. 

Note also that when using methods, you need to put a set of parentheses after the name of the method.

In [None]:
squirrels.head()

**Question:** What do you notice when looking at the first few rows?

**Note:** To see the list of all columns, you can take a look at the `columns` attribute:

In [None]:
squirrels.columns

One of the variables in the dataset is the primary fur color. We can access a single column from a data frame by using square brackets.

In [None]:
squirrels['Primary Fur Color']

When extracting a single column like we did above, we get a pandas **Series**. A Series can basically be understood as a single column of a DataFrame.

If you want to see how many squirrels there were for each fur color, you can use the `value_counts` method from `pandas` to create a frequency table.

In [None]:
squirrels['Primary Fur Color'].value_counts()

You can see that gray squirrels are by far the most common squirrel spotted in Central Park in the dataset.

By default, `value_counts` will return a count of each category, but what if we want to modify the behavior of this function? This can be done by passing in some additional **arguments**. To get a list of available arguments and what they do, we can bring up the docstrings. 

For this function, we can bring up the docstring by referencing the function name, placing the cursor inside the parentheses and hitting Shift + Tab. If you do this 4 times, it'll pin the docstring to the bottom of the screen.

In [None]:
pd.Series.value_counts()

An alternative is to check the documentation, which for `value_counts` is located at https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html.

If you are not interested in the *number* of observations of each group, but instead the *proportion* of observations in each category, you can add the `normalize = True` argument. This gives the **relative frequency** of each category.

In [None]:
squirrels['Primary Fur Color'].value_counts(normalize = True)

If you want to visualize the frequency per color, you can create a **bar plot**.

In [None]:
squirrels['Primary Fur Color'].value_counts().plot(kind = 'bar')

A few small improvements you can make:
1. Add a semicolon to the last line, which suppresses the unneeded text output.
2. Use the `.xticks()` function from matplotlib to remove the rotation for the labels.
3. Use the `.title()` method to add a title to our plot.

In [None]:
squirrels['Primary Fur Color'].value_counts().plot(kind = 'bar')
plt.xticks(rotation = 0)
plt.title('Number of Squirrels by Each Primary Fur Color');

### Your Turn 

**Question 1:** What percentage of the time were squirrels observed approaching (as indicated by the "Approaches" column)?

In [None]:
# Your Code Here

**Question 2:** Which age group (contained in the "Age" column) was most commonly spotted? Make a bar chart to show this.

In [None]:
# Your Code Here

**Question 3:** Are there any duplicate Unique Squirrel IDs? 

Bonus: How many are duplicated?

In [None]:
# Your Code Here

You may have noticed that the DataFrame contains a lot of NaN values. If we want to be able to count these, we can utilize the `dropna` argument.

In [None]:
squirrels['Primary Fur Color'].value_counts(dropna = False)

If we just want a general overview of the number of missing values per column, we can use the `isna` method followed by the `sum` method, or if we want the percentage of missing values, we can use `isna` in combination with `mean`.

In [None]:
squirrels.isna().sum()

In [None]:
squirrels.isna().mean()

## Quantitative/Numerical Variables

Numerical variables are those which can be counted or measured. There are number of ways we can examine numerical variables.

Let's look at a new dataset, one which contains stats for all active NBA players.

In [None]:
nba = pd.read_csv('../data/nba_players.csv')

In [None]:
nba.head()

**Question:** Which of the variables in this DataFrame are quantitative? Which are categorical?

There are three major categories of descriptive statistics for quantitative variables:
* Measures of Central Tendency
* Measures of Variability/Spread
* Measures of Position

# Measures of Central Tendency

**Goal:** Give a central or "typical" value of a data set.

Most common measures of central tendency:
* mean
* median

## Mean

Also known as the **average** or **arithmetic mean**. 

Defined as total (sum) of the values of a set of observations divided by the number of observations. 

The notation for the mean differs depending on if you are calculating it for a sample or for the entire population.

$$\text{Sample Mean: } \bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n} = \frac{\sum\limits_{i=1}^n x_i}{n}$$

$$\text{Population Mean: } \mu = \frac{x_1 + x_2 + \cdots + x_n}{n} = \frac{\sum\limits_{i=1}^n x_i}{n}$$

The mean represents the “balance point” of the data. It is the amount that all observations would have if the total amount of the variable was evenly distributed to all observations.

First, you'll manually calculate the mean, so that you can see how to use some of the methods available in `pandas`. 

In [None]:
nba['salary'].sum()

In [None]:
nba['salary'].count()

In [None]:
nba['salary'].sum() / nba['salary'].count()

The `pandas` library has many of the common descriptive statistics available as methods. For example, to compute the mean salary, you didn't need to compute the sum and count, instead you could have taken advantage of the `mean` method.

In [None]:
nba['salary'].mean()

This says that if you distributed the total payroll evenly to all players, they would each receive a salary of $8,900,391.

## Median

The **median** is the number which divides the dataset exactly in half. It is the middle value if the data is arranged by size.

For an odd number of observations, the median will be a value from the data set.

For an even number of observations, the median is the mean of the two centermost observations.

In [None]:
nba['salary'].median()

This tells us that half of NBA players make less than \\$4,220,057 and half make more than \\$4,220,057. 

**Your Turn:** Find the mean and median height in inches of nba players (which is contained in the `height_inches` column). What do you notice and how does it compare to what you saw with salaries?

In [None]:
# Your Code Here

Why do we see such a vast difference between the mean and median for salaries but not for salaries? This may be caused by the fact that we have some very extreme values for the salary variable - players who make a lot more than the typical player. In fact, the salary values of the top earner is more than 11 times the median salary. These extreme values can have an outsized impact on the calculation of the mean.

In [None]:
nba.nlargest(5, 'salary')

On the other hand, while there are very tall players, we do not have the same kind of extremes that we see with salary values.

In [None]:
nba.nlargest(5, 'height_inches')

What we have seen is that the median is not as impacted by extreme values as the mean is. The term for this is that the median is **robust**. This is the reason that the median is often used to report statistics on salaries or home values, where extreme values are a common occurrence.

## Distribution Shape

So far, we've seen the center of our distributions using the mean and median, but what if we want to get a better idea about the overall distribution of values? When looking at a numerical variable, we can also inspect the *shape* of the distribution of that variable.

The **distribution** refers to the possible values of that variable and which values occur more or less frequently than others.

When talking about the shape of a distribution, there are a few different aspects we can examine:
* **Symmetry:** Is the distribution symmetric? If so, is it "bell-shaped"? Is it flat?
* **Skewness:** If it is not symmetic, does it have a long tail to one side?
* **Peaks/Modes:** How many peaks does it have? Unimodal? Bimodal? Multimodal?
* **Spread:** How narrow/wide is the distribution?

### Histograms

If you are trying to understand the shape of the distribution of a variable, the most common tool to use is the histrogram.

A histogram shows how many observations lie within a certain class interval. That is, it divides the dataset into *bins*, and the height of the plot above each interval is proportional to the number of observations that fall within that bin.

Procedure:
* Separate data into equal-width, non-overlapping bins
* Count number of data points in each bin
* Draw a bar for each bin whose height is equal to the number of observations in that bin.

Let's look at the weight_lbs variable and examine the distribution.

In [None]:
plt.hist(
    data = nba,
    x = 'weight_lbs'
);

Let's make a few improvements to our plot.

**Note:** To see the possible _arguments_ for a function and a description of those arguments, you can press Shift + Tab inside the parentheses for that function to bring up the docstring.

In [None]:
fig,ax = plt.subplots(figsize = (10,6))               

plt.hist(
    data = nba,
    x = 'weight_lbs',
    edgecolor = 'black',
    linewidth = 2
);                              
plt.xlabel('weight (lbs.)')                            
plt.ylabel('count')
plt.title('Histogram of NBA Player Weights');

**Question:** There is a bar in the middle whose height is about 100. What does this bar tell us?

Now, let's look at our 4 questions:

* **Symmetry:** Is the distribution symmetric? If so, is it "bell-shaped"? Is it flat?
* **Skewness:** If it is not symmetic, does it have a long tail to one side?
* **Peaks/Modes:** How many peaks does it have? Unimodal? Bimodal? Multimodal?
* **Spread:** How narrow/wide is the distribution?

While it is not a perfect mirror image from left to right, this distribution is mostly **symmetric** and not skewed. It is close to being bell-shaped.

This distribution appears to be **unimodal**, with a large group of players whose weight is between about 215 and around 280 lbs. 

We can see that overall, the values are between about 160 and 300, with the distribution tailing off to either side.

The way that matplotlib created the bins, it appears that maybe there is another mode around 240 lbs. Let's take a closer look at this.

Histograms are very sensitive to how the bins are selected. If desired, we can supply our own bin edges.

In [None]:
bins = np.arange(start = 160, stop = 320, step = 20)
bins

In [None]:
fig,ax = plt.subplots(figsize = (10,6))               

plt.hist(
    data = nba,
    x = 'weight_lbs',
    edgecolor = 'black',
    linewidth = 2,
    bins = bins
);                             
plt.xlabel('weight (lbs.)')                            
plt.ylabel('count')
plt.title('Histogram of NBA Player Weights');

In [None]:
bins = np.arange(start = 160, stop = 320, step = 10)
bins

In [None]:
fig,ax = plt.subplots(figsize = (10,6))               

plt.hist(
    data = nba,
    x = 'weight_lbs',
    edgecolor = 'black',
    linewidth = 2,
    bins = bins
);                             
plt.xlabel('weight (lbs.)')                            
plt.ylabel('count')
plt.title('Histogram of NBA Player Weights');

In looking at our histogram in this view, we can see a little bit more about the symmetry of our data and see that it has a slight tail to the right, with some unusually high weights.

**Your Turn:** Create a histogram to examine the distribution of points per game (contained in the `pts_per_game` column). If you have time, also check on the `salary` column.

Describe what you find in terms of the 4 questions:

* **Symmetry:** Is the distribution symmetric? If so, is it "bell-shaped"? Is it flat?
* **Skewness:** If it is not symmetic, does it have a long tail to one side?
* **Peaks/Modes:** How many peaks does it have? Unimodal? Bimodal? Multimodal?
* **Spread:** How narrow/wide is the distribution?

In [None]:
# Your Code Here

#### Skewness

When a dataset has a long tail to the right, you say that it is **right-skewed**. 

Analogously, a dataset with a long tail to the left (unusually small observations) would be said to be **left-skewed**.

A long tail to one side will tend to pull the mean in that direction. The median is typically not affected as much by a long tail.

#### Importance of Modes

Why do you need to care about how many modes a distribution has? When analyzing bimodal distribution, using the mean or even the median can be misleading. It can easily be the case that there are very few observations close to the mean or median. This can be a problem is you are interpreting these measures of central tendency as "typical" values. This can happen when the overall population is made up of two or more heterogeneous groups.

See https://www.nalp.org/startingsalarydistributionclassof2009 for an example of a bimodal distribution where normal descriptive statistics are misleading. This website shows the distribution of starting salaries for lawyers in 2009. For lawyers, salaries are typically very high (for those that get full-time positions) or very low (for those that can only secure part-time employment). As a result, the salaries follow a bimodal distribution and almost no one makes the mean or median salary.

<img src="images/2009_bimodal_in_color_for_web.gif" width="500">

# Measures of Spread

**Goal:** Give an idea of how similar or varied the observations in the dataset are.

## Range

The range measures how "wide" a dataset is. It depends only on the largest and smallest observations, so it is highly influenced by outliers.

$$ \text{range} = \text{maximum observation} - \text{minimum observation}$$

In [None]:
nba['weight_lbs'].max()

If you want to know who has the largest salary, you can use the `nlargest` method.

In [None]:
nba.nlargest(1, 'weight_lbs')

Similar for the minimum.

In [None]:
nba['weight_lbs'].min()

In [None]:
nba.nsmallest(1, 'weight_lbs')

To find the range, you can subtract the minimum value from the maximum value.

In [None]:
nba['weight_lbs'].max() - nba['weight_lbs'].min()

What is the range for heights?

In [None]:
# Your Code Here
nba['height_inches'].max() - nba['height_inches'].min()

## Variance and Standard Deviation

The range of a dataset gives a quick glance at how varied a dataset is. It does have a major drawback, though, in that it only depends on two data points: the largest and smallest. What if you want to consider the entire dataset?

Let's focus on the height variable. For each player, we'll consider their **deviation:** the difference between their height and the average height.

In [None]:
nba['height_deviation'] = nba['height_inches'] - nba['height_inches'].mean()
nba.head()

**Question:** How do we interpret the value of 4.37961 for Steven Adams? What about the -1.62039 for Ochai Agbaji?

**Question:** What is the average deviation from the mean? Why do you think that you get this result?

In [None]:
# Your Code Here

There is one problem with simply taking the average of the deviations: if you were to sum the deviations, you would get zero, meaning that, on average, the deviation is zero.

A solution for the problem that we just encountered is to, for each datapoint $x_i$, look at the squared deviation $(x_i - \mu)^2$.

If you now take the mean of these squared deviations, you get what is called the **variance**. Note that this formula is only valid if you are looking at a *population*. You'll see the difference for a sample shortly.
$$\text{Population Variance: } \sigma^2 = \frac{\sum\limits_{i = 1}^n(x_i - \mu)^2}{n}$$

In [None]:
nba['squared_height_deviation'] = nba['height_deviation']**2
nba

In [None]:
nba['squared_height_deviation'].mean()

There is only one problem now: the variance is is squared units, not in our original unit. If we want to convert it to the starting units, you can take the square root and obtain what is called the **standard deviation**:

$$\text{Population Standard Deviation: } \sigma = \sqrt{\sigma^2} $$

In [None]:
np.sqrt(nba['squared_height_deviation'].mean())

When working with sample, there is a small modification that must be done to calculate the variance and standard deviation. Instead of dividing the $n$, the number of observations, you instead divide by $n-1$:

$$\text{Sample Variance: } s^2 = \frac{\sum\limits_{i = 1}^n(x_i - \bar{x})^2}{n - 1}$$

$$\text{Sample Standard Deviation: } s = \sqrt{s^2} $$

**Question:** What is the effect of this difference? (Is the sample variance larger or smaller than the population variance?)


**Why do we have this difference?** 

Informally, the reason that you do this is that you are trying to approximate the population variance. You want to estimate the deviation from the mean, but at the same time, you don't know the true population mean to start with, only an estimate from the sample ($\bar{x}$). So you are making an estimate using an estimate. To compensate for this, you need to inflate your estimate of the variance slightly, by dividing by $n - 1$ instead of $n$.

By default, most function that calculate the variance or standard deviation will assume that you are looking at a sample. However, in this case, you have the entire population, so you need to adjust it. If you are using `pandas` methods, you can specify `ddof = 0`, which sets the "delta degrees of freedom", or the amount that the "degrees of freedom" differ from the number of observations, to be 0.

If you are calculating the standard deviation of a sample, you need to use `ddof = 1` (which is the default behavior).

In [None]:
nba['height_inches'].var(ddof = 0)

In [None]:
nba['height_inches'].std(ddof = 0)

Compare this to the default behavior:

In [None]:
nba['height_inches'].std()

In [None]:
from nssstats.plots import std_plot

In [None]:
plt.figure(figsize = (10,6))

std_plot(nba['height_inches'], edgecolor = 'black', linewidth = 2)

For distributions that are approximately bell-shaped, about 2/3 of observations will be within one standard deviation of the mean.

If, however, the distribution is non bell-shaped, this may not hold. For example, let's look at salaries.

In [None]:
nba['salary'].std()

In [None]:
plt.figure(figsize = (10,6))

std_plot(nba['salary'], edgecolor = 'black', linewidth = 2)

**Question:** What happens with this distribution? Why?

## $z$-scores



Often, you are not as interested in understanding "how much", but instead "how different from average?". That is, you may wish to measure how "unusual" a particular observation is. 

A $z$-score allows you to answer this question, in terms of the number of standard deviations from the mean. It is *unitless*, which means that it does not depend on what is being measured and the scale of the measurements, but instead you can compare across different types of measurements.

$$ z\text{-score} = \frac{\text{observation} - \text{mean}}{\text{standard deviation}}$$

A $z$-score of 1.4 says that an observation is 1.4 standard deviations larger than the average value, whereas a $z$-score of -2.8 says that an observation is 2.8 standard deviations lower than the mean.

In [None]:
nba['height_z-score'] = (nba['height_inches'] - nba['height_inches'].mean()) / nba['height_inches'].std(ddof = 0)

**Question:** What are the mean and standard deviation of the z-scores that were just calculated? Why?

In [None]:
# Your code here

Let's go ahead and calculate the z-scores for weight, too.

In [None]:
nba['weight_z-score'] = (nba['weight_lbs'] - nba['weight_lbs'].mean()) / nba['weight_lbs'].std(ddof = 0)

**Question:** How do we interpret the z-scores for Zion Williamson? To see the values for Zion Williamson, we'll use the .loc property to filter down to his row.

In [None]:
nba.loc[(nba['first_name'] == 'Zion')]

# Measures of Position

Measures of position have to do with ranking where an observation is in the dataset with respect to all other values.

## Quartiles and Quantiles/Percentiles

You have already encountered a special case of quantiles and percentiles, in the form of the median. Recall that the median of a dataset is the middle observation, if the observations are placed in ascending order. Another way to view this is that the median separates the lower half of the dataset from the upper half.

Instead of dividing a dataset into halves, **quartiles** divide a dataset into quarters. 

The **first quartile** separates the smallest quarter of observations from the highest three-quarters, the **second quartile**, aka the median, separates the smallest half of observations from the largest half of observations, and the **third quartile** separates the smallest three-quarters from the largest quarter of observations.

Quartiles (and more generally, **quantiles**) can be calculated using the `quantile` method.

In [None]:
nba['weight_lbs'].quantile(q = 0.25)

In [None]:
nba['weight_lbs'].quantile(q = 0.5)

In [None]:
nba['weight_lbs'].quantile(q = 0.75)

Note that _pandas_ has a `describe` method which gives many of the summary statistics that we have mentioned.

In [None]:
nba['weight_lbs'].describe()

You can use the quantiles to find the **interquartile range**, which is defined as the distance from the first to the third quartile. In a way, it is a trimmed version of the range, which is not as sensitive to extreme values.

In [None]:
nba['weight_lbs'].quantile(q = 0.75) - nba['weight_lbs'].quantile(q = 0.25)

In [None]:
from nssstats.plots import iqr_plot

In [None]:
plt.figure(figsize = (10,6))

iqr_plot(nba['weight_lbs'], bins = 25, edgecolor = 'black', linewidth = 2)

More generally, you can look at the quantiles or percentiles. The $n$th percentile separtes the lowest $n$% of observations from the rest. For example, the 90th percentile divides the lowest 90% of observations from the highest 10%. 

To find percentiles, you can use the quantile function from pandas.

In [None]:
nba['weight_lbs'].quantile(q = 0.1)

In [None]:
nba['weight_lbs'].quantile(q = 0.9)

Percentiles can be used to identify unusual observations, or to trim outliers from a data set.

If you want to understand how a variable is distributed, you have already seen how to use a histogram. An alternative type of plot that you can use is a **boxplot** (aka **box-and-whiskers plot**). This type of plot displays a box which starts at the first quartile and extend to the third quartile, with the second quartile marked. It also has whiskers that extend to last observations contained within the **outlier boundaries**. 

These boundaries are (usually) defined as being at 1.5 times the interquartile range below the first quartile and above the third quartile. Any points outside of the outiler boundaries are plotted individually.

In [None]:
import seaborn as sns

In [None]:
plt.figure(figsize = (10,6))
sns.boxplot(x = nba['weight_lbs']);

In [None]:
plt.figure(figsize = (10,6))

sns.boxplot(x = nba['salary']);

You can add an argument to tell seaborn how to divide the data into categories. 

In [None]:
plt.figure(figsize = (10,8))

sns.boxplot(data = nba.sort_values('team'), x = "salary", y = "team");

**Question:** What can we learn from this plot?