<a href="https://colab.research.google.com/github/KordingLab/ENGR344/blob/master/tutorials/W3D1_What_should_we_do_when_data_has_problems/W3D1_Tutorial2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



```
# This is formatted as code
```

# Tutorial 2: Outliers
**Week 3: What should we do when data has problems?**

**Content creators**: Rob Lindgren

**Content reviewers**: Konrad Kording, Keervani Kandala

**Content modifiers**: ---

**Modified Content reviewer**: ---


___
# Tutorial Objectives

*Estimated timing of tutorial: 30 minutes*

This is the tutorial 2 in a 3-part series on how to handle data that has problems. In this tutorial, we will learn about outliers: what problems they can cause, how to identify them, and how to remove them. By the end of this tutorial, you will be able to:

- Explain the two possible causes of outliers, erroneous observations and natural variation
- Identify outliers using histograms, scatterplots, boxplots, and z-scores
- Remove outliers from a DataFrame using Boolean masks


In [1]:
# @title Tutorial slides
 
# @markdown These are the slides for the videos in all tutorials today
from IPython.display import IFrame

IFrame(src=f"https://mfr.ca-1.osf.io/render?url=https://osf.io/hncv7/?direct%26mode=render%26action=download%26mode=render", width=854, height=480)

---
# Setup

In [None]:
# Imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random

In [None]:
# @title Plotting Function

# Solution
def plt_cars(df):
  """Plot histograms of 'Horsepower' and ' Highway mpg' from the Cars dataset, 
  as well as a scatterplot of 'Horsepower' vs. 'Highway mpg'.

  Args:
    df (DataFrame): Cars dataset, with variables 'Horsepower' and 'Highway 'mpg'.

  Returns:
    None
  """

  # Compute means
  means = df.mean()

  # Create figure and axes objects
  fig_a, (ax1, ax2) = plt.subplots(1, 2)
  
  # Visualize 'Horsepower'
  ax1.hist('Horsepower', data=df)
  ax1.set_xlabel("Horsepower")
  ax1.set_ylabel("Number of vehicles")
  ax1.axvline(means['Horsepower'], color='Orange')

  # Visualize 'Highway mpg'
  ax2.hist('Highway mpg', data=df)
  ax2.set_xlabel("Highway mpg")
  ax2.set_ylabel("Number of vehicles")
  ax2.axvline(means['Highway mpg'], color='Orange')
  print(fig_a)

  print('\n')
  
  # Visualize the relationship between 'Horsepower' and 'Highway mpg'
  fig_b, ax = plt.subplots(1, 1)
  ax.scatter('Horsepower', 'Highway mpg', data=df)
  ax.set_xlabel('Horsepower')
  ax.set_ylabel('Highway mpg')
  print(fig_b)



In [None]:
# @title Regression Function
def regress(df_in, x_lab, y_lab):
  # Takes a dataframe and two variable names from the dataframe
  # and returns a dictionary with the results of a linear regression
  # of y_lab on x_lab.

  # Output dictionary looks like this...
  # {'prediction' : predicted_values,
  #  'intercept' : intercept
  #  'coef' : coefficient}
  df = df_in.copy()

  from sklearn.linear_model import LinearRegression

  x = df.loc[:, [x_lab]].values.reshape(-1, 1)  # values converts it into a numpy array
  y = df.loc[:, [y_lab]].values.reshape(-1, 1)  # -1 means that calculate the dimension of rows, but have 1 column
  reg = LinearRegression()  # create object for the class
  reg.fit(x, y)  # perform linear regression
  y_pred = reg.predict(x)  # make predictions

  out = {'prediction' : reg.predict(x),
        'intercept' : reg.intercept_[0],
        'coef' : reg.coef_[0][0]}

  return out

---
# Prepare Data

*Estimated timing to here from start of tutorial: ???*

Once again, let's load the Cars dataset, subset it, and verify that we got the subset that we expected.


In [None]:
data_url = 'https://raw.githubusercontent.com/RealTimeWeb/datasets/master/datasets/csv/cars/cars.csv'
df = pd.read_csv(data_url)[['ID', 'Horsepower', 'Highway mpg']]
df.head()

# Section 1: What problems do they pose?

- **Erroneous observations:** Outliers might indicate problems with measurement or data entry. For this reason, they must be identified and investigated. Data points that do not belong in the data set should be removed and measurement and data entry issues need to be corrected. E.g. we run a questionaire and a user misunderstands a question.
- **Natural variation**: Even accurately measured and recorded data can have outliers, and the presence of these outliers can have a substantial effect on the results of an analysis. It is possible that in this case that you might need to narrow your sample, thus excluding the outliers. E.g. a datapoint that is correct but unusual, and therefore one we may want to ignore for making predictions.

**Philosophy of outliers**
As such, we want to emphasize that how to deal with outliers is complex. If we are running a spam sending facility, we may want to focus on modeling the outliers well - we rely on outstandingly gullible people. If we want to build a tool to be used by many people we may mostly care about people that are somewhat typical and ignore the outliers. It is important to think through what we are trying to achieve with our modeling.



In [None]:
# @title Video 6:How does horsepower relate to efficiency
from ipywidgets import widgets
from IPython.display import display, IFrame, YouTubeVideo

out1 = widgets.Output()
with out1:
  video = YouTubeVideo(id="HhSjVj2AOgc", width=854, height=480, fs=1, rel=0)
  print(f'Video available at https://youtube.com/watch?v={video.id}')
  display(video)

out = widgets.Tab([out1])
out.set_title(0, 'Youtube')

display(out)

Tab(children=(Output(),), _titles={'0': 'Youtube'})

## Coding Exercise 1: What is the relationship between Horsepower and Fuel Efficiency?

Use the `regress()` function, defined above, to estimate a linear model of the relationship between 'Horsepower' and 'Highway mpg'. Then perform the same regression on the subset of `df` for which 'Horsepower' is greater than 300. 

In [None]:
###########################################################################
## TODO for students: Estimate a linear model of Horsepower vs. Highway mpg
## for both the whole dataset and the subset with Horsepower > 300.
raise NotImplementedError('student exercise: Regression on whole dataset and subset')
###########################################################################

# Regression on full dataset
results = regress(..., ..., ...)
print('Slope of the linear model estimated on the whole dataset:')
print(results['coef'])

print('\n')

# Regression on subset
# hint you can do things like df[df['Horsepower'] > 100] to select only cars with at least 100 horsepower
results_subset = regress(..., ..., ...)
print('Slope of the linear model estimated on only observations for which Horsepower > 300')
print(results_subset['coef'])

In [None]:
# to_remove Solution

# Regression on full dataset
results = regress(df, 'Horsepower', 'Highway mpg')
print('Slope of the linear model estimated on the whole dataset:')
print(results['coef'])

print('\n')

# Regression on subset
results_subset = regress(df[df['Horsepower'] > 300], 'Horsepower', 'Highway mpg')
print('Slope of the linear model estimated on only observations for which Horsepower > 300')
print(results_subset['coef'])

## Discussion Question 1

What do the results of our two regressions tell us about Horsepower and Fuel Efficiency?

- Is there a different relationship between 'Horsepower' and 'Highway mpg' for high-horsepower vehicles than for others?
- Why or why not?

In [None]:
#to remove (only for TA)
# The efffect for high HP cars is entirely driven by the one outlier we saw in the previous tutorial. Let the students figure that out themselves!

# Section 2: Identifying outliers
So how do we identify outliers? The first approach almost always has to be to visualize the data.

In [None]:
# @title Video 7: Let us visualize outliers
from ipywidgets import widgets
from IPython.display import display, IFrame, YouTubeVideo

out1 = widgets.Output()
with out1:
  video = YouTubeVideo(id="omVZoJ-62u8", width=854, height=480, fs=1, rel=0)
  print(f'Video available at https://youtube.com/watch?v={video.id}')
  display(video)

out = widgets.Tab([out1])
out.set_title(0, 'Youtube')

display(out)

Tab(children=(Output(),), _titles={'0': 'Youtube'})

## Section 2.1 Visualization is your friend

We already noticed that we have outliers in the Cars dataset back in Tutorial 1. Let's revisit our histograms and scatterplot.

In [None]:
plt_cars(df)

Another good plot for identifying outliers is the boxplot. DataFrames have their own method for this which produces a collective boxplot of all variables in the dataset.

In [None]:
df.boxplot()

We can see, once again, that there is a clear outlier in 'Highway mpg'. It also looks like there are some 'Horsepower' datapoints that lie beyond the 'whiskers', which represents a distance 1.5 times larger than the interquartile range.

## Section 2.2: Z-scores

Another way to identify outliers in your data is by calculating Z-scores for each data point. Z-scores measure the distance of a data point from the mean in terms of standard deviation, calculated as below.

\begin{align}
z = \frac{x - \overline{x}}{s}
\end{align}

$z$ is the score, $x$ the data point for which we're measuring the score, $\overline{x}$ is the sample mean, and $s$ is the sample standard deviation. A data point with a Z-score of 2 is 2 standard deviations above the mean.

How can we calculate Z-scores for all the data in our dataset? Pandas provides methods for calculating both mean and standard deviation by variable.

In [None]:
print(df.mean())
print('\n')
print(df.std())

These methods work intuitively with Pandas' indexing to allow us to subtract the mean of each column from each value in that column,

In [None]:
df - df.mean()

...as well as divide each value by its column's standard deviation.

In [None]:
df / df.std()

In [None]:
# @title Video 8: Identify outliers
from ipywidgets import widgets
from IPython.display import display, IFrame, YouTubeVideo

out1 = widgets.Output()
with out1:
  video = YouTubeVideo(id="xIv49TK7OrU", width=854, height=480, fs=1, rel=0)
  print(f'Video available at https://youtube.com/watch?v={video.id}')
  display(video)

out = widgets.Tab([out1])
out.set_title(0, 'Youtube')

display(out)

Tab(children=(Output(),), _titles={'0': 'Youtube'})

### Coding Exercise 2.2: Identifying Outliers by Z-scores

Fill in the function below so that it calculates the Z-score for each entry in the input DataFrame and returns a boolean DataFrame indicating whether each value is an outlier.

Note: It is common to use 3 as your threshold because 99.7% of all observations fall within 3 standard deviations in normally distributed data.

In [None]:
###########################################################################
## TODO for students: Fill in detect_outliers().
raise NotImplementedError('student exercise: detect outliers function')
###########################################################################

def detect_outliers(df_in, thresh):
  df = df_in.copy()
  df_z = (...) / ...
  df_out = ... > thresh # Hint: use np.abs() to get the absolute values of each entry
  return df_out

df_outliers = detect_outliers(df, 3)
df_outliers.head()


In [None]:
# to_remove Solution
def detect_outliers(df_in, thresh):
  df = df_in.copy()
  df_z = (df - df.mean()) / df.std()
  df_out = np.abs(df_z) > thresh
  return df_out

df_outliers = detect_outliers(df, 3)
df_outliers.sum()


# Section 3: Removing outliers

[Insert a discussion of when it is appropriate to remove outliers, explain np logic operations]

In [None]:
# insert video

### Coding Exercise 3: Detecting and Removing Outliers 


In [None]:
###########################################################################
## TODO for students: Fill in remove_outliers().
raise NotImplementedError('student exercise: remove outliers function')
###########################################################################

def remove_outliers(df_in, thresh):
  df = df_in.copy()
  df_out = ...
  return df_out

df_rem = remove_outliers(df, 3)

plt_cars(df_rem)

In [None]:
# to_remove Solution

def remove_outliers(df_in, thresh):
  df = df_in.copy()
  df_out = df[np.logical_not(detect_outliers(df, thresh)).all(axis=1)]
  return df_out

df_rem = remove_outliers(df, 3)

plt_cars(df_rem)



# Section 4: Discussion

- How can outliers affect the results of an analysis?
- Will every method of outlier detection always give the same results?
- Should you always remove outliers?