<a href="https://colab.research.google.com/github/KordingLab/ENGR344/blob/master/tutorials/W3D1_What_should_we_do_when_data_has_problems/W3D1_Tutorial1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



```
# This is formatted as code
```

# Tutorial 1: How is Data Usually Stored?
**Week 3: What should we do when data has problems?**

**Content creators**: Rob Lindgren

**Content reviewers**: Konrad Kording, Keervani Kandala

**Content modifiers**: ---

**Modified Content reviewer**: ---


___
# Tutorial Objectives

*Estimated timing of tutorial: Roughly 30 minutes per segment.

This is tutorial 1 in a 3-part series on how to handle data that has problems. In this tutorial, we will introduce the Pandas library and its core data structure, the DataFrame. By the end of this tutorial, you will be able to:

- Explain when data is best stored in a DataFrame as opposed to a NumPy array
- Create and subset DataFrames
- Examine DataFrames using the head() and tail() methods and the shape attribute
- Create histograms and scatterplots from data stored in DataFrames

Before we start. You know about numpy. Have you heard about pandas?

In [3]:
# @title Live portion slides slides
 
# @markdown These are the slides for the videos in all tutorials today
from IPython.display import IFrame

# IFrame(src=f"https://docs.google.com/presentation/d/1GaEnokeqNLk1goV-xAT2wxmBMwqhqljc/edit?usp=sharing&ouid=102615264973404864923&rtpof=true&sd=true", width=854, height=480)

IFrame(src=f"https://mfr.ca-1.osf.io/render?url=https://osf.io/hncv7/?direct%26mode=render%26action=download%26mode=render", width=854, height=480)

---
# Setup

Python requires you to explictly "import" libraries before their functions are available to use. We will always specify our imports at the beginning of each notebook or script.

In [None]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random

In [None]:
# @title Video 1: Real world data has problems
from ipywidgets import widgets
from IPython.display import display, IFrame, YouTubeVideo


out1 = widgets.Output()
with out1:
  video = YouTubeVideo(id="h4LFIXVH00k", width=854, height=480, fs=1, rel=0)
  print(f'Video available at https://youtube.com/watch?v={video.id}')
  display(video)

out = widgets.Tab([out1])
out.set_title(0, 'Youtube')

display(out)

Tab(children=(Output(),), _titles={'0': 'Youtube'})

# Section 0: The nature of data in science
Lets us first think about the nature of data in science. Let's say we run a questionaire to see if we can predict grades in CIS 344. We ask for the following:
* Minor
* Major
* Past Grades
* Age
* If they like pizza

We roll this out as a google form. Everything is optional to answer. The goal is predicting the grades.

Discuss the following two questions:
* What problems may the data have?
* Is running such a study a good or a bad idea? 


---
# Section 1: Introducing Pandas

You've been working with NumPy for several weeks now and should be getting familiar with it. NumPy is a numerical computing library for Python that works with arrays of numbers (arrays which can have many dimensions). The core of NumPy is optimized to significantly speed up numerical calculations, making it essential to data science tasks where you have to repeat those calculations many times. 

Much of the data you will encounter in the world, however, is in the form of a table (like an Excel spreadsheet, or SQL table). Real-world data often has missing values and is of different types (e.g., Age is represented as integers and Name is represented as strings). Pandas is a library, built on top of NumPy, that is designed specifically for manipulating and analyzing this kind of messy tabular data. The object at the center of the Pandas library is called a *DataFrame*, a multidimensional array with nameable rows and columns that allows for missing values and varying data types.

In [None]:
# @title Video 2: Pandas
from ipywidgets import widgets
from IPython.display import display, IFrame, YouTubeVideo


out1 = widgets.Output()
with out1:
  video = YouTubeVideo(id="SwJQ221SUSA", width=854, height=480, fs=1, rel=0)
  print(f'Video available at https://youtube.com/watch?v={video.id}')
  display(video)

out = widgets.Tab([out1])
out.set_title(0, 'Youtube')

display(out)

Tab(children=(Output(),), _titles={'0': 'Youtube'})

## Section 1.1: Warming Up to DataFrames

One way to think of a DataFrame is like a Python dictionary where each key is a column name and each value is an array of column data. In fact, providing an appropriately structured dictionary as an input to pd.dataframe() is an easy way to create a DataFrame.


In [None]:
data_dict = {'Speed' : [120, 150, 147, 128, 101],
             'Horsepower' : [200, 320, 280, 290, 110],
             'Mpg' : [15, 29, 12, 28, 40]}
df = pd.DataFrame(data_dict)
print(df)

An easy way to access the data in a single column is by using its name.

In [None]:
df['Speed']

This produces a Pandas *Series*, essentially a one-dimensional version of a DataFrame.

In [None]:
type(df['Speed'])

pandas.core.series.Series

Note the row indexes that appeared on the left side when we printed 'Col A'. We can access individual elements or ranges of elements using these indices.

In [None]:
print('Individual element')
print(df['Speed'][0])
print('\nRange of elements')
print(df['Speed'][0:2])

Individual element
1

Range of elements
0    1
1    2
Name: Speed, dtype: int64


The range is a Series, like above, while the individual element is whatever data type that is stored in the column (here it's a 64-bit integer).

In [None]:
type(df['Speed'][0])

numpy.int64

DataFrames allow for easy subsetting. Here we subset columns using a list of column names and subset rows using the boolean expression `df['Speed'] > 30`.

In [None]:
print('Subsetting by a list of column names')
print(df[['Speed', 'Mpg']])
print('\n')
print('Subsetting by row')
print(df[df['Speed'] > 3])

Subsetting by a list of column names
   Speed  Mpg
0      1   11
1      2   12
2      3   13
3      4   14
4      5   15


Subsetting by row
   Speed  Horsepower  Mpg
3      4           9   14
4      5          10   15


### Coding Exercise 1.1

*Exercise objective:*
- Print the first three rows of the entire dataset
- Print just the first three rows of the columns 'Col A' and 'Col B'.

In [None]:
#################################################################################
## TODO for students:
## Print subsets of DataFrame df
raise NotImplementedError("Student exercise: print subsets of df")
#################################################################################

print('Print the first three rows of df')
print(df[...])
print('\n')
print('Print the first three rows of the columns Speed and Horsepower')
print(df[...][...])

In [None]:
# to_remove Solution

print('Print the first three rows of df')
print(df[0:3])
print('\n')
print('Print the first three rows of the columns Col A and Col B')
print(df[['Speed', 'Horsepower']][0:3])

Print the first three rows of df
   Speed  Horsepower  Mpg
0      1           6   11
1      2           7   12
2      3           8   13


Print the first three rows of the columns Col A and Col B
   Speed  Horsepower
0      1           6
1      2           7
2      3           8


In [None]:
# @title Video 3: Corgis datasets (and more generally, there are great cleaned data sets out there)
from ipywidgets import widgets
from IPython.display import display, IFrame, YouTubeVideo


out1 = widgets.Output()
with out1:
  video = YouTubeVideo(id="MFxLC3PgER8", width=854, height=480, fs=1, rel=0)
  print(f'Video available at https://youtube.com/watch?v={video.id}')
  display(video)

out = widgets.Tab([out1])
out.set_title(0, 'Youtube')

display(out)

Tab(children=(Output(),), _titles={'0': 'Youtube'})

---

# Section 2: Exploring the Cars Dataset

Dataset from the CORGIS project: https://ct-vt.github.io

In this tutorial we will be working with the "Cars" dataset, which contains a number of variables we can use to understand the relationship between vehical characteristics and fuel consumption. 

## Section 2.1: df.head(), df.tail(), and df.shape

First we're going to load the dataset and view the first few rows using the `df.head()` method.

**explain head**

In [None]:
data_url = 'https://raw.githubusercontent.com/RealTimeWeb/datasets/master/datasets/csv/cars/cars.csv'
df_raw = pd.read_csv(data_url)
df_raw.head()

Unnamed: 0,City mpg,Classification,Driveline,Engine Type,Fuel Type,Height,Highway mpg,Horsepower,Hybrid,ID,Length,Make,Model Year,Number of Forward Gears,Torque,Transmission,Width,Year
0,18,Automatic transmission,All-wheel drive,Audi 3.2L 6 cylinder 250hp 236ft-lbs,Gasoline,140,25,250,False,2009 Audi A3 3.2,143,Audi,2009 Audi A3,6,236,6 Speed Automatic Select Shift,202,2009
1,22,Automatic transmission,Front-wheel drive,Audi 2.0L 4 cylinder 200 hp 207 ft-lbs Turbo,Gasoline,140,28,200,False,2009 Audi A3 2.0 T AT,143,Audi,2009 Audi A3,6,207,6 Speed Automatic Select Shift,202,2009
2,21,Manual transmission,Front-wheel drive,Audi 2.0L 4 cylinder 200 hp 207 ft-lbs Turbo,Gasoline,140,30,200,False,2009 Audi A3 2.0 T,143,Audi,2009 Audi A3,6,207,6 Speed Manual,202,2009
3,21,Automatic transmission,All-wheel drive,Audi 2.0L 4 cylinder 200 hp 207 ft-lbs Turbo,Gasoline,140,28,200,False,2009 Audi A3 2.0 T Quattro,143,Audi,2009 Audi A3,6,207,6 Speed Automatic Select Shift,202,2009
4,21,Automatic transmission,All-wheel drive,Audi 2.0L 4 cylinder 200 hp 207 ft-lbs Turbo,Gasoline,140,28,200,False,2009 Audi A3 2.0 T Quattro,143,Audi,2009 Audi A3,6,207,6 Speed Automatic Select Shift,202,2009


Similarly, we can easily view the end of the dataset using `df.tail()`.

In [None]:
df_raw.tail()

It's also helpful to check the shape of a DataFrame (# of rows by # of columns) to ensure that it's what you expected. Shape is an attribute of the DataFrame rather than a method, so we don't put parentheses after it.

In [None]:
df_raw.shape

The shape attribute is a tuple, so we can use indexing to retrieve the individual numbers if needed.

In [None]:
nrows = df_raw.shape[0]
ncols = df_raw.shape[1]

print('The Cars datasets has ' + str(nrows) + ' rows and ' + str(ncols) + ' columns.')

In [None]:
# @title Video 4: Always visualize everything
from ipywidgets import widgets
from IPython.display import display, IFrame, YouTubeVideo


out1 = widgets.Output()
with out1:
  video = YouTubeVideo(id="0AbNrcQjhGs", width=854, height=480, fs=1, rel=0)
  print(f'Video available at https://youtube.com/watch?v={video.id}')
  display(video)

out = widgets.Tab([out1])
out.set_title(0, 'Youtube')

display(out)

Tab(children=(Output(),), _titles={'0': 'Youtube'})

### Coding Exercise 2.1: Subsetting the Cars Dataset

Now that we've gotten a feel for what our data looks like, let's extract just the variables that we need. Going forward, we will be interested in the relationship between two variables from the Cars dataset, 'Horsepower' and 'Highway mpg'. We also want to include the 'ID' variable, which gives the make and model of car. Subset `df_raw` to include only these three variables and store them in a new DataFrame called `df`. 

*Exercise objective:* Produce a new dataframe `df` which contains only the columns 'ID', 'Horsepower', and 'Highway mpg'. Print the shape of the new DataFrame to verify that is now only has three columns.a

In [None]:
###########################################################################
## TODO for students: Subset the df_raw dataset to just two variables,
## 'Horsepower' and 'Highway mpg'. Name this subset 'df' and print the 
## shape of df to confirm that you now have two variables. 
raise NotImplementedError('student exercise: subset the Cars dataset')
###########################################################################

df = df_raw[...]
print(...)

In [None]:
# to_remove 
# Solution

df = df_raw[['ID', 'Horsepower', 'Highway mpg']]
print(df.shape)

## Section 2.2: Visualizing Horsepower and Fuel Efficiency

Now we're going to plot our two variables to get a feel for their distributions, as well as their relationship. To plot the distributions we will once again use `plt.hist()` from `matplotlib`.

We can supply our DataFrame to the `data=` parameter of `plt.hist()`, which allows use to specify our variables by column name. First we plot the distribution of 'Horsepower'.

In [None]:
plt.hist('Horsepower', bins= 25, histtype="bar", data=df)
plt.xlabel("Horsepower")
plt.ylabel("Number of vehicles")

Next, we plot 'Highway mpg'.

In [None]:
plt.hist('Highway mpg', bins=25, histtype="bar", data=df)
plt.xlabel("Highway mpg")
plt.ylabel("Number of vehicles")

Notice that the x-axis extends far past the last bar, indicating the presence of at least one outlier. We'll see this more clearly when we look at the relationship between the two datasets with a scatterplot (using `plt.scatter()`).



In [None]:
plt.scatter('Horsepower', 'Highway mpg', data=df)
plt.xlabel('Horsepower')
plt.ylabel('Highway mpg')

Maybe add analysis of wtf is going on with that one data point. How to approach this kind of a problem

Pandas provides an intuitive interface for calculating aggregate values like mean and median across a DataFrame: `df.mean()` and `df.median`.

In [None]:
print('Means:')
print(df.mean())
print('\n')
print('Medians')
print(df.median())

These methods return a Series of values, with each value labeled with its corresponding variable name. Note that both methods ignore the 'ID' variable, which has no mean or median.

In [None]:
type(df.median())

It is often helpful to be able to display plots side by side and to generate multiple plots within one block of code. For this we will use `plt.subplots()`, a method that returns a figure object and a some number of axes objects, depending on the parameters given to `subplots()`. 

By default, `plt.subplots()` returns a figure and an axes as a tuple. 

`fig, ax = plt.subplot()`

You can treat `ax` much like you treated `plt` above, using `ax.hist()`, `ax.scatter()`, etc, though some methods have slightly different names (`ax.set_xlabel()` instead of `plt.xlabel()`)

You can also tell `plt.subplot()` to return more axes, depending on the numbers of rows and columns you assign to the figure. The following line will create a figure with two subplots, arranged next to each other on the same row.

`fig, (ax1, ax2) = plt.subplot(nrow=1, ncols=2)`

Let's recreate the histograms above, but in a single figure.

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1)

ax1.hist('Horsepower', bins= 25, histtype="bar", data=df)
ax1.set_xlabel("Horsepower")
ax1.set_ylabel("Number of vehicles")

ax2.hist('Highway mpg', bins=25, histtype='bar', data=df)
ax2.set_xlabel("Highway mpg")
ax2.set_ylabel("Number of vehicles")

### Coding Exercise 2.2: Writing a Function to Visualize Our Date

Let's put the above plots together in one function that will allow us to easily visualize our data as we move through the tutorial.





In [None]:
def plt_cars(df):
  """Plot histograms of 'Horsepower' and ' Highway mpg' from the Cars dataset, 
  as well as a scatterplot of 'Horsepower' vs. 'Highway mpg'.

  Args:
    df (DataFrame): Cars dataset, with variables 'Horsepower' and 'Highway 'mpg'.

  Returns:
    None
  """
  ###########################################################################
  ## TODO for students: Complete this function for plotting histograms of 
  ## 'Horsepower' and ' Highway mpg' and a scatterplot of 'Horsepower' vs. 
  ## 'Highway mpg'.
  raise NotImplementedError('student exercise: write function for visualizing Cars')
  ###########################################################################

  # Compute means
  means = ...

  # Creates figure and axes objects
  fig_a, (ax1, ax2) = plt.subplots(1, 2)

  # Visualize 'Horsepower'
  ax1.hist(..., data=...)
  ax1.set_xlabel("Horsepower")
  ax1.set_ylabel("Number of vehicles")
  ax1.axvline(..., color='orange')

  # Visualize 'Highway mpg'
  ax2.hist(..., data=...)
  ax2.set_xlabel("Highway mpg")
  ax2.set_ylabel("Number of vehicles")
  ax2.axvline(..., color='orange')
  print(fig_a)

  print('\n')

  # Visualize the relationship between 'Horsepower' and 'Highway mpg'
  fig_b, ax = plt.subplots(1, 1)
  ax.scatter(x=..., y=..., data=...)
  ax.set_xlabel('Horsepower')
  ax.set_ylabel('Highway mpg')
  print(fig_b)

plt_hist(df)

In [None]:
# Solution
def plt_cars(df):
  """Plot histograms of 'Horsepower' and ' Highway mpg' from the Cars dataset, 
  as well as a scatterplot of 'Horsepower' vs. 'Highway mpg'.

  Args:
    df (DataFrame): Cars dataset, with variables 'Horsepower' and 'Highway 'mpg'.

  Returns:
    None
  """

  # Compute means
  means = df.mean()

  # Create figure and axes objects
  fig_a, (ax1, ax2) = plt.subplots(1, 2)
  
  # Visualize 'Horsepower'
  ax1.hist('Horsepower', data=df)
  ax1.set_xlabel("Horsepower")
  ax1.set_ylabel("Number of vehicles")
  ax1.axvline(means['Horsepower'], color='Orange')

  # Visualize 'Highway mpg'
  ax2.hist('Highway mpg', data=df)
  ax2.set_xlabel("Highway mpg")
  ax2.set_ylabel("Number of vehicles")
  ax2.axvline(means['Highway mpg'], color='Orange')
  print(fig_a)

  print('\n')
  
  # Visualize the relationship between 'Horsepower' and 'Highway mpg'
  fig_b, ax = plt.subplots(1, 1)
  ax.scatter('Horsepower', 'Highway mpg', data=df)
  ax.set_xlabel('Horsepower')
  ax.set_ylabel('Highway mpg')
  print(fig_b)

plt_cars(df)

## Section 2.3: What's with that point?

By now, you've likely noticed that there is a clear outlier with a 'Highway mpg' value far greater than the other vehicles in our dataset. A good first step to take in addressing this point would be to identify it. 

In [None]:
# @title Video 5: What is this car?
from ipywidgets import widgets
from IPython.display import display, IFrame, YouTubeVideo

out1 = widgets.Output()
with out1:
  video = YouTubeVideo(id="tpa1deNQxDQ", width=854, height=480, fs=1, rel=0)
  print(f'Video available at https://youtube.com/watch?v={video.id}')
  display(video)

out = widgets.Tab([out1])
out.set_title(0, 'Youtube')

display(out)

Tab(children=(Output(),), _titles={'0': 'Youtube'})

### Coding Exercise 2.3: Identifying the outlier

We discussed in Section 1.1 how we can use variable names to subset our data by column and boolean expressions like `df['Speed'] > 30` to subset by row. Combine these two techniques to print the 'ID' of the outlier. 

In [None]:
###########################################################################
## TODO for students: Print the 'ID' of the outlier. 
raise NotImplementedError('student exercise: print outlier ID')
###########################################################################

print(df[...][...])

In [None]:
# to_remove Solution
print(df['ID'][df['Highway mpg'] > 200])

---

# Section 3: Discussion
- That outlier you found. Is it real or a coding error? How would you go about it?

- When is it appropriate to store your data in NumPy arrays?
- When should you instead utilize a Pandas DataFrame?

In [None]:
#remove solution - the car seems to be a coding error. It does not reach 200MPG. Gotta google it!