In [53]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

In [60]:
dogs = pd.read_csv('dogs.csv')

(ch:sql)=
# Working With Relations Using SQL

:::{note}

This chapter repeats the data analyses in the {ref}`ch:pandas` chapter using
relations and SQL instead of dataframes and Python. The datasets, data
manipulations, and conclusions are nearly identical across the two chapters so
that it's easier for the reader to see how the same data manipulations are
performed in both `pandas` and SQL.

If you've already read the dataframe chapter, you can focus your attention on
this section where we introduce the relation, and the specific SQL code
examples in the sections that follow. 

:::

Data scientists work with data stored in tables. This chapter introduces
*relations*, one of the most widely used ways to represent data tables. We'll
also introduce SQL, the standard programming language for working with
relations. Here's an example of a relation that holds information about
popular dog breeds:

In [61]:
# Jupyter doesn't have a built-in way to display relations, so we jiggle the
# dataframe output a bit to make it look like a relation
from IPython.display import display, HTML
display(HTML(dogs.to_html(index=False)))

breed,grooming,food_cost,kids,size
Labrador Retriever,weekly,466.0,high,medium
German Shepherd,weekly,466.0,medium,large
Beagle,daily,324.0,high,small
Golden Retriever,weekly,466.0,high,medium
Yorkshire Terrier,daily,324.0,low,small
Bulldog,weekly,466.0,medium,medium
Boxer,weekly,466.0,high,medium


In a relation, each row represents a single record---in this case, a single
dog breed. Each column represents a feature about the record---for example, the
`grooming` column represents how often each dog breed needs to be groomed.

Relations have labels for columns. For instance, this relation has a column
labeled `grooming`. Within a column, data have the same type. For instance, the
`food_cost` column contains numbers, and the `size` column contains categories.
But data types can be different within a row.

Because of these properties, relations enable all sorts of useful operations.

:::{note}

As a data scientist, you'll often find yourself working with people from
different backgrounds who use different terms. For instance, computer
scientists say that the columns of a relation represent *features* of the
data, while statisticians call them *variables* instead.

Other times, people will use the same term to refer to slightly different
things. *Data types* in a programming sense refers to how a computer stores
data internally. For instance, the `size` column has a string data type in
Python. But from a statistical point of view, the `size` column stores ordered
categorical data (ordinal data). We talk more about this specific distinction 
in the {ref}`ch:eda` chapter.

:::

In this chapter, we'll show you how to do common relation operations using SQL.
First, we'll explain the structure of SQL queries. Then, we'll show how to use
SQL to perform common data manipulation tasks, like slicing, filtering,
sorting, grouping, and joining.