# 7.4 - Exploring Two Columns

Bar charts and histograms are useful for one column at a time. But what if you want to compare two columns?
In this lesson, you’ll use scatter plots and crosstab charts to look at relationships between two columns of data.

## Preparation

Run the code cell below to load the libraries we need.

In [None]:
import pandas
import numpy

## Scatter Plots

A scatter plot shows two columns on one graph using dots.
Each dot represents one row of data:
- one column is placed on the x-axis
- the other column is placed on the y-axis

Scatter plots help us see whether two variables seem related.

Let’s look at a scatter plot from the `US States` dataset. Run the next code cell.

In [None]:
# Load the US States data from the CSV file
stateInfo = pandas.read_csv("US States.csv")

# plot a scatter plot of the Admission Number vs Area of the first 48 states
first48 = stateInfo[stateInfo["Admission Number"] < 49]
first48.plot(kind="scatter", x="Admission Number", y="Area");


This scatter plot shows one dot for each of the first 48 states.
Dots are placed along the x-axis by `Admission Number` (the order states joined the union).
The y-axis shows state `Area` in square miles.

Do you see any pattern? How can you tell?

### Trend Lines

A trend line helps summarize the overall direction of the data.
- Upward trend line: as x increases, y tends to increase.
- Downward trend line: as x increases, y tends to decrease.

We’ll use `numpy` to calculate and draw a trend line. Run the next code cell.

In [None]:
# find the line function that best fits the data
coefficients = numpy.polyfit(first48["Admission Number"], first48["Area"], deg=1)
lineFunc = numpy.poly1d(coefficients)

# plot the line function on top of the scatter plot
first48.plot(kind="scatter", x="Admission Number", y="Area").plot(
    first48["Admission Number"], lineFunc(first48["Admission Number"]), color="red");


To calculate the trend line, we used `numpy` to find the line of best fit. Then we used `numpy.poly1d()` to turn that line into a function we can graph.

`numpy.polyfit(..., deg=1)` returns two coefficients for a line:
- the first coefficient is the slope (how much y changes when x increases by 1)
- the second coefficient is the y-intercept (the y value when x = 0)

So the trend line follows the form `y = mx + b`, where:
- `m` is the slope
- `b` is the y-intercept

Trend lines are helpful, but they do not tell the full story.
If the data is not close to linear, a trend line can be misleading.
In this example, the pattern is mostly linear, so the line is a reasonable summary.

Scatter plots are useful for:
- spotting possible relationships
- finding outliers (points far from the pattern)

If points look random, there may be little or no relationship.
If points form a pattern, there is likely some relationship.

Scatter plots can be less helpful when many points overlap.
When that happens, a crosstab chart can make patterns easier to see.

### Your Turn

In the code cell below, make a scatter plot comparing:
- `Median Household Income`
- `Percent of Adult College Graduates`

Then add a trend line.
What does the trend line suggest about the relationship between these two variables?

In [None]:
# use the stateInfo data


Look through the `US States` dataset and choose two other columns to compare.
What pattern do you think you might see?

Create a scatter plot and trend line for your two columns.
What do you notice about the relationship?

## Crosstab Charts

A crosstab chart is another way to compare two columns.
It is a table that counts how often combinations of values appear together.

- Rows represent values from one column.
- Columns represent values from another column.
- Each cell shows the count for that pair.

Let’s examine a crosstab using the `dogs` dataset.

When creating a crosstab:
- use `index=` for row categories
- use `columns=` for column categories

In this example:
- rows are `Breed Group`
- columns are `Maximum Life Span`
- each cell counts how many breeds match that pair

Run the code cell below.

In [None]:
dogInfo = pandas.read_csv("dogs.csv")

# make the "Maximum Life Span" column an integer to make it easier to work with
dogInfo["Maximum Life Span"] = dogInfo["Maximum Life Span"].round(0)
dogInfo["Maximum Life Span"] = dogInfo["Maximum Life Span"].astype(int)

# generate a crosstab of the maximum life span vs the breed group of the breeds
crosstab = pandas.crosstab(index=dogInfo["Breed Group"], columns=dogInfo["Maximum Life Span"])
crosstab.style.background_gradient(cmap="bone", axis=None)


This crosstab shows how many breeds match each combination of `Breed Group` and `Maximum Life Span`.
For example, there are 2 breeds in the `Herding` group with a maximum life span of 15 years.

Try answering these questions:
* What is the most common maximum life span for `Working` breeds? What about `Toy` breeds?
* How many `Herding` breeds have a maximum life span of 12 years? What about 15 years?
* Which breed groups seem likely to live the longest?
* How confident are you in your answers?

Crosstab charts are useful for:
* finding the most or least common combinations
* spotting patterns across two columns
* comparing columns when one or both columns are categorical (text)

Crosstabs can be less useful when one or both columns have many unique numeric values.
In those cases, scatter plots and trend lines are often better.

You can also apply a color map to make patterns easier to see.
Use the `cmap` argument in `background_gradient()` to choose colors.
See color map options [here](https://matplotlib.org/stable/tutorials/colors/colormaps.html#sequential).

### Your Turn

Download the `Words` dataset as a CSV from [code.org](https://learn.mycode.run/link/xzcnh).
Follow your instructor’s directions to export the dataset, then copy the CSV into the same folder as this notebook.

Use the `Words` dataset to create a crosstab showing combinations of `Length` and `Part of Speech`.
Copy the chart into your activity guide and answer the questions.

In [None]:
# load the words dataset


# create a crosstab of the length of the words vs the part of speech




Finally, use your cleaned class data tracker data to create a crosstab chart for `Grade` and `Favorite Subject`.
(You will need to repeat the automated cleaning from the last notebook.)
Copy this chart into your activity guide and use it to answer the questions.

In [None]:
# load the class data tracker dataset


# clean up the data


# create a crosstab of the grade vs the favorite subject

