# Data 6: Intro to Visualizations
* Line plot
* Scatter plot
* Bar chart

Source: [Data 8 Fall 2025 Lecture 07](https://github.com/data-8/materials-fa25/blob/main/lec/lec07/lec07.ipynb)

In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

### Attributes have types

In [None]:
full = Table.read_table('data/nc-est2019-agesex-res.csv')

In [None]:
# Keep only the columns we care about
partial = full.select('SEX', 'AGE', 'POPESTIMATE2019')

In [None]:
# Make things easier to read
us_pop_2019 = partial.relabeled(2, '2019')
us_pop_2019.show(5)

**STOP**

### Some plots for numerical attributes

**Task**: Plot the total population (in millions) as each age increases. From [_Inferential Thinking_](https://inferentialthinking.com/chapters/06/4/Example_Sex_Ratios.html):

> The SEX column contains numeric codes: 1 for male, 2 for female, and 0 for the total.

> The AGE column contains ages in completed years. The special value 999 represents the entire population regardless of age, and 100 represents “100 or more”. 

In [None]:
# NOTE: we will learn about filtering `are` predicates soon...
total_below_999 = us_pop_2019.where('AGE',are.below(999)).where('SEX', 0)
total_below_999 = total_below_999.with_columns('Population (in millions)', total_below_999.column('2019')/1e6) 

In [None]:
total_below_999.scatter('AGE', 'Population (in millions)')

In [None]:
total_below_999.plot('AGE', 'Population (in millions)')

_____

**Discussion [1 min]:** What are some things you notice from the below visualization? I see at least two! (1 min)

In [None]:
full.select('AGE', 'SEX', 
            'POPESTIMATE2019', 'POPESTIMATE2011').where('AGE',
                                                        are.below(999)).where('SEX',
                                                                              0).drop('SEX').plot('AGE')


______

## Actors

In [None]:
# Actors and their highest grossing movies
actors = Table.read_table('data/actors.csv')
actors

---

  ### Line plot? Or scatter plot?
    
**Task:** Plot the relationship between the number of movies cast and average salary per movie among famous Hollywood actors.

In [None]:
actors.plot('Number of Movies', 'Total Gross')

In [None]:
actors.scatter('Number of Movies', 'Total Gross')

#### Remember: Investigate anomalies!!

In [None]:
actors.scatter('Number of Movies', 'Average per Movie')

In [None]:
# NOTE: we will learn about filtering `are` predicates soon...
actors.where('Average per Movie', are.above(400))

**STOP**

<br/><br/>
<br/><br/>
<br/><br/>
<br/><br/>

## Categorical data: Bar Charts

### Example 1: Counts of categorical variable

In [None]:
# Highest grossing movies as of 2017
top_movies = Table.read_table('data/top_movies_2017.csv')
top_movies

#### **Task**: Visualize the number of movies released by each studio.

In [None]:
# NOTE: we will learn about group in lab
# for now, assume it counts movies by column name
studio_counts = top_movies.group('Studio')
studio_counts

In [None]:
studio_counts.barh('Studio')

**Discussion [30 sec]:** How might this plot be improved to show a clearer message about the data? 

In [None]:
# your code here...

---

### Example 2: Categorical x Numerical

#### **Task**: Visualize the top 10 highest adjusted grossing movies by US dollars (in millions).

In [None]:
# NOTE: we will cover np.arange next time.
# for now, the below code takes rows 0 to 9 of the top_movies table.

top10_adjusted = top_movies.take(np.arange(10))
top10_adjusted

Convert to millions of dollars for readability

In [None]:
millions = np.round(top10_adjusted.column('Gross (Adjusted)') / 1000000, 3)
top10_adjusted = top10_adjusted.with_column('Millions', millions)
top10_adjusted

In [None]:
top10_adjusted.barh('Title', 'Millions')

**STOP**

### Good visualization practices

#### Titling plots
- Just add a `print()` statement after your plotting code.

##### Bad title:

In [None]:
total_below_999.plot('AGE', 'Population (in millions)')
print('US Population in 2019 by age')  

##### Good title:

In [None]:
total_below_999.plot('AGE', 'Population (in millions)')
print('Population counts in 2019 show a steep decline around age 60')  

**GET TO HERE**

---

## Review

In [None]:
united_flights = Table.read_table('data/united.csv')

In [None]:
united_flights

**Review Question:** How long was flight 278 to `'SEA'` delayed? 

## **Challenge Task**:
- Generate the chart shown in the slides: a bar chart of age (# years since release) for the 10 highest grossing movies (non-adjusted).