# Data 6: Grouping

* More practice with `where` Table method
* `group`
* (if time) Boolean values, comparison operators

In [None]:
from datascience import *
import numpy as np
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.rcParams["patch.force_edgecolor"] = True

import seaborn as sns

## (continued) Boolean predicates with `where` Table method

### SAT Data

Continuing our analysis of a dataset showing aggregated (average) SAT scores by state ([source 1](https://commonwealthfoundation.org/2014/12/22/sat-scores-by-state-2014/), [source 2](https://reports.collegeboard.org/sat-suite-program-results/data-archive)).

**Note**: This data is from 2014, so the total score is out of 2400 (over three sections each out of 800) instead of 1600.

In [None]:
sat = Table.read_table('data/sat2014-lecture.csv')
sat = sat.with_columns(
    'Combined',
    sat.column('Critical Reading') + \
        sat.column('Math') + \
        sat.column('Writing')
)
sat

**Task**

Filter the `sat` table to include only the states listed in the `deep_south` array. Use the [Data 6 Python Reference](https://data6.org/notes/reference).

In [None]:
deep_south = np.array(['Alabama', 'Georgia',
                       'Louisiana', 'Mississippi',
                       'South Carolina'])

...

**Task**

Find the states in the deep south with participation lower than 10% and combined score greater than or equal to 1600.

In [None]:
...

## What are we looking for?

Consider the scatter plot of all states' participation rates and combined SAT scores. Does this scatter plot imply that **lower participation _causes_ higher SAT scores? If not, what might be going on here?**

In [None]:
import plotly.express as px

px.scatter(data_frame = sat.to_df(), 
           x = 'Combined', 
           y = 'Participation Rate', 
           hover_data = {'State': True},
           title = 'SAT (2014) Participation Rate by state')

Read more: [https://educationtoworkforce.org/indicators/sat-and-act-participation-and-performance](https://educationtoworkforce.org/indicators/sat-and-act-participation-and-performance)

## Aggregating with `group`

Now, let's consider what we call a "toy dataset" (tiny made-up data):

In [None]:
cones = Table.read_table('data/cones.csv')
cones

In [None]:
cones = Table.read_table('data/cones.csv')
cones.group('Color')

### Bar char variants

#### One categorical attribute

In [None]:
flavor_table = cones.group('Flavor')
flavor_table

In [None]:
flavor_table.barh('Flavor')

In [None]:
cones.group(['Color', 'Flavor'])

#### One categorical attribute, one numerical attribute

In [None]:
cone_average_price_table = cones.drop("Color").group('Flavor', np.average)
cone_average_price_table

In [None]:
cone_average_price_table.barh('Flavor')

#### [discuss] Does grouping on numerical data make sense?

## NBA salaries

Dataset source: [ESPN](https://www.espn.com/nba/salaries/_/page/1/seasontype/1)

In [None]:
nba = Table.read_table("data/nba_salaries_2526.csv")
nba.show(6)

#### Task: Find the five teams paying the highest average salary (in millions) in 2025.

* Your results should be in the form of a table with 5 rows and 2 columns:
* One column should have the team name
* The other column should have the average salary for that team, in millions.
                                                  
**Challenge**: Try to only use Table.select() or Table.drop() once in your solution.

In [None]:
top_team_salaries = ...

<details>
  <summary>Solution</summary>

```  
top_team_salaries = (
    nba.with_column("salary (millions)",
                    nba.column("salary")/1e6)
       .where('season',2025)
       .select('team','salary (millions)')
       .group('team', np.average)
       .sort('salary (millions) average', descending=True)
       .take(np.arange(5))
)
top_team_salaries
```
    

</details>

In [None]:
top_team_salaries.barh("team")

---

Below this is definitely not Quiz 2 material. Only cover if time.

## [if time] Booleans

In [None]:
3 > 1 + 1

In [None]:
3 < -1 * 2

In [None]:
1 < 1 + 1 < 3

In [None]:
s = "Data " + "6"
s == "Data 6"

In [None]:
# is age at least age_limit?
age_limit = 21
age = 17
age >= age_limit

Note: Password checkers are a bit more secure than the below, to be clear...

In [None]:
# is password_guess equal to true_password?
true_password = 'qwerty1093x!'
password_guess = 'QWERTY1093x!'
true_password == password_guess

### Comparison Operators

In [None]:
3 == 3

In [None]:
'hello' != 'howdy'

In [None]:
-3 > -2

In [None]:
-3 < -2

In [None]:
"apple" >= "banana"

### Be careful about *equality* vs. *assignment*...

`=` and `==` have very different meanings in Python.

In [None]:
# set x equal to 5
x = 5

In [None]:
# is x equal to 5?
x == 5

In [None]:
x = "some other value" # reset

In [None]:
# valid. what does it do?
y = x == 5

In [None]:
x

In [None]:
y

### Comparisons across types

#### Equality across types

In [None]:
17 == '17'

In [None]:
'zebra' != True

In [None]:
True == 1.0

#### Inequality across types

In [None]:
banana = 10
'apple' >= banana

In [None]:
'alpha' >= 5 

In [None]:
5 > True

In the above cell, the boolean value is being type cast into an integer value, 1!

## Boolean operators