# Introduction to pandas

This notebook will introduce you to the [`pandas`](https://pandas.pydata.org/) data analysis library and demonstrate how to inspect, sort, filter, group and aggregate a data set.

The data for this exercise will be a CSV of [USA TODAY's opening-day MLB salaries](https://www.usatoday.com/sports/mlb/salaries/) from the 2018 season.

(If you're completely new to Python or your syntax is rusty, it might be useful to [keep this notebook open in a new tab](Python%20syntax%20cheat%20sheet.ipynb) as a reference.)

#### Ssession outline
- [Using Jupyter notebooks](#Using-Jupyter-notebooks)
- [Import pandas](#Import-pandas)
- [Load data into a data frame](#Load-data-into-a-data-frame)
- [Inspect the data](#Inspect-the-data)
- [Sort the data](#Sort-the-data)
- [Filter the data](#Filter-the-data)
- [Group and aggregate the data](#Group-and-aggregate-the-data)

### Using Jupyter notebooks

There are many ways to write and run Python code on your computer. One way -- the method we're using today -- is to use [Jupyter notebooks](https://jupyter.org/), which run in your browser and allow you to intersperse documentation with your code. They're handy for bundling your code with a human-readable explanation of what's happening at each step. Check out some examples from the [L.A. Times](https://github.com/datadesk/notebooks) and [BuzzFeed News](https://github.com/BuzzFeedNews/everything#data-and-analyses).

**To add a new cell to your notebook**: Click the + button in the menu.

**To run a cell of code**: Select the cell and click the "Run" button in the menu, or you can press Shift+Enter.

**One common gotcha**: The notebook doesn't "know" about code you've written until you've _run_ the cell containing it. For example, if you define a variable called `my_name` in one cell, and later, when you try to access that variable in another cell but get an error that says `NameError: name 'my_name' is not defined`, the most likely solution is to run (or re-run) the cell in which you defined `my_name`.

### Import pandas

Before you can use the functionality of `pandas`, a third-party library installed separately from Python, you need to _import_ it. The convention is to import the library under an alias that's easier to type: `as pd`.

Run this cell:

In [None]:
import pandas as pd

### Load data into a data frame

Before you can start poking at a data file, you need to load the data into a pandas _data frame_, which is sort of like a virtual spreadsheet with columns and rows.

You can load many different types of data files into a data frame, including CSVs (and other delimited text files), Excel files, JSON [and more](https://www.cbtnuggets.com/blog/2018/10/14-file-types-you-can-import-into-pandas/). ([Here's a quick reference notebook](https://github.com/ireapps/cfj-2018/blob/master/reference/Importing%20data%20into%20pandas.ipynb) demonstrating how to import some different data files, including live data from the Internet!)

For today, we'll focus on importing the MLB salary data using a pandas method called [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). There are a ton of options you can supply when you read in the data file, but at minimum, you need to tell the method _where_ the file lives, which means you need to supply the path to the data file as a Python _string_ (some text enclosed in single or double quotes). The file is called `mlb.csv`, and it is located in the same directory as this notebook file, so we don't need to specify a longer path.

As we import the data, we'll also _assign_ the results of the loading operation to a new variable called _df_ (short for data frame -- easy to type, plus you'll see this pattern a lot when Googling around for help).

👉 [Click here for more information on Python variables](Python%20syntax%20cheat%20sheet.ipynb#Variable-assignment).

In [None]:
df = pd.read_csv('mlb.csv')

As a human sentence: "Go to the pandas library that we imported earlier as something called `pd` and use its `read_csv()` method to import a file called `mlb.csv` into a data frame -- and while we're at it, assign the results of that operation to a new variable called `df`."

### Inspect the data

Let's take a look at what we've got using a few built-in methods and attributes of a pandas data frame:
- `df.head()` will display the first five records (or, if you prefer, you can specify a number, e.g., `df.head(10)`)
- `df.tail()` will display the last five records (or, if you prefer, you can specify a number, e.g., `df.tail(10)`)
- `df.describe()` will compute summary stats on numeric columns
- `df.sample()` will return a randomly selected record (or, if you prefer, you specify a number, e.g., `df.sample(5)`
- `df.shape` will tell you how many columns, how many rows
- `df.dtypes` will list the column names and tell you what kind of data is in each one

### Sort the data

To sort a data frame, use the `sort_values()` method. At a minimum, you need to tell it which column to sort on.

In [None]:
df.sort_values('SALARY')

To sort descending, you need to pass in another argument to the `sort_values()` method: `ascending=False`. Note that the boolean value is _not_ a string, so it's not contained in quotes, and only the initial letter is capitalized. (If you are supplying multiple arguments to a function or method, separate them with commas.)

👉 [Click here for more information on Python booleans](Python%20syntax%20cheat%20sheet.ipynb#Booleans).

In [None]:
df.sort_values('SALARY', ascending=False)

You can use a process called "method chaining" to perform multiple operations in one line. If, for instance, we wanted to sort the data frame by salary descending and inspect the first 5 records returned:

In [None]:
df.sort_values('SALARY', ascending=False).head()

You can sort by multiple columns by passing in a _list_ of column names rather than the name of a single column. A list is a collection of items enclosed within square brackets `[]`.

👉 [Click here for more information on Python lists](Python%20syntax%20cheat%20sheet.ipynb#Lists).

To sort first by `SALARY`, then by `TEAM`:

In [None]:
df.sort_values(['SALARY', 'TEAM']).head()

You can specify the sort order (descending vs. ascending) for each sort column by passing another list to the `ascending` keyword with `True` and `False` items corresponding to the position of the columns in the first list. 

For example, to sort by `SALARY` descending, then by `TEAM` ascending:

In [None]:
df.sort_values(['SALARY', 'TEAM'], ascending=[False, True]).head()

The `False` goes with `SALARY` and the `True` with `TEAM` because they're in the same position in their respective lists.

One other note: Despite all of this sorting we've been doing, the original `df` data frame is unchanged:

In [None]:
df.head()

That's because we haven't "saved" the results of those sorts by assigning them to a new variable. Typically, if you want to preserve a sort (or any other kind of manipulation), you'd would assign the results to a new variable:

In [None]:
sorted_by_team = df.sort_values('TEAM')

In [None]:
sorted_by_team.head()

### ✍️ Your turn

In the cells below, practice sorting the `df` data frame:
- By `NAME`
- By `POS` descending
- By `SALARY` descending, then by `POS` ascending, and save the results to a new variable called `sorted_by_salary_then_pos`

### Filter the data

Let's go over two different kinds of filtering:

- Column filtering: Grabbing one or more columns of data to look at, like passing column names to a `SELECT` statement in SQL.
- Row filtering: Looking at a subset of your data that matches some criteria, like the crieria following a `WHERE` statement in SQL. (For instance, "Show me all records in my data frame where the value in the `TEAM` column is "ARI".)

#### Column filtering

To access the values in a single column of data, you can use "dot notation" as long as the column name doesn't have spaces or other special characters:

In [None]:
df.TEAM

Otherwise, use "bracket notation" with the name of the column as a string.

This is equivalent to the previous command:

In [None]:
df['TEAM']

To select multiple columns in your data frame, use bracket notation but pass in a _list_ of column names instead of just one. To make things clearer, you could break this out into two steps:

In [None]:
columns_we_care_about = ['TEAM', 'SALARY']
df[columns_we_care_about]

Quick aside: When you access a single column in your data frame, you're getting back something called a `Series` object (as opposed to a `DataFrame` object).

For numeric columns, you can call methods on that Series to compute basic summary stats:
- `min()` to get the lowest value
- `max()` to get the greatest value
- `median()` to get the median
- `mean()` to get the average
- `mode()` to get the most common value

Check it out for the `SALARY` column:

In [None]:
df.SALARY.min()

In [None]:
df.SALARY.max()

In [None]:
df.SALARY.median()

In [None]:
df.SALARY.mean()

In [None]:
df.SALARY.mode()

You can look at the unique values in a column with `unique()` -- let's do that with the `TEAM` column:

In [None]:
df.TEAM.unique()

What we just did is the equivalent of dragging the "TEAM" column name into the "rows" area of a spreadsheet pivot table, or, in SQL,

```sql
SELECT DISTINCT TEAM
FROM mlb
```

You can also count up a total for each value using the `value_counts()` method:

In [None]:
df.TEAM.value_counts()

#### Row filtering

To make things maximally confusing, you _also_ use bracket notation for row filtering. Except in this case, instead of dropping the name of a column (or a list of column names) into the brackets, you hand it a _condition_ of some sort.

Let's filter our data to see players who make more than $1 million (in other words, return rows of data where the value in the `SALARY` column is greater than 1000000):

(The equivalent SQL statement would be:
```sql
SELECT *
FROM mlb
WHERE SALARY > 1000000
```
)

In [None]:
df[df.SALARY > 1000000]

For many filters, you'll use Python's comparison operators:
- `>` greater than
- `>=` greater than or equal to
- `<` less than
- `<=` less than or equal to
- `==` equal to
- `!=` not equal to

#### Multiple filter conditions

What if you want to use multiple filtering conditions? There is a way, but it usually makes more sense -- and is much easier for your colleagues and your future self to think about and debug -- to _save_ the results of each filtering operation by assigning the results to a new variable, then filter _that_ again instead of the original data frame.

For example, if you wanted to look at Colorado Rockies players who make more than $1 million, you might do something like:

In [None]:
rockies = df[df.TEAM == 'COL']
rockies_over_1m = rockies[rockies.SALARY > 1000000]

In [None]:
rockies_over_1m

👉 [Check out some other filtering operations here]().

### ✍️ Your turn

In the cells below, practice filtering:
- Column filtering: Select the `NAME` column
- Column filtering: Select the `NAME` and `TEAM` columns
- Row filtering: Filter the rows to return only players who make the league minimum (535000)
- Row filtering: Filter the rows to return only catchers (`C`) who make at least 750000
- BONUS: Filter the rows to return only players for the Chicago Cubs (`CHC`), then use method chaining to order the results by `SALARY` descending

### Group and aggregate the data

Data frames have a `groupby` method for grouping and aggregating data, similar to what you might do in a pivot table or a `GROUP BY` statement in SQL. (They also have a [`pivot_table` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html), which can be homework for you to research.)

Let's say we wanted to see the top 10 teams by payroll. In other words, we want to:
- Group the data by the `TEAM` column: `groupby()`
- Add up the records in each group: `sum()`
- Sort the results by `SALARY` descending: `sort_values()`
- Take only the top 10 results: `head(10)`

Calling the `groupby()` method without telling it what to do with the grouped records isn't super helpful:

In [7]:
df.groupby('TEAM')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11161f4a8>

At this point, it's basically telling us that it has successfully grouped the records -- now what? Using method chaining, describe what you would like to _do_ with the numeric columns once you've grouped the data. Let's start with `sum()`:

In [11]:
df.groupby('TEAM').sum()

Unnamed: 0_level_0,SALARY,START_YEAR,END_YEAR,YEARS
TEAM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ARI,90730499,56469,56485,44
ATL,137339527,60491,60525,64
BAL,161684185,56460,56485,53
BOS,174287098,62510,62541,62
CHC,170088502,52429,52456,53
CIN,82375785,62516,62539,54
CLE,115991166,56455,56490,63
COL,101513571,64534,64553,51
CWS,109591167,56463,56487,52
DET,180250600,52420,52457,63


Neat! Except it's summing _every_ numeric column, not just `SALARY`. To deal with this, use column filtering to select the two columns we're interested in -- `TEAM` for grouping and `SALARY` for summing -- and _then_ tack on the `groupby` statement, etc.

(Remember: To select columns from a data frame, use bracket notation and hand it a _list_ of column names.)

In [12]:
df[['TEAM', 'SALARY']].groupby('TEAM').sum()

Unnamed: 0_level_0,SALARY
TEAM,Unnamed: 1_level_1
ARI,90730499
ATL,137339527
BAL,161684185
BOS,174287098
CHC,170088502
CIN,82375785
CLE,115991166
COL,101513571
CWS,109591167
DET,180250600


Bang bang. Now, using method chaining, let's sort by `SALARY` descending and look at just the top 10:

In [13]:
df[['TEAM', 'SALARY']].groupby('TEAM').sum().sort_values('SALARY', ascending=False).head(10)

Unnamed: 0_level_0,SALARY
TEAM,Unnamed: 1_level_1
LAD,187989811
DET,180250600
TEX,178431396
SF,176531278
NYM,176284679
BOS,174287098
NYY,170389199
CHC,170088502
WSH,162742157
TOR,162353367


You can use aggregation methods other than `sum()` -- `mean()` and `median()`, for instance -- or you can use [the `agg()` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html) to specify one or more aggregation methods to apply.

In [19]:
df[['TEAM', 'SALARY']].groupby('TEAM').median()

Unnamed: 0_level_0,SALARY
TEAM,Unnamed: 1_level_1
ARI,1300000
ATL,1250000
BAL,3462500
BOS,1950000
CHC,2750000
CIN,567000
CLE,2950000
COL,545000
CWS,875000
DET,1650000


In [18]:
df[['TEAM', 'SALARY']].groupby('TEAM').mean()

Unnamed: 0_level_0,SALARY
TEAM,Unnamed: 1_level_1
ARI,3240375.0
ATL,4577984.0
BAL,5774435.0
BOS,5622164.0
CHC,6541865.0
CIN,2657283.0
CLE,4142542.0
COL,3172299.0
CWS,3913970.0
DET,6932715.0


In [17]:
df[['TEAM', 'SALARY']].groupby('TEAM').agg(['sum', 'mean', 'median'])

Unnamed: 0_level_0,SALARY,SALARY,SALARY
Unnamed: 0_level_1,sum,mean,median
TEAM,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
ARI,90730499,3240375.0,1300000
ATL,137339527,4577984.0,1250000
BAL,161684185,5774435.0,3462500
BOS,174287098,5622164.0,1950000
CHC,170088502,6541865.0,2750000
CIN,82375785,2657283.0,567000
CLE,115991166,4142542.0,2950000
COL,101513571,3172299.0,545000
CWS,109591167,3913970.0,875000
DET,180250600,6932715.0,1650000


### ✍️ Your turn

In the cells below, practice grouping data:
- What's the median salary for each position? Group the data by `POS` and aggregate by `median()`, then sort by `SALARY` descending
- What's the average salary on each team? Group the data by `TEAM` and aggregate by `sum()`, then sort by `SALARY` descending
- What else?