In [1]:
import pandas as pd
import numpy as np
import os

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 5)

# Lecture 3 – More Pandas 🐼🐼

## DSC 80, Spring 2022

### Announcements

- Lab 1 is due on **Monday, April 4th at 11:59PM**.
    - Watch [this video 🎥](https://youtu.be/FpTo4AM9B30) for setup instructions.
- Discussion 1 is due for extra credit **tomorrow at 11:59PM**.
    - See [this post](https://campuswire.com/c/G325FA25B/feed/72) for a discussion on rounding errors.
    - Discussion podcasts appear alongside lecture podcasts at [podcast.ucsd.edu](https://podcast.ucsd.edu) – look for "A01" or "B01".
- Please submit the [Welcome + Alternate Exams Form](https://docs.google.com/forms/d/e/1FAIpQLSdBKLcPs4Xi0plaIw0MVZ0DyGcvnSZyHxKVC7S7LwEiCchepQ/viewform) by Monday.
- Project 1 will be released over the weekend.
    - The Checkpoint is due on **Thursday, April 7th at 11:59PM**.
    - The whole project is due on **Thursday, April 14th at 11:59PM**.

### Agenda

- `loc` and `iloc`.
- `pandas` and `numpy`.
- Useful Series and DataFrame methods.

## `loc` and `iloc`

In [2]:
enrollments = pd.DataFrame({
    'Name': ['Granger, Hermione', 'Potter, Harry', 'Weasley, Ron', 'Longbottom, Neville'],
    'PID': ['A13245986', 'A17645384', 'A32438694', 'A52342436'],
    'LVL': [1, 1, 1, 1]
})

enrollments

Unnamed: 0,Name,PID,LVL
0,"Granger, Hermione",A13245986,1
1,"Potter, Harry",A17645384,1
2,"Weasley, Ron",A32438694,1
3,"Longbottom, Neville",A52342436,1


### Selecting columns and rows simultaneously

So far, we used `[]` to select columns and `loc` to select rows.

In [3]:
enrollments.loc[enrollments['Name'] < 'M']['PID']

0    A13245986
3    A52342436
Name: PID, dtype: object

### Selecting columns and rows simultaneously

`loc` can also be used to select both rows and columns. The general pattern is:

```
df.loc[<row selector>, <column selector>]
```

Examples:
- `df.loc[idx_list, col_list]` returns a DataFrame containing the rows in `idx_list` and columns in `col_list`.
- `df.loc[bool_arr, col_list]` returns a DataFrame contaning the rows for which `bool_arr` is `True` and columns in `col_list`.
- If `:` is used as the first input, all rows are kept. If `:` is used as the second input, all columns are kept.

In [4]:
enrollments

Unnamed: 0,Name,PID,LVL
0,"Granger, Hermione",A13245986,1
1,"Potter, Harry",A17645384,1
2,"Weasley, Ron",A32438694,1
3,"Longbottom, Neville",A52342436,1


In [5]:
enrollments.loc[enrollments['Name'] < 'M', 'PID']

0    A13245986
3    A52342436
Name: PID, dtype: object

In [6]:
enrollments.loc[enrollments['Name'] < 'M', ['PID']]

Unnamed: 0,PID
0,A13245986
3,A52342436


### Even more ways of selecting rows and columns

In `df.loc[<row selection>, <column selection>]`:

- Both the first and second inputs can be Boolean sequences.
- Both the first and second inputs can be **slices**, which use `:` syntax (e.g. `0:2`, `'Name': 'PID'`).
- If both the first and second inputs are primitives (strings or numbers), the result is a single value, not a DataFrame or Series.
- The first input can be a **function** that takes a row as input and returns a Boolean.

There are many, many more – see the [`pandas` documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide) for more.

In [7]:
enrollments

Unnamed: 0,Name,PID,LVL
0,"Granger, Hermione",A13245986,1
1,"Potter, Harry",A17645384,1
2,"Weasley, Ron",A32438694,1
3,"Longbottom, Neville",A52342436,1


In [8]:
enrollments.loc[2, 'LVL']

1

In [9]:
enrollments.loc[0:2, 'Name':'PID']

Unnamed: 0,Name,PID
0,"Granger, Hermione",A13245986
1,"Potter, Harry",A17645384
2,"Weasley, Ron",A32438694


### Don't forget `iloc`!

- `iloc` stands for "integer location".
- `iloc` is like `loc`, but it selects rows and columns based off of integer positions only.

In [10]:
enrollments

Unnamed: 0,Name,PID,LVL
0,"Granger, Hermione",A13245986,1
1,"Potter, Harry",A17645384,1
2,"Weasley, Ron",A32438694,1
3,"Longbottom, Neville",A52342436,1


In [11]:
enrollments.iloc[2:4, 0:2]

Unnamed: 0,Name,PID
2,"Weasley, Ron",A32438694
3,"Longbottom, Neville",A52342436


In [12]:
other = enrollments.set_index('Name')
other

Unnamed: 0_level_0,PID,LVL
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
"Granger, Hermione",A13245986,1
"Potter, Harry",A17645384,1
"Weasley, Ron",A32438694,1
"Longbottom, Neville",A52342436,1


In [13]:
other.iloc[2]

PID    A32438694
LVL            1
Name: Weasley, Ron, dtype: object

In [14]:
other.loc[2]

KeyError: 2

### Practice Questions

Consider the DataFrame below.

In [None]:
jack = pd.DataFrame({1: ['fee', 'fi'], '1': ['fo', 'fum']})
jack

For each of the following pieces of code, predict what the output will be. Then, uncomment the line of code and see for yourself.

In [None]:
# jack[1]

In [None]:
# jack[[1]]

In [None]:
# jack['1']

In [None]:
# jack[[1, 1]]

In [None]:
# jack.loc[1]

In [None]:
# jack.loc[jack[1] == 'fo']

In [None]:
# jack[1, ['1', 1]]

In [None]:
# jack.loc[1, 1]

## Pandas and NumPy

<center><img src='imgs/python-stack.png' width=800></center>

### NumPy

- NumPy stands for "numerical Python". It is a commonly-used Python module that enables **fast** computation involving arrays and matrices.
- `numpy`'s main object is the **array**. In `numpy`, arrays are:
    - homogenous (all values are of the same type), and
    - (potentially) multi-dimensional.
- Computation in `numpy` is fast because
    - Much of it is implemented in C.
    - `numpy` arrays are stored more efficiently in memory than, say, Python lists. 
- [This site](https://cloudxlab.com/blog/numpy-pandas-introduction/) provides a good overview of `numpy` arrays.

### `pandas` is built upon `numpy`

- A Series in `pandas` is a `numpy` array with an index.
- A DataFrame is like a dictionary of columns, each of which is a `numpy` array.
- Many operations in `pandas` are fast because they use `numpy`'s implementations.
- To access the array underlying a DataFrame or Series, use the `to_numpy` method.
    - ⚠️ Warning: `to_numpy` returns a view of the original object, not a copy! Read more in the [course notes](https://notes.dsc80.com/content/02/data-types.html#copies-and-views-in-pandas).
    - `.values` is a soon-to-be-deprecated version of `.to_numpy()`.

In [None]:
arr = np.array([4, 2, 9, 15, -1])
arr

In [None]:
ser = pd.Series(arr, index='a b c d e'.split(' '))
ser

In [None]:
conv = ser.to_numpy()
conv

In [None]:
conv[2] = 100
conv

In [None]:
ser

### The dangers of `for`-loops

- `for`-loops are slow when processing large datasets.
- To illustrate how much faster `numpy` arithmetic is than using a `for`-loop, let's compute the distances between the origin $(0, 0)$ and 2000 random points $(x, y)$ in $\mathbb{R}^2$:
    - Using a `for`-loop.
    - Using vectorized arithmetic (through `numpy`).

### Aside: generating data

- First, we need to create a DataFrame containing 2000 random points in 2D. 
- `np.random.random(N)` returns an array containing `N` numbers selected uniformly at random from the interval $[0, 1)$.

In [None]:
N = 2000
x_arr = np.random.random(N)
y_arr = np.random.random(N)

coordinates = pd.DataFrame({"x": x_arr, "y": y_arr})
coordinates.head()

Next, let's define a function that takes in a DataFrame like the one above and returns the distances between each point and the origin, using a `for`-loop.

In [None]:
def distances(df):
    hyp_list = []
    for i in df.index:
        dist = (df.loc[i, 'x'] ** 2 + df.loc[i, 'y'] ** 2) ** 0.5
        hyp_list.append(dist)
    return hyp_list

The `%timeit` magic command can repeatedly run any snippet of code and give us its average runtime.

In [None]:
%timeit distances(coordinates)

Now, using a vectorized approach:

In [None]:
%timeit (coordinates['x'] ** 2 + coordinates['y'] ** 2) ** 0.5

Note that "µs" refers to microseconds, which are one-millionth of a second, whereas "ms" refers to milliseconds, which are one-thousandth of a second.

**Takeaway:** avoid `for`-loops whenever possible!

### `pandas` data types

- A **data type** in `pandas` refers to the type of values in a column.
- A column's data type determines which operations can be applied to it.
- `pandas` tries to guess the correct data type for a given DataFrame, and is often wrong.
    - This can lead to incorrect calculations and poor memory/time performance.
- As a result, you will often need to explicitly convert between data types.

### `pandas` data types

|Pandas dtype|Python type|NumPy type|SQL type|Usage|
|---|---|---|---|---|
|int64|int|int_, int8,...,int64, uint8,...,uint64|INT, BIGINT| Integer numbers|
|float64|float|float_, float16, float32, float64|FLOAT| Floating point numbers|
|bool|bool|bool_|BOOL|True/False values|
|datetime64|NA|datetime64[ns]|DATETIME|Date and time values|
|timedelta[ns]|NA|NA|NA|Differences between two datetimes|
|category|NA|NA|ENUM|Finite list of text values|
|object|str|string, unicode|NA|Text|
|object|NA|object|NA|Mixed types|

[This article](https://www.dataquest.io/blog/pandas-big-data/) details how `pandas` stores different data types under the hood.

### Type conversion and the underlying `numpy` array(s)
* The `dtypes` attribute (of both Series and DataFrames) describes the data type of each column.
* The `to_numpy` method, when used on a Series, returns an array in which all values are of the data type specified by `dtypes`.
* The `to_numpy` method, when used on a DataFrame, returns a multi-dimensional array of type `object`, unless all columns in the DataFrame are homogenous.

In [None]:
# Read in file
elections_fp = os.path.join('data', 'elections.csv')
elections = pd.read_csv(elections_fp)
elections.head()

In [None]:
elections.dtypes

In [None]:
elections['Year'].dtypes

In [None]:
elections['Year'].to_numpy().dtype

In [None]:
elections.to_numpy()

What do you think is happening here?

In [None]:
elections['Year'] ** 7

### ⚠️ Warning: `numpy` and `pandas` don't always make the same decisions! 

`numpy` prefers homogenous data types to optimize memory and read/write speed. This leads to **type coercion**. Notice that the array created below contains only strings, even though there was an `int` in the argument list.

In [None]:
np.array(['a', 1])

On the other hand, `pandas` likes correctness and ease-of-use. The Series created below is of type `object`, which preserves the original data types in the argument list.

In [1]:
pd.Series(['a', 1])

NameError: name 'pd' is not defined

In [None]:
pd.Series(['a', 1]).values

You can specify the data type of an array when initializing it by using the `dtype` argument.

In [None]:
np.array(['a', 1], dtype=object)

`pandas` does make some trade-offs for efficiency, however. For instance, a Series consisting of both `int`s and `float`s is coerced to the `float64` data type.

In [None]:
pd.Series([1, 1.0])

### Type conversion

You can change the data type of a Series using the `.astype` Series method.

In [None]:
ser = pd.Series(['1', '2', '3', '4'])
ser

In [None]:
ser.astype(int)

In [None]:
ser.astype(float)

### Performance and memory management

As we just discovered,
* `numpy` is optimized for speed and memory consumption.
* `pandas` makes implementation choices that: 
    - are slow and use a lot of memory, but
    - optimize for fast code development.

To demonstrate, let's create a large array in which all of the entries are non-negative numbers less than 255, meaning that they can be represented with 8 bits (i.e. as `np.uint8`s, where the "u" stands for "unsigned").

In [None]:
import random
data = np.random.choice(np.arange(8), 10 ** 6)

When we tell `pandas` to use a `dtype` of `uint8`, the size of the resulting DataFrame is under a megabyte.

In [None]:
ser1 = pd.Series(data, dtype=np.uint8).to_frame()
ser1.info()

But by default, even though the numbers are only 8-bit, `pandas` uses the `int64` dtype, and the resulting DataFrame is over 7 megabytes large.

In [None]:
ser2 = pd.Series(data).to_frame()
ser2.info()

## Useful Series and DataFrame methods

### Shared methods and attributes
* The `head`/`tail` methods return the first/last few rows (the default is 5).
* The `shape` attribute returns the number of rows (and columns).
* The `size` attribute returns the number of entries.

In [None]:
elections.head()

In [None]:
elections.shape

In [None]:
elections.size

### Series methods

|Method Name|Description|
|---|---|
|`count`|Returns the number of non-null entries in the Series|
|`unique`|Returns the unique values in the Series|
|`nunique`|Returns the number of unique values in the Series|
|`value_counts`|Returns a Series of counts of unique values|
|`describe`|Returns a Series of descriptive stats of values|

In [None]:
elections.head()

In [None]:
# Distinct candidates
elections['Candidate'].unique()

In [None]:
# Number of distinct candidates
elections['Candidate'].nunique()

In [None]:
# Total number of candidates
elections['Candidate'].count()

In [None]:
# 🤔
republicans = elections.loc[elections['Party'] == 'Republican']
republicans['Result'].value_counts()

In [None]:
republicans['%'].describe()

### DataFrame methods

* DataFrames share *many* of the same methods with Series.
    - In such cases, the DataFrame method applies the Series method to every row or column.
* Some DataFrame methods accept the `axis` keyword argument:
    - `axis=0`: the method is applied across the rows (i.e. to each column).
    - `axis=1`: the method is applied across the columns (i.e. to each row).
* Default value: `axis=0`.

In [None]:
elections.head()

In [None]:
elections[['%', 'Year']].mean()

The following piece of code works, but is meaningless. Why?

In [None]:
elections[['%', 'Year']].mean(axis=1)

### Even more DataFrame methods

|Method Name|Description|
|---|---|
|`sort_values`|Returns a DataFrame sorted by the specified column|
|`drop_duplicates`|Returns a DataFrame with duplicate values dropped|
|`describe`|Returns descriptive stats of the DataFrame|

In [None]:
elections.sort_values('%', ascending=False).head(4)

In [None]:
# By default, drop_duplicates looks for duplicate entire rows, which elections does not have
elections.drop_duplicates(subset=['Candidate'])

In [None]:
elections.describe()

### Adding and modifying columns, using a copy

* To add a new column to a DataFrame, use the `assign` method.
* To add a new row to a DataFrame, use the `append` method.
* Both `assign` and `append` return a copy of the DataFrame, **which is a great feature!**
* To change the values in a column, re-assign its name to a sequence of the desired values.

As an aside, you should try your best to write **chained** `pandas` code, as follows:

In [None]:
(
    elections
    .assign(proportion_of_vote=(elections['%'] / 100))
    .head()
)

You can chain together several steps at a time:

In [None]:
(
    elections
    .assign(proportion_of_vote=(elections['%'] / 100))
    .assign(Result=elections['Result'].str.upper())
    .head()
)

You can also use `assign` when the desired column name has spaces, by using keyword arguments.

In [None]:
(
    elections
    .assign(**{'Proportion of Vote': elections['%'] / 100})
    .head()
)

### ⚠️ Warning!

- Adding a row with `append` has terrible time complexity!
- Use it sparingly.
- Specifically, don't build a DataFrame using `append` in a loop.

### Adding and modifying columns, in-place

* You can assign a new row or column to a DataFrame **in-place** using `loc` or `[]`.
    - Works like dictionary assignment.
    - Unlike `assign`, this **modifies** the underlying DataFrame rather than a copy of it.
* This is the more "common" way of adding/modifying columns. 
    - ⚠️ Warning: Exercise caution when using this approach, since this approach changes the values of existing variables.

In [None]:
# By default, .copy() returns a deep copy of the object it is called on,
# meaning that if you change the copy the original remains unmodified.
mod_elec = elections.copy()
mod_elec.head()

In [None]:
mod_elec['Proportion of Vote'] = mod_elec['%'] / 100
mod_elec.head()

In [None]:
mod_elec['Result'] = mod_elec['Result'].str.upper()
mod_elec.head()

In [None]:
# 🤔
mod_elec.loc[-1, :] = ['Carter', 'Democratic', 50.1, 1976, 'WIN', 0.501]
mod_elec.loc[-2, :] = ['Ford', 'Republican', 48.0, 1976, 'LOSS', 0.48]
mod_elec

In [None]:
mod_elec = mod_elec.sort_index()
mod_elec.head()

In [None]:
# df.reset_index(drop=True) drops the current index 
# of the DataFrame and replaces it with an index of increasing integers
mod_elec.reset_index(drop=True)

## Example: San Diego employee salaries (again)

Note: We probably won't finish looking at all of this code in lecture, but we will leave it here for you as a reference.

### Reading the data

Let's work with the same dataset that we did in Lecture 1, using our new knowledge of `pandas`.

In [None]:
salaries = pd.read_csv('https://transcal.s3.amazonaws.com/public/export/san-diego-2020.csv')
salaries['Employee Name'] = salaries['Employee Name'].str.split().str[0] + ' xxxxx'

In [None]:
salaries.head()

In [None]:
salaries.info()

### Data cleaning

Current issues with the dataset:

- Some columns have no information (`'Notes'`) or the same value in all rows (`'Agency'`) – let's drop them.
- `'Other Pay'` should be numeric, but it's not currently.



In [None]:
# Dropping useless columns
salaries = salaries.drop(['Year', 'Notes', 'Agency'], axis=1)
salaries.head()

### Fixing the `'Other Pay'` column

In [None]:
salaries['Other Pay'].dtype

In [None]:
salaries['Other Pay'].unique()

It appears that most of the values in the `'Other Pay'` column are strings containing numbers. Which values are not numbers?

In [None]:
salaries.loc[-salaries['Other Pay'].str.contains('.00')]

We can keep just the rows where the `'Other Pay'` is numeric, and then convert the `'Other Pay'` column to `float`.

In [None]:
salaries = salaries.loc[salaries['Other Pay'].str.contains('.00') == True]
salaries['Other Pay'] = salaries['Other Pay'].astype(float)
salaries.head()

The line of code above is correct, but it errors if you run it more than once. Why? 🤔

### Full-time vs. part-time

What happens when we use `normalize=True` with `value_counts`?

In [None]:
salaries['Status'].value_counts()

In [None]:
salaries['Status'].value_counts(normalize=True)

### Salary analysis

In [None]:
# Salary statistics
salaries.describe()

**Question:** Is `'Total Pay'` equal to the sum of `'Base Pay'`, `'Overtime Pay'`, and `'Other Pay'`?

We can answer this by summing the latter three columns and seeing if the resulting Series equals the former column.

In [None]:
salaries.loc[:, ['Base Pay', 'Overtime Pay', 'Other Pay']].sum(axis=1)

In [None]:
salaries['Total Pay']

In [None]:
(salaries.loc[:, ['Base Pay', 'Overtime Pay', 'Other Pay']].sum(axis=1) == salaries['Total Pay']).all()

Similarly, we might ask whether `'Total Pay & Benefits'` is truly the sum of `'Total Pay'` and `'Benefits'`.

In [None]:
(salaries.loc[:, ['Total Pay', 'Benefits']].sum(axis=1) == salaries.loc[:, 'Total Pay & Benefits']).all()

### Visualization

In [None]:
salaries['Total Pay & Benefits'].plot(kind='hist', density=False, bins=20, ec='w');

In [None]:
salaries.plot(kind='scatter', x='Base Pay', y='Overtime Pay');

In [None]:
pd.plotting.scatter_matrix(salaries[['Base Pay', 'Overtime Pay', 'Total Pay']], figsize=(8, 8));

Think of your own questions about the dataset, and try and answer them!

## Summary, next time

### Summary

- `pandas` relies heavily on `numpy`. An understanding of how data types work in both will allow you to write more efficient and bug-free code.
- Series and DataFrames share many methods (refer to the [`pandas` documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide) for more details).
- Most `pandas` methods return **copies** of Series/DataFrames. Be careful when using techniques that modify values in-place.
- **Next time:** How to work with **messy data**.