## Pandas Accessing Data

In [1]:
import pandas as pd
import numpy as np

# initialize some sample data
dates = pd.date_range('20130101', periods=6)
random_df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
random_df_original = random_df.copy()


## Getting Specific Data

Below is a cheatsheet of different accessing methods. 

| Operation                         | Syntax              | Result      |
|-----------------------------------|---------------------|-------------|
| Select column                     | df[col]             | Series      |
| Select row by label               | df.loc[label]       | Series      |
| Select row by integer location    | df.iloc[loc]        | Series      |
| Select row AND column by label               | df.loc[row label, column label]       | Depends     |
| Select row AND column by integer location    | df.iloc[row int, column int]        | Depends     |
| Slice rows                        | df[5:10]            | DataFrame   |
| Select rows by boolean vector     | df[bool_vec]        | DataFrame   |


The datatype of the output is determined automatically based on the output's dimensions. 

For example, if we index multiple rows and columns, the object remains a dataframe - it's still '2D'.

If we index only a single row, or a single column, we get a Series - its '1D' now.

If we index exactly one row and one column, we get a single value - so its just whatever datatype is stored there (i.e. int, float, str, etc).

Two edge-case exceptions to this are shown at the bottom. If we slice exactly one row, as in `df[0:1]`, we still get a DataFrame. Also, if we select rows of a dataframe with only one column with a boolean vector, it stays a dataframe.

### Select Columns With `[ ]`

In [None]:
random_df['A']

You can also index multiple columns by passing a List. Note that the syntax here is `df[['A', 'B']]` with list brackets inside the index brackets, not `df['A', 'B']`.

In [None]:
random_df[['A', 'B', 'D']]

> Note: It is also possible to index columns using the syntax `df.A` which is equivalent to `df['A']`.

> However, it is not best practice to do this because of possible confusion with DataFrame methods (example below).

In [None]:
random_df['rank'] = range(1,7) # new column called rank with values 1-6

print((random_df['A'] == random_df.A).all()) # are these the same? Yes

print((random_df['rank'] == random_df.rank).all()) # are these the same? No

print(f"What's in random_df.rank? {type(random_df.rank)}")

> `df.rank` and `df['rank']` are not the same, because `.rank` is a special name for a function that acts on a dataframe (think like `.sort()` or similar).

> `df['rank']` syntax is safer, so is preferrable.

### Slicing Rows with `[ ]`

Square brackets can also slice rows.
- If you pass a single string or list of strings, it interprets as column names
- If you pass a slice (i.e. with `:`), it will slice the rows

Slicing DataFrames follows the same rules as slicing NumPy Arrays, except you cannot slice both rows and columns, only rows.

In [None]:
random_df[0:2]

##This one throws an error
# random_df[0:2, 0:2]

You can also slice using index labels. Interestingly, this also includes the final 'stop' label (`2013-01-03` here), while numerical slices do not.

In [None]:
random_df['2013-01-01':'2013-01-03']

### Selection by Label with `.loc`

`.loc` and `.iloc` are the preferred methods for accessing specific data because of the ability to access both specific rows AND specific columns simultaneously, and are extremely flexible.

Select first row based on index label:

In [None]:
random_df.loc["2013-01-01"]

Select multiple columns by name:

`:` represents that we want to take all the rows as well as the list, `['A', 'B']`, which represents the columns.

In [None]:
random_df.loc[:, ['A', 'B']]

We can use this first `:` to also slice rows by label. 

The columns can also be a slice, instead of a list, which will give all columns between the 'start' and 'stop'.

In [None]:
random_df.loc['2013-01-01':'2013-01-03', 'A':'D'] # row slice, then column slice

Remember that the object types here depend on the shape of the output. For the above, we had dataframes.

But indexing a single row (or column) is a Series.

In [None]:
random_df.loc['20130102', ['A', 'B']]

And accessing a single value is just that value's type.

In [None]:
random_df.loc['20130102', 'A']

`.loc` is extremely flexible, and can be manipulated to access any desired subset of the data if well constructed.

### Selection by Numeric Index with `.iloc`

`.iloc` is similar to `.loc`, but selects by the numeric index.

Indexing with `.iloc` is extremely similar to indexing NumPy arrays (or native Python arrays).

Ex: index 4th row (index value 3, because indexes start at 0)

In [None]:
random_df.iloc[3]

Slice specific rows and columns.

In [None]:
random_df.iloc[3:5, 0:2]

Slice specific columns with all rows.

In [None]:
random_df.iloc[:, 0:2]

### Selection by DataType:

The `select_dtypes()` method implements subsetting of columns based on their datatype.

We pass a list of acceptable datatypes, and the method will return only the columns with that datatype.

In [None]:
dtypes_df = pd.DataFrame({'string': list('abc'),
                       'int64': list(range(1, 4)),
                       'uint8': np.arange(3, 6).astype('u1'),
                       'float64': np.arange(4.0, 7.0),
                       'bool1': [True, False, True],
                       'bool2': [False, True, False],
                       'dates': pd.date_range('now', periods=3),
                       'category': pd.Series(list("ABC")).astype('category')})



In [None]:
dtypes_df.select_dtypes(include=[bool]) # just the `bool` columns

### Boolean Indexing 

We can also use Boolean indexes to filter data based on a condition.

Below, we use `>= 0` to give us a Boolean vector corresponding to the values of `A` greater than or equal to 0.

In [None]:
random_df['A'] >= 0

If we filter `random_df` using this boolean vector (sometimes called a 'mask'), we get only the `True` values (>= 0).

In [None]:
random_df[random_df['A'] >= 0]

We can also use `isin()` to check for specific values.

In [None]:
random_df['E'] = ['one', 'two', 'three', 'two', 'five', 'ten'] # make a new column 'E' with these values
random_df

In [None]:
random_df['E'].isin(['two', 'four']) # check if the values are 'two' or 'four'

In [None]:
random_df[random_df['E'].isin(['two', 'four'])] # filter for rows with 'two' or 'four' in 'E'

### Setting Values

As shown above when we made a new column 'E', we can also set values in the DataFrame.

If we index a location of the dataframe that already exists, we will replace the current values with new ones.

If we do something like specify a column that doesn't exist yet, we'll create that column!

We can index the dataframe using any of the previously seen methods, i.e using `.loc` or `.iloc`

You can also set multiple values at once (they 'broadcast').

In [None]:
random_df.iloc[0, 1] = 2 # set value at index [0,1] to 2
random_df.loc['2013-01-02':'2013-01-04', 'B'] = -20 # replace all values of B from rows '2013-01-02' to '2013-01-04' with -20

We can also use arrays (or lists) to replace sections of the dataframe.

> Note 1: The dimensions need to line up to do this! You can't put a list or array of length 5 in a slice of length 6.

> Note 2: If using an object like a list that can store multiple dtypes - make sure the datatypes are compatible with the column.

In [None]:
random_df.loc[:, 'C'] = [5, 1, 3, 5, 6, 8] # replace all rows in column C with this list.

In this example, we're working entirely with column `C`, which is numeric, so all the list values should be numeric.

Again, we can also make new columns by using square brackets.

In [None]:
random_df['F'] = np.random.randn(6, 1) # new column of random values
random_df['G'] = 10 # new column of all 10s

For reference, we saved a copy of the original `random_df` at the start of this walkthrough.

You can check through below to see all the changes we've made - see if you can account for each difference.

In [None]:
random_df

In [None]:
random_df_original