In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# DataFrame Essentials

In this section we cover some of the essential functionality, methods, and attributes of pandas DataFrames.

We will use the `penguins` dataset we saw in the `Importing Data` section

In [3]:
import pandas as pd
import numpy as np

penguins = pd.read_csv('data/examples/penguins.csv')

penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


## Attributes and underlying data

pandas objects have a number of attributes enabling you to access metadata about your dataset.

* **shape**: gives the axis dimensions of the object
* **Axis labels**
  * **Series**: *index* (this is the only axis is Series are 1-dimnesioal)
  * **DataFrame**: *index* and *columns*


In [4]:
penguins.columns #DataFrame
penguins['species'].index # Series

[x.upper() for x in penguins.columns]

Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex'],
      dtype='object')

RangeIndex(start=0, stop=344, step=1)

['SPECIES',
 'ISLAND',
 'BILL_LENGTH_MM',
 'BILL_DEPTH_MM',
 'FLIPPER_LENGTH_MM',
 'BODY_MASS_G',
 'SEX']

pandas objects (`Index`, `Series`, `DataFrame`) can be thought of as containers for arrays which hold the actual data. For many datatypes, the underlying array is a `numpy.ndarrady`. To extract the actual array from an `Index` or `Series` object, use the `.array` property.

You can also extract the data as a numpy array using the `to_numpy()` method. 

In [5]:
penguins.bill_length_mm.array
penguins['bill_depth_mm'].to_numpy()

<PandasArray>
[39.1, 39.5, 40.3,  nan, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0,
 ...
 46.2, 55.1, 44.5, 48.8, 47.2,  nan, 46.8, 50.4, 45.2, 49.9]
Length: 344, dtype: float64

array([18.7, 17.4, 18. ,  nan, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, 17.1,
       17.3, 17.6, 21.2, 21.1, 17.8, 19. , 20.7, 18.4, 21.5, 18.3, 18.7,
       19.2, 18.1, 17.2, 18.9, 18.6, 17.9, 18.6, 18.9, 16.7, 18.1, 17.8,
       18.9, 17. , 21.1, 20. , 18.5, 19.3, 19.1, 18. , 18.4, 18.5, 19.7,
       16.9, 18.8, 19. , 18.9, 17.9, 21.2, 17.7, 18.9, 17.9, 19.5, 18.1,
       18.6, 17.5, 18.8, 16.6, 19.1, 16.9, 21.1, 17. , 18.2, 17.1, 18. ,
       16.2, 19.1, 16.6, 19.4, 19. , 18.4, 17.2, 18.9, 17.5, 18.5, 16.8,
       19.4, 16.1, 19.1, 17.2, 17.6, 18.8, 19.4, 17.8, 20.3, 19.5, 18.6,
       19.2, 18.8, 18. , 18.1, 17.1, 18.1, 17.3, 18.9, 18.6, 18.5, 16.1,
       18.5, 17.9, 20. , 16. , 20. , 18.6, 18.9, 17.2, 20. , 17. , 19. ,
       16.5, 20.3, 17.7, 19.5, 20.7, 18.3, 17. , 20.5, 17. , 18.6, 17.2,
       19.8, 17. , 18.5, 15.9, 19. , 17.6, 18.3, 17.1, 18. , 17.9, 19.2,
       18.5, 18.5, 17.6, 17.5, 17.5, 20.1, 16.5, 17.9, 17.1, 17.2, 15.5,
       17. , 16.8, 18.7, 18.6, 18.4, 17.8, 18.1, 17

You can also convert an entire DataFrame to a `numpy.ndarray`.

In [6]:
penguins.to_numpy()

array([['Adelie', 'Torgersen', 39.1, ..., 181.0, 3750.0, 'Male'],
       ['Adelie', 'Torgersen', 39.5, ..., 186.0, 3800.0, 'Female'],
       ['Adelie', 'Torgersen', 40.3, ..., 195.0, 3250.0, 'Female'],
       ...,
       ['Gentoo', 'Biscoe', 50.4, ..., 222.0, 5750.0, 'Male'],
       ['Gentoo', 'Biscoe', 45.2, ..., 212.0, 5200.0, 'Female'],
       ['Gentoo', 'Biscoe', 49.9, ..., 213.0, 5400.0, 'Male']],
      dtype=object)

## Column selection, addition, deletion

You can treat a `DataFrame` like a dictionary of `Series` objects. Getting, setting, and deleting objects works identically to the analogous `dict` operators

In [7]:
penguins['body_mass_g']
penguins['body_mass_kg'] = penguins['body_mass_g']/1000
penguins['body_mass_flag'] = penguins['body_mass_kg'] > 3.5
penguins

0      3750.0
1      3800.0
2      3250.0
3         NaN
4      3450.0
        ...  
339       NaN
340    4850.0
341    5750.0
342    5200.0
343    5400.0
Name: body_mass_g, Length: 344, dtype: float64

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,body_mass_kg,body_mass_flag
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,3.75,True
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,3.80,True
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,3.25,False
3,Adelie,Torgersen,,,,,,,False
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,3.45,False
...,...,...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,,,False
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female,4.85,True
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male,5.75,True
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female,5.20,True


Columns can be deleted, popped, or dropped

In [8]:
flag = penguins.pop('body_mass_flag')
penguins = penguins.drop('body_mass_kg', axis = 1)
penguins

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female


By default, when you insert a new column, it will be inserted at the end of the `DataFrame`. However, `DataFrame.insert` can be used to insert a column at a particular location

In [9]:
penguins.insert(1, "body_mass_kg", penguins['body_mass_g']/1000)
penguins

Unnamed: 0,species,body_mass_kg,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,3.75,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,3.80,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,3.25,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,,Torgersen,,,,,
4,Adelie,3.45,Torgersen,36.7,19.3,193.0,3450.0,Female
...,...,...,...,...,...,...,...,...
339,Gentoo,,Biscoe,,,,,
340,Gentoo,4.85,Biscoe,46.8,14.3,215.0,4850.0,Female
341,Gentoo,5.75,Biscoe,50.4,15.7,222.0,5750.0,Male
342,Gentoo,5.20,Biscoe,45.2,14.8,212.0,5200.0,Female


### Assigning new columns in chained operation

We often want to be able to assign a new column within a chain of operations (which we will see more of later). Fortunately, inspired by the R package [`dplyr`](), Pandas recently introduced the `assign()` method that allows you to easily create new columns that are potentially derived from existing columns, or based on your own lambda function

> A note on chaining. Because the output of a `pd.Dataframe` method is another `pd.DataFrame` we can continuously call functions in a chain like so: `pd.DataFrame(...).join(...).query(...)` However, this is quite ugly to look at so we can format the chaining to be more SQL-like like so:

In [10]:
%%script false --no-raise-error

df = df \
    .join(df2, how = 'left') \
    .query('sex==M') \
    .assign(foo=1)

In [11]:
penguins = pd.read_csv('data/examples/penguins.csv') 

penguins \
    .assign(body_mass_kg = penguins['body_mass_g']/1000) \
    .query('body_mass_kg>5.0')

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,body_mass_kg
221,Gentoo,Biscoe,50.0,16.3,230.0,5700.0,Male,5.70
223,Gentoo,Biscoe,50.0,15.2,218.0,5700.0,Male,5.70
224,Gentoo,Biscoe,47.6,14.5,215.0,5400.0,Male,5.40
227,Gentoo,Biscoe,46.7,15.3,219.0,5200.0,Male,5.20
229,Gentoo,Biscoe,46.8,15.4,215.0,5150.0,Male,5.15
...,...,...,...,...,...,...,...,...
335,Gentoo,Biscoe,55.1,16.0,230.0,5850.0,Male,5.85
337,Gentoo,Biscoe,48.8,16.2,222.0,6000.0,Male,6.00
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male,5.75
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female,5.20


We can also pass in a callable lambda function. This is handy when we don't have a reference to the dataset (which is common when assigning in a chain of operations like this)

`assign()` is quite powerful as it allows us to leverage *dependant* assignement where an expression later in assign can refer to a column that was created earlier within the same `assign()`.

In [12]:
penguins = pd.read_csv('data/examples/penguins.csv') \
    .assign(body_mass_kg = lambda x: x['body_mass_g'] / 1000,
            body_mass_flag = lambda x: x['body_mass_kg'] > 5) \
    .query("body_mass_flag & sex=='Male'") \
    .head(10)

penguins

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,body_mass_kg,body_mass_flag
221,Gentoo,Biscoe,50.0,16.3,230.0,5700.0,Male,5.7,True
223,Gentoo,Biscoe,50.0,15.2,218.0,5700.0,Male,5.7,True
224,Gentoo,Biscoe,47.6,14.5,215.0,5400.0,Male,5.4,True
227,Gentoo,Biscoe,46.7,15.3,219.0,5200.0,Male,5.2,True
229,Gentoo,Biscoe,46.8,15.4,215.0,5150.0,Male,5.15,True
231,Gentoo,Biscoe,49.0,16.1,216.0,5550.0,Male,5.55,True
233,Gentoo,Biscoe,48.4,14.6,213.0,5850.0,Male,5.85,True
235,Gentoo,Biscoe,49.3,15.7,217.0,5850.0,Male,5.85,True
237,Gentoo,Biscoe,49.2,15.2,221.0,6300.0,Male,6.3,True
239,Gentoo,Biscoe,48.7,15.1,222.0,5350.0,Male,5.35,True


## Indexing & Selection

The basics of indexing are as following:

| Operation | Syntax | Result |
| --------- | ------ | ------ |
| Select column | `df[col]` | Series|
| Select row by label | `df.loc[label]` | Series |
| Select row by integer location | `df.iloc[loc]` | Series |
| Slice rows | `df[5:10]` | DataFrame |
| Select rows by boolean vector | `df[bool_mask]` | DataFrame |

Row selection returns a `Series` whose index is the columns of the `DataFrame`.


In [13]:
penguins.iloc[5]

species              Gentoo
island               Biscoe
bill_length_mm         49.0
bill_depth_mm          16.1
flipper_length_mm     216.0
body_mass_g          5550.0
sex                    Male
body_mass_kg           5.55
body_mass_flag         True
Name: 231, dtype: object

## Casting to different `dtypes`

We often need to change the datatype of a column. There are a few ways to do this in pandas. Generally, casting between different types is pretty simple. However, there are often many gotchas that can make the process frustrating. We will highlight the simple cases here and then cover the more complex situations later in this guide.

There are 4 main options for converting types in pandas:

1. `to_numeric()` - provides functionality to safely convert non-numeric types (e.g strings) to a suitable numeric type.
   * the functions `to_datetime()` and `to_timedelta()` operate similarly but for `datetime` and `timedelta` types
2. `astype()` - convert (almost) any type to (almost) any other type.
   * This works even when it maybe doesnt make sense to make the requested conversion
   * Allows you to convert to [categorical]() data. This is very useful
3. `infer_object()` - a utility method that will convert object columns to a native pandas type (if possible)
4. `convert_dtypes()` = convert `DataFrame` columns to the "best possible" dtype that supports `pd.NA`.


### `to_numeric()`

The best way to convert one or more columns of a DataFrame to numeric values is to use `pandas.to_numeric()`.

This function will try to change non-numeric objects (such as strings) into integers or floating-point numbers as appropriate.

The input to `to_numeric()` is a Series or a single column of a `DataFrame`

In [14]:
s = pd.Series(["8", 6, "7.5", 3, "0.9"])
s

pd.to_numeric(s)

0      8
1      6
2    7.5
3      3
4    0.9
dtype: object

0    8.0
1    6.0
2    7.5
3    3.0
4    0.9
dtype: float64

> By default, conversion with `to_numeric()` will give you either an `int64` or `float64` dtype. This is usually what you want. However, `to_numeric()` also give you the option to "downcast" your datatype to a more compact dtype like `float32` or `int8`. 

In [15]:
s = pd.Series([1, 2, -7])
s

pd.to_numeric(s, downcast='integer')

0    1
1    2
2   -7
dtype: int64

0    1
1    2
2   -7
dtype: int8

### `astype()`

The `astype()` method enables you to be explicit about the dtype you want your DataFrame or Series to have. It's very versatile in that you can try and go from one type to any other.

Just pick a type: you can use a NumPy dtype (e.g. `np.int16`), some Python types (e.g. bool), or pandas-specific types (like the categorical dtype).

Call the method on the object you want to convert and `astype()` will try and convert it for you:

In [16]:
df = pd.DataFrame({
    'a': [7.0, 5.0, 11.0],
    'b': [1+1j, 5, 3.4 + 1j]
})

# convert all DataFrame columns to the int64 dtype
df = df.astype(int)

# convert column "a" to int64 dtype and "b" to complex type
df = df.astype({"a": int, "b": complex})

# convert Series to float16 type
s = s.astype(np.float16)

# convert Series to Python strings
s = s.astype(str)

# convert Series to categorical type - see docs for more details
s = s.astype('category')

  return arr.astype(dtype, copy=True)


*NOTE*: `astype` will try to convert the data. If it does not know how to convert a value in the `Series` or `DataFrame`, it will raise an error. For example, if you have `NaN` or `inf` values youll get an error when trying to conver tot an integer.

There are a few ways around this which we will explore in the exercises.


### `infer_objects()`

> This section coming soon


### `convert_dtypes()`

> This sectionc oming soon