# Pandas 1.0

## About Me

* Ted Petrou
* Founder of Dunder Data
* Books
    * [Master Data Analysis with Python][0]
    * [Master Machine Learning with Python][1]

[0]: https://www.dunderdata.com/master-data-analysis-with-python
[1]: https://www.dunderdata.com/master-machine-learning-with-python

### Versioning
Previous version was 0.25
* Deprecations will be introduced in minor releases (e.g. 1.1.0, 1.2.0, 2.1.0, …)
* Deprecations will be enforced in major releases (e.g. 1.0.0, 2.0.0, 3.0.0, …)
* API-breaking changes will be made only in major releases (except for experimental features)

### Summary 

* Good news - few changes
* Bad news - few changes

### Major changes

* New missing value `pd.NA`
* Nullable integer
* Nullable boolean
* Dedicated string data type

Find changes each version in [release notes section][0].


[0]: https://pandas.pydata.org/docs/whatsnew/index.html
[1]: https://pandas.pydata.org/docs/whatsnew/v1.0.0.html

In [1]:
import pandas as pd
import numpy as np
pd.__version__

'1.0.1'

In [2]:
pd.NA == pd.NA

<NA>

In [3]:
None == None

True

In [4]:
np.nan == np.nan

False

In [5]:
pd.NA > 5

<NA>

## Named Aggregations (0.25)

You can now rename the aggregating columns within the groupby.

```python
df.groupby('grouping column').agg(new_column=('aggregating column', 'aggregating function'))
```

In [7]:
college = pd.read_csv('data/college.csv')
college.head(3)

Unnamed: 0,name,state,distance,population,sat_verbal,sat_math,age_25_above
0,Alabama A & M University,AL,0.0,4206.0,424.0,420.0,0.1049
1,University of Alabama at Birmingham,AL,0.0,11383.0,570.0,565.0,0.2422
2,Amridge University,AL,1.0,291.0,,,0.854


### Old syntax

In [10]:
c1 = college.groupby('state').agg({'population': 'mean'}).head(3)
c1

Unnamed: 0_level_0,population
state,Unnamed: 1_level_1
AK,2493.2
AL,2789.865169
AR,1644.146341


In [11]:
c1.columns = ['mean population']
c1

Unnamed: 0_level_0,mean population
state,Unnamed: 1_level_1
AK,2493.2
AL,2789.865169
AR,1644.146341


### New syntax

In [13]:
college.groupby('state').agg(mean_pop=('population', 'mean'),
                             max_sat=('sat_math', 'max')).head(3)

Unnamed: 0_level_0,mean_pop,max_sat
state,Unnamed: 1_level_1,Unnamed: 2_level_1
AK,2493.2,503.0
AL,2789.865169,590.0
AR,1644.146341,600.0


## New missing value - pd.NA

* All new data types use `pd.NA` for all missing values
* Comparisons evaluate differently than `np.nan`

In [15]:
a = pd.NA > 5

In [16]:
type(a)

pandas._libs.missing.NAType

In [17]:
a is pd.NA

True

In [None]:
np.nan

In [None]:
None

## Nullable Integer

* released in 0.24
* pandas-only data type
* original integer data type from numpy - no missing values allowed
* uses new pd.NA
* convert with `astype('Int64')` - that's capital `I`
* experimental

In [19]:
college.head(3)

Unnamed: 0,name,state,distance,population,sat_verbal,sat_math,age_25_above
0,Alabama A & M University,AL,0.0,4206.0,424.0,420.0,0.1049
1,University of Alabama at Birmingham,AL,0.0,11383.0,570.0,565.0,0.2422
2,Amridge University,AL,1.0,291.0,,,0.854


In [20]:
college.dtypes

name             object
state            object
distance        float64
population      float64
sat_verbal      float64
sat_math        float64
age_25_above    float64
dtype: object

In [23]:
college['population_int'] = college['population'].astype('Int64')

In [24]:
college.head(3)

Unnamed: 0,name,state,distance,population,sat_verbal,sat_math,age_25_above,population_int
0,Alabama A & M University,AL,0.0,4206.0,424.0,420.0,0.1049,4206
1,University of Alabama at Birmingham,AL,0.0,11383.0,570.0,565.0,0.2422,11383
2,Amridge University,AL,1.0,291.0,,,0.854,291


In [25]:
college.dtypes

name               object
state              object
distance          float64
population        float64
sat_verbal        float64
sat_math          float64
age_25_above      float64
population_int      Int64
dtype: object

### `values` attribute

* return a pandas array
* use `to_numpy` to get numpy array

In [35]:
a_np = college['population'].values

In [36]:
a_pd = college['population_int'].values

In [37]:
a_np + a_pd

array([ 8412., 22766.,   582., ...,    nan,    nan,    nan])

In [30]:
type(a)

pandas.core.arrays.integer.IntegerArray

In [33]:
college['population_int'].to_numpy()

array([4206, 11383, 291, ..., <NA>, <NA>, <NA>], dtype=object)

### Downside - cannot filter data with it

* No boolean selection
* No query method
* Might change

In [39]:
college.head(3)

Unnamed: 0,name,state,distance,population,sat_verbal,sat_math,age_25_above,population_int
0,Alabama A & M University,AL,0.0,4206.0,424.0,420.0,0.1049,4206
1,University of Alabama at Birmingham,AL,0.0,11383.0,570.0,565.0,0.2422,11383
2,Amridge University,AL,1.0,291.0,,,0.854,291


In [40]:
college['population_int'].isna().sum()

661

In [41]:
filt = college['population_int'] > 5_000
filt.head(3)

0    False
1     True
2    False
Name: population_int, dtype: boolean

In [43]:
filt.tail()

7530    <NA>
7531    <NA>
7532    <NA>
7533    <NA>
7534    <NA>
Name: population_int, dtype: boolean

In [46]:
college[filt.fillna(False)].head(2)

Unnamed: 0,name,state,distance,population,sat_verbal,sat_math,age_25_above,population_int
1,University of Alabama at Birmingham,AL,0.0,11383.0,570.0,565.0,0.2422,11383
3,University of Alabama in Huntsville,AL,0.0,5451.0,595.0,590.0,0.264,5451


In [51]:
np.nan == 500

False

In [52]:
college.query('population > 5_000').head(3)

Unnamed: 0,name,state,distance,population,sat_verbal,sat_math,age_25_above,population_int
1,University of Alabama at Birmingham,AL,0.0,11383.0,570.0,565.0,0.2422,11383
3,University of Alabama in Huntsville,AL,0.0,5451.0,595.0,590.0,0.264,5451
5,The University of Alabama,AL,0.0,29851.0,555.0,565.0,0.0853,29851


## Dedicated string data type

* pandas-only data type
* By default, these are read in as numpy 'object' data type - can contain anything (bad)
* only contains strings
* convert with `astype('string')`
* uses pd.NA

In [53]:
college.dtypes

name               object
state              object
distance          float64
population        float64
sat_verbal        float64
sat_math          float64
age_25_above      float64
population_int      Int64
dtype: object

In [54]:
college['state_str'] = college['state'].astype('string')
college.head(3)

Unnamed: 0,name,state,distance,population,sat_verbal,sat_math,age_25_above,population_int,state_str
0,Alabama A & M University,AL,0.0,4206.0,424.0,420.0,0.1049,4206,AL
1,University of Alabama at Birmingham,AL,0.0,11383.0,570.0,565.0,0.2422,11383,AL
2,Amridge University,AL,1.0,291.0,,,0.854,291,AL


In [67]:
college.loc[0, 'state'] = None
college.loc[1, 'state'] = np.nan
college.loc[2, 'state'] = pd.NA
college.head(3)

Unnamed: 0,name,state,distance,population,sat_verbal,sat_math,age_25_above,population_int,state_str
0,Alabama A & M University,,0.0,4206.0,424.0,420.0,0.1049,4206,
1,University of Alabama at Birmingham,,0.0,11383.0,570.0,565.0,0.2422,11383,AL
2,Amridge University,,1.0,291.0,,,0.854,291,AL


In [68]:
college.loc[0, 'population'] = None
college.loc[1, 'population'] = np.nan
college.loc[2, 'population'] = pd.NA
college.head(3)

Unnamed: 0,name,state,distance,population,sat_verbal,sat_math,age_25_above,population_int,state_str
0,Alabama A & M University,,0.0,,424.0,420.0,0.1049,4206,
1,University of Alabama at Birmingham,,0.0,,570.0,565.0,0.2422,11383,AL
2,Amridge University,,1.0,,,,0.854,291,AL


In [70]:
college['population'].values

array([nan, nan, <NA>, ..., nan, nan, nan], dtype=object)

In [65]:
college.loc[0, 'state_str'] = np.nan

In [66]:
college.head(2)

Unnamed: 0,name,state,distance,population,sat_verbal,sat_math,age_25_above,population_int,state_str
0,Alabama A & M University,1232,0.0,4206.0,424.0,420.0,0.1049,4206,
1,University of Alabama at Birmingham,AL,0.0,11383.0,570.0,565.0,0.2422,11383,AL


### Same `str` accessor methods available

In [61]:
college['state_str'].str.lower().head()

0    al
1    al
2    al
3    al
4    al
Name: state_str, dtype: string

### Use categorical instead

In [62]:
college['state'].astype('category')

0       1232
1         AL
2         AL
3         AL
4         AL
        ... 
7530      CA
7531      KS
7532      OH
7533      CA
7534      TX
Name: state, Length: 7535, dtype: category
Categories (60, object): [1232, AK, AL, AR, ..., WA, WI, WV, WY]

## Nullable Boolean

* pandas-only data type
* original boolean data type from numpy - no missing values allowed
* convert with `astype('boolean')`
* uses pd.NA

In [76]:
college['distance'].astype('boolean')

0       False
1       False
2        True
3       False
4       False
        ...  
7530     <NA>
7531     <NA>
7532     <NA>
7533     <NA>
7534     <NA>
Name: distance, Length: 7535, dtype: boolean

## Convert all with `convert_dtypes`

In [77]:
college = pd.read_csv('data/college.csv').convert_dtypes()
college.head()

Unnamed: 0,name,state,distance,population,sat_verbal,sat_math,age_25_above
0,Alabama A & M University,AL,0,4206,424.0,420.0,0.1049
1,University of Alabama at Birmingham,AL,0,11383,570.0,565.0,0.2422
2,Amridge University,AL,1,291,,,0.854
3,University of Alabama in Huntsville,AL,0,5451,595.0,590.0,0.264
4,Alabama State University,AL,0,4811,425.0,430.0,0.127


In [78]:
college.dtypes

name             string
state            string
distance          Int64
population        Int64
sat_verbal        Int64
sat_math          Int64
age_25_above    float64
dtype: object

In [None]:
college.convert_dtypes

## New data types are experimental

* Do not use for serious work
* Functionality can change
* Missing value filtering might change
* Use float and categorical instead

## Questions? Buy my books! 