In [1]:
%%HTML
<link rel="stylesheet" type="text/css" href="css/custom.css">

# Towards pandas 1.0

Marc Garcia - @datapythonista

## About me

Marc Garcia - @datapythonista

- 12 years working with Python
- pandas core developer
- Python fellow
- Organiser of the London Python sprints group
- Data scientist at Tesco

## About pandas

![](img/wes.jpg)

- Started by **Wes McKinney** in 2008 in his spare time
 - To have R's `dataframe` functionality in Python

- Huge API
 - `Series` has 325 public methods/attributes
 - `DataFrame` has 224 public methods/attributes
 - Native support for 14 data formats (besides loading from Python objects)
 - More than 1,200 docstrings

- Huge user base
 - Estimated to have around **10 million users**

- Developed by the community (contributors and maintainers rarely get paid for their work in pandas)
 - Supported by **NumFOCUS**

## Quick overview

## Features being deprecated

- Deprecation of `Panel` (n-dimensional `DataFrame`)
 - Use `DataFrame` with multi-index, or `x-array` package instead

- `.ix` method
 - Use `.loc` and `.iloc` instead

- `SparseDataFrame`
 - TBC

## Features being deprecated

- `inplace=True` in `Series` and `DataFrame` methods
 - TBC

In [None]:
df = (pandas.read_csv('flights.csv')
            .rename(columns=str.lower)
            .drop('code', axis='columns')
            .query('country == "GB"')
            .assign(delay=lambda df: df['actual_arrival'] - df['expected_arrival'],
                    cancelled=lambda df: df['cancelled'].replace({'Yes': True, 'No': False}))
            .groupby('airline')
            .agg({'delay': 'sum', 'cancelled': 'mean'})
            .sort_values('delay', ascending=False))

- This syntax should avoid unnecesary memory copies
- May be in the long term operations could be lazy?

## Dropping Python 2 support

- In January 2019 (yes, in 4.5 months)
 - Not only pandas, also numpy, matplotlib and others

## Some Python 3 features

Old:
```python
samples = 100000000
```

New:
```python
samples = 100_000_000
```

## Some Python 3 features

Old:
```python
print('samples: %s' % samples)
print('samples: {samples}'.format(samples=samples))
```

New:
```python
print(f'samples: {samples}')
```

## Some Python 3 features

```python
data = 'My hovercraft is full of eels.'.split()
```

Old:
```python
first, second, last = data[0], data[1], data[-1]
```

New:
```python
first, second, *discard, last = data
```

## Cost of supporting Python 2

Supporting last version only:
```python
def length(value):
    if isinstance(value, str):
        return len(value)
```

Supporting Python 2:
```python
def length(value):
    if isinstance(value, compat.string_types):
        return compat.strlen(value)
```

## Cost of supporting Python 2

Supporting last version only:
```python
def sorted_apply(func, items):
    return {x: func(x) for x in items}
```

Supporting Python 2:
```python
def sorted_apply(func, items):
    if compat.PY36:
        return {x: func(x) for x in items}
    else:
        result = collections.OrderedDict()
        for x in items:
            result[x] = func(x)
        return result
```

## Dropping Python 2 support

- In January 2019 (yes, in 4.5 months)
 - Not only pandas, also numpy, matplotlib and others

## New features

- Extension arrays
