# Data processing: Pandas 

### Acknowledgments & Credits

This lesson is adapted largely from the excellent curriculum materials by Cliburn Chan (2021) at https://github.com/cliburn/bios-823-2021/ under the MIT License.

**References**

- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
- [Python for Data Analysis, 2nd Edition](https://github.com/wesm/pydata-book)

In [None]:
import numpy as np
import pandas as pd

## Series and Data Frames

### Series objects

A `Series` is like a vector. All elements must have the same type or are nulls.

In [None]:
s = pd.Series([1,1,2,3] + [None])
s

### Size

In [None]:
s.size

### Unique Values and Counts

In [None]:
s.value_counts()

In [None]:
s.nunique(), s.unique()

### Special types of series

#### Strings

In [None]:
words = 'the quick brown fox jumps over the lazy dog'.split()
s1 = pd.Series([' '.join(item) for item in zip(words[:-1], words[1:])])
s1

In [None]:
s1.str.upper()

In [None]:
s1.str.split()

In [None]:
s1.str.split().str[1]

### Categories

In [None]:
b = pd.Series(['Adenine', 'Cytosine', 'Guanine', 'Thymine', 'Uracil'])
b

In [None]:
b = b.astype('category')
b

In [None]:
b.cat.categories

In [None]:
b.cat.codes

In [None]:
b = pd.Series(list('ACGTU')).astype('category')
b

In [None]:
b.cat.categories

In [None]:
b = pd.Series(list(b'ACGTU')).astype('category')
b

### DataFrame objects

A `DataFrame` is like a matrix. Columns in a `DataFrame` are `Series`.

- Each row in a DataFrame represents an **observation**
- Each column in a DataFrame represents a **feature** (or **variable**)
- Each cell in a DataFrame represents a **value**

In [None]:
df = pd.DataFrame(dict(num=[1,2,3] + [None]))
df

In [None]:
from sklearn import datasets
iris = datasets.load_iris()

In [None]:
def sklearnds2df(ds):
    df = pd.DataFrame(data=ds['data'], columns=ds['feature_names'])
    df['target'] = pd.Series(pd.Categorical.from_codes(ds['target'],
                                                       categories=iris['target_names']))
    return df

In [None]:
df = ds2df(iris)
df

### Index

Row and column identifiers are of `Index` type.

Somewhat confusingly, index is also a a synonym for the row identifiers.

In [None]:
df.index

#### Making an index into a column

In [None]:
df.reset_index(drop=True)
df

### Columns

This is just a different index object

In [None]:
df.columns

### Getting raw values

Sometimes you just want a `numpy` array, and not a `pandas` object.

In [None]:
df.values

## Indexing Data Frames

### Implicit defaults

if you provide a slice, it is assumed that you are asking for rows.

In [None]:
df[1:3]

If you provide a singe value or list, it is assumed that you are asking for columns.

In [None]:
df[['sepal length (cm)', 'sepal width (cm)']]

### Indexing by location

This is similar to `numpy` indexing

In [None]:
df.iloc[1:3, :]

In [None]:
df.iloc[1:3, 1:4:2]

### Indexing by name

In [None]:
df.loc[1:3, 'sepal length (cm)':'petal length (cm)']

**Warning**: When using `loc`, the row slice indicates row names, not positions.

In [None]:
df1 = df.copy()
df1.index = df.index + 1
df1

In [None]:
df1.loc[1:3, 'sepal length (cm)':'petal length (cm)']

In [None]:
df1.iloc[1:3, 0:3]

## Structure of a Data Frame

### Data types

In [None]:
df.dtypes

### Converting data types

#### Using `astype`, including on multiple columns

In [None]:
df1 = df.astype({'sepal length (cm)': int, 'sepal width (cm)':int})
df1

In [None]:
df1.dtypes

### Basic properties

In [None]:
df.size

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.sample(n=3)

In [None]:
df.sample(frac=0.1)

## Selecting, Renaming and Removing Columns

### Selecting columns

In [None]:
df.filter(regex='\slength\s')

### By data type

In [None]:
df.select_dtypes(include=np.number)

In [None]:
df.select_dtypes(exclude=['category'])

#### Note that you can also use regular string methods on the columns

In [None]:
df.loc[:, df.columns.str.contains('length')]

### Renaming columns

In [None]:
df.rename(dict(target='species'), axis=1)

In [None]:
df.columns = df.columns.str.replace(" (cm)", "").str.replace(" ", "_")
df

#### You can also use regular indexing

In [None]:
df.loc[:, ~df.columns.str.contains('sepal')]

## Selecting, Renaming and Removing Rows

### Selecting rows

In [None]:
df[df.sepal_length.between(5,6)]

In [None]:
df.query('5 <= sepal_length <= 6')

## Sorting Data Frames

### Sort on indexes

In [None]:
df.sort_index(axis=1)

In [None]:
df.sort_index(axis=0, ascending=False)

### Sort on values

In [None]:
df.sort_values(by=['sepal_length', 'petal_length'], ascending=[True, False])

## Summarizing

### Apply an aggregation function

In [None]:
df.agg(['target', np.sum, np.mean])

In [None]:
df.agg({'weight': ['median', 'sum'], 'height': ['min', 'max']})

## Split-Apply-Combine

We often want to perform subgroup analysis (conditioning by some discrete or categorical variable). This is done with `groupby` followed by an aggregate function. Conceptually, we split the data frame into separate groups, apply the aggregate function to each group separately, then combine the aggregated results back into a single data frame.

In [None]:
df['treatment'] = list('ababa')

In [None]:
df

In [None]:
grouped = df.groupby('treatment')

In [None]:
grouped.get_group('a')

In [None]:
grouped.mean()

### Using `agg` with `groupby`

In [None]:
grouped.agg('mean')

In [None]:
grouped.agg(['mean', 'std'])

In [None]:
grouped.agg({'weight': ['mean', 'std'], 
             'height': ['min', 'max'], 'bmi': lambda x: (x**2).sum()})

### Using `transform` wtih `groupby`

When you apply a transform with a grouped object, it returns the same value for each member in a group - all rows are represented.

In [None]:
g_mean = grouped[['weight', 'height']].transform(np.mean)
g_mean

In [None]:
g_std = grouped[['weight', 'height']].transform(np.std)
g_std

In [None]:
(df[['weight', 'height']] - g_mean)/g_std

## Combining Data Frames

In [None]:
df

In [None]:
df1 =  df.iloc[3:].copy()

In [None]:
df1.drop('something', axis=1, inplace=True)
df1

### Adding rows

Note that `pandas` aligns by column indexes automatically.

In [None]:
df.append(df1, sort=False)

In [None]:
pd.concat([df, df1], sort=False)

### Adding columns

In [None]:
df.pid

In [None]:
df2 = pd.DataFrame(OrderedDict(pid=[649, 533, 400, 600], age=[23,34,45,56]))

In [None]:
df2.pid

In [None]:
df.pid = df.pid.astype('int')

In [None]:
pd.merge(df, df2, on='pid', how='inner')

In [None]:
pd.merge(df, df2, on='pid', how='left')

In [None]:
pd.merge(df, df2, on='pid', how='right')

In [None]:
pd.merge(df, df2, on='pid', how='outer')

### Merging on the index

In [None]:
df1 = pd.DataFrame(dict(x=[1,2,3]), index=list('abc'))
df2 = pd.DataFrame(dict(y=[4,5,6]), index=list('abc'))
df3 = pd.DataFrame(dict(z=[7,8,9]), index=list('abc'))

In [None]:
df1

In [None]:
df2

In [None]:
df3

In [None]:
df1.join([df2, df3])

## Reshaping Data Frames

Sometimes we need to make rows into columns or vice versa.

### Converting multiple columns into a single column

This is often useful if you need to condition on some variable.

In [None]:
url = 'https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv'
iris = pd.read_csv(url)

In [None]:
iris.head()

In [None]:
iris.shape

In [None]:
df_iris = pd.melt(iris, id_vars='species')

In [None]:
df_iris.sample(10)

## Pivoting

Sometimes we need to convert categorical values in a column into separate columns. This is often done at the same time as performing a summary.

In [None]:
df_iris.pivot_table(index='variable', columns='species', values='value', aggfunc='mean')

## Functional style - `apply`, `map`

`apply` can be used to apply a custom function

In [None]:
scores = pd.DataFrame(
    np.around(np.clip(np.random.normal(90, 10, (5,3)), 0, 100), 1),
    columns = ['math', 'stat', 'biol'],
    index = ['anne', 'bob', 'charles', 'dirk', 'edgar']
)
scores

In [None]:
def convert_grade(score):
    return np.choose(
        pd.cut(score, [-1, 60, 70, 80, 90, 100], labels=False),
        ['F', 'D', 'C', 'B', 'A']
    )

In [None]:
scores.apply(convert_grade_2)

`apply` can be used to avoid explicit looping. It is also very handy for reductions along margins.

In [None]:
scores.apply(np.mean, axis=0)

If all else fails, you can loop over `pandas` data frames. Loops are frowned upon because they are not efficient, but sometimes pragmatism beats elegance.