# Data processing: Pandas 

### Acknowledgments & Credits

This lesson is adapted largely from the excellent curriculum materials by Cliburn Chan (2021) at https://github.com/cliburn/bios-823-2021/ under the MIT License.

**References**

- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
- [Python for Data Analysis, 2nd Edition](https://github.com/wesm/pydata-book)

In [None]:
import numpy as np
import pandas as pd

## Series and Data Frames

### Series objects

A `Series` is like a vector. All elements must have the same type or are nulls.

In [None]:
s = pd.Series([1,1,2,3] + [None])
s

### Size

In [None]:
s.size

### Unique Values and Counts

In [None]:
s.value_counts()

In [None]:
s.nunique(), s.unique()

### Special types of series

#### Strings

In [None]:
words = 'the quick brown fox jumps over the lazy dog'.split()
s1 = pd.Series([' '.join(item) for item in zip(words[:-1], words[1:])])
s1

In [None]:
s1.str.upper()

In [None]:
s1.str.split()

In [None]:
s1.str.split().str[1]

### Categories

In [None]:
b = pd.Series(['Adenine', 'Cytosine', 'Guanine', 'Thymine', 'Uracil'])
b

In [None]:
b = b.astype('category')
b

In [None]:
b.cat.categories

In [None]:
b.cat.codes

In [None]:
b = pd.Series(list('ACGTU')).astype('category')
b

In [None]:
b.cat.categories

In [None]:
b = pd.Series(list(b'ACGTU')).astype('category')
b

### DataFrame objects

A `DataFrame` is like a matrix. Columns in a `DataFrame` are `Series`.

- Each row in a DataFrame represents an **observation**
- Each column in a DataFrame represents a **feature** (or **variable**)
- Each cell in a DataFrame represents a **value**

In [None]:
df = pd.DataFrame(dict(num=[1,2,3] + [None]))
df

In [None]:
from sklearn import datasets
iris = datasets.load_iris()
iris

In [None]:
print(iris['DESCR'])

In [None]:
def sklearnds2df(ds):
    df = pd.DataFrame(data=ds['data'], columns=ds['feature_names'])
    df['target'] = pd.Series(pd.Categorical.from_codes(ds['target'],
                                                       categories=ds['target_names']))
    return df

In [None]:
df = sklearnds2df(iris)
df

### Index

Row and column identifiers are of `Index` type.

Somewhat confusingly, index is also a a synonym for the row identifiers.

In [None]:
df.index

#### Making an index into a column

In [None]:
df1 = df.reset_index(drop=False)
df1

### Columns

This is just a different index object

In [None]:
df.columns

### Getting raw values

Sometimes you just want a `numpy` array, and not a `pandas` object.

In [None]:
df.values

## Indexing Data Frames

### Implicit defaults

if you provide a slice, it is assumed that you are asking for rows.

In [None]:
df[1:3]

If you provide a singe value or list, it is assumed that you are asking for columns.

In [None]:
df[['sepal length (cm)', 'sepal width (cm)']]

### Indexing by location

This is similar to `numpy` indexing

In [None]:
df.iloc[1:3, :]

In [None]:
df.iloc[1:3, 1:4:2]

### Indexing by name

In [None]:
df.loc[1:3, 'sepal length (cm)':'petal length (cm)']

**Warning**: When using `loc`, the row slice indicates row names, not positions.

In [None]:
df1 = df.copy()
df1.index = df.index + 1
df1

In [None]:
df1.loc[1:3, 'sepal length (cm)':'petal length (cm)']

In [None]:
df1.iloc[1:3, 0:3]

## Structure of a Data Frame

### Data types

In [None]:
df.dtypes

### Converting data types

#### Using `astype`, including on multiple columns

In [None]:
df1 = df.astype({'sepal length (cm)': int, 'sepal width (cm)':int})
df1

In [None]:
df1.dtypes

### Basic properties

In [None]:
df.size

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.sample(n=3)

In [None]:
df.sample(frac=0.1)

## Selecting, Renaming and Removing Columns

### Selecting columns

In [None]:
df.filter(regex='\slength\s')

In [None]:
df.filter(like="(cm)")

### By data type

In [None]:
df.select_dtypes(include=np.number)

In [None]:
df.select_dtypes(exclude=['category'])

#### Note that you can also use regular string methods on the columns

In [None]:
df.loc[:, df.columns.str.contains('length')]

### Renaming columns

In [None]:
df.rename(dict(target='species'), axis=1)

In [None]:
df.columns = df.columns.str.removesuffix(" (cm)").str.replace(" ","_")
df

#### You can also use regular indexing

In [None]:
df.loc[:, ~df.columns.str.contains('sepal')]

## Selecting, Renaming and Removing Rows

### Selecting rows

In [None]:
df[df.sepal_length.between(5,6)]

In [None]:
df.query('5 <= sepal_length <= 6')

## Sorting Data Frames

### Sort on indexes

In [None]:
df.sort_index(axis=1)

In [None]:
df.sort_index(axis=0, ascending=False)

### Sort on values

In [None]:
df.sort_values(by=['sepal_length', 'petal_length'], ascending=[True, False])

In [None]:
df['sepal_length'].argsort()

In [None]:
df.iloc[df['sepal_width'].argsort(),:]

## Summarizing

### Apply an aggregation function

In [None]:
df_X = df.select_dtypes(include=np.number)
df_X.agg(["min", "max", "std", "mean", "median"])

In [None]:
df_X.select_dtypes(include=np.number).mean(axis=0)

In [None]:
df_X.select_dtypes(include=np.number).mean(axis=1)

## Split-Apply-Combine

We often want to perform subgroup analysis (conditioning by some discrete or categorical variable). This is done with `groupby` followed by an aggregate function. Conceptually, we split the data frame into separate groups, apply the aggregate function to each group separately, then combine the aggregated results back into a single data frame.

In [None]:
grouped = df.groupby(by='target', observed=True)

In [None]:
grouped.get_group('versicolor')

In [None]:
grouped.mean()

### Using `agg` with `groupby`

In [None]:
grouped.agg(['mean'])

In [None]:
grouped.agg(['mean', 'std'])

In [None]:
df.select_dtypes(np.number).apply(lambda x: (np.mean(x), np.std(x)), axis=0)

### Using `transform` wtih `groupby`

When you apply a transform with a grouped object, it returns the same value for each member in a group - all rows are represented.

In [None]:
g_mean = grouped[['sepal_length', 'sepal_width']].transform(np.mean)
g_mean

## Combining Data Frames

### Adding rows

Note that `pandas` aligns by column indexes automatically.

In [None]:
grouped = df.groupby('target', observed=True)
df_s, df_v, df_vg = [grouped.get_group(species) 
                     for species in ['setosa', 'versicolor', 'virginica']]

**Note that `DataFrame.append()` is no longer available as of Pandas 2.+.** Instead, use `pandas.concat()` for combining along either axis:

In [None]:
pd.concat([df_v, df_vg, df_s], axis=0)

### Adding columns

In [None]:
df_X = df.select_dtypes(np.number)
df_Y = df.select_dtypes('category')

In [None]:
pd.merge(df_X, df_Y, left_index=True, right_index=True)

In [None]:
df_sepals = df[['sepal_length','sepal_width','target']]
df_petals = df[['petal_length','petal_width','target']]

In [None]:
pd.merge(df_sepals, df_petals, left_on='target', right_on='target', how='inner')

In [None]:
dfp = df_petals.astype({'target':str}).reset_index(drop=False)
dfs = df_sepals.astype({'target':str}).reset_index(drop=False)
dfs.merge(dfp, left_on='target', right_on='target', how='inner')

In [None]:
pd.concat([df_sepals, df_petals], axis=1, join='inner')

## Reshaping Data Frames

Sometimes we need to make rows into columns or vice versa.

In [None]:
df.head()

In [None]:
df_triples = pd.melt(df, id_vars='target')
df_triples

### Pivoting

Sometimes we need to convert categorical values in a column into separate columns. This is often done at the same time as performing a summary.

In [None]:
df_triples.pivot_table(index='variable', columns='target',
                       values='value', aggfunc='mean', observed=True)