# pandas

[pandas](https://pandas.pydata.org/) is an open source library for tabular heterogeneous data manipulation. The core structures are `Series` and `DataFrame` which can be seen as a collection of Series.  In addition `pandas` provides the necessary means for data cleaning and preparation. `pandas` uses NumPy array structure  as an extension type with methods for conversion in both directions.

&#9888; A major difference between numpy arrays and `pandas` Series and DataFrame is in the way that <tt>pandas</tt> indices are used. In NumPy the index is implicitly assigned $0..(n-1)$ whereas `pandas` Series and DataFrame have similar behaviour but in addition allow labels as indices. In addition the indices are preserved after applying operations.

Many parallels can be drawn between <tt>pandas</tt> and `tidyverse` R package. In terms data structure, Series and DataFrame can be viewed as vectors and data.frame/tibble respectively. Furthermore, in terms of functionality most data manipulation operations available in tidyverse have a counterparts in <tt>pandas</tt>.


In [None]:
# convention
import pandas as pd
import numpy as np
#
from numpy.random import default_rng
rng = default_rng()

## Series

`Series` is a sequence of values, possibly of heterogeneous types. You can create Series with the <tt>pd.Series</tt> function.

**Synopsis: &nbsp; &nbsp;**<tt>Series(data=None, index=None, dtype=None, name=None, copy=False)</tt>
 - data: array, iterable, dict, scalar
 - index: 1-dimensional array, otherwise $0..(n-1)$
 - dtype: [data types](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes), otherwise inferred
 - name: optional
 - copy: default False, data is not copied but is a reference

In [None]:
s = pd.Series([3,5,7])
s = pd.Series({'a':3, 'b':5, 'c':7})
s = pd.Series([3,5,7], index=['a','b','c'])
s = pd.Series([3,5,7])

Input data to pd.Series is not copied by default. In the following scenario an update to Series `s` propogates to NumPy array `arr`:

In [None]:
arr = np.array(range(3,7+1,2)) # NumPy array [3,7] with step=2
s = pd.Series(arr, copy=False) # default copy=False
s[1] = -1                      # set value s[1] to -1

Series, besides the ordered indices $0..(n-1)$, may also be viewed as a dictionary where values are accessed based on mapped indices to values:

In [None]:
s = pd.Series({'a':3, 'b':5, 'c':7})
s[1] == s['b']

Operations between Series are carried out based on matching indices as opposed to element-wise:

In [None]:
s1 = pd.Series({'a':3, 'b':5, 'c':2})
s2 = pd.Series({'b':3, 'a':5, 'c':2})
s1+s2

and they don't have to be the same size:

In [None]:
s3 = pd.Series({'b':3, 'a':5, 'c':2, 'd':10}) # there is no matching 'd' in s1 therefore d=NaN
s1+s3

Index membership:

In [None]:
"b" in s1 # s1 : {'a':3, 'b':5, 'c':2}

In contrast to NumPy arrays, and R vectors, being homogenous containers, Series may take up values of different types:

In [None]:
s = pd.Series({'a':3, 'b':5, 'c':'7'})
s.dtype
[type(v) for v in s]

## Series methods and submodules

An exhaustive review of [Series' methods and submodules](https://pandas.pydata.org/docs/reference/series.html#) is beyond the scope of this course. Here we only review several common uses.



In [None]:
s1 = pd.Series(['apple', 'watermelon', 'orange', 'pear', 'cherry', 'strawberry'],
               index=list("abcdef"))
s2 = pd.Series(['apple', 'kiwi', 'orange', 'pear', 'cherry', 'grape'],
               index=list("abcdef"))
s3 = pd.Series(np.log(np.arange(0.1,10,.1)), index=np.arange(0.1,10,.1))

s4 = np.array([""])

s1.unique()
s1.count()
s1.compare(s2)
s1.filter(['a','b'])
s3.plot();
s1.drop(['b','f'])
s3.apply(lambda x: np.abs(x))

## Timestamp

In [None]:
dates = pd.Series(['1-4-1988', '1-1-1987', '1-12-2011', '1-6-2005', '1-5-2005'])
tss = pd.to_datetime(dates,format="%d-%m-%Y")
tss.min(), tss.max()
tss.sort_values()

# DataFrame

The pandas' DataFrame is a 2-dimensional structure which may be viewed as a collection of Series. It has indices for both dimensions. We will use the terms observations and variables for rows and columns interchangeably. DataFrame, and Series, can hold dimensions $>2$ with the so called `hierarchical indexing` which is beyond the scope of this course.

&#9888; We will be working with homogeneous Series in the context of DataFrames.

To create a DataFrame use the function pd.DataFrame:

**Synopsis: &nbsp; &nbsp;**<tt>DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)</tt>

Most arguments are familiar from pd.Series except the additional *columns* with which the indices of the second dimension are controlled.


In [None]:
df = pd.DataFrame(data=[[3,'a'], [5,'b'], [7,'c']],                    # list, tuple, or np.array
                  columns=['x', 'y'])                                  #
df = pd.DataFrame({'x': [3,5,7], 'y': ['a','b','c']})                  # dictionary of columns
df = pd.DataFrame({3:'a', 5:'b', 7: 'c'}.items(), columns= ['x','y'])  # dictionary of rows

## DataFrame : read/write

You may want to store or share with others the DataFrame you just created. The most common data format to store a DataFrame is comma-separated-values (csv) format. Use `to_csv` method to export a DataFrame and `pd.read_csv` import:

In [None]:
df = pd.DataFrame({'x': rng.standard_normal(10), 'y': rng.standard_normal(10)})
df.to_csv("df.csv",index=False)   # write df to file 'df.csv', do not include index
df = pd.read_csv("df.csv")        # read df.csv into df object

## Inspect content

In [None]:
df = pd.DataFrame({'x': rng.standard_normal(10), 'y': rng.standard_normal(10)}) # x and y two random variables

df.head()                   # top 5 (default) observation
df.tail(2)                  # last 2 observations
df.head(5).tail(2)          # composition
df.shape                    # size of the dimensions
df.size                     # total number of elements
df.columns                  # the columns indices/names
df.dtypes                   # listing of all columns' types
df.describe()               # descriptive summary of all variables

## Select columns

### Single column

You can select a column from a DataFrame using the square bracket `df["column_name"]` or `df.column_name`. When only one column name is given the result is a Series, with a list of columns the result is a DataFrame:

In [None]:
df["x"]   # Series
df.x      # <=>  df["x"]
df[["x"]] # DataFrame

Only `valid python names` can be accessed through dot `.`:

In [None]:
pd.DataFrame({'valid_name': [1,2,3], 'another variable':[3,2,1]  }).valid_name

### Multiple columns

Use a list of indices to select multiple columns:

In [None]:
df[['Periods', 'TotalSupply_1']]  # explicit
df[df.columns[[1,2]]]             # use indices on df.columns

## Select rows

### Using logical criteria

Similar to NumPy logical masks we can filter out rows for which the logical condition succeeds. A condition on the variables of a DataFrame returns a logical value for each row in a format of a `Series` object:

In [None]:
df = pd.DataFrame({'x': rng.standard_normal(10), 'y': rng.standard_normal(10)}) # x and y two random variables
df[((df.x < 0) & (df.y > 0))]  # parentheses are required

### Using index : loc method

Rows in a Dataframe are by default indexed with $[0,n)$. The `DataFrame` method `loc` can be used in the following forms:

- `df.loc[<row-label>]`                : select a row by numeric index
- `df.loc[<row-label>,<column-label>]` : select the indexed entry

Both row-label and column-label may take values such as, a single label,  list/array of labels, slices, boolean arrays and series. Though these indexing schemes may look similar to NumPy, there are two cautionary remarks:

- The labels are not positional indices.
- The slices used with `.loc` are inclusive of start and stop, i.e. [0,k].


In [None]:
df.loc[1]               # [.] row 1 as a Series
df.loc[[1]]             # [[.]] row 1 as a DataFrame
df.loc[1,'x']           # [.,.] labels
df.loc[0:3, 'x':'y']    # [.,.] slices
df.loc[df.x > df.y,'x'] # [.,.] boolean

**iloc:** Also take a look at the method `iloc` which is similar to `loc` except it only accepts positional integers or ranges for rows and columns indices.

## Update variables

DataFrame's columns can be updated with an assignment `=` with or without a row selection:

In [None]:
df.x=range(df.shape[0])            # variable size and the size of the new values must match.
df.loc[(df.x % 2 ==0),'y'] = None  # set y values to NaN where x is an even value

Value update according to a selection should only be done using `.loc` (or `.iloc`) method. For example both selections below are equivalent but only the `.loc` version can be used in an assignment:

In [None]:
s1 = df[0::2]['y']      # selection with composition (aka chained)
s2 = df.loc[0::2, 'y']  # selection with loc
s1.equals(s2)           # s1 == s2
df[0::2]['y'] = -2      # warning
df.loc[0::2, 'y'] = -2  # valid

## Merge Series and DataFrames

To combine DataFrames use the `pd.concat` function:

**Synopsis: &nbsp; &nbsp;**<tt>concat(objs, axis=0, ignore_index=False, copy=True)</tt>

In [None]:
s1 = pd.Series(list("abcd"))          # ['a', 'b', 'c', 'd']
s2 = pd.Series(range(4))              # [0, 4)
pd.concat([s1,s2])                    # Series
pd.concat([s1,s2], ignore_index=True)  # Series
pd.concat([s1,s2], axis=1)            # DataFrame

In [None]:
df1 = pd.DataFrame({'a':range(3), 'b':list("abc")})
df2 = pd.DataFrame({'c':range(5), 'b':list("abcde"[::-1])})
pd.concat([df1,df2], axis=0, join='outer')  # along axis 0
pd.concat([df1,df2], axis=1, join='outer')  # along axis 1

## Add row to DataFrame

For this we can use the `pd.concat` function:

In [None]:
df = pd.DataFrame({'Year': [2021, 2021], 'Month': [11, 12],'Day': [9, 16]})
new_row =  pd.DataFrame({'Year': [2023], 'Month': [3],'Day': [20]})
pd.concat([df, new_row])

## Missing values

Recall the special values None and NaN from the lectures representing no value and not a number. They are of different types and have different properties. In the context of DataFrames we have the notion of missing values, and they can be represented by both.

NaN and None types:

In [None]:
s = pd.Series(["0", float('nan'), np.nan,  2, None])
[type(v) for v in s]

### Handling missing data

Possible actions when dealing with missing data are  *summarise*, *remove* or *replace* missing values.

To be able to do any action on missing values you'll need to first find them. DataFrame and Series have the methods `isna` and `isnull` (alias to `isna`) for finding missing values. Both return a logical mask with `True` marking the location of the missing values. We will use `isna` throughout the lectures.

In [None]:
s.isna()      # isna: boolean marking missing value
s[s.notna()]  # <=> s[~ s.isna()]

With `dropna` you may discard all missing from a Series object. With DataFrames you'll have more control in how to discard the missing

**Synopsis: &nbsp; &nbsp;**<tt>pandas.DataFrame.dropna(axis=0, how='any', thresh=None, inplace=False)</tt>

In [None]:
sample_space = np.arange(10).tolist() + ([np.nan]*2)
df = pd.DataFrame(rng.choice(sample_space,25).reshape(5,5))

In [None]:
df.dropna(axis=0) # default : drop rows having any missing values
df.dropna(axis=1) # drop columns having any missing values
df.dropna(axis=1, how='all') # drop columns having only missing values

With `fillna` we can replace the missing with values, either fixed or a set of values (Series,DataFrame etc.) according to the indices. We only illustrate scalars here:

In [None]:
df.fillna(0)
df.fillna(df.mean(axis=0))

## Group operations

When the data has categorical variables we may be interested in descriptive statistics on each group. This can be done by first grouping the data with `groupby` method and then summarise on those groups. I'll use the [diamonds](https://ggplot2.tidyverse.org/reference/diamonds.html) dataset for illustration.

In [None]:
diamonds = pd.read_csv("data/diamonds.csv") # read diamonds.csv

In [None]:
grp = diamonds[diamonds.columns.drop(['color'])].groupby(['cut', 'clarity'], as_index=False) #
grp.indices
grp.ngroups
df = grp.mean()