# Pandas Cheatsheet

- Importing: 
    - `pd.read_csv(filepath,sep=',',delimeter=None,header='infer'...)` read a csv file into a DataFrame.
    - `pd.read_excel`
    - `pd.read_json`

## General functions 
- Data manipulations
    - `pd.melt(frame,id_vars,value_vars,...)` unpivots a DataFrame from wide to long format. This function is useful to manipulate a DataFrame into a formay where one or more columns are identifier variables, whle all other columns, considered measured variables, are "unpivoted" to the row axis.
    - `pd.crosstab(index,columns,values,...)` computes a simple cross tabulation of two (or more) factors. By deafult computes a frequency table of the factors unless an array of values and an aggregation function are passed. 
    - `pd.cut` bins values into discrete intervals.
    - `pd.qcut` is a quantile-based discretization function. It discretisizes variables into equal-sized buckets based on rank or on sample quantiles. 
    - `pd.merge` merges a DataFrame or named Series objects with a database-style join. The join is done on columns or indexes.
    - `pd.merge_ordered` performs merge with optional filling/interpolation. Designed for ordered data like time series data. 
    - `pd.merge_asof` performs an asof merge. This is similar to a left-join except that we match on the nearest key rather than equal keys. Both DataFrames must be sorted by the key. 
    - `pd.concat` concatenates pandas objects along a particular axis with optional set logic along the other axes. Can also add a layer of hierarchical indexing on the concatenation axis.
    - `pd.get_dummies` converts categorical variables into dummy/indicator variables. 
    - `pd.factorize` encodes an object as an enumerated type or categorical vairable. This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values. 
    - `pd.unique` returnes uniques in order of appearance. 
    - `pd.wide_to_long` transforms a wide panel to long format. Less flexible, but more user-friendly than melt. 
    - **`pd.to_numeric`** convert arguments to a numeric type. The default return dtype is *float64* or *int64* depending on the data supplied.
- Missing data
    - **`pd.isna`** detects missing values (i.e., NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike). Returns a bool or array-like bool.
    - `pd.isnull` is an alias for `pd.isna`. Same thing!
    - `pd.notna` detects non-missing values and returns a bool or array-like bool. 
    - `pd.notnull` is an alias for `pd.notna`. 
    
- Datetime manipulations
    - **`pd.to_datetime`** converts arguments to datetime. 
    - `pd.to_timedelta` converts arguments to timedelta. Timedeltas are absolute differences in times, expressed in difference units (e.g., days, hours, mins, seconds).
    - `pd.date_range` returns a fixed frequency DatetimeIndex. 
    - `pd.bdate_range` returns a fixed frequency DatetimeIndex, with business day as the default frequency.
    - `pd.period_range` returns a fixed frequency PeriodIndex.
    - `pd.timedelta_range` returns a fixed frequency TimedeltaIndex, with dat as the default frequency.
    - `pd.infer_freq` infers the most likely frequency given the input index.
    - `pd.interval_range` returns a fixed frequency IntervalIndex.
    
- Evaluation
    - `pd.eval` evaluates a Python expression as a string using various backends. Supported operations: +,-,/,//,**,%, |, &, ~. 
    
- Hashing
    - `pd.util.hash_array` when given a 1d array, returns an array of deterministic integers.
    - `pd.util.hash_pandas_object` returns a data hash of the Index/Series/DataFrame.
    

At the very basic level, Pandas objects can be thought of as an enhanced version of NumPy arrays in which the rows and columns are identified with labels rather than simple integer indices. 

Three fundamental Pandas stuctures are `Series`, `DataFrame`, and `Index`.

## Series
- A `Series` is a one-dimensional array of indexed data. The essential difference between the `Series` object and a one-dimensional NumPy array is the presence of the index: while the NumPy array has an *implicitly* defined integer index, the Pandas `Series` has an *explicitly* defined index associated with the values. 
    - The explicit index definition gives the `Series` object additional capabilities. 
        -  The index doesn't need to be an integer, it can consist of values of any desired type (e.g., string, datetime, etc.)
        - The index can be non-contiguous or non-sequential. 
        
### Constructing series objects
- General format: `pd.Series(data,index=index)`. Note that index is an optional argument. 
    - `data` can be a dictionary, in which `index` defaults to the sorted dictionary keys.
    
 Series attributes/functions (some of them):
 - `pd.Series.index`
 - `pd.Series.array`
 - `pd.Series.values`
 - `pd.Series.dtype`
 - `pd.Series.shape`
 

In [None]:
import numpy as np
import pandas as pd

In [None]:
data = pd.Series([0.25,0.5,0.75,1.0])
print(data)

In [None]:
data.values

In [None]:
data.index

In [31]:
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]})
df

Unnamed: 0,B
0,0.0
1,1.0
2,2.0
3,
4,4.0


In [32]:
df.rolling(2, win_type='triang').sum()

Unnamed: 0,B
0,
1,0.5
2,1.5
3,
4,


In [33]:
df.expanding(2).sum()

Unnamed: 0,B
0,
1,1.0
2,3.0
3,3.0
4,7.0


In [None]:
pd.Series([2,4,5])

In [None]:
pd.Series({2:'a',1:'b',3:'c'})

Dictionary to Pandas Series. By default, a `Series` will be created where the index is drawn from the sorted keys. 

## DataFrame
A `DataFrame` is an analog of a two-dimensional array with both flexible row indices and flexible column names. 

`pd.DataFrame(data,index,columns,dtype,copy)`

DataFrame attributes/functions (some of them):

- Attributes and underlying data:
    - `pd.DataFrame.index` returns the index (row labels) of the DataFrame as well as the dtype of the index. 
    - `pd.DataFrame.columns` returns the names of the columns.
    - `pd.DataFrame.dtypes` returns a series with the dtypes of each column in the DataFrame. The result's index is the original DataFrame's columns. 
    - `pd.DataFrame.info(verbose=True)` prints information about a DataFrame including the index dtype and columns, non-null values and memory usage. 
    - `pd.DataFrame.shape` returns a tuple representing the dimensionality of the DataFrame.
    - `pd.DataFrame.values` returns a NumPy representation of the DataFrame, only the values in the DataFrame will be returned, the axes labels will be removed.
    - `pd.DataFrame.axes` returns a list representing the axs of the DataFrame.
    - `pd.DataFrame.ndim` returns an int representing the number of axes/array dimensions. 
    - `pd.DataFrame.size` returns an int representing the number of elements in this object.
    - `pd.DataFrame.memory_usage` returns the memory usage of each column in bytes. 
    - `pd.DataFrame.empty` returns True if the DataFrame is entirely empty
    - `pd.DataFrame.attrs` returns a dictionary storing global metadata for this DataFrame.
    
- Type conversion 
    - `pd.DataFrame.astype()` cast a pandas object to a specified dtype.
    - `pd.DataFrame.convert_dtypes()` convert columns to best possible dtypes using dtypes supporting pd.NA.
    - `pd.DataFrame.infer_objects` attempts to infer better dtypes for object columns. 

- Info:
    - `pd.DataFrame.keys()` returns 'info axis' (columns)

- Selection:
    - `pd.DataFrame.head`
    - `pd.DataFrame.tail`
    - `pd.DataFrame.loc[]` accesses a group of rows and columns by label(s) or a boolean array. Note using `[[label(s)/boolean array(s]]` returns a DataFrame. 
    - `pd.DataFrame.iloc` accesses a group of rows and columns by integer position 
    - `pd.DataFrame.select_dtypes(include=None,exclude=None)` returns a subset of the DataFrame's columns based on the column dtypes specified to include or exclude. 
    - `pd.DataFrame.at[row,col]` accesses a single value for a row/column label pair.
    - `pd.DataFrame.iat[row_i,col_i]` accesses a single value for a row/column pair by integer position.
    - `pd.DataFrame.lookup` when given an equal-length arrays of rows and column labels, returns an array of the values corresponding to each (row,col) pair.
    - `pd.DataFrame.pop`
    - `pd.DataFrame.at_time`
    - `pd.DataFrame.between_time`
    - `pd.DataFrame.filter`
    - `pd.DataFrame.first`
    - `pd.DataFrame.last`
    - `pd.DataFrame.sample`
    - `pd.DataFrame.groupby`
    - `pd.DataFrame.xs`
    - `pd.DataFrame.take`
    - `pd.DataFrame.get`
    - `pd.DataFrame.query`

- Changing: 
    - `pd.DataFrame.insert(loc,column,value)` inserts column into DataFrame at specified location. 
    - `pd.DataFrame.where`
    `pd.DataFrame.mask`
    - `pd.DataFrame.update`

- Iteration:
    - `pd.DataFrame.__iter__()` iterate over the columns. 
    - `pd.DataFrame.iteritems()` iterate over the columns, returning a tuple with the column name and content as a Series. 
    - `pd.DataFrame.items()` iterate over the columns, returning a tuple with the column name and the content as a series. 
    - `pd.DataFrame.iterrows()` iterate over the DataFrame rows as (index,Series) pairs. 
    - `pd.DataFrame.itertuples()` iterate over DataFrame rows as named tuples. 
      
- Data manipulation
    - `pd.DataFrame.drop`
    - `pd.DataFrame.drop_duplicates`
    - `pd.DataFrame.replace`
    - `pd.DataFrame.clip`
    - `pd.DataFrame.explode`

- Sorting
    - `pd.DataFrame.reorder_levels`
    - `pd.DataFrame.sort_values`
    - `pd.DataFrame.sort_index`
    - `pd.DataFrame.nlargest`
    - `pd.DataFrame.nsmallest`
    
- Reshaping
    - `pd.DataFrame.melt`
    - `pd.DataFrame.pivot`
    - `pd.DataFrame.pivot_table`
    - `pd.DataFrame.T`
    - `pd.DataFrame.transpose`
    - `pd.DataFrame.stack`
    - `pd.DataFrame.unstack`
    - `pd.DataFrame.truncate`
    - `pd.DataFrame.squeeze`
    - `pd.DataFrame.swapaxes`
    - `pd.DataFrame.append`
    - `pd.DataFrame.to_xarray`
    - `pd.DataFrame.assign`
    - `pd.DataFrame.droplevel`

- Joining/merging
    - `pd.DataFrame.join`
    - `pd.DataFrame.merge`
    - `pd.DataFrame.align`
    - `pd.DataFrame.combine`
    - `pd.DataFrame.combine_first`

- Math: 
    - `pd.DataFrame.add`
    - `pd.DataFrame.sub`
    - `pd.DataFrame.div`
    - `pd.DataFrame.mul`
    - `pd.DataFrame.truediv`
    - `pd.DataFrame.floordiv`
    - `pd.DataFrame.mod`
    - `pd.DataFrame.pow`
    - `pd.DataFrame.dot`
    - `pd.DataFrame.radd`
    - `pd.DataFrame.rsub`
    - `pd.DataFrame.rmul`
    - `pd.DataFrame.rdiv`
    - `pd.DataFrame.rtruediv`
    - `pd.DataFrame.rfloordiv`
    - `pd.DataFrame.rmod`
    - `pd.DataFrame.rpow`
    
- Function application
    - `pd.DataFrame.apply()`
    - `pd.DataFrame.applymap`
    - `pd.DataFrame.pipe`
    - `pd.DataFrame.agg`
    - `pd.DataFrame.aggregate`
    - `pd.DataFrame.eval`????
    - `pd.DataFrame.transform`
    - `pd.DataFrame.rolling`
    - `pd.DataFrame.expanding`

- Null values
    - `pd.DataFrame.isna`
    - `pd.DataFrame.isnull`
    - `pd.DataFrame.notna`
    - `pd.DataFrame.notnull`
    - `pd.DataFrame.fillna`
    - `pd.DataFrame.dropna`
    - `pd.DataFrame.backfill`
    - `pd.DataFrame.bfill`
    - `pd.DataFrame.ffill`
    - `pd.DataFrame.interpolate`
    - `pd.DataFrame.pad`

- Computations/descriptive stats
    - **`pd.DataFrame.describe`**
    - `pd.DataFrame.abs`
    - `pd.DataFrame.corr`
    - `pd.DataFrame.corrwith`
    - `pd.DataFrame.cov`
    - `pd.DataFrame.quantile`
    - `pd.DataFrame.cummax`
    - `pd.DataFrame.cummin`
    - `pd.DataFrame.cumprod`
    - `pd.DataFrame.cumsum`
    - `pd.DataFrame.diff`
    - `pd.DataFrame.kurt`
    - `pd.DataFrame.kurtosis`
    - `pd.DataFrame.mad`
    - `pd.DataFrame.max`
    - `pd.DataFrame.median`
    - `pd.DataFrame.min`
    - `pd.DataFrame.mode`
    - `pd.DataFrame.count`
    - `pd.DataFrame.rank`
    - `pd.DataFrame.round`
    - `pd.DataFrame.sem`
    - `pd.DataFrame.skew`
    - `pd.DataFrame.sum`
    - `pd.DataFrame.std`
    - `pd.DataFrame.var`
    - `pd.DataFrame.nunique`
    - `pd.DataFrame.value_counts`
    - `pd.DataFrame.pct_change`
    - `pd.DataFrame.prod`
    - `pd.DataFrame.product`
    
- Time Series-related
    - `pd.DataFrame.asfreq`
    - `pd.DataFrame.asof`
    - `pd.DataFrame.shift`
    - `pd.DataFrame.to_period`
    - `pd.DataFrame.to_timestamp`
    - `pd.DataFrame.resample`

- Reindex
    - `pd.DataFrame.reindex`
    - `pd.DataFrame.reindex_like`
    - `pd.DataFrame.set_index`
    - `pd.DataFrame.idxmin`
    - `pd.DataFrame.idxmax`  

- Label manipulation
    - `pd.DataFrame.add_prefix`
    - `pd.DataFrame.add_suffix`
    - `pd.DataFrame.rename`
    - `pd.DataFrame.rename_axis`
    - `pd.DataFrame.reset_index`
    - `pd.DataFrame.set_axis`

- Boolean
    - `pd.DataFrame.all`
    - `pd.DataFrame.any`
    - `pd.DataFrame.duplicated`
    - `pd.DataFrame.isin`

- Copies
    - `pd.DataFrame.copy(deep=True)` makes a copy of this object's indices and data. When deep=True modifications to the data or indices of the copy will not be reflected in the original object. When deep=False, any changes of the original will be reflected in the shallow copy (and vice versa). 

- Comparison
    - `pd.DataFrame.equals` test whether two objects contain the same elements. 
    - `pd.DataFrame.compare` compare to another DataFrame and show the differences. 
    
- Style:
    - `pd.DataFrame.style` returns a Styler object which contains methods for building a styled HTML representation of the DataFrame. This can be used to do things like highlight null values, apply a color map, apply background gradients, etc. See [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html).

In [36]:
data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]}
df= pd.DataFrame(data)
df

Unnamed: 0,col_0,col_1
0,9,-2
1,-3,-7
2,0,6
3,-1,8
4,5,-5


In [37]:
df.clip(-4,6)

Unnamed: 0,col_0,col_1
0,6,-2
1,-3,-4
2,0,6
3,-1,6
4,5,-4


In [40]:
df = pd.DataFrame({'A': [[1, 2, 3], 'foo', [], [3, 4]], 'B': 1})
df

Unnamed: 0,A,B
0,"[1, 2, 3]",1
1,foo,1
2,[],1
3,"[3, 4]",1


In [46]:
pd.__version__

'0.23.4'

In [49]:
pip install --upgrade ipython


Requirement already up-to-date: ipython in /Applications/anaconda3/lib/python3.7/site-packages (7.18.1)
Note: you may need to restart the kernel to use updated packages.


In [47]:
!pip install --upgrade pandas

Requirement already up-to-date: pandas in /Applications/anaconda3/lib/python3.7/site-packages (1.1.3)


In [56]:
?pd.df.explode()

Object `pd.df.explode()` not found.


In [None]:
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
                  index=[4, 5, 6], columns=['A', 'B', 'C'])

In [None]:
for i in df.iteritems():
    print(i)

In [None]:
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
                  index=['a', 5, 6], columns=['A', 'B', 'C'])

In [None]:
df.index

In [None]:
df.info(memory_usage='deep',verbose=True,null_counts=True)

In [23]:
df = pd.DataFrame({'num_legs': [2, 4], 'num_wings': [2, 0]},
                  index=['falcon', 'dog'])
df

Unnamed: 0,num_legs,num_wings
falcon,2,2
dog,4,0


In [29]:
d = {'num_legs': [4, 4, 2, 2],
     'num_wings': [0, 0, 2, 2],
     'class': ['mammal', 'mammal', 'mammal', 'bird'],
     'animal': ['cat', 'dog', 'bat', 'penguin'],
     'locomotion': ['walks', 'walks', 'flies', 'walks']}

df = pd.DataFrame(data=d)
df = df.set_index(['class', 'animal', 'locomotion'])

df.xs('mammal')

Unnamed: 0_level_0,Unnamed: 1_level_0,num_legs,num_wings
animal,locomotion,Unnamed: 2_level_1,Unnamed: 3_level_1
cat,walks,4,0
dog,walks,4,0
bat,flies,2,2


In [24]:
df.isin([0, 2])

Unnamed: 0,num_legs,num_wings
falcon,True,True
dog,False,True


In [None]:
df.convert_dtypes()

In [None]:
df = pd.DataFrame(
    {
        "a": pd.Series([1, 2, 3], dtype=np.dtype("int32")),
        "b": pd.Series(["x", "y", "z"], dtype=np.dtype("O")),
        "c": pd.Series([True, False, np.nan], dtype=np.dtype("O")),
        "d": pd.Series(["h", "i", np.nan], dtype=np.dtype("O")),
        "e": pd.Series([10, np.nan, 20], dtype=np.dtype("float")),
        "f": pd.Series([np.nan, 100.5, 200], dtype=np.dtype("float")),
    }
)

In [20]:
df.axes

[Int64Index([0, 1, 2], dtype='int64'), Index(['A', 'B', 'C'], dtype='object')]

In [None]:
for i in df.__iter__():
    print(i)

In [None]:
df.dtypes

In [None]:
df.infer_objects()

In [None]:
 df.convert_dtypes()

In [None]:
df= pd.DataFrame({"A": ["a", 1, 2, 3]})

In [None]:
df

In [None]:
for i in df.iterrows():
    for j in i:
        for k in j:
            print(k)

In [None]:
df = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])
row = next(df.iterrows())[1]
row

In [None]:
df.at[1,'A']

In [None]:
df.iat[1,0]

In [None]:
df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
     index=['cobra', 'viper', 'sidewinder'],
     columns=['max_speed', 'shield'])

In [None]:
df

In [None]:
df.loc[['viper']]

In [None]:
for i in df.__iter__():
    print(i)

In [None]:
for i in df.iteritems():
    print(i)

In [None]:
for i in df.iterrows():
    print(i)

In [None]:
df.xs

In [None]:
df.lookup('cobra','max_speed')

In [None]:
df = pd.DataFrame([['1990', 'a', 5, 4, 7, 2], ['1991', 'c', 10, 1, 2, 0], ['1992', 'd', 2, 1, 4, 12], ['1993', 'a', 5, 8, 11, 6]], columns=('Date', 'best', 'a', 'b', 'c', 'd'))
df

In [None]:
df.lookup(df.index,df['best'])

- Exporting: 
    - `pd.DataFrame.to_csv` write object to a csv file. 
    - pd.DataFrame.to_sql
    - pd.DataFrame.to_dict
    - `pd.DataFrame.to_excel` write object to an Excel sheet. 
    - `pd.DataFrame.to_json` convert the object to a JSON string. 
    - pd.DataFrame.to_html
    - pd.DataFrame.to_string
    - pd.DataFrame.to_markdown