# Pandas

## 1. Overview

### 1.1. Introduction

The pandas library is specifically designed for data analysis and manipulation. 

It offers a variety of features and functionalities that make it an essential tool for anyone working with data in Python. 


**Pandas offers:**

* **Data Structures:** Pandas provides two main data structures: Series (one-dimensional labeled array) and DataFrame (two-dimensional labeled data structure similar to a spreadsheet). These structures allow us to store and organize our data efficiently.
* **Data Cleaning and Manipulation:** Pandas offers extensive tools for cleaning and manipulating our data, including handling missing values, filtering, sorting, and aggregating data.
* **Data Analysis:** Pandas comes with a rich set of statistical and analytical functions, allowing us to perform various analyses on your data, such as calculating descriptive statistics, finding correlations, and creating visualizations.
* **Time Series Analysis:** Pandas has specialized functionalities for working with time series data, making it a valuable tool for financial analysis, weather forecasting, and other time-dependent applications.
* **Integration with Other Libraries:** Pandas seamlessly integrates with other popular Python libraries like NumPy, Matplotlib, and Scikit-learn, allowing you to build powerful data analysis workflows.


**Advantages of pandas:**

* **Powerful and Efficient:** Pandas is designed for speed and efficiency, making it suitable for working with large datasets.
* **Easy to Learn:** The syntax of pandas is relatively simple and intuitive, even for beginners in Python.
* **Versatile:** Pandas can be used for a wide range of data analysis tasks, from basic cleaning and manipulation to complex statistical analysis and visualization.
* **Popular and Well-Supported:** Pandas has a large and active community, meaning we can easily find resources, tutorials, and help online.
* **Optimized Data Structures:** Pandas leverages optimized data structures like NumPy arrays and Series internally, providing efficient memory management and fast access to data elements.
* **Vectorized Operations:** Pandas uses vectorized operations instead of looping, allowing it to perform calculations on entire arrays of data simultaneously, significantly speeding up computations compared to traditional Python loops.
* **C-Level Optimizations:** Pandas utilizes C-level code for critical operations, further enhancing performance and memory efficiency compared to pure Python implementations.
* **Lazy Evaluation:** Pandas employs lazy evaluation, delaying expensive computations until they are absolutely necessary. This improves performance for interactive analysis and exploration.
* **Wide Range of Data Types:** Pandas supports various data types, including numerics, strings, categorical, datetimes, and more, allowing for flexible data manipulation and analysis.


**Disadvantages of Pandas:**

* **Not Ideal for Big Data:** Although pandas can handle large datasets, it might not be the best choice for truly massive datasets typically associated with "big data" applications. For such scenarios, libraries like Spark are designed to handle distributed processing and scale efficiently.
* **Limited Support for Unstructured Data:** Pandas primarily focuses on structured data in tabular formats. While it can handle some unstructured data manipulation, it's not ideal for complex processing of text, images, or other non-tabular data types. Libraries like spaCy or OpenCV are better suited for such tasks.
* **Learning Curve for Advanced Features:** While the core functionalities of pandas are relatively easy to learn, mastering its advanced features like `groupby` operations, custom functions, and complex data transformations can have a steeper learning curve.
* **Memory Overhead:** Data structures in pandas come with some memory overhead compared to raw NumPy arrays, which can be an issue for extremely large datasets. Pandas can be memory-intensive, especially when working with large datasets. Its data structures are designed for flexibility and ease of use, but they might not be the most memory-efficient choice for massive datasets.
* **Limited Parallelism:** While pandas allows some parallelization, it's not optimized for large-scale distributed computing like libraries like Dask or Spark. This can limit performance for massive datasets requiring parallel processing across multiple cores or machines.
* **GIL (Global Interpreter Lock):** Python's GIL can limit performance for CPU-bound operations in pandas, especially on multi-core systems. However, recent versions offer experimental parallelism capabilities to mitigate this issue.
  

**Pandas vs Numpy**

* **When to use pandas:**
  * Structured data
  * Data cleaning and manipulation
  * Statistical analysis and exploration
  * Time series analysis
  * Data visualization
* **When to use numpy:**
  * Purely numerical operations
  * Limited data size
  * Specific data types
  

[Docs Reference](https://pandas.pydata.org/docs/reference/index.html)

### 1.2. History

**Early Days (2008-2010):**

* **Conception:** Wes McKinney, frustrated by the lack of efficient tools for data analysis in Python, began developing Pandas in 2008.
* **Inspiration:** Drawing inspiration from R's DataFrames and NumPy's arrays, McKinney aimed to create a library that combined the strengths of both.
* **Initial Release:** In 2010, the first public version of Pandas (0.1.0) was released. It included basic DataFrame and Series functionalities, focused primarily on financial analysis.


**Growth and Adoption (2011-2015):**

* **Rapid Development:** Pandas gained significant traction due to its intuitive interface, efficient data structures, and growing feature set.
* **Community Contributions:** An active community of developers began contributing features, bug fixes, and documentation, accelerating Pandas' development.
* **Integration with SciPy Stack:** Pandas became a core component of the SciPy stack, solidifying its position as a key tool for data analysis in Python.
* **Integration with Numpy Stack:** Pandas became a core component of the Numpy stack, solidifying its position as a key tool for data analysis in Python in 2011.
* **Integration with Matplotlib and Seaborn Stack:** Pandas became a core component of the Matplotlib and Seaborn stack, solidifying its position as a key tool for data analysis in Python in 2015.
* **Introduction of Time Series features:** Time series functionalities implemented for financial and scientific analysis in 2014.

### 1.3. Architecture of Pandas

**A. Core Data Structures:**

* **Series:** One-dimensional labeled array, like a column from a spreadsheet. It holds data of any type (numeric, string, etc.) and is indexed by labels.
* **DataFrame:** Two-dimensional labeled data structure, like a spreadsheet. It consists of columns (Series) with different data types and is indexed by rows and columns.
* **Panel (deprecated):** Three-dimensional analogous to DataFrames, but less commonly used and deprecated in recent versions.


**B. Internal Building Blocks:**

* **BlockManager:** The heart of Pandas, responsible for managing the physical memory layout of data in Series and DataFrames. It uses NumPy arrays internally for efficient storage and retrieval of data.
* **Index:** Represents the labels for rows and columns in Series and DataFrames. It can be numeric, categorical, or custom objects.
* **DataType:** Defines the type of data stored in a Series or DataFrame column (e.g., integer, string, datetime).


**C. Key Architectural Aspects:**

* **Vectorized Operations:** Pandas leverages vectorized operations, performing calculations on entire arrays at once instead of individual elements, leading to significant performance gains.
* **Mutable vs. Immutable:** While Series and DataFrames are mutable (changeable), some internal data structures like NumPy arrays are immutable (unchangeable) for efficiency and data integrity.
* **Lazy Evaluation:** Pandas employs lazy evaluation, delaying computations until necessary, improving performance for interactive analysis and exploration.
* **Integration with NumPy:** Pandas builds upon NumPy arrays for efficient data storage and manipulation, offering a seamless experience for numerical operations.

### 1.4. Objects in Pandas

**A. Series:**

* Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). 
* The axis labels are collectively referred to as the index.
* Imagine it like a single column from a spreadsheet with labels for each element.
* Used for storing and manipulating sequences of data.
* Size of series is fixed and it has only one data type which is assigned at time of initialization or declaration.
* Series supports vectorized operations.


**B. DataFrame:**

* DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. 
* We can think of it like a spreadsheet or SQL table, or a dict of Series objects.
* The primary workhorse of Pandas, used for storing, manipulating, and analyzing tabular data.
* DataFrame accepts many different kinds of input:
  * Dict of 1D ndarrays, lists, dicts, or Series
  * 2-D numpy.ndarray
  * Structured or record ndarray
  * A Series
  * Another DataFrame


**C. Index:**

* Represents the labels for rows and columns in Series and DataFrames.
* Can be numeric, categorical, or even custom objects.
* Provides unique identifiers and facilitates data retrieval and selection.


**D. Data Type:**

* Defines the type of data stored in a Series or DataFrame column (e.g., integer, string, datetime).
* Determines how data is stored and manipulated internally, impacting performance and operations.

**E. BlockManager:**

* (Internal) The core of Pandas, responsible for managing the physical memory layout of data in Series and DataFrames.
* Uses NumPy arrays internally for efficient storage and retrieval of data.
* We don't directly interact with this object, but it's crucial for Pandas functionality.


**F. Panel (deprecated):**

* A three-dimensional analogous to DataFrames, but less commonly used and deprecated in recent versions.
* Considered less intuitive and efficient for most data analysis tasks.

## 2. Common Arguments for methods and classes throughout Pandas

- `sep` character that is treated as delimiter in the file.
- `delimiter` alias for `sep`
- `header` row number contain column labels marking start of the data. By default it is None or 0 which means no header.
- `names` sequence of column labels to apply. Basically if we want to give names to column we use it and if columns have already some name then set them to None as `header = 0` or `header = None` and then use `names`.
- `index_col` columns to use as row labels, `index_col = False` can be used to force pandas to not use first column as row labels.
- `usecols` subset of columns to select, denoted either by column labels or column indices. It specifies which columns header to use and for which not to. It can be an sequence or callable and where it evaluates True it will use header else not. Ex - `usecols = [True, False]` or `lambda x: x.upper() in ['aa', 'AA']`.
- `dtype` if we want all columns to have single data type pass an string else if we want each columns to have different data types then pass an dictionary. Ex - `dtype='str` or `dtype='int'` or `dtype={'col1': 'int', 'col2': 'str'}`.
- `engine` parser engine to use for data within dataframe or series. Argument can have following values `c`, `python`, `pyarrow` where `pyarrow` only supports multithreading.
- `converters` functions for converting values in specified columns, it is an dict of {hashable: callable} where key can either be column or labels or column indices.
- `true_values` values to consider as `True` n addition to case-insensitive variants of ‘True’. It is an `list`.
- `false_values` values to consider as `False` n addition to case-insensitive variants of False. It is an `list`.
- `skipinitialspace` its boolean which if `True` then skip spaces after delimiter else not.
- `skiprows` its `int`, `list` of int or callable. Basically it specifies number of lines to skip at the start of file. Ex - To skip lines in all columns `skiprows=2`, to skip specified lines in each columns `skiprows=[2, 5, 8]`, to skip lines if condition is satisfied or via callable `lambda x: x in [0, 2]`.
- `skipfooterint` default 0, its an `int`, its number of lines at bottom of file to skip (Unsupported with engine='c').
- `nrows` its `int` and specifies number of lines to read from the file.
- `na_values` it can be a hashable, iterable of hashable or dict of {hashable: iterable}. it specifies which type of Na values are in columns. Ex - for all the columns `na_values='NA'`, different type of na values in different columns `na_values=['NA', 'Nan']` or `na_values={'a': 'NA', 'b': 'Nan'}` or it may be callable also. All possible na values are: “#N/A”, “#N/A N/A”, “#NA”, “-1.#IND”, “-1.#QNAN”, “-NaN”, “-nan”, “1.#IND”, “1.#QNAN”, “<NA>”, “N/A”, “NA”, “NULL”, “NaN”, “None”, “n/a”, “nan”, “null“.
- `na_filter` bool, default `True`, detect missing value markers (empty strings and the value of na_values). In data without any NA values, passing` na_filter=False` can improve the performance of reading a large file.
- `verbosebool`, default `False`, Indicate number of NA values placed in non-numeric columns.
- `skip_blank_lines` bool, default `True`, If `True`, skip over blank lines rather than interpreting as NaN values.
- `parse_dates` bool, list of hashable, list of lists or dict of {hashable: list}, default is `False`.
  - if `parse_dates` is `True` try parsing the index.
  - if `parse_dates` is `list` of `int` or `str` try parsing columns each as a separate date column. Ex - `[1, 2, 3]` or `['a', 'b']`.
  - if `parse_dates` is `list` of `list` combine columns and parse as a single date column and values are joined with space before parsing. Ex - `[[1, 2]]` it will combine values from column 1 and 2.
  - if `parse_dates` is `dict` parse columns as date and call result the key, values are joined with space before parsing. Ex - `{'a': [1, 2]}` it will combine column 1 and 2 and call it `a`.
- `keep_date_col` bool, default `False`, if `True` and `parse_dates` specifies combining multiple columns then keep the original columns else not.
- `date_format` str or dict of column, format to use for parsing `parse_dates`. Ex - `%d%m%Y`.
- `dayfirst` bool, default `False`, DD/MM formate dates, international and european format.
- `cache_dates` bool, default `True`, if `True` use a cache of unique converted dates to apply the `datetime` conversion. It speeds up the parsing of duplicate date strings.
- `iterator` bool, default `False`, return `TextFileReader` object for iteration or getting chunks with `get_chunk()`.
- `chunksize` int, optional, number of lines to read from the file per chunk.
- `compression` str or dict, default `infer`,  or on-the-fly decompression of on-disk data, if `infer` it detects automatically some commonly used compression formats.
- `thousands` str (length 1), optional, Character acting as the thousands separator in numerical values.
- `decimal` str (length 1), default ‘.’, Character to recognize as decimal point (e.g., use ‘,’ for European data).
- `lineterminator` str (length 1), optional, Character used to denote a line break. Only valid with C parser.
- `quotechar` str (length 1), optional, Character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.
- `quoting` {0 or csv.QUOTE_MINIMAL, 1 or csv.QUOTE_ALL, 2 or csv.QUOTE_NONNUMERIC, 3 or csv.QUOTE_NONE}, default csv.QUOTE_MINIMAL, 
Control field quoting behavior per csv.QUOTE_* constants. Default is csv.QUOTE_MINIMAL (i.e., 0) which implies that only fields containing special characters are quoted (e.g., characters defined in quotechar, delimiter, or lineterminator.
- `doublequote` bool, default True, When quotechar is specified and quoting is not `QUOTE_NONE`, indicate whether or not to interpret two consecutive quotechar elements INSIDE a field as a single quotechar element.
- `escapechar` str (length 1), optional, Character used to escape other characters.
- `comment` str (length 1), optional, Character indicating that the remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character. Like empty lines.
- `encoding` str, optional, default ‘utf-8’, Encoding to use for UTF when reading/writing (ex. 'utf-8'). List of Python standard encodings .
- `encoding_errors` str, optional, default ‘strict’, How encoding errors are treated.
- `dialect` str or csv.Dialect, optional, it will override values (default or not) for the following parameters: `delimiter`, `doublequote`, `escapechar`, `skipinitialspace`, `quotechar`, and `quoting`. 
- `on_bad_lines` {‘error’, ‘warn’, ‘skip’} or Callable, default ‘error’, Specifies what to do upon encountering a bad line (a line with too many fields). Allowed values are :
  - `error`, raise an Exception when a bad line is encountered.
  - `warn`, raise a warning when a bad line is encountered and skip that line.
  - `skip`, skip bad lines without raising or warning when they are encountered.
- `delim_whitespace` bool, default `False`, specifies whether white space will be used as `sep`.
- `low_memory` bool, default True, Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter.
- `memory_map` bool, default False, if a filepath is provided for `filepath_or_buffer`, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead.
- `float_precision` {‘high’, ‘legacy’, ‘round_trip’}, optional, specifies which converter the C engine should use for floating-point values. The options are None or `high` for the ordinary converter, `legacy` for the original lower precision pandas converter, and `round_trip` for the round-trip converter.
- `storage_options` dict, optional, Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. 
- `dtype_backend `{‘numpy_nullable’, ‘pyarrow’}, default, ‘numpy_nullable’ Back-end data type applied to the resultant DataFrame (still experimental). Behavior is as follows:
  - `numpy_nullable`: returns nullable-dtype-backed DataFrame (default).
  - `pyarrow`: returns pyarrow-backed nullable ArrowDtype DataFrame.

## 3. Input/Output Functions in Pandas 

Various input/output general functions are available in pandas.

**Note:** Functions that are being called upon the `Dataframe` can also be used in the same way on `Series` also in most of the cases.

### 3.1. Pickling

- `read_pickle(filepath_or_buffer, compression='infer', storage_options=None)` Load pickled pandas object (or any object) from file.
  - `compression` For on-the-fly decompression of on-disk data, if `infer` it detects automatically some commonly used compression formats.
  - `storage_options` used to store pickled object to specified connection.
- `DataFrame.to_pickle(path, *[, compression, ...])` Pickle (serialize) object to file.

### 3.2. Flat File

- `read_table(filepath_or_buffer, *[, sep, ...])` Read general delimited file into DataFrame.
- `read_csv(filepath_or_buffer, *[, sep, ...])` Read a comma-separated values (csv) file into DataFrame.
- `DataFrame.to_csv([path_or_buf, sep, na_rep, ...])` Write object to a comma-separated values (csv) file.
- `read_fwf(filepath_or_buffer, *[, colspecs, ...])`  Read a table of fixed-width formatted lines into DataFrame.

### 3.3. Clipboard

- `read_clipboard([sep, dtype_backend])` Read text from clipboard and pass to `read_csv()`.
- `DataFrame.to_clipboard(*[, excel, sep])` Copy object to the system clipboard.

### 3.4. Excel

- `read_excel(io[, sheet_name, header, names, ...])` Read an Excel file into a pandas DataFrame.
- `DataFrame.to_excel(excel_writer, *[, ...])` Write object to an Excel sheet.
- `ExcelFile(path_or_buffer[, engine, ...])` Class for parsing tabular Excel sheets into DataFrame objects.
- `ExcelFile.parse([sheet_name, header, names, ...])` Parse specified sheet(s) into a DataFrame.
- `Styler.to_excel(excel_writer[, sheet_name, ...])`  Write Styler to an Excel sheet, where `excel_writer` is path of existing excel writer.
- `ExcelWriter(path[, engine, date_format, ...])` Class for writing DataFrame objects into excel sheets.

### 3.5. JSON

- `read_json(path_or_buf, *[, orient, typ, ...])` Convert a JSON string to pandas object.
- `json_normalize(data[, record_path, meta, ...])` Normalize semi-structured JSON data into a flat table.
- `build_table_schema(data[, index, ...])` Create a Table schema from data.
- `DataFrame.to_json([path_or_buf, orient, ...])` Convert the object to a JSON string.

### 3.6. HTML

- `read_html(io, *[, match, flavor, header, ...])` Read HTML tables into a list of DataFrame objects.
- `Styler.to_html([buf, table_uuid, ...])` Write Styler to a file, buffer or string in HTML-CSS format.
- `DataFrame.to_html([buf, columns, col_space, ...])` Render a DataFrame as an HTML table.

### 3.7. XML

- `read_xml(path_or_buffer, *[, xpath, ...])` Read XML document into a DataFrame object.
- `DataFrame.to_xml([path_or_buffer, index, ...])` Render a DataFrame to an XML document.

### 3.8. Latex

- `Styler.to_latex([buf, column_format, ...])` Write Styler to a file, buffer or string in LaTeX format.
- `DataFrame.to_latex([buf, columns, header, ...])` Render object to a LaTeX tabular, longtable, or nested table.

### 3.9. HDFStore: PyTables(HDF5)

- `read_hdf(path_or_buf[, key, mode, errors, ...])` Read from the store, close it if we opened it.
- `HDFStore.put(key, value[, format, index, ...])` Store object in HDFStore.
- `HDFStore.append(key, value[, format, axes, ...])` Append to Table in file.
- `HDFStore.get(key)` Retrieve pandas object stored in file.
- `HDFStore.select(key[, where, start, stop, ...])` Retrieve pandas object stored in file, optionally based on where criteria.
- `HDFStore.info()` Print detailed information on the store.
- `HDFStore.keys([include])` Return a list of keys corresponding to objects stored in HDFStore.
- `HDFStore.groups()` Return a list of all the top-level nodes.
- `HDFStore.walk([where])` Walk the pytables group hierarchy for pandas objects.

### 3.10. Feather

- `read_feather(path[, columns, use_threads, ...])` Load a feather-format object from the file path.
- `DataFrame.to_feather(path, **kwargs)` Write a DataFrame to the binary Feather format.

### 3.11. Parquet

-  `read_parquet(path[, engine, columns, ...])` Load a parquet object from the file path, returning a DataFrame.
- `DataFrame.to_parquet([path, engine, ...])` Write a DataFrame to the binary parquet format.

### 3.12. ORC

- `read_orc(path[, columns, dtype_backend, ...])` Load an ORC object from the file path, returning a DataFrame.
- `DataFrame.to_orc([path, engine, index, ...])` Write a DataFrame to the ORC format.

### 3.13. SAS

- `read_sas(filepath_or_buffer, *[, format, ...])` Read SAS files stored as either XPORT or SAS7BDAT format files. 

### 3.14. SPSS

- `read_spss(path[, usecols, ...])` Load an SPSS file from the file path, returning a DataFrame.

### 3.15. SQL

- `read_sql_table(table_name, con[, schema, ...])` Read SQL database table into a DataFrame.
- `read_sql_query(sql, con[, index_col, ...])` Read SQL query into a DataFrame.
- `read_sql(sql, con[, index_col, ...])` Read SQL query or database table into a DataFrame.
- `DataFrame.to_sql(name, con, *[, schema, ...])` Write records stored in a DataFrame to a SQL database.

### 3.16. Google BigQuery

- `read_gbq(query[, project_id, index_col, ...])` Load data from Google BigQuery.

### 3.17. STATA

- `read_stata(filepath_or_buffer, *[, ...])` Read Stata file into DataFrame.
- `DataFrame.to_stata(path, *[, convert_dates, ...])` Export DataFrame object to Stata dta format.
- `StataReader.data_label` Return data label of Stata file.
- `StataReader.value_labels()` Return a nested dict associating each variable name to its value and label.
- `StataReader.variable_labels()` Return a dict associating each variable name with corresponding label.
- `StataWriter.write_file()` Export DataFrame object to Stata dta format.

## 4. General Functions

Various input/output general functions are available in pandas.

**Note:** Functions that are being called upon the `Dataframe` can also be used in the same way on `Series` also in most of the cases.

### 4.1. Data Manipulation

- `pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)` Unpivot a DataFrame from wide to long format, optionally leaving identifiers set. Its useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.
  - `id_vars` scalar, tuple, list, ndarray - columns to use as identifier variables.
  - `value_vars` scalar, tuple, list, ndarray - columns to unpivot, if not set uses all columns that are not set as `id_vars`
  - `var_name` scalar - name to use for the variable column, if None uses default column name
  - `value_name` scalar - name to use fot he value column, can't be an existing column
  - `ignore_index` boo - if True original index is ignored else not.
- `pandas.pivot(data, *, columns, index=_NoDefault.no_default, values=_NoDefault.no_default)` Return reshaped DataFrame organized by given index / column values. Reshape data (produce a “pivot” table) based on column values. Uses unique values from specified index / columns to form axes of the resulting DataFrame.
  - `data` DataFrame
  - `columns` str, list of str, object - column to use to make new frames columns
  - `index` str, list of str, object - column to use to make new frames index, if not given use s existing ones
  - `values` columns to sue fro populating new frames values, if not specified all remaining columns will be used and the result will have hierarchically indexed columns
- `pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=_NoDefault.no_default, sort=True)` create a spreadsheet style pivot table as a dataframe.
  - `data` DataFrame
  - `values` list like, scalar - column or columns to be aggregated
  - `index` columns, grouper, array, list of all of them - keys to group by on the pivot table index
  - `columns` column, grouper, array,, list of all of them - keys to group by on the pivot table column
  - `aggfunc` function, list of functions, dict - if a list of function is passed the output pivot table will have hierarchical columns whose top level are the functions names, if a dict is passed the key is column to aggregate and the value is function or list of functions
  - `fill_value` scalar - value to replace the missing values
  - `margin` bool -  if `margin=True` partial aggregates are calculated
  - `dropna` bool - if True drops columns with NaN values
  - `margins_name` str - name of the row or column that will contain the totals when margins is True
  - `observed` bool - only applies if groupers are categorical, if True only show observed value else show all values
  - `sort` bool - if True sort values ascending else not
- `pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name='All', dropna=True, normalize=False)` compute a cross tabulation of two or more factors.
  - `index` array like, series, list of arrays - values to group by in the rows
  - `columns` array like, series, list of arrays - values to group by in the columns
  - `rownames` sequence - names to be given to rows in the crosstab
  - `columnames`  sequence - names to be given to columns in the crosstab
  - `aggfunc` function - aggregate function to be used to make calculation for crosstab if not given it uses frequency by default
  - `margins` bool - add row/column margins if True
  - `margins_name` str - name of the row/column margin if margins is True
  - `dropna` bool - if True drops columns with NaN values
  - `normalize` bool, {'all', 'index', 'columns'}, {'0', '1'} - normalize by dividing all values by the sum of values
      - if `all` or `True` will normalize over all values
      - if `index` will normalize over each row
      - if `columns` will normalize over each column
      - if margins `True` will also normalize margin values
- `pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True)` used to segment and sort data values into bins. We can also use it to convert continuous variables to categorical variables.
  - `x` array-like - input array to binned, must be 1D
  - `bins` int, sequence of scalars, IntervalIndex
    - if `int` defines the number of bins of equal size to in x
    - if `sequence of scalars` defines the bin edges allowing for non uniform width
    - if `IntervalIndex` defines the exact bins to be formed
  - `right` bool - if True bins are included at rightmost side of `x` else not
  -  `labels` array - if True specifies labels for returned bins else not
  - `retbins` bool - if True bins are returned else not
  - `precision` int - precision at which to store & display labels
  - `include_layout` bool - if True first interval should be left inclusive else not
  - `ordered` bool - if True labels are ordered, else not
- `pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise')`
  - `x` array-like - input array to binned, must be 1D
  - `q` int, list like of float - number of quartiles
  - `labels` array - if True specifies labels for returned bins else not
  - `retbins` bool - if True bins are returned else not
  - `precision` int - precision at which to store & display labels
  - `duplicates` dict - if bin edges are not unique, raise ValueError or drop non uniques
 - `pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=None, indicator=False, validate=None)` merge dataframe or series objects in database join fashion. Can be done colum or row wise.
  - `left` DataFrame, Series - first object to merge
  - `right` DataFrame, Series - second object to merge
  - `how` `{'left', 'right', 'inner', 'outer', 'cross'}` - type of merge to be performed   
  - `on` label, list - column or index level names to join on
  - `right_on` label, list, array like - column or index level names to join on in the right DataFrame/Series
  - `left_on` label, list, array like - column or index name to join on in the left DataFrames/Series
  - `left_index` bool - if True use the index from the left DataFrame/Series as the join keys
  - `right_index` bool - if True use the index from the right DataFrame/Series as the join keys
  - `sort` bool - if True sor the join keys lexicographically in the result DataFrame
  - `suffixes` list like, `{'_x', '_y'}` - a length 2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively
  - `copy` bool - if False avoid copy if possible
- `pandas.merge_ordered(left, right, on=None, left_on=None, right_on=None, left_by=None, right_by=None, fill_method=None, suffixes=('_x', '_y'), how='outer')` perform merge for the ordered data with optional filling/interpolation
  - `left` DataFrame, Series - first object to merge
  - `right` DataFrame, Series - second object to merge 
  - `how` `{'left', 'right', 'inner', 'outer', 'cross'}` - type of merge to be performed  
  - `on` label, list - column or index level names to join on
  - `right_on` label, list, array like - field names to join on in right DataFrame or vector/list of vectors per left_on docs.
  - `left_on` label, list, array like - field names to join on in left DataFrame or vector/list of vectors per left_on docs.
  - `left_by` column name, list of column names - Group left DataFrame by group columns and merge piece by piece with right DataFrame. Must be None if either left or right are a Series.
  - `right_by`column name, list of column names - Group right DataFrame by group columns and merge piece by piece with left DataFrame. Must be None if either left or right are a Series.
  - `suffixes` list like, `{'_x', '_y'}` - a length 2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively
  - `fill_method` `{'ffill', None}` - interpolation method for data
- `pandas.merge_asof(left, right, on=None, left_on=None, right_on=None, left_index=False, right_index=False, by=None, left_by=None, right_by=None, suffixes=('_x', '_y'), tolerance=None, allow_exact_matches=True, direction='backward')` perform a merge by key distance (similar to left join except that we match on nearest key rather than equal keys).
  - `left` DataFrame, Series - first object to merge
  - `right` DataFrame, Series - second object to merge 
  - `how` `{'left', 'right', 'inner', 'outer', 'cross'}` - type of merge to be performed  
  - `on` label - field name to join on
  - `right_on` label - field name to join on in right DataFrame
  - `left_on` label - field name to join on in the left DataFrame
  - `left_index` bool - if True use the index of the left DataFrame as the join key
  - `right_index` bool - if True use the index of the right DataFrame as the join key
  - `by` column name or list of column names - match on these columns before performing merge operation
  - `left_by` column name - field names to match on in the left DataFrame
  - `right_by` column name - field names to match on in the right DataFrame
  - `suffixes` list like - a length 2 sequence to apply overlapping column names in the left and right side respectively
  - `tolerance` int, Timedelta - select asof tolerance within this range, must be compatible with merge index
  - `allow_exact_matches` bool - if True allow matching with the same `on` value (i.e <=, >=) else matches `on` value (i.e <, >)
  - `direction` `{'backward', 'forward', 'nearest'}` - whether to search for prior, subsequent or closest matches
- `pandas.concat(objs, *, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=None)` concatenate pandas objects along a particular axis.
  - `objs` sequence, mapping of Series or DataFrame objects - if mapping then sorted keys will be used as the `key` argument unless it is passed
  - `axis` `{0/'index', 1/'columns'}` - axis to concatenate along
  - `join` `{'inner', 'outer'}` - how to handle indexes on other axis
  - `ignore_index` bool - if True do not use the index values along the concatenation axis(the resulting axis wll be labeled 0,1,2,..)
  - `keys` sequence - if multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level
  - `levels` list of sequences - specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys
  - `name` list - names for the levels in the resulting hierarchical index
  - `verify_integrity` bool - if True checks whether the new concatenated axis contains duplicates
  - `sort` bool - if True sorts non concatenation axis if it is not already aligned(not does it if axis is DatetimeIndex and join is outer, in that case lexicographical sorting is performed)
  - `copy` bool - if False don't copy data unnecessarily 
- `pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None` convert categorical variable into dummy/indicator variables. Each variable is converted in as many 0/1 variables as there are different values. Columns in the output are each named after a value.
  - `data` array like, Series, DataFrame - data of which to get dummy indicators
  - `prefix` str, list of str, dic of str - string to append DataFrame column names
  - `prefix_sep` str - if appending prefix, this is the separator to use
  - `dummy_na` bool - if True add a column to indicate NaNs else ignored
  - `columns` list like - column names in the DataFrame to be encoded, if columns is None then all the columns with object, string, category dtype will be converted
  - `sparse` bool - if True dummy encoded columns should be backed by a SparseArray else by Numpy array
  - `drop_first` bool - if True will yield `k-1` dummies of `k` categorical levels by removing the first level
  - `dtype` dtype - data type for new columns(only a single dtype is allowed)
- `pandas.from_dummies(data, sep=None, default_category=None)` create a categorical DataFrame from a DataFrame of dummy variables.
  - `data` DataFrame - data which contains dummy coded variables
  - `sep` str - separator used in column names of dummy categories
  - `default_category` None, hashable, dict of hashables - the default category is the implied category when a value has none of the listed categories specified with a one, i.e. if all dummies in a row are zero. Can be a single value for all variables or a dict directly mapping the default categories to a prefix of a variable
- `pandas.factorize(values, sort=False, use_na_sentinel=True, size_hint=None)` encode the object as an enumerated type or categorical variable. This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values.
  - `values` sequence - 1D sequence that aren't pandas objects are coerced to ndarrays before factorization
  - `sort` bool - if True  sort uniques and shuffle codes to maintain the relationship
  - `use_na_sentinel` bool - if True the sentinel `-1`  be used for NaN values else NaN values will be encoded as non negative integers and will not drop the NaN from the uniques of the values
  - `size_hint` int - hint to the hashable sizer
- `pandas.unique(values)` return unique values based on hashable table. Uniques appear in order of appearance and this does not sort. *(very fast then numpy.unique)
  - `values` 1D array like
- `pandas.lreshape(data, groups, dropna=True)` reshape wide-format data to long. Generalized inverse of `DataFrame.pivot`. Accepts a dictionary, `groups`, in which each key is a new column name and each value is a list of old column names that will be “melted” under the new column name as part of the reshape.
  - `data` DataFrame - wide format DataFrame
  - `groups` dict - `{'new_name', list_of_columns}`
  - `dropna` bool - if True do not include columns whose entries are all NaN
- `pandas.wide_to_long(df, stubnames, i, j, sep='', suffix='\\d+')` Unpivot a DataFrame from wide to long format. Less flexible but more user-friendly than melt.
  - `df` DataFrame - wide format DataFrame
  - `stubnames` str, list like - the stub names, the wide formate variables are assumed to start with the stub names
  - `i` str, list like - columns to use as id variables
  - `sep` str - separator separating the variables name in `df`
  - `suffix` str - regular expression capturing the wanted suffixes. ‘\d+’ captures numeric suffixes.

### 4.2. Top Level Missing Data

- `pandas.isna(obj)` returns True if detects missing/null values for an array like object else False for scalar and for array returns array of booleans
- `pandas.isnull(obj)` returns True if detects missing/null values for an array like object else False for scalar and for array returns array of booleans
- `pandas.notna(obj)` returns False if detects missing/null values for an array like object else True for scalar and for array returns array of booleans
- `pandas.notnull(obj)` returns False if detects missing/null values for an array like object else True for scalar and for array returns array of booleans

### 4.3. Top Level Dealing with Numeric Data

- `pandas.to_numeric(arg, errors='raise', downcast=None, dtype_backend=_NoDefault.no_default` convert argument to a numeric type.
  - `arg` scalar, list, tuple, 1D array, Series - argument to be converted
  - `errors` `{'ignore', 'raise', 'coerce'}` - if in valid parsing 
    - `raise` then raise an exception
    - `ignore` then return the input
    - `coerce` will be set as NaN
  - `downcast` str - downcast the data to smallest numerical dtype possible

### 4.4. Top Level Dealing with DateTime like Data

- `pandas.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=False, format=None, exact=_NoDefault.no_default, unit=None, infer_datetime_format=_NoDefault.no_default, origin='unix', cache=True)` convert argument to datetime.
  - `arg` int, float, str, datetime, list, tuple 1D array, Series, DataFrame - the object to converted to datetime
  - `errors` `{'ignore', 'raise', 'coerce'}` - if in valid parsing 
    - `raise` then raise an exception
    - `ignore` then return the input
    - `coerce` will be set as NaT
  - `dayfirst` bool - if True parses datetime with day first (day first in datetime)
  - `yearfirst` bool - if True parses datetime with year first (year first in datetime)
  - `utc` bool - if True function will return a timezone aware UTC localized Timestamp, Series, DateTimeIndex else inputs will note be coerced to UTC
  - `format` str - the strftime to parse time (ex - "%d%m%Y")
  - `exact` bool - if True require an exact formate match, else allow the formate to match anywhere in the target string
  - `unit` str - its the unit of `arg` (D, s, ms, etc)
  - `infer_date_time_format` bool - if True and no format is given attempt to infer the format of the datetime strings based on the first non NaN element and if it can bve inferred switch to a faster method of parsing them 
  - `origin` scalar, `{'unix', 'julia'}` - define the reference date (if julian used the unit must be 'D')
  - `cache` bool - if True use a cache of unique converted dates to apply the datetime conversion
- `pandas.to_timedelta(arg, unit=None, errors='raise')` convert argument to timedelta (timedelta's are absolute difference in times expressed in difference units such as days, months, hours, etc)
  - `arg` str, timedelta, list like, series - the date to be  converted to timedelta
  - `unit` str - denotes unit of time of `arg` (possible values = [‘W’
‘D’ / ‘days’ / ‘day’, ‘hours’ / ‘hour’ / ‘hr’ / ‘h’ / ‘H’, ‘m’ / ‘minute’ / ‘min’ / ‘minutes’ / ‘T’, ‘s’ / ‘seconds’ / ‘sec’ / ‘second’ / ‘S’, ‘ms’ / ‘milliseconds’ / ‘millisecond’ / ‘milli’ / ‘millis’ / ‘L’, ‘us’ / ‘microseconds’ / ‘microsecond’ / ‘micro’ / ‘micros’ / ‘U’, ‘ns’ / ‘nanoseconds’ / ‘nano’ / ‘nanos’ / ‘nanosecond’ / ‘N’])
  - `errors` `{'ignore', 'raise', 'coerce'}` - if in valid parsing 
    - `raise` then raise an exception
    - `ignore` then return the input
    - `coerce` will be set as NaT
- `pandas.date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, inclusive='both', *, unit=None, **kwargs)` returns a fixed frequency of DataTimeIndex (returns a range of equally spaced time points)
  - `start` str, datetime like - left bound for generating dates
  - `end` str, datetime like - right bound for generating dates
  - `periods` int - number of periods to generate
  - `freq` str, Timedelta, datetime.timedelta, DateOffset - frequency of range
  - `tz` str, tzinfo - time zone name for returning localized DateTimeIndex
  - `normalize` bool - normalize start/end dates to midnight before generating date range
  - `name` str - name of the resulting DateTimeIndex
  - `inclusive` `{'both', 'neither', 'left', 'right'}` - specify which boundary to include or not
  - `uni` str - desired unit of the result
  - `**kwargs` for compatibility, no effect on results
- `pandas.bdate_range(start=None, end=None, periods=None, freq='B', tz=None, normalize=True, name=None, weekmask=None, holidays=None, inclusive='both', **kwargs)` return a fixed frequency DateTimeIndex with business day as the default.
  - `start` str, datetime like - left bound for generating dates
  - `end` str, datetime like - right bound for generating dates
  - `periods` int - number of periods to generate
  - `freq` str, Timedelta, datetime.timedelta, DateOffset - frequency of range
  - `tz` str, tzinfo - time zone name for returning localized DateTimeIndex
  - `normalize` bool - normalize start/end dates to midnight before generating date range
  - `name` str - name of the resulting DateTimeIndex
  - `inclusive` `{'both', 'neither', 'left', 'right'}` - specify which boundary to include or not
  - `weekmask` str - weekmask of valid business days (only used if custom `freq` passed)
  - `holidays` list like - dates to exclude from the set of business days (only used if custom `freq` passed) 
  - `**kwargs` for compatibility, no effect on results
- `pandas.period_range(start=None, end=None, periods=None, freq=None, name=None)` return a fixed frequency PeriodIndex.
  - `start` str, datetime like - left bound for generating periods
  - `end` str, datetime like - right bound for generating periods
  - `periods` int - number of periods to generate
  - `freq` str, Timedelta, datetime.timedelta, DateOffset - frequency of range
  - `name` str - name of the resulting PeriodIndex
- `pandas.timedelta_range(start=None, end=None, periods=None, freq=None, name=None, closed=None, *, unit=None)` return a fixed frequency TimeDeltaIndex with day as default.
  - `start` str, datetime like - left bound for generating timedelta's
  - `end` str, datetime like - right bound for generating timedelta's
  - `periods` int - number of periods to generate
  - `freq` str, Timedelta, datetime.timedelta, DateOffset - frequency of range
  - `name` str - name of the resulting TimeDeltaIndex
  - `closed` str - make the interval closed with respect to boundary (both, left, right)
  - `unit` str - desired unit of the result
- `pandas.infer_freq(index)` infer the most likely frequency given the input index.
  - `index` DateTimeIndex, TImeDeltaIndex, Series, array like - if passed a Series will use the values of the series

### 4.5. Top Level Dealing with Interval Data

- `pandas.interval_range(start=None, end=None, periods=None, freq=None, name=None, closed='right')` return a fixed frequency onf IntervalIndex
  - `start` numeric, datetime like - left bound for generating intervals
  - `end` numeric, datetime like - right bound for generating intervals
  - `periods` int - number of periods to generate
  - `freq` str, Timedelta, datetime.timedelta, DateOffset - frequency of range/length of each interval
  - `name` str - name of the resulting IntervalIndex
  - `closed` str - make the interval closed with respect to boundary (both, left, right, neither)

### 4.6. Top Level Evaluation

- `pandas.eval(expr, parser='pandas', engine=None, local_dict=None, global_dict=None, resolvers=(), level=0, target=None, inplace=False)` evaluate a python expression as a string using various backends.
  - `expr` str - expression to evaluate
  - `parser` `{'python, 'pandas'}` - parser to be used to construct syntax tree
  - `engine` `{'python', 'numexpr'}` - engine used to evaluate the expression
  - `local_dict` dict - dictionary of local variables taken from `locals()` by default 
  -  `global_dict` dict - dictionary of global variables take from `globals()` by default
  -  `resolvers` list of dict - list of objects implementing `__getitem__`
  - `level` int - number of prior stack frames to traverse and add to the current scope
  - `target` object - target object for assignment
  - `inplace` bool - if `target` provided and expression mutates target whether to modify target inplace, otherwise return a copy of target with the mutation 

### 4.7. DateTime Formats

- `pandas.tseries.api.guess_datetime_format(dt_str, dayfirst=False)` guess the datetime formate of given datetime string.
  - `dt_str` str - datetime string to guess the format
  - `dayfirst` if True parsed dates with day first

### 4.8. Hashing

- `pandas.util.hash_array(vals, encoding='utf8', hash_key='0123456789123456', categorize=True)` given a 1D array it returns an array of deterministic integers.
  - `vals` ndarray, existing array 
  - `encoding` str - encoding for data and key when strings
  - `hash_key` str - hash key for string key to encode
  - `categorize` bool - if True first categorize object array before hashing
- `pandas.util.hash_pandas_object(obj, index=True, encoding='utf8', hash_key='0123456789123456', categorize=True)` return a data hash of the Index/Series/DataFrame.
  - `obj` Index, Series, DataFrame 
  - `index` bool - if True includes the index in the hash(if series/DataFrame)
  - `encoding` str - encoding for data and key when strings
  - `hash_key` str - hash key for string key to encode
  - `categorize` bool - if True first categorize object array before hashing