# Pandas

## 1. Overview

### 1.1. Introduction

The pandas library is specifically designed for data analysis and manipulation. 

It offers a variety of features and functionalities that make it an essential tool for anyone working with data in Python. 


**Pandas offers:**

* **Data Structures:** Pandas provides two main data structures: Series (one-dimensional labeled array) and DataFrame (two-dimensional labeled data structure similar to a spreadsheet). These structures allow us to store and organize our data efficiently.
* **Data Cleaning and Manipulation:** Pandas offers extensive tools for cleaning and manipulating our data, including handling missing values, filtering, sorting, and aggregating data.
* **Data Analysis:** Pandas comes with a rich set of statistical and analytical functions, allowing us to perform various analyses on your data, such as calculating descriptive statistics, finding correlations, and creating visualizations.
* **Time Series Analysis:** Pandas has specialized functionalities for working with time series data, making it a valuable tool for financial analysis, weather forecasting, and other time-dependent applications.
* **Integration with Other Libraries:** Pandas seamlessly integrates with other popular Python libraries like NumPy, Matplotlib, and Scikit-learn, allowing you to build powerful data analysis workflows.


**Advantages of pandas:**

* **Powerful and Efficient:** Pandas is designed for speed and efficiency, making it suitable for working with large datasets.
* **Easy to Learn:** The syntax of pandas is relatively simple and intuitive, even for beginners in Python.
* **Versatile:** Pandas can be used for a wide range of data analysis tasks, from basic cleaning and manipulation to complex statistical analysis and visualization.
* **Popular and Well-Supported:** Pandas has a large and active community, meaning we can easily find resources, tutorials, and help online.
* **Optimized Data Structures:** Pandas leverages optimized data structures like NumPy arrays and Series internally, providing efficient memory management and fast access to data elements.
* **Vectorized Operations:** Pandas uses vectorized operations instead of looping, allowing it to perform calculations on entire arrays of data simultaneously, significantly speeding up computations compared to traditional Python loops.
* **C-Level Optimizations:** Pandas utilizes C-level code for critical operations, further enhancing performance and memory efficiency compared to pure Python implementations.
* **Lazy Evaluation:** Pandas employs lazy evaluation, delaying expensive computations until they are absolutely necessary. This improves performance for interactive analysis and exploration.
* **Wide Range of Data Types:** Pandas supports various data types, including numerics, strings, categorical, datetimes, and more, allowing for flexible data manipulation and analysis.


**Disadvantages of Pandas:**

* **Not Ideal for Big Data:** Although pandas can handle large datasets, it might not be the best choice for truly massive datasets typically associated with "big data" applications. For such scenarios, libraries like Spark are designed to handle distributed processing and scale efficiently.
* **Limited Support for Unstructured Data:** Pandas primarily focuses on structured data in tabular formats. While it can handle some unstructured data manipulation, it's not ideal for complex processing of text, images, or other non-tabular data types. Libraries like spaCy or OpenCV are better suited for such tasks.
* **Learning Curve for Advanced Features:** While the core functionalities of pandas are relatively easy to learn, mastering its advanced features like `groupby` operations, custom functions, and complex data transformations can have a steeper learning curve.
* **Memory Overhead:** Data structures in pandas come with some memory overhead compared to raw NumPy arrays, which can be an issue for extremely large datasets. Pandas can be memory-intensive, especially when working with large datasets. Its data structures are designed for flexibility and ease of use, but they might not be the most memory-efficient choice for massive datasets.
* **Limited Parallelism:** While pandas allows some parallelization, it's not optimized for large-scale distributed computing like libraries like Dask or Spark. This can limit performance for massive datasets requiring parallel processing across multiple cores or machines.
* **GIL (Global Interpreter Lock):** Python's GIL can limit performance for CPU-bound operations in pandas, especially on multi-core systems. However, recent versions offer experimental parallelism capabilities to mitigate this issue.
  

**Pandas vs Numpy**

* **When to use pandas:**
  * Structured data
  * Data cleaning and manipulation
  * Statistical analysis and exploration
  * Time series analysis
  * Data visualization
* **When to use numpy:**
  * Purely numerical operations
  * Limited data size
  * Specific data types
  

[Docs Reference](https://pandas.pydata.org/docs/reference/index.html)

### 1.2. History

**Early Days (2008-2010):**

* **Conception:** Wes McKinney, frustrated by the lack of efficient tools for data analysis in Python, began developing Pandas in 2008.
* **Inspiration:** Drawing inspiration from R's DataFrames and NumPy's arrays, McKinney aimed to create a library that combined the strengths of both.
* **Initial Release:** In 2010, the first public version of Pandas (0.1.0) was released. It included basic DataFrame and Series functionalities, focused primarily on financial analysis.


**Growth and Adoption (2011-2015):**

* **Rapid Development:** Pandas gained significant traction due to its intuitive interface, efficient data structures, and growing feature set.
* **Community Contributions:** An active community of developers began contributing features, bug fixes, and documentation, accelerating Pandas' development.
* **Integration with SciPy Stack:** Pandas became a core component of the SciPy stack, solidifying its position as a key tool for data analysis in Python.
* **Integration with Numpy Stack:** Pandas became a core component of the Numpy stack, solidifying its position as a key tool for data analysis in Python in 2011.
* **Integration with Matplotlib and Seaborn Stack:** Pandas became a core component of the Matplotlib and Seaborn stack, solidifying its position as a key tool for data analysis in Python in 2015.
* **Introduction of Time Series features:** Time series functionalities implemented for financial and scientific analysis in 2014.

### 1.3. Architecture of Pandas

**A. Core Data Structures:**

* **Series:** One-dimensional labeled array, like a column from a spreadsheet. It holds data of any type (numeric, string, etc.) and is indexed by labels.
* **DataFrame:** Two-dimensional labeled data structure, like a spreadsheet. It consists of columns (Series) with different data types and is indexed by rows and columns.
* **Panel (deprecated):** Three-dimensional analogous to DataFrames, but less commonly used and deprecated in recent versions.


**B. Internal Building Blocks:**

* **BlockManager:** The heart of Pandas, responsible for managing the physical memory layout of data in Series and DataFrames. It uses NumPy arrays internally for efficient storage and retrieval of data.
* **Index:** Represents the labels for rows and columns in Series and DataFrames. It can be numeric, categorical, or custom objects.
* **DataType:** Defines the type of data stored in a Series or DataFrame column (e.g., integer, string, datetime).


**C. Key Architectural Aspects:**

* **Vectorized Operations:** Pandas leverages vectorized operations, performing calculations on entire arrays at once instead of individual elements, leading to significant performance gains.
* **Mutable vs. Immutable:** While Series and DataFrames are mutable (changeable), some internal data structures like NumPy arrays are immutable (unchangeable) for efficiency and data integrity.
* **Lazy Evaluation:** Pandas employs lazy evaluation, delaying computations until necessary, improving performance for interactive analysis and exploration.
* **Integration with NumPy:** Pandas builds upon NumPy arrays for efficient data storage and manipulation, offering a seamless experience for numerical operations.

### 1.4. Objects in Pandas

**A. Series:**

* Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). 
* The axis labels are collectively referred to as the index.
* Imagine it like a single column from a spreadsheet with labels for each element.
* Used for storing and manipulating sequences of data.
* Size of series is fixed and it has only one data type which is assigned at time of initialization or declaration.
* Series supports vectorized operations.


**B. DataFrame:**

* DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. 
* We can think of it like a spreadsheet or SQL table, or a dict of Series objects.
* The primary workhorse of Pandas, used for storing, manipulating, and analyzing tabular data.
* DataFrame accepts many different kinds of input:
  * Dict of 1D ndarrays, lists, dicts, or Series
  * 2-D numpy.ndarray
  * Structured or record ndarray
  * A Series
  * Another DataFrame


**C. Index:**

* Represents the labels for rows and columns in Series and DataFrames.
* Can be numeric, categorical, or even custom objects.
* Provides unique identifiers and facilitates data retrieval and selection.


**D. Data Type:**

* Defines the type of data stored in a Series or DataFrame column (e.g., integer, string, datetime).
* Determines how data is stored and manipulated internally, impacting performance and operations.

**E. BlockManager:**

* (Internal) The core of Pandas, responsible for managing the physical memory layout of data in Series and DataFrames.
* Uses NumPy arrays internally for efficient storage and retrieval of data.
* We don't directly interact with this object, but it's crucial for Pandas functionality.


**F. Panel (deprecated):**

* A three-dimensional analogous to DataFrames, but less commonly used and deprecated in recent versions.
* Considered less intuitive and efficient for most data analysis tasks.

## 2. Input/Output Functions in Pandas 

Various input/output general functions are available in pandas.

### 2.1. Pickling

- `read_pickle(filepath_or_buffer[, ...])` Load pickled pandas object (or any object) from file.