# Pandas Data handling in Python /  Pandas Data Engineering with Python

Pandas name reason: 

- PAN- Pannel Data (3d Data)
- DA-Data Frame (2D Data)
- S-series (1d Data)

![image.png](attachment:image.png)

Pandas is a powerful open-source data analysis and manipulation library for Python. It provides data structures like **Series** (one-dimensional) and **DataFrame** (two-dimensional), making it easy to work with structured data, such as tables. Here’s a quick overview of its key features:

**Key Features:**
1. **DataFrame & Series**: Two main data structures to store and manipulate tabular data.
   - **Series**: A one-dimensional labeled array, like a list.
   - **DataFrame**: A two-dimensional table with labeled axes (rows and columns).
   
2. **Data Cleaning**: Tools for handling missing data, replacing values, and dropping rows/columns.

3. **Data Manipulation**:
   - **Filtering**: Select rows and columns based on conditions.
   - **Aggregation**: Grouping and summarizing data using operations like mean, sum, count, etc.
   - **Merging/Joining**: Combining multiple DataFrames.

4. **Handling CSV and other File Formats**: Easily read and write data to/from CSV, Excel, SQL, and other file formats.

5. **Time Series**: Pandas provides excellent support for time-series data.
Pandas is essential for data analysis in Python, often used alongside libraries like **NumPy** and **Matplotlib**.

# comparison of **Pandas**, **PySpark**, and **Polars** 

Here’s a comparison of **Pandas**, **PySpark**, and **Polars** in a table format to highlight their key features, performance, and use cases:

| Feature/Aspect       | **Pandas**                         | **PySpark**                          | **Polars**                           |
|----------------------|------------------------------------|--------------------------------------|--------------------------------------|
| **Language**         | Python                             | Python API for Apache Spark          | Python (Rust backend)                |
| **Performance**      | Single-threaded, slower on large data | Distributed, highly scalable        | Multi-threaded, faster than Pandas   |
| **Execution Model**  | Eager (immediate execution)        | Lazy (execution triggered by actions)| Lazy (optimized execution)           |
| **Data Size**        | Limited by single machine’s memory | Handles big data across clusters     | Efficient on larger-than-memory datasets on a single machine |
| **Concurrency**      | Single-threaded                   | Multi-threaded, distributed across clusters | Multi-threaded on a single machine   |
| **Data Structures**  | Series and DataFrame               | RDDs and DataFrames                 | DataFrames                           |
| **Setup Complexity** | Simple, requires only Python       | Requires cluster setup (for distributed use) | Simple, requires only Python         |
| **Data Format Support** | CSV, Excel, SQL, JSON, HDF5      | CSV, JSON, Parquet, ORC, SQL        | CSV, JSON, Parquet, Arrow            |
| **Memory Usage**     | Higher memory usage (in-memory)    | Distributed memory across nodes     | Lower memory usage (in-memory, columnar format) |
| **Use Case**         | Small to medium datasets           | Large-scale, distributed big data   | Large datasets, high performance on a single machine |
| **Parallelism**      | No inherent parallelism            | Parallelism across cluster nodes    | Parallelism within a single machine  |
| **Learning Curve**   | Easy for Python users              | Moderate, requires knowledge of Spark | Easy, similar to Pandas with some differences |
| **Key Libraries**    | NumPy, Matplotlib, Scikit-learn    | Spark MLlib, Spark SQL, Spark Streaming | Arrow (for efficient columnar data handling) |
| **Real-time Processing** | No                             | Yes (with Spark Streaming)           | No                                   |
| **Fault Tolerance**  | No                                | Yes (RDDs are resilient to failures) | No                                   |
| **Typical Use Cases** | Data analysis, small-scale analytics | Big data processing, real-time analytics | Large-scale data processing on a single machine |
| **Pros**             | Simple, widely used, lots of community support | Handles big data, highly scalable   | Fast, memory-efficient, multi-threaded |
| **Cons**             | Limited to single-machine memory, slow on large datasets | Requires cluster setup, overhead for small datasets | Newer, smaller community, fewer features than Pandas |

***Summary:***
- **Pandas**: Best for small to medium datasets, with a simple API for single-machine data analysis.
- **PySpark**: Ideal for large-scale, distributed data processing across multiple nodes, making it perfect for handling massive datasets and real-time streaming.
- **Polars**: A high-performance alternative to Pandas for large datasets on a single machine, offering fast processing and lower memory usage.

Each library has its strengths depending on the size and nature of the data, as well as the required computational power.

# Pandas vs pyspark code syntax

Here’s a side-by-side comparison of **Pandas** and **PySpark** code syntax for common data manipulation tasks like creating a DataFrame, filtering, grouping, and more:

| Task                         | **Pandas Syntax**                                | **PySpark Syntax**                               |
|------------------------------|-------------------------------------------------|--------------------------------------------------|
| **Import Library**            | `import pandas as pd`                           | `from pyspark.sql import SparkSession`           |
|                              |                                                 | `from pyspark.sql import functions as F`         |
| **Create a DataFrame**        | `df = pd.DataFrame(data)`                       | `spark = SparkSession.builder.appName("example").getOrCreate()`<br>`df = spark.createDataFrame(data, schema)` |
| **Read CSV File**             | `df = pd.read_csv("file.csv")`                  | `df = spark.read.csv("file.csv", header=True, inferSchema=True)` |
| **Show First 5 Rows**         | `print(df.head())`                              | `df.show(5)`                                     |
| **Filter Rows**               | `df[df['age'] > 30]`                            | `df.filter(df.age > 30)`                         |
| **Select Specific Columns**   | `df[['name', 'age']]`                           | `df.select('name', 'age')`                       |
| **Group By and Aggregate**    | `df.groupby('gender')['salary'].mean()`         | `df.groupBy('gender').agg(F.mean('salary'))`     |
| **Add a New Column**          | `df['new_col'] = df['age'] + 10`                | `df = df.withColumn('new_col', df['age'] + 10)`  |
| **Drop a Column**             | `df = df.drop('col_name', axis=1)`              | `df = df.drop('col_name')`                       |
| **Rename Columns**            | `df.rename(columns={'old_name': 'new_name'})`   | `df = df.withColumnRenamed('old_name', 'new_name')` |
| **Join Two DataFrames**       | `df1.merge(df2, on='id', how='inner')`          | `df1.join(df2, df1.id == df2.id, 'inner')`       |
| **Sort Values**               | `df.sort_values('age', ascending=False)`        | `df.orderBy('age', ascending=False)`             |
| **Handle Missing Data**       | `df.fillna(0)`                                  | `df.fillna(0)`                                   |
| **Drop Duplicates**           | `df.drop_duplicates()`                          | `df.dropDuplicates()`                            |
| **Basic Statistics**          | `df.describe()`                                 | `df.describe().show()`                           |
| **Apply Function to Column**  | `df['age'].apply(lambda x: x + 10)`             | `df.withColumn('new_age', F.expr('age + 10'))`   |

### Summary of Key Differences:
1. **Pandas** operates on data stored in-memory and is great for small to medium datasets.
2. **PySpark** is designed for distributed computing, ideal for large datasets spread across a cluster of machines.
3. **DataFrame Creation**: Pandas is simpler for local usage, while PySpark requires creating a `SparkSession`.
4. **DataFrame Operations**: Syntax is similar for many operations, but PySpark requires the use of **Spark SQL functions** (`functions as F`) for operations like aggregations, expressions, and transformations.
5. **Execution**: Pandas executes immediately (eager execution), while PySpark waits until an action (like `show()`, `collect()`, or `save()`) is triggered (lazy execution).

If you’re working with large datasets or distributed systems, PySpark is the better option. For local analysis on small datasets, Pandas is more convenient.