## Pandas: Theoretical Overview

Pandas is an open-source, high-performance Python library designed for data manipulation, data analysis, and data cleaning. It provides data structures and functions needed to work seamlessly with structured (tabular, multidimensional, heterogeneous) and time-series data.

#### Key Features
| Feature                   | Description                                                           |
| ------------------------- | --------------------------------------------------------------------- |
| **Data Structures**       | `Series` (1D) and `DataFrame` (2D)                                    |
| **Handling Missing Data** | Built-in support using `NaN`, `fillna()`, `dropna()`                  |
| **Flexible Indexing**     | Label-based (`.loc`) and position-based (`.iloc`) indexing            |
| **Data Alignment**        | Automatic data alignment in operations                                |
| **File I/O**              | Read/write support for formats such as CSV, Excel, JSON, SQL, Parquet |
| **GroupBy Operations**    | Powerful grouping and aggregation using `groupby()`                   |
| **Time Series Support**   | Tools for resampling, time shifting, and date parsing                 |
| **Vectorized Operations** | Efficient operations across entire datasets (NumPy-based)             |


#### Core Data Structures

In [1]:
## 1. Series - One-dimensional labeled array.
## Can hold data of any type (integers, strings, floats, Python objects).

import pandas as pd
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])

## 2. DataFrame - Two-dimensional labeled data structure with columns of potentially different types.
## Think of it as an in-memory SQL table or an Excel spreadsheet.

df = pd.DataFrame({
    'Name': ['Alice', 'Bob'],
    'Age': [25, 30]
})

#### Core Functionalities
| Operation         | Method                                     | Example                       |
| ----------------- | ------------------------------------------ | ----------------------------- |
| **Read data**     | `read_csv`, `read_excel`, `read_json`      | `pd.read_csv('file.csv')`     |
| **View data**     | `head()`, `tail()`, `info()`, `describe()` | `df.head()`                   |
| **Selection**     | `df['col']`, `df.loc[]`, `df.iloc[]`       | `df.loc[0, 'Age']`            |
| **Filtering**     | Conditional indexing                       | `df[df['Age'] > 25]`          |
| **Add column**    | Assignment                                 | `df['Salary'] = [5000, 6000]` |
| **Drop column**   | `drop()`                                   | `df.drop('Age', axis=1)`      |
| **Aggregation**   | `mean()`, `sum()`, `groupby()`             | `df.groupby('Dept').mean()`   |
| **Sort**          | `sort_values()`                            | `df.sort_values('Age')`       |
| **Null handling** | `isnull()`, `fillna()`, `dropna()`         | `df.fillna(0)`                |


#### Use Cases in AI / ML Pipelines
| Phase                               | Role of Pandas                                                |
| ----------------------------------- | ------------------------------------------------------------- |
| **Data Ingestion**                  | Load structured data from various sources                     |
| **Data Cleaning**                   | Handle missing values, remove duplicates                      |
| **Feature Engineering**             | Create new features, encode categorical data                  |
| **Exploratory Data Analysis (EDA)** | Descriptive statistics, visualization with seaborn/matplotlib |
| **Preprocessing**                   | Normalize/standardize, binning, transformation                |


#### Interoperability
Pandas ↔ NumPy: Pandas is built on NumPy; arrays can be interchanged.

Pandas ↔ Scikit-learn: Pass DataFrames directly for ML model inputs.

Pandas ↔ Matplotlib/Seaborn: Visualization libraries accept DataFrames.

#### Installation
pip install pandas

pip install pandas==1.5.3


### Pandas CSV / Excel I/O & DataFrame Operations — Syntax & Explanation Table


#### 1. File Reading (Input)
| **Function**               | **Syntax**                                         | **Description**                  |
| -------------------------- | -------------------------------------------------- | -------------------------------- |
| Read CSV                   | `pd.read_csv('file.csv')`                          | Load a CSV file into a DataFrame |
| Read CSV with delimiter    | `pd.read_csv('file.csv', sep=';')`                 | Specify custom delimiter         |
| Read Excel                 | `pd.read_excel('file.xlsx')`                       | Load Excel file                  |
| Read specific sheet        | `pd.read_excel('file.xlsx', sheet_name='Sheet1')`  | Read specific Excel sheet        |
| Read with column names     | `pd.read_csv('file.csv', names=['A', 'B'])`        | Set custom column names          |
| Read only selected columns | `pd.read_csv('file.csv', usecols=['Name', 'Age'])` | Read specific columns            |
| Read with index column     | `pd.read_csv('file.csv', index_col='ID')`          | Set column as index              |


####  2. File Writing (Output)
| **Function**                   | **Syntax**                                       | **Description**               |
| ------------------------------ | ------------------------------------------------ | ----------------------------- |
| Write to CSV                   | `df.to_csv('output.csv')`                        | Export to CSV including index |
| Write to CSV (no index)        | `df.to_csv('output.csv', index=False)`           | Exclude index in export       |
| Write to Excel                 | `df.to_excel('output.xlsx')`                     | Export DataFrame to Excel     |
| Write Excel specific sheet     | `df.to_excel('output.xlsx', sheet_name='Sales')` | Custom sheet name             |
| Write with null as custom text | `df.to_csv('out.csv', na_rep='N/A')`             | Replace NaN with text         |


#### 3. Viewing & Inspecting Data
| **Function**        | **Syntax**      | **Description**          |
| ------------------- | --------------- | ------------------------ |
| View top rows       | `df.head()`     | First 5 rows             |
| View bottom rows    | `df.tail(10)`   | Last 10 rows             |
| Data dimensions     | `df.shape`      | Tuple of (rows, columns) |
| Column names        | `df.columns`    | All column headers       |
| Data types & nulls  | `df.info()`     | Summary of structure     |
| Statistical summary | `df.describe()` | Mean, std, min, max etc. |


####  4. Selection & Filtering
| **Task**                | **Syntax**                                    | **Description**      |
| ----------------------- | --------------------------------------------- | -------------------- |
| Select column           | `df['Age']`                                   | Get column as Series |
| Select multiple columns | `df[['Age', 'Salary']]`                       | Return as DataFrame  |
| Select row by index     | `df.loc[0]`                                   | By label             |
| Select row by position  | `df.iloc[0]`                                  | By index             |
| Conditional filter      | `df[df['Age'] > 30]`                          | Filter rows          |
| Compound condition      | `df[(df['Age']>30) & (df['Gender']=='Male')]` | Combine filters      |


####  5. Aggregation & Grouping
| **Operation**        | **Syntax**                               | **Description**  |
| -------------------- | ---------------------------------------- | ---------------- |
| Column mean          | `df['Salary'].mean()`                    | Average salary   |
| Grouping             | `df.groupby('Dept')['Salary'].sum()`     | Sum by group     |
| Multiple aggregation | `df.agg({'Age':'mean', 'Salary':'max'})` | Multiple metrics |


#### 6. Modifying Data
| **Task**       | **Syntax**                                   | **Description**       |
| -------------- | -------------------------------------------- | --------------------- |
| Add column     | `df['Bonus'] = df['Salary'] * 0.10`          | Derived column        |
| Update values  | `df.loc[df['Dept']=='HR', 'Salary'] += 1000` | Conditional update    |
| Rename columns | `df.rename(columns={'old':'new'})`           | Rename headers        |
| Replace values | `df.replace('Sales', 'Marketing')`           | Replace string/values |


#### 7. Cleaning & Handling Nulls
| **Task**              | **Syntax**                  | **Description**            |
| --------------------- | --------------------------- | -------------------------- |
| Null count            | `df.isnull().sum()`         | Missing value count        |
| Drop nulls            | `df.dropna()`               | Remove rows with NaN       |
| Fill nulls with value | `df.fillna(0)`              | Replace NaN with 0         |
| Forward fill          | `df.fillna(method='ffill')` | Propagate last valid value |
| Drop duplicates       | `df.drop_duplicates()`      | Remove duplicate rows      |


####  8. Sorting, Indexing & Reset
| **Task**        | **Syntax**                                     | **Description**     |
| --------------- | ---------------------------------------------- | ------------------- |
| Sort values     | `df.sort_values(by='Salary')`                  | Ascending           |
| Sort descending | `df.sort_values(by='Salary', ascending=False)` | Descending          |
| Reset index     | `df.reset_index(drop=True)`                    | Rebuild index       |
| Set index       | `df.set_index('ID')`                           | Change index column |


#### 9. Merging / Joining / Concatenating
| **Task**     | **Syntax**                                 | **Description**        |
| ------------ | ------------------------------------------ | ---------------------- |
| Merge on key | `pd.merge(df1, df2, on='ID')`              | Inner join by default  |
| Left join    | `pd.merge(df1, df2, on='ID', how='left')`  | Keep all rows from df1 |
| Outer join   | `pd.merge(df1, df2, on='ID', how='outer')` | Union of both          |
| Concatenate  | `pd.concat([df1, df2])`                    | Stack vertically       |
